I am trying to parse a HTML file using Jsoup. There are certain text in the HTML that doesn't come under an tags.
<li class="inactive">
<span class="status label">inactive</span>
<a href="/officers/144662696" class="officer inactive" title="more info on MILLTOWN CORPORATE SERVICES">
MILLTOWN CORPORATE SERVICES
</a>
member,
<span class="status label">inactive</span>
<a href="/companies/us_wv/193180" class="company inactive revoked_(failure_to_file_annual_report)" title="More Free And Open Company Data On EASTBRIDGE L.L.C. (West Virginia (US), 193180)">
EASTBRIDGE L.L.C.
</a>
(West Virginia (US),
<span class="start_date">25 May 2000</span>-<span class="end_date"> 1 Aug 2002</span>)
</li>
I am able to read all the content in a tag but I am trying to get the values (West Virginia US) and member.
Is there a way to get the values outside the classes and inside a li tag.
You are probably looking for something like Element#ownText.
This only gets the text of the current element and not a combined text of all children.
Element listItem = doc.select("li.inactive").first();
System.out.println(listItem.ownText()); // prints "member, (West Virginia (US), -)"
You can also use the previous tags to get the text nodes which are not embedded in any tags. If i get it right, you want to get each text node after each a tag. Try something like :
String html = "<li class=\"inactive\"> \n"
+ " <span class=\"status label\">inactive</span> \n"
+ " <a href=\"/officers/144662696\" class=\"officer inactive\" title=\"more info on MILLTOWN CORPORATE SERVICES\">\n"
+ " MILLTOWN CORPORATE SERVICES\n"
+ " </a>\n"
+ " member, \n"
+ " <span class=\"status label\">inactive</span> \n"
+ " <a href=\"/companies/us_wv/193180\" class=\"company inactive revoked_(failure_to_file_annual_report)\" title=\"More Free And Open Company Data On EASTBRIDGE L.L.C. (West Virginia (US), 193180)\">\n"
+ " EASTBRIDGE L.L.C.\n"
+ " </a> \n"
+ " (West Virginia (US), \n"
+ " <span class=\"start_date\">25 May 2000</span>-<span class=\"end_date\"> 1 Aug 2002</span>) \n"
+ "</li>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for(Element e : links){
System.out.println(e.nextSibling().toString());
}
Related
I've following string which is HTML -
<html>
<head>
<title>Repository</title>
</head>
<body>
<h2>Subversion</h2>
<ul>
<li>
..
</li>
<li>
branch_A
</li>
<li>
branch_B
</li>
</ul>
</body>
</html>
Out of this I want to get labels of li tag which are branch_A, branch_B
Count of li's can vary. I want to get all of them. Can you please help how I can parse this String and get those values?
NOTE I could have used jsoup library to achieve same, but considering our project restriction, I cannot use it.
You can use an HTML parser for this. In the code below jsoup (https://www.baeldung.com/java-with-jsoup) is used and its quick and easy.
Document doc = Jsoup.connect(fix url here).get();
doc.select(tag you want).forEach(System.out::println);
Other tools are discussed here: https://tomassetti.me/parsing-html/
Using Java 8 streams:
String html = "<html>\n" +
" <head>\n" +
" <title>Repository</title>\n" +
" </head>\n" +
" <body>\n" +
" <h2>Subversion</h2>\n" +
" <ul>\n" +
" <li>\n" +
" ..\n" +
" </li>\n" +
" <li>\n" +
" branch_A\n" +
" </li>\n" +
" <li>\n" +
" branch_B\n" +
" </li>\n" +
" </ul>\n" +
" </body>\n" +
"</html>";
html.lines().filter(line -> line.contains("<a href")).forEach(System.out::println);
Output:
..
branch_A
branch_B
Keep in mind you can run streams in parallel if you have huge HTML file.
Also you can strip HTML tags using map:
html.lines().filter(line -> line.contains("<a href")).map(line -> line.replaceAll("<[^>]*>","")).forEach(System.out::println);
Output:
branch_A
..
branch_B
We have a requirement where we are asking our customers to fill the BRD document which is in a HTML file. HTML consists of radio buttons, text box etc along with colors and table. We will have a button which when clicked should call a java class which exports the HTML along with data customer inputs to word document. We are successful in converting a HTML code which is given directly as a string in the java program to word document. We are having issues in sending the HTML along with data.
Can any one let me know how I can achieve this? Or is there any better way we can do this.
public class XhtmlToDocx {
public static void main(String[] args) throws Exception {
//String html = "<html><form><input type=\"checkbox\" name=\"xhtml_mp_tutorial_chapter\" value=\"1\"/></form></html>";
String html = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"+
"<html xmlns=\"http://www.w3.org/1999/xhtml\">"+
"<head>"+
"<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">"+
"<title>Untitled Form</title>"+
"<link rel=\"stylesheet\" type=\"text/css\" href=\"view.css\" media=\"all\">"+
"<script type=\"text/javascript\" src=\"view.js\"></script>"+
"<script type=\"text/javascript\" src=\"calendar.js\"></script>"+
"</head>"+
"<body id=\"main_body\" >"+
" "+
" <img id=\"top\" src=\"top.png\" alt=\"\">"+
" <div id=\"form_container\">"+
" "+
" <h1><a>Untitled Form</a></h1>"+
" <form id=\"form_82495\" class=\"appnitro\" method=\"post\" action=\"\">"+
" <div class=\"form_description\">"+
" <h2>Untitled Form</h2>"+
" <p>This is your form description. Click here to edit.</p>"+
" </div> "+
" <ul >"+
" "+
" <li id=\"li_1\" >"+
" <label class=\"description\" for=\"element_1\">Text </label>"+
" <div>"+
" <input id=\"element_1\" name=\"element_1\" class=\"element text medium\" type=\"text\" maxlength=\"255\" value=\"\"/> "+
" </div> "+
" </li> <li id=\"li_3\" >"+
" <label class=\"description\" for=\"element_3\">Multiple Choice </label>"+
" <span>"+
" <input id=\"element_3_1\" name=\"element_3\" class=\"element radio\" type=\"radio\" value=\"1\" />"+
"<label class=\"choice\" for=\"element_3_1\">First option</label>"+
"<input id=\"element_3_2\" name=\"element_3\" class=\"element radio\" type=\"radio\" value=\"2\" />"+
"<label class=\"choice\" for=\"element_3_2\">Second option</label>"+
"<input id=\"element_3_3\" name=\"element_3\" class=\"element radio\" type=\"radio\" value=\"3\" />"+
"<label class=\"choice\" for=\"element_3_3\">Third option</label>"+
""+
" </span> "+
" </li> <li id=\"li_2\" >"+
" <label class=\"description\" for=\"element_2\">Date </label>"+
" <span>"+
" <input id=\"element_2_1\" name=\"element_2_1\" class=\"element text\" size=\"2\" maxlength=\"2\" value=\"\" type=\"text\"> /"+
" <label for=\"element_2_1\">MM</label>"+
" </span>"+
" <span>"+
" <input id=\"element_2_2\" name=\"element_2_2\" class=\"element text\" size=\"2\" maxlength=\"2\" value=\"\" type=\"text\"> /"+
" <label for=\"element_2_2\">DD</label>"+
" </span>"+
" <span>"+
" <input id=\"element_2_3\" name=\"element_2_3\" class=\"element text\" size=\"4\" maxlength=\"4\" value=\"\" type=\"text\">"+
" <label for=\"element_2_3\">YYYY</label>"+
" </span>"+
" "+
" <span id=\"calendar_2\">"+
" <img id=\"cal_img_2\" class=\"datepicker\" src=\"calendar.gif\" alt=\"Pick a date.\"> "+
" </span>"+
" <script type=\"text/javascript\">"+
" Calendar.setup({"+
" inputField : \"element_2_3\","+
" baseField : \"element_2\","+
" displayArea : \"calendar_2\","+
" button : \"cal_img_2\","+
" ifFormat : \"%B %e, %Y\","+
" onSelect : selectDate"+
" });"+
" </script>"+
" "+
" </li>"+
" "+
" <li class=\"buttons\">"+
" <input type=\"hidden\" name=\"form_id\" value=\"82495\" />"+
" "+
" <input id=\"saveForm\" class=\"button_text\" type=\"submit\" name=\"submit\" value=\"Submit\" />"+
" </li>"+
" </ul>"+
" </form> "+
" <div id=\"footer\">"+
" Generated by pForm"+
" </div>"+
" </div>"+
" <img id=\"bottom\" src=\"bottom.png\" alt=\"\">"+
" </body>"+
"</html>";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(html.getBytes());
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
// .. the bit in document body
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId() );
wordMLPackage.getMainDocumentPart().addObject(ac);
// .. content type
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
wordMLPackage.save(new java.io.File("C:/Users/****/Downloads/Word.docx"));
}
}
The problem you are facing seems to be happening because you are reading the static HTML page not the submitted page.
In order to get the full content of the submitted html, you need to submit your form first with the data, create it as a static html page and then access that page with XMLSerializer or URLStreamReader to get the final data to be passed to the word processing part of your program.
I am not providing an exact solution with code as I suppose you will be able to implement the solution yourself and you are mainly stuck on the logic.
I'm trying to get data from html in order from a web. Html code looks like:
<div class="text">
First Text
<br>
<br>
<div style="margin:20px; margin-top:5px; ">
<table cellpadding="5">
<tbody><tr>
<td class="alt2">
<div>
Written by <b>excedent</b>
</div>
<div style="font-style:italic">quote message</div>
</td>
</tr>
</tbody></table>
</div>Second Text<br>
<br>
<img class="img" src="https://developer.android.com/_static/images/android/touchicon-180.png"><br>
<br>
Third Text
</div>
What I want to do is create an Android layout scraping html, but I need to preserve the order of the elements. In this case:
TextView => First Text
TextView => Quote Message
TextView => Second Text
ImageView => img
TextView => Third Text
The problem comes when I try to get html values in order, using JSoup I get a String with "First Text Second Text Third Text" with Element.ownText, an then img at the end, resulting:
TextView => First Text Second Text Third Text
TextView => Quote Message
ImageView => img
What can I do to get that data in order?
Thanks in advance
You can parse the html into a list of html nodes. The list of nodes will preserve the DOM order and give what you want.
Check the parseFragment method :
This method will give you a list of nodes.
Try this.
String html = ""
+ "<div class=\"text\">"
+ " First Text"
+ " <br>"
+ " <br>"
+ " <div style=\"margin:20px; margin-top:5px; \">"
+ " <table cellpadding=\"5\">"
+ " <tbody><tr>"
+ " <td class=\"alt2\">"
+ " <div>"
+ " Written by <b>excedent</b>"
+ " </div>"
+ " <div style=\"font-style:italic\">quote message</div>"
+ " </td>"
+ " </tr></tbody>"
+ " </table>"
+ " </div>Second Text<br>"
+ " <br>"
+ " <img class=\"img\" src=\"https://developer.android.com/_static/images/android/touchicon-180.png\"><br>"
+ " <br>"
+ " Third Text"
+ " </div>";
Document doc = Jsoup.parse(html);
List<String> rootTexts = doc.select("div.text").first().textNodes().stream()
.map(node -> node.text().trim())
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
System.out.println(rootTexts);
OUTPUT:
[First Text, Second Text, Third Text]
This answer is a little late, but the correct way to do what you want to do is this. For your outermost <div>, instead of getting the child elements using Element.children(), you'll want to use Element.childNodes() instead.
Element.children() only returns child Elements, in which text is not included.
Element.childNodes() returns all child nodes, which includes TextNodes and Elements.
This solution works for me.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am using jsoup to extract info from a web, my code is like this:
doc = Jsoup.connect(myurl).get();
Elements newsHeadlines = doc.select(".myclass");
If I do a System.out.println of newsHeadlines I obtain this:
<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-minutoPartido="93'"></span>
<span class="blado"></span>
<span class="blahave">
Oh yeah!<br/></span>
</span>
</span>
<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-health="97'"></span>
<span class="blado"></span>
<span class="blahave">
This is my world</span>
</span>
</span>
How can I save on array each block:
<span class="cmtComentario">
<span class="blaicon"></span>
<span class="blacoment"><span class="cmtHora" data-hora=""></span>
<span class="blathing" data-health="92'"></span>
<span class="blado"></span>
<span class="blahave">
This is my world</span>
</span>
</span>
Thank you so much
newsHeadlines is nothing else than a List of Element as Elements implements List.
So you can iterate over the newsHeadlines the same way than you iterate over a list.
for(Element element : newsHeadlines) {
System.out.println(element.toString());
}
If that is not what you need (I did not test the code), you can try Element.children.
This gives you again Elements you can iterate over.
You could also add a div tag for each comment, and use some Java 8 syntax sugar for collecting the Element-instances in a List
Elements elements = Jsoup.parse(markup).getAllElements().select(".myclass");
List<Element> comments = elements.stream().collect(Collectors.<Element>toList());
for(Element comment : comments) {
System.out.println(comment.html());
}
For the sake of the test I used parse, instead of the connect-method.
It prints:
<span class="cmtComentario"> <span class="blaicon">1</span>.......
<span class="cmtComentario"> <span class="blaicon">2</span>........
Test markup:
String markup = "" +
"<div class=\"myclass\">\n" +
"<span class=\"cmtComentario\">\n" +
"<span class=\"blaicon\">1</span>\n" +
"<span class=\"blacoment\"><span class=\"cmtHora\" data-hora=\"\"></span>\n" +
"<span class=\"blathing\" data-minutoPartido=\"93'\"></span>\n" +
"<span class=\"blado\"></span>\n" +
"<span class=\"blahave\">\n" +
"Oh yeah!<br/></span>\n" +
"</span>\n" +
"</span>\n" +
"</div>" +
"<div class=\"myclass\">\n" +
"<span class=\"cmtComentario\">\n" +
"<span class=\"blaicon\">2</span>\n" +
"<span class=\"blacoment\"><span class=\"cmtHora\" data-hora=\"\"></span>\n" +
"<span class=\"blathing\" data-health=\"97'\"></span>\n" +
"<span class=\"blado\"></span>\n" +
"<span class=\"blahave\">\n" +
"This is my world</span>\n" +
"</span>\n" +
"</span>" +
"</div>";
Hope it helps!
I have an problem which in turn is causing a lot of headaches. I need to dynamically create buttons/images which link to JSF actionListener. Here is the code:
HTML:
<h:form>
<div class="carousel-container">
<div id="carousel">
<h:outputText value="#{courseBean.course}" escape="false"/>
</div>
</div>
</h:form>
what courseBean.course gets is the Overriden toString which returns the following:
#Override
public String toString() {
return "<div class=\"carousel-feature\"> "
+ "<h:commandLink id=\"" + courseID + "\" actionListener=\"#{courseBean.getCourseSelected}\">"
+ "<img class=\"carousel-image\" src=\"Images/testButton.jpg\"/>"
+ "<span style=\"display:bloack; position:absolute; top:20px; bottom:20px; left:0; right:0; "
+ "background:white; background:rgba(255, 255, 255, 0.25);\">" + courseName + "</span>"
+ "</h:commandLink> "
+ "<div class=\"carousel-caption\"> "
+ "</div>"
+ "</div>";
}//end method toString
The HTML is being rendered fine and image is being displayed in the carousel however when it is clicked actionListener is not being called which is the issue here.
edit: the actionListener only prints the courseID to the console nothing major.
Thank you for taking your time :)
That approach is wrong, if you do "view source" you will see the <h:commandLink in the source of your page (because it wont be processed by JSF life cycle at all) , while if you had a <h:commandLink in your xhtml page the generated html source will contain a <a href....> element
You better rethink of your original goal and ask a question on how to achieve it...