jsoup tag extraction problem

jsoup tag extraction problem - java

test: example test1:example1
Elements size = doc.select("div:contains(test:)");
how can i extract the value example and example1 from this html tag....using jsoup..

Since this HTML is not semantic enough for the final purpose you have (a <br> cannot have children and : is not HTML), you can't do much with a HTML parser like Jsoup. A HTML parser isn't intented to do the job of specific text extraction/tokenizing.
Best what you can do is to get the HTML content of the <div> using Jsoup and then extract that further using the usual java.lang.String or maybe java.util.Scanner methods.
Here's a kickoff example:
String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
int colon = part.indexOf(':');
if (colon > -1) {
System.out.println(part.substring(colon + 1).trim());
}
}
This results in
example
example1
If I was the HTML author, I would have used a definition list for this. E.g.
<dl id="mydl">
<dt>test:</dt><dd>example</dd>
<dt>test1:</dt><dd>example1</dd>
</dl>
This is more semantic and thus more easy parseable:
String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
System.out.println(dt.text());
}

Related

Convert String to arraylist using split

Is it possible to convert below String content to an arraylist using split, so that you get something like in point A?
<a class="postlink" href="http://test.site/i7xt1.htm">http://test.site/i7xt1.htm<br/>
</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://information.com/qokp076wulpw">http://information.com/qokp076wulpw<br/>
</a>
<br/>Additional:<br/>
<a class="postlink" href="http://additional.com/qokdsfsdwulpw">http://additional.com/qokdsfsdwulpw<br/>
</a>
Point A (desired arraylist content):
http://test.site/i7xt1.htm
Mirror:
http://information.com/qokp076wulpw
Additional:
http://additional.com/qokdsfsdwulpw
I am now using below code but it doesn`t bring the desired output. (mirror for instance is being added multiple times etc).
Document doc = Jsoup.parse(string);
Elements links = doc.select("a[href]");
for (Element link : links) {
Node previousSibling = link.previousSibling();
while (!(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text"))) {
previousSibling = previousSibling.previousSibling();
}
String identifier = previousSibling.toString();
if (identifier.contains("Mirror")) {
totalUrls.add("MIRROR(s):");
}
totalUrls.add(link.attr("href"));
}

Fix your links first. As cricket_007 mentioned, having proper HTML would make this a lot easier.
String html = yourHtml.replaceAll("<br/></a>", "</a>"); // get rid of bad HTML
String[] lines = html.split("<br/>");
for (String str : Arrays.asList(lines)) {
Jsoup.parse(str).text();
... // you can go further here, check if it has a link or not to display your semi-colon;
}
Now that the errant <br> tags are out of the links, you can split the string on the <br> tags that remain and print out your html result. It's not pretty, but it should work.

Parsing elements within div using Jsoup

Here is the html I'm trying to parse:
<div class="entry">
<img src="http://www.example.com/image.jpg" alt="Image Title">
<p>Here is some text</p>
<p>Here is some more text</p>
</div>
I want to get the text within the <p>'s into one ArrayList. I've tried using Jsoup for this.
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
for (Element desc : descs) {
String text = desc.getElementsByTag("p").first().text();
myArrayList.add(text);
}
But this doesn't work at all. I'm quite new to Jsoup but it seems it has its limitations. If I can get the text within <p> into one ArrayList using Jsoup, how can I accomplish that? If I must use some other means to parse the html, let me know.
I'm using a BufferedReader to read the html file one line at a time.

You could change your approach to the following:
Document doc = Jsoup.parse(line);
Elements pElems = doc.select("div.entry > p");
for (Element pElem : pElems) {
myArrayList.add(pElem.data());
}

Not sure why you are reading the html line by line. However if you want to read the whole html use the code below:
String line = "<div class=\"entry\">" +
"<img src=\"http://www.example.com/image.jpg\" alt=\"Image Title\">" +
"<p>Here is some text</p>" +
"<p>Here is some more text</p>" +
"</div>";
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
List<String> myArrayList = new ArrayList<String>();
for (Element desc : descs) {
Elements paragraphs = desc.getElementsByTag("p");
for (Element paragraph : paragraphs) {
myArrayList.add(paragraph.text());
}
}

In your for-loop:
Elements ps = desc.select("p");
(http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#select(java.lang.String))

Try this:
Document doc = Jsoup.parse(line);
String text = doc.select("p").first().text();

Extract text from html: looking for a good sax-like parser or advices with a dom parser

I have an html document formatted this way:
<p>
some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
just some plain text
</p>
<p>
<strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>
I'd like to extract the text. With dom like parsers I could extract each paragraph , but the problem is inside: I'd have to extract text from inner tags too and have a resulting string with the same order, in the example above, first paragraph, I want to extract:
some plain text some emphatized text, some strong text
and for this purpose I guess a sax like parser would be better than a dom, given that I can't know inner tags number o sequence: a paragraph can have zero or more inner tags, of different type.

You can use dom parsers, get the text inside of the p tags (including child html elements) in to a string variable and use some other functionality to strip all the html tags out of the resulting string. This should leave you with all of the content between the p tags without any of the child element tags.
Example
<p>
some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
just some plain text
</p>
<p>
<strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>
Use some dom parser to extract the p tags to strings, you would then have a string like so:
String content = "some plain text <em>some emphatized text</em>, <strong> some strong text</strong>";
content = stripHtmlTags( content );
println( content ); // some plain text some emphatized text, some strong text

String extractedText=Html.fromHtml(Your HTML String).toString()
This gives u extracted text..
Hope this help you.

Add code to read CDATA by DOM pase
**childNode.getNodeType() == Node.CDATA_SECTION_NODE**
if Using XMLUtils modify like
public static String getNodeValue(Node node) {
node.normalize();
String response = node.getNodeValue();
if (response != null) {
return response;
} else {
NodeList list = node.getChildNodes();
int size = list == null ? 0 : list.getLength();
for (int j = 0; j < size; j++) {
Node childNode = list.item(j);
if (childNode.getNodeType() == Node.TEXT_NODE
|| childNode.getNodeType() == Node.CDATA_SECTION_NODE) {
response = childNode.getNodeValue();
return response;
}
}
}
return "";
}

How to parse HTML Heading

I have this HTML i am parsing.
<div id="articleHeader">
<h1 class="headline">Assassin's Creed Revelations: The Three Heroes</h1>
<h2 class="subheadline">Exclusive videos and art spanning three eras of assassins.</h2>
<h2 class="publish-date"><script>showUSloc=(checkLocale('uk')||checkLocale('au'));document.writeln(showUSloc ? '<strong>US, </strong>' : '');</script>
<span class="us_details">September 22, 2011</span>
What i want to do it parse the "headline" subheadline and publish date all to seperate Strings

Just use the proper CSS selectors to grab them.
Document document = Jsoup.connect(url).get();
String headline = document.select("#articleHeader .headline").text();
String subheadline = document.select("#articleHeader .subheadline").text();
String us_details = document.select("#articleHeader .us_details").text();
// ...
Or a tad more efficient:
Document document = Jsoup.connect(url).get();
Element articleHeader = document.select("#articleHeader").first();
String headline = articleHeader.select(".headline").text();
String subheadline = articleHeader.select(".subheadline").text();
String us_details = articleHeader.select(".us_details").text();
// ...

Android has a SAX parser built into it . You can use other standard XML parsers as well.
But I think if ur HTML is simple enough u could use RegEx to extract string.

How to get text & Other tags between specific tags using Jericho HTML parser?

I have a HTML file which contains a specific tag, e.g. <TABLE cellspacing=0> and the end tag is </TABLE>. Now I want to get everything between those tags. I am using Jericho HTML parser in Java to parse the HTML. Is it possible to get the text & other tags between specific tags in Jericho parser?
For example:
<TABLE cellspacing=0>
<tr><td>HELLO</td>
<td>How are you</td></tr>
</TABLE>
Answer:
<tr><td>HELLO</td>
<td>How are you</td></tr>

Once you have found the Element of your table, all you have to do is call getContent().toString(). Here's a quick example using your sample HTML:
Source source = new Source("<TABLE cellspacing=0>\n" +
" <tr><td>HELLO</td> \n" +
" <td>How are you</td></tr>\n" +
"</TABLE>");
Element table = source.getFirstElement();
String tableContent = table.getContent().toString();
System.out.println(tableContent);
Output:
<tr><td>HELLO</td>
<td>How are you</td></tr>

Aby, I walk down the code for all elements and show on screen. Maybe help you.
List<Element> elementListTd = source.getAllElements(HTMLElementName.TD);
//Scroll through the list of elements "td" page
for (Element element : elementListTd) {
if (element.getAttributes() != null) {
String td = element.getAllElements().toString();
String tag = "td";
System.out.println("TD: " + td);
System.out.println(element.getContent());
String conteudoAtributo = element.getTextExtractor().toString();
System.out.println(conteudoAtributo);
if (td.contains(palavraCompara)) {
tabela.add(conteudoAtributo);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

jsoup tag extraction problem - java

test: example test1:example1 Elements size = doc.select("div:contains(test:)"); how can i extract the value example and example1 from this html tag....using jsoup..

Related

Convert String to arraylist using split

Parsing elements within div using Jsoup

Extract text from html: looking for a good sax-like parser or advices with a dom parser

How to parse HTML Heading

How to get text & Other tags between specific tags using Jericho HTML parser?

Categories

Resources