How to parse HTML Heading

How to parse HTML Heading - java

I have this HTML i am parsing.
<div id="articleHeader">
<h1 class="headline">Assassin's Creed Revelations: The Three Heroes</h1>
<h2 class="subheadline">Exclusive videos and art spanning three eras of assassins.</h2>
<h2 class="publish-date"><script>showUSloc=(checkLocale('uk')||checkLocale('au'));document.writeln(showUSloc ? '<strong>US, </strong>' : '');</script>
<span class="us_details">September 22, 2011</span>
What i want to do it parse the "headline" subheadline and publish date all to seperate Strings

Just use the proper CSS selectors to grab them.
Document document = Jsoup.connect(url).get();
String headline = document.select("#articleHeader .headline").text();
String subheadline = document.select("#articleHeader .subheadline").text();
String us_details = document.select("#articleHeader .us_details").text();
// ...
Or a tad more efficient:
Document document = Jsoup.connect(url).get();
Element articleHeader = document.select("#articleHeader").first();
String headline = articleHeader.select(".headline").text();
String subheadline = articleHeader.select(".subheadline").text();
String us_details = articleHeader.select(".us_details").text();
// ...

Android has a SAX parser built into it . You can use other standard XML parsers as well.
But I think if ur HTML is simple enough u could use RegEx to extract string.

Related

Jsoup don't parse xml correctly, missing tags

I want to parse a xml text but jsoup seems to delete <col> tags.
This is what happens:
Original:
<rowh> <col>DTC Code</col> <col>Description</col> </rowh>
Result:
<rowh> DTC Code Description
</rowh>
This is the code I am using to see the content.
Document jDoc = Jsoup.parse(contentXML);
Log.d("Original", contentXML);
Log.d("Document", jDoc.outerHtml());
I need to count how many <col> tags are inside each <rowh> tag but it always returns 0. I am using Jsoup version 1.11.2

May this helps you:
String html = "<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"><rowh><col>DTC Code</col><col>Description</col></rowh></xml>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements e = doc.select("rowh");
String text = e.text();
Log.i("TAG1", text);
OutPut:

Convert String to arraylist using split

Is it possible to convert below String content to an arraylist using split, so that you get something like in point A?
<a class="postlink" href="http://test.site/i7xt1.htm">http://test.site/i7xt1.htm<br/>
</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://information.com/qokp076wulpw">http://information.com/qokp076wulpw<br/>
</a>
<br/>Additional:<br/>
<a class="postlink" href="http://additional.com/qokdsfsdwulpw">http://additional.com/qokdsfsdwulpw<br/>
</a>
Point A (desired arraylist content):
http://test.site/i7xt1.htm
Mirror:
http://information.com/qokp076wulpw
Additional:
http://additional.com/qokdsfsdwulpw
I am now using below code but it doesn`t bring the desired output. (mirror for instance is being added multiple times etc).
Document doc = Jsoup.parse(string);
Elements links = doc.select("a[href]");
for (Element link : links) {
Node previousSibling = link.previousSibling();
while (!(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text"))) {
previousSibling = previousSibling.previousSibling();
}
String identifier = previousSibling.toString();
if (identifier.contains("Mirror")) {
totalUrls.add("MIRROR(s):");
}
totalUrls.add(link.attr("href"));
}

Fix your links first. As cricket_007 mentioned, having proper HTML would make this a lot easier.
String html = yourHtml.replaceAll("<br/></a>", "</a>"); // get rid of bad HTML
String[] lines = html.split("<br/>");
for (String str : Arrays.asList(lines)) {
Jsoup.parse(str).text();
... // you can go further here, check if it has a link or not to display your semi-colon;
}
Now that the errant <br> tags are out of the links, you can split the string on the <br> tags that remain and print out your html result. It's not pretty, but it should work.

How to access the subclass using jsoup

I want to access this webpage: https://www.google.com/trends/explore#q=ice%20cream and extract the data within in the center line graph. The html file is(Here, I only paste the part that I use.):
<div class="center-col">
<div class="comparison-summary-title-line">...</div>
...
<div id="reportContent" class="report-content">
<!-- This tag handles the report titles component -->
...
<div id="report">
<div id="reportMain">
<div class="timeSection">
<div class = "primaryBand timeBand">...</div>
...
<div aria-lable = "one-chart" style = "position: absolute; ...">
<svg ....>
...
<script type="text/javascript">
var chartData = {...}
And the data I used is stored in the script part(last line). My idea is to get the class "report-content" first, and then select script. And my code follows as:
String html = "https://www.google.com/trends/explore#q=ice%20cream";
Document doc = Jsoup.connect(html).get();
Elements center = doc.getElementsByClass("center-col");
Element report = doc.getElementsByClass("report-content");
System.out.println(center);
System.out.println(report);
When I print "center" class, I can get all the subclasses content except the "report-content", and when I print the "report-content", the result is only like:
<div id="reportContent" Class="report-content"></div>
And I also try this:
Element report = doc.select(div.report-content).first();
but still does not work at all. How could I get the data in the script here? I appreciate your help!!!

Try this url instead:
https://www.google.com/trends/trendsReport?hl=en&q=${keywords}&tz=${timezone}&content=1
where
${keywords} is an encoded space separated keywords list
${timezone} is an encoded timezone in the Etc/GMT* form
DEMO
SAMPLE CODE
String myKeywords = "ice cream";
String myTimezone = "Etc/GMT+2";
String url = "https://www.google.com/trends/trendsReport?hl=en&q=" + URLEncoder.encode(keywords, "UTF-8") +"&tz="+URLEncoder.encode(myTimezone, "UTF-8")+"&content=1";
Document doc = Jsoup.connect(url).timeout(10000).get();
Element scriptElement = doc.select("div#TIMESERIES_GRAPH_0-time-chart + script").first();
if (scriptElement==null) {
throw new RuntimeException("Unable to locate trends data.");
}
String jsCode = scriptElement.html();
// parse jsCode to extract charData...
References:
How to extract the text of a <script> element with Jsoup?

Trying getting the same by Id, you would get the complete tag

Sanitize HTML string

I have an HTML sting like:
<p dir="ltr"><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>bold</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>all</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u>in</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u><b><i><u> </u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></u></i><i><u><b><i><u><b><i><u><b><i><u><b><i><u>one</u></i></b></u></i></b></u></i></b></u></i></b></u></i></b></p>
I want to sanitize the html like <b><i><u> bold all in one </b></i></u>
I tried this method: webText = webText.replaceAll("(</?(?:b|i|u)>)\\1+", "$1").replaceAll("</(b|i|u)><\\1>", "");
But it is of no use. The html remains clumsy. What should I do to mend the same? Is there any other Regex or JSON way?

But it is of no use. The html remains clumsy. What should I do to mend
the same? Is there any other Regex or JSON way?
Regex may help here, but in general they serve not very well as Html parser if things get complex. Jsoup is a great Html library, and i really can recommend it.
Unfortunately your html is still valid html, so the solution is tricky.
Best you start with the Jsoup documentation, especially the one of it's Selector syntax.
Here's something for starting:
final String html = ... // your html from above
// Parse the html string into a document
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
/*
* Select all elements, which ...
*
* (a) have a text (= at least not empty)
* (b) has no childs it's own
*
* Iterate over those found and print them.
*/
for( Element element : doc.select("*:matches(^..+?$):not(:has(*))") )
{
System.out.println(element);
}
Result:
<u>bold</u>
<u>all</u>
<u>in</u>
<u>one</u>
If you need literally <b><i><u> bold all in one </b></i></u>:
final String html = ... // your html from above
// As above
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
// All text of the document
String text = doc.text();
// Create an element and it's childs
Element element = new Element(Tag.valueOf("b"), "");
element.appendElement("i").appendElement("u").text(text);
System.out.println(element);
Result:
<b><i><u>bold all in one</u></i></b>

You could try below method to remove unwanted html tags:
public String stripHtml(String html)
{
return Html.fromHtml(html).toString();
}

jsoup tag extraction problem

test: example test1:example1
Elements size = doc.select("div:contains(test:)");
how can i extract the value example and example1 from this html tag....using jsoup..

Since this HTML is not semantic enough for the final purpose you have (a <br> cannot have children and : is not HTML), you can't do much with a HTML parser like Jsoup. A HTML parser isn't intented to do the job of specific text extraction/tokenizing.
Best what you can do is to get the HTML content of the <div> using Jsoup and then extract that further using the usual java.lang.String or maybe java.util.Scanner methods.
Here's a kickoff example:
String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
int colon = part.indexOf(':');
if (colon > -1) {
System.out.println(part.substring(colon + 1).trim());
}
}
This results in
example
example1
If I was the HTML author, I would have used a definition list for this. E.g.
<dl id="mydl">
<dt>test:</dt><dd>example</dd>
<dt>test1:</dt><dd>example1</dd>
</dl>
This is more semantic and thus more easy parseable:
String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
System.out.println(dt.text());
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse HTML Heading - java

Android has a SAX parser built into it . You can use other standard XML parsers as well. But I think if ur HTML is simple enough u could use RegEx to extract string.

Related

Jsoup don't parse xml correctly, missing tags

Convert String to arraylist using split

How to access the subclass using jsoup

Sanitize HTML string

jsoup tag extraction problem

Categories

Resources