I have the following HTML:
<html>
<body>
...
<h2> Blah Blah 1</h2>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header</th><th>Col 2 Header</th></tr>
<tr><td>Line 1.1 Value</td><td>Line 2.1 Header</td></tr>
<tr><td>Line 2.1 Value</td><td>Line 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header T2</th><th>Col 2 Header T2</th></tr>
<tr><td>Line 1.1 Value T2</td><td>Line 2.1 Header T2</td></tr>
<tr><td>Line 2.1 Value T2</td><td>Line 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
<h2> Blah Blah 2</h2>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
<tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
<tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header T2</th><th>XCol 2 Header T2</th></tr>
<tr><td>XLine 1.1 Value T2</td><td>XLine 2.1 Header T2</td></tr>
<tr><td>XLine 2.1 Value T2</td><td>XLine 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I would like to extract the 2nd DIV following an h2 tag that contains a given text.
As you may notice in the first and second div the p tags are not in the same position.
To extract the DIV following the first h2, the below formula would work:
h2:contains(Blah 1) + p + div +div
But to extract the 2nd, replacing "Blah 1" with "Blah 2" would not work as the ""p"" tag is located elsewhere , so a static selector would be :
h2:contains(Blah 2) + div + p +div
And what I need is a single selector formula where changing the text would make it work, wherever the p blocks may be
I tried several ways :
like ... The selector nth-of-type would not work either, because I know the position of the DIV only wrt the h2 that is not father of DIV but a preceding sibling ...
Help please
I have two ideas how to achieve this.
The first one is to remove every <p> and then you will only have to select "h2:contains(" + text + ")+div+div". Be careful and use it only when you're sure your <div> doesn't contain any <p>. Otherwise it will lack some content.
public void execute1(String html) {
Document doc = Jsoup.parse(html);
// first approach: remove every <p> to simplify document
Elements paragraphs = doc.select("p");
for (Element paragraph : paragraphs) {
paragraph.remove();
}
// then one selector will return what you want in both cases
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText(Document doc, String text) {
return doc.select("h2:contains(" + text + ")+div+div").first();
}
The second approach would be to iterate over siblings of "h2:contains(" + text+ ")" and "manually" find second <div> ignoring anything else. It's better because it doesn't destroy the original document and it will skip any number of <p> elements.
public void execute2(String html) {
Document doc = Jsoup.parse(html);
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText2(Document doc, String text) {
int counter = 2;
// find h2 with given text
Element h2 = doc.select("h2:contains(" + text + ")").first();
// select every sibling after this h2 element
Elements siblings = h2.nextElementSiblings();
// loop over them
for (Element sibling : siblings) {
// skip everything that's not a div
if (sibling.tagName().equals("div")) {
// count how many divs left to skip
counter--;
if (counter == 0) {
// return when found nth div
return sibling;
}
}
}
return null;
}
I had also third idea to use "h2:contains(" + text + ")~div:nth-of-type(2)". It works for the first case, but fails for the second one probably because there's a <p> between the divs.
A simple way to do this is by using the comma (,) query operator which does an OR between the selectors. So you can combine the two variations of where the P tag falls.
h2:contains(Blah 2) + div ~ div, h2:contains(Blah 2) ~ div + div
Here's an example on the try.jsoup playground.
Related
I am writing a code for detecting matching tags patterns in web page. Here is the example.
<body>
<table width="200" border="1">
<tr>
<td>Name</td>
<td>Place</td>
<td>Animal</td>
</tr>
<p>hello World</p>
<tr>
<td>Jack</td>
<td>New york</td>
<td>Lion</td>
</tr>
<b>Code Works</b>
<tr>
<td>George</td>
<td>Sydney</td>
<td>Tiger</td>
</tr>
<tr>
<td>Tina</td>
<td>Delhi</td>
<td>Cat</td>
</tr>
</table>
<table>
<tbody>
<tr>
<td> </td>
<td>
1
2
3
4
5
</td>
</tr>
</tbody>
</table>
</body>
For above Tag pattern, I need to find the tags which are occurring repeatedly. And to discard those that are not in the pattern like tags b and p. For first table tags tr and td are occurring . For 2nd table 'a' tag is repeated.
This is what I have done till now:
Parsed to DOM tree using Jsoup.
Then used node visitor class to traverse the tree. Using head and tail methods, I can enter and exit tags.
But I am confused about how to proceed further.
Note: The tags pattern are not fixed.Tag pattern will vary depending on web page structure. Any kind of help will be appreciated.
But I am confused about how to proceed further.
Your confusion is propagating and reach us too. However, I'll try to give you an hint.
You can count the tags in your HTML code. If a tag count reaches a certain threshold, you can consider this tag as "repeatedly occuring".
// Load document
String html = ...
Document doc = Jsoup.parse(html);
// Count tags
String tagsSelector = "*";
Map<Element, Integer> tagsCountByType = new Hashmap<>();
for(Element e : doc.select("*")) {
Integer count = tagsCountByType.get(e);
if (count == null) {
tagsCountByType.put(e, new Integer(1));
} else {
tagsCountByType.put(e, new Integer(count.intValue() + 1));
}
}
// Find tag with a count greater than a given threshold
// ...
I didn't test the code. Just take it as an idea, some sort of inspiration.
Another idea, you can narrow down the tagsSelector. For example:
// All elements (tags) under any table directly under body.
String tagsSelector = "body > table *";
I have the following HTML code:
<tr class="odd">
<td class="first name">
3i Group PLC
</td>
<td class="value">457.80</td>
<td class="change up">+10.90</td> <td class="delta up">+2.44%</td> <td class="value">1,414,023</td>
<td class="datetime">11:35:08</td>
For which I need to get the data
457.80
(ie. The value attribute), and I have this Java code currently:
String FTSE = "http://www.bloomberg.com/quote/UKX:IND/members";
doc = Jsoup.connect(FTSE).get();
Elements links = doc.select("a[href='/quote/III:LN']");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("value"));
System.out.println("text : " + link.text());
When I run my program it terminates having output nothing. How do I make it so that it outputs the value, which in this case, is '457.80'?
links will contain the <a href...> element. What you are trying to retrieve is the text of a completely different element, i.e. a <td> tag which has the class value.
My guess is that you have multiple <tr> elements and you only want the one which contains the link you've selected. In which case you will need the following code:
String FTSE = "http://www.bloomberg.com/quote/UKX:IND/members";
doc = Jsoup.connect(FTSE).get();
Elements trs = doc.select("tr:has(a[href='/quote/III:LN'])");
Elements values = trs.select("td.value");
link = values.get(0);
System.out.println("text : " + link.text());
Or something similar...
i have been working on this for a while and just cant seem to work out how to get the correct table that corresponds to the header it has. the tables are split up into sections which i can retrieve however inside the section is a header with the title of the table. i need to find the section with the header that matches a string and then pull the data from it. I'm fine with getting the data out of the table its just getting the correct section for the table
HTML extract of the section:
<section class="blueTab">
<header><h2>Energy</h2></header> //<----- THE HEADER I NEED TO MATCH TO
<table class="infoTable">
<tr><th>Model</th><th>0-60 mph</th><th>Top Speed</th><th>BHP</th><th></th></tr>
<tr>
<td><p>1.4i 16V Energy 5d</p></td>
<td><p>12.8 secs</p></td>
<td><p>111 mph</p></td>
<td><p>88 bhp</p></td>
</tr>
<tr class="alternate">
<td><p>1.6i 16V Energy 5d</p></td>
<td><p>11.5 secs</p></td>
<td><p>115 mph</p></td>
<td><p>103 bhp</p></td>
</tr>
<tr>
<td><p>1.8i VVT Energy 5d Auto</p></td>
<td><p>10.7 secs</p></td>
<td><p>117 mph</p></td>
<td><p>138 bhp</p></td>
</tr>
<tr class="alternate">
<td><p>1.3 CDTi 16V Energy 5d</p></td>
<td><p>12.8 secs</p></td>
<td><p>107 mph</p></td>
<td><p>88 bhp</p></td>
</tr>
</table>
<div class="fr topMargin">
<div id="ctl00_contentHolder_topFullWidthContent" class="modelEnquiry">
<div id="ctl00_contentHolder_topFullWidthContent" class="buttonLinks">
</div>
<div class="cb"><!----></div>
</div>
</div>
<div class="cb"><!----></div>
</section>
Im guessing i will have to use doc.getElementsByClass("blueTab") in a for loop and for each element see if h2 equals the string im looking for, i am just not sure how to implement this
This should solve your problem
Document doc = Jsoup.parse(input, "UTF-8");
Elements elem = doc.select(".blueTab header h2");
for (Iterator<Element> iterator = elem.iterator(); iterator.hasNext();)
{
Element element = iterator.next();
if (element.text().equals("Energy")) // your comparison text
{
Element tableElement = element.parent().nextElementSibling(); //Your got the expected table Element as per your requirement
}
}
<div class="Class-feedbacks">
<div class="grading class2">
<div itemtype="http://xx.edu/grading" itemscope="" itemprop="studentgrading">
<div class="rating">
<img class="passportphoto" width="1500" height="20" src="http://greg.png" >
<meta content="4.0" itemprop="gradingvalue">
</div>
</div>
<meta content="2012-09-08" itemprop="gradePublished">
<span class="date smaller">9/8/2012</span>
</div>
<p class="review_comment feedback" itemprop="description">Greg is one the smart person in his batch</p>
</div>
I want to print:
date: 2012-09-08
Feedback : Greg is one the smart person in his batch
I was able to use this as suggested at - Jsoup getting a hyperlink from li
The doc.select(div div divn li ui ...) and get the class feedback.
How should I use the select command to get the values of the above values?
To get the value of an attribute, use the attr method. E.g.
Elements elements = doc.select("meta");
for(Element e: elements)
System.out.println(e.attr("content"));
In one single select ...have you tried the comma Combinator "," ?
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
Elements elmts = doc.select("div.Class-feedbacks meta, p")
Element elmtDate = elmts.get(0);
System.out.println("date: " + elmtDate.attr("content"));
Element elmtParag = elmts.get(1);
System.out.println("Feedback: " + elmtParag.text());
You should get back 2 elements in your list the date and the feedback after the select.
This is an old question and I might be late, but if anyone else wants to know how to do this easily, the below code will be helpful.
Document doc = Jsoup.parse(html);
// We select the meta tag whose itemprop property has value 'gradePublished'
String date = doc.select("meta[itemprop=gradePublished]").attr("content");
System.out.println("date: "+date);
// Now we select the text inside the p tag with itemprop value 'description'
String feedback = doc.select("p[itemprop=description]").text();
System.out.println("Feedback: "+feedback);
i need to extract an image tag using jsoup from this html
<div class="picture">
<img src="http://asdasd/aacb.jpgs" title="picture" alt="picture" />
</div>
i need to extract the src of this img tag ...
i am using this code i am getting null value
Element masthead2 = doc.select("div.picture").first();
String linkText = masthead2.outerHtml();
Document doc1 = Jsoup.parse(linkText);
Element masthead3 = doc1.select("img[src]").first();
String linkText1 = masthead3.html();
Here's an example to get the image source attribute:
public static void main(String... args) {
Document doc = Jsoup.parse("<div class=\"picture\"><img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /></div>");
Element img = doc.select("div.picture img").first();
String imgSrc = img.attr("src");
System.out.println("Img source: " + imgSrc);
}
The div.picture img selector finds the image element under the div.
The main extract methods on an element are:
attr(name), which gets the value of an element's attribute,
text(), which gets the text content of an element (e.g. in <p>Hello</p>, text() is "Hello"),
html(), which gets an element's inner HTML (<div><img></div> html() = <img>), and
outerHtml(), which gets an elements full HTML (<div><img></div> html() = <div><img></div>)
You don't need to reparse the HTML like in your current example, either select the correct element in the first place using a more specific selector, or hit the element.select(string) method to winnow down.
<tr> <td class="blackNoLine" nowrap="nowrap" valign="top" width="25" align="left"><b>CAST: </b></td> <td class="blackNoLine" valign="top" width="416">Jay, Shazahn Padamsee </td> </tr>
You can use:
Document doc = Jsoup.parse(...);
Elements els = doc.select("td[class=blackNoLine]");
Element el= els.get(1);
String castName = el.text();
With the following code I can extract the image correctly:
Document doc = Jsoup.parse("<div class=\"picture\"> <img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /> </div>");
Element elem = doc.select("div.picture img").first();
System.out.println("elem: " + elem.attr("src"));
I'm using jsoup release 1.2.2, the latest one.
Maybe you're trying to print the inner html of an empty tag like img.
From the documentation: "html() - Retrieves the element's inner HTML".
For the second portion of html you can use:
Document doc2 = Jsoup.parse("<tr> <td class=\"blackNoLine\" nowrap=\"nowrap\" valign=\"top\" width=\"25\" align=\"left\"><b>CAST: </b></td> <td class=\"blackNoLine\" valign=\"top\" width=\"416\">Jay, Shazahn Padamsee </td> </tr>");
Elements trElems = doc2.select("tr");
if (trElems != null) {
for (Element element : trElems) {
Element secondTd = element.select("td").get(1);
System.out.println("name: " + secondTd.text());
}
}
which prints "Jay, Shazahn Padamsee".