Jsoup image tag extraction

Jsoup image tag extraction - java

i need to extract an image tag using jsoup from this html
<div class="picture">
<img src="http://asdasd/aacb.jpgs" title="picture" alt="picture" />
</div>
i need to extract the src of this img tag ...
i am using this code i am getting null value
Element masthead2 = doc.select("div.picture").first();
String linkText = masthead2.outerHtml();
Document doc1 = Jsoup.parse(linkText);
Element masthead3 = doc1.select("img[src]").first();
String linkText1 = masthead3.html();

Here's an example to get the image source attribute:
public static void main(String... args) {
Document doc = Jsoup.parse("<div class=\"picture\"><img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /></div>");
Element img = doc.select("div.picture img").first();
String imgSrc = img.attr("src");
System.out.println("Img source: " + imgSrc);
}
The div.picture img selector finds the image element under the div.
The main extract methods on an element are:
attr(name), which gets the value of an element's attribute,
text(), which gets the text content of an element (e.g. in <p>Hello</p>, text() is "Hello"),
html(), which gets an element's inner HTML (<div><img></div> html() = <img>), and
outerHtml(), which gets an elements full HTML (<div><img></div> html() = <div><img></div>)
You don't need to reparse the HTML like in your current example, either select the correct element in the first place using a more specific selector, or hit the element.select(string) method to winnow down.

<tr> <td class="blackNoLine" nowrap="nowrap" valign="top" width="25" align="left"><b>CAST: </b></td> <td class="blackNoLine" valign="top" width="416">Jay, Shazahn Padamsee </td> </tr>
You can use:
Document doc = Jsoup.parse(...);
Elements els = doc.select("td[class=blackNoLine]");
Element el= els.get(1);
String castName = el.text();

With the following code I can extract the image correctly:
Document doc = Jsoup.parse("<div class=\"picture\"> <img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /> </div>");
Element elem = doc.select("div.picture img").first();
System.out.println("elem: " + elem.attr("src"));
I'm using jsoup release 1.2.2, the latest one.
Maybe you're trying to print the inner html of an empty tag like img.
From the documentation: "html() - Retrieves the element's inner HTML".
For the second portion of html you can use:
Document doc2 = Jsoup.parse("<tr> <td class=\"blackNoLine\" nowrap=\"nowrap\" valign=\"top\" width=\"25\" align=\"left\"><b>CAST: </b></td> <td class=\"blackNoLine\" valign=\"top\" width=\"416\">Jay, Shazahn Padamsee </td> </tr>");
Elements trElems = doc2.select("tr");
if (trElems != null) {
for (Element element : trElems) {
Element secondTd = element.select("td").get(1);
System.out.println("name: " + secondTd.text());
}
}
which prints "Jay, Shazahn Padamsee".

Related

Jsoup selectors: 2nd div after h2

I have the following HTML:
<html>
<body>
...
<h2> Blah Blah 1</h2>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header</th><th>Col 2 Header</th></tr>
<tr><td>Line 1.1 Value</td><td>Line 2.1 Header</td></tr>
<tr><td>Line 2.1 Value</td><td>Line 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header T2</th><th>Col 2 Header T2</th></tr>
<tr><td>Line 1.1 Value T2</td><td>Line 2.1 Header T2</td></tr>
<tr><td>Line 2.1 Value T2</td><td>Line 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
<h2> Blah Blah 2</h2>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
<tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
<tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header T2</th><th>XCol 2 Header T2</th></tr>
<tr><td>XLine 1.1 Value T2</td><td>XLine 2.1 Header T2</td></tr>
<tr><td>XLine 2.1 Value T2</td><td>XLine 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I would like to extract the 2nd DIV following an h2 tag that contains a given text.
As you may notice in the first and second div the p tags are not in the same position.
To extract the DIV following the first h2, the below formula would work:
h2:contains(Blah 1) + p + div +div
But to extract the 2nd, replacing "Blah 1" with "Blah 2" would not work as the ""p"" tag is located elsewhere , so a static selector would be :
h2:contains(Blah 2) + div + p +div
And what I need is a single selector formula where changing the text would make it work, wherever the p blocks may be
I tried several ways :
like ... The selector nth-of-type would not work either, because I know the position of the DIV only wrt the h2 that is not father of DIV but a preceding sibling ...
Help please

I have two ideas how to achieve this.
The first one is to remove every <p> and then you will only have to select "h2:contains(" + text + ")+div+div". Be careful and use it only when you're sure your <div> doesn't contain any <p>. Otherwise it will lack some content.
public void execute1(String html) {
Document doc = Jsoup.parse(html);
// first approach: remove every <p> to simplify document
Elements paragraphs = doc.select("p");
for (Element paragraph : paragraphs) {
paragraph.remove();
}
// then one selector will return what you want in both cases
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText(Document doc, String text) {
return doc.select("h2:contains(" + text + ")+div+div").first();
}
The second approach would be to iterate over siblings of "h2:contains(" + text+ ")" and "manually" find second <div> ignoring anything else. It's better because it doesn't destroy the original document and it will skip any number of <p> elements.
public void execute2(String html) {
Document doc = Jsoup.parse(html);
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText2(Document doc, String text) {
int counter = 2;
// find h2 with given text
Element h2 = doc.select("h2:contains(" + text + ")").first();
// select every sibling after this h2 element
Elements siblings = h2.nextElementSiblings();
// loop over them
for (Element sibling : siblings) {
// skip everything that's not a div
if (sibling.tagName().equals("div")) {
// count how many divs left to skip
counter--;
if (counter == 0) {
// return when found nth div
return sibling;
}
}
}
return null;
}
I had also third idea to use "h2:contains(" + text + ")~div:nth-of-type(2)". It works for the first case, but fails for the second one probably because there's a <p> between the divs.

A simple way to do this is by using the comma (,) query operator which does an OR between the selectors. So you can combine the two variations of where the P tag falls.
h2:contains(Blah 2) + div ~ div, h2:contains(Blah 2) ~ div + div
Here's an example on the try.jsoup playground.

unable to retrieve the Table th tag value using webdriver with java

From the below html i want to check each row in the table header value and if matched need retrieve the td value
below is my html
<table class="span-5" id="summaryTable" title="Table showing Summary data">
<tbody>
<tr>
<th class="width-40" id="num">
(12) App no:
</th>
<td headers="num">
(11)
<strong>2796179</strong>
</td>
</tr>
<tr>
<th class="noLines alignLeft width35" id="EnglishTitle">
(54) English Title:
</th>
<td class="noLines alignLeft width65" headers="EnglishTitle">
FRAME BIT-SIZE ALLOCATION
</td>
</tr>
<tr>
</tbody>
</table>
i want to collect the each th tag value (i.e (12) App no (54) English Title)
my java code
WebElement summary = driver.findElement(By.xpath("//*[#id='summaryTable']/tbody"));
List<WebElement>rows = summary.findElements(By.tagName("tr"));
for (int i=1;i<=rows.size();i++){
String dc = driver.findElement(By.xpath("//*[#id='summaryTable']/tbody/tr["+i+"]/td/th/a")).getText();
if (dc.equalsIgnoreCase("(12) App no")){
appNo = driver.findElement(By.xpath("//*[#id='summaryTable']/tbody/tr["+i+"]/td/strong")).getText();
}
}
but i'm getting no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id='summaryTable']/tbody/tr[1]/td/th/a"}

Please use the below code for this
WebElement elem = driver.findElement(By.id("summaryTable"));
List<WebElement> lists = elem.findElements(By.tagName("th"));
for(WebElement el : lists){
WebElement element = el.findElement(By.tagName("a"));
String str = element.getAttribute("innerHTML");
System.out.println(str);
}

I think you are making it a bit complicated, can you try bit simpler version?
public String getRequiredDataFromTableFromRow(String header){
WebElement table = driver.findElement(By.id("summaryTable"));
List<WebElement> rows = table.findElements(By.tagName("tr"));
for (WebElement row:rows) {
if(row.getText().contains(header)){
return row.findElement(By.tagName("td")).getText();
}
}
return null;
}

Cells are also arrays within the row, so you need to specify the position to get the text. The th tag is not there within the td tag.
Try the following code:
WebElement summary = driver.findElement(By.xpath("//*[#id='summaryTable']/tbody"));
List<WebElement>rows = summary.findElements(By.tagName("tr"));
for(int i = 1; i <= rows.size(); i++) {
String dc = driver.findElement(By.xpath("//*[#id='summaryTable']/tbody/tr[" + i + "]/th[0]")).getText();
if(dc.equalsIgnoreCase("(12) App no")) {
appNo = driver.findElement(By.xpath("//*[#id='summaryTable']/tbody/tr[" + i + "]/td[0]")).getText();
}
}

Below is basically for getting you the text for each "th" element.
WebElement summary = driver.findElement(By.id("summaryTable"));
List<WebElement>rows = summary.findElements(By.tagName("th"));
for(WebElement row : rows){
row.getText();
}}
In the above code, I am getting the reference using the "id" and using same object reference in order to get the elements list for "th" tag.
In case you want to perform operation on the text been found can be done using the reference of the row element

How to read image "alt" attributes within links using jsoup?

I need to read alt attributes with the jsoup library?
For Example :
<a href="www.test.com">
<img src="http://test.org/images/icon/socialNetwork/telegram-icon.png" border="0" alt="telegram"/>
</a>
How can read it?

Here is a code snippet, which reads all the alt attributes of the image tags:
String html = "<a href=\"www.test.com\"> <img src=\"http://test.org/images/icon/socialNetwork/telegram-icon.png\" border=\"0\" alt=\"telegram\"></img>";
Document document = Jsoup.parse(html);
Elements elements = document.getElementsByTag("img");
for (Element e : elements) {
String alt = e.attr("alt");
System.out.println("alt: " + alt);
}

Java HTML Parsing not getting my data?

I have the following HTML code:
<tr class="odd">
<td class="first name">
3i Group PLC
</td>
<td class="value">457.80</td>
<td class="change up">+10.90</td> <td class="delta up">+2.44%</td> <td class="value">1,414,023</td>
<td class="datetime">11:35:08</td>
For which I need to get the data
457.80
(ie. The value attribute), and I have this Java code currently:
String FTSE = "http://www.bloomberg.com/quote/UKX:IND/members";
doc = Jsoup.connect(FTSE).get();
Elements links = doc.select("a[href='/quote/III:LN']");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("value"));
System.out.println("text : " + link.text());
When I run my program it terminates having output nothing. How do I make it so that it outputs the value, which in this case, is '457.80'?

links will contain the <a href...> element. What you are trying to retrieve is the text of a completely different element, i.e. a <td> tag which has the class value.
My guess is that you have multiple <tr> elements and you only want the one which contains the link you've selected. In which case you will need the following code:
String FTSE = "http://www.bloomberg.com/quote/UKX:IND/members";
doc = Jsoup.connect(FTSE).get();
Elements trs = doc.select("tr:has(a[href='/quote/III:LN'])");
Elements values = trs.select("td.value");
link = values.get(0);
System.out.println("text : " + link.text());
Or something similar...

how to extract data inside a specific td in html table using java

I have:
<table class="cast_list">
<tr><td colspan="4" class="castlist_label"></td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_i1" ><img height="44" width="32" alt="Tim Robbins" title="Tim Robbins"src="http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._V379389446_.png"class="loadlate hidden " loadlate="http://ia.media-imdb.com/images/M/MV5BMTI1OTYxNzAxOF5BMl5BanBnXkFtZTYwNTE5ODI4._V1_SY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td class="itemprop" itemprop="actor" itemscope itemtype="http://schema.org/Person">
<a href="/name/nm0000209/?ref_=ttfc_fc_cl_t1" itemprop='url'> <span class="itemprop" itemprop="name">Tim Robbins</span>
</a> </td>
<td class="ellipsis">
...
</td>
how can I get only the information inside the second td class? (td class= itemprop). I want to get "/name/nm0000209/?ref_=ttfc_fc_cl_t1" and "Tim Robbins".
This is my code:
Elements elms = doc.getElementsByClass("cast_list").first().getElementsByTag("table");
Elements tds = elms.select("td");
for(Element td : tds){
if(td.attr("class").contains("itemprop")){
Elements links = tds.select("a[href]");
for(Element link : links){
if(link.attr("href").contains("name/nm"))
{
String castname = link.text();
String castImdbId = link.attr("href");
System.out.println("CastName:" + castname + "\n");
System.out.println("CastImdbID:" + castImdbId + "\n");
}
but it also returns the text of the link inside td class="primary_phptp" which is null, this is part of my output:
CastName:
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_i1
CastName:Tim Robbins
CastImdbID:/name/nm0000209/?ref_=ttfc_fc_cl_t1
CastName:
......
Could someone please let me know where is my problem? I think the condition if(td.attr("class").contains("itemprop")) does not work at all.
Thanks,

Use a different css selector instead of td. Since the right <td> is identified be the class, why not use it:
td.itemprop
Your java code then would start like this instead
Elements tds = elms.select("td.itemprop");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup image tag extraction - java

Related

Jsoup selectors: 2nd div after h2

unable to retrieve the Table th tag value using webdriver with java

How to read image "alt" attributes within links using jsoup?

Java HTML Parsing not getting my data?

how to extract data inside a specific td in html table using java

Categories

Resources