JSoup scrape HTML document by attribute value

JSoup scrape HTML document by attribute value - java

I want to make a dynamic website and need some pics off the internet. I decided to scrape them off flickr and include the owners on my website but am running into problems scraping. I'll post part of the HTML below but if you want to check the source code yourself, here's the website. https://www.flickr.com/explore
HTML:
<div class="thumb ">
<span class="photo_container pc_ju">
<a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="rapidnofollow photo-click"><img id="photo_img_15586482942" src="https://c2.staticflickr.com/4/3945/15586482942_6a7154363f_z.jpg"width="508" height="339" alt="Lake District" class="pc_img " border="0"><div class="play"></div></a>
</span>
<div class="meta">
<div class="title"><a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="title">Lake District</a></div>
<div class="attribution-block">
<span class="attribution">
<span>by </span>
******<a data-track="owner" href="/photos/sheilarogers13" title="sheilarogers22" class="owner">sheilarogers22</a>******
</span>
</div>
<span class="inline-icons">
<a data-track="favorite" href="#" class="rapidnofollow fave-star-inline canfave" title="Add this photo to your favorites?"><img width="12" height="12" alt="[★]" src="https://s.yimg.com/pw/images/spaceball.gif" class="img"><span class="fave-count count">99+</span></a>
<a title="Comments" href="#" class="rapidnofollow comments-icon comments-inline-btn">
<img width="12" height="12" alt="Comments" src="https://s.yimg.com/pw/images/spaceball.gif">
<span class="comment-count count">57</span>
</a>
<img width="12" height="12" alt="" src="https://s.yimg.com/pw/images/spaceball.gif">
</span>
</div>
</div>
I want the line where I put asterisks, in order to be able to give credit to the authors of the pictures.
My code:
Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
The above code however gives me all 4 data tracks in my div.meta though, and I only want the one that =owner.
I checked the JSoup documentation and it says that attributes with values are found using [attr=value], but I can't seem to get it to work. I've tried:
.select("[data-track=owner]")
.select("[data-track='owner']")
but neither work. Thoughts?

Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
Elements ownerElements = new Elements();
for(Element element:pgElem){
if(!element.getElementsByAttributeValueContaining("data-track","owner").isEmpty()){
ownerElements.add(element);
}
}
actually, I just gave it another spin and this works for me:
doc.select("div.thumb").select("div.meta").select("[data-track=owner]")

Related

Selenium - How to get following sibling?

So I want to target a textbox that appears in a list that comes after the label "First Name" and send a first name to the box, but can't seem to be able to target the textBox...
What I've tried:
WebElement firstNameLocTry = driver.findElement(By.xpath("//label[text()='First Name']/following-sibling::div"));
firstNameLocTry.sendKeys(firstName);
What the li look like:
<li class="WPTO WKVO" role="presentation" data-automation-id="formLabelRequired">
<div class="WBUO WETO WDUO">
<label id="56$551056--uid24-formLabel" data-automation-id="formLabel" for="56$551056--uid24-input">First Name</label>
<div class="WEUO wd-c8594868-6b31-4526-9dda-7d146648964b" aria-hidden="true">First Name</div>
</div>
<div data-automation-id="decorationWrapper" id="56$551056" class="WFUO">
<div class="WICJ">
<div class="WMP2 textInput WLP2 WJ5" data-automation-id="textInput" id="56$551056--uid24" data-metadata-id="56$551056" style="visibility: visible;">
<input type="text" class="gwt-TextBox WEQ2" data-automation-id="textInputBox" tabindex="0" role="textbox" id="56$551056--uid24-input" aria-invalid="false" aria-required="true">
</div>
</div>
</div>
</li>
Any reason my sendKeys just leads to Element not interactable?

The attached HTML code Produce below out put
With the given XPath points to the second "First Name" Div [below pic], when you perform sendKeys, it is obvious that the error "Element not interactable" will be thrown.
Try with below two Xpaths,
1. //label[text()='First Name']//parent::div/following-sibling::div
2. //label[text()='First Name']//parent::div/following-sibling::div//input

Please use following xpath:
//div[text()='First Name']
or try:
//*[contains(text(),'First Name')]

How to get specific sub-elements of html data using Jsoup

So I am trying to get all prices from a Html file using Jsoup. The simplified Html is structured something like this:
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
$509
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_ECONOMY_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for Economy (lowest)</span>
</div>
</div>
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
$1,046
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_MIN-BUSINESS-OR-FIRST_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for First (2-cabin, lowest)</span>
</div>
<div class="pp-remaining-seats">5 tickets left at this price</div>
</div>
//some html
This is what I have tried so far:
File input = new File("Flights.html");
Document document = Jsoup.parse(input, "UTF-8", "");
Elements prices = document.getElementsByClass("price-point");
for(Element e: prices){
System.out.println(e.toString());
}
This gives me the following result:
<div class="price-point price-point-revised use-roundtrippricing">
$509
</div>
<div class="price-point price-point-revised use-roundtrippricing">
$1,046
</div>
.....
But now I only want prices like:
509
1046
I tried regex by only keeping the digits e.toString().replaceAll("\\D+","") when printing it, this seems to work but that is not how I want to achieve it. How can I get only the numbers using Jsoup?

Thanks to the comment from #Eritrean, I needed to use e.text() instead of e.toString()which gave me
$509
$1,046
I still need to use regex like e.replaceAll("[$,]", "") to get rid of the dollar signs.

parse data of certain tag which is before a particular class

I need parse data from web page by tag ("p"). I try like this:
Elements content = document.getElementsByTag("p");
for(Element el : content) {
System.out.println(el.text());
}
And it's work fine. But I get superfluous data.
For example:
<div class="DicCellTerm">
<h1>Impossible</h1>
<div class=des>
<p class=par2><span class=hint><em>smth</em></span></p>
<p class=par2>1) (<em>with</em>) all, do</p>
<p class=par2>2) <span class=hint><em>text</em></span> some words</p>
<p class=par3>it is impossible</p>
</div>
</div>
</div><!--DicCell end-->
<div align="center" class="AdContent" id="adcontentnoprint">
<div class=SharedItems>
<div class=DicCellParent>
<span class=LinkOtherDic>+ dictionary <strong>impossible</strong> - translate</span>
<div class=DicCellOther id=diccellothershow>
<h2>impossible</h2>
<div class=des>
<p class=par1>1) important, is</p>
<p class=par1>what</p>
<p class=par1>2) true, false</p>
</div>
</div>
<!--DicCellOther end-->
</div>
<!--DicCellParent end-->
<div class=DicCellParent>
<span class=LinkOtherDic>+ translate <strong>important</strong> - dictionary</span>
<div class=DicCellOther id=diccellothershow>
<h2>importnant</h2>
<div class=des>
<p class=par1>1) müim, emiyetli; emiyet bar</p>
<p class=par1>it is very important - bu pek müimdir, bunıñ büyük emiyeti bar</p>
<p class=par1>2) qopayıp, qabarıp</p>
</div>
</div>
<!--DicCellOther end-->
</div>
<!--DicCellParent end-->
</div>
<!--SharedItems end-->
I need to get data by tag "p" before class SharedItems.
I tried parse data by class "DicCellTerm" and I get properly data. And all data is written in one line, but I need to get data as on web page.

Elements elements = document.select(".DicCellTerm p");
This grabs all p inside the .DicCellTerm class, then you can iterate over elements. Here is a link to all possible selectors in jsoup, this is where i get most of my help =)
https://jsoup.org/apidocs/index.html?org/jsoup/select/Selector.html

How to retrieve atomic values inside tags in HTMLUnit

I am new to HtmlUnit and I don't know how to get the text inside the [...]
A part of my html file:
<ul ......somethin....>
<li data-role="list-divider" role="heading" style="font-size:16px;" class="ui-bar-f">
INFORMATION_LINE_1
</li>
<li data-theme="d" class="ui-li ui-btn-icon-right ui-btn-up-d ui-odd-match-column ">
<div class="ui-btn-inner ui-li">
<div class="">
<div class="ui-btn-text">
<a href="/x/cxntay/13113/ndzvsssl/g1" class=" ui-link-inherit ui-link-hover">
<h3 class="ui-li-heading">
<span class="xheader">INFORMATION_LINE_2</span>
<span class="label live">INFORMATION_LINE_3</span>
</h3>
<div class="ui-live-scores">
<span class="team1-scores">
<span class="ui-team-name">INFORMATION_LINE_4</span>
<span style="font-weight:bold">INFORMATION_LINE_5</span>
</span>
<span>INFORMATION_LINE_6</span>
</div>
</a>
</div>
</div>
</div>
</li>
</ul>
Now, I want to retrieve "INFORMATION_LINE_X"(1,2...6) in between these tags..
This is what I tried:
List<HtmlUnorderedList> ls = (List<HtmlUnorderedList>) page.getByXPath("/ul");
List<DomNode> dls = ls.get(0).getChildNodes();
System.out.println(dls.get(0).getFirstByXPath("//li[#data-role='list-divider']/text()");
I just tried to get INFORMATION_LINE_1
But it printed null. I need to get all the INFORMATION_LINES.

It is better to use just XPath rather than mixing it with HTMLUnit methods. Something like this should work to get you the first information line:
HtmlElement e = page.getFirstByXPath("//li[#data-role='list-divider']");
System.out.println(e.asText());
In order to fetch the other information lines you should follow the same approach but changing the XPath string.
Bear in mind you should always debug the page by taking a look at the code by printing the output of page.asXml(). If you use a real browser you are not actually seeing exactly the same as HTMLUnit is seeing. You can stumble with differences particularly if the page executes JavaScript.

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.

The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup scrape HTML document by attribute value - java

Related

Selenium - How to get following sibling?

How to get specific sub-elements of html data using Jsoup

parse data of certain tag which is before a particular class

How to retrieve atomic values inside tags in HTMLUnit

Extracting href from a class within other div/id classes with jsoup

Categories

Resources