Java jsoup link extracting(wrong output)

Java jsoup link extracting(wrong output) - java

I am trying to get all the links in the <a class="subHover" but the thing is that with the code I wrote I get all the links in the page, here is my code:
String website = "http://www.svensktnaringsliv.se/english/publications/?start=" +maxPage;
Document docOne = Jsoup.connect(website).get();
Elements elem = docOne.getElementsByAttributeValue("class", "search-result");
Elements el = elem.attr("class", "subHover");
System.out.println(el.select("a[href]"));
I dont really know where I am doing it wrong :/
The output of the code is:
<img class="border" src="http://www.svensktnaringsliv.se/migration_catalog/Rapporter_och_opinionsmaterial/Rapporters/corporate_governance_10017apdf_579280.html/ALTERNATES/PORTRAIT_170/Corporate_Governance_10017a.pdf">
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/corporate-governance-internal-control-and-compliance-from-an-info_578545.html"> <h2> Corporate Governance, Internal Control and Compliance - - From an Information Security Perspective</h2> </a>
<a class="noHover" href="http://www.svensktnaringsliv.se/personer/christer-magnusson_538711.html"><span class="entypo entypo-user"></span><span>Christer Magnusson</span></a>
<img class="border" src="http://www.svensktnaringsliv.se/migration_catalog/Rapporter_och_opinionsmaterial/Rapporter/proposed_guidelines_for_a_european_research_policypng_595932.html/ALTERNATES/PORTRAIT_170/Proposed_guidelines_for_a_European_research_policy.png">
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/from-stagnation-to-acceleration-proposed-guidelines-for-a-europea_595930.html"> <h2>From stagnation to acceleration - Proposed guidelines for a European research policy</h2> </a>
<a class="noHover" href="http://www.svensktnaringsliv.se/medarbetare/emil-gornerup_566685.html"><span class="entypo entypo-user"></span><span>Emil Görnerup</span></a>
<img class="border" src="http://www.svensktnaringsliv.se/migration_catalog/decision-usefulness_omslagjpg_588538.html/ALTERNATES/PORTRAIT_170/Decision%20usefulness_omslag.jpg">
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/decision-usefulness-explored-an-investigation-of-capital-market-a_588531.html"> <h2>Decision usefulness explored - An investigation of capital market actors´ use of financial reports</h2> </a>
<a class="subHover" href="http://www.svensktnaringsliv.se/english/publications/tax-reductions-and-public-resources_590643.html"> <h2>Tax reductions and public resources</h2> </a>
<a class="noHover" href="http://www.svensktnaringsliv.se/english/staff/mikael-witterblad_572108.html"><span class="entypo entypo-user"></span><span>Mikael Witterblad</span></a>
<a class="noHover" href="http://www.svensktnaringsliv.se/medarbetare/johan-fall_551949.html"><span class="entypo entypo-user"></span><span>Johan Fall</span></a>

The reason for your results is, that the document contains HTML like this:
<div class="subHover">
<span class="subject">PUBLICATION</span>
<span class="subject-info"><b>Publicerad:</b> <time datetime="2005-06-30">30 June 2005 </time></span>
<div class="result-content clearfix">
<a class="subHover" href="http://www.svensktnaringsliv.se/material/rapporter/internationell-utblick-loner-och-arbetskraftskostnader-juni-2005-_565749.html"> <h2>Internationell utblick - Löner och arbetskraftskostnader juni 2005 / International Outlook - Wages, Salaries, Labour Costs June 2005</h2> </a>
<div class="info-block">
<p><a class="noHover" href="http://www.svensktnaringsliv.se/medarbetare/krister-b-andersson_560480.html"><span class="entypo entypo-user"></span><span>Krister B Andersson</span></a></p>
</div>
</div>
</div>
You can see, that the outer div is of class subHover, which you pick up in your code. Later you select any inside a that has an attribute href, but you do not force the class of that a to be also subHover.
Why don't you just use CSS selectors? This should work:
String website = "http://www.svensktnaringsliv.se/english/publications/?start=" +maxPage;
Document docOne = Jsoup.connect(website).get();
Elements els = docOne.select("a.subHover");
for (Element el : els ){
System.out.println(el);
}
I would recommend learning the power of CSS selectors, as described in the JSoup documentation.

Related

JSoup scrape HTML document by attribute value

I want to make a dynamic website and need some pics off the internet. I decided to scrape them off flickr and include the owners on my website but am running into problems scraping. I'll post part of the HTML below but if you want to check the source code yourself, here's the website. https://www.flickr.com/explore
HTML:
<div class="thumb ">
<span class="photo_container pc_ju">
<a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="rapidnofollow photo-click"><img id="photo_img_15586482942" src="https://c2.staticflickr.com/4/3945/15586482942_6a7154363f_z.jpg"width="508" height="339" alt="Lake District" class="pc_img " border="0"><div class="play"></div></a>
</span>
<div class="meta">
<div class="title"><a data-track="photo-click" href="/photos/sheilarogers13/15586482942/in/explore-2014-10-20" title="Lake District" class="title">Lake District</a></div>
<div class="attribution-block">
<span class="attribution">
<span>by </span>
******<a data-track="owner" href="/photos/sheilarogers13" title="sheilarogers22" class="owner">sheilarogers22</a>******
</span>
</div>
<span class="inline-icons">
<a data-track="favorite" href="#" class="rapidnofollow fave-star-inline canfave" title="Add this photo to your favorites?"><img width="12" height="12" alt="[★]" src="https://s.yimg.com/pw/images/spaceball.gif" class="img"><span class="fave-count count">99+</span></a>
<a title="Comments" href="#" class="rapidnofollow comments-icon comments-inline-btn">
<img width="12" height="12" alt="Comments" src="https://s.yimg.com/pw/images/spaceball.gif">
<span class="comment-count count">57</span>
</a>
<img width="12" height="12" alt="" src="https://s.yimg.com/pw/images/spaceball.gif">
</span>
</div>
</div>
I want the line where I put asterisks, in order to be able to give credit to the authors of the pictures.
My code:
Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
The above code however gives me all 4 data tracks in my div.meta though, and I only want the one that =owner.
I checked the JSoup documentation and it says that attributes with values are found using [attr=value], but I can't seem to get it to work. I've tried:
.select("[data-track=owner]")
.select("[data-track='owner']")
but neither work. Thoughts?

Elements pgElem = doc.select("div.thumb").select("div.meta").select("[data-track]");
Elements ownerElements = new Elements();
for(Element element:pgElem){
if(!element.getElementsByAttributeValueContaining("data-track","owner").isEmpty()){
ownerElements.add(element);
}
}
actually, I just gave it another spin and this works for me:
doc.select("div.thumb").select("div.meta").select("[data-track=owner]")

I can not select a link using class spam in Selenium

I have the following command in HTML:
**<a id="pt1:cb1" class="xfc p_AFTextOnly" href="#" onclick="return false;">
<span class="x106">Cadastro de cliente</span>**
Must select the page in the "Cadastro de cliente" item.
I used the following command , but it did not work.
WebElement menuCadCliente = driver.findElement(By.xpath("/html/body/div/form/div/div/div/div/div[1]/div[7]/div/div[6]/div/div[1]/div/div[1]/a/span"));
menuCadCliente.click();
I'm New world of selenium webdriver , so I ask your help.

If the span that you are interested is always going to be a child of the anchor tag with the id "pt1:cb1" then I would suggest using:
WebElement menuCadCliente = driver.findElement(By.cssSelector("#pt1:cb1 .x106"));
otherwise, if there is no guarantee on the order, I would suggest using a logical loop to find the right element.
List<WebElement> spans = driver.findElements(By.cssSelector(".x106"));
WebElement menuCadCliente;
for (WebElement span : spans) {
if (span.getText().equals("Cadastro de cliente")) {
menuCadCliente = span;
}
}
or if you could try using :contains, which matches substring.
WebElement menuCadCliente = driver.findElement(By.cssSelector("span:contains('Cadastro de cliente')"));
However, this method would also match span that has text such as Cadastro de clientes1231

That xpath looks overly complicated. It might be useful to think more simple. You can search by id, but if the id attribute is dynamically generated, you can always try searching it by 'linkText' :
driver.findElement(By.linkText("Cadastro de cliente")).click();

You are using the complete XPath, which almost always is a bad idea. From the little bit of code you posted there are too many solutions. Any one of the following could work:
driver.findElement(By.className("x106"))
driver.findElement(By.tagName("span"))
driver.findElement(By.linkText("Cadastro de cliente"))
Have a read through the documentation.

The code snippet of the page and have the link where I need to click.
<div class="xwq" style="position:absolute;left:0px;right:0px;top:0px;bottom:0px">
<div style="position:absolute;width:100%;height:100%">
<div id="pt1:sdi1" class="af_showDetailItem" style="position:absolute;width:auto;height:auto;top:0px;left:0px;bottom:0px;right:0px">
<div>
<a id="pt1:cb1" class="xfc p_AFTextOnly" href="#" onclick="return false;">
<span class="x106">Cadastro de cliente</span>
</a>
</div>
<div>
<a id="pt1:cb2" class="xfc p_AFTextOnly" href="#" onclick="return false;">
<span class="x106">Relacionar cliente à Proposta de Venda</span>
</a>
</div>
<div>
<a id="pt1:cb3" class="xfc p_AFTextOnly" href="#" onclick="return false;">
<span class="x106">Iniciar processo de Análise de Crédito</span>
</a>
</div>
</div>
</div>
Command in selenium was used :
WebElement menuCadCliente = driver.findElement(By.linkText("Cadastro de cliente"));
menuCadCliente.click();
What has been identified is that a single class and inside it has different values . This is my if I have not understood erroado.

How to retrieve atomic values inside tags in HTMLUnit

I am new to HtmlUnit and I don't know how to get the text inside the [...]
A part of my html file:
<ul ......somethin....>
<li data-role="list-divider" role="heading" style="font-size:16px;" class="ui-bar-f">
INFORMATION_LINE_1
</li>
<li data-theme="d" class="ui-li ui-btn-icon-right ui-btn-up-d ui-odd-match-column ">
<div class="ui-btn-inner ui-li">
<div class="">
<div class="ui-btn-text">
<a href="/x/cxntay/13113/ndzvsssl/g1" class=" ui-link-inherit ui-link-hover">
<h3 class="ui-li-heading">
<span class="xheader">INFORMATION_LINE_2</span>
<span class="label live">INFORMATION_LINE_3</span>
</h3>
<div class="ui-live-scores">
<span class="team1-scores">
<span class="ui-team-name">INFORMATION_LINE_4</span>
<span style="font-weight:bold">INFORMATION_LINE_5</span>
</span>
<span>INFORMATION_LINE_6</span>
</div>
</a>
</div>
</div>
</div>
</li>
</ul>
Now, I want to retrieve "INFORMATION_LINE_X"(1,2...6) in between these tags..
This is what I tried:
List<HtmlUnorderedList> ls = (List<HtmlUnorderedList>) page.getByXPath("/ul");
List<DomNode> dls = ls.get(0).getChildNodes();
System.out.println(dls.get(0).getFirstByXPath("//li[#data-role='list-divider']/text()");
I just tried to get INFORMATION_LINE_1
But it printed null. I need to get all the INFORMATION_LINES.

It is better to use just XPath rather than mixing it with HTMLUnit methods. Something like this should work to get you the first information line:
HtmlElement e = page.getFirstByXPath("//li[#data-role='list-divider']");
System.out.println(e.asText());
In order to fetch the other information lines you should follow the same approach but changing the XPath string.
Bear in mind you should always debug the page by taking a look at the code by printing the output of page.asXml(). If you use a real browser you are not actually seeing exactly the same as HTMLUnit is seeing. You can stumble with differences particularly if the page executes JavaScript.

How to click a link from list selenium2

I'm using Selenium 2 and I want to click an 'invite' link for Name3. How can I do that?
here is the html code:
<ul>
<li>
<label for="511565484">
<img src="pic1">Name1</label>
<a class="button_green sendInvite" href="javascript:;" title="Invite">Invite</a>
</li>
<li>
<label for="535963597">
<img src="pic2">Name2</label>
<a class="button_green sendInvite" href="javascript:;" title="Invite">Invite</a>
</li>
<li>
<label for="561708219">
<img src="pic3">Name3</label>
<a class="button_green sendInvite" href="javascript:;" title="Invite">Invite</a>
</li>
</ul>

Seems likely it can only be done with XPath:
//label[text()='Name3']/following-sibling::a

element2 = driver.findElement(By.xpath("//img[()text='Name3']/a"));
element2.click();

If XPATH isn't the most usable thing for you, you can always do something like this (Ruby implementation of Webdriver... but it's all the same):
invite_links = driver.find_elements(:class_name, "sendInvite")
invite_links now contains an array of all matches, so your next step is pretty easy:
invite_links[2].click()
Or the way I'd do it:
driver.find_elements(:class_name, "sendInvite")[2].click
This is a little easier for me to read than XPATH, because I don't use it that often.

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.

The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java jsoup link extracting(wrong output) - java

Related

JSoup scrape HTML document by attribute value

I can not select a link using class spam in Selenium

How to retrieve atomic values inside tags in HTMLUnit

How to click a link from list selenium2

Extracting href from a class within other div/id classes with jsoup

Categories

Resources