HtmlUnit running getbyxpath inside getbyxpath - java

I have html like so:
<div id="myId">
<a href="#">
<div class="name">Item1</div>
<span class="location">32143|2323</span>
</a>
<a href="#">
<div class="name">Item1</div>
<span class="location">32143|2323</span>
</a>
<a href="#">
<div class="name">Item1</div>
<span class="location">32143|2323</span>
</a>
</html>
Using HtmlUnit for Java, I need to grab the name and location of each anchor. I tried doing a getByXPath to pickup all the anchors, then looping through them with a for and running another getByXpath to get the name in the div. However the individual items in the list are apparently Objects, so I am unable to run a second getByXPath.
I also tried running the first getByXPath using "/a/div[#class='name']" to which I loop through the results which are Objects and I cannot find the proper method for returning the contents of the divs.

If your xpath select element, you can cast the list to the desired type:
List<HtmlElement> divs = (List<HtmlElement>)document.getByXPath("//a/div[#class='name']");

Related

Is there a way to parse an entire HTML tag in JSoup?

Hi I'm wondering if there's a way to parse an entire HTML tag using JSoup? In my example pictures below, the five elements (4 images and 1 string) are all inside the "li" container. However, when you open the "li" tag, there are multiple nested containers. Is there a way to parse it so that I have access to all 5 elements contained in this "li" tag? I'm thinking of using getElementsMatchingOwnText("Collins") but that seems to only get me "span class="text text_14 mix-text_color7">Panorama". Any help would be appreciated, thanks!
Yes, you can iterate over the children of your <li> tag using jsoup.
Here is a simplified version of the HTML in your screenshot, showing the 5 elements:
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
Assuming you have selected this specific <li> tag in your document, you can use the following approach:
String html = "<li><span class=\"foo\"><img src=\"bar\" class=\"img\"></span><span class=\"bar\">Collins</span><i class=\"baz1\"><img src=\"baz1\" class=\"img\"></i><i class=\"baz2\"><img src=\"baz2\" class=\"img\"></i><i class=\"baz3\"><img src=\"baz3\" class=\"img\"></i></li>";
Document document = Jsoup.parse(html);
Element element = document.selectFirst("li");
element.children().forEach(child -> {
// do your processing here - this is just an example:
if (child.hasText()) {
System.out.println(child.text());
} else {
System.out.println(child.html());
}
});
The above code prints the following output:
<img src="bar" class="img">
Collins
<img src="baz1" class="img">
<img src="baz2" class="img">
<img src="baz3" class="img">
UPDATE
If the starting point is a URL, then you would need to start with this:
Document document = Jsoup.connect("https://www...").get();
Then the exercise is about identifying a unique way to find your specific element. So, if we update my earlier example, let's assume your web page is like this:
<html>
<head>...</head>
<body>
<div>
<ul class="vList_4">
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
</ul>
</div>
</body
</html>
Here we have a class in a <ul> tag called vList_4. If that is a unique class name, we can use it to jump to that section of the HTML page (IDs are better than class names because they are guaranteed to be unique - but I did not see any ID names in your screenshot).
Now, instead of my previous selector:
Element element = document.selectFirst("li");
We can use this more specific selector:
Element element = document.selectFirst("ul.vList_4 li");
The same results will be printed as before.
So, it's all about you looking at the page structure and figuring out how to jump to the relevant section of the page.
See here for technical details describing how selectors are constructed.

differentiate two html elements with same class

I have this html code below and I want to differentiate between these two PagePostsSectionPagelet as I only want to find web elements from the first PagePostsSectionPagelet. Is there any way I can do it without using <div id="PagePostsSectionPagelet-183102686112-0" as the value will not always be the same?
<div id="PagePostsSectionPagelet-183102686112-0" data-referrer="PagePostsSectionPagelet-183102686112-0">
<div class="_1k4h _5ay5">
<div class="_5sem">
</div>
</div>
<div id="PagePostsSectionPagelet-183102686112-1" class="" data-referrer="PagePostsSectionPagelet-183102686112-1" style="">
<div class="_1k4h _5ay5">
<div class="_5dro _5drq">
<div class="clearfix">
<span class="_5em9 lfloat _ohe _50f4 _50f7">Earlier in 2015</span>
<div id="u_jsonp_3_4e" class="_6a uiPopover rfloat _ohf">
</div>
</div>
<div id="u_jsonp_3_4j" class="_5sem">
<div id="u_jsonp_3_4g" class="_5t6j">
<div class="_1k4h _5ay5">
<div class="_5sem">
</div>
</div>
Tried using //div[#class='_1k4h _5ay5']//div[#class ='_5sem'] but it will return both.
Using //div[#class='_5dro _5drq']//span[contains(#class,'_5em9 lfloat _ohe _50f4 _50f7') and contains(text(), '')] will help me find the second PagePostsSectionPagelet instead.
you need to use the following xpath:
//div[contains(#class,'_1k4h') and contains(#class,'_5ay5')]
as selenium doesn't work properly with search of several classes in one attribute.
I mean By.Class("_1k4h _5ay5") will found nothing in any case and By.Xpath("//div[#class='_1k4h _5ay5']") can also found nothing in case of class will be "_5ay5 _1k4h" or " _5ay5 _1k4h".(as they possibly generated automatically, its may be have different position on page reload)
But for the best result by performance and by correctness I think will be the following xpath:
".//div[contains(#id, 'PagePostsSectionPagelet')][1]" -- for first div
".//div[contains(#id, 'PagePostsSectionPagelet')][2]" -- for second div
I see that dynamic in the div id is only the number so you can use something like:
WebElement element = driver.FindElements(By.XPath("//div[contains(.,'PagePostsSectionPagelet')])")[1];
This will take only the first web element.
Try using a css selector as below and refine further if required.
The code below returns a List of matching WebElements and then you grab the first one in the List.
List<WebElement> listOfElements = driver.findElements(By.cssSelector("div[data-referrer]"));
WebElement myElement = listOfElements.get(0);
Hint: use the Chrome console to test your css and xpath selectors directly. e.g. use
$$("div[data-referrer]") in the console to reveal what will get selected.

Click on Tabs in Selenium webdriver

I am trying to open different section of page. These Section will open on click of different tabs.
Below is HTML Structure of Page
<div id="MainContentPlaceHolder_divMainContent">
<div id="MainContentPlaceHolder_tbCntrViewCase" class="Tab ajax__tab_container ajax__tab_default" style="width: 100%; visibility: visible;">
<div id="MainContentPlaceHolder_tbCntrViewCase_header" class="ajax__tab_header">
<span id="MainContentPlaceHolder_tbCntrViewCase_tbPnlCaseDetails_tab" class="ajax__tab_active">
<span id="MainContentPlaceHolder_tbCntrViewCase_tbPnlVehicle_tab" class="ajax__tab_hover">
<span class="ajax__tab_outer">
<span class="ajax__tab_inner">
<a id="__tab_MainContentPlaceHolder_tbCntrViewCase_tbPnlVehicle" class="ajax__tab_tab" style="text-decoration:none;" href="#">
<span>Vehicle</span>
</a>
</span>
</span>
</span>
and I have Written Below Lines but these are not working
driver.findElement(By.id("__tab_MainContentPlaceHolder_tbCntrViewCase_tbPnlVehicle")).click();
driver.findElement(By.xpath("//a[text()='Vehicle']")).click();
I got Source Not Found Error
As per the OP's comments, I am posting the xpaths that can be used to locate the concerned element :
1- //span[#id='MainContentPlaceHolder_tbCntrViewCase_tbPnlVehicle_tab']//span[.='Vehicle']
This will locate the span element with innerHTML/text as Vehicle which is a descendant of span with id MainContentPlaceHolder_tbCntrViewCase_tbPnlVehicle_tab
OR
2-//span[#id='MainContentPlaceHolder_tbCntrViewCase_tbPnlVehicle_tab']//span[.='Vehicle']/..
This will locate the parent of span element with innerHTML/text as Vehicle which is a descendant of span with id MainContentPlaceHolder_tbCntrViewCase_tbPnlVehicle_tab which in this case is an a element.
Please check if this works for you. Else, let me know how many matching nodes does it show, when you use them. We will sort this one out.

How to retrieve atomic values inside tags in HTMLUnit

I am new to HtmlUnit and I don't know how to get the text inside the [...]
A part of my html file:
<ul ......somethin....>
<li data-role="list-divider" role="heading" style="font-size:16px;" class="ui-bar-f">
INFORMATION_LINE_1
</li>
<li data-theme="d" class="ui-li ui-btn-icon-right ui-btn-up-d ui-odd-match-column ">
<div class="ui-btn-inner ui-li">
<div class="">
<div class="ui-btn-text">
<a href="/x/cxntay/13113/ndzvsssl/g1" class=" ui-link-inherit ui-link-hover">
<h3 class="ui-li-heading">
<span class="xheader">INFORMATION_LINE_2</span>
<span class="label live">INFORMATION_LINE_3</span>
</h3>
<div class="ui-live-scores">
<span class="team1-scores">
<span class="ui-team-name">INFORMATION_LINE_4</span>
<span style="font-weight:bold">INFORMATION_LINE_5</span>
</span>
<span>INFORMATION_LINE_6</span>
</div>
</a>
</div>
</div>
</div>
</li>
</ul>
Now, I want to retrieve "INFORMATION_LINE_X"(1,2...6) in between these tags..
This is what I tried:
List<HtmlUnorderedList> ls = (List<HtmlUnorderedList>) page.getByXPath("/ul");
List<DomNode> dls = ls.get(0).getChildNodes();
System.out.println(dls.get(0).getFirstByXPath("//li[#data-role='list-divider']/text()");
I just tried to get INFORMATION_LINE_1
But it printed null. I need to get all the INFORMATION_LINES.
It is better to use just XPath rather than mixing it with HTMLUnit methods. Something like this should work to get you the first information line:
HtmlElement e = page.getFirstByXPath("//li[#data-role='list-divider']");
System.out.println(e.asText());
In order to fetch the other information lines you should follow the same approach but changing the XPath string.
Bear in mind you should always debug the page by taking a look at the code by printing the output of page.asXml(). If you use a real browser you are not actually seeing exactly the same as HTMLUnit is seeing. You can stumble with differences particularly if the page executes JavaScript.

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.
The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

Categories