Jsoup nested values inside list item - java

I have an HTML page that I'm trying to dig out the Logname value from. I can get all the li text jammed together as one string, but not quite what I want. I'd like just the second part of the li Logname after the </span>. Any way to easily get that? With what I have, I could do a split and get what I want but seems like there should be a more elegant way?
Current code
Elements detail = mHtml.select ("div.alpha-first");
for (Element items : detail)
{
Log.d (TAG, " label text " + items.text());
detail.
if (items.text().equals ("ACID"))
{
Log.d (TAG, " got ACID ");
}
}
HTML
<html>
<title>emp id chart</title>
<body>
<div class="alpha-first">
<ul class="account-detail">
<li><span class="label">ID</span>42</li>
<li><span class="label">Logname</span>George</li>
<li><span class="label">Surname</span>Glass</li>
<li><span class="label">ACID</span>15</li>
<li><span class="label">Dept</span>101348</li>
<li><span class="label">Empclass</span>Echo</li>
</ul>
<p class="last-swipe">3 Apr 9:53</p><br> </div>
<div class="detail-last-loc">
<p style="font-size: 8pt;">Current status</p>
<p class="current-location">Bldg #23 South Lot</p>
<p> current time 10:43 <br /></p>
<div class="detail-extra">
<p>More | 3 Day History</p>
</div>
</div>
</body>
</html>

From what I understood, given your example, you would want to obtain from: <li><span class="label">Logname</span>George</li>, the value: George.
You really don't need to iterate, you can get it directly. I would not go so far as to call this code elegant, but still, here it is:
//Select the <span> element the text "Logname"
Elements select = mHtml.select(".account-detail span.label:contains(Logname)");
//Get the element itself, since the select returns a list
Element lognameSpan = select.get(0);
//Get the <li> parent of the <span>
Element parent = lognameSpan.parent();
//Access the text node of the <li> directly since there is only one
String logname = parent.textNodes().get(0).text();
Hope it helps.

Related

Is there a way to parse an entire HTML tag in JSoup?

Hi I'm wondering if there's a way to parse an entire HTML tag using JSoup? In my example pictures below, the five elements (4 images and 1 string) are all inside the "li" container. However, when you open the "li" tag, there are multiple nested containers. Is there a way to parse it so that I have access to all 5 elements contained in this "li" tag? I'm thinking of using getElementsMatchingOwnText("Collins") but that seems to only get me "span class="text text_14 mix-text_color7">Panorama". Any help would be appreciated, thanks!
Yes, you can iterate over the children of your <li> tag using jsoup.
Here is a simplified version of the HTML in your screenshot, showing the 5 elements:
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
Assuming you have selected this specific <li> tag in your document, you can use the following approach:
String html = "<li><span class=\"foo\"><img src=\"bar\" class=\"img\"></span><span class=\"bar\">Collins</span><i class=\"baz1\"><img src=\"baz1\" class=\"img\"></i><i class=\"baz2\"><img src=\"baz2\" class=\"img\"></i><i class=\"baz3\"><img src=\"baz3\" class=\"img\"></i></li>";
Document document = Jsoup.parse(html);
Element element = document.selectFirst("li");
element.children().forEach(child -> {
// do your processing here - this is just an example:
if (child.hasText()) {
System.out.println(child.text());
} else {
System.out.println(child.html());
}
});
The above code prints the following output:
<img src="bar" class="img">
Collins
<img src="baz1" class="img">
<img src="baz2" class="img">
<img src="baz3" class="img">
UPDATE
If the starting point is a URL, then you would need to start with this:
Document document = Jsoup.connect("https://www...").get();
Then the exercise is about identifying a unique way to find your specific element. So, if we update my earlier example, let's assume your web page is like this:
<html>
<head>...</head>
<body>
<div>
<ul class="vList_4">
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
</ul>
</div>
</body
</html>
Here we have a class in a <ul> tag called vList_4. If that is a unique class name, we can use it to jump to that section of the HTML page (IDs are better than class names because they are guaranteed to be unique - but I did not see any ID names in your screenshot).
Now, instead of my previous selector:
Element element = document.selectFirst("li");
We can use this more specific selector:
Element element = document.selectFirst("ul.vList_4 li");
The same results will be printed as before.
So, it's all about you looking at the page structure and figuring out how to jump to the relevant section of the page.
See here for technical details describing how selectors are constructed.

href attribute of a li element coming up as null when there is data for the href

I am trying to iterate over a collection of li elements in an ordered list and then print the URL from each element.
The HTML:
<div id="global-atoz-navigation">
<nav role="navigation" aria-label="Suppliers">
<ol>
<li class="selected">
<span class="visuallyhidden">Suppliers starting with </span>
<strong>A</strong>
</li>
<li>
<span class="visuallyhidden">Suppliers starting with </span>
B
</li>
<li>
<span class="visuallyhidden">Suppliers starting with </span>
C
</li>
</ol>
</nav>
</div>
My code so far:
WebElement navDiv = driver.findElement(By.id("global-atoz-navigation"));
List<WebElement> links = navDiv.findElements(By.tagName("li"));
for (WebElement link : links) {
System.out.println(link.getAttribute("href"));
}
But for some reason this is printing 'null' for each li element. Any ideas why?
Your <li> elements don't have an href attribute. The <a> elements inside them do. Is there a reason you wouldn't do navDiv.findElements(By.tagName("a")) to find those instead? Or, if there could be anchors you want to avoid, either get all the anchors and filter the bad ones out, or get all the list items and do another li.findElements(By.tagName("a")) on each one.
Please note that <li> isn't short for "link". It's short for "list item".
Links are in the "anchor", or <a>, element.
Referencing from the answer here:
JSP getAttribute() returning null
"You are not typecasting it to String. request.getAttribute() will return an Object."
Try using this and see if it works:
String value = (String)link.getAttribute("href");
What's also true is ErikE's answer where you should be grabbing the attributes for the < a > tags not the < li > tags

How can I retrieve data from html using Jsoup

I'm new to HTML and I'm trying to learn a little about the HTML tags by trying to retrieve data from an HTML String.
<li>
<div class="item" data-youtube_code="code_for_youtuber" data-feature_code="data" data-feature_url="/movies/Truman">
<div class="title">
<span>the title of the video</span>
</div>
<div class="image">
<img src="/media/image.png" data-src="http://url_of_image.jpg" alt="">
</div>
</div> </li>
I'm using the Java Jsoup library and so far I've manage to extract the <span> content using:
Document doc = Jsoup.connect("http://www.yesplanet.co.il/movies").get();
System.out.println(doc.html());
Elements elem = doc.select(".item").text();
How can I get other things such as the data-youtube_code and the img src.
Edit:
For example:
System.out.println("doc...data-youtube_code");//some code that retrieves
//data-youtube_code. The ouptup will be "code_for_youtuber"
System.out.println("data-src")
//some code that retrieves
//data-src. The ouptup will be "http://url_of_image.jpg"
You can simply select first div and get the value by attribute
Element elements = Jsoup.parse(s).select("div").first();
System.out.println(elements.attr("data-youtube_code"));
Output:
code_for_youtuber
EDIT :
Element elements = Jsoup.parse(s).select(".item").first();
System.out.println(elements.attr("data-youtube_code"));
Element element1 = elements.select(".image img").first();
System.out.println(element1.attr("data-src"));
Output:
code_for_youtuber
http://url_of_image.jpg
Since you are beginner i suggest you to look for this link

Jsoup: How to select direct parents until the root without their siblings?

I'm trying to get all direct parents of element, but also I get their siblings.
For example, I have this DOM structure...
<div class="html">
<div class="head"></div>
<div class="body">
seznam
<h2>Foo</h2>
google
<p>
<img class="first">
</p>
<img class="second">
<ol>
<li>1</li>
<li>2</li>
</ol>
</div>
</div>
So I want get all direct parents of img element with class first until a div with class html.
I've tried using the following code
Element element = document.select("img").first();
Node root = element.root();
But in the root var I get whole DOM structure also with all siblings.
UPDATE
After this in root var I have the whole DOM structure again:
<div class="html">
<div class="head"></div>
<div class="body">
seznam
<h2>Foo</h2>
google
<p>
<img class="first">
</p>
<img class="second">
<ol>
<li>1</li>
<li>2</li>
</ol>
</div>
</div>
But I want something like this:
<div class="html">
<div class="body">
<p>
<img class="first">
</p>
</div>
</div>
If you are interested in path only, use Element.cssSelector()
It gives you whole DOM path e.g. html > body > img
"Path" returned by Darshit Chokshi approach is not unique.
First of all get all elements with class name 'first' using,
Elements childs = document.getElementsByClass("first");
Now, iterate all child elements to get their parent elements using,
for( Element child : childs){
Elements parents = child.parents();
for(Element parent: parents){
System.out.println(parent.tagName());
}
}
Try this, Hope it will work for you ;)

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.
The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

Categories