Jsoup: How to select direct parents until the root without their siblings? - java

I'm trying to get all direct parents of element, but also I get their siblings.
For example, I have this DOM structure...
<div class="html">
<div class="head"></div>
<div class="body">
seznam
<h2>Foo</h2>
google
<p>
<img class="first">
</p>
<img class="second">
<ol>
<li>1</li>
<li>2</li>
</ol>
</div>
</div>
So I want get all direct parents of img element with class first until a div with class html.
I've tried using the following code
Element element = document.select("img").first();
Node root = element.root();
But in the root var I get whole DOM structure also with all siblings.
UPDATE
After this in root var I have the whole DOM structure again:
<div class="html">
<div class="head"></div>
<div class="body">
seznam
<h2>Foo</h2>
google
<p>
<img class="first">
</p>
<img class="second">
<ol>
<li>1</li>
<li>2</li>
</ol>
</div>
</div>
But I want something like this:
<div class="html">
<div class="body">
<p>
<img class="first">
</p>
</div>
</div>

If you are interested in path only, use Element.cssSelector()
It gives you whole DOM path e.g. html > body > img
"Path" returned by Darshit Chokshi approach is not unique.

First of all get all elements with class name 'first' using,
Elements childs = document.getElementsByClass("first");
Now, iterate all child elements to get their parent elements using,
for( Element child : childs){
Elements parents = child.parents();
for(Element parent: parents){
System.out.println(parent.tagName());
}
}
Try this, Hope it will work for you ;)

Related

Is there a way to parse an entire HTML tag in JSoup?

Hi I'm wondering if there's a way to parse an entire HTML tag using JSoup? In my example pictures below, the five elements (4 images and 1 string) are all inside the "li" container. However, when you open the "li" tag, there are multiple nested containers. Is there a way to parse it so that I have access to all 5 elements contained in this "li" tag? I'm thinking of using getElementsMatchingOwnText("Collins") but that seems to only get me "span class="text text_14 mix-text_color7">Panorama". Any help would be appreciated, thanks!
Yes, you can iterate over the children of your <li> tag using jsoup.
Here is a simplified version of the HTML in your screenshot, showing the 5 elements:
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
Assuming you have selected this specific <li> tag in your document, you can use the following approach:
String html = "<li><span class=\"foo\"><img src=\"bar\" class=\"img\"></span><span class=\"bar\">Collins</span><i class=\"baz1\"><img src=\"baz1\" class=\"img\"></i><i class=\"baz2\"><img src=\"baz2\" class=\"img\"></i><i class=\"baz3\"><img src=\"baz3\" class=\"img\"></i></li>";
Document document = Jsoup.parse(html);
Element element = document.selectFirst("li");
element.children().forEach(child -> {
// do your processing here - this is just an example:
if (child.hasText()) {
System.out.println(child.text());
} else {
System.out.println(child.html());
}
});
The above code prints the following output:
<img src="bar" class="img">
Collins
<img src="baz1" class="img">
<img src="baz2" class="img">
<img src="baz3" class="img">
UPDATE
If the starting point is a URL, then you would need to start with this:
Document document = Jsoup.connect("https://www...").get();
Then the exercise is about identifying a unique way to find your specific element. So, if we update my earlier example, let's assume your web page is like this:
<html>
<head>...</head>
<body>
<div>
<ul class="vList_4">
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
</ul>
</div>
</body
</html>
Here we have a class in a <ul> tag called vList_4. If that is a unique class name, we can use it to jump to that section of the HTML page (IDs are better than class names because they are guaranteed to be unique - but I did not see any ID names in your screenshot).
Now, instead of my previous selector:
Element element = document.selectFirst("li");
We can use this more specific selector:
Element element = document.selectFirst("ul.vList_4 li");
The same results will be printed as before.
So, it's all about you looking at the page structure and figuring out how to jump to the relevant section of the page.
See here for technical details describing how selectors are constructed.

How can I retrieve data from html using Jsoup

I'm new to HTML and I'm trying to learn a little about the HTML tags by trying to retrieve data from an HTML String.
<li>
<div class="item" data-youtube_code="code_for_youtuber" data-feature_code="data" data-feature_url="/movies/Truman">
<div class="title">
<span>the title of the video</span>
</div>
<div class="image">
<img src="/media/image.png" data-src="http://url_of_image.jpg" alt="">
</div>
</div> </li>
I'm using the Java Jsoup library and so far I've manage to extract the <span> content using:
Document doc = Jsoup.connect("http://www.yesplanet.co.il/movies").get();
System.out.println(doc.html());
Elements elem = doc.select(".item").text();
How can I get other things such as the data-youtube_code and the img src.
Edit:
For example:
System.out.println("doc...data-youtube_code");//some code that retrieves
//data-youtube_code. The ouptup will be "code_for_youtuber"
System.out.println("data-src")
//some code that retrieves
//data-src. The ouptup will be "http://url_of_image.jpg"
You can simply select first div and get the value by attribute
Element elements = Jsoup.parse(s).select("div").first();
System.out.println(elements.attr("data-youtube_code"));
Output:
code_for_youtuber
EDIT :
Element elements = Jsoup.parse(s).select(".item").first();
System.out.println(elements.attr("data-youtube_code"));
Element element1 = elements.select(".image img").first();
System.out.println(element1.attr("data-src"));
Output:
code_for_youtuber
http://url_of_image.jpg
Since you are beginner i suggest you to look for this link

Retrieving the contents of an html label using XPath

I have the following html code:
<div id="ipsLayout_contentArea">
<div class="preContentPadding">
<div id="ipsLayout_contentWrapper">
<div id="ipsLayout_mainArea">
<a id="elContent"></a>
<div class="cWidgetContainer " data-widgetarea="header" data-orientation="horizontal" data-role="widgetReceiver" data-controller="core.front.widgets.area">
<div class="ipsPageHeader ipsClearfix">
<div class="ipsClearfix">
<div class="cTopic ipsClear ipsSpacer_top" data-feedid="topic-100269" data-lastpage="" data-baseurl="https://forum.com/forum/topic/100269-topic/" data-autopoll="" data-controller="core.front.core.commentFeed,forums.front.topic.view">
<div class="" data-controller="core.front.core.moderation" data-role="commentFeed">
<form data-role="moderationTools" data-ipspageaction="" method="post" action="https://forum.com/forum/topic/100269-topic/?csrfKey=b092dccccee08fdbc06c26d350bf3c2b&do=multimodComment">
<a id="comment-626016"></a>
<article id="elComment_626016" class="cPost ipsBox ipsComment ipsComment_parent ipsClearfix ipsClear ipsColumns ipsColumns_noSpacing ipsColumns_collapsePhone " itemtype="http://schema.org/Comment" itemscope="">
<aside class="ipsComment_author cAuthorPane ipsColumn ipsColumn_medium">
<div class="ipsColumn ipsColumn_fluid">
<div id="comment-626016_wrap" class="ipsComment_content ipsType_medium ipsFaded_withHover" data-quotedata="{"userid":3859,"username":"Admin","timestamp":1453221383,"contentapp":"forums","contenttype":"forums","contentid":100269,"contentclass":"forums_Topic","contentcommentid":626016}" data-commentid="626016" data-commenttype="forums" data-commentapp="forums" data-controller="core.front.core.comment">
<div class="ipsComment_meta ipsType_light">
<div class="cPost_contentWrap ipsPad">
<div class="ipsType_normal ipsType_richText ipsContained" data-controller="core.front.core.lightboxedImages" itemprop="text" data-role="commentContent">
<p> Hi, </p>
<p> </p>
<p> This is a post with multiple </p>
<p> lines of text </p>
and am trying to get the contents (in plaintext) of the post. The XPath I'm currently using:
//div[#id='ipsLayout_contentArea']/div[2]/div/div[4]/div/form/article/div/div/div[2]/div//text()
retrieves each line of each post (as delimited by <p></p>). How can I get the whole contents of the post (inside:
<div class="ipsType_normal ipsType_richText ipsContained" data-controller="core.front.core.lightboxedImages" itemprop="text" data-role="commentContent"> Post content </div>),
in plaintext (so that <p></p> is treated as a text (as well as other labels that the post might include))?
Edit:
I'm using the following XPath:
//div[#id='ipsLayout_contentArea']/div[2]/div/div[4]/div/form/article/div/div/div[2]/div
to retrieve the div that contains the body of the post.
// forumTemplate.getXpathElements().get(forumTemplate.XPATH_GET_THREAD_POSTS) = //div[#id='ipsLayout_contentArea']/div[2]/div/div[4]/div/form/article/div/div/div[2]/div
List<DomNode> posts = (List<DomNode>) firstPage.getByXPath(forumTemplate.getXpathElements().get(forumTemplate.XPATH_GET_THREAD_POSTS));
for (DomNode post : posts) {
// Retrieve the contents of the post as a string
String postContentStr = post.getNodeValue();
The variable postContentStr is always null. Why?
You specified //text(), that will get all text nodes under the specified path recursively. Depending of what you use, this could work better:
//div[#data-role='commentContent']
That will match the comment node you are trying to get. If you use code to evaluate, you can go from here. Don't match text() though, that will not match any of the <p> tags.

How to access a specific child div using xpath? (Selenium Java)

I have the following html code:
<div class="panel">
<div class = "heading">
<span class="wName">Name</span>
<div class="foo1" style="display: none;"></div>
<div class="foo2" style="display: none;"></div>
</div>
</div>
I already located element panel and I'm trying to test when foo2 doesn't appear with the following line of code:
if (panel.findElement(By.xpath("../div[#class='foo2']")).getCssValue("display").equals("none"))
I'm not sure why this won't retrieve the element properly.
Your XPath is wrong! .. means "parent of". Single dot . would mean relative to current location.
Try: panel.findElement(By.xpath(".//div[#class='foo2']")
How about you use descendant
panel.findElement(By.xpath("//div[#class='panel']/descendant::div[#class='foo2']"));
Source http://www.caucho.com/resin-3.1/doc/xpath.xtp#descendant

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.
The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

Categories