Retrieving the contents of an html label using XPath - java

I have the following html code:
<div id="ipsLayout_contentArea">
<div class="preContentPadding">
<div id="ipsLayout_contentWrapper">
<div id="ipsLayout_mainArea">
<a id="elContent"></a>
<div class="cWidgetContainer " data-widgetarea="header" data-orientation="horizontal" data-role="widgetReceiver" data-controller="core.front.widgets.area">
<div class="ipsPageHeader ipsClearfix">
<div class="ipsClearfix">
<div class="cTopic ipsClear ipsSpacer_top" data-feedid="topic-100269" data-lastpage="" data-baseurl="https://forum.com/forum/topic/100269-topic/" data-autopoll="" data-controller="core.front.core.commentFeed,forums.front.topic.view">
<div class="" data-controller="core.front.core.moderation" data-role="commentFeed">
<form data-role="moderationTools" data-ipspageaction="" method="post" action="https://forum.com/forum/topic/100269-topic/?csrfKey=b092dccccee08fdbc06c26d350bf3c2b&do=multimodComment">
<a id="comment-626016"></a>
<article id="elComment_626016" class="cPost ipsBox ipsComment ipsComment_parent ipsClearfix ipsClear ipsColumns ipsColumns_noSpacing ipsColumns_collapsePhone " itemtype="http://schema.org/Comment" itemscope="">
<aside class="ipsComment_author cAuthorPane ipsColumn ipsColumn_medium">
<div class="ipsColumn ipsColumn_fluid">
<div id="comment-626016_wrap" class="ipsComment_content ipsType_medium ipsFaded_withHover" data-quotedata="{"userid":3859,"username":"Admin","timestamp":1453221383,"contentapp":"forums","contenttype":"forums","contentid":100269,"contentclass":"forums_Topic","contentcommentid":626016}" data-commentid="626016" data-commenttype="forums" data-commentapp="forums" data-controller="core.front.core.comment">
<div class="ipsComment_meta ipsType_light">
<div class="cPost_contentWrap ipsPad">
<div class="ipsType_normal ipsType_richText ipsContained" data-controller="core.front.core.lightboxedImages" itemprop="text" data-role="commentContent">
<p> Hi, </p>
<p> </p>
<p> This is a post with multiple </p>
<p> lines of text </p>
and am trying to get the contents (in plaintext) of the post. The XPath I'm currently using:
//div[#id='ipsLayout_contentArea']/div[2]/div/div[4]/div/form/article/div/div/div[2]/div//text()
retrieves each line of each post (as delimited by <p></p>). How can I get the whole contents of the post (inside:
<div class="ipsType_normal ipsType_richText ipsContained" data-controller="core.front.core.lightboxedImages" itemprop="text" data-role="commentContent"> Post content </div>),
in plaintext (so that <p></p> is treated as a text (as well as other labels that the post might include))?
Edit:
I'm using the following XPath:
//div[#id='ipsLayout_contentArea']/div[2]/div/div[4]/div/form/article/div/div/div[2]/div
to retrieve the div that contains the body of the post.
// forumTemplate.getXpathElements().get(forumTemplate.XPATH_GET_THREAD_POSTS) = //div[#id='ipsLayout_contentArea']/div[2]/div/div[4]/div/form/article/div/div/div[2]/div
List<DomNode> posts = (List<DomNode>) firstPage.getByXPath(forumTemplate.getXpathElements().get(forumTemplate.XPATH_GET_THREAD_POSTS));
for (DomNode post : posts) {
// Retrieve the contents of the post as a string
String postContentStr = post.getNodeValue();
The variable postContentStr is always null. Why?

You specified //text(), that will get all text nodes under the specified path recursively. Depending of what you use, this could work better:
//div[#data-role='commentContent']
That will match the comment node you are trying to get. If you use code to evaluate, you can go from here. Don't match text() though, that will not match any of the <p> tags.

Related

How can I retrieve data from html using Jsoup

I'm new to HTML and I'm trying to learn a little about the HTML tags by trying to retrieve data from an HTML String.
<li>
<div class="item" data-youtube_code="code_for_youtuber" data-feature_code="data" data-feature_url="/movies/Truman">
<div class="title">
<span>the title of the video</span>
</div>
<div class="image">
<img src="/media/image.png" data-src="http://url_of_image.jpg" alt="">
</div>
</div> </li>
I'm using the Java Jsoup library and so far I've manage to extract the <span> content using:
Document doc = Jsoup.connect("http://www.yesplanet.co.il/movies").get();
System.out.println(doc.html());
Elements elem = doc.select(".item").text();
How can I get other things such as the data-youtube_code and the img src.
Edit:
For example:
System.out.println("doc...data-youtube_code");//some code that retrieves
//data-youtube_code. The ouptup will be "code_for_youtuber"
System.out.println("data-src")
//some code that retrieves
//data-src. The ouptup will be "http://url_of_image.jpg"
You can simply select first div and get the value by attribute
Element elements = Jsoup.parse(s).select("div").first();
System.out.println(elements.attr("data-youtube_code"));
Output:
code_for_youtuber
EDIT :
Element elements = Jsoup.parse(s).select(".item").first();
System.out.println(elements.attr("data-youtube_code"));
Element element1 = elements.select(".image img").first();
System.out.println(element1.attr("data-src"));
Output:
code_for_youtuber
http://url_of_image.jpg
Since you are beginner i suggest you to look for this link

Jsoup: How to select direct parents until the root without their siblings?

I'm trying to get all direct parents of element, but also I get their siblings.
For example, I have this DOM structure...
<div class="html">
<div class="head"></div>
<div class="body">
seznam
<h2>Foo</h2>
google
<p>
<img class="first">
</p>
<img class="second">
<ol>
<li>1</li>
<li>2</li>
</ol>
</div>
</div>
So I want get all direct parents of img element with class first until a div with class html.
I've tried using the following code
Element element = document.select("img").first();
Node root = element.root();
But in the root var I get whole DOM structure also with all siblings.
UPDATE
After this in root var I have the whole DOM structure again:
<div class="html">
<div class="head"></div>
<div class="body">
seznam
<h2>Foo</h2>
google
<p>
<img class="first">
</p>
<img class="second">
<ol>
<li>1</li>
<li>2</li>
</ol>
</div>
</div>
But I want something like this:
<div class="html">
<div class="body">
<p>
<img class="first">
</p>
</div>
</div>
If you are interested in path only, use Element.cssSelector()
It gives you whole DOM path e.g. html > body > img
"Path" returned by Darshit Chokshi approach is not unique.
First of all get all elements with class name 'first' using,
Elements childs = document.getElementsByClass("first");
Now, iterate all child elements to get their parent elements using,
for( Element child : childs){
Elements parents = child.parents();
for(Element parent: parents){
System.out.println(parent.tagName());
}
}
Try this, Hope it will work for you ;)

Display a string that contains HTML in Thymeleaf template

How can I display a string that contains HTML tags in Thymeleaf?
So this piece of code:
<div th:each="content : ${cmsContent}">
<div class="panel-body" sec:authorize="hasRole('ROLE_ADMIN')">
<div th:switch="${content.reference}">
<div th:case="'home.admin'">
<p th:text="${content.text}"></p>
</div>
</div>
</div>
//More code....
And at this line of piece of code ${content.text} it literally generates this on the browser:
<p>test</p>
But I want to show this instead on the browser:
test
You can use th:utext (unescaped text) for such scenarios.
Simply change
<p th:text="${content.text}"></p>
to
<p th:utext="${content.text}"></p>
I will suggest to also have a look into documentation here to know all about using Thymeleaf.

How to click Buy Now button using selenium?

Source HTML look like this :
<script id="during-reserve-tpl" type="text/x-lodash-template">
<div class="gd-row">
<div class="gd-col gu16">
<div class="emailModule message module-tmargin">
<div class="error-msg"></div>
<div class="register brdr-btm">
<div class="jbv jbv-orange jbv-buy-big jbv-reserve">Buy Now</div>
</div>
<div class="topTextWrap brdr-btm tmargin20">
<div class="subHeading">
Only one phone per registered user
<p>
First come, first serve!
</p>
</div>
</div>
</div>
</div>
</div>
</script>
When I code : IWebElement buy = driver.FindElement(By.CssSelector(".jbv.jbv-orange.jbv-buy-big.jbv-reserve")); It says Element not found.
I tried putting By.ClassName with while spaces but it says, compound classes are not supported.
Is there any alternative to click it ?
driver.FindElement(By.cssselector("div.jbv.jbv-orange.jbv-buy-big.jbv-reserve"))
In the above example css selector looks for div tag with name and it will look for all the dot with space
Try this By.xpath("//*[contains(#class, 'jbv')]") if it works.
You can try either of these:
IWebElement buy = driver.FindElement(By.CssSelector("div.register>div"));
OR
IWebElement buy = driver.FindElement(By.CssSelector("div.register"));

Extracting href from a class within other div/id classes with jsoup

Hello I am trying to extract the first href from within the "title" class from the following source (the source is only part of the whole page however I am using the entire page):
div id="atfResults" class="list results ">
<div id="result_0" class="result firstRow product" name="0006754023">
<div id="srNum_0" class="number">1.</div>
<div class="image">
<a href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">
<img src="http://ecx.images-amazon.com/images/I/31ZcWU6HN4L._AA115_.jpg" class="productImage" alt="Product Details">
</a>
</div>
<div class="data">
<div class="title">
<a class="title titleHover" href="http://www.amazon.co.uk/Essential-Modern-Classics-J-Tolkien/dp/0006754023/ref=sr_1_1?ie=UTF8&qid=1316504574&sr=8-1">Essential Modern Classics - The Hobbit</a>
<span class="ptBrand">by J. R. R. Tolkien</span>
<span class="bindingAndRelease">(<span class="binding">Paperback</span> - 2 Apr 2009)</span>
</div>
I have tried several variations of both the select function and also getElementByClass but all have given me a "null" value such as:
Document firstSearchPage = Jsoup.connect(fullST).get();
Element link = firstSearchPage.select("div.title").first();
If someone could help me with a solution to this problem and recommend some areas of reading so I can avoid this problem in future it would be greatly appreciated.
The CSS selector div.title, returns a <div class="title">, not a link as you seem to think. If you want an <a class="title"> then you should use the a.title selector.
Element link = document.select("a.title").first();
String href = link.absUrl("href");
// ...
Or if an <a class="title"> can appear elsewhere in the document outside a <div class="title"> before that point, then you need the following more specific selector:
Element link = document.select("div.title a.title").first();
String href = link.absUrl("href");
// ...
This will return the first <a class="title"> which is a child of <div class="title">.

Categories