GetValue (JSoup) - java

<div class="Class-feedbacks">
<div class="grading class2">
<div itemtype="http://xx.edu/grading" itemscope="" itemprop="studentgrading">
<div class="rating">
<img class="passportphoto" width="1500" height="20" src="http://greg.png" >
<meta content="4.0" itemprop="gradingvalue">
</div>
</div>
<meta content="2012-09-08" itemprop="gradePublished">
<span class="date smaller">9/8/2012</span>
</div>
<p class="review_comment feedback" itemprop="description">Greg is one the smart person in his batch</p>
</div>
I want to print:
date: 2012-09-08
Feedback : Greg is one the smart person in his batch
I was able to use this as suggested at - Jsoup getting a hyperlink from li
The doc.select(div div divn li ui ...) and get the class feedback.
How should I use the select command to get the values of the above values?

To get the value of an attribute, use the attr method. E.g.
Elements elements = doc.select("meta");
for(Element e: elements)
System.out.println(e.attr("content"));

In one single select ...have you tried the comma Combinator "," ?
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
Elements elmts = doc.select("div.Class-feedbacks meta, p")
Element elmtDate = elmts.get(0);
System.out.println("date: " + elmtDate.attr("content"));
Element elmtParag = elmts.get(1);
System.out.println("Feedback: " + elmtParag.text());
You should get back 2 elements in your list the date and the feedback after the select.

This is an old question and I might be late, but if anyone else wants to know how to do this easily, the below code will be helpful.
Document doc = Jsoup.parse(html);
// We select the meta tag whose itemprop property has value 'gradePublished'
String date = doc.select("meta[itemprop=gradePublished]").attr("content");
System.out.println("date: "+date);
// Now we select the text inside the p tag with itemprop value 'description'
String feedback = doc.select("p[itemprop=description]").text();
System.out.println("Feedback: "+feedback);

Related

how to display span class field

i am trying to display two "text text-pass" from html in chrome browser to my print console, apparently, it did not work, any advise please?
my browser html code
<a href="/abc/123" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
<a href="/abc/1234" class="active">
<div class="sidebar-text">
<span class="text text-pass"> </span> </a>
My code
String 123= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(123);
String 1234= driver.findElement(By.xpath("//*[#id="js-app"]/div/div/div[2]/div[1]/div/div/ul/li[5]/a")).getText();
System.out.println(1234);
You can use .findElements to get multiple elements with the same pattern, it will return a list collection.
UPDATE
Refers to your comment, you need put the string into a list again and check with the Collection.contains() method:
List<String> results = new ArrayList<>();
List<WebElement> elements = driver.findElements(By.xpath("//div[#class='sidebar-text']//span"));
for(WebElement element: elements) {
String attr = element.getAttribute("class");
results.add(attr);
System.out.println(attr);
}
if(results.contains("text text-fail")) {
System.out.println("this is list contains 'text text-fail'");
}
Try this Code :
String pass = driver.findElement(By.xpath("//*[#class='sidebar-text']/span")).getAttribute("class");
System.out.println(pass);

Jsoup selectors: 2nd div after h2

I have the following HTML:
<html>
<body>
...
<h2> Blah Blah 1</h2>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header</th><th>Col 2 Header</th></tr>
<tr><td>Line 1.1 Value</td><td>Line 2.1 Header</td></tr>
<tr><td>Line 2.1 Value</td><td>Line 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header T2</th><th>Col 2 Header T2</th></tr>
<tr><td>Line 1.1 Value T2</td><td>Line 2.1 Header T2</td></tr>
<tr><td>Line 2.1 Value T2</td><td>Line 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
<h2> Blah Blah 2</h2>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
<tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
<tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header T2</th><th>XCol 2 Header T2</th></tr>
<tr><td>XLine 1.1 Value T2</td><td>XLine 2.1 Header T2</td></tr>
<tr><td>XLine 2.1 Value T2</td><td>XLine 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I would like to extract the 2nd DIV following an h2 tag that contains a given text.
As you may notice in the first and second div the p tags are not in the same position.
To extract the DIV following the first h2, the below formula would work:
h2:contains(Blah 1) + p + div +div
But to extract the 2nd, replacing "Blah 1" with "Blah 2" would not work as the ""p"" tag is located elsewhere , so a static selector would be :
h2:contains(Blah 2) + div + p +div
And what I need is a single selector formula where changing the text would make it work, wherever the p blocks may be
I tried several ways :
like ... The selector nth-of-type would not work either, because I know the position of the DIV only wrt the h2 that is not father of DIV but a preceding sibling ...
Help please
I have two ideas how to achieve this.
The first one is to remove every <p> and then you will only have to select "h2:contains(" + text + ")+div+div". Be careful and use it only when you're sure your <div> doesn't contain any <p>. Otherwise it will lack some content.
public void execute1(String html) {
Document doc = Jsoup.parse(html);
// first approach: remove every <p> to simplify document
Elements paragraphs = doc.select("p");
for (Element paragraph : paragraphs) {
paragraph.remove();
}
// then one selector will return what you want in both cases
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText(Document doc, String text) {
return doc.select("h2:contains(" + text + ")+div+div").first();
}
The second approach would be to iterate over siblings of "h2:contains(" + text+ ")" and "manually" find second <div> ignoring anything else. It's better because it doesn't destroy the original document and it will skip any number of <p> elements.
public void execute2(String html) {
Document doc = Jsoup.parse(html);
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText2(Document doc, String text) {
int counter = 2;
// find h2 with given text
Element h2 = doc.select("h2:contains(" + text + ")").first();
// select every sibling after this h2 element
Elements siblings = h2.nextElementSiblings();
// loop over them
for (Element sibling : siblings) {
// skip everything that's not a div
if (sibling.tagName().equals("div")) {
// count how many divs left to skip
counter--;
if (counter == 0) {
// return when found nth div
return sibling;
}
}
}
return null;
}
I had also third idea to use "h2:contains(" + text + ")~div:nth-of-type(2)". It works for the first case, but fails for the second one probably because there's a <p> between the divs.
A simple way to do this is by using the comma (,) query operator which does an OR between the selectors. So you can combine the two variations of where the P tag falls.
h2:contains(Blah 2) + div ~ div, h2:contains(Blah 2) ~ div + div
Here's an example on the try.jsoup playground.

How to save Element from Jsoup to database

I use Jsoup get all data from website and save element if match some content when i get. I want when we get element. If it match some thing character , I save element from database(MYSQL,Postgress...). I code look like :
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed").get(0).children();
Elements list = new Elements();
Elements strings = new Elements();
for (Element element : elements) {
if (element.hasClass("post-feed-item")) {
list.add(element);
Element e = element.children().get(1).children().get(1).children().get(0);
if (e.text().matches("^.*?(Docker|docker|DOCKER).*$")) {
strings.add(e);
//save to element to DB
}
}
}
for (Element page : elements) {
if (links.add(URL)) {
//Remove the comment from the line below if you want to see it running on your editor
System.out.println(URL);
}
getPageLinks(page.attr("abs:href"));
}
I want if title from element contain : "Docker" it save my element to Database. But in element, It contain div and some thing link url, img , content. How to i save it to database. What if I want to save each element in a field in a database that is feasible? If not I can convert element to html and save it? Please help.
Example html i want save data base:
<div class="post-feed-item">
<img src="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png" srcset="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 1x, https://images.viblo.asia/avatar-retina/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 2x" class="avatar avatar--md mr-05">
<div class="post-feed-item__info">
<div class="post-meta--inline">
<div class="user--inline d-inline-flex">
<!---->
Hoàn Kì
<!---->
</div>
<div class="post-meta d-inline-flex align-items-center flex-wrap">
<div class="text-muted mr-05">
<span class="mr-05">about 3 hours ago</span>
<button title="Copy URL" class="icon-btn _13z_mK0hRyRB3dPzawysKe_0"><i aria-hidden="true" class="fa fa-link"></i></button>
</div>
<!---->
<!---->
</div>
</div>
<div class="post-title--inline">
<h3 class="word-break mr-05">Docker: Chưa biết gì đến biết dùng (Phần 3 docker-compose )</h3>
<div class="tags" data-v-cbe11868>
<a href="/tags/docker" class="el-tag _3wKNDsArij9ZFjXe8k4ryR_0 el-tag--info el-tag--mini" data-v-cbe11868>Docker</a>
</div>
</div>
<!---->
<div class="d-flex justify-content-between">
<div class="d-flex">
<div class="stats">
<span title="Views" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-eye"></i> 62 </span>
<span title="Clips" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-paperclip"></i> 1 </span>
<span title="Comments" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-comments"></i> 0 </span>
</div>
<!---->
</div>
<div title="Score" class="points">
<div class="carets">
<i aria-hidden="true" class="fa fa-caret-up"></i>
<i aria-hidden="true" class="fa fa-caret-down"></i>
</div>
<span class="text-muted">4</span>
</div>
</div>
</div>
</div>
First, modify your logic for fetching post-feed-item like this-
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed-item"); //This will get the whole element.
for (Element element : elements) {
String postFeeds = "";
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString());
//save postFeeds to DB
}
}
Extra
/**
* Your parsed element may contain single quote (').
* This will cause error while persisting.
* to avoid this you need to escape single quote (')
* with double single quote ('')
*/
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString().replaceAll("'", "''"));
//save postFeeds to DB
}
Second, What if I want to save each element in a field in a database that is feasible?
You don't need separate columns to store each element at the database. However you can save but the feasibility depends on your use case. If you just want to store the post-feed-items only for writing it back to your web page then it is not feasible.
Third, How can I convert element to html and save?
You don't need to convert the element to html but you need to convert the element to String if you want to save it the database.
All you need is a column type of BLOB data type (you can also save it as VARCHAR but BLOB is safer).
Update
How can I traverse all pages?
By looking at the source code of that page I found this is how you can get the total page number -
Elements pagination = doc.getElementsByAttributeValueMatching("href", "page=\\d");
int totalPageNo = Integer.parseInt(pagination.get(pagination.size() - 2).text());
then loop through each page.
for(int page = 1; page <= totalPageNo; page++) {
Connection conn = Jsoup.connect("https://viblo.asia/?page=" + page);
//rest of your code
}
I properly know what's your mean.Here are some views:First you should clearify what`s your search for and make fields of tables in database. Such as according your ideas, you can make a table_docker table in db and there are field_id,field_content,field_start_time,field_links and so on in it. Second you should code some utils of classes such as JsoupUtils which is get HTML and parse it , HtmlUtils which is used to handle the html remarks and download these pictures,DBUtils which is used to connect db and save data,POIUtils which is used to show your data,DataUtils which is used to handle your data by your ways.

Jsoup I want select div , not select span or all a

<div class="conditions-race">
Çim: Ağır 4,9 Kum: Normal Hava: 14 C , PARÇALI BULUTLU , NEM %50
<span style="float: right;">
<a id="PDFBulten">PDF Programı</a>
<a id="PDFOzetBulten">Özet PDF Programı</a>
<a id="CSVBulten">CSV Programı</a>
1. AGF Tablosu
2. AGF Tablosu
</span>
</div>
I want only this line "Çim: Ağır 4,9 Kum: Normal Hava: 14 C , PARÇALI BULUTLU , NEM %50"
You want to use Element#ownText method.
Extract from Javadoc
Gets the text owned by this element only; does not get the combined text of all children.
For example, given HTML <p>Hello <b>there</b> now!</p>, p.ownText() returns "Hello now!", whereas p.text() returns "Hello there now!".
Note that the text within the b element is not returned, as it is not a direct child of the p element.
Sample code
Document doc = ...
for(Element div : doc.select("div.conditions-race")) {
System.out.println(div.ownText());
}

Using JSOUP to remove duplicate elementText value

I am scrapping Walmart page at URL http://www.walmart.com/search/search-ng.do?tab_value=all&search_query=camera&search_constraint=0&Find=Find&ss=false&ic=16_32 using JSOUP DOM parser in JAVA.
I am building URL based on user parameters and building a DOM object using
Document doc = Jsoup.parse(contentVar);
For the next step i want to print all the products/price. I used the following code:
String price = doc.getElementsByClass("camelPrice").text();
String title = doc.getElementsByClass("ListItemLink").text();
System.out.println("Product: " + title);
System.out.println("Price: "+ price);
Here i am using the tags for the price and product description. However my results are :
Title/Product Name: C1, C2, ... C16 (c is camera title)
Price: $279.95 $279.95 $479.00 $479.00 $60.00 $60.00 $99.00 $99.00 $429.00 $429.00 $129.00 $129.00 $109.00 $109.00 $89.00 $89.00 $384.00 $384.00 $69.00 $69.00 $279.00 $279.00 $129.00 $129.00 $55.20 - $69.00 $55.20 - $69.00 $74.00 $74.00 $119.00 $119.00
here the prices are duplicated because of a possible quickview tag. Is there any way to remove the duplicacy in prices using any JSOUP method
Well seeing the html dom I noticed that there are duplicates in the sense that there is a price
<div class="ItemShelfAvail"> <----------- SEE HERE
<div class="OnlinePriceAvail">
<div class="PriceHeader OnlineHead">Online</div>
<div class="PriceContent">
<div class="PriceDisplay" id="price_display_23204350_2">
<div class="PriceCompare">
<div class="camelPrice"><span class="prefixPriceText2"></span><span class="bigPriceText2">$279.</span><span class="smallPriceText2">00</span><span></span></div>
and a price
<div class="OnlinePriceAvail">
<div class="PriceHeader OnlineHead">Online</div>
<div class="PriceContent">
<div class="PriceDisplay" id="price_display_23204350_2">
<div class="PriceCompare">
<div class="camelPrice"><span class="prefixPriceText2"></span><span class="bigPriceText2">$279.</span><span class="smallPriceText2">00</span><span></span></div>
You must see what list you want from the two and then put a proper selector. If you want both of them just take the Elements list returned by the getElementsByClass and manipulate each price.
getElementsByClass returns Elements which is a list where every node is of type Element. You can do
Elements elPrice = doc.getElementsByClass("camelPrice");
I know that this may now be useless to the creator of the thread but I found this when looking on how to find the price of a product on amazon uk.
String pricing = doc.getElementsByClass("priceLarge").text();
System.out.println("price : " + pricing);
Here is the code to do it :)

Categories