Jsoup I want select div , not select span or all a

Jsoup I want select div , not select span or all a - java

<div class="conditions-race">
Çim: Ağır 4,9 Kum: Normal Hava: 14 C , PARÇALI BULUTLU , NEM %50
<span style="float: right;">
<a id="PDFBulten">PDF Programı</a>
<a id="PDFOzetBulten">Özet PDF Programı</a>
<a id="CSVBulten">CSV Programı</a>
1. AGF Tablosu
2. AGF Tablosu
</span>
</div>
I want only this line "Çim: Ağır 4,9 Kum: Normal Hava: 14 C , PARÇALI BULUTLU , NEM %50"

You want to use Element#ownText method.
Extract from Javadoc
Gets the text owned by this element only; does not get the combined text of all children.
For example, given HTML <p>Hello <b>there</b> now!</p>, p.ownText() returns "Hello now!", whereas p.text() returns "Hello there now!".
Note that the text within the b element is not returned, as it is not a direct child of the p element.
Sample code
Document doc = ...
for(Element div : doc.select("div.conditions-race")) {
System.out.println(div.ownText());
}

Related

Jsoup selectors: 2nd div after h2

I have the following HTML:
<html>
<body>
...
<h2> Blah Blah 1</h2>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header</th><th>Col 2 Header</th></tr>
<tr><td>Line 1.1 Value</td><td>Line 2.1 Header</td></tr>
<tr><td>Line 2.1 Value</td><td>Line 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<div>
<div>
<table>
<tbody>
<tr><th>Col 1 Header T2</th><th>Col 2 Header T2</th></tr>
<tr><td>Line 1.1 Value T2</td><td>Line 2.1 Header T2</td></tr>
<tr><td>Line 2.1 Value T2</td><td>Line 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
<h2> Blah Blah 2</h2>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
<tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
<tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
</tbody>
</table>
</div>
</div>
<p>blah blah</p>
<div>
<div>
<table>
<tbody>
<tr><th>XCol 1 Header T2</th><th>XCol 2 Header T2</th></tr>
<tr><td>XLine 1.1 Value T2</td><td>XLine 2.1 Header T2</td></tr>
<tr><td>XLine 2.1 Value T2</td><td>XLine 2.2 Value T2</td></tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I would like to extract the 2nd DIV following an h2 tag that contains a given text.
As you may notice in the first and second div the p tags are not in the same position.
To extract the DIV following the first h2, the below formula would work:
h2:contains(Blah 1) + p + div +div
But to extract the 2nd, replacing "Blah 1" with "Blah 2" would not work as the ""p"" tag is located elsewhere , so a static selector would be :
h2:contains(Blah 2) + div + p +div
And what I need is a single selector formula where changing the text would make it work, wherever the p blocks may be
I tried several ways :
like ... The selector nth-of-type would not work either, because I know the position of the DIV only wrt the h2 that is not father of DIV but a preceding sibling ...
Help please

I have two ideas how to achieve this.
The first one is to remove every <p> and then you will only have to select "h2:contains(" + text + ")+div+div". Be careful and use it only when you're sure your <div> doesn't contain any <p>. Otherwise it will lack some content.
public void execute1(String html) {
Document doc = Jsoup.parse(html);
// first approach: remove every <p> to simplify document
Elements paragraphs = doc.select("p");
for (Element paragraph : paragraphs) {
paragraph.remove();
}
// then one selector will return what you want in both cases
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText(Document doc, String text) {
return doc.select("h2:contains(" + text + ")+div+div").first();
}
The second approach would be to iterate over siblings of "h2:contains(" + text+ ")" and "manually" find second <div> ignoring anything else. It's better because it doesn't destroy the original document and it will skip any number of <p> elements.
public void execute2(String html) {
Document doc = Jsoup.parse(html);
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 1"));
System.out.println(selectSecondDivAfterH2WithText2(doc, "Blah 2"));
}
private Element selectSecondDivAfterH2WithText2(Document doc, String text) {
int counter = 2;
// find h2 with given text
Element h2 = doc.select("h2:contains(" + text + ")").first();
// select every sibling after this h2 element
Elements siblings = h2.nextElementSiblings();
// loop over them
for (Element sibling : siblings) {
// skip everything that's not a div
if (sibling.tagName().equals("div")) {
// count how many divs left to skip
counter--;
if (counter == 0) {
// return when found nth div
return sibling;
}
}
}
return null;
}
I had also third idea to use "h2:contains(" + text + ")~div:nth-of-type(2)". It works for the first case, but fails for the second one probably because there's a <p> between the divs.

A simple way to do this is by using the comma (,) query operator which does an OR between the selectors. So you can combine the two variations of where the P tag falls.
h2:contains(Blah 2) + div ~ div, h2:contains(Blah 2) ~ div + div
Here's an example on the try.jsoup playground.

How to save Element from Jsoup to database

I use Jsoup get all data from website and save element if match some content when i get. I want when we get element. If it match some thing character , I save element from database(MYSQL,Postgress...). I code look like :
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed").get(0).children();
Elements list = new Elements();
Elements strings = new Elements();
for (Element element : elements) {
if (element.hasClass("post-feed-item")) {
list.add(element);
Element e = element.children().get(1).children().get(1).children().get(0);
if (e.text().matches("^.*?(Docker|docker|DOCKER).*$")) {
strings.add(e);
//save to element to DB
}
}
}
for (Element page : elements) {
if (links.add(URL)) {
//Remove the comment from the line below if you want to see it running on your editor
System.out.println(URL);
}
getPageLinks(page.attr("abs:href"));
}
I want if title from element contain : "Docker" it save my element to Database. But in element, It contain div and some thing link url, img , content. How to i save it to database. What if I want to save each element in a field in a database that is feasible? If not I can convert element to html and save it? Please help.
Example html i want save data base:
<div class="post-feed-item">
<img src="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png" srcset="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 1x, https://images.viblo.asia/avatar-retina/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 2x" class="avatar avatar--md mr-05">
<div class="post-feed-item__info">
<div class="post-meta--inline">
<div class="user--inline d-inline-flex">
<!---->
Hoàn Kì
<!---->
</div>
<div class="post-meta d-inline-flex align-items-center flex-wrap">
<div class="text-muted mr-05">
<span class="mr-05">about 3 hours ago</span>
<button title="Copy URL" class="icon-btn _13z_mK0hRyRB3dPzawysKe_0"><i aria-hidden="true" class="fa fa-link"></i></button>
</div>
<!---->
<!---->
</div>
</div>
<div class="post-title--inline">
<h3 class="word-break mr-05">Docker: Chưa biết gì đến biết dùng (Phần 3 docker-compose )</h3>
<div class="tags" data-v-cbe11868>
<a href="/tags/docker" class="el-tag _3wKNDsArij9ZFjXe8k4ryR_0 el-tag--info el-tag--mini" data-v-cbe11868>Docker</a>
</div>
</div>
<!---->
<div class="d-flex justify-content-between">
<div class="d-flex">
<div class="stats">
<span title="Views" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-eye"></i> 62 </span>
<span title="Clips" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-paperclip"></i> 1 </span>
<span title="Comments" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-comments"></i> 0 </span>
</div>
<!---->
</div>
<div title="Score" class="points">
<div class="carets">
<i aria-hidden="true" class="fa fa-caret-up"></i>
<i aria-hidden="true" class="fa fa-caret-down"></i>
</div>
<span class="text-muted">4</span>
</div>
</div>
</div>
</div>

First, modify your logic for fetching post-feed-item like this-
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed-item"); //This will get the whole element.
for (Element element : elements) {
String postFeeds = "";
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString());
//save postFeeds to DB
}
}
Extra
/**
* Your parsed element may contain single quote (').
* This will cause error while persisting.
* to avoid this you need to escape single quote (')
* with double single quote ('')
*/
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString().replaceAll("'", "''"));
//save postFeeds to DB
}
Second, What if I want to save each element in a field in a database that is feasible?
You don't need separate columns to store each element at the database. However you can save but the feasibility depends on your use case. If you just want to store the post-feed-items only for writing it back to your web page then it is not feasible.
Third, How can I convert element to html and save?
You don't need to convert the element to html but you need to convert the element to String if you want to save it the database.
All you need is a column type of BLOB data type (you can also save it as VARCHAR but BLOB is safer).
Update
How can I traverse all pages?
By looking at the source code of that page I found this is how you can get the total page number -
Elements pagination = doc.getElementsByAttributeValueMatching("href", "page=\\d");
int totalPageNo = Integer.parseInt(pagination.get(pagination.size() - 2).text());
then loop through each page.
for(int page = 1; page <= totalPageNo; page++) {
Connection conn = Jsoup.connect("https://viblo.asia/?page=" + page);
//rest of your code
}

I properly know what's your mean.Here are some views:First you should clearify what`s your search for and make fields of tables in database. Such as according your ideas, you can make a table_docker table in db and there are field_id,field_content,field_start_time,field_links and so on in it. Second you should code some utils of classes such as JsoupUtils which is get HTML and parse it , HtmlUtils which is used to handle the html remarks and download these pictures,DBUtils which is used to connect db and save data,POIUtils which is used to show your data,DataUtils which is used to handle your data by your ways.

XmlSlurper to parse XML and get value of inside elements using Groovy

I am trying to parse the below XML:
<body>
<section id="5f884f20-6638-461f-a3f5-3d237341c048" outputclass="definition_and_scope">
<title>Definition and Scope</title>
<p>A work that is modified for a purpose, use, or medium other than that for which it was originally intended.</p>
<p>This relationship applies to changes in form or to works completely rewritten in the same form.</p>
</section>
<section id="a7cf019f-dc82-46e2-b5ae-2e947d3c2509" outputclass="popup:ready_reference">
<title>Element Reference</title>
<div id="8472e205-3a32-40e3-a7ea-8bd7dbd43715" outputclass="iri">
<p id="e6ddf17a-6b4b-4de3-886e-a315d88545ea" outputclass="title">
<b>IRI</b>
</p>
<p id="c69f6279-27a3-4cd8-84a6-bb2c5a7b0424">
<xref format="html" href="http://rdaregistry.info/Elements/w/P10142" scope="external">http://rdaregistry.info/Elements/w/P10142</xref>
</p>
</div>
<div id="3e979983-cbac-4982-84c7-57ae9756e2bb" outputclass="domain">
<p id="9815dbdf-7483-4dcf-8166-7ea50138b3e5" outputclass="title">
<b>Domain</b>
</p>
<p id="328a1035-1eaf-4c4b-aead-d604586b3f64">
<xref keyref="rdacC10001/ala-c3e1fff8-0a79-35c6-bee1-39b6b4c9ed35">Work</xref>
</p>
</div>
<div id="13163eda-dcfd-48d9-aea4-cc8abef2f675" outputclass="range">
<p id="d07d4e37-dff1-4561-baab-f8f557d99662" outputclass="title">
<b>Range</b>
</p>
<p id="3873a6ab-5f73-47e2-9daa-441169e66c36">
<xref keyref="rdacC10001/ala-c3e1fff8-0a79-35c6-bee1-39b6b4c9ed35">Work</xref>
</p>
</div>
</section>
</body>
I want to extract the values of all the p tags inside of section & section/div and append that value to a stringbuilder.
Here is my code:
def docText = new StringBuilder();
def bodyObject = doc.topic.body.toXmlString(true) //I have only pasted a part of my XML in this question. My XML starts with a doc/topic/body etc
def parseBodyObject = new XmlSlurper().parse(new InputSource(new StringReader(bodyObject)));
def findAllSection = parseBodyObject.depthFirst().findAll{it.name()=='section'}
findAllSection.each {section->
docText.append(" " +section.p)
docText.append(" " +section.div.p + " ")
}
Output:
My docText looks like below:
A work that is modified for a purpose, use, or medium other than that for which it was originally intended.This relationship applies to changes in form or to works completely rewritten in the same form. IRIhttp://rdaregistry.info/Elements/w/P10142DomainWorkRangeWorkAlternate labelsUser tasksRecording methodsDublin Core TermsMARC 21 Bibliographic Recording an unstructured descriptionRecording a structured descriptionRecording an identifierRecording an IRI For the inverse of this element, see Work: adapted as work For broader elements, see Work: based on workFor narrower elements, see
I am stuck at adding a space between text. For eg. When it is going through section/div/p, it is adding all the p together without any spaces as below:
IRIhttp://rdaregistry.info/Elements/w/P10142DomainWorkRangeWorkAlternate
which should output as(expected output):
IRI http://rdaregistry.info/Elements/w/P10142 Domain Work
How should I get these values separated? Any help is appeciated.

I believe that this depthFirst().findAll { it.name() == 'section'} returns an array list which element is a combination of the inner text of p tags.
Let's define your sample XML as xmlDoc. Below is the snippet of code that works as expected:
def parseBodyObject = new XmlSlurper().parseText(xmlDoc)
def findAllPtags = parseBodyObject.children().depthFirst().findAll {
it.name() == 'p'
}
def docText = new StringBuilder()
findAllPtags.each { p ->
docText.append("\n" + p)
}
You can replace \n by a space.

Jsoup: select(div[class=rslt prod]) returns null when it shouldn't

I am trying to select the all div with class="rlts prod" from this page http://www.amazon.fr/s/field-keywords=samsung
Document doc = Jsoup.connect("http://www.amazon.fr/s/field-keywords=samsung").get();
Elements divProd = doc.select("div[class=rslt prod]");
System.out.println("\nsize: "+divProd.size());
But it returns 0 and it shouldn't, any idea why ?
example of what should be selected:
<div id="result_4" class="rslt prod" name="B006O9QNHU">
[...]
</div>

You have to change the user agent, otherwise you get a differnt website from amazon.
Document doc = Jsoup.connect("http://www.amazon.fr/s/field-keywords=samsung")
.userAgent("Mozilla/17.0") // you can use any other user agent here
.get();
for( Element element : doc.select("div[class=rslt prod]") )
{
System.out.println(element);
System.out.println("");
}
Now the output is a list like
<div id="result_1" class="rslt prod" name="B007XOM6SU">
...
</div>
<div id="result_2" class="rslt prod" name="B006SXSF4Q">
...
</div>
...

GetValue (JSoup)

<div class="Class-feedbacks">
<div class="grading class2">
<div itemtype="http://xx.edu/grading" itemscope="" itemprop="studentgrading">
<div class="rating">
<img class="passportphoto" width="1500" height="20" src="http://greg.png" >
<meta content="4.0" itemprop="gradingvalue">
</div>
</div>
<meta content="2012-09-08" itemprop="gradePublished">
<span class="date smaller">9/8/2012</span>
</div>
<p class="review_comment feedback" itemprop="description">Greg is one the smart person in his batch</p>
</div>
I want to print:
date: 2012-09-08
Feedback : Greg is one the smart person in his batch
I was able to use this as suggested at - Jsoup getting a hyperlink from li
The doc.select(div div divn li ui ...) and get the class feedback.
How should I use the select command to get the values of the above values?

To get the value of an attribute, use the attr method. E.g.
Elements elements = doc.select("meta");
for(Element e: elements)
System.out.println(e.attr("content"));

In one single select ...have you tried the comma Combinator "," ?
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
Elements elmts = doc.select("div.Class-feedbacks meta, p")
Element elmtDate = elmts.get(0);
System.out.println("date: " + elmtDate.attr("content"));
Element elmtParag = elmts.get(1);
System.out.println("Feedback: " + elmtParag.text());
You should get back 2 elements in your list the date and the feedback after the select.

This is an old question and I might be late, but if anyone else wants to know how to do this easily, the below code will be helpful.
Document doc = Jsoup.parse(html);
// We select the meta tag whose itemprop property has value 'gradePublished'
String date = doc.select("meta[itemprop=gradePublished]").attr("content");
System.out.println("date: "+date);
// Now we select the text inside the p tag with itemprop value 'description'
String feedback = doc.select("p[itemprop=description]").text();
System.out.println("Feedback: "+feedback);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup I want select div , not select span or all a - java

Related

Jsoup selectors: 2nd div after h2

How to save Element from Jsoup to database

XmlSlurper to parse XML and get value of inside elements using Groovy

Jsoup: select(div[class=rslt prod]) returns null when it shouldn't

GetValue (JSoup)

Categories

Resources