Using JSOUP to remove duplicate elementText value - java

I am scrapping Walmart page at URL http://www.walmart.com/search/search-ng.do?tab_value=all&search_query=camera&search_constraint=0&Find=Find&ss=false&ic=16_32 using JSOUP DOM parser in JAVA.
I am building URL based on user parameters and building a DOM object using
Document doc = Jsoup.parse(contentVar);
For the next step i want to print all the products/price. I used the following code:
String price = doc.getElementsByClass("camelPrice").text();
String title = doc.getElementsByClass("ListItemLink").text();
System.out.println("Product: " + title);
System.out.println("Price: "+ price);
Here i am using the tags for the price and product description. However my results are :
Title/Product Name: C1, C2, ... C16 (c is camera title)
Price: $279.95 $279.95 $479.00 $479.00 $60.00 $60.00 $99.00 $99.00 $429.00 $429.00 $129.00 $129.00 $109.00 $109.00 $89.00 $89.00 $384.00 $384.00 $69.00 $69.00 $279.00 $279.00 $129.00 $129.00 $55.20 - $69.00 $55.20 - $69.00 $74.00 $74.00 $119.00 $119.00
here the prices are duplicated because of a possible quickview tag. Is there any way to remove the duplicacy in prices using any JSOUP method

Well seeing the html dom I noticed that there are duplicates in the sense that there is a price
<div class="ItemShelfAvail"> <----------- SEE HERE
<div class="OnlinePriceAvail">
<div class="PriceHeader OnlineHead">Online</div>
<div class="PriceContent">
<div class="PriceDisplay" id="price_display_23204350_2">
<div class="PriceCompare">
<div class="camelPrice"><span class="prefixPriceText2"></span><span class="bigPriceText2">$279.</span><span class="smallPriceText2">00</span><span></span></div>
and a price
<div class="OnlinePriceAvail">
<div class="PriceHeader OnlineHead">Online</div>
<div class="PriceContent">
<div class="PriceDisplay" id="price_display_23204350_2">
<div class="PriceCompare">
<div class="camelPrice"><span class="prefixPriceText2"></span><span class="bigPriceText2">$279.</span><span class="smallPriceText2">00</span><span></span></div>
You must see what list you want from the two and then put a proper selector. If you want both of them just take the Elements list returned by the getElementsByClass and manipulate each price.
getElementsByClass returns Elements which is a list where every node is of type Element. You can do
Elements elPrice = doc.getElementsByClass("camelPrice");

I know that this may now be useless to the creator of the thread but I found this when looking on how to find the price of a product on amazon uk.
String pricing = doc.getElementsByClass("priceLarge").text();
System.out.println("price : " + pricing);
Here is the code to do it :)

Related

How to save Element from Jsoup to database

I use Jsoup get all data from website and save element if match some content when i get. I want when we get element. If it match some thing character , I save element from database(MYSQL,Postgress...). I code look like :
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed").get(0).children();
Elements list = new Elements();
Elements strings = new Elements();
for (Element element : elements) {
if (element.hasClass("post-feed-item")) {
list.add(element);
Element e = element.children().get(1).children().get(1).children().get(0);
if (e.text().matches("^.*?(Docker|docker|DOCKER).*$")) {
strings.add(e);
//save to element to DB
}
}
}
for (Element page : elements) {
if (links.add(URL)) {
//Remove the comment from the line below if you want to see it running on your editor
System.out.println(URL);
}
getPageLinks(page.attr("abs:href"));
}
I want if title from element contain : "Docker" it save my element to Database. But in element, It contain div and some thing link url, img , content. How to i save it to database. What if I want to save each element in a field in a database that is feasible? If not I can convert element to html and save it? Please help.
Example html i want save data base:
<div class="post-feed-item">
<img src="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png" srcset="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 1x, https://images.viblo.asia/avatar-retina/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 2x" class="avatar avatar--md mr-05">
<div class="post-feed-item__info">
<div class="post-meta--inline">
<div class="user--inline d-inline-flex">
<!---->
Hoàn Kì
<!---->
</div>
<div class="post-meta d-inline-flex align-items-center flex-wrap">
<div class="text-muted mr-05">
<span class="mr-05">about 3 hours ago</span>
<button title="Copy URL" class="icon-btn _13z_mK0hRyRB3dPzawysKe_0"><i aria-hidden="true" class="fa fa-link"></i></button>
</div>
<!---->
<!---->
</div>
</div>
<div class="post-title--inline">
<h3 class="word-break mr-05">Docker: Chưa biết gì đến biết dùng (Phần 3 docker-compose )</h3>
<div class="tags" data-v-cbe11868>
<a href="/tags/docker" class="el-tag _3wKNDsArij9ZFjXe8k4ryR_0 el-tag--info el-tag--mini" data-v-cbe11868>Docker</a>
</div>
</div>
<!---->
<div class="d-flex justify-content-between">
<div class="d-flex">
<div class="stats">
<span title="Views" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-eye"></i> 62 </span>
<span title="Clips" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-paperclip"></i> 1 </span>
<span title="Comments" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-comments"></i> 0 </span>
</div>
<!---->
</div>
<div title="Score" class="points">
<div class="carets">
<i aria-hidden="true" class="fa fa-caret-up"></i>
<i aria-hidden="true" class="fa fa-caret-down"></i>
</div>
<span class="text-muted">4</span>
</div>
</div>
</div>
</div>
First, modify your logic for fetching post-feed-item like this-
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed-item"); //This will get the whole element.
for (Element element : elements) {
String postFeeds = "";
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString());
//save postFeeds to DB
}
}
Extra
/**
* Your parsed element may contain single quote (').
* This will cause error while persisting.
* to avoid this you need to escape single quote (')
* with double single quote ('')
*/
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString().replaceAll("'", "''"));
//save postFeeds to DB
}
Second, What if I want to save each element in a field in a database that is feasible?
You don't need separate columns to store each element at the database. However you can save but the feasibility depends on your use case. If you just want to store the post-feed-items only for writing it back to your web page then it is not feasible.
Third, How can I convert element to html and save?
You don't need to convert the element to html but you need to convert the element to String if you want to save it the database.
All you need is a column type of BLOB data type (you can also save it as VARCHAR but BLOB is safer).
Update
How can I traverse all pages?
By looking at the source code of that page I found this is how you can get the total page number -
Elements pagination = doc.getElementsByAttributeValueMatching("href", "page=\\d");
int totalPageNo = Integer.parseInt(pagination.get(pagination.size() - 2).text());
then loop through each page.
for(int page = 1; page <= totalPageNo; page++) {
Connection conn = Jsoup.connect("https://viblo.asia/?page=" + page);
//rest of your code
}
I properly know what's your mean.Here are some views:First you should clearify what`s your search for and make fields of tables in database. Such as according your ideas, you can make a table_docker table in db and there are field_id,field_content,field_start_time,field_links and so on in it. Second you should code some utils of classes such as JsoupUtils which is get HTML and parse it , HtmlUtils which is used to handle the html remarks and download these pictures,DBUtils which is used to connect db and save data,POIUtils which is used to show your data,DataUtils which is used to handle your data by your ways.

XmlSlurper to parse XML and get value of inside elements using Groovy

I am trying to parse the below XML:
<body>
<section id="5f884f20-6638-461f-a3f5-3d237341c048" outputclass="definition_and_scope">
<title>Definition and Scope</title>
<p>A work that is modified for a purpose, use, or medium other than that for which it was originally intended.</p>
<p>This relationship applies to changes in form or to works completely rewritten in the same form.</p>
</section>
<section id="a7cf019f-dc82-46e2-b5ae-2e947d3c2509" outputclass="popup:ready_reference">
<title>Element Reference</title>
<div id="8472e205-3a32-40e3-a7ea-8bd7dbd43715" outputclass="iri">
<p id="e6ddf17a-6b4b-4de3-886e-a315d88545ea" outputclass="title">
<b>IRI</b>
</p>
<p id="c69f6279-27a3-4cd8-84a6-bb2c5a7b0424">
<xref format="html" href="http://rdaregistry.info/Elements/w/P10142" scope="external">http://rdaregistry.info/Elements/w/P10142</xref>
</p>
</div>
<div id="3e979983-cbac-4982-84c7-57ae9756e2bb" outputclass="domain">
<p id="9815dbdf-7483-4dcf-8166-7ea50138b3e5" outputclass="title">
<b>Domain</b>
</p>
<p id="328a1035-1eaf-4c4b-aead-d604586b3f64">
<xref keyref="rdacC10001/ala-c3e1fff8-0a79-35c6-bee1-39b6b4c9ed35">Work</xref>
</p>
</div>
<div id="13163eda-dcfd-48d9-aea4-cc8abef2f675" outputclass="range">
<p id="d07d4e37-dff1-4561-baab-f8f557d99662" outputclass="title">
<b>Range</b>
</p>
<p id="3873a6ab-5f73-47e2-9daa-441169e66c36">
<xref keyref="rdacC10001/ala-c3e1fff8-0a79-35c6-bee1-39b6b4c9ed35">Work</xref>
</p>
</div>
</section>
</body>
I want to extract the values of all the p tags inside of section & section/div and append that value to a stringbuilder.
Here is my code:
def docText = new StringBuilder();
def bodyObject = doc.topic.body.toXmlString(true) //I have only pasted a part of my XML in this question. My XML starts with a doc/topic/body etc
def parseBodyObject = new XmlSlurper().parse(new InputSource(new StringReader(bodyObject)));
def findAllSection = parseBodyObject.depthFirst().findAll{it.name()=='section'}
findAllSection.each {section->
docText.append(" " +section.p)
docText.append(" " +section.div.p + " ")
}
Output:
My docText looks like below:
A work that is modified for a purpose, use, or medium other than that for which it was originally intended.This relationship applies to changes in form or to works completely rewritten in the same form. IRIhttp://rdaregistry.info/Elements/w/P10142DomainWorkRangeWorkAlternate labelsUser tasksRecording methodsDublin Core TermsMARC 21 Bibliographic Recording an unstructured descriptionRecording a structured descriptionRecording an identifierRecording an IRI For the inverse of this element, see Work: adapted as work For broader elements, see Work: based on workFor narrower elements, see
I am stuck at adding a space between text. For eg. When it is going through section/div/p, it is adding all the p together without any spaces as below:
IRIhttp://rdaregistry.info/Elements/w/P10142DomainWorkRangeWorkAlternate
which should output as(expected output):
IRI http://rdaregistry.info/Elements/w/P10142 Domain Work
How should I get these values separated? Any help is appeciated.
I believe that this depthFirst().findAll { it.name() == 'section'} returns an array list which element is a combination of the inner text of p tags.
Let's define your sample XML as xmlDoc. Below is the snippet of code that works as expected:
def parseBodyObject = new XmlSlurper().parseText(xmlDoc)
def findAllPtags = parseBodyObject.children().depthFirst().findAll {
it.name() == 'p'
}
def docText = new StringBuilder()
findAllPtags.each { p ->
docText.append("\n" + p)
}
You can replace \n by a space.

How To Get get data(Claim number and Status) from HTML using java

In my application, I have submitted a claim It generated the claim details along with claim number and Status. I need to extract the Claim Number and Status from the claim details.
HTML Code for the table containing Claim number and Status:
<div id="claim-num-success" style="width:50%; margin:0 auto; padding:25px; background:none; border:1px solid #d3d3d3; line-height:24px;"> <b>Service Name:</b> 7,500 MILES - NON-TURBO ENGINE
<br> <b>Claim Number:</b> 02923240
<br> <b>R/O Number:</b> 12000
<br> <b>R/O Date:</b> 12/13/2017
<br> <b>Claim Amount:</b> $40.00
<br> <b>Status:</b> APPROVED
<br>
</div>
Here you can use below xpath to extract the value :
//div[#id='claim-num-success']/b[text()='Claim Number:']/following-sibling::text()[1]
AND
//div[#id='claim-num-success']/b[text()='Status:']/following-sibling::text()[1]
But Selenium doesn't allow you to locate an element using text node in xpath. So you can use JavascriptExecutor to evaluate your xpath and locate the element using text node.
This is how you can full-fill your requirement :
JavascriptExecutor js = (JavascriptExecutor)driver;
Object claimNo= js.executeScript("var value = document.evaluate(\"//div[#id='claim-num-success']/b[text()='Claim Number:']/following-sibling::text()[1]\",document, null, XPathResult.STRING_TYPE, null ); return value.stringValue;");
System.out.println("Claim Number : "+ claimNo.toString());
Object Status= js.executeScript("var value = document.evaluate(\"//div[#id='claim-num-success']/b[text()='Status:']/following-sibling::text()[1]\",document, null, XPathResult.STRING_TYPE, null ); return value.stringValue;");
System.out.println("Status : "+ Status.toString());

GetValue (JSoup)

<div class="Class-feedbacks">
<div class="grading class2">
<div itemtype="http://xx.edu/grading" itemscope="" itemprop="studentgrading">
<div class="rating">
<img class="passportphoto" width="1500" height="20" src="http://greg.png" >
<meta content="4.0" itemprop="gradingvalue">
</div>
</div>
<meta content="2012-09-08" itemprop="gradePublished">
<span class="date smaller">9/8/2012</span>
</div>
<p class="review_comment feedback" itemprop="description">Greg is one the smart person in his batch</p>
</div>
I want to print:
date: 2012-09-08
Feedback : Greg is one the smart person in his batch
I was able to use this as suggested at - Jsoup getting a hyperlink from li
The doc.select(div div divn li ui ...) and get the class feedback.
How should I use the select command to get the values of the above values?
To get the value of an attribute, use the attr method. E.g.
Elements elements = doc.select("meta");
for(Element e: elements)
System.out.println(e.attr("content"));
In one single select ...have you tried the comma Combinator "," ?
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
Elements elmts = doc.select("div.Class-feedbacks meta, p")
Element elmtDate = elmts.get(0);
System.out.println("date: " + elmtDate.attr("content"));
Element elmtParag = elmts.get(1);
System.out.println("Feedback: " + elmtParag.text());
You should get back 2 elements in your list the date and the feedback after the select.
This is an old question and I might be late, but if anyone else wants to know how to do this easily, the below code will be helpful.
Document doc = Jsoup.parse(html);
// We select the meta tag whose itemprop property has value 'gradePublished'
String date = doc.select("meta[itemprop=gradePublished]").attr("content");
System.out.println("date: "+date);
// Now we select the text inside the p tag with itemprop value 'description'
String feedback = doc.select("p[itemprop=description]").text();
System.out.println("Feedback: "+feedback);

Get HTML nodes that have the same parent - JAVA

I have a document containing several forms similar to the example posted below. I want to extract all the name/value pairs from the hidden input fields of one of the forms, the form is identified by its name and I don't know in advance how many hidden fields will be present.
I am able to select all the relevant input fields in the document using the selector query: input[type=hidden][name][value]
Is there a way to only select the input fields which has FORM[name=lgo] as parent? Using some kind filter maybe?
<FORM METHOD='POST' onSubmit='javascript:isWaitForm();' ACTION='https://abc-azerty.querty.se/carmon/servlet/action/change_1 ' name='lgo'>
<input type='hidden' name='LogInFlag' value='1'>
<input type='hidden' name='LogInTime' value='2011-07-26 11:10'>
<input type='hidden' name='cCode2' value='SE'>
<a href='javascript:isWaitForm();javascript:document.lgo.submit();' class='linkNone'>Business Monitor</a>
<a href='javascript:isWaitForm();javascript:document.lgo.submit();' class='linkNone'>
<input type='image' src='/images/button_arrow_right.gif' height=19 width=22 border=0 style='float:left;'></A>
</FORM>
Based on this info, at least one of following should work -
doc.select("form[name=lgo] > input[type=hidden]");
Or, you can chain your selects -
doc.select("form[name=lgo]").select("input[type=hidden]");
The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.
<script type="text/javascript">
var inputs = document.getElementsByName('lgo')[0].getElementsByTagName('input');
for(var i = 0 ; i < inputs.length ; i++){
if(inputs[i].getAttribute('type') == "hidden") {
// This will get the name: inputs[i].getAttribute('name')
// This will get the value: inputs[i].value)
console.log(inputs[i].getAttribute('name') + ": " + inputs[i].value);
}}
</script>

Categories