XmlSlurper to parse XML and get value of inside elements using Groovy - java

I am trying to parse the below XML:
<body>
<section id="5f884f20-6638-461f-a3f5-3d237341c048" outputclass="definition_and_scope">
<title>Definition and Scope</title>
<p>A work that is modified for a purpose, use, or medium other than that for which it was originally intended.</p>
<p>This relationship applies to changes in form or to works completely rewritten in the same form.</p>
</section>
<section id="a7cf019f-dc82-46e2-b5ae-2e947d3c2509" outputclass="popup:ready_reference">
<title>Element Reference</title>
<div id="8472e205-3a32-40e3-a7ea-8bd7dbd43715" outputclass="iri">
<p id="e6ddf17a-6b4b-4de3-886e-a315d88545ea" outputclass="title">
<b>IRI</b>
</p>
<p id="c69f6279-27a3-4cd8-84a6-bb2c5a7b0424">
<xref format="html" href="http://rdaregistry.info/Elements/w/P10142" scope="external">http://rdaregistry.info/Elements/w/P10142</xref>
</p>
</div>
<div id="3e979983-cbac-4982-84c7-57ae9756e2bb" outputclass="domain">
<p id="9815dbdf-7483-4dcf-8166-7ea50138b3e5" outputclass="title">
<b>Domain</b>
</p>
<p id="328a1035-1eaf-4c4b-aead-d604586b3f64">
<xref keyref="rdacC10001/ala-c3e1fff8-0a79-35c6-bee1-39b6b4c9ed35">Work</xref>
</p>
</div>
<div id="13163eda-dcfd-48d9-aea4-cc8abef2f675" outputclass="range">
<p id="d07d4e37-dff1-4561-baab-f8f557d99662" outputclass="title">
<b>Range</b>
</p>
<p id="3873a6ab-5f73-47e2-9daa-441169e66c36">
<xref keyref="rdacC10001/ala-c3e1fff8-0a79-35c6-bee1-39b6b4c9ed35">Work</xref>
</p>
</div>
</section>
</body>
I want to extract the values of all the p tags inside of section & section/div and append that value to a stringbuilder.
Here is my code:
def docText = new StringBuilder();
def bodyObject = doc.topic.body.toXmlString(true) //I have only pasted a part of my XML in this question. My XML starts with a doc/topic/body etc
def parseBodyObject = new XmlSlurper().parse(new InputSource(new StringReader(bodyObject)));
def findAllSection = parseBodyObject.depthFirst().findAll{it.name()=='section'}
findAllSection.each {section->
docText.append(" " +section.p)
docText.append(" " +section.div.p + " ")
}
Output:
My docText looks like below:
A work that is modified for a purpose, use, or medium other than that for which it was originally intended.This relationship applies to changes in form or to works completely rewritten in the same form. IRIhttp://rdaregistry.info/Elements/w/P10142DomainWorkRangeWorkAlternate labelsUser tasksRecording methodsDublin Core TermsMARC 21 Bibliographic Recording an unstructured descriptionRecording a structured descriptionRecording an identifierRecording an IRI For the inverse of this element, see Work: adapted as work For broader elements, see Work: based on workFor narrower elements, see
I am stuck at adding a space between text. For eg. When it is going through section/div/p, it is adding all the p together without any spaces as below:
IRIhttp://rdaregistry.info/Elements/w/P10142DomainWorkRangeWorkAlternate
which should output as(expected output):
IRI http://rdaregistry.info/Elements/w/P10142 Domain Work
How should I get these values separated? Any help is appeciated.

I believe that this depthFirst().findAll { it.name() == 'section'} returns an array list which element is a combination of the inner text of p tags.
Let's define your sample XML as xmlDoc. Below is the snippet of code that works as expected:
def parseBodyObject = new XmlSlurper().parseText(xmlDoc)
def findAllPtags = parseBodyObject.children().depthFirst().findAll {
it.name() == 'p'
}
def docText = new StringBuilder()
findAllPtags.each { p ->
docText.append("\n" + p)
}
You can replace \n by a space.

Related

How to save Element from Jsoup to database

I use Jsoup get all data from website and save element if match some content when i get. I want when we get element. If it match some thing character , I save element from database(MYSQL,Postgress...). I code look like :
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed").get(0).children();
Elements list = new Elements();
Elements strings = new Elements();
for (Element element : elements) {
if (element.hasClass("post-feed-item")) {
list.add(element);
Element e = element.children().get(1).children().get(1).children().get(0);
if (e.text().matches("^.*?(Docker|docker|DOCKER).*$")) {
strings.add(e);
//save to element to DB
}
}
}
for (Element page : elements) {
if (links.add(URL)) {
//Remove the comment from the line below if you want to see it running on your editor
System.out.println(URL);
}
getPageLinks(page.attr("abs:href"));
}
I want if title from element contain : "Docker" it save my element to Database. But in element, It contain div and some thing link url, img , content. How to i save it to database. What if I want to save each element in a field in a database that is feasible? If not I can convert element to html and save it? Please help.
Example html i want save data base:
<div class="post-feed-item">
<img src="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png" srcset="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 1x, https://images.viblo.asia/avatar-retina/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 2x" class="avatar avatar--md mr-05">
<div class="post-feed-item__info">
<div class="post-meta--inline">
<div class="user--inline d-inline-flex">
<!---->
Hoàn Kì
<!---->
</div>
<div class="post-meta d-inline-flex align-items-center flex-wrap">
<div class="text-muted mr-05">
<span class="mr-05">about 3 hours ago</span>
<button title="Copy URL" class="icon-btn _13z_mK0hRyRB3dPzawysKe_0"><i aria-hidden="true" class="fa fa-link"></i></button>
</div>
<!---->
<!---->
</div>
</div>
<div class="post-title--inline">
<h3 class="word-break mr-05">Docker: Chưa biết gì đến biết dùng (Phần 3 docker-compose )</h3>
<div class="tags" data-v-cbe11868>
<a href="/tags/docker" class="el-tag _3wKNDsArij9ZFjXe8k4ryR_0 el-tag--info el-tag--mini" data-v-cbe11868>Docker</a>
</div>
</div>
<!---->
<div class="d-flex justify-content-between">
<div class="d-flex">
<div class="stats">
<span title="Views" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-eye"></i> 62 </span>
<span title="Clips" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-paperclip"></i> 1 </span>
<span title="Comments" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-comments"></i> 0 </span>
</div>
<!---->
</div>
<div title="Score" class="points">
<div class="carets">
<i aria-hidden="true" class="fa fa-caret-up"></i>
<i aria-hidden="true" class="fa fa-caret-down"></i>
</div>
<span class="text-muted">4</span>
</div>
</div>
</div>
</div>
First, modify your logic for fetching post-feed-item like this-
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed-item"); //This will get the whole element.
for (Element element : elements) {
String postFeeds = "";
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString());
//save postFeeds to DB
}
}
Extra
/**
* Your parsed element may contain single quote (').
* This will cause error while persisting.
* to avoid this you need to escape single quote (')
* with double single quote ('')
*/
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString().replaceAll("'", "''"));
//save postFeeds to DB
}
Second, What if I want to save each element in a field in a database that is feasible?
You don't need separate columns to store each element at the database. However you can save but the feasibility depends on your use case. If you just want to store the post-feed-items only for writing it back to your web page then it is not feasible.
Third, How can I convert element to html and save?
You don't need to convert the element to html but you need to convert the element to String if you want to save it the database.
All you need is a column type of BLOB data type (you can also save it as VARCHAR but BLOB is safer).
Update
How can I traverse all pages?
By looking at the source code of that page I found this is how you can get the total page number -
Elements pagination = doc.getElementsByAttributeValueMatching("href", "page=\\d");
int totalPageNo = Integer.parseInt(pagination.get(pagination.size() - 2).text());
then loop through each page.
for(int page = 1; page <= totalPageNo; page++) {
Connection conn = Jsoup.connect("https://viblo.asia/?page=" + page);
//rest of your code
}
I properly know what's your mean.Here are some views:First you should clearify what`s your search for and make fields of tables in database. Such as according your ideas, you can make a table_docker table in db and there are field_id,field_content,field_start_time,field_links and so on in it. Second you should code some utils of classes such as JsoupUtils which is get HTML and parse it , HtmlUtils which is used to handle the html remarks and download these pictures,DBUtils which is used to connect db and save data,POIUtils which is used to show your data,DataUtils which is used to handle your data by your ways.

Jsoup - retrieving & manipulating data

So I'm having trouble figuring out how to manipulate the data completely that I'm scraping using Jsoup. I know how to target the areas but i don't know how to target them individually but still group them together.
For Example:
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading1 </p>
</div>
<div class="panel-body">
<p> Body1 <p>
</div>
</div>
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading2 </p>
</div>
<div class="panel-body">
<p> Body2 <p>
</div>
</div>
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading3 </p>
</div>
<div class="panel-body">
<p> Body3 <p>
</div>
</div>
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading4 </p>
</div>
<div class="panel-body">
<p> Body4 <p>
</div>
I want to target different sections in this HTML and then place them in textViews a certain way. But when I try to for example target div.panel-heading & div.panel-body and I want to place the heading above the body it will repeat all of the div.panel-headings for the entire page first then below that it will repeat all of the div.panel-bodys. It's printing them in totally separate groups instead of one on top of the other.
Below is the code I'm using:
private void arbitrage() {
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder builder = new StringBuilder();
final StringBuilder builder2 = new StringBuilder();
try {
Document doc = Jsoup.connect("THE URL HERE").get();
Elements links = doc.select("div.panel.panel-default > div.panel-heading");
Elements links2 = doc.select("div.panel.panel-default > div.panel-body");
for (Element link : links) {
builder.append("\n").append(link.text());
builder2.append("\n").append(links2.text() + "\n");
}
} catch (IOException e) {
builder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
arbitrage.setText(builder.toString() + builder2.toString());
}
});
}
}).start();
}
==-=-=-=-=-=- EDITED =-=-=-=-=--
I've changed the HTML code to better reflect what the web URL looks like. When I run my current code it displays.
Heading1
Heading2
Heading3
Heading4
Body1
Body2
Body3
Body4
I want it to display as follows.
Heading1
Body1
Heading2
Body2
Heading3
Body3
Heading4
Body4
So essentially, I want to grab the panel-heading & panel-body individually, but display them together in a group. I can grab them both in one group by selecting div.panel.panel-default, but I don't have as much control on how this is displayed from a UI standpoint. At least I don't know how to manipulate that data when I scrape it all together like that.
EDIT TWO =-=-=-=-=-=-=-=-
I'm getting close, this code allows me to manipulate the data individually better, but still cant do what I need. I want to style the heading & body let's say different colors. I can't figure this out.
for (Element panel : panels) {
Elements links = panel.select("div.panel-heading");
Elements links2 = panel.select("div.panel-body");
builder.append("\n").append(links.text()).append("\n").append("\n").append(links2.text())
.append("\n")
.append("\n");
}
changed my runOnUIThread to this:
runOnUiThread(new Runnable() {
#Override
public void run() {
arbitrageTextView.setText(builder.toString());
}
});
But if I want to like change the text color for the header different from the body I'm not able to. Or add a divider between all of the groups, doesn't allow me to do this. It just seems very limited on the UI side of things, doesn't allow me to stylize them, just pull them in and display them. I believe this is because it's pulling it all in under one textView, would I need to put them in two different textViews?
Try this:
Elements panels = doc.select("div.panel.panel-default");
for (Element panel : panels) {
Elements links = panel.select("div.panel-heading");
Elements links2 = panel.select("div.panel-body");
builder.append("\n").append(links.text());
builder.append("\n").append(links2.text() + "\n");
}
Update
I changed code

Jsoup I want select div , not select span or all a

<div class="conditions-race">
Çim: Ağır 4,9 Kum: Normal Hava: 14 C , PARÇALI BULUTLU , NEM %50
<span style="float: right;">
<a id="PDFBulten">PDF Programı</a>
<a id="PDFOzetBulten">Özet PDF Programı</a>
<a id="CSVBulten">CSV Programı</a>
1. AGF Tablosu
2. AGF Tablosu
</span>
</div>
I want only this line "Çim: Ağır 4,9 Kum: Normal Hava: 14 C , PARÇALI BULUTLU , NEM %50"
You want to use Element#ownText method.
Extract from Javadoc
Gets the text owned by this element only; does not get the combined text of all children.
For example, given HTML <p>Hello <b>there</b> now!</p>, p.ownText() returns "Hello now!", whereas p.text() returns "Hello there now!".
Note that the text within the b element is not returned, as it is not a direct child of the p element.
Sample code
Document doc = ...
for(Element div : doc.select("div.conditions-race")) {
System.out.println(div.ownText());
}

GetValue (JSoup)

<div class="Class-feedbacks">
<div class="grading class2">
<div itemtype="http://xx.edu/grading" itemscope="" itemprop="studentgrading">
<div class="rating">
<img class="passportphoto" width="1500" height="20" src="http://greg.png" >
<meta content="4.0" itemprop="gradingvalue">
</div>
</div>
<meta content="2012-09-08" itemprop="gradePublished">
<span class="date smaller">9/8/2012</span>
</div>
<p class="review_comment feedback" itemprop="description">Greg is one the smart person in his batch</p>
</div>
I want to print:
date: 2012-09-08
Feedback : Greg is one the smart person in his batch
I was able to use this as suggested at - Jsoup getting a hyperlink from li
The doc.select(div div divn li ui ...) and get the class feedback.
How should I use the select command to get the values of the above values?
To get the value of an attribute, use the attr method. E.g.
Elements elements = doc.select("meta");
for(Element e: elements)
System.out.println(e.attr("content"));
In one single select ...have you tried the comma Combinator "," ?
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
Elements elmts = doc.select("div.Class-feedbacks meta, p")
Element elmtDate = elmts.get(0);
System.out.println("date: " + elmtDate.attr("content"));
Element elmtParag = elmts.get(1);
System.out.println("Feedback: " + elmtParag.text());
You should get back 2 elements in your list the date and the feedback after the select.
This is an old question and I might be late, but if anyone else wants to know how to do this easily, the below code will be helpful.
Document doc = Jsoup.parse(html);
// We select the meta tag whose itemprop property has value 'gradePublished'
String date = doc.select("meta[itemprop=gradePublished]").attr("content");
System.out.println("date: "+date);
// Now we select the text inside the p tag with itemprop value 'description'
String feedback = doc.select("p[itemprop=description]").text();
System.out.println("Feedback: "+feedback);

Get HTML nodes that have the same parent - JAVA

I have a document containing several forms similar to the example posted below. I want to extract all the name/value pairs from the hidden input fields of one of the forms, the form is identified by its name and I don't know in advance how many hidden fields will be present.
I am able to select all the relevant input fields in the document using the selector query: input[type=hidden][name][value]
Is there a way to only select the input fields which has FORM[name=lgo] as parent? Using some kind filter maybe?
<FORM METHOD='POST' onSubmit='javascript:isWaitForm();' ACTION='https://abc-azerty.querty.se/carmon/servlet/action/change_1 ' name='lgo'>
<input type='hidden' name='LogInFlag' value='1'>
<input type='hidden' name='LogInTime' value='2011-07-26 11:10'>
<input type='hidden' name='cCode2' value='SE'>
<a href='javascript:isWaitForm();javascript:document.lgo.submit();' class='linkNone'>Business Monitor</a>
<a href='javascript:isWaitForm();javascript:document.lgo.submit();' class='linkNone'>
<input type='image' src='/images/button_arrow_right.gif' height=19 width=22 border=0 style='float:left;'></A>
</FORM>
Based on this info, at least one of following should work -
doc.select("form[name=lgo] > input[type=hidden]");
Or, you can chain your selects -
doc.select("form[name=lgo]").select("input[type=hidden]");
The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.
<script type="text/javascript">
var inputs = document.getElementsByName('lgo')[0].getElementsByTagName('input');
for(var i = 0 ; i < inputs.length ; i++){
if(inputs[i].getAttribute('type') == "hidden") {
// This will get the name: inputs[i].getAttribute('name')
// This will get the value: inputs[i].value)
console.log(inputs[i].getAttribute('name') + ": " + inputs[i].value);
}}
</script>

Categories