Jsoup - retrieving & manipulating data - java

So I'm having trouble figuring out how to manipulate the data completely that I'm scraping using Jsoup. I know how to target the areas but i don't know how to target them individually but still group them together.
For Example:
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading1 </p>
</div>
<div class="panel-body">
<p> Body1 <p>
</div>
</div>
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading2 </p>
</div>
<div class="panel-body">
<p> Body2 <p>
</div>
</div>
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading3 </p>
</div>
<div class="panel-body">
<p> Body3 <p>
</div>
</div>
<div class="panel panel-default">
<div class="panel-heading">
<p> Heading4 </p>
</div>
<div class="panel-body">
<p> Body4 <p>
</div>
I want to target different sections in this HTML and then place them in textViews a certain way. But when I try to for example target div.panel-heading & div.panel-body and I want to place the heading above the body it will repeat all of the div.panel-headings for the entire page first then below that it will repeat all of the div.panel-bodys. It's printing them in totally separate groups instead of one on top of the other.
Below is the code I'm using:
private void arbitrage() {
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder builder = new StringBuilder();
final StringBuilder builder2 = new StringBuilder();
try {
Document doc = Jsoup.connect("THE URL HERE").get();
Elements links = doc.select("div.panel.panel-default > div.panel-heading");
Elements links2 = doc.select("div.panel.panel-default > div.panel-body");
for (Element link : links) {
builder.append("\n").append(link.text());
builder2.append("\n").append(links2.text() + "\n");
}
} catch (IOException e) {
builder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
arbitrage.setText(builder.toString() + builder2.toString());
}
});
}
}).start();
}
==-=-=-=-=-=- EDITED =-=-=-=-=--
I've changed the HTML code to better reflect what the web URL looks like. When I run my current code it displays.
Heading1
Heading2
Heading3
Heading4
Body1
Body2
Body3
Body4
I want it to display as follows.
Heading1
Body1
Heading2
Body2
Heading3
Body3
Heading4
Body4
So essentially, I want to grab the panel-heading & panel-body individually, but display them together in a group. I can grab them both in one group by selecting div.panel.panel-default, but I don't have as much control on how this is displayed from a UI standpoint. At least I don't know how to manipulate that data when I scrape it all together like that.
EDIT TWO =-=-=-=-=-=-=-=-
I'm getting close, this code allows me to manipulate the data individually better, but still cant do what I need. I want to style the heading & body let's say different colors. I can't figure this out.
for (Element panel : panels) {
Elements links = panel.select("div.panel-heading");
Elements links2 = panel.select("div.panel-body");
builder.append("\n").append(links.text()).append("\n").append("\n").append(links2.text())
.append("\n")
.append("\n");
}
changed my runOnUIThread to this:
runOnUiThread(new Runnable() {
#Override
public void run() {
arbitrageTextView.setText(builder.toString());
}
});
But if I want to like change the text color for the header different from the body I'm not able to. Or add a divider between all of the groups, doesn't allow me to do this. It just seems very limited on the UI side of things, doesn't allow me to stylize them, just pull them in and display them. I believe this is because it's pulling it all in under one textView, would I need to put them in two different textViews?

Try this:
Elements panels = doc.select("div.panel.panel-default");
for (Element panel : panels) {
Elements links = panel.select("div.panel-heading");
Elements links2 = panel.select("div.panel-body");
builder.append("\n").append(links.text());
builder.append("\n").append(links2.text() + "\n");
}
Update
I changed code

Related

Get number of words in a text

I'm using Java and Selenium, and I have to extract the number of words in a specific text. I'm stuck because I get more results than I expected.
Considering the following HTML
<div data-v-2f952c88="" class="text1">
<section data-v-3b70ad5b="" data-v-2f952c88="" data-content-provider="ABC" class="description__section">
<div data-v-051a83e7="" data-v-3b70ad5b="" class="markdown" data-v-2f952c88="">
<p>Headline 1
Hello everyone i´m new at stack overflow</p>
<p> And I need your help
to get the total of words in this exemple
</p>
</div>
</section>
<section data-v-3b70ad5b="" data-v-2f952c88="" data-content-provider="DEF" class="description__section">
<div data-v-051a83e7="" data-v-3b70ad5b="" class="markdown" data-v-2f952c88="">
<p>I Love Coding
I use Java</p>
<p> Another Text
And Selenium
</p>
</div>
</section>
</div>
<div data-v-2f952c99="" class="querty">
<section data-v-3b755ad5b="" data-v-2f952288="" data-content-provider="DEF" class="description__section">
<div data-v-051a18e7="" data-v-3b789d5b="" class="markdown" data-v-2f962c88="">
<p>This is another text along the WEBPAGE
I don´t want to count this words in my total count</p>
</div>
</section>
</div>
In Java I've created this function:
private String countWords(WebDriver driver){
int totalLetters = 0;
try{
List<WebElement> className = driver.findElements(By.cssSelector("[class*='text1']"));
for(WebElement classElement: className){
if(classElement!=null) {
String[] tags = {"p", "section"};
for (String tag: tags) {
List<WebElement> elements = driver.findElements(By.tagName(tag));
for (WebElement element: elements) {
String text=element.getText();
String[] words = text.split("\\s+");
if (words!=null) {
totalLetters = totalLetters + words.length;
}
}
}
}
}
}
catch(NoSuchMethodError e){
//e.printStackTrace();
throw e;
}
String s=String.valueOf(totalLetters);
System.out.println("How many word? " + s);
return s;
So my problem is that my function is extracting all the words inside every "p" and "section" tags in the webpage and I only wanted the "p" and "section" inside the first "div ..... class="text1" ".
What am I doing wrong?
Please refer to the image to check why it gives count of all 'p' and 'section' tag
Is this helpful to find your problem ?
Or your problem is that it is also giving the counts of class ='querty'?
<div data-v-2f952c99="" class="querty">
<section data-v-3b755ad5b="" data-v-2f952288="" data-content-provider="DEF" class="description__section">
<div data-v-051a18e7="" data-v-3b789d5b="" class="markdown" data-v-2f962c88="">
<p>This is another text along the WEBPAGE
I don´t want to count this words in my total count</p>
</div>
</section>
</div>

How to save Element from Jsoup to database

I use Jsoup get all data from website and save element if match some content when i get. I want when we get element. If it match some thing character , I save element from database(MYSQL,Postgress...). I code look like :
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed").get(0).children();
Elements list = new Elements();
Elements strings = new Elements();
for (Element element : elements) {
if (element.hasClass("post-feed-item")) {
list.add(element);
Element e = element.children().get(1).children().get(1).children().get(0);
if (e.text().matches("^.*?(Docker|docker|DOCKER).*$")) {
strings.add(e);
//save to element to DB
}
}
}
for (Element page : elements) {
if (links.add(URL)) {
//Remove the comment from the line below if you want to see it running on your editor
System.out.println(URL);
}
getPageLinks(page.attr("abs:href"));
}
I want if title from element contain : "Docker" it save my element to Database. But in element, It contain div and some thing link url, img , content. How to i save it to database. What if I want to save each element in a field in a database that is feasible? If not I can convert element to html and save it? Please help.
Example html i want save data base:
<div class="post-feed-item">
<img src="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png" srcset="https://images.viblo.asia/avatar/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 1x, https://images.viblo.asia/avatar-retina/1d0e5458-ad41-4d1c-89db-292dc198b4fa.png 2x" class="avatar avatar--md mr-05">
<div class="post-feed-item__info">
<div class="post-meta--inline">
<div class="user--inline d-inline-flex">
<!---->
Hoàn Kì
<!---->
</div>
<div class="post-meta d-inline-flex align-items-center flex-wrap">
<div class="text-muted mr-05">
<span class="mr-05">about 3 hours ago</span>
<button title="Copy URL" class="icon-btn _13z_mK0hRyRB3dPzawysKe_0"><i aria-hidden="true" class="fa fa-link"></i></button>
</div>
<!---->
<!---->
</div>
</div>
<div class="post-title--inline">
<h3 class="word-break mr-05">Docker: Chưa biết gì đến biết dùng (Phần 3 docker-compose )</h3>
<div class="tags" data-v-cbe11868>
<a href="/tags/docker" class="el-tag _3wKNDsArij9ZFjXe8k4ryR_0 el-tag--info el-tag--mini" data-v-cbe11868>Docker</a>
</div>
</div>
<!---->
<div class="d-flex justify-content-between">
<div class="d-flex">
<div class="stats">
<span title="Views" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-eye"></i> 62 </span>
<span title="Clips" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-paperclip"></i> 1 </span>
<span title="Comments" class="stats-item text-muted"><i aria-hidden="true" class="stats-item__icon fa fa-comments"></i> 0 </span>
</div>
<!---->
</div>
<div title="Score" class="points">
<div class="carets">
<i aria-hidden="true" class="fa fa-caret-up"></i>
<i aria-hidden="true" class="fa fa-caret-down"></i>
</div>
<span class="text-muted">4</span>
</div>
</div>
</div>
</div>
First, modify your logic for fetching post-feed-item like this-
Connection conn = Jsoup.connect("https://viblo.asia");
Document doc = conn.userAgent("Mozilla").get();
Elements elements = doc.getElementsByClass("post-feed-item"); //This will get the whole element.
for (Element element : elements) {
String postFeeds = "";
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString());
//save postFeeds to DB
}
}
Extra
/**
* Your parsed element may contain single quote (').
* This will cause error while persisting.
* to avoid this you need to escape single quote (')
* with double single quote ('')
*/
if (element.toString().contains("docker")) {
postFeeds = postFeeds.concat(element.toString().replaceAll("'", "''"));
//save postFeeds to DB
}
Second, What if I want to save each element in a field in a database that is feasible?
You don't need separate columns to store each element at the database. However you can save but the feasibility depends on your use case. If you just want to store the post-feed-items only for writing it back to your web page then it is not feasible.
Third, How can I convert element to html and save?
You don't need to convert the element to html but you need to convert the element to String if you want to save it the database.
All you need is a column type of BLOB data type (you can also save it as VARCHAR but BLOB is safer).
Update
How can I traverse all pages?
By looking at the source code of that page I found this is how you can get the total page number -
Elements pagination = doc.getElementsByAttributeValueMatching("href", "page=\\d");
int totalPageNo = Integer.parseInt(pagination.get(pagination.size() - 2).text());
then loop through each page.
for(int page = 1; page <= totalPageNo; page++) {
Connection conn = Jsoup.connect("https://viblo.asia/?page=" + page);
//rest of your code
}
I properly know what's your mean.Here are some views:First you should clearify what`s your search for and make fields of tables in database. Such as according your ideas, you can make a table_docker table in db and there are field_id,field_content,field_start_time,field_links and so on in it. Second you should code some utils of classes such as JsoupUtils which is get HTML and parse it , HtmlUtils which is used to handle the html remarks and download these pictures,DBUtils which is used to connect db and save data,POIUtils which is used to show your data,DataUtils which is used to handle your data by your ways.

Select and iterate through elements and sub elements with same name (Jsoup)

I need to parse through a page by jsoup. The page has elements with tags div,h3,a etc. I want to parse through the elements and select a (i.e. title) to be displayed in jList.
As an example, the page looks like:
<div class="start">
<div class="g">
<div class="abc">
<a class="picture" href="www.img.com"><img src="img" alt="image1"></a>
<div class="xyz">
<h3 class="_r">
<a class="title" href="www.example.com" onmousedown="return rwt(this,'','','','1','adf','','ahahh','','',event)">THIS IS <em>example</em>1</a>
</h3>
</div>
</div>
</div>
<div class="g">
<div class="abc">
<a class="picture" href="www.img.com"><img src="img" alt="image2"></a>
<div class="xyz">
<h3 class="_r">
<a class="title" href="www.example.com" onmousedown="return rwt(this,'','','','1','adf','','ahahh','','',event)">lead by this<em>example</em></a>
</h3>
</div>
</div>
</div>
<div class="g">
<div class="abc">
<a class="picture" href="www.img.com"><img src="img" alt="image3"></a>
<div class="xyz">
<h3 class="_r">
<a class="title" href="www.example.com" onmousedown="return rwt(this,'','','','1','adf','','ahahh','','',event)">showed<em>example</em>for the people</a>
</h3>
</div>
</div>
</div>
<div class="g">
<div class="abc">
<a class="picture" href="www.img.com"><img src="img" alt="image4"></a>
<div class="xyz">
<h3 class="_r">
<a class="title" href="www.example.com" onmousedown="return rwt(this,'','','','1','adf','','ahahh','','',event)">we set<em>example</em>for people</a>
</h3>
</div>
</div>
</div>
</div>
This is the code:
String url = "http://www.google.com/search?q=example&tbm=nws&source=lnms";
String title = "";
try {
Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
Elements e = doc.select("div.g");
for (Element e1 : e) {
title = e1.getElementsByTag("a").text();
}
DefaultListModel<String> listModel = new DefaultListModel<>();
listModel.addElement(title);
jList.setModel(listModel);
} catch (IOException ex) {
Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}
The output that I got was the title of the last element div.g:
we set example for people
I want to select the title from each div.g and display each title separately in jList as item like this:
THIS IS example 1
lead by this example
showed example for the people
we set example for people
Currently you assign the scraped data to title in a loop and then outside the loop you assign title to the jlist. So, the value of title once the loop has completed will always be the last value.
Replace this ...
for (Element e1 : e) {
title = e1.getElementsByTag("a").text();
}
DefaultListModel<String> listModel = new DefaultListModel<>();
listModel.addElement(title);
With this ...
DefaultListModel<String> listModel = new DefaultListModel<>();
for (Element e1 : e) {
listModel.addElement(e1.getElementsByTag("a").text());
}
You actually don't add title each time. The loop replace each time title with the new value found and after the loop you add it in the list. Something like this might work the way you want it :
DefaultListModel<String> listModel = new DefaultListModel<>();
for (Element e1 : e) {
listModel.addElement(e1.getElementsByTag("a").text());
}

How to find same element from item grid using loop in java selenium?

I am trying to find button add to cart is present or not using loop from all item box from following code
<div class="page-body">
<div class="product-selectors">
<div class="product-filters-wrapper">
<div class="product-grid">
<div class="item-box">
<div class="item-box">
<div class="item-box">
<div class="item-box">
</div>
in each item box folowing code
<div class="item-box">
<div class="product-item" data-productid="20">
<div class="picture">
<div class="details">
<h2 class="product-title">
<div class="product-rating-box" title="1 review(s)">
<div class="description"> 12x optical zoom; SuperRange Optical Image Stabilizer </div>
<div class="add-info">
<div class="prices">
<div class="buttons">
<input class="button-2 product-box-add-to-cart-button" type="button" onclick="AjaxCart.addproducttocart_catalog('/addproducttocart/catalog/20/1/1 ');return false;" value="Add to cart">
</div>
</div>
</div>
</div>
</div>
I need to find that all itembox have add to cart button present or not using loop. if anyone can help please
I suggest to avoid looping if not necessary. You do not need to do the loop to find out unless there is an explicit need of doing so. You can find the count of Add to cart button and compare with a known value
By byCss = By.cssSelector(".item-box>div input[value='Add to cart']");
int cartCount = driver.findElements(byCss).size();
if (cartCount != 4){
//fail the test
}
If you exactly one to looping and check if the input button exist or not.
By itemBoxes = By.className("item-box");
By button = By.cssSelector("[type='button'][value='Add to cart']");
List<WebElement> webElementList = driver.findElements(itemBoxes);
for (WebElement element: webElementList){
//simply taking size if exist it will return 1
if (element.findElements(button).size() != 1){
//fail
}
}
you can use searching by xpath inside of the loop.
Something like
".//input[#value='Add to cart'][1]"
".//input[#value='Add to cart'][2]"
".//input[#value='Add to cart'][3]"
etc
not sure that this xpath is correct, but generally it will work for you, bro.
Or something like this:
string xpath=".//input[#value='Add to cart']";
var AddToCartBtnsList = driver.findElements(By.Xpath(xpath));
foreach(IWebElement button in AddToCartBtnsList )
{
button.click();
}

How do I correctly parse data using JSoup (java)

I want to parse the data out of this HTML (CompanyName, Location, jobDescription,...) using JSoup (java). I get stuck when trying to iterate the joblistings
The extract from the HTML is one of many "JOBLISTING" divs which I want to iterate and extract the Data out of it. I just can't handle how to iterate the specific div objects. Sorry for this noob question, but maybe someone can help me who already knows which function to use. Select?
<div class="between_listings"><!-- local.spacer --></div>
<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">
<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
<img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
</a>
</div>
<div class="job_info">
<div class="h3 job_title">
<a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
<span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
</a>
</div>
<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Delivery Hero Holding GmbH</span>
</div>
</div>
<div class="job_location_date">
<div class="job_location target-location">
<div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">
<div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="addressLocality"> Berlin</span>
</div>
<span class="location_actions">
<a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
<span class="location-icon"><!-- --></span>
<span class="location-label">Google Maps</span>
</a>
</span>
</div>
</div>
<div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>
<div class="job_actions">
</div>
</div>
<div class="between_listings"><!-- local.spacer --></div>
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"); // Load file into extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]"); // other element properties System.out.println(jobListingElements);
Java code:
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements jobListingElements = ParseResult.select(".joblisting");
for (Element jobListingElement: jobListingElements) {
jobListingElement.select(".companyName span[itemprop=\"name\"]");
// other element properties
System.out.println(jobListingElements);
}
Thank you!
So you got your Jsoup document right? Than it seems pretty easy if the css class joblisting does not appear anywhere else.
Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
System.out.println(companyName);
System.out.println(jobTitle);
}
I don't know why the attribute [itemprop*=\"name\"] selector does not find the span (Further reading: http://jsoup.org/cookbook/extracting-data/selector-syntax )
Got it: span[itemprop=name] without any quotes or escapes. Other attributes or values also should work to get a more specific selection.

Categories