JSOUP scraping a data

JSOUP scraping a data - java

Hi i need to scrape a web site using JSOUP and i needed to get output in key- value pairs can anyone suggest me.
The url which i need to scrape is https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=
The code which i written is:
package com.jaysons;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ScrapeBody {
public static void main( String[] args ) throws IOException{
String url = "https://www.cpsc.gov/Recalls?field_rc_date_value%5Bmin%5D&field_rc_date_value%5Bmax%5D&field_rc_heading_value=&field_rc_hazard_description_value=&field_rc_manufactured_in_value=&field_rc_manufacturers_value=&field_rc_number_value=";
Document doc = Jsoup.connect(url).get();
Elements content = doc.select("div.views-field views-field-php");
doc = Jsoup.parse( content.html().replaceAll("</div>", "</div><span>")
.replaceAll("<div", "</span><div") );
Elements labels = doc.select("div.remedy");
for (Element label : labels) {
System.out.println(String.format("%s %s", label.text().trim(),
label.nextElementSibling().text()));
}
}
}
i need output in key value pairs like
Date:OCTOBER 20, 2017
remedy:
units:
website:http://www.bosch-home.com/us
phone:(888) 965-5813
kindly let me know where did i do mistake

Theres no need to reassign and re-parse the value of the content variable.
Elements content = doc.select("div.views-field >span");
for (Element viewField : content) {
/*
each viewField corresponds to one
<div class="views-field views-field-php">
<span class="field-content">
<a href="/Recalls/2018/BSH-Home-Appliances-amplía-retiro-del-mercado-de-lavavajillas">
<div class="date">
October 20, 2017
</div>
...
</span>
</div>
*/
Elements divs = viewField.getElementsByTag("div");
for (Element div : divs) {
String className = div.className();
if (className.equals("date")) {
// store and extract date
} else if (className.equals("...")) {
// do something else
} // else...
}
}
Not only you can select subelements by tag, but also by name, by some attributes etc. Check the official documentation for more info: https://jsoup.org/cookbook/extracting-data/dom-navigation
Disclaimer: I could not test the code right now.

Related

JSoup, how to return data from a dynamic <a href> tag

Very new to JSoup, trying to retrieve a changeable value that is stored within an tag, specifically from the following website and html.
Snapshot of HTML
the results after "consitituency/" are changeable and dependent on the input of the user. I am able to retrieve the h2 tags themselves but not the information within. At the moment the best return I can get is just tags using the method below
The desired return would be something that I can substring down into
Dublin Bay South
The actual return is
<well.col-md-4.h2></well.col-md-4.h2>
private String jSoupTDRequest(String aLine1, String aLine3) throws IOException {
String constit = "";
String h2 = "h2";
String url = "https://www.whoismytd.com/search?utf8=✓&form-input="+aLine1+"%2C+"+aLine3+"+Ireland";
//Switch to try catch if time
Document doc = Jsoup.connect(url)
.timeout(6000).get();
//Scrape elements from relevant section
Elements body = doc.select("well.col-md-4.h2");
Element e = new Element("well.col-md-4.h2");
constit = e.toString();
return constit;
I am extremely new to JSoup and scraping in general. Would appreciate any input from someone who knows what they're doing or any alternate ways to try and get the desired result

Change your scraping elements from relevant section code as follows:
Select the very first <div class="well"> element first.
Element tdsDiv = doc.select("div.well").first();
Select the very first <a> link element next. This link points to the constituency.
Element constLink = tdsDiv.select("a").first();
Get the constituency name by grabbing this link's text content.
constit = constLink.text();

import org.junit.jupiter.api.Test;
import java.io.IOException;
#DisplayName("JSoup, how to return data from a dynamic <a href> tag")
class JsoupQuestionTest {
private static final String URL = "https://www.whoismytd.com/search?utf8=%E2%9C%93&form-input=Kildare%20Street%2C%20Dublin%2C%20Ireland";
#Test
void findSomeText() throws IOException {
String expected = "Dublin Bay South";
Document document = Jsoup.connect(URL).get();
String actual = document.getElementsByAttributeValue("href", "/constituency/dublin-bay-south").text();
Assertions.assertEquals(expected, actual);
}
}

JSOUP - Crawling Images & Text from URLs Found on a Previously Crawled Page

I'm attempting to create a crawler using Jsoup that will...
Go to a web page (specifically, a google sheets publicly published page like this one https://docs.google.com/spreadsheets/d/1CE9HTe2rdgPsxMHj-PxoKRGX_YEOCRjBTIOVtLa_2iI/pubhtml) and collect all href url links found in each cell.
Next, I want it to go to each individual url found the page, and crawl THAT url's headline and main image.
Ideally, if the urls on the google sheets page were for example, a specific Wikipedia page and a Huffington Post article, it would print out something like:
Link: https: //en.wikipedia.org/wiki/Wolfenstein_3D
Headline: Wolfenstein 3D
Image: https: //en.wikipedia.org/wiki/Wolfenstein_3D#/media/File:Wolfenstein-3d.jpg
Link: http: //www.huffingtonpost.com/2012/01/02/ron-pippin_n_1180149.html
Headline: Ron Pippin’s Mythical Archives Contain History Of Everything (PHOTOS)
Image: http: //i.huffpost.com/gen/453302/PIPPIN.jpg
(excuse the spaces in the URLs. Obviously I don't want the crawler to add spaces and break up URLS... stack overflow just wouldn't let me post more links in this question)
So far, I've got the jsoup working for the first step (pulling the links from the initial url) using this code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mycrawler {
public static void main(String[] args) {
Document doc;
try {
doc = Jsoup.connect("https://docs.google.com/spreadsheets/d/1CE9HTe2rdgPsxMHj-PxoKRGX_YEOCRjBTIOVtLa_2iI/pubhtml").get();
Elements links = doc.select("a[href]");
for (Element link : links) {
System.out.println(link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
I'm now having trouble figuring out how to create the second aspect of the crawler where it cycles through each link (could be a variable number of links) and finds the headline and main image from each.

public static void main(String[] args) {
Document doc;
String url = "https://docs.google.com/spreadsheets/d/1CE9HTe2rdgPsxMHj-PxoKRGX_YEOCRjBTIOVtLa_2iI/pubhtml";
try {
doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
String innerurl = link.text();
if (!innerurl.contains("://")) {
continue;
}
System.out.println("*******");
System.out.println(innerurl);
Document innerDoc = Jsoup.connect(innerurl).get();
Elements headerLinks = innerDoc.select("h1");
for (Element innerLink : headerLinks) {
System.out.println("Headline : " + innerLink.text());
}
Elements imgLinks = innerDoc.select("img[src]");
for (Element innerLink : imgLinks) {
String innerImgSrc = innerLink.attr("src");
if(innerurl.contains("huffingtonpost") && innerImgSrc.contains("i.huffpost.com/gen")){
System.out.println("Image : " + innerImgSrc);
}
if(innerurl.contains("wikipedia")) {
Pattern pattern = Pattern.compile("(jpg)$", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(innerImgSrc);
if(matcher.find()){
System.out.println("Image : " + innerImgSrc);
break;
}
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
Output
*******
https://en.wikipedia.org/wiki/Wolfenstein_3D
Headline : Wolfenstein 3D
Image : //upload.wikimedia.org/wikipedia/en/0/05/Wolfenstein-3d.jpg
*******
http://www.huffingtonpost.com/2012/01/02/ron-pippin_n_1180149.html
Headline : Ron Pippin's Mythical Archives Contain History Of Everything (PHOTOS)
Image : http://i.huffpost.com/gen/453302/PIPPIN.jpg
Image : http://i.huffpost.com/gen/453304/PIPSHIP.jpg

I think you should get the href attribute of the link with link.attr("href") instead of link.text(). (in the page the displayed text and the underlying href are different) Track all the links to a list and iterate that list in second step to get the corresponding Document from which you can extract the Headline and Image URL.
For wiki pages we can extract the heading with Jsoup as follows
Element heading = document.select("#firstHeading").first();
System.out.println("Heading : " + heading.text());

Selecting a portion of text in html Using Java

I have looked through multiple forms before asking this question.Basically, what i need is to select part of the text in a HTML file. the html is constructed something like this
<div class = "pane big">
<code>
<pre>
SomeText
<a id="par1" href="#par1">¶</a>
MoreText
.
.
.
<a id="par2" href="#par2">¶</a>
MoreText
</pre>
</code>
</div>
So what i need to do, is to extract the text under the href tag par1 by itself and then get the text under par2 href tag separately. i tried to use Jsoup but all i could do is to select the whole text withing the div. Also tried XPath but the expression that I'm evaluating is not accepted. not sure maybe because it's not an XML file to begin with.
and example of XPath expressions that I used is .
/html/body/div/div[2]/code[2]/pre/text()[3]
and CSS
body > div > div.pane.big > code:nth-child(7) > pre

It's not possible to do that with pure CSS selectors, additional extracting and appending logic in Java code needed:
Select pre element
Split it to sequence of text parts by a element as splitter.
Skip 1st element and join two (or more) next parts.
Here simple code sample for that (JDK 1.8 style with stream API and old JDK 1.5 - 1.7 style):
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
import static java.util.Arrays.stream;
import static java.util.stream.Collectors.joining;
public class SimpleParser {
public static void main(String[] args) throws IOException {
final Document document = Jsoup.parse(new File("div.html"), "UTF-8");
final Elements elements = document.select("div.pane.big pre");
System.out.println("JDK 1.8 style");
System.out.println(
stream(elements.html().split("\\s+<a.+</a>\\s+"))
.skip(1)
.collect(joining("\n")
));
System.out.println("\nJDK 1.7 style");
String[] textParts = elements.html().split("\\s+<a.+</a>\\s+");
StringBuilder resultText = new StringBuilder();
for (int i = 1; i < textParts.length; i++) {
resultText.append(textParts[i] + "\n");
}
System.out.println(resultText.toString());
}
}
P.S. Note that last tag div in your HTML code sample should be closed-tag.

Wait ,so you need the part inside the href tag,right ?Say we have
<a id="par1" href="#iNeedThisPart">¶</a> , then do you want 'iNeedThisPart'?
If that is indeed what you want ,then you need to use the css query a[href] ,which would select all 'a' tags with 'href' attribute. The JSoup code for the same will be as follows:
public List<String> getTextWithinHrefAttribute(final File file) throws IOException{
final List<String> hrefTexts = new ArrayList<>();
final Document document=Jsoup.parse(file,"utf-8");
final Elements ahrefs =document.select("a[href]");
for(final Element ahref : ahrefs ){
hrefTexts.add(ahref.attr("href"));
}
return hrefTexts;
}
I am assuming that you are parsing from a file, and not crawling a web page.

Deleting <textarea> tag from Document doc

I am about to parse this url : http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y
Document jDoc = Jsoup.connect(url1).get();
System.out.println(jDoc1.text());
But the output of the second line(above) is all TAGS inside textarea + text of other tags. Output is like :
..
..
<ul class="">
<li><a data-time="1dy" data-frequency="1mi" class="mdm_time">1 Day</a></li>
<li><a data-time="5dy" data-frequency="15mi" class="mdm_time">5 Days</a></li>
..
..
All the html is getting printed (of what is inside ) and text of other tags. I either want to remove this tag from Doc or want to get this as element so that I can remove it by my hand.
Hope, I am able to explain everything clearly. Please help me solve this.
EDIT :
As per suggestion, I did this :
System.out.println(jDoc1.select("textarea"));
And output comes is :
textarea id="wsj_autocomplete_template" style="display:none">
<div>
<div class="acHeadline hidden" >
</div>
<div class="dropdownContainerClass">
<div class="suggestionblock hidden" templateType="C1">
....
...
..
Certainly it is selecting the textarea, but is not able to parse inner elements. possibly due to &lt instead of < tag. Is there any workaround for this ?

If you want to remove the entire text area tag use doc.select("textarea").remove();. Or if you want to get the content of textarea use doc.select("textarea").text(). Note here i'm using the text() method instead of toString() or html() methods. This gives the exact text rather than html escape codes.
Again if you want to manipulate this html you can parse it again like Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
Example
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class WSJParser {
public static void main(String[] args) {
String url = "http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y";
try {
Document doc = Jsoup.connect(url).get();
//doc.select("textarea").remove(); // Removes the entire text area tag
Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
System.out.println(textareaDoc);
} catch (IOException e) {
e.printStackTrace();
}
}
}

If I understand correctly, what you want is this
Elements textareas = Jsoup.connect(url1).get().select("textarea");
for (Element textarea : textareas) {
Elements elements = textarea.select("*");
for (Element element : elements) {
System.out.println(element.ownText());
}
}

HTML table id and class id

How can I find the Table id of the large table on in the following url: http://en.wikipedia.org/wiki/States_and_territories_of_India
I was able to see the classes wikitable sortable jquery-tablesorter
This is the table which has list of states in India. I was able confirm from firebug that this table = wikitable sortable jquery-tablesorter is having the list of states. How can I get the ID of that table?
What is the CSS equivalent to get all the names in that table?
I want to get only the states... the first column. I am using jsoup.

If this is still pending issue, here is how you can get list of states in India :
public static void main(String[] args) throws IOException
{
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/States_and_territories_of_India").get();
Elements tables = doc.select("table");
for (Element table : tables) {
Element tableCaption = table.getElementsByTag("big").first();
if (tableCaption != null && tableCaption.text().equals("States of India")) {
Document statesDoc = Jsoup.parse(table.toString());
Elements states = statesDoc.select("tr td:eq(0)");
for (Element state : states) {
System.out.println(state.text().replaceAll("\\[\\d\\]", ""));
}
}
}
}

There is no ID on that table. If you want to get the content of the table which has the class "wikitable". Use Jsoup with this code
package com.main;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main (String args[]){
Document doc;
try {
doc = Jsoup.connect("http://en.wikipedia.org/wiki/States_and_territories_of_India").get();
Elements newsHeadlines = doc.select("table.wikitable").get(0).select("td:eq(0) a");
System.out.println(newsHeadlines.html());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

So it looks like you're trying to screenscrape this table.
The answer to your question is there there is no id on that particular <table>.
The html that starts the table is:
<table class="wikitable sortable jquery-tablesorter" style="width:70%;">
As you can see there is no id attribute for that element.
What libraries are you using to parse the HTML? In JavaScript you could use document.getElementsByClassName('wikitable')[0] and find that uniquely on the page. But the syntax you would use will depend on what kind of HTML DOM traversing are available to you.

The id element is optional; not every element on a page will have one. This table doesn't.

Using JQuery. You want the first table with classes wikitable sortable jquery-table-sorter.
$(".wikitable.sortable.jquery-table-sorter").first()
Although, the css classes could change at any time so I wouldn't rely on that. It might be worth asking someone who can edit the wiki page to add an id to all the tables.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSOUP scraping a data - java

Related

JSoup, how to return data from a dynamic <a href> tag

JSOUP - Crawling Images & Text from URLs Found on a Previously Crawled Page

Selecting a portion of text in html Using Java

Deleting <textarea> tag from Document doc

HTML table id and class id

Categories

Resources