Screen Scraping Using Jsoup to Extract Sentences - java

I want to do some screen scraping and after doing a little research, it appears that JSoup is the best tool for this task. I want to be able to extract all the sentences on a web page; so for example, given this wikipedia page, http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping, I want to be able to get all the sentences on that page and print it out to the console. I'm still not familiar with how JSoup works though, so if somebody could help me out that would be greatly appreciated. Thanks!

First download Jsoup and include it in your project. Then the best place to start is the Jsoup cookbook (http://jsoup.org/cookbook/) as it provides examples for the most common methods you will use with Jsoup. I recommend that you spend some time working through those examples to familiarize yourself with the API. Another good resource is the javadocs.
Here is a quick example to pull some text from the Wikipedia link you provided:
String url = "http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping";
// Download the HTML and store in a Document
Document doc = Jsoup.connect(url).get();
// Select the <p> Elements from the document
Elements paragraphs = doc.select("p");
// For each selected <p> element, print out its text
for (Element e : paragraphs) {
System.out.println(e.text());
}

Related

How to parse result by Custom Search Engine

I'm using Jsoup to parse data from a website. But I don't know how to parse search result by CSE (custom search engine).
Please review images below.
Search Result
In search result, I want to get: image, title, link and description.
If you know, you can give me some solution!
Link Search:
http://www.truyenngan.com.vn/tim-kiem.html?q=love&cx=000993172113723111222%3Auprumhk-rde&cof=FORID%3A11&ie=UTF-8&siteurl=www.truyenngan.com.vn%2F&ref=&ss=419j62441j4
`
When you'll parse page.asXml() you will get the source code which will definitely fetch the whole page data.
You need to apply some parsing logic,as the links will be with particular div/class/id ,so you can fetch them by looping.
Document doc = Jsoup.parse(page.asXml());
Elements elements = doc.getElementsByTag('<id/div/class>');
Iterate elements to get value of all the links and description.
Use the Custom Search Engines API, and you will be getting parsed results in JSON.

How to integrate a part of one html website into java program?

Given a HTML website which displays a temperature outside and other unimportant peaces of information:
<div style="">15</div>
15 - is my destination number, which I want to extract as a variable.
Now what I want to do is, that Java program will go to the website, search for the particular HTML code line (temperature=15;) and after it is found, it must display it like this: http://i.stack.imgur.com/lY0qi.jpg
All I want to know, what syntax should I use to let program request that number.
Extracting information from a website is called crawling or scraping.
You basically go to the web site, get the HTML source and search it for your element. You can search with a regular expression or (more common) with a parser like Jsoup.
You will find a lot of working examples on the official site of Jsoup (e.g. http://jsoup.org/cookbook/extracting-data/example-list-links). Jsoup will parse the HTML source into a DOM-like structure with elements and nodes. You can search for specific nodes, e.g. for all DIV elements. Then you can iterate over them and get your temperature.
There are tools called scraper that extract information from the web .thare are many Java API that let you write your own scraper. You can try with JSoup ,HTMLUnit or Jaunt .

Google Api,Java, reading data from webpage

Can anybody help me on reading data from google webpage. for example:i want to read the links, author names below the links and PDF or HTML links on the right side to my database using Java.
Please find the link here:
http://scholar.google.com/scholar?hl=en&q=visualization&btnG=&as_sdt=1%2C4&as_sdtp=
What you're asking about is called data extraction. You need to load the HTML page and then logically select the pieces of information from the HTML.
Start by using an HTML parser to read the HTML page and then look for the patterns in how Google lays out its scholar links. You might find that things are listed in an un-ordered list, or maybe certain elements have an identifying tag or class that you can use to extract the data you want.

Page scrape for a particular div

I am wondering if there is a way to read the html output of a given webpage using Java?
I know in php you can do something like:
$handle = #fopen("'http://www.google.com", "r");
$source_code = fread($handle,9000);
I am looking for the Java equivalent.
Additionally, once I have the rendered html are there any Java utilities that would allow me to strip out a single div by its id?
Thanks for any help with this.
Use jsoup.
You have the choice between a tree model and a powerful query syntax similar to CSS or jQuery selectors, plus utility methods to quickly get the source of a webpage.
To quote from their website:
Fetch the Wikipedia homepage, parse it to a DOM, and select the
headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Once you found the Element representing the div you want to remove, just call remove() on it.

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular
Try using a HTML parsing library then search for <a> tags in the HTML document.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
not all url are in tags, some are text
and some are in links or other tags
You shouldn't scan the HTML source to achieve this.
You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.
Best way is still that you use a tool made for the job.
You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".
[attr^=value], [attr$=value],
[attr*=value]: elements with
attributes that start with, end with,
or contain the value, e.g.
select("[href*=/path/]")
See: jSoup.
You may want to have a look at XPath or Regular Expressions.
Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.
The best way should be to google for regexes. One example is this one:
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:#]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})*))?$/i
found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

Categories