Scraping Data. Save File? - java

I am trying to scrape data from a website which uses javascript to load much of their content. Right now I am using jSoup to parse html pages, however since much of the content is loaded using javascript I haven't been able to parse the data I want.
How should I go about getting this javascript content? Should I first save the page then load and parse it using jSoup? If so, what should I use to load javascript content before I save? Is there an API which you would recommend that could output html?
Currently using java.

You might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context - among other things, you can define a "ready" function for the page and wait to scrape until the function (which might check for the existence of certain DOM elements, etc) returns true.
The other option, depending on the page, is to use a console like Firebug to figure out what data is being loaded (i.e. what URLs are being retrieved by the AJAX calls on the page), and scrape the data directly from those URLs.

If the data are generated with javascript then the data are in the downloaded page.
Better is to directly parse them on the fly as you do with plain HTML or Text parsing.
If you cannot isolate tokens with jSoup API just parse them using direct String options, as a plain text.

I tried using htmlUnit however I found it very slow.
I ended up using the curl command line function within java which worked for my purposes.
String command = "curl "+url;
Process p = Runtime.getRuntime().exec(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((s = stdInput.readLine()) != null) {
html = html+s+"\n";
}
return html;

Related

How to read dynamic website content in java

As per the html the source code is:
{result.data}
While requesting the URL result.data is set with 100 and am able to see the value as 100 in the browser. Where as while I am trying to execute the java program with the same url request I am unable to see the value as I have seen in browser.
URL url = new URL(site)
url.openConnection() etc..
I wanted to get the same content as I have seen in the browser through java program.
Your question is not very descriptive, but I guess you are trying to scrape data from the site.
You can use the following libraries for this task:
Jaunt (http://jaunt-api.com)
Jsoup (http://jsoup.org/cookbook/extracting-data/dom-navigation)
HTMLUnit
To what i understand, you want to do one of the below things :
Instead of reading the result line by line, you want to parse it as an XML to as to traverse to div(s) and other html tags.
For this purpose i would suggest you to use jsoup library.
When you hit the URL: www.abcd.com/number=500 in browser, it loads an empty div and on load it fetches data from somewhere, this data which it fetches on load, you want to fetch this using java ?
For this, there must be some js in the resulting page, which is fetching data by hitting some service on page load, you will need to look up in the page to know the service details and instead of hitting this URL (www.abcd.com/number=500) you will need to hit that service to get data.

using jsoup to parse html but not follow/fetch links

What is the "correct" way to use JSoup to parse html string or stream without fetching external data for link/img/area/iframe (and whatever other) tags? Right now I am doing something like this after I fetch a page using Apache HttpComponents:
HttpEntity entity = response.getEntity();
InputStream is = entity.getContent();
Document = JSoup.parse(is, null, "");
Which actually works fine. But passing the baseUri as empty just feels wrong, because I am betting JSoup tries to use it, only to fail and move on. I only want to use JSoup as an html parser and DOM manipulation kit, not an http framework. I am also a bit worried that JSoup might try to look for ="/foo" resources in the current directory or something. What does it do with an empty string? I tried passing null as the baseUri, which would be a natural interface for doing what I want, but it dies with an IllegalStateException.
Is there a way to do this, or am I worried about nothing?
... I don't think think that JSoup does that. The URL parameter is just for the canonicalization of relative URLs, what you do with them is your responsibility. JSoup will not by itself try to access resources.

Any command line utility (like wget,curl etc) and/or java methods to get meta data of a shortened URL?

I have stream of shortened URLs as received from twitter feeds. I don't need entire page of that URL, but basic meta information like expanded original URL, page title, timestamps etc. I can get the entire page containing these meta as well using curl,wget, but any quicker way to only get the meta? Also, is there any java classes/methods to do this like curl.
HTTP Head requests may be what you are looking for, here is an example that uses curl (in its Python implementation though): Python: Convert those TinyURL (bit.ly, tinyurl, ow.ly) to full URLS

Java getting source code from a website

I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...
The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.
If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?
Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());
Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?

Are there any java libraries for validating user supplied HTML, on the server side?

I have a service which takes the user supplied rich text (can have HTML tags) and saves it into the database. That data gets used by some other application. But sometimes the user supplied data has missing HTML tags and wrong closing tags. I want to validate if the user supplied data is valid HTML or not and depending on that I want to warn the user.
Are there any java libraries to do HTML validation?
You can try JTidy, but it's too slow for simple HTML cleaning.
If you want just process HTML you can try NekoHTML, it's lightweight and fast
You can try JTidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer.
You can use Jsoup, from the project README
Here is an example:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
String markup = "<body><head>...";
Jsoup.isValid(markup, null);
Instead of null, you can pass a Whitelist ? object as second parameter to the isValid method.
Plus, you can easily install this library using Gradle
Validator.nu, which implements the HTML5 spec, IMO.
There's a great thing called NekoHTML which is just a thin wrapper over the Apache Xerces parser that turns on error-recovery/correction. It doesn't validate so much as error-correct, so you can process the result as XML, i.e. run it through XPaths or XSLTs. It has worked flawlessly for me for several months on completely arbitrary HTML from 3rd-party sites.

Categories