Advice with crawling web site content - java

I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.
But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).
But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.
Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.

You need to check before if the website you're crawling demands some of this list to show all contents:
Authentication with Login/Password
Some sort of session validation on HTTP headers
Cookies
Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
An specific User-Agent browser
A proxy password if, by example, you're inside a corporative network security configuration.
If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.
http://jsoup.org/cookbook/input/load-document-from-url

Related

How to control the life cycle of eclipse browser and monitor its HTML content change

The scenario is that when I would like to launch the browswer pointing to some web pages in Eclipse Plugin and I would like to monitor the content change in the HTTP/S message and do some corresponding operations.
The corresponding operations may looks like fetch the raw field of HTTP/S message from browser.
For example, when users do some operations (AJAX) call and then the title or other fields of HTML page change, I would like to know this is happpening and fetch the "raw" content from the body field of HTTP/S message.
I found some ways to launch the browser here. The first one is that I could use SWT browser (org.eclipse.swt.browser.Browser) here. However, I do not see it expose any listener API to monitor this change, let alone fetch raw content of HTTP/S message.
The second one is about org.eclipse.ui.browser.IWebBrowser. I also do not see any API expose by IWebBrowser.
Does anyone kwno how to do achieve this? Thank for your help.
This are no APIs for this.
The Browser class does have listeners for changes to the Location (addLocationListener) and the page title (addTitleListener) but these are fairly limited.

How to parse a HTML Source Code without getting the entire source code.

I am interested in extracting a particular from the source code of a website. I am able to do this using JSoup, by getting the entire source code using
Document doc;
doc = Jsoup.connect("http://example.com").get();
Element divs = document.getElementById("importantDiv");
However, the problem is that I need to do this about 20000 times a day, to be able to get all the changes that are happening in the div. To create the whole document every time would use a lot of network bandwidth, which I would like to avoid. Is there a way to be able to extract the required element without re-creating the entire document on the client side.
NOTE : The code snippet is an example and not the actual URL or ID which I need to extract.
I don't believe you can request specific portions of a web page. JSoup is basically a web client class, and the web client has no control over what the server sends it. The server is the one that dictates what is sent, so you can't really request a segment of a webpage without requesting the entire web page.
Do you have access to this webpage, or is it an external website?
If you don't have control of the server side, you cannot do it. You will need to download the complete html. But note that it's just the HTML, not the rest of the resources like stylesheets, images, javascripts, etc.
To save bandwidth you would need to install some code in the server, so that it serves just the bits of information required.
Take a look at the URLConnection class, you can use it to open a connection to an URL get the connection's input stream and read only as much bytes as you need, this will work and you won't have to download the entire document, but unfortunately you won't be able to download the document starting from an offset. You will always have to start downloading the document from its beginning.

It is possible to get cookies from a flash application using java?

I'm trying to get data from a website, but first I need to log in to the site using java. The script worked until now, but now the site installed an anti bot system. Until now the procedure was simple, I've created a HttpStreamWriter and submitted my details to the login.php page, then get the cookies and later, when I want to get data from the site, I resubmit the cookie from the login.php page, but now there is a problem: an anti bot system:
I'm not sure, but I think this is the system:
https://github.com/yuri-gushin/Roboo/blob/master/Roboo.pm
The anti bot system creates a cookie, called anti-bot and I can't access the page without that cookie, the problem is that the cookie is generated by a flash application only after the page loads so I can't get the cookie from the page?
Any ideas how to "hack" this ? Thanks!
Your need is about cookie extraction, here is how to do or on the oracle site
That is you need to connect to the site, browse the headers until Set-Cookie. Having the correct http header, you'll be able to parse it very easily.
After what, you'll have to set it back to your further request.
Edit
Flash cookie or Local Shared Object are stored in AMF. AMF wil be used store anything, the problem with your use case is that you don't know which value (or maybe class instances) have been serialized...
However you could (it will take time, at least for include all necessary libs) try with the AMFConnection to retrieve information. But I won't bet on that.
Could you contact the webmaster to have some information about that ? Or doens't this website any login api ?

How to enable downloading of dynamically generated files from a browser?

I have a web application in which one of the workflows, users can download files that are dynamically generated. The input is a form which has parameters needed to generate the file.
My current solution is to let them submit this form & on the servlet side I change the response header - content disposition to be an attachment & also provide an appropriate mime-type.
But I find this approach to be inadequate. Because there are chances that the generation of file can take a very long time, in such cases after a certain timeout I directly get 500 or 503 errors in the browser. I guess this is to be expected for the current approach.
I want my workflow to be flexible enough to tell the users as soon as they submit the form that it might take time for the file to generate & that we will display the link to the file as soon as it is ready. I guess I can also email the file or this message to them, but this is not ideal.
Can you guys suggest me an approach for this problem? Should I be more specific in providing information? Any help appreciated.
If you want to do this synchronously (i.e. make the user wait for the document to be ready rather than have them go off and do other things while waiting) a traditional approach is to bring them to a "report loading" page.
This would be a page that:
1) informs them that the report is loading.
2) refreshes itself (either using the meta refresh tag or javascript)
3) upon refresh, checks to see if the report is ready and either:
a) goes back to step 1 if it isn't ready
b) gives them the document if it is ready.
Synchronous is kind of old-school, but your question sounded like that was the approach you wanted.
Asynchronous approaches would include:
Use Ajax to make a link to the document appear on the page once it is ready.
Have a separate page that shows previously generated documents. The use can go to this page at their leisure, and, meanwhile, they can browse the rest of the site. This requires keeping a history of generated documents.
As you suggested, send it via e-mail.
You can make an asynchronous Ajax Call to the server with the form data instead of submiting the form direct.
On the server you create a temp file and return a link to the client with the download URL.
After submitting the answer via Javascript you can show the user a hint, that the download link will appear in a minute. Don't forget to cleanup the temp file!
For submitting the Ajax Call I would suggest using an Javascript Framework. Have a look at JQuery:
http://api.jquery.com/category/ajax/

Making AJAX Applications Crawlable? How to build a simple web service on Google App Engine to produce HTML Snapshots?

Real World Problem:
I have my app hosted on Heroku, who (to my knowledge) are unable to offer a solution for running a Headless (GUI-less) Browser - such as HTMLUnit - for generating HTML Snapshots for Googlebot to index my AJAX content.
My Proposed Solution:
If you haven't already, I suggest reading Google's Full Specification for Making AJAX Applications Crawlable.
Imagine I have:
a Sinatra app hosted on Heroku on the domain http://example.com
the app has tabs along the top of the page TabA, TabB and TabC
under each tab is SubTab1, SubTab2, SubTab3
onload if the url is http://example.com#!tab=TabA&subtab=SubTab3 then client-side Javascript takes the location.hash and loads in TabA, SubTab3 content via AJAX.
Note: the Hash Bang (#!) is part of the google spec.
I would like to build a simple "web service" hosted on Google App Engine (GAE) that:
Accepts a URL param e.g. http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Runs HTMLUnit to open http://example.com#!tab=TabA&subtab=SubTab3 and run the client-side javascript on the sever.
HTMLUnit returns the DOM once everything is complete (or something like 45 seconds has passed).
The return content could be sent back via JSON/JSONP, or alternatively a URL is return to a file generated and stored on the google app engine server (for file based "cached" results)... open to suggestions here. If a URL to a file was returned then you could CURL to get the source code (aka a HTML Snapshot).
My http://example.com app would need to manage the call to http://htmlsnapshot.appspot.com... basically:
Catch Googlebots call to http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3 (googlebot crawler escapes certain characters e.g. %26 = &).
Send request from the backend to http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Render the returned HTML Snapshot to the frontend.
Google Indexes the content and we rejoice!
I don't have any experience with Google App Engine or Java or HTMLUnit.
I might be able to figure it out... and will post my results if I do.
Otherwise I feel this is a VERY good opportunity for someone to write a kick-ass blog post that outlines a novices step-by-step guide to setting up a web service like this.
This will introduce more people to the excellent (and free!) Google App Engine. Also it will undoubtably encourage more people to adopt Google's specs for crawlable AJAX content... something we can all benefit from!
As Google's specification gains more acceptance the "hurdle" of setting up a Headless Browser is going to send many devs Googling for answers! Get in now with an answer for fame and glory! (edit: at the very least I will sing your praises).
Hit me up on twitter #_chrisjacob if you would like to discuss solutions.
I have successfully used HTMLunit on AppEngine. My GWT code to do this is available in the gwt-platform project the results I got were similar to that of the HTMLunit-AppEngine test application by Amit Manjhi.
It should be relatively easy to use GWTP current HTMLunit support to do exactly what you describe, although you could likely do it in a simpler app. One problem I see is that AppEngine requests have a 30 second timeout, so you can't have a page that takes HTMLunit longer than that to process.
UPDATE:
It's been a while, but I finally closed the long standing issue about making GWT applications crawlable using GWTP. The documentation is not entirely there, but check out the issue:
http://code.google.com/p/gwt-platform/issues/detail?id=1

Categories