I have a task where I need to autologin and scrape a particualr website.
I have seen people suggesting htmlUnit and HttpClient mostly with Java. htmlUnit looks like a testing tool. I am not sure what to do with that. Is there an example that explains auto login and web scraping with htmlUnit or httpClient?
Im a Java developer. Can anyone who closely works with it please share any ideas?
Your problem can be broken down to
login into the website
Scrape the data from website.
So, for the first part-:
Install livehttp header firefox addon and than read all the http
headers that were sent and received by your browser while trying to
login.
Try to send these headers using your java code, basically you have
to emulate a HTTP POST request using your java code. For that
google->make post request from java
After you have login into the website, than scrap the data using the API of your choice.
I personally use htmlcleaner HtmlCleaner.
To scrape data you can use XPath expressions with htmlcleaner.
Take a look at Xpath+htmlcleaner and here also
You can also use JSoup instead of htmlcleaner. Advantage of using JSoup is it can handle both login[POST Request] and Data scraping. Take a look here http://pastebin.com/E0WzpuhF
I know it seems a lot of work, i have provided you with two alternative solution for your problem but divide your problem into smaller chunks and than try to solve it.
Related
I trying to automate few things in my workplace where we are not allowed to use internet (Not all website very few allowed).
Req: I have a form which has a single text box & a single submit button, I have to put something in the text box and submit the form. The response I need to parse the HTML and get a specific text. The pages are written in JSP
Constraint: I don't have access to third party libraries & have to work with Java 6.
Please put me in right direction.
HttpURLConnection comes default with java. You may consider using this API. This API does most of the functionality as Apache HTTPClient. Here is simple example on how to use HTTPURLConnection.
I would use something like tamperdata from Firefox to capture the HTTP request that gets sent to the server, and then use HTTPUrlConnection (part of the JDK) to re-create that request.
I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.
But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).
But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.
Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.
You need to check before if the website you're crawling demands some of this list to show all contents:
Authentication with Login/Password
Some sort of session validation on HTTP headers
Cookies
Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
An specific User-Agent browser
A proxy password if, by example, you're inside a corporative network security configuration.
If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.
http://jsoup.org/cookbook/input/load-document-from-url
I'm basically trying to use Java to read some data from my school's website (homework assignments, which lessons I have and when, etc.) for personal use. However, my school requires one to be logged in to access this information.
Could anyone point me in the right direction for logging in with code and accessing this information?
Thanks,
Mike.
The Apache has a API for http client simulating.
Link: http://hc.apache.org/httpcomponents-client-ga/
You need to find out how the server handles logins. What authenticationer it used: cookies, url session id, etc.
Then you can send the server a login http form submit (as you would have done manually) and save the authenticationer.
With the authenticationer you can then access the secured sites.
I would use HtmlUnit, which is a programmatic web browser.
You tell it to load the authentication page, to fill the form with your credentials and to click on the submit button. The you click on links, just as you would do it with a real browser, but programmatically, using Java instructions.
And it even supports JavaScript, if necessary.
Im trying to use DefaultHttpClient to log into xbox.com. I realize that you cant be logged in without visiting http://login.live.com, so I was going to submit to the form on that page and then use the cookies in any requests to xbox.com.
The problem is that requesting anything from live.com using DefaultHttpClient returns the followings message.
Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.
How do I tell DefaultHttpClient to tell the server that javascript is available for use? I tried looking in the default options and also adding it as a parameter object but I cant see what I've got to do.
The reason this is happening is that this line of HTML is getting parsed from live:
<noscript><meta http-equiv="Refresh" content="0; URL=http://login.live.com/jsDisabled.srf?mkt=EN-US&lc=1033"/>Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.<br /><br />To find out whether your browser supports JavaScript, or to allow scripts, see the browser's online help.</noscript>
Which is used to redirect you if your client does not have javascript enabled (and therefore will parse <noscript> tags.)
You could try to use a less intelligent HTTP library which does no parsing of the content, but which instead simply does the transport and leaves the parsing to you.
Use Wireshark to trace the communication using both a browser and your program, and look for the differences. It's hard to say what, exactly, live.com/xbox.com are looking for, but there is likely some AJAX-y code used to get the actual content.
Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.To find out whether your browser supports JavaScript, or to allow scripts, see the browser's online help.
Real World Problem:
I have my app hosted on Heroku, who (to my knowledge) are unable to offer a solution for running a Headless (GUI-less) Browser - such as HTMLUnit - for generating HTML Snapshots for Googlebot to index my AJAX content.
My Proposed Solution:
If you haven't already, I suggest reading Google's Full Specification for Making AJAX Applications Crawlable.
Imagine I have:
a Sinatra app hosted on Heroku on the domain http://example.com
the app has tabs along the top of the page TabA, TabB and TabC
under each tab is SubTab1, SubTab2, SubTab3
onload if the url is http://example.com#!tab=TabA&subtab=SubTab3 then client-side Javascript takes the location.hash and loads in TabA, SubTab3 content via AJAX.
Note: the Hash Bang (#!) is part of the google spec.
I would like to build a simple "web service" hosted on Google App Engine (GAE) that:
Accepts a URL param e.g. http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Runs HTMLUnit to open http://example.com#!tab=TabA&subtab=SubTab3 and run the client-side javascript on the sever.
HTMLUnit returns the DOM once everything is complete (or something like 45 seconds has passed).
The return content could be sent back via JSON/JSONP, or alternatively a URL is return to a file generated and stored on the google app engine server (for file based "cached" results)... open to suggestions here. If a URL to a file was returned then you could CURL to get the source code (aka a HTML Snapshot).
My http://example.com app would need to manage the call to http://htmlsnapshot.appspot.com... basically:
Catch Googlebots call to http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3 (googlebot crawler escapes certain characters e.g. %26 = &).
Send request from the backend to http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Render the returned HTML Snapshot to the frontend.
Google Indexes the content and we rejoice!
I don't have any experience with Google App Engine or Java or HTMLUnit.
I might be able to figure it out... and will post my results if I do.
Otherwise I feel this is a VERY good opportunity for someone to write a kick-ass blog post that outlines a novices step-by-step guide to setting up a web service like this.
This will introduce more people to the excellent (and free!) Google App Engine. Also it will undoubtably encourage more people to adopt Google's specs for crawlable AJAX content... something we can all benefit from!
As Google's specification gains more acceptance the "hurdle" of setting up a Headless Browser is going to send many devs Googling for answers! Get in now with an answer for fame and glory! (edit: at the very least I will sing your praises).
Hit me up on twitter #_chrisjacob if you would like to discuss solutions.
I have successfully used HTMLunit on AppEngine. My GWT code to do this is available in the gwt-platform project the results I got were similar to that of the HTMLunit-AppEngine test application by Amit Manjhi.
It should be relatively easy to use GWTP current HTMLunit support to do exactly what you describe, although you could likely do it in a simpler app. One problem I see is that AppEngine requests have a 30 second timeout, so you can't have a page that takes HTMLunit longer than that to process.
UPDATE:
It's been a while, but I finally closed the long standing issue about making GWT applications crawlable using GWTP. The documentation is not entirely there, but check out the issue:
http://code.google.com/p/gwt-platform/issues/detail?id=1