Scraping websites with PHP or Java is easily implemented, my question is however if I would want to have the client computer scrape, if I could do this with javascript rather than on the server side.
The background is that websites could block my servers or server farms, however if I let the user computer scrape and then post that information to my server we would avoid that blockage of the servers.
Can we scrape a website with javascript and use CSS selectors or regular expressions to parse the HTML in order to extract certain information?
Will we be able to protect our code that we use in the javacript or will our scraping algorithm have to be human readable to people?
If we then post the results via AJAX to our server, how do we make sure it was our script and not manipulated data by a malevolent user?
Is there a good framework for accomplishing this or should I continue server-side scraping?
Related
Recently I used a Mac application called Spotflux. I think it's written in Java (because if you hover over its icon it literally says "java"...).
It's just a VPN app. However, to support itself, it can show you ads... while browsing. You can be browsing on chrome, and the page will load with a banner at the bottom.
Since it is a VPN app, it obviously can control what goes in and out of your machine, so I guess that it simply appends some html to any website response before passing it to your browser.
I'm not interested in making a VPN or anything like that. The real question is: how, using Java, can you intercept the html response from a website and append more html to it before it reaches your browser? Suppose that I want to make an app that literally puts a picture at the bottom of every site you visit.
This is, of course, a hypothetical answer - I don't really know how Spotflux works.
However, I'm guessing that as part of its VPN, it installs a proxy server. Proxy servers intercept HTTP requests and responses, for a variety of reasons - most corporate networks use proxy servers for caching, monitoring internet usage, and blocking access to NSFW content.
As a proxy server can see all HTTP traffic between your browser and the internet, it can modify that HTTP; often, a proxy server will inject an HTTP header, for instance; injecting an additional HTML tag for an image would be relatively easy.
Here's a sample implementation of a proxy server in Java.
There are many ways to do this. Probably the easiest would be to proxy HTTP requests through a web proxy like RabbIT (written in java). Then just extend the proxy to mess with the response being sent back to the browser.
In the case of Rabbit, this can either be done with custom code, or with a special Filter. See their FAQ.
WARNING: this is not as simple as you think. Adding an image to the bottom of every screen will be hard to do, depending on what kind of HTML is returned by the server. Depending on what CSS, javascript, etc that the remote site uses, you can't just put the same HTML markup in and expect it to act the same everywhere.
Here's what I would like to do.
I have a PHP file in my server where I would like to call java applet. The applet function will send a get request to read a page from third party server. Now I want page read from applet function to be sent to PHP script. To simply put ,i want the return value of the applet request function in a PHP variable. Is it possible to do?
I want to do this way because I already have the code to parse the page information in PHP, so I don't want to rewrite that in java again.
I wanted the Java applet because the request has to be sent using the client information like IP. So I don't want to use proxies.
Note: I am not trying to hack anyone's server. I am not a advanced programmer of either Java or PHP. Please reply me in a descriptive manner possibly with pseudo code.
I already have the code to parse the page information in PHP, so I don't want to rewrite that in java again.
PHP should be able to get that page more easily than can a Java applet. The applet would need to be trusted or in communication with a site that uses the 'cross-domain resources' file that explicitly allows hot-linking.
Searches on 'php proxie' seemed to spill out around 7.32 million hits. I'd start there.
I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.
But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).
But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.
Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.
You need to check before if the website you're crawling demands some of this list to show all contents:
Authentication with Login/Password
Some sort of session validation on HTTP headers
Cookies
Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
An specific User-Agent browser
A proxy password if, by example, you're inside a corporative network security configuration.
If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.
http://jsoup.org/cookbook/input/load-document-from-url
I have a task where I need to autologin and scrape a particualr website.
I have seen people suggesting htmlUnit and HttpClient mostly with Java. htmlUnit looks like a testing tool. I am not sure what to do with that. Is there an example that explains auto login and web scraping with htmlUnit or httpClient?
Im a Java developer. Can anyone who closely works with it please share any ideas?
Your problem can be broken down to
login into the website
Scrape the data from website.
So, for the first part-:
Install livehttp header firefox addon and than read all the http
headers that were sent and received by your browser while trying to
login.
Try to send these headers using your java code, basically you have
to emulate a HTTP POST request using your java code. For that
google->make post request from java
After you have login into the website, than scrap the data using the API of your choice.
I personally use htmlcleaner HtmlCleaner.
To scrape data you can use XPath expressions with htmlcleaner.
Take a look at Xpath+htmlcleaner and here also
You can also use JSoup instead of htmlcleaner. Advantage of using JSoup is it can handle both login[POST Request] and Data scraping. Take a look here http://pastebin.com/E0WzpuhF
I know it seems a lot of work, i have provided you with two alternative solution for your problem but divide your problem into smaller chunks and than try to solve it.
I'm developing (with Java) a P2P application. One of the features includes a chat service. When a user sends a message to all of the application users, each user gets the message and updates its chat HTML page.
How can I access, from my Java code, the DOM of this page and change it, without the need to refresh the page in order to see the new message?
Is there any object in Java that can get me this access? For example, can I call a JavaScript function that inserts the new message?
If by from Java you mean applet then:
You can define some javascript functions in your HTML page to return/modify what you want and then call the javascripts from the applet. Look at here.
If by Java you mean Web server then you have to use some AJAX solution, you can look for example at JQuery
What you're really looking for is a technology known as Comet. Comet is Reverse Ajax. It's a technique that uses long-lived HTTP connections to hold a connection open from a client browser to a server so that the server can push updates back to the client browser.
The basic flow is that the server pushes a command back to the browser in the response, and JavaScript parses the response via a callback function, and then the JavaScript updates the DOM, all without reloading the page.
You can learn more about Comet on the CometD Website, and if you're developing on Google App Engine, this blog post on the ChannelAPI will be helpful.