I trying to automate few things in my workplace where we are not allowed to use internet (Not all website very few allowed).
Req: I have a form which has a single text box & a single submit button, I have to put something in the text box and submit the form. The response I need to parse the HTML and get a specific text. The pages are written in JSP
Constraint: I don't have access to third party libraries & have to work with Java 6.
Please put me in right direction.
HttpURLConnection comes default with java. You may consider using this API. This API does most of the functionality as Apache HTTPClient. Here is simple example on how to use HTTPURLConnection.
I would use something like tamperdata from Firefox to capture the HTTP request that gets sent to the server, and then use HTTPUrlConnection (part of the JDK) to re-create that request.
Related
I sanitize a string in Angular like so:
var sanitized = $sanitize($scope.someHtml);
This works well if the user tries to enter malign HTML/Javascript on the application screen.
But if the user presses F12 and sends to the server an HTTP request bypassing the UI code without sanitizing the string, the server will take it. Is there a way to run sanitize on the server side as well? I'm using Scala/Java.
Take a look at Jsoup, a Java lib (that you can easily use with scala) for HTML parsing, DOM manipulations, and so on.
The given link explains how to clean a document using a Whitelist (so that only the specified elements/tags are accepted).
I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.
But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).
But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.
Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.
You need to check before if the website you're crawling demands some of this list to show all contents:
Authentication with Login/Password
Some sort of session validation on HTTP headers
Cookies
Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
An specific User-Agent browser
A proxy password if, by example, you're inside a corporative network security configuration.
If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.
http://jsoup.org/cookbook/input/load-document-from-url
I have a task where I need to autologin and scrape a particualr website.
I have seen people suggesting htmlUnit and HttpClient mostly with Java. htmlUnit looks like a testing tool. I am not sure what to do with that. Is there an example that explains auto login and web scraping with htmlUnit or httpClient?
Im a Java developer. Can anyone who closely works with it please share any ideas?
Your problem can be broken down to
login into the website
Scrape the data from website.
So, for the first part-:
Install livehttp header firefox addon and than read all the http
headers that were sent and received by your browser while trying to
login.
Try to send these headers using your java code, basically you have
to emulate a HTTP POST request using your java code. For that
google->make post request from java
After you have login into the website, than scrap the data using the API of your choice.
I personally use htmlcleaner HtmlCleaner.
To scrape data you can use XPath expressions with htmlcleaner.
Take a look at Xpath+htmlcleaner and here also
You can also use JSoup instead of htmlcleaner. Advantage of using JSoup is it can handle both login[POST Request] and Data scraping. Take a look here http://pastebin.com/E0WzpuhF
I know it seems a lot of work, i have provided you with two alternative solution for your problem but divide your problem into smaller chunks and than try to solve it.
Im trying to use DefaultHttpClient to log into xbox.com. I realize that you cant be logged in without visiting http://login.live.com, so I was going to submit to the form on that page and then use the cookies in any requests to xbox.com.
The problem is that requesting anything from live.com using DefaultHttpClient returns the followings message.
Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.
How do I tell DefaultHttpClient to tell the server that javascript is available for use? I tried looking in the default options and also adding it as a parameter object but I cant see what I've got to do.
The reason this is happening is that this line of HTML is getting parsed from live:
<noscript><meta http-equiv="Refresh" content="0; URL=http://login.live.com/jsDisabled.srf?mkt=EN-US&lc=1033"/>Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.<br /><br />To find out whether your browser supports JavaScript, or to allow scripts, see the browser's online help.</noscript>
Which is used to redirect you if your client does not have javascript enabled (and therefore will parse <noscript> tags.)
You could try to use a less intelligent HTTP library which does no parsing of the content, but which instead simply does the transport and leaves the parsing to you.
Use Wireshark to trace the communication using both a browser and your program, and look for the differences. It's hard to say what, exactly, live.com/xbox.com are looking for, but there is likely some AJAX-y code used to get the actual content.
Windows Live ID requires JavaScript to sign in. This web browser either does not support JavaScript, or scripts are being blocked.To find out whether your browser supports JavaScript, or to allow scripts, see the browser's online help.
This problem relates to the Restlet framework and Java
When a client wants to discover the resources available on a server - they must send an HTTP request with OPTIONS as the request type. This is fine I guess for non human readable clients - i.e. in code rather than a browser.
The problem I see here is - browsers (human readable) using GET, will NOT be able to quickly discover the resources available to them and find out some extra help documentation etc - because they do not use OPTIONS as a request type.
Is there a way to make a browser send an OPTIONS/GET request so the server can fire back formatted XML to the client (as this is what happens in Restlet - i.e. the server response is to send all information back as XML), and display this in the browser?
Or have I got my thinking all wrong - i.e. the point of OPTIONS is that is meant to be used inside a client's code and not meant to be read via a browser.
Use the TunnelService (which by default is already enabled) and simply add the method=OPTIONS query parameter to your URL.
(The Restlet FAQ Q19 is a similar question.)
I think OPTIONS is not designed to be 'user-visible'.
How would you dispatch an OPTIONS request from the browser ? (note that the form element only allows GET and POST).
You could send it using XmlHttpRequest and then get back XML in your Javascript callback and render it appropriately. But I'm not convinced this is something that your user should really know about!