how to parse web page over firewall - java

I am trying to parse a web url with this Jsoup code:
Document doc = Jsoup.connect("http://www.*url*.com").get();
String title = doc.title();
System.out.println("title: "+title);
I always get the error below
Exception in thread "main" java.io.IOException: 403 error loading URL http://www.*url*.com
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:327)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:130)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:119)
at HttpRequestPoster.main(HttpRequestPoster.java:151)
My computer is a member of network which is controlled by kerio winroute firewall. Before internet connetions we connect to firewall from our web browsers. The reason must be this. How can i parse the url ?

Setting the user agent worked for me.
Document document = Jsoup.connect(url).header("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2").get();

HTTP error 403 just means Forbidden.
The server understood the request, but is refusing to fulfill it.
In other words, the server side decided based on the request-specific information that the requester isn't allowed to receive the response. That can have many causes: specific information is missing in request headers, the IP address is disallowed, the user agent is disallowed, etcetera.
Your most honest bet would be contacting the admin of the website in question, asking for clarification and permission. You could also use a real webbrowser and track using a HTTP traffic tracker like Firebug or Fiddler2 to check the request/response details. Jsoup's HttpConnection class at least offers several methods to set headers, cookies and/or user agent whenever necessary.

Related

How to follow redirect from 404 with Jsoup

I want to scrape the redirected tumblr site which comes up if you try to go to a tumblr page that doesnt exist. If I put the URL in the browser I get to that redirected site. Jsoup however just gives back a " HTTP error fetching URL. Status=404" Error. Any suggestions?
String userAgent = "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6";
Document doc = Jsoup.connect("http://www.faszokvagyunk.tumblr.com").userAgent(userAgent).followRedirects(true).get();
Thank you.
Your code seem to handle other types of redirects just fine however, with tumblr you get a 404 page that causes a 404 status hence, the exception and there could be many reasons to this:
Redirect might not happen at all
Tumblr does redirect in a strange way
Tumblr unnecessary returns 404 which causes the exception
Other possibilities
I don't know if this solution can help you but, you actually can instruct your JSOUP connection to ignoreHttpErrors by chaining the method as follow (this at least allow you to validate the http errors):
Document doc = Jsoup.connect("http://oddhouredproductivity.tumblr.com/tagged/tips").userAgent(userAgent).followRedirects(true).ignoreHttpErrors(true).get();
ignoreHttpErrors instructs the connection not to throw Http error when it comes across 404, 500, etc error status codes.
Connection ignoreHttpErrors(boolean ignoreHttpErrors)
Configures the
connection to not throw exceptions when a HTTP error occurs. (4xx -
5xx, e.g. 404 or 500). By default this is false; an IOException is
thrown if an error is encountered. If set to true, the response is
populated with the error body, and the status message will reflect the
error.
Parameters: ignoreHttpErrors - - false (default) if HTTP errors should
be ignored.
Returns: this Connection, for chaining
if you set ignoreHttpErrors to true then you will get the Document. If not then Document will be null.
I also came across this site that might actually demonstrate actual tumblr redirect. You might want to use URLs in that page to do your test as they are proper tumblr redirect. If you look inside the retrieved document for this page then you see a JavaScript direct function that triggers after 3 seconds as follow:
//redirect to new blog
setTimeout( redirectTumblr, 3000 );
function redirectTumblr() {
location.replace('http://oddhour.tumblr.com' + location.pathname);
}
When I connect to the URL that you have given your in your question. I see 404 page and the content of the 404 page returned in Document by connection contains no sign of redirect (like the other page have).

Permanently retrieving a HttpStatusException with status 500 using Jsoup

I want to use Jsoup to access some data of a website located on a network server. Everytime I try to connect via a valid URL I'm retrieving a HttpStatusException with the following error:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://sv.thisismydomain.de/path/xyz.jsp (I've changed the URL)
This is my attempt:
System.out.println(Jsoup.connect(urlBase + urlLoginForm).userAgent(userAgent).timeout(10000).get().html());
I'm sure that this is the correct URL. The URL works fine if I copy it out of the StackTrace into my browser - so this can't be the problem.
This is the user agent I'm using:
String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) " +
"Chrome/30.0.1599.101 Safari/537.36";
Do you have any ideas? This drives me crazy!
Status 500 is a permanant error. It means the server encountered an unexpected condition which prevented it from fulfilling the request.No way around it other than handlin it in server. Since you are saying the url is working perfectly there can we certain possibilites that we could re-check. These may not exactly be a reason of 505.
1) When you say urlBase + urlLoginForm there could be chance to miss a \. Say you have urlBase = http://sv.thisismydomain.de/path and urlLoginForm = xyz.jsp when you construct it could be http://sv.thisismydomain.de/pathxyz.jsp instead of http://sv.thisismydomain.de/path/xyz.jsp
If urlLoginForm is a parameter list you should re-check how it is constructed.
**This should ideally return a 404 but since the domain part is correct chances are there it can explicitly fail with a 505.
2) The site you are trying to might be checking the source of the request. So you could rely on the referrer method of Jsoup in this case.
Document doc = Jsoup.connect(urlBase + urlLoginForm).referrer(urlBase + urlLoginForm).userAgent(userAgent).timeout(10000).get();
** ideally this should return a Forbidden 403 error or access denied.
3) Make sure get method is supported. try using post. Again this should return a Method 303, but just in case.. ;)
4) The URL doesn't show any issue. Since its behind a proxy you could try setting proxy properties before invoking jsoup.connect(). But again this should result in time out and not 505.
System.setProperty("http.proxyHost", "<your host ip>");
System.setProperty("http.proxyPort", "<proxy port>");
Sorry to give all these suggestion which are unrelated to 505. Since I don't have access to your URL this is the best I could suggest. :)

Apache HttpComponents request.get.excute is not returning correct HTML source code

I am using Apache HttpComponents with 150 threads to download HTML source code for roughly 5000 different URLs.
The URL's are contained in a LinkedBlockingQueue and the SourceGetterThreads take from the queue when possible. A thread will then attempt to download the source code using EntityUtils.toString(HttpClient.execute().getEntity). The string representation of the HTML source code is then put on another LinkedBlockingQueue where I have a further 10 threads ready to peform useful work on the source code they take from the second queue.
My problem lies in that I have noticed errors in the work being peformed on the source code. I am using Matcher to match specific patterns and record the patterns found. However sometimes the source code is incorrect and does not match the URL (i.e the source code saved in my java memory is not the same as the source code when viewed in Chrome or FireFox). This is seemingly random and thus sometimes the source code is correct and sometimes it is not.
Does anybody know why this is?
Most likely the sites you're trying to fetch pages from perform some kind of request filtering based on request headers, like User-Agent. So they can simply return different content based on the analysis result.There are many reasons to do so:
Provide search robots with appropriate info
Deny web-crawlers from fetching site's content
Recognize mobile devices and supply their user with different HTML/CSS/JS
If you're querying the same site intensively some kind of DOS-attack protection may be triggered, causing a stub error page being returned instead of a regular content
If you want to get the same content as in browser, then the most basic recommendation for your client is to behave like a browser:
Always provide User-Agent header
Follow HTTP redirect (HTTP codes 302,303)
Maintain cookies and authentication routines if required
Be ready to support HTTPS scheme
Without the source code you're trying to run it's hard to say more, but Apache HTTP Client you're using is definitely capable to do all these things. Here's, for example, how to set User-Agent value:
String url = "http://www.google.com/search?q=httpClient";
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(url);
request.addHeader("User-Agent","Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
HttpResponse response = client.execute(request);

JSOUP throws url status 503 in Eclipse but URL works fine in browser

In particular, this is with the website amazon.com to be specific. I am receiving a 503 error for their domain, but I can successfully parse other domains.
I am using the line
Document doc = Jsoup.connect(url).timeout(30000).get();
to connect to the URL.
You have to set a User Agent:
Document doc = Jsoup.connect(url).timeout(30000).userAgent("Mozilla/17.0").get();
(Or others; best you choose a browser user agent)
Else you'll get blocked.
Please see also: Jsoup: select(div[class=rslt prod]) returns null when it shouldn't
you can try
val ret=Jsoup.connect(url)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
.timeout(2*1000)
.followRedirects(true)
.maxBodySize(1024*1024*3) //3Mb Max
//.ignoreContentType(true) //for download xml, json, etc
.get()
it maybe works, maybe amazon.com need followRedirects set to true.

Identify whether HTTP requests from Android App or not? and then respond appropriately

My Android App has an App Widget associated with it which is updated every 10 minutes on an Android Device. These updates send HTTP requests for data to the servers and parse the server response and updates the App as required.
As of now if you ping that URL from the browsers on your laptop or PC the server will respond and update whatever is required in the database on the server.
What I want to do is when the HTTP requests are received at the server, I want to identify if the request came from my Android App from an Android device and then respond with the data. I would like to change the code in the PHPs on the server in a way that they would display or redirect to some page if the HTTP request came from a browser or anything else except for my Android App.
Typical HTTP requests from the Apps are like http://example.com/abc.php?usera=abc&datab=xyz
I don't want to respond to this URL in the same way if it is coming from anywhere else except from the Android App. Is this possible? What would be a good way to achieve this..
Thanks for your help.
You can add a signature to the request and then check it on server-side.
Just take the query and add one secret word at the end, then make a MD5 of it that you can send as an header (or use as a user-agent). And on the server you do the same and check if the checksum is the same.
To make it a bit safer you can make a timestamp so the request only will be valid for a short time.
Make your query look like http://example.com/abc.php?usera=abc&datab=xyz&timestamp=123456789 where timestamp is the current time (in unix time stamp) and add this in your app:
public static String makeCheck(String url)
{
URL u=new URL(url);
MessageDigest md = MessageDigest.getInstance("MD5");
u.getQuery();
md.update(u.getQuery().getBytes());
BigInteger bn = new BigInteger(1,md.digest("A_SECRET_WORD".getBytes()));
return bn.toString(16);
}
And when you need to add the header use something like:
request.addHeader("X-CHECKSUM", makeCheck(url) );
Then on your server you can use:
if (md5($_SERVER['QUERY_STRING']."A_SECRET_WORD")!=$_SERVER['X-CHECKSUM']) {
// Wrong checksum
}
$timediff=60;
if ( $_GET['timestamp']>(time()+$timediff) || $_GET['timestamp']<(time()-$timediff) ) {
// Bad timestamp
}
Remember to be a bit slack on the timestamp since your servers clock and the phones clock can be off sync a bit.
The typical way of doing this is using the User-Agent header in the HTTP request. if the request comes from the standard browser, it will uniquely identify both the hardware and software. For example a Nexus One running Froyo will have the following User-Agent:
Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
However, if you're using HttpClient to make requests from your app, you can customise the User-Agent header that HttpClient uses as demonstrated in this answer: Android HTTP User Agent.
On the server-side you can use a regex match on the user-Agent header to determine whether a request has originated from your Android app, and send the appropriate response.
If the actual request is the same (for instance, you are not able to add a POST or GET variable to actively identify your request), you'd have to rely on other things, like user-agent.
While you can set them according to your wishes in your app (also see #mark_bakker nd #mark_allison 's answers), you should be aware that there are ways to mess with this, so don't use it for stuff you really don't want other users to see.
An android user could in theory change the user_agent string between the request leaving your app and the request leaving his/her network. So don't use it for "Android users didn't pay, so should not see this/that info" applications
The other way around, non-android users can change their user-agent too obviously, so if you have content only your paying android-users should see, they might fake the string.
In the end it might be better to just change your request if you can: you want a different reply, you should do a different request is my opinion.
When you create the HttpClient in android you can set the following
client.getParams().setParameter(CoreProtocolPNames.USER_AGENT, "MY Android device identifier");
This set the USER_AGENT for each http request send to your server. On your server you can retrieve the USER_AGENT to determine that the request came from your android device

Categories