Permanently retrieving a HttpStatusException with status 500 using Jsoup

Permanently retrieving a HttpStatusException with status 500 using Jsoup - java

I want to use Jsoup to access some data of a website located on a network server. Everytime I try to connect via a valid URL I'm retrieving a HttpStatusException with the following error:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://sv.thisismydomain.de/path/xyz.jsp (I've changed the URL)
This is my attempt:
System.out.println(Jsoup.connect(urlBase + urlLoginForm).userAgent(userAgent).timeout(10000).get().html());
I'm sure that this is the correct URL. The URL works fine if I copy it out of the StackTrace into my browser - so this can't be the problem.
This is the user agent I'm using:
String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) " +
"Chrome/30.0.1599.101 Safari/537.36";
Do you have any ideas? This drives me crazy!

Status 500 is a permanant error. It means the server encountered an unexpected condition which prevented it from fulfilling the request.No way around it other than handlin it in server. Since you are saying the url is working perfectly there can we certain possibilites that we could re-check. These may not exactly be a reason of 505.
1) When you say urlBase + urlLoginForm there could be chance to miss a \. Say you have urlBase = http://sv.thisismydomain.de/path and urlLoginForm = xyz.jsp when you construct it could be http://sv.thisismydomain.de/pathxyz.jsp instead of http://sv.thisismydomain.de/path/xyz.jsp
If urlLoginForm is a parameter list you should re-check how it is constructed.
**This should ideally return a 404 but since the domain part is correct chances are there it can explicitly fail with a 505.
2) The site you are trying to might be checking the source of the request. So you could rely on the referrer method of Jsoup in this case.
Document doc = Jsoup.connect(urlBase + urlLoginForm).referrer(urlBase + urlLoginForm).userAgent(userAgent).timeout(10000).get();
** ideally this should return a Forbidden 403 error or access denied.
3) Make sure get method is supported. try using post. Again this should return a Method 303, but just in case.. ;)
4) The URL doesn't show any issue. Since its behind a proxy you could try setting proxy properties before invoking jsoup.connect(). But again this should result in time out and not 505.
System.setProperty("http.proxyHost", "<your host ip>");
System.setProperty("http.proxyPort", "<proxy port>");
Sorry to give all these suggestion which are unrelated to 505. Since I don't have access to your URL this is the best I could suggest. :)

Related

How to follow redirect from 404 with Jsoup

I want to scrape the redirected tumblr site which comes up if you try to go to a tumblr page that doesnt exist. If I put the URL in the browser I get to that redirected site. Jsoup however just gives back a " HTTP error fetching URL. Status=404" Error. Any suggestions?
String userAgent = "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6";
Document doc = Jsoup.connect("http://www.faszokvagyunk.tumblr.com").userAgent(userAgent).followRedirects(true).get();
Thank you.

Your code seem to handle other types of redirects just fine however, with tumblr you get a 404 page that causes a 404 status hence, the exception and there could be many reasons to this:
Redirect might not happen at all
Tumblr does redirect in a strange way
Tumblr unnecessary returns 404 which causes the exception
Other possibilities
I don't know if this solution can help you but, you actually can instruct your JSOUP connection to ignoreHttpErrors by chaining the method as follow (this at least allow you to validate the http errors):
Document doc = Jsoup.connect("http://oddhouredproductivity.tumblr.com/tagged/tips").userAgent(userAgent).followRedirects(true).ignoreHttpErrors(true).get();
ignoreHttpErrors instructs the connection not to throw Http error when it comes across 404, 500, etc error status codes.
Connection ignoreHttpErrors(boolean ignoreHttpErrors)
Configures the
connection to not throw exceptions when a HTTP error occurs. (4xx -
5xx, e.g. 404 or 500). By default this is false; an IOException is
thrown if an error is encountered. If set to true, the response is
populated with the error body, and the status message will reflect the
error.
Parameters: ignoreHttpErrors - - false (default) if HTTP errors should
be ignored.
Returns: this Connection, for chaining
if you set ignoreHttpErrors to true then you will get the Document. If not then Document will be null.
I also came across this site that might actually demonstrate actual tumblr redirect. You might want to use URLs in that page to do your test as they are proper tumblr redirect. If you look inside the retrieved document for this page then you see a JavaScript direct function that triggers after 3 seconds as follow:
//redirect to new blog
setTimeout( redirectTumblr, 3000 );
function redirectTumblr() {
location.replace('http://oddhour.tumblr.com' + location.pathname);
}
When I connect to the URL that you have given your in your question. I see 404 page and the content of the 404 page returned in Document by connection contains no sign of redirect (like the other page have).

java.io.IOException: Server returned HTTP response code: 403 for URL: <url>

I am trying to check if a service is available and always returns the same error:
 
java.io.IOException: Server returned HTTP response code: 403 for URL
Internet browsing proposed that it was necessary to indicate the "USER-AGENT" and so I did, but the error remains the same:
openConnection.addRequestProperty("User-Agent", "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 40.0.2214.91 Safari / 537.36");
The complete code is:
url = cadenaURL + cadenaEndpoint;
URLConnection openConnection = new URL(url).openConnection();
openConnection.connect();
is = openConnection.getInputStream();
if ("gzip".equals(openConnection.getContentEncoding()))
is = new GZIPInputStream(is);
and the error is in:
is = openConnection.getInputStream();
Someone could help me?
Thank You,
a greeting,

I am trying to check if a service is available
Simply put returning 403 means the service is not available. There's your answer.
There's a possibility you're using the wrong url and attempting to connect to the wrong service.
However it seems also that
is = openConnection.getInputStream();
This may not be enough to connect to an HTTP server. You need to properly format the request, you need to show more of the code and how you're using it for us to help you more.

Finally I changed URLConnection by HttpURLConnection and works perfect.
Thank You

Http post request not working

I want to search the following in Google query box:
http://www.cmu.edu/silicon-valley/ faculty directory
Unfortunately, the following code does not work:
Jsoup.connect("http://www.google.com/search?hl=en&q=http%3A%%2F%%2F%www.cmu.edu%2F%silicon-valley%2F%20faculty20directory").get();
nor does this one:
Jsoup.connect("http://www.google.com/search?hl=en&q=http%3A%%2F%%2F%www.cmu.edu%2F%silicon-valley%2F%20faculty20or20directory").get();
What am I missing here?
Edit: not working means Google didn't return any result as we see from browser.
Jsoup.connect("http://www.google.com/search?hl=en&q=http%3A%"%2F%%2F%www.cmu.edu%2F%silicon-valley%2F%20faculty").get();
The code above works though. It's equivalent to Googling "http://www.cmu.edu/silicon-valley/ faculty".
Edit: I have the following trick in my program, so bot-rule is not an issue:
.userAgent("Mozilla")

Jsoup.connect("http://www.google.com/search?hl=en&q=http%3A%2F%2Fwww.cmu.edu%2Fsilicon-valley%2F+faculty+directory") leads to a 403 error (Forbidden) as google forbis robots to access its results
You'll have to change the User Agent String if you want to do that
doc = Jsoup.connect("http://www.google.com/search?hl=en&q=http%3A%2F%2Fwww.cmu.edu%2Fsilicon-valley%2F+faculty+directory").header("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17").get() should work as expected, but could be against Google's Terms of Use.

how to parse web page over firewall

I am trying to parse a web url with this Jsoup code:
Document doc = Jsoup.connect("http://www.*url*.com").get();
String title = doc.title();
System.out.println("title: "+title);
I always get the error below
Exception in thread "main" java.io.IOException: 403 error loading URL http://www.*url*.com
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:327)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:130)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:119)
at HttpRequestPoster.main(HttpRequestPoster.java:151)
My computer is a member of network which is controlled by kerio winroute firewall. Before internet connetions we connect to firewall from our web browsers. The reason must be this. How can i parse the url ?

Setting the user agent worked for me.
Document document = Jsoup.connect(url).header("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2").get();

HTTP error 403 just means Forbidden.
The server understood the request, but is refusing to fulfill it.
In other words, the server side decided based on the request-specific information that the requester isn't allowed to receive the response. That can have many causes: specific information is missing in request headers, the IP address is disallowed, the user agent is disallowed, etcetera.
Your most honest bet would be contacting the admin of the website in question, asking for clarification and permission. You could also use a real webbrowser and track using a HTTP traffic tracker like Firebug or Fiddler2 to check the request/response details. Jsoup's HttpConnection class at least offers several methods to set headers, cookies and/or user agent whenever necessary.

Why do I get the proper URL response from a browser, but empty within Android's "HttpGet"?

I am writing an android application which uses a REST-based API on the server. So far the login works perfectly using HttpGet = I send the credentials, it sends me back a JSON response object containing session id or failure. I then moved onto using another get api (this one is passed the sessionid) and the response I get back looks like a valid one "200 - Ok" but the response body contains nothing - 0 text.
If I take the same URL and drop it into a browser, I get all the JSON text I expect displayed in the browser window. So what is the difference between a browser request/response and that of HttpGet? Any clues as to why my HttpGet might return a 'valid' nothing?

I had the same problem. Setting user agent solved my problem:
HttpParams params = new BasicHttpParams();
...
params.setParameter(CoreProtocolPNames.USER_AGENT, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.71");

Thats my pull() I have written
mHttpGet.setURI(url.toURI());
mResponse = mHttpClient.execute(mHttpGet);
mResponse.getEntity().getContent(); // returns inputstream
How did you do yours?!

It turned out to be a server-side issue. They were actually sending me empty strings when the requester was not a browser. Too bad I can't delete a question. :(

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Permanently retrieving a HttpStatusException with status 500 using Jsoup - java

Related

How to follow redirect from 404 with Jsoup

java.io.IOException: Server returned HTTP response code: 403 for URL: <url>

Http post request not working

how to parse web page over firewall

Why do I get the proper URL response from a browser, but empty within Android's "HttpGet"?

Categories

Resources