Jsoup connect with url with special chars - java

I try this code to download the source code of a page but it make a 404 error, but when I put it in a navigator it works
org.jsoup.Connection req = Jsoup.connect("http://www.3mfrance.fr/3M/fr_FR/notre-societe-fr/tous-les-produits-3M/~/Colle-structurale-%C3%A9poxyde-3M-Scotch-Weld-DP190?N=5002385+8709320+8710676+8710815+8711017+8711736+8713609+3293242432&rt=rud");
Response rep=req.execute();
String codeSource= rep.body();

Related

Links give invalid response code from code but valid response code from browser

I'm validating links by trying to hit them and getting the response codes(in Java). But I get invalid response codes(403 or 404) from code but from browser, I get 200 status code when I inspect the network activity. Here's my code that gets the response code. [I do basic validations on urls beforehand, like making it lowercase, etc.]
static int getResponseCode(String link) throws IOException {
URL url = new URL(link);
HttpURLConnection http = (HttpURLConnection) url.openConnection();
return http.getResponseCode();
}
For link like http://science.sciencemag.org/content/220/4599/868, I am getting 403 status when I run this code. But on browser(chrome), I am getting 200 status. Also, if I use the below curl command, I am getting 200 status code.
curl -Is http://science.sciencemag.org/content/220/4599/868
The only way to overcome that is to:
check what are the HTTP headers sent by your program (for instance, by sending queries to http://scooterlabs.com/echo and check the response)
check what are the HTTP headers sent by your browser (for instance, by visiting https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending )
spot the differences
change your program to send the same headers as your browser (the ones that work)
I made this analysis for you, and it turns out this website requires an Accept header that resemble the Accept headers of an existing browser. By default Java sends something valid, but not resembling that.
You just need to change your program as so:
static int getResponseCode(String link) throws IOException {
URL url = new URL(link);
HttpURLConnection http = (HttpURLConnection) url.openConnection();
http.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
return http.getResponseCode();
}
(Or any other value that an actual browser uses)

How to redirect to a site from proxy java

Am using Apache Httpclient to open a URL through proxy and getting response instead I want to redirect the site from a proxy passing post parameters.
This is my code , It is a servlet
String parameter= request.getParameter("parameter");
HttpClient httpClient = new HttpClient();
httpClient.getHostConfiguration().setProxy(proxyhost, proxyport);
log.info("message:::"+message);
PostMethod postMethod = new PostMethod(url);
NameValuePair[] data = new NameValuePair[1];
data[0] = new NameValuePair("parameter", parameter);
postMethod.setRequestBody(data);
int code = httpClient.executeMethod(postMethod);
response.setContentType("text/html");
PrintWriter out=response.getWriter();
out.print(postMethod.getResponseBodyAsString());
Actual problem is am getting response from other site and html is being rendered and clicking the link in browser the URL is being opened from my server because the response is coming this this format
var url "../../someparams"
It should indeed open http://url/someparams(url here is mentioned above one in the code) or The as soon as the URL is hit using proxy can we redirect to that page in the browser too I mean opening the URL through that proxy and removing the URl of the servlet being called.
#Law Anthony:
response.sendRedirect("http://www.google.com"); is not helping
Need your help to resolve this .
Is this a servlet?
If it is,
repsonse.sendRedirect("http://www.google.com");
This will send a code 302 redirect to the request.
EDIT: What exactly you want?

JSOUP throws url status 503 in Eclipse but URL works fine in browser

In particular, this is with the website amazon.com to be specific. I am receiving a 503 error for their domain, but I can successfully parse other domains.
I am using the line
Document doc = Jsoup.connect(url).timeout(30000).get();
to connect to the URL.
You have to set a User Agent:
Document doc = Jsoup.connect(url).timeout(30000).userAgent("Mozilla/17.0").get();
(Or others; best you choose a browser user agent)
Else you'll get blocked.
Please see also: Jsoup: select(div[class=rslt prod]) returns null when it shouldn't
you can try
val ret=Jsoup.connect(url)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
.timeout(2*1000)
.followRedirects(true)
.maxBodySize(1024*1024*3) //3Mb Max
//.ignoreContentType(true) //for download xml, json, etc
.get()
it maybe works, maybe amazon.com need followRedirects set to true.

404 error when parsing URL using jsoup

I am getting a 404 error when using Jsoup. The call is Document doc = Jsoup.parse(url, 30000) and the URL string is http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94
and the URL displays fine in Chrome. The error I am getting is java.io.IOException: 404 error loading URL http://www.myland.co.il/vmchk/××ש×-×שק××
Any ideas?
Don't use parse()-method for websites, use connect() instead. So you can set more connection settings.
final String url = "http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94";
Document doc = Jsoup.connect(url).get();
However the problem is the url-encoding:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.myland.co.il/vmchk/××ש×-×שק××
Even decoding the url back to utf-8 doesn't solve this.
Do you have an "alternative" url?
try decodeURL()
String url = "http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94";
Document doc = Jsoup.connect(url.decodeURL()).get();

403 error in accessing an URL but works fine in browsers

String url = "http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false";
URL google = new URL(url);
HttpURLConnection con = (HttpURLConnection) google.openConnection();
and I use BufferedReader to print the content I get 403 error
The same URL works fine in the browser. Could any one suggest.
The reason it works in a browser but not in java code is that the browser adds some HTTP headers which you lack in your Java code, and the server requires those headers. I've been in the same situation - and the URL worked both in Chrome and the Chrome plugin "Simple REST Client", yet didn't work in Java. Adding this line before the getInputStream() solved the problem:
connection.addRequestProperty("User-Agent", "Mozilla/4.0");
..even though I have never used Mozilla. Your situation might require a different header. It might be related to cookies ... I was getting text in the error stream advising me to enable cookies.
Note that you might get more information by looking at the error text. Here's my code:
try {
HttpURLConnection connection = ((HttpURLConnection)url.openConnection());
connection.addRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input;
if (connection.getResponseCode() == 200) // this must be called before 'getErrorStream()' works
input = connection.getInputStream();
else input = connection.getErrorStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
String msg;
while ((msg =reader.readLine()) != null)
System.out.println(msg);
} catch (IOException e) {
System.err.println(e);
}
HTTP 403 is a Forbidden status code. You would have to read the HttpURLConnection.getErrorStream() to see the response from the server (which can tell you why you have been given a HTTP 403), if any.
This code should work fine. If you have been making a number of requests, it is possible that Google is just throttling you. I have seen Google do this before. You can try using a proxy to verify.
Most browsers automatically encode URLs when you enter them, but the Java URL function doesn't.
You should Encode the URL with URLEncoder URL Encoder
I know this is a bit late, but the easiest way to get the contents of a URL is to use the Apache HttpComponents HttpClient project: http://hc.apache.org/httpcomponents-client-ga/index.html
you original page (with link) and the targeted linked page are not the same domain.
original-domain and target-domain.
I found the difference is in request header:
with 403 forbidden error,
request header have one line:
Referer: http://original-domain/json2tree/ipfs/ipfsList.html
when I enter url, no 403 forbidden,
the request header does NOT have above line referer: original-domain
I finally figure out how to fix this error!!!
on your original-domain web page, you have to add
<meta name="referrer" content="no-referrer" />
it will remove or prevent sending the Referer in header, works both for links and for Ajax requests made

Categories