I am getting a 404 error when using Jsoup. The call is Document doc = Jsoup.parse(url, 30000) and the URL string is http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94
and the URL displays fine in Chrome. The error I am getting is java.io.IOException: 404 error loading URL http://www.myland.co.il/vmchk/××ש×-×שק××
Any ideas?
Don't use parse()-method for websites, use connect() instead. So you can set more connection settings.
final String url = "http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94";
Document doc = Jsoup.connect(url).get();
However the problem is the url-encoding:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.myland.co.il/vmchk/××ש×-×שק××
Even decoding the url back to utf-8 doesn't solve this.
Do you have an "alternative" url?
try decodeURL()
String url = "http://www.myland.co.il/%D7%9E%D7%97%D7%A9%D7%91-%D7%94%D7%A9%D7%A7%D7%99%D7%94";
Document doc = Jsoup.connect(url.decodeURL()).get();
Related
I'm trying to connect to a URL in an AsynTask but am having trouble.
Document doc = Jsoup.connect("https://www.gasbuddy.com/home?search=33444&fuel=1").get();
gives me an error in my log saying:
Error Connecting: failed to connect to www.gasbuddy.com/104.17.148.191 (port 443) after 30000ms
After doing some research, I believe I need to connect to the baseURL. However, I'm not sure how to do this. I've tried:
URL absoluteUrl = new URL(new URL("https://www.gasbuddy.com/home?search=33444&fuel=1"), "script.js");
Document doc = Jsoup.connect(String.valueOf(absoluteUrl)).get();
But I get the same error in my my logs.
Edit: I get the same error on any website I try connecting too ("https://www.yahoo.com/", "https://www.google.com/", etc) I'm on a a public Starbucks Wifi at the moment, not sure if that matters.
I'm trying to connect and retrieve the page title from here. The code works fine if I remove everything after ".com" from the link. The following code does not work:
try {
Document doc = Jsoup.connect("https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en")
.data("query", "Java")
.userAgent("Chrome")
.cookie("auth", "token")
.timeout(3000)
.post();
String title = doc.title();
Log.d("hellomate", title);
}
catch (IOException e) {
Log.d("hellomatee", e.toString());
}
If the code worked, the title returned should be "Sammamish Washington - Google News".
The error returned from the code is: "org.jsoup.HttpStatusException: HTTP error fetching URL. Status=405, URL=https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en"
What does status 405 mean? Does Jsoup not allow the kind of url I used?
Thanks.
Status 405 is an http error code that means "Method Not allowed". You can find some documentation from microsoft on it here. As #Andreas said, you can fix this by changing .post(); to .get();.
If you look at the jsoup docs under example, it shows you how you would probably want to structure your requests:
Jsoup.connect("http://en.wikipedia.org/").get();
I try this code to download the source code of a page but it make a 404 error, but when I put it in a navigator it works
org.jsoup.Connection req = Jsoup.connect("http://www.3mfrance.fr/3M/fr_FR/notre-societe-fr/tous-les-produits-3M/~/Colle-structurale-%C3%A9poxyde-3M-Scotch-Weld-DP190?N=5002385+8709320+8710676+8710815+8711017+8711736+8713609+3293242432&rt=rud");
Response rep=req.execute();
String codeSource= rep.body();
I want to scrape the redirected tumblr site which comes up if you try to go to a tumblr page that doesnt exist. If I put the URL in the browser I get to that redirected site. Jsoup however just gives back a " HTTP error fetching URL. Status=404" Error. Any suggestions?
String userAgent = "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6";
Document doc = Jsoup.connect("http://www.faszokvagyunk.tumblr.com").userAgent(userAgent).followRedirects(true).get();
Thank you.
Your code seem to handle other types of redirects just fine however, with tumblr you get a 404 page that causes a 404 status hence, the exception and there could be many reasons to this:
Redirect might not happen at all
Tumblr does redirect in a strange way
Tumblr unnecessary returns 404 which causes the exception
Other possibilities
I don't know if this solution can help you but, you actually can instruct your JSOUP connection to ignoreHttpErrors by chaining the method as follow (this at least allow you to validate the http errors):
Document doc = Jsoup.connect("http://oddhouredproductivity.tumblr.com/tagged/tips").userAgent(userAgent).followRedirects(true).ignoreHttpErrors(true).get();
ignoreHttpErrors instructs the connection not to throw Http error when it comes across 404, 500, etc error status codes.
Connection ignoreHttpErrors(boolean ignoreHttpErrors)
Configures the
connection to not throw exceptions when a HTTP error occurs. (4xx -
5xx, e.g. 404 or 500). By default this is false; an IOException is
thrown if an error is encountered. If set to true, the response is
populated with the error body, and the status message will reflect the
error.
Parameters: ignoreHttpErrors - - false (default) if HTTP errors should
be ignored.
Returns: this Connection, for chaining
if you set ignoreHttpErrors to true then you will get the Document. If not then Document will be null.
I also came across this site that might actually demonstrate actual tumblr redirect. You might want to use URLs in that page to do your test as they are proper tumblr redirect. If you look inside the retrieved document for this page then you see a JavaScript direct function that triggers after 3 seconds as follow:
//redirect to new blog
setTimeout( redirectTumblr, 3000 );
function redirectTumblr() {
location.replace('http://oddhour.tumblr.com' + location.pathname);
}
When I connect to the URL that you have given your in your question. I see 404 page and the content of the 404 page returned in Document by connection contains no sign of redirect (like the other page have).
In particular, this is with the website amazon.com to be specific. I am receiving a 503 error for their domain, but I can successfully parse other domains.
I am using the line
Document doc = Jsoup.connect(url).timeout(30000).get();
to connect to the URL.
You have to set a User Agent:
Document doc = Jsoup.connect(url).timeout(30000).userAgent("Mozilla/17.0").get();
(Or others; best you choose a browser user agent)
Else you'll get blocked.
Please see also: Jsoup: select(div[class=rslt prod]) returns null when it shouldn't
you can try
val ret=Jsoup.connect(url)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
.timeout(2*1000)
.followRedirects(true)
.maxBodySize(1024*1024*3) //3Mb Max
//.ignoreContentType(true) //for download xml, json, etc
.get()
it maybe works, maybe amazon.com need followRedirects set to true.