I'm trying to query Goolge using Jsoup. Unfortunately, I now get the error shown below after about 300 queries. I use the following code snippet. How can I fix the problem?
String request = "http://www.google.com/search?q=" + query + "&num=1";
System.out.println("Sending request..." + request);
Document doc = Jsoup
.connect(request)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000).get();
HTTP error fetching URL. Status=429 (Too many requests)
Related
i am trying to get list of stops(stations) for a train by passing train number and other required parameters(got from web developer tools-firefox) with the url(POST method), but i get 404-page not found error code. when i tried with POSTMAN, it gets the webpage with the requested data, what is wrong with the code?
Document doc= Jsoup.connect("https://enquiry.indianrail.gov.in/mntes/q?")
.data("opt","TrainRunning")
.data("subOpt","FindStationList")
.data("trainNo",trainNumber)
.data("jStation","")
.data("jDate","25-Aug-2021")
.data("jDateDay","Wed")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0")
.referrer("https://enquiry.indianrail.gov.in/mntes/")
.ignoreHttpErrors(true)
.post();
System.out.println(doc.text());
thank you in advance
I've tried to make the request work with Jsoup but to no avail. An odd way of sending form data is being used. Form data is passed as URL query parameters in a POST request.
Jsoup uses a simplified HTTP API in which this particular use case was not foreseen. It is debatable whether it is appropriate to send form parameters the way https://enquiry.indianrail.gov.in/mntes expects them to be sent.
If you're using Java 11 or later, you could simply fetch the response of your POST request via the modern Java HTTP Client. It fully supports the HTTP protocol. You can then feed the returned String into Jsoup.
Here's what you could do:
// 1. Get the response
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://enquiry.indianrail.gov.in/mntes/q?opt=TrainRunning&subOpt=ShowRunC&trainNo=07482&jDate=25-Aug-2021&jDateDay=Wed&jStation=DVD%23false"))
.POST(BodyPublishers.noBody())
.build();
HttpResponse<String> response =
client.send(request, BodyHandlers.ofString());
// 2. Parse the response via Jsoup
Document doc = Jsoup.parse(response.body());
System.out.println(doc.text());
I've simply copy-pasted the proper URL from Postman. You might want to build your query string in a more robust way. See:
Java URL encoding of query string parameters
How to convert map to url query string?
I'm trying to connect and retrieve the page title from here. The code works fine if I remove everything after ".com" from the link. The following code does not work:
try {
Document doc = Jsoup.connect("https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en")
.data("query", "Java")
.userAgent("Chrome")
.cookie("auth", "token")
.timeout(3000)
.post();
String title = doc.title();
Log.d("hellomate", title);
}
catch (IOException e) {
Log.d("hellomatee", e.toString());
}
If the code worked, the title returned should be "Sammamish Washington - Google News".
The error returned from the code is: "org.jsoup.HttpStatusException: HTTP error fetching URL. Status=405, URL=https://news.google.com/news/local/section/geo/Sammamish,%20WA%2098075,%20United%20States/Sammamish,%20Washington?ned=us&hl=en"
What does status 405 mean? Does Jsoup not allow the kind of url I used?
Thanks.
Status 405 is an http error code that means "Method Not allowed". You can find some documentation from microsoft on it here. As #Andreas said, you can fix this by changing .post(); to .get();.
If you look at the jsoup docs under example, it shows you how you would probably want to structure your requests:
Jsoup.connect("http://en.wikipedia.org/").get();
I want to scrape the redirected tumblr site which comes up if you try to go to a tumblr page that doesnt exist. If I put the URL in the browser I get to that redirected site. Jsoup however just gives back a " HTTP error fetching URL. Status=404" Error. Any suggestions?
String userAgent = "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6";
Document doc = Jsoup.connect("http://www.faszokvagyunk.tumblr.com").userAgent(userAgent).followRedirects(true).get();
Thank you.
Your code seem to handle other types of redirects just fine however, with tumblr you get a 404 page that causes a 404 status hence, the exception and there could be many reasons to this:
Redirect might not happen at all
Tumblr does redirect in a strange way
Tumblr unnecessary returns 404 which causes the exception
Other possibilities
I don't know if this solution can help you but, you actually can instruct your JSOUP connection to ignoreHttpErrors by chaining the method as follow (this at least allow you to validate the http errors):
Document doc = Jsoup.connect("http://oddhouredproductivity.tumblr.com/tagged/tips").userAgent(userAgent).followRedirects(true).ignoreHttpErrors(true).get();
ignoreHttpErrors instructs the connection not to throw Http error when it comes across 404, 500, etc error status codes.
Connection ignoreHttpErrors(boolean ignoreHttpErrors)
Configures the
connection to not throw exceptions when a HTTP error occurs. (4xx -
5xx, e.g. 404 or 500). By default this is false; an IOException is
thrown if an error is encountered. If set to true, the response is
populated with the error body, and the status message will reflect the
error.
Parameters: ignoreHttpErrors - - false (default) if HTTP errors should
be ignored.
Returns: this Connection, for chaining
if you set ignoreHttpErrors to true then you will get the Document. If not then Document will be null.
I also came across this site that might actually demonstrate actual tumblr redirect. You might want to use URLs in that page to do your test as they are proper tumblr redirect. If you look inside the retrieved document for this page then you see a JavaScript direct function that triggers after 3 seconds as follow:
//redirect to new blog
setTimeout( redirectTumblr, 3000 );
function redirectTumblr() {
location.replace('http://oddhour.tumblr.com' + location.pathname);
}
When I connect to the URL that you have given your in your question. I see 404 page and the content of the 404 page returned in Document by connection contains no sign of redirect (like the other page have).
I have for example:
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();
Document doc = Jsoup.connect("http://example.com")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Response response1 = Jsoup.connect("http://example.com").method(Connection.Method.GET)
.data("name", "peksak")
.execute();
How can I check in these three examples, how much they received the following KB from the internet?
I use this in my Android application. For internet I use WI-FI in my mobile.
Well you can simply get the length of the document and check it's length which is in bytes:
Log.e("TAG", "Document size is " +doc.outerHtml().length());
With the third example:
Log.e("TAG", "Document size is " +response1.body().length());
Then you can do an http header request and get the size of each image. Just grab all the images on the page using jsoup, iterate through them and make the request.
In particular, this is with the website amazon.com to be specific. I am receiving a 503 error for their domain, but I can successfully parse other domains.
I am using the line
Document doc = Jsoup.connect(url).timeout(30000).get();
to connect to the URL.
You have to set a User Agent:
Document doc = Jsoup.connect(url).timeout(30000).userAgent("Mozilla/17.0").get();
(Or others; best you choose a browser user agent)
Else you'll get blocked.
Please see also: Jsoup: select(div[class=rslt prod]) returns null when it shouldn't
you can try
val ret=Jsoup.connect(url)
.userAgent("Mozilla/5.0 Chrome/26.0.1410.64 Safari/537.31")
.timeout(2*1000)
.followRedirects(true)
.maxBodySize(1024*1024*3) //3Mb Max
//.ignoreContentType(true) //for download xml, json, etc
.get()
it maybe works, maybe amazon.com need followRedirects set to true.