jsoup throws 204 status despite a status code check - java

While i connect to a url through jsoup. Here is the snippet of my code:
for (int j = 0; j < unq_urls.size(); j++) {
Response response2 = Jsoup.connect(unq_urls.get(j))
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(100*1000)
.ignoreContentType(true)
.execute();
if (response2.statusCode() == 200) {
...}
}
When the connection is executed jsoup throws the following error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=204, URL=https://www.google.com/gen_204?reason=EmptyURL
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:459)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:475)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:475)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
at cseapiandparsing.CSE_Author_Name_Dis.<init>(CSE_Author_Name_Dis.java:187)
at cseapiandparsing.CSE_Author_Name_Dis.main(CSE_Author_Name_Dis.java:263)
How can I overcome this? I mean i want jsoup to pass another URL if it cannot connect to a specific URL. Related to this jsoup also throws time out error when it takes too much time to connect a URL. To this end I have already put .timeout(100*1000) option. However, I was wondering is there a way of passing to another URL if the attempt for the current one takes too long?
Thanks in advance.

I believe you are looking for a try-catch mechanism here.
Surround the Jsoup.connect part with a try clause, then in your catch clause handle the exceptions gracefully, which in your case would be continuing to the next loop.
To skip the current one if it takes too long, simply set timeout() value to your desired waiting period, if it passes that period it will throw a timeout exception, which again will be caught by the catch clause.
Try the code I posted below:
for (int j = 0; j < unq_urls.size(); j++) {
try{
Response response2 = Jsoup.connect(unq_urls.get(j))
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(100*1000)
.ignoreContentType(true)
.execute();
} catch(Exception e) {
continue; //continue to the next loop if exception occurs
}
}

Related

Java 403 Exception When My Bot Tries To Send An Embedded Link

I got a discord bot that I've made in Java and one of its purposes is to send an embedded link (I don't own the site) everytime someone leaves the server. It worked the first 2-3 times and every time after that I get the following exception:
java.io.IOException: Server returned HTTP response code: 403 for URL: ...
Example link:
https://signature.hzgaming.net/sig.php?name=Juntao_Lubu&style=1
I tried numerous solutions I've found online (with User-Agents and all that fancy stuff), but none of them seem to work for me.
Is there any other workaround this?
Code:
String link = "https://signature.hzgaming.net/sig.php?name=" + allMembers.get(mEvent.getUser().getDiscriminatedName()).replace(" ", "_") + "&style=1";
URLConnection urlCon = new URL(link).openConnection();
urlCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.29 Safari/537.36");
InputStream is = urlCon.getInputStream();
StringBuilder textBuilder = new StringBuilder();
Reader reader = new BufferedReader(new InputStreamReader(is, Charset.forName(StandardCharsets.UTF_8.name())));
int c = 0;
while((c = reader.read()) != -1) {
textBuilder.append((char)c);
}
String result = textBuilder.toString().replaceAll("<[^>]*>", "");
if(!result.equalsIgnoreCase("Non-Existant Player") && !result.equalsIgnoreCase("Non-ExistantPlayer")) {
new MessageBuilder().append(link).send((TextChannel)server.getChannelById(973242211623895080L).get());
}
Thanks in advance.

GET Request to a url (to accept terms and conditions) returns the same page as the response body but works in POSTMAN

There is a page I need to access after accepting the Terms and Conditions in my crawler. However, even after using the url in the source code:
'/auth/submitterms.do?randomNum=' + randomNum
the response body returned is the same terms and conditions page.
When I do the same in Postman it works fine and takes me to the next
randomNum is obtained using regex from the response body. I used the cookieHandler API to handle the session.
Code Snippet:
GET request to terms and conditions page:
connection = new URL(URL_NEXIS+"auth/ipmdelegate.do").openConnection();
res = connection.getInputStream();
try (Scanner scanner = new Scanner(res)){
response = scanner.useDelimiter("\\A").next();
System.out.println("Terms Page: \n"+response);
}
Regex used to obtain the randomNum from the response:
Pattern pattern = Pattern.compile("randomNum=[0-9].[0-9]*");
Matcher matcher = pattern.matcher(response);
if (matcher.find()) {
System.out.println(matcher.group());
randomNum = matcher.group().split("=")[1];
System.out.println(randomNum.toString());
} else {
throw new IOException("Error: Could not accept terms and conditions.");
}
GET request to URL which 'accepts' the terms.
System.out.println(URL_NEXIS + "auth/submitterms.do?randomNum=" + randomNum);
redirect_page = URL_NEXIS + "auth/submitterms.do?randomNum=" + randomNum;
connection = new URL(redirect_page).openConnection();
connection.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
connection.setRequestProperty("Accept-Encoding", "gzip, deflate, br");
connection.setRequestProperty("Accept-Language", "en-US,en;q=0.5");
connection.setRequestProperty("Connection","keep-alive");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36");
res = connection.getInputStream();
How can this be achieved and why is the same request behaving differently with Postman?
EDIT:
I used connection.setInstanceFollowRedirect(false); and was able to obtain a 302 response code. I can now see the location header and set-cookie variable. However, I think the cookieHandler api already handles the session.
Now when I try to send a GET request to the new Url from the location header, I get taken back to the terms and conditions page.

Bing Search with Jsoup - how can I avoid captcha?

keywordexist = false;
try {
res = Jsoup
.connect(
bingSearchUrl.replaceAll("keyword", "intitle:\""
+ keyword + "\""))
.userAgent(
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.15 (KHTML, like Gecko) Chrome/24.0.1295.0 Safari/537.15")
.referrer("http://www.bing.com")
.method(Connection.Method.GET).execute();
doc = res.parse();
System.out.println(bingSearchUrl.replaceAll("keyword", "intitle:\""
+ keyword + "\""));
elements = doc.select("li[class^=b_algo]");
System.out.println(doc.html());
System.out.println(elements.html());
// String divContents =
// doc.select(".id-app-orig-desc").first().text();
// elements.remove("div");
if (elements.html().contains("<strong>" + keyword + "</strong>")) {
keywordexist = true;
System.out.println("keyword exists");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
I'm trying to use jsoup to check a list of keywords I have in Bing Search but whenever I run my program jsoup will always connect to Bing's captcha page, is there any way I can avoid this? I thought this would be remedied by adding a useragent and referrer but it doesn't seem to have any effect.
I used a code similar to yours and get all the results. However here are two points I noticed:
I think you should slow down between two searches. For example, add a random pause from 3000 to 5000 ms.
Don't forget to escape the query parameters
SAMPLE CODE
String bingSearchUrl = "http://www.bing.com/search?q=keyword";
String keyword = "stackoverflow jsoup";
String uaString = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.15 (KHTML, like Gecko) Chrome/24.0.1295.0 Safari/537.15";
String url = bingSearchUrl.replaceAll("keyword", URLEncoder.encode("intitle:\"" + keyword + "\"", "UTF-8"));
Document doc = Jsoup.connect(url).userAgent(uaString).get();
System.out.println(doc.select("li h2"));

Jsoup malformed url

I'm having trouble with connecting to a url with JSoup.
The url I am trying to test is www.xbox.com/en-US/security which is a 302(I think) redirect to http://www.xbox.com/en-US/Live/Account-Security. I have set up jsoup to follow redirect and get the new url using .headers("location"). The url returned is /en-US/Live/Account-Security. I'm not sure how to handle it, my code is below:
while (i < retries){
try {
response = Jsoup.connect(checkUrl)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.followRedirects(false)
.timeout(10000)
.execute();
success = true;
break;
} catch (SocketTimeoutException ex){
timeout = true;
} catch (MalformedURLException ep){
malformedUrl = true;
}catch (IOException e) {
statusCode = 404;
}
}
private void getStatus(){
if (success){
statusCode = response.statusCode();
success = false;
}
if (statusCode >= 300 && statusCode <= 399){
//System.out.println("redirect: " +statusCode + " " +checkUrl);
checkUrl = response.header("location");
//System.out.println(checkUrl);
connect();
getStatus();
}
}
Has anyone got suggestions on how to handle this? Or should I do a check on my checkUrl = response.header("location"); to see if it is a valid url and if not don't test it?
First things first: If you try to access "www.xbox.com/en-US/security", it'll throw you a MalformedException and thus not redirect you to where you want.
Than there's the issue that I'd use only the boolean variable success, and set it as false if any exception is caught. Then again I don't know if you're using timeout, or malformed variables for anything.
After that I'd say that the line right after IOException is never useful. I again couldn't tell, since I can't see the full code.
Now... To your question: The returned string is a domain within the first URL you provided. It'll go simply like this:
//Assuming you won't ever change it, make it a final
//variable for less memory usage.
final String URL = "http://www.xbox.com/en-US/security";
//Whatever piece of processing here
//Some tests just to make sure you'll get what you're
//fetching:
String newUrl = ""
if (checkUrl.startsWith("/"))
newUrl = URL + checkUrl;
if (checkUrl.startsWith("http://"))
newUrl = checkUrl;
if (checkUrl.startsWith("www"))
newUrl = "http://" + checkUrl;
This piece of code will basically make sure you can navigate through urls, without getting some MalformedUrlException. I'd suggest putting a manageUrl() method somewhere and test if the fetched URL is within the domain you're searching, or ele you might end up in e-commerces or publicuty websites.
Hope it helps =)

Handling connection errors and JSoup

I'm trying to create an application to scrape content off of multiple pages on a site. I am using JSoup to connect. This is my code:
for (String locale : langList){
sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName;
try {
Document doc = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
} catch (IOException e) {
System.out.println(e);
}
}
Everything works perfectly most of the time. However there are a few things I want to be able to do.
First off sometimes a 404 status will return or a 500 status maybe a 301. With my code below it will just print the error and move onto the next url. What I would like to be able to do is try to be able to return the url status for all links. If the page connects print a 200, if not print the relevant status code.
Secondly I sometimes catch this error "java.net.SocketTimeoutException: Read timed out" I could increase my timeout however I would prefer to try to connect 3 times, upon failing the 3rd time I want to add the URL to a "failed" array so I can retry the failed connections in the future.
Can someone with more knowledge than me help me out?
The above returns an IOException for me rather than the execute() returning the correct status code.
Using JSoup-1.6.1 I had to change the above code to use ignoreHttpErrors(true).
Now when the code returns the response rather than throwing an exception and you can check the error codes/messages.
Connection.Response response = null;
try {
response = Jsoup.connect(bad_url)
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5")
.timeout(100000)
.ignoreHttpErrors(true)
.execute();
} catch (IOException e) {
System.out.println("io - "+e);
}
System.out.println("Status code = " + response.statusCode());
System.out.println("Status msg = " + response.statusMessage());
Output:
Status code = 404
Status msg = Not Found
For your first question, you can do your connection/read in two steps, stopping to ask for the status code in the middle like so:
Connection.Response response = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if(statusCode == 200) {
Document doc = connection.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
}
else {
System.out.println("received error code : " + statusCode);
}
Note that the execute() method will fail with an IOException if it's unable to connect to the server, if the response is malformed HTTP, etc., so you'll need to handle that. However, as long as the server said something that made sense, you'll be able to read the status code and continue. Also, if you've asked Jsoup to follow redirects, you won't be seeing 30x response codes b/c Jsoup will set the status code from the final page fetched.
As for your second question, all you need is a loop around the code sample I just gave you that's wrapped with a try/catch block with SocketTimeoutException. When you catch the exception, the loop should continue. If you're able to get data, then return or break. Shout if you need more help with it!

Categories