I'm trying to create an application to scrape content off of multiple pages on a site. I am using JSoup to connect. This is my code:
for (String locale : langList){
sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName;
try {
Document doc = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
} catch (IOException e) {
System.out.println(e);
}
}
Everything works perfectly most of the time. However there are a few things I want to be able to do.
First off sometimes a 404 status will return or a 500 status maybe a 301. With my code below it will just print the error and move onto the next url. What I would like to be able to do is try to be able to return the url status for all links. If the page connects print a 200, if not print the relevant status code.
Secondly I sometimes catch this error "java.net.SocketTimeoutException: Read timed out" I could increase my timeout however I would prefer to try to connect 3 times, upon failing the 3rd time I want to add the URL to a "failed" array so I can retry the failed connections in the future.
Can someone with more knowledge than me help me out?
The above returns an IOException for me rather than the execute() returning the correct status code.
Using JSoup-1.6.1 I had to change the above code to use ignoreHttpErrors(true).
Now when the code returns the response rather than throwing an exception and you can check the error codes/messages.
Connection.Response response = null;
try {
response = Jsoup.connect(bad_url)
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5")
.timeout(100000)
.ignoreHttpErrors(true)
.execute();
} catch (IOException e) {
System.out.println("io - "+e);
}
System.out.println("Status code = " + response.statusCode());
System.out.println("Status msg = " + response.statusMessage());
Output:
Status code = 404
Status msg = Not Found
For your first question, you can do your connection/read in two steps, stopping to ask for the status code in the middle like so:
Connection.Response response = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if(statusCode == 200) {
Document doc = connection.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
}
else {
System.out.println("received error code : " + statusCode);
}
Note that the execute() method will fail with an IOException if it's unable to connect to the server, if the response is malformed HTTP, etc., so you'll need to handle that. However, as long as the server said something that made sense, you'll be able to read the status code and continue. Also, if you've asked Jsoup to follow redirects, you won't be seeing 30x response codes b/c Jsoup will set the status code from the final page fetched.
As for your second question, all you need is a loop around the code sample I just gave you that's wrapped with a try/catch block with SocketTimeoutException. When you catch the exception, the loop should continue. If you're able to get data, then return or break. Shout if you need more help with it!
Related
There is a page I need to access after accepting the Terms and Conditions in my crawler. However, even after using the url in the source code:
'/auth/submitterms.do?randomNum=' + randomNum
the response body returned is the same terms and conditions page.
When I do the same in Postman it works fine and takes me to the next
randomNum is obtained using regex from the response body. I used the cookieHandler API to handle the session.
Code Snippet:
GET request to terms and conditions page:
connection = new URL(URL_NEXIS+"auth/ipmdelegate.do").openConnection();
res = connection.getInputStream();
try (Scanner scanner = new Scanner(res)){
response = scanner.useDelimiter("\\A").next();
System.out.println("Terms Page: \n"+response);
}
Regex used to obtain the randomNum from the response:
Pattern pattern = Pattern.compile("randomNum=[0-9].[0-9]*");
Matcher matcher = pattern.matcher(response);
if (matcher.find()) {
System.out.println(matcher.group());
randomNum = matcher.group().split("=")[1];
System.out.println(randomNum.toString());
} else {
throw new IOException("Error: Could not accept terms and conditions.");
}
GET request to URL which 'accepts' the terms.
System.out.println(URL_NEXIS + "auth/submitterms.do?randomNum=" + randomNum);
redirect_page = URL_NEXIS + "auth/submitterms.do?randomNum=" + randomNum;
connection = new URL(redirect_page).openConnection();
connection.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
connection.setRequestProperty("Accept-Encoding", "gzip, deflate, br");
connection.setRequestProperty("Accept-Language", "en-US,en;q=0.5");
connection.setRequestProperty("Connection","keep-alive");
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36");
res = connection.getInputStream();
How can this be achieved and why is the same request behaving differently with Postman?
EDIT:
I used connection.setInstanceFollowRedirect(false); and was able to obtain a 302 response code. I can now see the location header and set-cookie variable. However, I think the cookieHandler api already handles the session.
Now when I try to send a GET request to the new Url from the location header, I get taken back to the terms and conditions page.
While i connect to a url through jsoup. Here is the snippet of my code:
for (int j = 0; j < unq_urls.size(); j++) {
Response response2 = Jsoup.connect(unq_urls.get(j))
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(100*1000)
.ignoreContentType(true)
.execute();
if (response2.statusCode() == 200) {
...}
}
When the connection is executed jsoup throws the following error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=204, URL=https://www.google.com/gen_204?reason=EmptyURL
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:459)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:475)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:475)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
at cseapiandparsing.CSE_Author_Name_Dis.<init>(CSE_Author_Name_Dis.java:187)
at cseapiandparsing.CSE_Author_Name_Dis.main(CSE_Author_Name_Dis.java:263)
How can I overcome this? I mean i want jsoup to pass another URL if it cannot connect to a specific URL. Related to this jsoup also throws time out error when it takes too much time to connect a URL. To this end I have already put .timeout(100*1000) option. However, I was wondering is there a way of passing to another URL if the attempt for the current one takes too long?
Thanks in advance.
I believe you are looking for a try-catch mechanism here.
Surround the Jsoup.connect part with a try clause, then in your catch clause handle the exceptions gracefully, which in your case would be continuing to the next loop.
To skip the current one if it takes too long, simply set timeout() value to your desired waiting period, if it passes that period it will throw a timeout exception, which again will be caught by the catch clause.
Try the code I posted below:
for (int j = 0; j < unq_urls.size(); j++) {
try{
Response response2 = Jsoup.connect(unq_urls.get(j))
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(100*1000)
.ignoreContentType(true)
.execute();
} catch(Exception e) {
continue; //continue to the next loop if exception occurs
}
}
I want to update WEB UI of Rails by ajax request triggered from java.
However, it didn't work although I saw 200 OK message.
This is my step I did.
Make Ajax request in Java application (using HttpURLConnection object)
- X-Request-With : XMLHttpRequest
String url = "http://127.0.0.1:3000/java/fromjava";
String charset = "UTF-8";
String query = String.format("param1=%s¶m2=%s", param1, param2);
try {
HttpURLConnection urlConnection = (HttpURLConnection) new URL(url)
.openConnection();
urlConnection.setDoOutput(true);
urlConnection.setUseCaches(true);
urlConnection.setRequestMethod("POST");
urlConnection.setRequestProperty("Accept-Encoding", "gzip,deflate,sdch");
urlConnection.setRequestProperty("Accept-Language", "ko-KR,ko;q=0.8,en-US;q=0.6,en;q=0.4");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
urlConnection.setRequestProperty("Accept", "*/*");
urlConnection.setRequestProperty("X-Requested-With", "XMLHttpRequest");
urlConnection.setRequestProperty("Referer", "http://192.168.43.79:3000/");
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36");
urlConnection.connect();
OutputStreamWriter writer = null;
writer = new OutputStreamWriter(urlConnection.getOutputStream(),
charset);
writer.write(query);
writer.flush();
if (writer != null)
try {
writer.close();
} catch (IOException logOrIgnore) {
}
respond to js in controller
- Set 'rack-CORS' to resolve same origin policy
class JavaController < ApplicationController
skip_before_filter :verify_authenticity_token
def fromjava
respond_to do|format|
format.js
end
end
end
change WEB UI
[fromjava.js.erb]
$('#container').empty().append('This is [ajax] text');
[java.html.erb]
<div id="container">
This is default text
</div>
Log message in log/development.log
Started POST "/java/fromjava" for 127.0.0.1 at [current date]
Processing by JavaController#fromjava as */*
Rendered java/fromjava.js.erb (0.0ms)
Complete 200 OK in 43ms (View: 39.0ms | ActiveRecord: 0.0ms)
I saw the 200 OK message in log. However, Didn't update WEB UI. What is the problem?
Also, Are there other way to track the progress of the request & response ??
First
you need to make sure that you are passing the the .js in your request url in order to get the response based on fromjava.js.erb. your url in the java code should be
String url = "http://127.0.0.1:3000/java/fromjava.js";
Second as I can see in the fromjava.js.erb you just passing a jQuery code, OK but you need to make sure that the jQuery code is being executed once the response is received on the client side, you just send the response which is code but it hasn't been executed on the client's machine.
I am trying to retrieve the html of a Google search query result using Java. That is, if I do a search in Google.com for a particular phrase, I would like to retrieve the html of the resulting web page (the page containing the links to possible matches along with their descriptions, URLs, ect…).
I tried doing this using the following code that I found in a related post:
import java.io.*;
import java.net.*;
import java.util.*;
public class Main {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
try {
url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}
From: How do you Programmatically Download a Webpage in Java
The URL used in this code was obtained by doing a Google search query from the Google homepage. For some reason I do not understand, if I write the phrase that I want to search for in the URL bar of my web browser and then use the URL of the resulting search result page in the code I get a 403 error.
This code, however, did not return the html of the search query result page. Instead, it returned the source code of the Google homepage.
After doing further research I noticed that if you view the source code of a Google search query result (by right clicking on the background of the search result page and selecting “View page source”) and compare it with the source code of the Google homepage, they are both identical.
If instead of viewing the source code of the search result page I save the html of the search result page (by pressing ctrl+s), I can get the html that I am looking for.
Is there a way to retrieve the html of the search result page using Java?
Thank you!
Rather than parsing the resulting HTML page from a standard google search, perhaps you would be better off looking at the official Custom Search api to return results from Google in a more usable format. The API is definitely the way to go; otherwise your code could simply break if Google were to change some features of the google.com front-end's html. The API is designed to be used by developers and your code would be far less fragile.
To answer your question, though: We can't really help you just from the information you've provided. Your code seems to retrieve the html of stackoverflow; an exact copy-and-paste of the code from the question you linked to. Did you attempt to change the code at all? What URL are you actually trying to use to retrieve google search results?
I tried to run your code using url = new URL("http://www.google.com/search?q=test"); and I personally get an HTTP error 403 forbidden. A quick search of the problem says that this happens if I don't provide the User-Agent header in the web request, though that doesn't exactly help you if you're actually getting HTML returned. You will have to provide more information if you wish to receive specific help - though switching to the Custom Search API will likely solve your problem.
edit: new information provided in original question; can directly answer question now!
I figured out your problem after packet-capturing the web request that java was sending and applying some basic debugging... Let's take a look!
Here's the web request that Java was sending with your provided example URL:
GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Notice that the request seemed to ignore most of the URL... leaving just the "GET /". That is strange. I had to look this one up.
As per the documentation of the Java URL class (and this is standard for all web pages), A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.
Let's take a look at your example URL...
https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951
notice that "#" is the first character in the file path? Java is simply ignoring everything after the "#" because sharp-signs are only used by the client / web browser - this leaves you with the url https://www.google.com/. Hey, at least it was working as intended!
I can't tell you exactly what Google is doing, but the sharp-symbol url definitely means that Google is returning results of the query through some client-side (ajax / javascript) scripting. I'd be willing to bet that any query you send directly to the server (i.e- no "#" symbol) without the proper headers will return a 403 forbidden error - looks like they're encouraging you to use the API :)
edit2: As per Tengji Zhang answer to the question, here is working code that returns the result of the google query for "test"
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com/search?q=test");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
I suggest you try http://seleniumhq.org/
There is a good tutorial of searching in google
http://code.google.com/p/selenium/wiki/GettingStarted
you don't set the User-Agent in your code.
URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
Or you can read that "http://www.google.com/robots.txt". This file tells you which url is allowed by the google servers.
The below code is successful.
package org.test.stackoverflow;
import java.io.*;
import java.net.*;
import java.util.*;
public class SearcherRetriver {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}
I'm having trouble with connecting to a url with JSoup.
The url I am trying to test is www.xbox.com/en-US/security which is a 302(I think) redirect to http://www.xbox.com/en-US/Live/Account-Security. I have set up jsoup to follow redirect and get the new url using .headers("location"). The url returned is /en-US/Live/Account-Security. I'm not sure how to handle it, my code is below:
while (i < retries){
try {
response = Jsoup.connect(checkUrl)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.followRedirects(false)
.timeout(10000)
.execute();
success = true;
break;
} catch (SocketTimeoutException ex){
timeout = true;
} catch (MalformedURLException ep){
malformedUrl = true;
}catch (IOException e) {
statusCode = 404;
}
}
private void getStatus(){
if (success){
statusCode = response.statusCode();
success = false;
}
if (statusCode >= 300 && statusCode <= 399){
//System.out.println("redirect: " +statusCode + " " +checkUrl);
checkUrl = response.header("location");
//System.out.println(checkUrl);
connect();
getStatus();
}
}
Has anyone got suggestions on how to handle this? Or should I do a check on my checkUrl = response.header("location"); to see if it is a valid url and if not don't test it?
First things first: If you try to access "www.xbox.com/en-US/security", it'll throw you a MalformedException and thus not redirect you to where you want.
Than there's the issue that I'd use only the boolean variable success, and set it as false if any exception is caught. Then again I don't know if you're using timeout, or malformed variables for anything.
After that I'd say that the line right after IOException is never useful. I again couldn't tell, since I can't see the full code.
Now... To your question: The returned string is a domain within the first URL you provided. It'll go simply like this:
//Assuming you won't ever change it, make it a final
//variable for less memory usage.
final String URL = "http://www.xbox.com/en-US/security";
//Whatever piece of processing here
//Some tests just to make sure you'll get what you're
//fetching:
String newUrl = ""
if (checkUrl.startsWith("/"))
newUrl = URL + checkUrl;
if (checkUrl.startsWith("http://"))
newUrl = checkUrl;
if (checkUrl.startsWith("www"))
newUrl = "http://" + checkUrl;
This piece of code will basically make sure you can navigate through urls, without getting some MalformedUrlException. I'd suggest putting a manageUrl() method somewhere and test if the fetched URL is within the domain you're searching, or ele you might end up in e-commerces or publicuty websites.
Hope it helps =)