Jsoup malformed url - java

I'm having trouble with connecting to a url with JSoup.
The url I am trying to test is www.xbox.com/en-US/security which is a 302(I think) redirect to http://www.xbox.com/en-US/Live/Account-Security. I have set up jsoup to follow redirect and get the new url using .headers("location"). The url returned is /en-US/Live/Account-Security. I'm not sure how to handle it, my code is below:
while (i < retries){
try {
response = Jsoup.connect(checkUrl)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.followRedirects(false)
.timeout(10000)
.execute();
success = true;
break;
} catch (SocketTimeoutException ex){
timeout = true;
} catch (MalformedURLException ep){
malformedUrl = true;
}catch (IOException e) {
statusCode = 404;
}
}
private void getStatus(){
if (success){
statusCode = response.statusCode();
success = false;
}
if (statusCode >= 300 && statusCode <= 399){
//System.out.println("redirect: " +statusCode + " " +checkUrl);
checkUrl = response.header("location");
//System.out.println(checkUrl);
connect();
getStatus();
}
}
Has anyone got suggestions on how to handle this? Or should I do a check on my checkUrl = response.header("location"); to see if it is a valid url and if not don't test it?

First things first: If you try to access "www.xbox.com/en-US/security", it'll throw you a MalformedException and thus not redirect you to where you want.
Than there's the issue that I'd use only the boolean variable success, and set it as false if any exception is caught. Then again I don't know if you're using timeout, or malformed variables for anything.
After that I'd say that the line right after IOException is never useful. I again couldn't tell, since I can't see the full code.
Now... To your question: The returned string is a domain within the first URL you provided. It'll go simply like this:
//Assuming you won't ever change it, make it a final
//variable for less memory usage.
final String URL = "http://www.xbox.com/en-US/security";
//Whatever piece of processing here
//Some tests just to make sure you'll get what you're
//fetching:
String newUrl = ""
if (checkUrl.startsWith("/"))
newUrl = URL + checkUrl;
if (checkUrl.startsWith("http://"))
newUrl = checkUrl;
if (checkUrl.startsWith("www"))
newUrl = "http://" + checkUrl;
This piece of code will basically make sure you can navigate through urls, without getting some MalformedUrlException. I'd suggest putting a manageUrl() method somewhere and test if the fetched URL is within the domain you're searching, or ele you might end up in e-commerces or publicuty websites.
Hope it helps =)

Related

How to check if the url is there or not in java?

I am coming to an issue where I need help to check for a url that when I search on a particular code it wont show the url that I have listed. Is there a way to make it work with my code? I tried and created a method below which is generateLink. Thanks for the help.
First, since generateLink will either return a valid URL or null, you need to change this:
jgen.writeStringField("pay_grade_description_link", generateLink(XXX_URL + value.jobClassCd) + ".pdf");
to this:
jgen.writeStringField("pay_grade_description_link", generateLink(value.jobClassCd));
If you concatenate ".pdf" to it, null return values will be meaningless, since null + ".pdf" results in the eight-character string "null.pdf".
Second, you can check the response code of an HttpURLConnection to test a URL’s validity. (In theory, you should be able to use the "OPTIONS" HTTP method to test the URL, but not all servers support it.)
private String generateLink(String jobClassCd) {
String url = XXX_URL + jobClassCd + ".pdf";
try {
HttpURLConnection connection =
(HttpURLConnection) new URL(url).openConnection();
if (connection.getResponseCode() < 400) {
return url;
}
} catch (IOException e) {
Logger.getLogger(JobSerializer.class.getName()).log(Level.FINE,
"URL \"" + url + "\" is not reachable.", e);
}
return null;
}

Getting new Url if Moved Permanently

I am developing a code for a project where a part of the code is to check a list of Urls (Web site) is live and and confirm it.
So far every thing is working as planned, expect some pages that are Moved Permanently with error 301 regarding this list. In case of error 301 I need to get the new Url info and pass it in a method before returning true.
The following example is just move to https but other examples could be moved to another Url, so if you call this site:
http://en.wikipedia.org/wiki/HTTP_301
it moves to
https://en.wikipedia.org/wiki/HTTP_301
Which is fine, I just need to get the new Url.
Is this possible and how?
This is my working code part so far:
boolean isUrlOk(String urlInput) {
HttpURLConnection connection = null;
try {
URL url = new URL(urlInput);
connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();
urlStatusCode = connection.getResponseCode();
} catch (IOException e) {
// other error types to be reported
e.printStackTrace();
}
if (urlStatusCode == 200) {
return true;
} else if (urlStatusCode == 301) {
// call a method with the correct url name
// before returning true
return true;
}
return false;
}
You can get the new URL with
String newUrl = connection.getHeaderField("Location");

Java - Quickest way to check if URL exists

Hi I am writing a program that goes through many different URLs and just checks if they exist or not. I am basically checking if the error code returned is 404 or not. However as I am checking over 1000 URLs, I want to be able to do this very quickly. The following is my code, I was wondering how I can modify it to work quickly (if possible):
final URL url = new URL("http://www.example.com");
HttpURLConnection huc = (HttpURLConnection) url.openConnection();
int responseCode = huc.getResponseCode();
if (responseCode != 404) {
System.out.println("GOOD");
} else {
System.out.println("BAD");
}
Would it be quicker to use JSoup?
I am aware some sites give the code 200 and have their own error page, however I know the links that I am checking dont do this, so this is not needed.
Try sending a "HEAD" request instead of get request. That should be faster since the response body is not downloaded.
huc.setRequestMethod("HEAD");
Again instead of checking if response status is not 400, check if it is 200. That is check for positive instead of negative. 404,403,402.. all 40x statuses are nearly equivalent to invalid non-existant url.
You may make use of multi-threading to make it even faster.
Try to ask the next DNS Server
class DNSLookup
{
public static void main(String args[])
{
String host = "stackoverflow.com";
try
{
InetAddress inetAddress = InetAddress.getByName(host);
// show the Internet Address as name/address
System.out.println(inetAddress.getHostName() + " " + inetAddress.getHostAddress());
}
catch (UnknownHostException exception)
{
System.err.println("ERROR: Cannot access '" + host + "'");
}
catch (NamingException exception)
{
System.err.println("ERROR: No DNS record for '" + host + "'");
exception.printStackTrace();
}
}
}
Seems you can set the timeout property, make sure it is acceptable. And if you have many urls to test, do them parallelly, it will be much faster. Hope this will be helpful.

How can I Retrieve the HTML of a Search Engine Query Result?

I am trying to retrieve the html of a Google search query result using Java. That is, if I do a search in Google.com for a particular phrase, I would like to retrieve the html of the resulting web page (the page containing the links to possible matches along with their descriptions, URLs, ect…).
I tried doing this using the following code that I found in a related post:
import java.io.*;
import java.net.*;
import java.util.*;
public class Main {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
try {
url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}
From: How do you Programmatically Download a Webpage in Java
The URL used in this code was obtained by doing a Google search query from the Google homepage. For some reason I do not understand, if I write the phrase that I want to search for in the URL bar of my web browser and then use the URL of the resulting search result page in the code I get a 403 error.
This code, however, did not return the html of the search query result page. Instead, it returned the source code of the Google homepage.
After doing further research I noticed that if you view the source code of a Google search query result (by right clicking on the background of the search result page and selecting “View page source”) and compare it with the source code of the Google homepage, they are both identical.
If instead of viewing the source code of the search result page I save the html of the search result page (by pressing ctrl+s), I can get the html that I am looking for.
Is there a way to retrieve the html of the search result page using Java?
Thank you!
Rather than parsing the resulting HTML page from a standard google search, perhaps you would be better off looking at the official Custom Search api to return results from Google in a more usable format. The API is definitely the way to go; otherwise your code could simply break if Google were to change some features of the google.com front-end's html. The API is designed to be used by developers and your code would be far less fragile.
To answer your question, though: We can't really help you just from the information you've provided. Your code seems to retrieve the html of stackoverflow; an exact copy-and-paste of the code from the question you linked to. Did you attempt to change the code at all? What URL are you actually trying to use to retrieve google search results?
I tried to run your code using url = new URL("http://www.google.com/search?q=test"); and I personally get an HTTP error 403 forbidden. A quick search of the problem says that this happens if I don't provide the User-Agent header in the web request, though that doesn't exactly help you if you're actually getting HTML returned. You will have to provide more information if you wish to receive specific help - though switching to the Custom Search API will likely solve your problem.
edit: new information provided in original question; can directly answer question now!
I figured out your problem after packet-capturing the web request that java was sending and applying some basic debugging... Let's take a look!
Here's the web request that Java was sending with your provided example URL:
GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Notice that the request seemed to ignore most of the URL... leaving just the "GET /". That is strange. I had to look this one up.
As per the documentation of the Java URL class (and this is standard for all web pages), A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.
Let's take a look at your example URL...
https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951
notice that "#" is the first character in the file path? Java is simply ignoring everything after the "#" because sharp-signs are only used by the client / web browser - this leaves you with the url https://www.google.com/. Hey, at least it was working as intended!
I can't tell you exactly what Google is doing, but the sharp-symbol url definitely means that Google is returning results of the query through some client-side (ajax / javascript) scripting. I'd be willing to bet that any query you send directly to the server (i.e- no "#" symbol) without the proper headers will return a 403 forbidden error - looks like they're encouraging you to use the API :)
edit2: As per Tengji Zhang answer to the question, here is working code that returns the result of the google query for "test"
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com/search?q=test");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
I suggest you try http://seleniumhq.org/
There is a good tutorial of searching in google
http://code.google.com/p/selenium/wiki/GettingStarted
you don't set the User-Agent in your code.
URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
Or you can read that "http://www.google.com/robots.txt". This file tells you which url is allowed by the google servers.
The below code is successful.
package org.test.stackoverflow;
import java.io.*;
import java.net.*;
import java.util.*;
public class SearcherRetriver {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}

Handling connection errors and JSoup

I'm trying to create an application to scrape content off of multiple pages on a site. I am using JSoup to connect. This is my code:
for (String locale : langList){
sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName;
try {
Document doc = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
} catch (IOException e) {
System.out.println(e);
}
}
Everything works perfectly most of the time. However there are a few things I want to be able to do.
First off sometimes a 404 status will return or a 500 status maybe a 301. With my code below it will just print the error and move onto the next url. What I would like to be able to do is try to be able to return the url status for all links. If the page connects print a 200, if not print the relevant status code.
Secondly I sometimes catch this error "java.net.SocketTimeoutException: Read timed out" I could increase my timeout however I would prefer to try to connect 3 times, upon failing the 3rd time I want to add the URL to a "failed" array so I can retry the failed connections in the future.
Can someone with more knowledge than me help me out?
The above returns an IOException for me rather than the execute() returning the correct status code.
Using JSoup-1.6.1 I had to change the above code to use ignoreHttpErrors(true).
Now when the code returns the response rather than throwing an exception and you can check the error codes/messages.
Connection.Response response = null;
try {
response = Jsoup.connect(bad_url)
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5")
.timeout(100000)
.ignoreHttpErrors(true)
.execute();
} catch (IOException e) {
System.out.println("io - "+e);
}
System.out.println("Status code = " + response.statusCode());
System.out.println("Status msg = " + response.statusMessage());
Output:
Status code = 404
Status msg = Not Found
For your first question, you can do your connection/read in two steps, stopping to ask for the status code in the middle like so:
Connection.Response response = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if(statusCode == 200) {
Document doc = connection.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
}
else {
System.out.println("received error code : " + statusCode);
}
Note that the execute() method will fail with an IOException if it's unable to connect to the server, if the response is malformed HTTP, etc., so you'll need to handle that. However, as long as the server said something that made sense, you'll be able to read the status code and continue. Also, if you've asked Jsoup to follow redirects, you won't be seeing 30x response codes b/c Jsoup will set the status code from the final page fetched.
As for your second question, all you need is a loop around the code sample I just gave you that's wrapped with a try/catch block with SocketTimeoutException. When you catch the exception, the loop should continue. If you're able to get data, then return or break. Shout if you need more help with it!

Categories