Looking for an alternate way to validate URLs in Java - java

I'm using HttpURLConnection to validate URLs coming out of a database. Sometimes with certain URLs I will get an exception, I assume they are timing out but are in fact reachable (no 400 range error).
Increasing the timeout doesn't seem to matter, I still get an exception. Is there a second check I could do in the catch region to verify if in fact the URL is bad? The relevant code is below. It works with 99.9% of URLs, it's that .01%.
try {
HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
connection.setConnectTimeout(timeout);
connection.setReadTimeout(timeout);
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");
connection.connect () ;
int responseCode = connection.getResponseCode();
if (responseCode >= 401)
{
String prcMessage = "ERROR: URL " + url + " not found, response code was " + responseCode + "\r";
System.out.println(prcMessage);
VerifyUrl.writeToFile(prcMessage);
return (false);
}
}
catch (IOException exception)
{
String errorMessage = ("ERROR: URL " + url + " did not load in the given time of " + timeout + " milliseconds.");
System.out.println(errorMessage);
VerifyUrl.writeToFile(errorMessage);
return false;
}

Depends on what you want to check. But i guess Validating URL in Java got you covered.
You got two possiblities:
Check syntax ("Is this URL a real URL or just made up?")
There is a large amount of text which describes how to do it. Basically search for RFC 3986. I guess someone has implemented a check like this already.
Check the semantics ("Is the URL available?")
There is not really a faster way to do that though there are different tools available for sending a http request in java. You may send a HEAD request instead of GET as HEAD omits the HTTP body and may result in faster requests and less timeouts.

Related

Why server returns me the Response Code 403 for a valid file in java?

I want to get the Content Length of this file by java:
https://www.subf2m.co/subtitles/farsi_persian-text/SImp4fRrRnBK6j-u2RiPdXSsHSuGVCDLz4XZQLh05FnYmw92n7DZP6KqbHhwp6gfvrxazMManmskHql6va6XEfasUDxGevFRmkWJLjCzsCK50w1lwNajPoMGPTy9ebCC0&name=Q2FwdGFpbiBNYXJ2ZWwgRmFyc2lQZXJzaWFuIGhlYXJpbmcgaW1wYWlyZWQgc3VidGl0bGUgLSBTdWJmMm0gW3N1YmYybS5jb10uemlw
When I insert this url in Firefox or Google Chrome, it downloads a file. but when i want to see that file's size by Java HttpsURlConnection, server returns me Response Code 403 and Content Length -1. why this happens? Thanks
try {
System.out.println("program started -----------------------------------------");
String str_url = "https://www.subf2m.co/subtitles/farsi_persian-text/SImp4fRrRnBK6j-u2RiPdXSsHSuGVCDLz4XZQLh05FnYmw92n7DZP6KqbHhwp6gfvrxazMManmskHql6va6XEfasUDxGevFRmkWJLjCzsCK50w1lwNajPoMGPTy9ebCC0&name=Q2FwdGFpbiBNYXJ2ZWwgRmFyc2lQZXJzaWFuIGhlYXJpbmcgaW1wYWlyZWQgc3VidGl0bGUgLSBTdWJmMm0gW3N1YmYybS5jb10uemlw";
URL url = new URL(str_url);
HttpsURLConnection con = (HttpsURLConnection) url.openConnection();
con.setConnectTimeout(150000);
con.setReadTimeout(150000);
con.setRequestMethod("HEAD");
con.setInstanceFollowRedirects(false);
con.setRequestProperty("Accept-Encoding", "identity");
con.setRequestProperty("connection", "close");
con.connect();
System.out.println("responseCode: " + con.getResponseCode());
System.out.println("contentLength: " + con.getContentLength());
} catch (IOException e) {
System.out.println("error | " + e.toString());
e.printStackTrace();
}
output:
program started -----------------------------------------
responseCode: 403
contentLength: -1
The default Java user-agent is blocked by some online services (most notably, Cloudflare). You need to set the User-Agent header to something else.
con.setRequestProperty("User-Agent", "My-User-Agent");
In my experience, it doesn't matter what you set it to, as long as it's not the default one:
con.setRequestProperty("User-Agent", "aaa"); // works perfectly fine
EDIT: looks like this site uses Cloudflare with DDoS protection active - your code won't run the JavaScript challenge needed to actually get the content of the file.

Count number of redirects using Selenium Webdriver

I am using Selenium and Webdriver to verify that a webpage or a URL has certain element present for example a drop down box.
However I have seen that on landing on certain page,it gets redirected to another and there is one more redirection after this and
after landing on this final page I could then verify the element is present or not.
Now my question is If I open one URL using webdriver,how do I count how many redirects were done before reaching the final URL.
The redirect on URL could be implemented by 301 or 302 response code,by meta refresh or by Javascript Redirects.
I could find one code that checks but not sure whether it handles counting of all kind of redirects.
HttpURLConnection con = (HttpURLConnection)(new URL( myURL ).openConnection());
((HttpURLConnection) con).setInstanceFollowRedirects( false );
con.connect();
int responseCode = ((HttpURLConnection) con).getResponseCode();
System.out.println("Original Url"+""+ myURL+responseCode);
int numberHops =0;
while (responseCode!=200)
{
String newUrl = con.getHeaderField("Location");
HttpURLConnection conn = (HttpURLConnection) new URL(newUrl).openConnection();
con.setInstanceFollowRedirects( false );
responseCode = conn.getResponseCode();
//System.out.println(newUrl + responseCode);
numberHops++;
System.out.println("location is" + newUrl);
System.out.println("number of Hoops before Reaching " +conn.getURL()+"is"+numberHops );
if(numberHops >2)
break;
}
Also this is Java code.Is there a way to do this using webdriver code and cover all the three possible way of counting the redirects.
If not,then how to count the number of redirects using Java code.
Thanks

Java - Quickest way to check if URL exists

Hi I am writing a program that goes through many different URLs and just checks if they exist or not. I am basically checking if the error code returned is 404 or not. However as I am checking over 1000 URLs, I want to be able to do this very quickly. The following is my code, I was wondering how I can modify it to work quickly (if possible):
final URL url = new URL("http://www.example.com");
HttpURLConnection huc = (HttpURLConnection) url.openConnection();
int responseCode = huc.getResponseCode();
if (responseCode != 404) {
System.out.println("GOOD");
} else {
System.out.println("BAD");
}
Would it be quicker to use JSoup?
I am aware some sites give the code 200 and have their own error page, however I know the links that I am checking dont do this, so this is not needed.
Try sending a "HEAD" request instead of get request. That should be faster since the response body is not downloaded.
huc.setRequestMethod("HEAD");
Again instead of checking if response status is not 400, check if it is 200. That is check for positive instead of negative. 404,403,402.. all 40x statuses are nearly equivalent to invalid non-existant url.
You may make use of multi-threading to make it even faster.
Try to ask the next DNS Server
class DNSLookup
{
public static void main(String args[])
{
String host = "stackoverflow.com";
try
{
InetAddress inetAddress = InetAddress.getByName(host);
// show the Internet Address as name/address
System.out.println(inetAddress.getHostName() + " " + inetAddress.getHostAddress());
}
catch (UnknownHostException exception)
{
System.err.println("ERROR: Cannot access '" + host + "'");
}
catch (NamingException exception)
{
System.err.println("ERROR: No DNS record for '" + host + "'");
exception.printStackTrace();
}
}
}
Seems you can set the timeout property, make sure it is acceptable. And if you have many urls to test, do them parallelly, it will be much faster. Hope this will be helpful.

Java HTTPUrlConnection returns 500 status code

I'm trying to GET a url using HTTPUrlConnection, however I'm always getting a 500 code, but when I try to access that same url from the browser or using curl, it works fine!
This is the code
try{
URL url = new URL("theurl");
HttpURLConnection httpcon = (HttpURLConnection) url.openConnection();
httpcon.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
httpcon.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:14.0) Gecko/20100101 Firefox/14.0.1");
System.out.println(httpcon.getHeaderFields());
}catch (Exception e) {
System.out.println("exception "+e);
}
When I print the headerfields, it shows the 500 code.. when I change the URL to something else like google.com , it works fine. But I don't understand why it doesn't work here but it works fine on the browser and with curl.
Any help would be highly appreciated..
Thank you,
This is mostly happening because of encoding.
If you are using browser OK, but getting 500 ( internal server error ) in your program,it is because the browsers have a highly sophisticated code regarding charsets and content-types.
Here is my code and it works in the case of ISO8859_1 as charset and english language.
public void sendPost(String Url, String params) throws Exception {
String url=Url;
URL obj = new URL(url);
HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();
con.setRequestProperty("Acceptcharset", "en-us");
con.setRequestProperty("Accept-Language", "en-US,en;q=0.5");
con.setRequestProperty("charset", "EN-US");
con.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
String urlParameters=params;
// Send post request
con.setDoOutput(true);
con.setDoInput(true);
con.connect();
//con.
DataOutputStream wr = new DataOutputStream(con.getOutputStream());
wr.writeBytes(urlParameters);
wr.flush();
wr.close();
int responseCode = con.getResponseCode();
System.out.println("\nSending 'POST' request to URL : " + url);
System.out.println("Post parameters : " + urlParameters);
System.out.println("Response Code : " + responseCode);
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
//print result
System.out.println(response.toString());
this.response=response.toString();
con.disconnect();
}
and in the main program , call it like this:
myclassname.sendPost("https://change.this2webaddress.desphilboy.com/websitealias/orwebpath/someaction","paramname="+URLEncoder.encode(urlparam,"ISO8859_1"))
The status code 500 suggests that the code at web server have been crashed .Use HttpURLConnection#getErrorStream() to get more idea of the error. Refer Http Status Code 500
I ran into the problem of "URL works in browser, but when I do http-get in java I get a 500 Error".
In my case the problem was that the regular http-get ended up in an infinite redirect loop between /default.aspx and /login.aspx
URL oUrl = new URL(url);
HttpURLConnection con = (HttpURLConnection) oUrl.openConnection();
con.setRequestMethod("GET");
...
int responseCode = con.getResponseCode();
What was happening was: The server serves up a three-part cookie and con.getResponseCode() only used one of the parts. The cookie data in the header looked like this:
header.key = null
value = HTTP/1.1 302 Found
...
header.key = Location
value = /default.aspx
header.key = Set-Cookie
value = WebCom-lbal=qxmgueUmKZvx8zjxPftC/bHT/g/rUrJXyOoX3YKnYJxEHwILnR13ojZmkkocFI7ZzU0aX9pVtJ93yNg=; path=/
value = USE_RESPONSIVE_GUI=1; expires=Wed, 17-Apr-2115 18:22:11 GMT; path=/
value = ASP.NET_SessionId=bf0bxkfawdwfr10ipmvviq3d; path=/; HttpOnly
...
So the server when receiving only a third of the needed data got confused: You're logged in! No wait, you have to login. No, you're logged in, ...
To work around the infinite redirect-loop I had to manually look for re-directs and manually parse through the header for "Set-cookie" entries.
con = (HttpURLConnection) oUrl.openConnection();
con.setRequestMethod("GET");
...
log.debug("Disable auto-redirect. We have to look at each redirect manually");
con.setInstanceFollowRedirects(false);
....
int responseCode = con.getResponseCode();
With this code the parsing of the cookie, if we get a redirect in the responseCode:
private String getNewCookiesIfAny(String origCookies, HttpURLConnection con) {
String result = null;
String key;
Set<Map.Entry<String, List<String>>> allHeaders = con.getHeaderFields().entrySet();
for (Map.Entry<String, List<String>> header : allHeaders) {
key = header.getKey();
if (key != null && key.equalsIgnoreCase(HttpHeaders.SET_COOKIE)) {
// get the cookie if need, for login
List<String> values = header.getValue();
for (String value : values) {
if (result == null || result.isEmpty()) {
result = value;
} else {
result = result + "; " + value;
}
}
}
}
if (result == null) {
log.debug("Reuse the original cookie");
result = origCookies;
}
return result;
}
Make sure that your connection allows following redirects - this is one of the possible reasons for difference in behaviour between your connection and the browser (allows redirect by default).
It should be returning code 3xx, but there maybe something else somewhere that changes it to 500 for your connection.
I faced the same issue, and our issue was there was a special symbol in one of the parameter values. We fixed it by using URLEncoder.encode(String, String)
In my case it turned out that the server always returns HTTP/1.1 500 (in Browser as in Java) for the page I wanted to access, but successfully delivers the webpage content nonetheless.
A human accessing the specific page via Browser just doesn't notice, since he will see the page and no error message, in Java I had to read the error stream instead of the input stream (thanks #Muse).
I have no idea why, though. Might be some obscure way to keep Crawlers out.
This is an old question, but I have had same issue and solved it this way.
This might help other is same situation.
In my case I was developing system on local environment, and every thing worked fine when I checked my Rest Api from browser but I got all the time thrown HTTP error 500 in my Android system.
The problem is when you work on Android, it works on VM (Virtual Machine), that said it means your local computer firewall might preventing your Virtual Machine accessing the local URL (IP) address.
You need just to allow that in your computer firewall. The same thing apply if you trying to access system from out side your network.
Check the parameter
httpURLConnection.setDoOutput(false);
Only for GET Method and set to true on POST, this save me lot of time!!!

Getting java.io.IOException: Error writing to server at getInputStream

If the temp string is very large I get java.io.IOException: Error writing to server at getInputStream
String tmp = js.deepSerialize(taskEx);
URL url = new URL("http://"
+ "localhost"
+ ":"
+ "8080"
+ "/Myproject/TestServletUpdated?command=startTask&taskeId=" +taskId + "'&jsonInput={\"result\":"
+ URLEncoder.encode(tmp) + "}");
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();
Why is that?
This call goes to the servlet mentioned in the URL.
Use HTTP POST method instead of putting all the data in the URL for GET method. There is an upper limit for the length of the URL, so you need to use POST method if you want to send arbitrary length data.
You may want to modify the URL to http://localhost:8080/Myproject/TestServletUpdated, and put the rest
command = "startTask&taskeId=" + taskId + "'&jsonInput={\"result\":" + URLEncoder.encode(tmp) + "}"
in the body of the POST request.
I think you might have a "too long url", the maximum number of characters are 2000 (see this SO post for more info). GET requests are not made to handle such long data input.
You can, if you can change the servlet code also, change it into a POST instead of a GET request (as you have today). The client code would look pretty simular:
public static void main(String[] args) throws IOException {
URL url = new URL("http", "localhost:8080", "/Myproject/TestServletUpdated");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write("command=startTask" +
"&taskeId=" +taskId +
"&jsonInput={\"result\":" + URLEncoder.encode(tmp) + "}");
wr.flush();
.... handle the answer ...
}
I didn't see it first but it seems like you have a single quote character in your request string.
...sk&taskeId=" + taskId + "'&jso.....
^
try removing it, it might help you!
getInputStream() is used to read data. Use getOutputStream()
It could be because the request is being sent as a GET which has a limitation of a very few characters. When the limit exceeds you get an IOException. Convert that to POST and it should work.
For POST
URLConnection conn = url.openConnection().
OutputStream writer = conn.getOutputSteam();
writer.write("yourString".toBytes());
Remove the temp string from the url that you are passing. Move the "command" string to the "yourString".toBytes() section in the code above

Categories