Verify random URLs on a network in Java - java

This question may be a bit too low-level, but I couldn't find an answer already.
I'm typing this next paragraph so that you can correct me/ explain the things I refer to unwittingly.
You know in a web browser you can type directory paths from your own computer, and it will bring them up? Apparently, it also works with pages within a local network. If there's another page on the same subnet, you can access it with "http://pagename/".
On the network I'm a part of, there are a lot of these pages, and they all (or mostly) have common, single-word names, such as "http://word/" . I want to test, using Java, a dictionary of common words to see which exist as locations on the network. Of course, there's probably an easier way if I know the range of ip addresses on the network, which I do. However, I get the "page not found" page if I try typing the IP address of, say, "http://word/" (which I get from ping), into the address bar. This is true even if "http://word/" works.
So say I loop through my word bank. How can I test if a URL is real?
I've worked out how to load my word bank. Here's what I have right now
URL article=new URL("http://word"); //sample URL
URLConnection myConn=article.openConnection();
Scanner myScan=new Scanner(new InputStreamReader(myConn.getInputStream()));
System.out.println(myScan.hasNext()); //Diagnostic output
This works when the URL is constructed with a valid URL. When it gets passed a bad URL, the program just ignores the System.out.println, not even making a new line. I know that different browsers show different "page not found" screens, and that these have their own html source code. Maybe that's related to my problem?
How can I test if a URL is real using this method?
Is there a way to test it with IP addresses, given my problem? or, why am I having a problem typing in the IP address and not the URL?

You should check HTTP response code. If URL is "real" (in your terms) the response code should be 200. Otherwise I believe that you will get other response code.
Do it using HttpUrlConnection.getResponseCode();
HttpUrlConnection is a subclass of URLConnection. When your are connecting with HTTP that is actually what you get from openConnection(), so you can say:
URL article=new URL("http://word"); //sample URL
HttpURLConnection myConn = (HttpURLConnection)article.openConnection();

If you are testing only http urls you can cast the URLConnection to a HTTPUrlConnection and check the HTTP response code for 200 = HTTP_OK:
URL article=new URL("http://word"); //sample URL
HttpURLConnection myConn= (HttpURLConnection)article.openConnection();
if (myConn.getResponseCode() == HttpURLConnection.HTTP_OK) {
// Site exists and has valid content
}
Additionally if you want to test IP addresses you van simply use it as url:
http://10.0.0.1

I think I've figured it out.
This code wouldn't compile without me catching IOException (Because of URL, URLConnection, and Scanner), so I had to try{/*code*/} catch(IOException oops){}, which I did nothing with. I didn't think that it was important to put the try/catch in my question. UnknownHostException and MalformedURLException extend IOException, so I was already unwittingly triggering one of them with Scanner.hasNext() or with HttpURLConnection.getResponseCode(), catching it, and exiting the try block. Thus, I never got a response code when I had a bad URL. So I need to write
try
{
URL article=new URL("http://word");
HttpURLConnection myConn=(HttpURLConnection)article.openConnection();
//code to store "http://word" as a working URL
}
catch (UnknownHostException ex) {/*code if "http://word" is not a working URL*/}
catch (IOException oops) {oops.printStackTrace();}
Thanks for everyone's help, I learned a lot. If you have a different/better answer or if you can answer why using the IP addresses didn't work, I'm still wondering that.

Related

HTTP response code of text file is 460 when file contains 1.0.2?

I'm using an HttpsURLConnection to grab the only line in a text file hosted on Dropbox, as an update checker (for a Minecraft mod). The relevant code is below:
URL url = new URL(linkToVersionFile);
HttpsURLConnection connection = (HttpsURLConnection) url.openConnection();
connection.setConnectTimeout(999);
int responseCode = connection.getResponseCode();
Normally, this works fine. Except, however, if the text in the file is "1.0.2". When it's 1.0.2, it returns the 460 response code, which I cant seem to find in any list of response codes. The accompanying response message is "Restricted" though.
If the file contains "1.0.1", "1.0.3", "1.1.2", "1.2.2" or even "2.0.2" it works just fine. Nothing changes but the 5 characters located in the file. The same thing happens if different files are used, and given the text "1.0.2", so it's not a corrupt file.
While I can get around it by either avoiding 1.0.2 and moving straight to 1.0.3, or writing it as "102" instead, it's just such an usual problem that I was wondering if anyone had an explanation :P
If more information or test results is required, let me know.
Thanks in advance :)

Retrieve redirected URL with Java / HttpURLConnection

Given a URL (String ref), I am attempting to retrieve the redirected URL as follows:
HttpURLConnection con = (HttpURLConnection)new URL(ref).openConnection();
con.setInstanceFollowRedirects(false);
con.setRequestProperty("User-Agent","");
int responseType = con.getResponseCode()/100;
while (responseType == 1)
{
Thread.sleep(10);
responseType = con.getResponseCode()/100;
}
if (responseType == 3)
return con.getHeaderField("Location");
return con.getURL().toString();
I am having several (conceptual and technical) problems with it:
Conceptual problem:
It works in most cases, but I don't quite understand how.
All methods of the 'con' instance are called AFTER the connection is opened (when 'con' is instanciated).
So how do they affect the actual result?
How come calling 'setInstanceFollowRedirects' affects the returned value of 'getHeaderField'?
Is there any point calling 'getResponseCode' over and over until the returned value is not 1xx?
Bottom line, my general question here: is there another request/response sent through the connection every time one of these methods is invoked?
Technical problem:
Sometimes the response-code is 3xx, but 'getHeaderField' does not return the "final" URL.
I tried calling my code with the returned value of 'getHeaderField' until the response-code was 2xx.
But in most other cases where the response-code is 3xx, 'getHeaderField' DOES return the "final" URL, and if I call my code with this URL then I get an empty string.
Can you please advise how to approach the two problems above in order to have a "100% proof" code for retrieving the "final" URL?
Please ignore cases where the response-code is 4xx or 5xx (or anything else other than 1xx / 2xx / 3xx for that matter).
Thanks
Conceptual problems:
0.) Can one URLConnection or HttpURLConnection object be reused?
No, you can not reuse such an object. You can use it to fetch the content of one URL just once. You can not use it to retrieve another URL, nor to fetch the content twice (speaking on the network level).
If you want to fetch another URL or to fetch the URL a second time, you have to call the openConnection() method of the URL class again to instanciate a new connection object.
1.) When is the URLConnection actually connected?
The method name openConnection() is misleading. It only instanciates the connection object. It does not do anything on the network level.
The interaction on the network level starts in this line, which implicitly connects the connection (= the TCP socket under the hood is opened and data is sent and received):
int responseType = con.getResponseCode()/100;
.
Alternatively, you can use HttpURLConnection.connect() to explicitly connect the connection.
2.) How does setInstanceFollowRedirects work?
setInstanceFollowRedirects(true) causes the URLs to be fetched "under the hood" again and again until there is a non-redirect response. The response code of the non-redirect response is returned by your call to getResponseCode().
UPDATE:
Yes, this allows to write simple code if you do not want to bother about the redirects yourself. You can simply switch on to follow redirects and then you can read the final response of the location to which you get redirected as if there was no redirect taking place.
I would be more careful in evaluating the response code. Not every 3xx-code is automatically a kind of redirection. For example the code 304 just stands for "Not modified."
Look at the original definitions here.

Get the redirected URL of a very specific URL (in Java)

How can I get the redirected URL of http://at.atwola.com/?adlink/5113/1649059/0/2018/AdId=4041444;BnId=872;itime=15692006;impref=13880156912668385284; in Java?
My code (given below) is constructed according to answers to similar questions on stack-overflow (https://stackoverflow.com/a/5270162/1382251 in particular).
But it just yields the original URL. I suspect that there are other similar cases, so I would like to resolve this one in specific and use the solution in general.
String ref = "http://at.atwola.com/?adlink/5113/1649059/0/2018/AdId=4041444;BnId=872;itime=15692006;impref=13880156912668385284;";
try
{
URLConnection con1 = new URL(ref).openConnection();
con1.connect();
InputStream is = con1.getInputStream();
URL url = con1.getURL();
is.close();
String finalPage = url.toString();
if (finalPage.equals(ref))
{
HttpURLConnection con2 = (HttpURLConnection)con1;
con2.setInstanceFollowRedirects(false);
con2.connect();
if (con2.getResponseCode()/100 == 3)
finalPage = con2.getHeaderField("Location");
}
System.out.println(finalPage);
}
catch (Exception error)
{
System.out.println("error");
}
I played a bit with your URL with telnet, wget, and curl and I noticed that in some cases the server returns response 200 OK, and sometimes 302 Moved Temporarily. The main difference seems to be the request User-agent header. Your code works if you add the following before con1.connect():
con1.setRequestProperty("User-Agent","");
That is, with empty User-Agent (or if the header is not present at all), the server issues a redirect. With the Java User-Agent (in my case User-Agent: Java/1.7.0_45) and with the default curl User-Agent (User-Agent: curl/7.32.0) the server responds with 200 OK.
In some cases you might need to also set:
System.setProperty("http.agent", "");
See Setting user agent of a java URLConnection
The server running the site is the Adtech Adserver and apparently it is doing user agent sniffing. There is a long history of user agent sniffing. So it seems that the safest thing to do would be to set the user agent to Mozilla:
con1.setRequestProperty("User-Agent","Mozilla"); //works with your code for your URL
Maybe the safest option would be to use a user agent used by some of the popular web browsers.

Http request returning different response in browser

I'm getting a JSON object from a server, When I enter the following generated URL into my browser I get a response with "num_match": 18, however when running in my app I get a JSON object with "num_matches": 2.
The URL object is created like this
URL request;
request = new URL(url);
and connection like this:
HttpURLConnection connection = (HttpURLConnection) request.openConnection();
connection.setConnectTimeout(MAX_TIME);
connection.setReadTimeout(MAX_TIME);
url is a String and I am copying the string contents into my browser to test.
The string is:
http://search.3taps.com/?auth_token=xxxxxxxxxxxxxxxxxx&retvals=heading,body,timestamp,external_url,images,price&rpp=100&source=BKPGE|CRAIG|EBAYC|INDEE|KIJIJ&category=PWSM&radius=200mi&lat=26.244&long=-80.2&annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
The URL object has the following fields
query:
auth_token=xxxxxxxxxxxxxxxxxx&retvals=heading,body,timestamp,external_url,images,price&rpp=100&source=BKPGE|CRAIG|EBAYC|INDEE|KIJIJ&category=PWSM&radius=200mi&lat=26.244&long=-80.2&annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
file:
/?auth_token=xxxxxxxxxxxxxxxxxx&retvals=heading,body,timestamp,external_url,images,price&rpp=100&source=BKPGE|CRAIG|EBAYC|INDEE|KIJIJ&category=PWSM&radius=200mi&lat=26.244&long=-80.2&annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
host:
search.3taps.com
The response comes back as "success":true on both but with a discrepancy in the object returned. I don't know much about http, what could be causing this?
UPDATE: On further testing it seems like there is only a problem when the annotations entry is present
annotations={age:18 OR age:19 OR age:20 OR age:21 OR age:22}
seems to be causing the problem.
Make sure you are encoding the URL request correctly when you are setting the URL for the server. The spaces, braces, and colons all need to be appropriately escaped. Spaces should be %20, etc. This may help: HTTP URL Address Encoding in Java
Old Answer.... Comments indicate this does not affect the result... so moving down.
It is quite possible that the server is changing it's behaviour based on the type of 'browser' you are reporting yourself to be. When connecting to an HTTP server you tell the server what your UserAgent is (typically for a browser it is something like "Internet Explorer ...." or "Mozilla ..." or "Google Chome ...". The Server will often tailor the results of a request to suite the User Agent (different javascript files and HTML codes go to IE, etc.). This is also how servers re-direct mobile devices to a mobile-friendly version of a site.
It is quite possible that the server is changing it's response to match your UserAgent exposed by your Java code, (which by decault is something like "Java/1.7.0". You can change this value a few ways. Have a look at this question Setting user agent of a java URLConnection and try to run your program with the Mozilla agent, and see if you get different results.

Guaranteed way to correctly get the contents of www.bing.com/

I have been working on a program that gets the contents of www.bing.com and saves it to a file, but out of the two ways I have tried one using sockets, and the other using HtmlUnit neither shows the contents 100% correct when I open the file. I know there are other options out there, but I looking for one that is guaranteed to get the contents of www.bing.com/ correctly. I would therefore appreciate it if someone could point me to a means of accomplishing this.
The differences you see are likely due to the web server providing different content to different browsers based on the user agent string and other request headers.
Try setting the User-Agent header in your socket and HtmlUnit strategies to the one you are comparing against and see if the result is as expected. Moreover, you will likely have to replicate the request headers exactly as they are sent by your target browser.
What is "incorrect" about what is returned? Keep in mind, Bing is probably generating some of the content via JavaScript; your client will need to make additional requests to retrieve the JavaScript files, run the JavaScript, etc.
You can use a URL.openConnection() to create a URLConnection and call URLConnection.getInputStream(). You can read the InputStream contents and write it to a file.
If you need to override the User-Agent because the server is using it to serve different content you can do so by first setting the http.agent system property to empty string.
/* Somewhere in your code before you make requests */
System.setProperty("http.agent", "");
or using -Dhttp.agent= on your java command line
and then setting the User-Agent to something useful on the connection before you get the InputStream.
URLConnection conn = ... //Create your URL connection as described above.
String userAgent = ... //Some user-agent string here.
conn.setRequestProperty("User-Agent", userAgent);

Categories