Weird problem accessing web page with Java - java

I am trying to write a program that reads the html source code of the website http://judgephilosophies.wikispaces.com. I wrote some simple java code that reads and outputs the source code, but it just prints out "null." Here's the bizarre thing, though - if I replace "http://judgephilosophies.wikispaces.com" in the code with any other website, it works just fine. It only seems to be for websites in the wikispaces.com domain that the program doesn't work, and I am utterly befuddled as to why. The code is below. Help is much appreciated.
import java.io.*;
import java.net.*;
public class AccessWebExample
{
public static void main (String[] args) throws Exception
{
//Create reader to access html source code
URL url = new URL ("http://judgephilosophies.wikispaces.com/");
InputStreamReader isr = new InputStreamReader (url.openStream());
BufferedReader reader = new BufferedReader (isr);
//Read and print the text
do
{
System.out.println(reader.readLine());
}
while(reader.readLine() != null);
}
}

Do an HTTP trace using Wireshark or somesuch and compare. It's probably a matter of cookies or headers, if the bare URLConnection is acting differently than a browser.

Using wget from the command line you'll find:
broach#broach-laptop:~$ wget http://judgephilosophies.wikispaces.com/
--2011-04-23 14:50:31-- http://judgephilosophies.wikispaces.com/
Resolving judgephilosophies.wikispaces.com... 208.43.192.33, 75.126.104.177
Connecting to judgephilosophies.wikispaces.com|208.43.192.33|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://session.wikispaces.com/1/auth/auth?authToken=e8ad55c0e2701a0e7da89807255609da [following]
It redirects (a couple more times, actually). Your bare URLConnection doesn't handle that. The response code is in the headers so your program currently prints null.
You really should look at using HttpUrlConnection as it can handle redirects for you. To do it with URL would require you looking at the returned headers and acting on HTTP response codes (which is what HttpURLConnection does)

Related

HTTP Server - Serving up favicon.ico

I'm playing around setting up my own java http server to better understand http servers and what goes on under the hood of the web. I've developed a pretty simple server and have been able to serve both html pages as well as data in JSON form. Then I saw the browser (I'm using chrome but assuming it's the same for others) was sending a request for favicon.ico. I'm able to identify that request on my server, so I'm trying to serve up a random icon I downloaded and resized to 16x16 pixels in png format, as that's what the internet says the size needs to be. Here's my code, note it's not supposed to be anything professional, just something that will work for my basic educational purposes:
[set up ServerSocket and listen]
public static String err_header = "HTTP/1.1 500 ERR\nAccess-Control-Allow-Origin: *";
public static String success_header = "HTTP/1.1 200 OK\nAccess-Control-Allow-Origin: *";
public static String end_header = "\r\n\r\n";
while(true){
try{
System.out.println("Listening for new connections");
clientSocket = server.accept();
System.out.println("Connection established");
InputStreamReader isr = new InputStreamReader(clientSocket.getInputStream());
BufferedReader reader = new BufferedReader(isr);
String getLine = reader.readLine();//first line of HTTP request
handleRequest(getLine,clientSocket);
}//end of try
catch(Exception e){
[error stuff]
}//end of catch
}//end of while
HandleRequest method:
public static void handleRequest(String getLine,Socket clientSocket) throws Exception{
if(getLine.substring(5,16).equals("favicon.ico")){
List<String> iconTag = new ArrayList<String>();
iconTag.add("\nContent-Type: image/png");
handleFileRequest("[file]",iconTag,clientSocket);
}//end of if
else{
handleFileRequest("[file]",clientSocket);
}//end of else
}//end of handleRequest
handleFileRequest for images:
public static void handleFileRequest(String fileName,List<String> headerTags,Socket clientSocket) throws Exception{
OutputStream out = clientSocket.getOutputStream();
BufferedReader read = new BufferedReader(new FileReader(fileName));
out.write(success_header.getBytes("UTF-8"));
Iterator<String> itr = headerTags.iterator();
while(itr.hasNext()){
out.write(itr.next().getBytes("UTF-8"));
}//end of while
out.write(end_header.getBytes("UTF-8"));
String readLine = "";
while((readLine = read.readLine())!=null){
out.write(readLine.getBytes("UTF-8"));
}//end of while
out.flush();
out.close();
}//end of handleFileRequest
And it appears to work, as the server sends the file, the browser shows the 200 OK response, but there's no favicon and when I filter network requests to just images, there is one image requested by the page being served but the favicon request is not listed there (the favicon request is in the "other" section). Similarly when clicking on the other image the image shows up on the preview, whereas that's not the case with the favicon request. Screenshot:
Meanwhile here's what the other image looks like, and it shows up in the page just fine:
I also tried including the Content-Length header, but that didn't seem to make a difference. Am I missing something obvious?
Also just to clarify, I know I can include the favicon in the actual html page, the goal isn't to do it, but to understand how it works.
Reading binary files
It seems the content of the favicon is not served correctly.
I suspect this is most likely due to the way you read its content:
while((readLine = read.readLine())!=null){
out.write(readLine.getBytes("UTF-8"));
}
Reading binary content line by line is inappropriate,
because the concept of lines, and also UTF-8 encoding,
don't make sense in the context of binary files.
And you cannot read binary content correctly line by line this way,
because the readLine method of a BufferedReader doesn't return the full line, because it strips the newline from the end.
You cannot manually add a newline character because you cannot know what exactly it was.
Here's a simpler and correct way to read the content of a binary file:
byte[] bytes = Files.readAllBytes(Paths.get("/path/to/file"));
Once you have this, it's easy to produce a correct file header with the content length, using the value of bytes.length.
What happens when you visit a page in a browser
It seems it will be good for you if we clarify a few things.
When you open a URL in a browser,
the browser sends a GET request to the web server to download the content of the original URL that you have specified.
Once it has the page content, it will send further GET requests:
Fetch a favicon if it doesn't have one already. The location of this may be specified in the HTML document, or else the browser will try to fetch SERVERNAME/favicon.ico by default
Fetch the images specified in src attribute of any (valid) <img/> tags in the document
Fetch the style sheets specified in href attribute of any (valid) <style/> tags in the document
... and similarly for <script/> tags, and so on...
The favicon is purely cosmetic, to show in browser tab titles,
the other resources are essential for rendering a page.
They are not essential in text-based browsers like lynx,
such browsers will obviously not fetch these resources.
This is the explanation for why the favicon is requested, and how.
How does a web server serve files?
In the most basic case, serving a file has two important components:
Produce an appropriate HTTP header: each line in the header is in name: value format, and each line must end with \n.
There must be at least a Content-type header.
The header must be terminated by a blank line.
After the blank line that terminates the header,
the content can be anything, even binary.
To illustrate with an example,
consider the curl command, which dumps the content of a url to standard output.
If you run curl url-to-some-html-file,
you will see the content of the html file.
If you run curl url-to-some-image-file,
you will see the content of the image file.
It will be unreadable, and your terminal will probably make funny noises.
You can redirect the output to a file with curl url-to-some-image-file > image.png,
and that will give you an image file,
binary content,
that you can open in any image viewer tool.
In short, serving files is really just printing a header on stdout,
then printing a blank line to terminate the header,
then printing the content on stdout.
Debugging the serving of an image
An easy way to debug that an image is correctly served is to save the URL to a file using curl,
and then verify that the saved file and the original file are identical,
for example using the cmp command:
curl -o file url-to-favicon
cmp file /path/to/original
The output of cmp should be empty.
This command only produces output if it finds a difference in the two files.
Implementing a simple HTTP server
Instead of using a ServerSocket,
here's a drastically simpler way to implement an HTTP server:
HttpServer server = HttpServer.create(new InetSocketAddress(1234), 0);
server.createContext("/favicon.ico", t -> {
byte[] bytes = Files.readAllBytes(Paths.get("/path/to/favicon"));
t.sendResponseHeaders(200, bytes.length);
try (OutputStream os = t.getResponseBody()) {
os.write(bytes);
}
});
server.createContext("/", t -> {
Charset charset = StandardCharsets.UTF_8;
List<String> lines = Files.readAllLines(Paths.get("/path/to/index"), charset);
t.sendResponseHeaders(200, 0);
try (OutputStream os = t.getResponseBody()) {
for (String line : lines) {
os.write((line + "\n").getBytes(charset));
}
}
});
server.start();

Taking text from a response web page using Java

I am sending commands to a server using http, and I currently need to parse a response that the server sends back (I am sending the command via the command line, and the servers response appears in my browser).
There are a lot of resources such as this: Saving a web page to a file in Java, that clearly illustrate how to scrape a page such as cnn.com. However, since this is a response page that is only generated when the camera receives a specific command, my attempts to use the method described by Mike Deck (in the link above) have met with failure. (Specifically, when my program requests the page again the server returns a 401 error.)
The response from the server opens a new tab in my browser. Essentially, I need to know how to save the current web page using java, since reading in a file is probably the most simple way to approach this. Do any of you know how to do this?
TL;DR How do you save the current webpage to a webpage.html or webpage.txt file using java?
EDIT: I used Base64 from the Apache commons codec, which solved my 401 authentication issue. However, I am still getting a 400 error when I attempt to connect my InputStream (see below). Does this mean a connection isn't being established in the first place?
URL url = new URL ("http://"+ipAddress+"/axis-cgi/record/record.cgi?diskid=SD_DISK");
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String (encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("POST");
connection.setDoInput (true);
connection.setRequestProperty ("Authorization", "Basic " + encoding);
connection.connect();
InputStream content = (InputStream)connection.getInputStream();
BufferedReader in = new BufferedReader (new InputStreamReader (content));
String line;
while ((line = in.readLine()) != null) {
System.out.println(line);
}
EDIT 2: Changing the request to a GET resolved the issue.
So while scrutinizing my code above, I decided to change
connection.setRequestMethod("POST");
to
connection.setRequestMethod("GET");
This solved my problem. In hindsight, I think the server was not recognizing the HTTP because it is not set up to handle the various trappings that come along with post.

Error 503 in HTTP during page parsing in java

Today I'm developing a java RMI server (and also the client) that gets info from a page and returns me what I want. I put the code right down here. The problem is that sometimes the url I pass to the method throws an IOException that says that the url given makes a 503 HTTP error. It could be easy if it was always that way but the thing is that it appears sometimes.
I have this method structure because the page I parse is from a weather company and I want info from many cities, not only for one, so some cities works perfectly at the first chance and others it fails. Any suggestions?
public ArrayList<Medidas> parse(String url){
medidas = new ArrayList<Medidas>();
int v=0;
String sourceLine;
String content = "";
try{
// The URL address of the page to open.
URL address = new URL(url);
// Open the address and create a BufferedReader with the source code.
InputStreamReader pageInput = new InputStreamReader(address.openStream());
BufferedReader source = new BufferedReader(pageInput);
// Append each new HTML line into one string. Add a tab character.
while ((sourceLine = source.readLine()) != null){
if(sourceLine.contains("<tbody>")) v=1;
else if (sourceLine.contains("</tbody>"))
break;
else if(v==1)
content += sourceLine + "\n";
}
........................
........................ NOW THE PARSING CODE, NOT IMPORTANT
}
HTTP 500 errors reflect server errors so it has likely nothing to do with your client code.
You would get a 400 error if you were passing invalid parameters on your request.
503 is "Service Unavailable" and may be sent by the server when it is overloaded and cannot process your request. From a publicly accessible server, that could explain the erratic behavior.
Edit
Build a retry handler in your code when you detect a 503. Apache HTTPClient can do that automatically for you.
List of HTTP Status Codes
Check that the IOException is really not a MalformedURLException. Try printing out the URLs to verify a bad URL is not causing the IOException.
How large is the file you are parsing? Perhaps your JVM is running out of memory.

403 error in accessing an URL but works fine in browsers

String url = "http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false";
URL google = new URL(url);
HttpURLConnection con = (HttpURLConnection) google.openConnection();
and I use BufferedReader to print the content I get 403 error
The same URL works fine in the browser. Could any one suggest.
The reason it works in a browser but not in java code is that the browser adds some HTTP headers which you lack in your Java code, and the server requires those headers. I've been in the same situation - and the URL worked both in Chrome and the Chrome plugin "Simple REST Client", yet didn't work in Java. Adding this line before the getInputStream() solved the problem:
connection.addRequestProperty("User-Agent", "Mozilla/4.0");
..even though I have never used Mozilla. Your situation might require a different header. It might be related to cookies ... I was getting text in the error stream advising me to enable cookies.
Note that you might get more information by looking at the error text. Here's my code:
try {
HttpURLConnection connection = ((HttpURLConnection)url.openConnection());
connection.addRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input;
if (connection.getResponseCode() == 200) // this must be called before 'getErrorStream()' works
input = connection.getInputStream();
else input = connection.getErrorStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
String msg;
while ((msg =reader.readLine()) != null)
System.out.println(msg);
} catch (IOException e) {
System.err.println(e);
}
HTTP 403 is a Forbidden status code. You would have to read the HttpURLConnection.getErrorStream() to see the response from the server (which can tell you why you have been given a HTTP 403), if any.
This code should work fine. If you have been making a number of requests, it is possible that Google is just throttling you. I have seen Google do this before. You can try using a proxy to verify.
Most browsers automatically encode URLs when you enter them, but the Java URL function doesn't.
You should Encode the URL with URLEncoder URL Encoder
I know this is a bit late, but the easiest way to get the contents of a URL is to use the Apache HttpComponents HttpClient project: http://hc.apache.org/httpcomponents-client-ga/index.html
you original page (with link) and the targeted linked page are not the same domain.
original-domain and target-domain.
I found the difference is in request header:
with 403 forbidden error,
request header have one line:
Referer: http://original-domain/json2tree/ipfs/ipfsList.html
when I enter url, no 403 forbidden,
the request header does NOT have above line referer: original-domain
I finally figure out how to fix this error!!!
on your original-domain web page, you have to add
<meta name="referrer" content="no-referrer" />
it will remove or prevent sending the Referer in header, works both for links and for Ajax requests made

URLConnection FileNotFoundException for non-standard HTTP port sources

I was trying to use the Apache Ant Get task to get a list of WSDLs generated by another team in our company. They have them hosted on a weblogic 9.x server on http://....com:7925/services/. I am able to get to the page through a browser, but the get task gives me a FileNotFoundException when trying to copy the page to a local file to parse. I was still able to get (using the ant task) a URL without the non-standard port 80 for HTTP.
I looked through the Ant source code, and narrowed the error down to the URLConnection. It seems as though the URLConnection doesn't recognize the data is HTTP traffic, since it isn't on the standard port, even though the protocol is specified as HTTP. I sniffed the traffic using WireShark and the page loads correctly across the wire, but still gets the FileNotFoundException.
Here's an example where you will see the error (with the URL changed to protect the innocent). The error is thrown on connection.getInputStream();
import java.io.File;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
public class TestGet {
private static URL source;
public static void main(String[] args) {
doGet();
}
public static void doGet() {
try {
source = new URL("http", "test.com", 7925,
"/services/index.html");
URLConnection connection = source.openConnection();
connection.connect();
InputStream is = connection.getInputStream();
} catch (Exception e) {
System.err.println(e.toString());
}
}
}
The response to my HTTP request returned with a status code 404, which resulted in a FileNotFoundException when I called getInputStream(). I still wanted to read the response body, so I had to use a different method: HttpURLConnection#getErrorStream().
Here's a JavaDoc snippet of getErrorStream():
Returns the error stream if the
connection failed but the server sent
useful data nonetheless. The typical
example is when an HTTP server
responds with a 404, which will cause
a FileNotFoundException to be thrown
in connect, but the server sent an
HTML help page with suggestions as to
what to do.
Usage example:
public static String httpGet(String url) {
HttpURLConnection con = null;
InputStream is = null;
try {
con = (HttpURLConnection) new URL(url).openConnection();
con.connect();
//4xx: client error, 5xx: server error. See: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html.
boolean isError = con.getResponseCode() >= 400;
//In HTTP error cases, HttpURLConnection only gives you the input stream via #getErrorStream().
is = isError ? con.getErrorStream() : con.getInputStream();
String contentEncoding = con.getContentEncoding() != null ? con.getContentEncoding() : "UTF-8";
return IOUtils.toString(is, contentEncoding); //Apache Commons IO
} catch (Exception e) {
throw new IllegalStateException(e);
} finally {
//Note: Closing the InputStream manually may be unnecessary, depending on the implementation of HttpURLConnection#disconnect(). Sun/Oracle's implementation does close it for you in said method.
if (is != null) {
try {
is.close();
} catch (IOException e) {
throw new IllegalStateException(e);
}
}
if (con != null) {
con.disconnect();
}
}
}
This is an old thread, but I had a similar problem and found a solution that is not listed here.
I was receiving the page fine in the browser, but got a 404 when I tried to access it via the HttpURLConnection. The URL I was trying to access contained a port number. When I tried it without the port number I successfully got a dummy page through the HttpURLConnection. So it seemed the non-standard port was the problem.
I started thinking the access was restricted, and in a sense it was. My solution was that I needed to tell the server the User-Agent and I also specify the file types I expect. I am trying to read a .json file, so I thought the file type might be a necessary specification as well.
I added these lines and it finally worked:
httpConnection.setRequestProperty("User-Agent","Mozilla/5.0 ( compatible ) ");
httpConnection.setRequestProperty("Accept","*/*");
check the response code being returned by the server
I know this is an old thread but I found a solution not listed anywhere here.
I was trying to pull data in json format from a J2EE servlet on port 8080 but was receiving the file not found error. I was able to pull this same json data from a php server running on port 80.
It turns out that in the servlet, I needed to change doGet to doPost.
Hope this helps somebody.
You could use OkHttp:
OkHttpClient client = new OkHttpClient();
String run(String url) throws IOException {
Request request = new Request.Builder()
.url(url)
.build();
Response response = client.newCall(request).execute();
return response.body().string();
}
I've tried that locally - using the code provided - and I don't get a FileNotFoundException except when the server returns a status 404 response.
Are you sure that you're connecting to the webserver you intend to be connecting to? Is there any chance you're connecting to a different webserver? (I note that the port number in the code doesn't match the port number in the link)
I have run into a similar issue but the reason seems to be different, here is the exception trace:
java.io.FileNotFoundException: http://myhost1:8081/test/api?wait=1
at sun.reflect.GeneratedConstructorAccessor2.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
at com.doitnext.loadmonger.HttpExecution.getBody(HttpExecution.java:85)
at com.doitnext.loadmonger.HttpExecution.execute(HttpExecution.java:214)
at com.doitnext.loadmonger.ClientWorker.run(ClientWorker.java:126)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.io.FileNotFoundException: http://myhost1:8081/test/api?wait=1
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1434)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at com.doitnext.loadmonger.HttpExecution.execute(HttpExecution.java:166)
... 2 more
So it would seem that just getting the response code will cause the URL connection to callGetInputStream.
I know this is an old thread but just noticed something on this one so thought I will just put it out there.
Like Jessica mentioned, this exception is thrown when using non-standard port.
It only seems to happen when using DNS though. If I use IP number I can specify the port number and everything works fine.

Categories