A question on webpage representation in Java - java

I've followed a tutorial and came up with the following method to read the webpage content into a CharSequence
public static CharSequence getURLContent(URL url) throws IOException {
URLConnection conn = url.openConnection();
String encoding = conn.getContentEncoding();
if (encoding == null) {
encoding = "ISO-8859-1";
}
BufferedReader br = new BufferedReader(new
InputStreamReader(conn.getInputStream(),encoding));
StringBuilder sb = new StringBuilder(16384);
try {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
sb.append('\n');
}
} finally {
br.close();
}
return sb;
}
It will return a representation of the webpage specified by the url.
However,this representation is hugely different from what I use "view page source" in my Firefox,and since I need to scrape data from the original webpage(some data segement in the original "view page source" file),it will always fail to find required text on this Java representation.
Did I go wrong somewhere?I need your advice guys,thanks a lot for helping!

You need to use an HTML-parsing library to build a data structure representing the HTML text on this webpage. My recommendation is to use this library: http://htmlparser.sourceforge.net.

Things like the request useragent and cookies can change what the server returns in the response. So the problem is more likely in the details of the request you are sending rather than in how you are reading the response.
Things like HttpClient will allow you to more easily simulate the request being sent from a browser.

Related

Getting incomplete HTML source on url.openConnection()

I am trying to get HTML page source for a website. But I am not able to get some image links, which I think are populated dynamically on the webpage.
I am using java as:
url = new URL(firstLevelURL);
connection = (HttpURLConnection) url.openConnection();
try ( // Read all the text returned by the server
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
// Read each line of "in" until done, adding each to "response"
while ((str = br.readLine()) != null) {
// str is one line of text readLine() strips newline characters
//I am not able to get this image as it is loaded dynamically using javascript/ajax or something.
if(str.contains("<img id=\"tileImage")) {
response = str;
break;
}
}
}
I tried using :
connection.setReadTimeout(15*1000);
But the page is still not loading completely
Is there any way to wait for page to load completely before fetching HTML source

Encoding ignored while reading InputStream

I'm having some encoding problems in a Java application that makes HTTP requests to an IIS server.
Iterating over the headers of the URLConnection object I can see the following (relevant) headers:
Transfer-Encoding: [chunked]
Content-Encoding: [utf-8]
Content-Type: [text/html; charset=utf-8]
The URLConnection.getContentEncoding() method returns utf-8 as the document encoding.
This is how my HTTP request, and stream read is being made:
OutputStreamWriter sw = null;
BufferedReader br = null;
char[] buffer = null;
URL url;
url = new URL(this.URL);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
sw = new OutputStreamWriter(connection.getOutputStream());
sw.write(postData);
sw.flush();
br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF8"));
StringBuilder totalResponse = new StringBuilder();
String line;
while((line = br.readLine()) != null) {
totalResponse.append(line);
}
buffer = totalResponse.toString().toCharArray();
if (sw != null)
sw.close();
if (br != null)
br.close();
return buffer;
However the following string sent by the server "ÃÃÃção" is received by the client as "�����o".
What am I doing wrong ?
Based on your comments, you are trying to receive a FIX message from an IIS server and FIX uses ASCII. There are only a small subset of tags which support other encoding and they have to be treated in a special manner (non-ASCII tags in the standard FIX spec are 349,351,353,355,357,359,361,363,365). If such tags are present, you will get a tag 347 with a value specifying the encoding (for example UTF-8) and then each tag, will be preceded by a tag giving you the length of the coming encoded value (for tag 349, you will always get 348 first with an integer value)
In your case, it looks like the server is sending a custom tag 10411 (the 10xxx range) in some other encoding. By convention, the preceding tag 10410 should give you the length of the value in 10411, but it contains "0000" instead, which may have some other meaning.
Note that although FIX message are very readable, they should still be treated as binary data. Tags and values are mostly ASCII characters, but the delimiter (SOH) is 0x01 and as mentioned above, certain tags may be encoded with another encoding. The IIS service should really return the data as application/octet-stream so it can be received properly. Attempting to return it as text/html is asking for trouble :).
If the server really sends a Content-Encoding of "UTF-8" then it is very confused. See http://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7231.html#header.content-encoding
For good order a couple of corrections.
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
connection.connect();
try (Writer sw = new OutputStreamWriter(connection.getOutputStream(),
StandardCharsets.UTF_8)) {
sw.write(postData);
sw.flush();
try (BufferedReader br = new BufferedReader(
new InputStreamReader(connection.getInputStream(),
StandardCharsets.UTF_8))) {
StringBuilder totalResponse = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
totalResponse.append(line).append("\r\n");
}
return totalResponse.toString().toCharArray();
} // Close br.
} // Close sw.
Maybe:
postData = ... + "Accept-Charset: utf-8\r\n" + ...;
Receiving the totalResponse.toString() you should have all read correctly.
But then when displaying again, the String/char is again converted to bytes, and there the encoding fails. For instance System.out.println will not do as probably the Windows encoding is used.
You can test the String by dumping its bytes:
String s = totalResponse.toString();
Logger.getLogger(getClass().getName()).log(Level.INFORMATION, "{0}",
Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));
In some rare cases the font will not contain the special characters.
Can you try by putting the stream as part of request attribute and then printing it out on client side. a request attribute will be received as is withou any encoding issues

How to Pass a File through an HttpURLConnection

I'm trying to get an image hosting on our server available to be displayed on a client. As per the specs of the project:
"When a Client receives such a URL, it must download the
contents (i.e., bytes) of the file referenced by the URL.
Before the Client can display the image to the user, it must first retrieve (i.e., download) the bytes of the
image file from the Server. Similarly, if the Client receives the URL of a known data file or a field help file
from the Server, it must download the content of those files before it can use them."
I'm pretty sure we have the server side stuff down, because if I put the url into a browser it retrieves and displays just fine. So it must be something with the ClientCommunicator class; can you take a look at my code and tell me what the problem is? I've spent hours on this.
Here is the code:
Where I actually call the function to get and display the file: (This part is working properly insofar as it is passing the right information to the server)
JFrame f = new JFrame();
JButton b = (JButton)e.getSource();
ImageIcon image = new ImageIcon(ClientCommunicator.DownloadFile(HOST, PORT, b.getLabel()));
JLabel l = new JLabel(image);
f.add(l);
f.pack();
f.setVisible(true);
From the ClientCommunicator class:
public static byte[] DownloadFile(String hostname, String port, String url){
String image = HttpClientHelper.doGetRequest("http://"+hostname+":"+port+"/"+url, null);
return image.getBytes();
}
The pertinent httpHelper:
public static String doGetRequest(String urlString,Map<String,String> headers){
URL url;
HttpURLConnection connection = null;
try {
//Create connection
url = new URL(urlString);
connection = (HttpURLConnection)url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Content-Language", "en-US");
connection.setUseCaches (false);
connection.setDoInput(true);
connection.setDoOutput(true);
if(connection.getResponseCode() == 500){
return "failed";
}
//Get Response
InputStream is = connection.getInputStream();
BufferedReader rd = new BufferedReader(new InputStreamReader(is));
String line;
StringBuffer response = new StringBuffer();
while((line = rd.readLine()) != null) {
response.append(line);
}
rd.close();
return response.toString();
} catch (Exception e) {
e.printStackTrace();
return null;
} finally {
if(connection != null) {
connection.disconnect();
}
}
}
After that, it jumps into the server stuff, which as I stated I believe is working correctly because clients such as Chrome can retrieve the file and display it properly. The problem has to be somewhere in here.
I believe that it has to do with the way the bytes are converted into a string and then back, but I do not know how to solve this problem. I've looked at similar problems on StackOverflow and have been unable to apply them to my situation. Any pointers in the right direction would be greatly appreciated.
If your server is sending binary data, you do not want to use an InputStreamReader, or in fact a Reader of any sort. As the Java API indicates, Readers are for reading streams of characters (not bytes) http://docs.oracle.com/javase/7/docs/api/java/io/Reader.html, which means you will run into all sorts of encoding issues.
See this other stack overflow answer for how to read bytes from a stream:
Convert InputStream to byte array in Java
Do your homework.
Isolate the issue. Modify the server side to send only 256 all possible bytes. Do a binary search and reduce it to small set of bytes.
Use http proxy tools to monitor the bytes as they are transmitted. Fiddler in windows world. Find other ones for the *nix environments.
Then see where the problem is happening and google/bing the suspicions or share the result.

How to read UTF strings from web

I've a simple web service that lists a variable number of foreign languages.
Some of them are listed in native charset (like Chinese, for example).
I must read this from a webpage and dynamically add them to a JComboBox.
Actually I'm reading them in this way:
public static Vector getSiteLanguages() {
System.out.println("Reading Home from " + Constants.HOME);
URL url;
URLConnection connection;
BufferedReader br;
String inputLine;
String regEx = "<option.*value=.([A-Z]*).>(.*)</option>";
Pattern pattern = Pattern.compile(regEx);
Matcher m;
Vector siteLangs = new Vector();
try {
url = new URL( Constants.HOME);
connection = url.openConnection();
br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while ((inputLine = br.readLine()) != null) {
m = pattern.matcher(inputLine);
while ( m.find()) {
System.out.println(m.group(1) + "->" + m.group(2) );
siteLangs.add(m.group(2));
}
}
br.close();
} catch (IOException e) {
return siteLangs;
}
return siteLangs;
}
Then in the JFrame class I'm doing this:
Vector siteLangs = Language.getSiteLanguages();
JComboBox siteLangCombo = new JComboBox(siteLangs);
But in this way all non-latin languages are lost...
How do I preserve non-latin info in this situation?
The InputStreamReader uses by default the platform default character encoding to convert bytes to characters. The website is apparently using a different character encoding to convert characters to bytes in the HTTP response. You need to check the HTTP Content-Type response header which one it is.
String contentType = connection.getHeaderField("Content-Type");
Assuming that it's UTF-8, which is these days the most commonly used character encoding in websites who strive to world domination, here's how you should be specifying it during the construction of the InputStreamReader in your code:
br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
See also:
Using java.net.URLConnection to fire and handle HTTP requests
Unrelated to the concrete problem, the Vector is a legacy class which has been replaced by the List interface since 1998. Are you sure that you're reading up-to-date resources during your Java learning spree? Further, regex should not be your first choice when you just need to parse HTML. This is Java, not PHP. Use a normal HTML parser. You may find Jsoup helpful in this. The whole code which you've so far can then be brought back to two or three lines.

how to (simply) generate POST http request from java to do the file upload

I would like to upload files from java application/applet using POST http event. I would like to avoid to use any library not included in SE, unless there is no other (feasible) option.
So far I come up only with very simple solution.
- Create String (Buffer) and fill it with compatible header (http://www.ietf.org/rfc/rfc1867.txt)
- Open connection to server URL.openConnection() and write content of this file to OutputStream.
I also need to manually convert binary file into POST event.
I hope there is some better, simpler way to do this?
You need to use the java.net.URL and java.net.URLConnection classes.
There are some good examples at http://java.sun.com/docs/books/tutorial/networking/urls/readingWriting.html
Here's some quick and nasty code:
public void post(String url) throws Exception {
URL u = new URL(url);
URLConnection c = u.openConnection();
c.setDoOutput(true);
if (c instanceof HttpURLConnection) {
((HttpURLConnection)c).setRequestMethod("POST");
}
OutputStreamWriter out = new OutputStreamWriter(
c.getOutputStream());
// output your data here
out.close();
BufferedReader in = new BufferedReader(
new InputStreamReader(
c.getInputStream()));
String s = null;
while ((s = in.readLine()) != null) {
System.out.println(s);
}
in.close();
}
Note that you may still need to urlencode() your POST data before writing it to the connection.
You need to learn about the chunked encoding used in newer versions of HTTP. The Apache HttpClient library is a good reference implementation to learn from.

Categories