Wrong encoding with Java HttpURLConnection - java

Trying to read a generated XML from a MS Webservice
URL page = new URL(address);
StringBuffer text = new StringBuffer();
HttpURLConnection conn = (HttpURLConnection) page.openConnection();
conn.connect();
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent());
BufferedReader buff = new BufferedReader(in);
box.setText("Getting data ...");
String line;
do {
line = buff.readLine();
text.append(line + "\n");
} while (line != null);
box.setText(text.toString());
or
URL u = new URL(address);
URLConnection uc = u.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
inputLine = java.net.URLDecoder.decode(inputLine, "UTF-8");
System.out.println(inputLine);
}
in.close();
Any page reads fine except the web service output
it reads the greater and less than signs strangely
it read < to "& lt;" and > to "& gt;" without spaces, but if i type them here without spaces stackoverflow makes them < and >
Please help
thanks

First there seem to be a confusion on this row:
inputLine = java.net.URLDecoder.decode(inputLine, "UTF-8");
This effectively says that you expect every row in the document that your server is providing to be URL encoded. URL encoding is not the same as document encoding.
http://en.wikipedia.org/wiki/Percent-encoding
http://en.wikipedia.org/wiki/Character_encoding
Looking at your code snippet, I think URL encoding (percent encoding) is not what you're after.
In terms of document character encoding. You are making a conversion on this line:
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent());
conn.getContent() returns an InputStream that operates on bytes, whilst the reader operates on chars - the character encoding conversion is done here. Checkout the other constructors of InputStreamReader which takes the encoding as second argument. Without the second argument you are falling back on whatever is your platform default in java.
InputStreamReader(InputStream in, String charsetName)
for instance lets you change your code to:
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent(), "utf-8");
But the real question will be "what encoding is your server providing the content in?" If you own the server code too, you may just hard code it to something reasonable such as utf-8. But if it can vary, you need to look at the http header Content-Type to figure it out.
String contentType = conn.getHeaderField("Content-Type");
The contents of contentType will look like
text/plain; charset=utf-8
A short hand way of getting this field is:
String contentEncoding = conn.getContentEncoding();
Notice that it's entirely possible that no charset is provided, or no Content-Type header, in which case you must fall back on reasonable defaults.

Mark Rotteveel is correct, the webservice is the culprit here it's for some reason sending the greater than and less than sign with the & lt and & gt format
Thanks Martin Algesten but i have already stated i worked around it i was just looking for why it was this way.

Related

What is wrong with my HttpURLConnection request?

I am trying to call a REST API with a PUT Request but I am receiving a 400 Error Code (Bad Request). Can someone spot what I may be doing wrong?
I have successfully called this API with a REST Client, here are the headers and body used:
https://imgur.com/dZVyawn
https://imgur.com/lMtn2JB
String credentials = Base64.getEncoder().encodeToString(("wcadmin:wcadmin").getBytes());
URL url = new URL(getURL());
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("PUT");
connection.setDoOutput(true);
connection.setDoInput(true);
//Set Headers
String fileUrl = "c:\\0000000050.xml";
File fileToUpload = new File(fileUrl);
long length = fileToUpload.length();
String FORM_DATA_BOUNDARY = "------FormBoundary" + System.currentTimeMillis();
connection.setRequestProperty("csrf_nonce", getNonceValue());
connection.setRequestProperty("Accept", "application/xml");
connection.setRequestProperty("Authorization", "Basic " + credentials);
connection.setRequestProperty("Content-Type", "multipart/form-data; boundary=" + FORM_DATA_BOUNDARY);
connection.setRequestProperty("Content-Length", Long.toString(length));
//Setup Request Body Writer
OutputStream requestBodyOutputStream = connection.getOutputStream();
BufferedWriter requestBodyWriter = new BufferedWriter(new OutputStreamWriter(requestBodyOutputStream));
//Write Body
requestBodyWriter.write("\r\n\r\n");
requestBodyWriter.write(FORM_DATA_BOUNDARY);
requestBodyWriter.write("\r\n");
requestBodyWriter.write("Content-Disposition: form-data; name=\"file\"; filename=\"" + fileUrl + "\"");
requestBodyWriter.write("\r\n");
requestBodyWriter.write("Content-Type: text/xml");
requestBodyWriter.write("\r\n\r\n");
requestBodyWriter.flush();
FileInputStream uploadFileStream = new FileInputStream(fileToUpload);
int bytesRead;
byte[] dataBuffer = new byte[1024];
while ((bytesRead = uploadFileStream.read(dataBuffer)) != -1) {
requestBodyOutputStream.write(dataBuffer, 0, bytesRead);
}
requestBodyOutputStream.flush();
requestBodyWriter.write("\r\n");
requestBodyWriter.write(FORM_DATA_BOUNDARY);
requestBodyWriter.flush();
//Close the streams
requestBodyOutputStream.close();
requestBodyWriter.close();
uploadFileStream.close();
//Read Response
String inputLine;
StringBuffer content = new StringBuffer();
InputStream inputStream = connection.getInputStream();
if (inputStream != null) {
BufferedReader responseReader = new BufferedReader(new InputStreamReader(inputStream));
if (responseReader != null) {
while ((inputLine = responseReader.readLine()) != null) {
content.append(inputLine);
}
responseReader.close();
}
}
connection.disconnect();
Error 400 Bad Request response received
First and most important: you cannot write to both an OutputStream, and a OutputStreamWriter which wraps that same OutputStream. They will conflict with each other.
Do not use OutputStreamWriter at all; instead, convert text to bytes yourself:
OutputStream requestBodyOutputStream = connection.getOutputStream();
requestBodyOutputStream.write("\r\n\r\n".getBytes(StandardCharsets.UTF_8));
requestBodyOutputStream.write(FORM_DATA_BOUNDARY.getBytes(StandardCharsets.UTF_8));
// etc.
Second, you are converting between bytes and strings using the system’s default charset, which means exactly what gets written depends on the system where the code is running. Don’t call String.getBytes without specifying an explicit Charset. Usually getBytes(StandardCharsets.UTF_8) is what you want.
Similarly, you need to pass a Charset to your InputStreamReader creation (although this isn’t the cause of your problem, since you aren’t getting a valid response at the moment). Don’t assume a charset; use the charset MIME type parameter from the response’s Content-Type header. If you’re using a version of Java older than 11, you can parse the Content-Type value with the javax.activation.MimeType class, but be aware that the javax.activation package has been removed from Java SE as of Java 11. For Java 11 and later, Java Activation can be downloaded as a stand-alone library. Another option is to use the JavaMail library, specifically its ContentType class, for parsing.
Third, the Content-Length header inside the body part (between the boundaries) should use the file’s length as a Content-Length. The Content-Length of the entire request body must be the length of everything you’ve written: the boundaries, the body part headers, and the file content.
The good news is, I think (though I’m not positive) that URLConnection will set the request’s overall Content-Length automatically, based on the bytes you write, so you probably don’t need to compute the length yourself; you can simply refrain from setting "Content-Length" at all.
When you do pass a correct request, you will find that you are dropping the newlines in the response. If the response is supposed to be human-readable text, those newlines are likely to matter. If you’re using Java 10 or later, you can use Reader.transferTo with a StringWriter:
StringWriter responseBody = new StringWriter();
responseReader.transferTo(responseBody);
String content = responseBody.toString();
If you’re using a version of Java older than 10:
new BufferedReader cannot return null, so checking for null is pointless. In Java, the new operator always, no matter what, returns a new object (unless an exception is thrown, in which case new doesn’t return at all).
You should use StringBuilder, not StringBuffer. They are identical except that StringBuffer is an older class that provides thread safety for every method, creating unnecessary overhead for nearly all use cases.
You are copying your file into the request without any buffering, which is going to be slow and inefficient. Consider using Files.copy(fileToUpload.toPath(), requestBodyOutputStream) instead.

Garbage when loading xml content with URLConnection

I'm trying to load the content of an XML page using URLConnection but I'm getting back garbage characters. The same code works for me for pretty much any other site so I'm not sure what's the issue.
Here's the relevant code:
String url = "http://myUrl";
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setConnectTimeout(60*2000); // wait only 60 seconds for a response
conn.setReadTimeout(60*2000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
wholeDocument += inputLine;
}
Printing out wholeDocument produces a bunch of characters like this: er���;�pI.���$6
I am using encoding = 'UTF-8'.
I also tried using XML libraries, for example:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new URL(baseUrl).openStream());
System.out.println("doc = " + doc);
But the result is the same. When using curl in a terminal app (I'm on a mac) the result is similar although the characters look like this: ???0??KZV??????0N6?aH:$?X9v???$>???`
Any idea how to solve this?
If you check the headers of your response you will see Content-Encoding: gzip indicating that the body of the response has been compressed, you need to uncompress it first, that's why you get those weird characters. More details about Http Compression.
A good way to check the headers with curl is to use the verbose option -v, In this case thanks to curl -v http://sites.one.co.il/XML/VOD/ | more, I could quickly see the response headers.
Expanding on the other answer, you can check if the received file is gzip encoded, and decode it if so by:
if (conn.getHeaderField("Content-Encoding") != null &&
conn.getHeaderField("Content-Encoding").equals("gzip")){
InputStream gzStream = new GZIPInputStream(conn.getInputStream());
InputStreamReader isr = new InputStreamReader(gzStream, encoding);
} else {
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
}
Alternatively, you can specify that you wouldn't like gzip encoded data by:
conn.setRequestProperty("Accept-Encoding", "identity");

Encoding ignored while reading InputStream

I'm having some encoding problems in a Java application that makes HTTP requests to an IIS server.
Iterating over the headers of the URLConnection object I can see the following (relevant) headers:
Transfer-Encoding: [chunked]
Content-Encoding: [utf-8]
Content-Type: [text/html; charset=utf-8]
The URLConnection.getContentEncoding() method returns utf-8 as the document encoding.
This is how my HTTP request, and stream read is being made:
OutputStreamWriter sw = null;
BufferedReader br = null;
char[] buffer = null;
URL url;
url = new URL(this.URL);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
sw = new OutputStreamWriter(connection.getOutputStream());
sw.write(postData);
sw.flush();
br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF8"));
StringBuilder totalResponse = new StringBuilder();
String line;
while((line = br.readLine()) != null) {
totalResponse.append(line);
}
buffer = totalResponse.toString().toCharArray();
if (sw != null)
sw.close();
if (br != null)
br.close();
return buffer;
However the following string sent by the server "ÃÃÃção" is received by the client as "�����o".
What am I doing wrong ?
Based on your comments, you are trying to receive a FIX message from an IIS server and FIX uses ASCII. There are only a small subset of tags which support other encoding and they have to be treated in a special manner (non-ASCII tags in the standard FIX spec are 349,351,353,355,357,359,361,363,365). If such tags are present, you will get a tag 347 with a value specifying the encoding (for example UTF-8) and then each tag, will be preceded by a tag giving you the length of the coming encoded value (for tag 349, you will always get 348 first with an integer value)
In your case, it looks like the server is sending a custom tag 10411 (the 10xxx range) in some other encoding. By convention, the preceding tag 10410 should give you the length of the value in 10411, but it contains "0000" instead, which may have some other meaning.
Note that although FIX message are very readable, they should still be treated as binary data. Tags and values are mostly ASCII characters, but the delimiter (SOH) is 0x01 and as mentioned above, certain tags may be encoded with another encoding. The IIS service should really return the data as application/octet-stream so it can be received properly. Attempting to return it as text/html is asking for trouble :).
If the server really sends a Content-Encoding of "UTF-8" then it is very confused. See http://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7231.html#header.content-encoding
For good order a couple of corrections.
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
connection.connect();
try (Writer sw = new OutputStreamWriter(connection.getOutputStream(),
StandardCharsets.UTF_8)) {
sw.write(postData);
sw.flush();
try (BufferedReader br = new BufferedReader(
new InputStreamReader(connection.getInputStream(),
StandardCharsets.UTF_8))) {
StringBuilder totalResponse = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
totalResponse.append(line).append("\r\n");
}
return totalResponse.toString().toCharArray();
} // Close br.
} // Close sw.
Maybe:
postData = ... + "Accept-Charset: utf-8\r\n" + ...;
Receiving the totalResponse.toString() you should have all read correctly.
But then when displaying again, the String/char is again converted to bytes, and there the encoding fails. For instance System.out.println will not do as probably the Windows encoding is used.
You can test the String by dumping its bytes:
String s = totalResponse.toString();
Logger.getLogger(getClass().getName()).log(Level.INFORMATION, "{0}",
Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));
In some rare cases the font will not contain the special characters.
Can you try by putting the stream as part of request attribute and then printing it out on client side. a request attribute will be received as is withou any encoding issues

How to read UTF strings from web

I've a simple web service that lists a variable number of foreign languages.
Some of them are listed in native charset (like Chinese, for example).
I must read this from a webpage and dynamically add them to a JComboBox.
Actually I'm reading them in this way:
public static Vector getSiteLanguages() {
System.out.println("Reading Home from " + Constants.HOME);
URL url;
URLConnection connection;
BufferedReader br;
String inputLine;
String regEx = "<option.*value=.([A-Z]*).>(.*)</option>";
Pattern pattern = Pattern.compile(regEx);
Matcher m;
Vector siteLangs = new Vector();
try {
url = new URL( Constants.HOME);
connection = url.openConnection();
br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while ((inputLine = br.readLine()) != null) {
m = pattern.matcher(inputLine);
while ( m.find()) {
System.out.println(m.group(1) + "->" + m.group(2) );
siteLangs.add(m.group(2));
}
}
br.close();
} catch (IOException e) {
return siteLangs;
}
return siteLangs;
}
Then in the JFrame class I'm doing this:
Vector siteLangs = Language.getSiteLanguages();
JComboBox siteLangCombo = new JComboBox(siteLangs);
But in this way all non-latin languages are lost...
How do I preserve non-latin info in this situation?
The InputStreamReader uses by default the platform default character encoding to convert bytes to characters. The website is apparently using a different character encoding to convert characters to bytes in the HTTP response. You need to check the HTTP Content-Type response header which one it is.
String contentType = connection.getHeaderField("Content-Type");
Assuming that it's UTF-8, which is these days the most commonly used character encoding in websites who strive to world domination, here's how you should be specifying it during the construction of the InputStreamReader in your code:
br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
See also:
Using java.net.URLConnection to fire and handle HTTP requests
Unrelated to the concrete problem, the Vector is a legacy class which has been replaced by the List interface since 1998. Are you sure that you're reading up-to-date resources during your Java learning spree? Further, regex should not be your first choice when you just need to parse HTML. This is Java, not PHP. Use a normal HTML parser. You may find Jsoup helpful in this. The whole code which you've so far can then be brought back to two or three lines.

String received with utf8 format doesn't get displayed correctly

I want to know how to receive the string from a file in Java which has different language letters.
I used UTF-8 format. This can receive some language letters correctly, but Latin letters can't be displayed correctly.
So, how can I receive all language letters?
Alternatively, is there any other format which will allow me to receive all language letters.
Here's my code:
URL url = new URL("http://google.cm");
URLConnection urlc = url.openConnection();
BufferedReader buffer = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF-8"));
StringBuilder builder = new StringBuilder();
int byteRead;
while ((byteRead = buffer.read()) != -1)
{
builder.append((char) byteRead);
}
buffer.close();
text=builder.toString();
If I display the "text", the letters can't be displayed correctly.
Reading a UTF-8 file is fairly simple in Java:
Reader r = new InputStreamReader(new FileInputStream(filename), "UTF-8");
If that isn't working, the issue lies elsewhere.
EDIT: According to iconv, Google Cameroon is serving invalid UTF-8. It seems to actually be iso-8859-1.
EDIT2: Actually, I was wrong. It serves (and declares) valid UTF-8 if the user agent contains "Mozilla/5.0" (or higher), but valid iso-8859-1 in (some) other cases. Obviously, the best bet is to use getContentType to check before decoding.

Categories