Decode error in BufferedReader

Decode error in BufferedReader - java

I received some data from server and read them from java code :
is = new BufferedInputStream(connection.getInputStream());
reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
int length;
char[] buffer = new char[4096];
StringBuilder sb = new StringBuilder();
while ((length = reader.read(buffer)) != -1) {
sb.append(new String(buffer, 0, length));//buffer is already incorrect
}
byte[] byteDatas = sb.toString().getBytes();
And I print byteDatas as Hex string:
Comparing to the wireshark's result:
Some bytes are decoded as bf bd ef , I know it's \ufffd(65533) stand for invalid data.
So I think there must have decode error in my code , after debug, I found that If I use connection.getInputStream() to read data directly , there is no invalid data.
So ,the problem must happens in BufferedReader or InputStreamReader, But I have already add "UTF-8" and the data in wireshark seems not very wired. Does UTF-8 is not correctly? Server do not reply the charset.
Please help how to let BufferedReader read the correct data.
UPDATE
My default charset is "UTF-8" and have debug to prove it . After read return , I have already got the wrong data , so it's not String's fault.

String.getBytes() will use the platform's default encoding (not necessarily UTF-8) to convert the characters of the String to bytes.
Quoting from the javadoc of String.getBytes():
Encodes this String into a sequence of bytes using the platform's default charset...
You can't compare the UTF-8 encoded input data to the result which might not be the result of UTF-8 encoded. Instead explicitly specify the encoding like this:
byte[] byteDatas = sb.toString().getBytes(StandardCharsets.UTF_8);
Note:
If your input data is NOT UTF-8 encoded text and if you attempt to decode it as UTF-8, the decoder may replace invalid byte sequences. This will cause that the bytes you get by encoding the String will not be the same as the input raw bytes.

Related

Java Socket HTTP GET request

I'm trying to create a simple Java program that create an HTTP request to a HTTP server hosted locally, by using Socket.
This is my code:
try
{
//Create Connection
Socket s = new Socket("localhost",80);
System.out.println("[CONNECTED]");
DataOutputStream out = new DataOutputStream(s.getOutputStream());
DataInputStream in = new DataInputStream(s.getInputStream());
String header = "GET / HTTP/1.1\n"
+"Host:localhost\n\n";
byte[] byteHeader = header.getBytes();
out.write(byteHeader,0,header.length());
String res = "";
/////////////READ PROCESS/////////////
byte[] buf = new byte[in.available()];
in.readFully(buf);
System.out.println("\t[READ PROCESS]");
System.out.println("\t\tbuff length->"+buf.length);
for(byte b : buf)
{
res += (char) b;
}
System.out.println("\t[/READ PROCESS]");
/////////////END READ PROCESS/////////////
System.out.println("[RES]");
System.out.println(res);
System.out.println("[CONN CLOSE]");
in.close();
out.close();
s.close();
}catch(Exception e)
{
e.printStackTrace();
}
But by when I run it the Server reponse with a '400 Bad request error'.
What is the problem? Maybe some HTTP headers to add but I don't know which one to add.

There are a couple of issues with your request:
String header = "GET / HTTP/1.1\n"
+ "Host:localhost\n\n";
The line break to be used must be Carriage-Return/Newline, i.e. you should change that to
String header = "GET / HTTP/1.1\r\n"
+ "Host:localhost\r\n\r\n";
Next problem comes when you write the data to the OutputStream:
byte[] byteHeader = header.getBytes();
out.write(byteHeader,0,header.length());
The call of readBytes without the specification of a charset uses the system's charset which might be a different than the one that is needed here, better use getBytes("8859_1"). When writing to the stream, you use header.length() which might be different from the length of the resulting byte-array if the charset being used leads to the conversion of one character into multiple bytes (e.g. with UTF-8 as encoding). Better use byteHeader.length.
out.write(byteHeader,0,header.length());
String res = "";
/////////////READ PROCESS/////////////
byte[] buf = new byte[in.available()];
After sending the header data you should do a flush on the OutputStream to make sure that no internal buffer in the streams being used prevents the data to actually be sent to the server.
in.available() only returns the number of bytes you can read from the InputStream without blocking. It's not the length of the data being returned from the server. As a simple solution for starters, you can add Connection: close\r\n to your header data and simply read the data you're receiving from the server until it closes the connection:
StringBuffer sb = new StringBuffer();
byte[] buf = new byte[4096];
int read;
while ((read = in.read(buf)) != -1) {
sb.append(new String(buf, 0, read, "8859_1"));
}
String res = sb.toString();
Oh and independent form the topic of doing an HTTP request by your own:
String res = "";
for(byte b : buf)
{
res += (char) b;
}
This is a performance and memory nightmare because Java is actually caching all strings in memory in order to reuse them. So the internal cache gets filled with each result of this concatenation. A response of 100 KB size would mean that at least 5 GB of memory are allocated during that time leading to a lot of garbage collection runs in the process.
Oh, and about the response of the server: This most likely comes from the invalid line breaks being used. The server will regard the whole header including the empty line as a single line and complains about the wrong format of the GET-request due to additional data after the HTTP/1.1.

According to HTTP 1.1:
HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body [...].
So, you'll need all of your request to be ending with \r\n.

How to read unicode characters from server socket

I need to receive a unicode (UTF-8) string sent by client on a server side. The length of the string is of course unknown.
ServerSocket serverSocket = new ServerSocket(567);
Socket clientSocket = serverSocket.accept();
PrintWriter out = new PrintWriter(clientSocket.getOutputStream(), true);
BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.getInputStream()));
I can read bytes using in.read() (until it returns -1) but the problem is that the string is unicode, in other words, every character is represented by two bytes. So converting the result of read() which would work with normal ascii characters makes no sense.
update
As per suggestions bello, I created the reader as follows:
BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.getInputStream(),"UTF-8"));
I've changed the client side to send a newline (#10#13) after each string.
But the new problem is I get bullshit instead of real string if i call:
in.readLine();
And print the result I get some nonsense string (I cannot even copy it here) although I am not dealing with non-latin chars or anything else.
To see what's going on I introduced following code:
int j = 0
while (j < 255){
j++;
System.out.print(in.read()+", ");
}
So here I just print all bytes received. If I send "ab" I get:
97, 0, 98, 0, 10, 13,
This is what one would expect, but than why the readLine method doesn't produce "good" results?
Anyway, if we couldn't find the actual answer, I should probably collect the bytes (like above) and create my string from them? How to do that?
P.S. Just a quick note - I am on windows.

Use new InputStreamReader(clientSocket.getInputStream(), "UTF-8") in order to set properly the name of the charset to use while reading the InputStream coming from your client

When creating InputStreamReader you can set encoding like this:
BufferedReader in =
new BufferedReader(
new InputStreamReader(clientSocket.getInputStream(), "UTF-8")
);

Try this way:
Reader in = new BufferedReader(
new InputStreamReader(
clientSocket.getInputStream(), StandardCharsets.UTF_8));
Note the StandardCharsets class. It is supported since Java 1.7 and provides more elegant way to specify a standard encoding like UTF-8.

HTTP Request packet getting corrupted

When I received an HTTP request of smaller length it's fine, but when receiving long packet getting corrupted. I took a trace through wire shark and I printed packet in hex value in JAVA console. Some additional values are showing in that printing. Why?
How can I solve it?
Is there anything wrong with conversion of HTTP request to Hex.
Following code is used to convert String to Hex.
ByteArrayOutputStream baos = new ByteArrayOutputStream();
InputStream responseData = request.getInputStream();
byte[] buffer = new byte[1000];
int bytesRead = 0;
while ((bytesRead = responseData.read(buffer)) > 0) {
baos.write(buffer, 0, bytesRead);
sb=baos.toString();
str = baos.toString();
sb.append(str);
sb = new String(baos.toByteArray(),UTF8);
}
baos.close(); // connection.close();

You can't convert the read bytes to a String until all your input is read because a fraction of the input might be invalid UTF-8 encoded data.
Also don't use ByteArrayOutputStream.toString() because it uses the platform's default character set to decode bytes to characters (String) which is indeterministic. Instead use ByteArrayOutputStream.toString(String charsetName) and specify the encoding.
Also you should use ServletRequest.getCharacterEncoding() to detect encoding and revert to UTF-8 for example if it is unknown.
First read all input, and then convert it to a String:
String encoding = ServletRequest.getCharacterEncoding();
if (encoding == null)
encoding = "UTF-8";
// First read all input data
while ((bytesRead = responseData.read(buffer)) > 0) {
baos.write(buffer, 0, bytesRead);
}
// We have all input, now convert it to String:
String text = baos.toString(encoding);
Better Alternative
Since you convert the binary input to a String, you should use ServletRequest.getReader() instead of reading binary data using ServletRequest.getInputStream() and converting it to String manually.
E.g. reading all lines:
BufferedReader reader = request.getReader();
StringBuilder sb = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
// Process line, here I just append it to a StringBuilder
sb.append(line);
// If you want to preserve newline characters, keep the next line:
sb.append('\n');
}

Trouble reading bytes from webpage response (amf)

I'm trying to write a program that can read different types of encoding from webpage responses. Right now I'm trying to figure out how to successfully read AMF data's response. Sending it is no problem, and with my HttpWrapper, it gets the response string just fine, but many of the characters get lost in translation. For that purpose, I'm trying to receive the response as bytes, to then convert into readable text.
The big thing I'm getting is that characters get lost in translation, literally. I use a program called Charles 3.8.3 to help me get an idea of what I should be seeing in the response, both hex-wise and AMF-wise. It's generally fine when it comes to normal characters, but whenever it sees non-unicode character, I always get "ef bf bd." My code for reading the HTTP response is as follows:
BufferedReader d = new BufferedReader(new InputStreamReader(new DataInputStream(conn.getInputStream())));
while (d.read() != -1) {
String bytes = new String(d.readLine().getBytes(), "UTF-8");
result += bytes;
}
I then try to convert it to hex, as follows:
for (int x = 0; x < result.length(); x++) {
byte b = (byte) result.charAt(x);
System.out.print(String.format("%02x", b & 0xFF));
}
My output is: 0000000001000b2f312f6f6e526573756c7400046e756c6c00000**bf**
Whereas Charles 3.8.3 is: 0000000001000b2f312f6f6e526573756c7400046e756c6c00000**0b**
I'm at my wits end on how to resolve this, so any help would be greatly appreciated!
Thank you for your time

It looks like you're using readLine() because you're used to working with text. Wikipedia says AMF is a binary encoding, so you should be able to do something like this, rather than going through an encode/decode noop (you'd need to use ISO-8859-1, not UTF-8 for that to work) with a string.
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buffer = new byte[2048];
try (InputStream in = conn.getInputStream()) {
int read;
while ((read = in.read(buffer)) >= 0) {
out.write(buffer, 0, read);
}
}
out.toByteArray();
// Convert to hex if you want.

Your code assumes that every stream uses UTF-8 encoding. This is simply incorrect. You will need to inspect the content-type response header field.

String received with utf8 format doesn't get displayed correctly

I want to know how to receive the string from a file in Java which has different language letters.
I used UTF-8 format. This can receive some language letters correctly, but Latin letters can't be displayed correctly.
So, how can I receive all language letters?
Alternatively, is there any other format which will allow me to receive all language letters.
Here's my code:
URL url = new URL("http://google.cm");
URLConnection urlc = url.openConnection();
BufferedReader buffer = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF-8"));
StringBuilder builder = new StringBuilder();
int byteRead;
while ((byteRead = buffer.read()) != -1)
{
builder.append((char) byteRead);
}
buffer.close();
text=builder.toString();
If I display the "text", the letters can't be displayed correctly.

Reading a UTF-8 file is fairly simple in Java:
Reader r = new InputStreamReader(new FileInputStream(filename), "UTF-8");
If that isn't working, the issue lies elsewhere.
EDIT: According to iconv, Google Cameroon is serving invalid UTF-8. It seems to actually be iso-8859-1.
EDIT2: Actually, I was wrong. It serves (and declares) valid UTF-8 if the user agent contains "Mozilla/5.0" (or higher), but valid iso-8859-1 in (some) other cases. Obviously, the best bet is to use getContentType to check before decoding.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Decode error in BufferedReader - java

Related

Java Socket HTTP GET request

How to read unicode characters from server socket

HTTP Request packet getting corrupted

Trouble reading bytes from webpage response (amf)

String received with utf8 format doesn't get displayed correctly

Categories

Resources