I'm trying to write a program that can read different types of encoding from webpage responses. Right now I'm trying to figure out how to successfully read AMF data's response. Sending it is no problem, and with my HttpWrapper, it gets the response string just fine, but many of the characters get lost in translation. For that purpose, I'm trying to receive the response as bytes, to then convert into readable text.
The big thing I'm getting is that characters get lost in translation, literally. I use a program called Charles 3.8.3 to help me get an idea of what I should be seeing in the response, both hex-wise and AMF-wise. It's generally fine when it comes to normal characters, but whenever it sees non-unicode character, I always get "ef bf bd." My code for reading the HTTP response is as follows:
BufferedReader d = new BufferedReader(new InputStreamReader(new DataInputStream(conn.getInputStream())));
while (d.read() != -1) {
String bytes = new String(d.readLine().getBytes(), "UTF-8");
result += bytes;
}
I then try to convert it to hex, as follows:
for (int x = 0; x < result.length(); x++) {
byte b = (byte) result.charAt(x);
System.out.print(String.format("%02x", b & 0xFF));
}
My output is: 0000000001000b2f312f6f6e526573756c7400046e756c6c00000**bf**
Whereas Charles 3.8.3 is: 0000000001000b2f312f6f6e526573756c7400046e756c6c00000**0b**
I'm at my wits end on how to resolve this, so any help would be greatly appreciated!
Thank you for your time
It looks like you're using readLine() because you're used to working with text. Wikipedia says AMF is a binary encoding, so you should be able to do something like this, rather than going through an encode/decode noop (you'd need to use ISO-8859-1, not UTF-8 for that to work) with a string.
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buffer = new byte[2048];
try (InputStream in = conn.getInputStream()) {
int read;
while ((read = in.read(buffer)) >= 0) {
out.write(buffer, 0, read);
}
}
out.toByteArray();
// Convert to hex if you want.
Your code assumes that every stream uses UTF-8 encoding. This is simply incorrect. You will need to inspect the content-type response header field.
Related
I am reading this file: https://www.reddit.com/r/tech/top.json?limit=100 into a BufferedReader from a HttpUrlConnection. I've got it to read some of the file, but it only reads about a 1/10th of what it should. It doesn't change anything if I change the size of the input buffer - it prints the same thing just in smaller chunks:
try{
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
StringBuilder sb = new StringBuilder();
int charsRead;
char[] inputBuffer = new char[500];
while(true) {
charsRead = reader.read(inputBuffer);
if(charsRead < 0) {
break;
}
if(charsRead > 0) {
sb.append(String.copyValueOf(inputBuffer, 0, charsRead));
Log.d(TAG, "Value read " + String.copyValueOf(inputBuffer, 0, charsRead));
}
}
reader.close();
return sb.toString();
} catch(Exception e){
e.printStackTrace();
}
I believe the issue is that the text is all on one line since it's not formatted in json correctly, and BufferedReader can only take a line so long. Is there any way around this?
read() should continue to read as long as charsRead > 0. Every time it makes a call to read, the reader marks where it last read from and the next call starts at that place and continues on until there is no more to read. There is no limit to the size it can read. The only limit is the size of the array but the overall size of the file there is none.
You could try the following:
try(InputStream is = connection.getInputStream();
ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
int read = 0;
byte[] buffer = new byte[4096];
while((read = is.read(buffer)) > 0) {
baos.write(buffer, 0, read);
}
return new String(baos.toByteArray(), StandardCharsets.UTF_8);
} catch (Exception ex){}
The above method is using purely the bytes from the stream and reading it into the output stream, then creating the string from that.
I suggest using 3d party Http client. It could reduce your code literally to just a few lines and you don't have to worry about all those little details. Bottom line is - someone already wrote the code that you are trying to write. And it works and already well tested. Few suggestions:
Apache Http Client - A well known and popular Http client, but might be a bit bulky and complicated for a simple case like yours.
Ok Http Client - Another well-known Http client
And finally, my favorite (because it is written by me) MgntUtils Open Source library that has Http Client. Maven artifacts can be found here, GitHub that includes the library itself as a jar file, source code, and Javadoc can be found here and JavaDoc is here
Just to demonstrate the simplicity of what you want to do here is the code using MgntUtils library. (I tested the code and it works like a charm)
private static void testHttpClient() {
HttpClient client = new HttpClient();
client.setContentType("application/json; charset=utf-8");
client.setConnectionUrl("https://www.reddit.com/r/tech/top.json?limit=100");
String content = null;
try {
content = client.sendHttpRequest(HttpMethod.GET);
} catch (IOException e) {
content = client.getLastResponseMessage() + TextUtils.getStacktrace(e, false);
}
System.out.println(content);
}
My wild guess is that your default platform charset was UTF-8 and encoding problems were raised. For remote content the encoding should be specified, and not assumed to be equal to the default encoding on your machine.
The charset of the response data must be correct. For that the headers must be inspected. The default should be Latin-1, ISO-8859-1, but browsers interprete that
as Windows Latin-1, Cp-1252.
String charset = connection.getContentType().replace("^.*(charset=|$)", "");
if (charset.isEmpty()) {
charset = "Windows-1252"; // Windows Latin-1
}
Then you can better read bytes, as there is no exact correspondence to the number of bytes read and the number of chars read. If at the end of a buffer is the first char of a surrogate pair, two UTF-16 chars that form a Unicode glyph, symbol, code point above U+FFFF, I do not know the efficiency of the underlying "repair."
BufferedInputStream in = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buffer = new byte[512];
while (true) {
int bytesRead = in.read(buffer);
if (bytesRead < 0) {
break;
}
if (bytesRead > 0) {
out.write(buffer, 0, bytesRead);
}
}
return out.toString(charset);
And indeed it is safe to do:
sb.append(inputBuffer, 0, charsRead);
(Taking a copy was probably a repair attempt.)
By the way char[500] takes almost twice the memory of byte[512].
I saw that the site uses gzip compression in my browser. That makes sense for text such as json. I mimicked it by setting a request header Accept-Encoding: gzip.
URL url = new URL("https://www.reddit.com/r/tech/top.json?limit=100");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("Accept-Encoding", "gzip");
try (InputStream rawIn = connection.getInputStream()) {
String charset = connection.getContentType().replaceFirst("^.*?(charset=|$)", "");
if (charset.isEmpty()) {
charset = "Windows-1252"; // Windows Latin-1
}
boolean gzipped = "gzip".equals(connection.getContentEncoding());
System.out.println("gzip=" + gzipped);
try (InputStream in = gzipped ? new GZIPInputStream(rawIn)
: new BufferedInputStream(rawIn)) {
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buffer = new byte[512];
while (true) {
int bytesRead = in.read(buffer);
if (bytesRead < 0) {
break;
}
if (bytesRead > 0) {
out.write(buffer, 0, bytesRead);
}
}
return out.toString(charset);
}
}
It might be for not gzip conform "browsers" the content length of the compressed content was erroneously set in the response. Which is a bug.
I believe the issue is that the text is all on one line since it's not formatted in json correctly, and BufferedReader can only take a line so long.
This explanation is not correct:
You are not reading a line at a time, and BufferedReader is not treating the text as line based.
Even when you do read from a BufferedReader a line at a time (i.e. using readLine()) the only limits on the length of a line are the inherent limits of a Java String length (2^31 - 1 characters), and the size of your heap.
Also, note that "correct" JSON formatting is subjective. The JSON specification says nothing about formatting. It is common for JSON emitters to not waste CPU cycles and network bandwidth on formatting for JSON that a human will only rarely read. Application code that consumes JSON needs to be able cope with this.
So what is actually going on?
Unclear, but here are some possibilities:
A StringBuilder also has an inherent limit of 2^31 - 1 characters. However, with (at least) some implementations, if you attempt to grow a StringBuilder beyond that limit, it will throw an OutOfMemoryError. (This behavior doesn't appear to be documented, but it is clear from reading the source code in Java 8.)
Maybe you are reading the data too slowly (e.g. because your network connection is too slow) and the server is timing out the connection.
Maybe the server has a limit on the amount of data that it is willing to send in a response.
Since you haven't mentioned any exceptions and you always seem to get the same amount of data, I suspect the 3rd explanation is the correct one.
I'm trying to create a simple Java program that create an HTTP request to a HTTP server hosted locally, by using Socket.
This is my code:
try
{
//Create Connection
Socket s = new Socket("localhost",80);
System.out.println("[CONNECTED]");
DataOutputStream out = new DataOutputStream(s.getOutputStream());
DataInputStream in = new DataInputStream(s.getInputStream());
String header = "GET / HTTP/1.1\n"
+"Host:localhost\n\n";
byte[] byteHeader = header.getBytes();
out.write(byteHeader,0,header.length());
String res = "";
/////////////READ PROCESS/////////////
byte[] buf = new byte[in.available()];
in.readFully(buf);
System.out.println("\t[READ PROCESS]");
System.out.println("\t\tbuff length->"+buf.length);
for(byte b : buf)
{
res += (char) b;
}
System.out.println("\t[/READ PROCESS]");
/////////////END READ PROCESS/////////////
System.out.println("[RES]");
System.out.println(res);
System.out.println("[CONN CLOSE]");
in.close();
out.close();
s.close();
}catch(Exception e)
{
e.printStackTrace();
}
But by when I run it the Server reponse with a '400 Bad request error'.
What is the problem? Maybe some HTTP headers to add but I don't know which one to add.
There are a couple of issues with your request:
String header = "GET / HTTP/1.1\n"
+ "Host:localhost\n\n";
The line break to be used must be Carriage-Return/Newline, i.e. you should change that to
String header = "GET / HTTP/1.1\r\n"
+ "Host:localhost\r\n\r\n";
Next problem comes when you write the data to the OutputStream:
byte[] byteHeader = header.getBytes();
out.write(byteHeader,0,header.length());
The call of readBytes without the specification of a charset uses the system's charset which might be a different than the one that is needed here, better use getBytes("8859_1"). When writing to the stream, you use header.length() which might be different from the length of the resulting byte-array if the charset being used leads to the conversion of one character into multiple bytes (e.g. with UTF-8 as encoding). Better use byteHeader.length.
out.write(byteHeader,0,header.length());
String res = "";
/////////////READ PROCESS/////////////
byte[] buf = new byte[in.available()];
After sending the header data you should do a flush on the OutputStream to make sure that no internal buffer in the streams being used prevents the data to actually be sent to the server.
in.available() only returns the number of bytes you can read from the InputStream without blocking. It's not the length of the data being returned from the server. As a simple solution for starters, you can add Connection: close\r\n to your header data and simply read the data you're receiving from the server until it closes the connection:
StringBuffer sb = new StringBuffer();
byte[] buf = new byte[4096];
int read;
while ((read = in.read(buf)) != -1) {
sb.append(new String(buf, 0, read, "8859_1"));
}
String res = sb.toString();
Oh and independent form the topic of doing an HTTP request by your own:
String res = "";
for(byte b : buf)
{
res += (char) b;
}
This is a performance and memory nightmare because Java is actually caching all strings in memory in order to reuse them. So the internal cache gets filled with each result of this concatenation. A response of 100 KB size would mean that at least 5 GB of memory are allocated during that time leading to a lot of garbage collection runs in the process.
Oh, and about the response of the server: This most likely comes from the invalid line breaks being used. The server will regard the whole header including the empty line as a single line and complains about the wrong format of the GET-request due to additional data after the HTTP/1.1.
According to HTTP 1.1:
HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body [...].
So, you'll need all of your request to be ending with \r\n.
I received some data from server and read them from java code :
is = new BufferedInputStream(connection.getInputStream());
reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
int length;
char[] buffer = new char[4096];
StringBuilder sb = new StringBuilder();
while ((length = reader.read(buffer)) != -1) {
sb.append(new String(buffer, 0, length));//buffer is already incorrect
}
byte[] byteDatas = sb.toString().getBytes();
And I print byteDatas as Hex string:
Comparing to the wireshark's result:
Some bytes are decoded as bf bd ef , I know it's \ufffd(65533) stand for invalid data.
So I think there must have decode error in my code , after debug, I found that If I use connection.getInputStream() to read data directly , there is no invalid data.
So ,the problem must happens in BufferedReader or InputStreamReader, But I have already add "UTF-8" and the data in wireshark seems not very wired. Does UTF-8 is not correctly? Server do not reply the charset.
Please help how to let BufferedReader read the correct data.
UPDATE
My default charset is "UTF-8" and have debug to prove it . After read return , I have already got the wrong data , so it's not String's fault.
String.getBytes() will use the platform's default encoding (not necessarily UTF-8) to convert the characters of the String to bytes.
Quoting from the javadoc of String.getBytes():
Encodes this String into a sequence of bytes using the platform's default charset...
You can't compare the UTF-8 encoded input data to the result which might not be the result of UTF-8 encoded. Instead explicitly specify the encoding like this:
byte[] byteDatas = sb.toString().getBytes(StandardCharsets.UTF_8);
Note:
If your input data is NOT UTF-8 encoded text and if you attempt to decode it as UTF-8, the decoder may replace invalid byte sequences. This will cause that the bytes you get by encoding the String will not be the same as the input raw bytes.
i have to send a short string as text from client to server and then after that send a binary file.
how would I send both binary file and the string using the same socket connection?
the server is a java desktop application and the client is an Android tablet. i have already set it up to send text messages between the client and server in both directions. i have not yet done the binary file sending part.
one idea is to set up two separate servers running at the same time. I think this is possible if i use two different port numbers and set up the servers on two different threads in the application. and i would have to set up two concurrent clients running on two services in the Android app.
the other idea is to somehow use an if else statement to determine which of the two types of files is being sent, either text of binary, and use the appropriate method to receive the file for the file type being sent.
example code for sending text
PrintWriter out;
BufferedReader in;
out = new PrintWriter(new BufferedWriter
(new OutputStreamWriter(Socket.getOutputStream())) true,);
in = new BufferedReader(new InputStreamReader(socket.getInputStream()));
out.println("test out");
String message = in.readLine();
example code for sending binary file
BufferedOutputStream out;
BufferedInputStream in;
byte[] buffer = new byte[];
int length = 0;
out = new BufferedOutputStream(new FileOutputStream("test.pdf));
in = new BufferedInputStream(new FileOutputStream("replacement.pdf"));
while((length = in.read(buffer)) > 0 ){
out.write(buffer, 0, length);
}
I don't think using two threads would be necessary in your case. Simply use the socket's InputStream and OutputStream in order to send binary data after you have sent your text messages.
Server Code
OutputStream stream = socket.getOutputStream();
PrintWriter out = new PrintWriter(
new BufferedWriter(
new OutputStreamWriter(stream)
)
);
out.println("test output");
out.flush(); // ensure that the string is not buffered by the BufferedWriter
byte[] data = getBinaryDataSomehow();
stream.write(data);
Client Code
InputStream stream = socket.getInputStream();
String message = readLineFrom(stream);
int dataSize = getSizeOfBinaryDataSomehow();
int totalBytesRead = 0;
byte[] data = new byte[dataSize];
while (totalBytesRead < dataSize) {
int bytesRemaining = dataSize - totalBytesRead;
int bytesRead = stream.read(data, totalBytesRead, bytesRemaining);
if (bytesRead == -1) {
return; // socket has been closed
}
totalBytesRead += bytesRead;
}
In order to determine the correct dataSize on the client side you have to transmit the size of the binary block somehow. You could send it as a String right before out.flush() in the Server Code or make it part of your binary data. In the latter case the first four or eight bytes could hold the actual length of the binary data in bytes.
Hope this helps.
Edit
As #EJP correctly pointed out, using a BufferedReader on the client side will probably result in corrupted or missing binary data because the BufferedReader "steals" some bytes from the binary data to fill its buffer. Instead you should read the string data yourself and either look for a delimiter or have the length of the string data transmitted by some other means.
/* Reads all bytes from the specified stream until it finds a line feed character (\n).
* For simplicity's sake I'm reading one character at a time.
* It might be better to use a PushbackInputStream, read more bytes at
* once, and push the surplus bytes back into the stream...
*/
private static String readLineFrom(InputStream stream) throws IOException {
InputStreamReader reader = new InputStreamReader(stream);
StringBuffer buffer = new StringBuffer();
for (int character = reader.read(); character != -1; character = reader.read()) {
if (character == '\n')
break;
buffer.append((char)character);
}
return buffer.toString();
}
You can read about how HTTP protocol works which essentially sends 'ascii and human readable' headers (so to speak) and after that any content can be added with appropriate encoding like base64 for example. You may create sth similar yourself.
You need to first send the String, then the size of the byte array then the byte array, use String.startsWith() method to check what is being send.
I want to know how to receive the string from a file in Java which has different language letters.
I used UTF-8 format. This can receive some language letters correctly, but Latin letters can't be displayed correctly.
So, how can I receive all language letters?
Alternatively, is there any other format which will allow me to receive all language letters.
Here's my code:
URL url = new URL("http://google.cm");
URLConnection urlc = url.openConnection();
BufferedReader buffer = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF-8"));
StringBuilder builder = new StringBuilder();
int byteRead;
while ((byteRead = buffer.read()) != -1)
{
builder.append((char) byteRead);
}
buffer.close();
text=builder.toString();
If I display the "text", the letters can't be displayed correctly.
Reading a UTF-8 file is fairly simple in Java:
Reader r = new InputStreamReader(new FileInputStream(filename), "UTF-8");
If that isn't working, the issue lies elsewhere.
EDIT: According to iconv, Google Cameroon is serving invalid UTF-8. It seems to actually be iso-8859-1.
EDIT2: Actually, I was wrong. It serves (and declares) valid UTF-8 if the user agent contains "Mozilla/5.0" (or higher), but valid iso-8859-1 in (some) other cases. Obviously, the best bet is to use getContentType to check before decoding.