DataOutputStream: purpose of the "encoded string too long" restriction - java

There is a strange restriction in java.io.DataOutputStream.writeUTF(String str) method, which limits the size of an UTF-8 encoded string to 65535 bytes:
if (utflen > 65535)
throw new UTFDataFormatException(
"encoded string too long: " + utflen + " bytes");
It is strange, because:
there is no any information about this restriction in JavaDoc of this method
this restriction can be easily solved by copying and modifying an internal static int writeUTF(String str, DataOutput out) method of this class
there is no such restriction in the opposite method java.io.DataInputStream.readUTF().
According to the said above I can not understand the purpose of a such restriction in the writeUTF method. What have I missed or misunderstood?

The Javadoc of DataOutputStream.writeUTF states:
First, two bytes are written to the output stream as if by the
writeShort method giving the number of bytes to follow. This value
is the number of bytes actually written out, not the length of the
string.
Two bytes means 16 bits: in 16 bits the maximum integer one can encode is 2^16 == 65535.
DataInputStream.readUTF has the exact same restriction, because it first reads the number of UTF-8 bytes to consume, in the form of a 2-byte integer, which again can only have a maximum value of 65535.
writeUTF first writes two bytes with the length, which has the same result as calling writeShort with the length and then writing the UTF-encoded bytes. writeUTF doesn't actually call writeShort - it builds up a single byte[] with both the 2-byte length and the UTF bytes. But that is why the Javadoc says "as if by the writeShort method" rather than just "by the writeShort method".

Related

Java 11 Compact Strings magic behind char[] to byte[]

I been reading about encoding Unicode Java 9 compact Strings in the last two days i am getting quite well. But there is something that i dont understand.
About byte data type
1). Is a 8-bit storage ranges from -128 to 127
Questions
1). Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
2). Does the negative value mean something i mean i have try a simple example using Java 11
final char value = (char)200;//in byte would overflow
final String stringValue = new String(new char[]{value});
System.out.println(stringValue);//THE SAME VALUE OF JAVA 8
I have checked the String.value variable and i see a byte array of
System.out.println(value[0]);//-56
The same questions like before arise does the -56 mean something i mean the (negative value) in other languages this overflow is detected to return to the value 200? How can Java know that -56 value is the same as 200 in char.
I have try hardest examples like codepoint 128048 and i see in String.value variable a array of bytes like this.
0 = 61
1 = -40
2 = 48
3 = -36
I know this codepoint takes 4 bytes but i get it how is transformed char[] to byte[] but i dont know how String handle this byte[] data.
Sorry if this question is simple and sorry any typing english is not my natural language thanks a lot.
Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
Java’s primitive data types were settled with Java 1.0 a quarter century ago. The compact strings were introduced in Java 9, less than two years ago. This new feature, which is merely an implementation detail, did not justify fundamental changes at Java’s type system.
Besides that, you are looking at one interpretation of the data stored in a byte. For the sake of representing iso-latin-1 units, it is entirely irrelevant whether interpreting the same data as Java’s built-in signed byte would result in a positive or negative number.
Likewise Java’s I/O API allows reading a file into a byte[] array and write byte[] arrays back to files and these two operations are already sufficient to copy a file losslessly, regardless of its file format which would be relevant when interpreting its content.
So the following works since Java 1.1:
byte[] bytes = "È".getBytes("iso-8859-1");
System.out.println(bytes[0]);
System.out.println(bytes[0] & 0xff);
-56
200
The two numbers, -56 and 200 are just different interpretations of the bit pattern 11001000 whereas the iso-latin-1 interpretation of a byte containing the bit pattern 11001000 is the character È.
A char value is also just an interpretation of a two byte quantity, i.e. as UTF-16 code unit. Likewise, a char[] array is a sequence of bytes in the computer’s memory with a standard interpretation.
We can also interpret other byte sequences this way.
StringBuilder sb = new StringBuilder().appendCodePoint(128048);
byte[] array = new byte[4];
StandardCharsets.UTF_16LE.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
System.out.println(Arrays.toString(array));
will print the value you’ve seen, [61, -40, 48, -36].
The advantage of using a byte[] array inside the String class is, that now, the interpretation can be chosen, to use iso-latin-1 when all characters are representable with this encoding or utf-16 otherwise.
The possible numeric interpretations are irrelevant to the string. However, when you ask “How can Java know that -56 value is the same as 200”, you should ask yourself, how does it know that the bit pattern 11001000 of a byte is -56 in the first place?
System.out.println(value[0]);
bears an actually expensive operation, compared to ordinary computer arithmetic, the conversion of a byte (or an int) to a String. This conversion operation is often overlooked as it has been defined as the default way of printing a byte, but is not more natural than a conversion to a String interpreting the value as an unsigned quantity. For further reading, I recommend Two's complement.
This is because not all bytes in a string are interpreted the same. This depends to the string's character encoding.
Example:
if a string is an UTF-8 string, its characters will be 8-bits in size.
in an UTF-16 string, its characters will be 16-bits in size.
etc...
This means, if the string is to be represented as UTF-8, the characters will be made by reading 1 byte at a time; if 16-bits, the characters will made by reading 2 bytes at a time.
Look at this code: a single byte array data is transformed to string using UTF-8 and UTF-16.
byte[] data = new byte[] {97, 98, 99, 100};
System.out.println(new String(data, StandardCharsets.UTF_8));
System.out.println(new String(data, StandardCharsets.UTF_16));
The output of this code is:
abcd // 4 bytes = 4 chars, 1 byte per char
慢捤 // 4 bytes = 2 chars, 2 byte per char
Going back to the question, what motivated the developers to do so is to reduce memory footprint on strings. Not all strings uses all the 16-bits a char offers.
EDIT: Code here

Sockets: Passing unsigned char array from C to JAVA

C side:
unsigned char myBuffer[62];
fread(myBuffer,sizeof(char),62,myFile);
send(mySocket, myBuffer, 62,0);
JAVA side:
bufferedReader.read(tempBuffer,0,62);
Now in JAVA program i receive (using socket) values less than 0x80 in C program with no problems, but i receive 0xFD value for all values equal or greater than 0x80 in C program.
I need perfect solution for this problem.
Don't use a Reader to read bytes, use an InputStream!
A Reader is meant to read characters; it receives a stream of bytes and (tries and) converts these bytes to characters; you lose the original bytes.
In more details, a Reader will use a CharsetDecoder; this decoder is configured so that unknown byte sequences are replaced; and the encoding used here likely replaces unknown byte sequences with character 0x00fd, hence your result.
Also, you don't care about signed vs unsigned; 1000 0000 may be 128 as an unsigned char in C and -127 as a byte in Java, it still remains 1000 0000.
If what you send is really text, then it means the charset you have chosen for decoding is not the good one; you must know the encoding in which the files on your original system are written.

Initializing ByteArrayOutputStream?

I am new to the MQTT and Android Open Accessory "AOA". while reading a tutorial I realized that, before any attempt to write to the variable of the type ByteArrayOutputStream,however, 0 or 0x00 should be written to that variable first.
Is this some kind of initialisation? Below is an example of that:
EX_1
variableHeader.write(0x00);
variableHeader.write(PROTOCOL_NAME.getBytes("UTF-8").length);
variableHeader.write(PROTOCOL_NAME.getBytes("UTF-8"));
EX_2
public static byte[] connect() throws UnsupportedEncodingException, IOException {
String identifier = "android";
ByteArrayOutputStream payload = new ByteArrayOutputStream();
payload.write(0);
payload.write(identifier.length());
}
This is not any kind of initialization needed by the ByteArrayOutputStream. Calling write(0) simply inserts a 0-byte as a the first byte in the byte array.
Instead, the byte must have meaning to the MQTT protocol. I'm not familiar with it, but a quick look at the MQTT protocol specification reveals that strings are encoded by writing the string bytes in UTF-8, prefixed by a 2-byte length field, upper byte first.
In both the examples you give, strings are being written, but it is only writing one length byte. The 0 byte, then, must be the other length byte. I'm sure that's what it is. The code is a bit sloppy: it assumes that the strings in your case are less than 256 bytes long, so it can always assume the upper length byte is 0.
If there is any possibility of the "protocol name" being 256 bytes or longer, then the proper way to write this code:
variableHeader.write(0x00);
variableHeader.write(PROTOCOL_NAME.getBytes("UTF-8").length);
variableHeader.write(PROTOCOL_NAME.getBytes("UTF-8"));
would be:
byte[] stringBytes = PROTOCOL_NAME.getBytes("UTF-8");
variableHeader.write(stringBytes.length >> 8); // upper length byte
variableHeader.write(stringBytes.length & 0xFF); // lower length byte
variableHeader.write(stringBytes); // actual data

FileOutputStream write method (int) java

From the API the method write(int byte) should take an int representing a byte so in that way it when EOF comes it can return -1.
However it's possible doing the following thing:
FileOutputStream fi = new FileOutputStream(file);
fi.write(100000);
I expected to not compile as the number exceeds the byte range.
How does the JVM interpret it exactly?
Thanks in advance.
From the OutputStream.write(int) doc:
Writes the specified byte to this output stream. The general contract for write is that one byte is written to the output stream. The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
Emphasis mine.
Note that the method takes an int. And since 100000 is a valid integer literal, there is no point of it being not compiling.
Where did you read that part about EOF and -1?
The method just writes one byte, which for some reason is passed along as an int.
Writes the specified byte to this output stream. The general contract for write is that one byte is written to the output stream. The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
I expected to not compile as the number exceeds the byte range
No, this will compile okay. The compiler just looks for an int. (A long would not compile).
Everything except the lowest 8 bits will be ignored.

OutputSteam.write(int) only writes 1 byte to file?

Hello I have the following code:
int i=12345;
DataOutputStream dos=new DataOutputStream(new FileOutputStream("Raw.txt"));
dos.write(i);
dos.close();
System.out.println(new File("Raw.txt").length());
The file size is being reported as 1 byte. Why is it not 4 bytes when an integer is 4 bytes long?
Thanks
Because you only wrote one byte to it. See the Javadoc for DataOutputStream.write(int). It writes a byte, not an int.
While the DataOutputStream.write method takes an int argument, it actually only writes the bottom 8 bits of that argument. So you actually wrote only one byte ... and hence the file is one byte long.
If you want to write the entire int you should use the writeInt(int) method.
The underlying reason for this strangeness is (I believe) that the write(int) method is defined to be consistent with OutputStream.write(int) which in turn defined to be consistent with InputStream.read(). InputStream.read() reads a byte and returns it as an int ... with the value -1 used to indicate the end-of-stream condition.

Categories