Java 11 Compact Strings magic behind char[] to byte[]

Java 11 Compact Strings magic behind char[] to byte[] - java

I been reading about encoding Unicode Java 9 compact Strings in the last two days i am getting quite well. But there is something that i dont understand.
About byte data type
1). Is a 8-bit storage ranges from -128 to 127
Questions
1). Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
2). Does the negative value mean something i mean i have try a simple example using Java 11
final char value = (char)200;//in byte would overflow
final String stringValue = new String(new char[]{value});
System.out.println(stringValue);//THE SAME VALUE OF JAVA 8
I have checked the String.value variable and i see a byte array of
System.out.println(value[0]);//-56
The same questions like before arise does the -56 mean something i mean the (negative value) in other languages this overflow is detected to return to the value 200? How can Java know that -56 value is the same as 200 in char.
I have try hardest examples like codepoint 128048 and i see in String.value variable a array of bytes like this.
0 = 61
1 = -40
2 = 48
3 = -36
I know this codepoint takes 4 bytes but i get it how is transformed char[] to byte[] but i dont know how String handle this byte[] data.
Sorry if this question is simple and sorry any typing english is not my natural language thanks a lot.

Why Java didn't implement it like char unsigned 16 bits? i mean it would be in a range of 0.256 because from 0 to 127 only can i hold a Ascii value but what would happen if i set the value 200 a extended ascii would overflow to -56.
Java’s primitive data types were settled with Java 1.0 a quarter century ago. The compact strings were introduced in Java 9, less than two years ago. This new feature, which is merely an implementation detail, did not justify fundamental changes at Java’s type system.
Besides that, you are looking at one interpretation of the data stored in a byte. For the sake of representing iso-latin-1 units, it is entirely irrelevant whether interpreting the same data as Java’s built-in signed byte would result in a positive or negative number.
Likewise Java’s I/O API allows reading a file into a byte[] array and write byte[] arrays back to files and these two operations are already sufficient to copy a file losslessly, regardless of its file format which would be relevant when interpreting its content.
So the following works since Java 1.1:
byte[] bytes = "È".getBytes("iso-8859-1");
System.out.println(bytes[0]);
System.out.println(bytes[0] & 0xff);
-56
200
The two numbers, -56 and 200 are just different interpretations of the bit pattern 11001000 whereas the iso-latin-1 interpretation of a byte containing the bit pattern 11001000 is the character È.
A char value is also just an interpretation of a two byte quantity, i.e. as UTF-16 code unit. Likewise, a char[] array is a sequence of bytes in the computer’s memory with a standard interpretation.
We can also interpret other byte sequences this way.
StringBuilder sb = new StringBuilder().appendCodePoint(128048);
byte[] array = new byte[4];
StandardCharsets.UTF_16LE.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
System.out.println(Arrays.toString(array));
will print the value you’ve seen, [61, -40, 48, -36].
The advantage of using a byte[] array inside the String class is, that now, the interpretation can be chosen, to use iso-latin-1 when all characters are representable with this encoding or utf-16 otherwise.
The possible numeric interpretations are irrelevant to the string. However, when you ask “How can Java know that -56 value is the same as 200”, you should ask yourself, how does it know that the bit pattern 11001000 of a byte is -56 in the first place?
System.out.println(value[0]);
bears an actually expensive operation, compared to ordinary computer arithmetic, the conversion of a byte (or an int) to a String. This conversion operation is often overlooked as it has been defined as the default way of printing a byte, but is not more natural than a conversion to a String interpreting the value as an unsigned quantity. For further reading, I recommend Two's complement.

This is because not all bytes in a string are interpreted the same. This depends to the string's character encoding.
Example:
if a string is an UTF-8 string, its characters will be 8-bits in size.
in an UTF-16 string, its characters will be 16-bits in size.
etc...
This means, if the string is to be represented as UTF-8, the characters will be made by reading 1 byte at a time; if 16-bits, the characters will made by reading 2 bytes at a time.
Look at this code: a single byte array data is transformed to string using UTF-8 and UTF-16.
byte[] data = new byte[] {97, 98, 99, 100};
System.out.println(new String(data, StandardCharsets.UTF_8));
System.out.println(new String(data, StandardCharsets.UTF_16));
The output of this code is:
abcd // 4 bytes = 4 chars, 1 byte per char
慢捤 // 4 bytes = 2 chars, 2 byte per char
Going back to the question, what motivated the developers to do so is to reduce memory footprint on strings. Not all strings uses all the 16-bits a char offers.
EDIT: Code here

Related

String hex hash to bytes

I have String hash in hex form ("e6fb06210fafc02fd7479ddbed2d042cc3a5155e") and I would like to compare it to crypt.digest().
One way, which works fine, is to convert crypt.digest() to hex, but I would like to avoid multiple conversions and rather convert hash from hex form (above) to byte array.
What I tried was:
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
but it does not match with crypt.digest(). When I convert hashBytes back to hex I get "00e6fb06210fafc02fd7479ddbed2d042cc3a5155e".
The leading zeros seem to be the reason why I fail to match byte arrays. Why do they occur? How can I get the same result using crypt.digest() and toByteArray?

The reason for the extra 00 is that e6 has it high (sign) bit set.
A redundant byte 00 makes it an unsigned value for BigInteger.
String hash = "e6fb06210fafc02fd7479ddbed2d042cc3a5155e";
byte[] hashBytes = new BigInteger(hash, 16).toByteArray();
hashBytes = hashBytes.length > 1 && hashBytes[0] == 0
? Arrays.copyOfRange(hashBytes, 1, hashBytes.length) : hashBytes;
System.out.println(Arrays.toString(hashBytes));
The question arises, what if the hash actually starts with a 00?
Then you need the hash length, or do a lenient comparison.

The answer can be found in the following answer from a thread about the highly related question Convert a string representation of a hex dump to a byte array using Java?:
The issue with BigInteger is that there must be a "sign bit". If the leading byte has the high bit set then the resulting byte array has an extra 0 in the 1st position. But still +1.
– Gray Oct 28 '11 at 16:20
Since the first bit has a special meaning (indicating the sign, 0 for positive, 1 for negative), BigInteger will prefix the data with an additional 0 in case your data started with a 1 on the high bit. Otherwise it would be interpreted as negative although it was not negative to begin with.
I.e. data like
101110
is turned into
0101110
You could easily undo this manually by using Arrays.copyOfRange(data, 1, data.length) if it happens.
However, instead of fixing that code, I would suggest using one of the other solutions posted in the linked thread. They are cleaner and easier to read and maintain.

What is the difference in bytes of a number as a string and as an integer?

Let's say we have a my_string = "123456"
I do
my_string.getBytes()
and
new BigInteger(123456).toByteArray()
The resulting byte arrays are different for both these cases. Why is that so? Isn't "123456" same as 123456 other than the difference in data type?

They are different because the String type is made up of unicode characters. The character '2' is not at all the same as the numeric value 2.

No. Why would they be? "123456" is a sequence of the ASCII character 1 (which is not represented as the number 1, but as the number 49), followed by the number 2 (50), and so on. 123456 as an int isn't even represented as a sequence of digits from 0-9, but it's stored as a number in binary.

I assume that you are asking about the total memory used to represent a number as a String versus a byte[].
The String size will depend on the actual string representation used. This depends on the JVM version; see What is the Java's internal represention for String? Modified UTF-8? UTF-16?
For Java 8 and earlier (with some caveats), the String consists of a String object with 1 int fields and 1 reference field. Assuming 64 bit references, that adds up to 8 bytes of header + 1 x 4 bytes + 1 x 8 bytes + 4 bytes of padding. Then add the char[] used to represent the characters: 12 bytes of header + 2 bytes per character. This needs to be rounded up to a multiple of 8.
For Java 9 and later, the main object has the same size. (There is an extra field ... but that fits into the "padding".) The char[] is replaced by a byte[], and since you are just storing ASCII decimal digits1, they will be encoded one character per byte.
In short, the asymptotic space usage is 1 byte per decimal digit for Java 9 or later and 2 bytes per decimal digit in Java 8 or earlier.
For the byte[] representation produce from a BigInteger, the represention consists of 12 bytes of header + 1 byte per byte ... rounded up to a multiple of 8. The asymptotic size is 1 byte per byte.
In both cases there is also the size of the reference to the representation; i.e. another 8 bytes.
If you do the sums, the byte[] representation is more compact than the String representation in all cases. But int or long are significantly more compact that either of these representations in all cases.
1 - If you are not ... or if you are curious why I added this caveat ... read the Q&A at the link above!

Converting String to UTF-8 byte array returns a negative value in Java

Let's say I have a byte array and I try to encode it to UTF_8 using the following
String tekst = new String(result2, StandardCharsets.UTF_8);
System.out.println(tekst);
//where result2 is the byte array
Then, I get the bytes using getBytes() with values from 0 to 128
byte[] orig = tekst.getBytes();
And then, I wish to do a frequency count of my byte[] orig using the ff:
int frequencies = new int[256];
for (byte b: orig){
frequencies[b]++;
}
Everything goes well till I encounter an error which states
java.lang.ArrayIndexOutOfBoundsException: -61
Does that mean that my byte still contains negative values despite converting it to UTF-8? Is there something wrong that I'm doing? Can someone please give me clarity on this cause I'm still a beginner on the subject. Thank you.

Answering the specific question
Does that mean that my byte still contains negative values despite converting it to UTF-8?
Yes, absolutely. That's because byte is signed in Java. A byte value of -61 would be 195 as an unsigned value. You should expect to get bytes which aren't in the range 0-127 when you encode any non-ASCII text with UTF-8.
The fix is easy: just clamp the range to 0-255 with a bit mask:
frequencies[b & 0xff]++;
Addressing what you're attempting to do
This line:
String tekst = new String(result2, StandardCharsets.UTF_8);
... is only appropriate if result2 is genuinely UTF-8-encoded text. It's not appropriate if result2 is some arbitrary binary data such as an image, compressed data, or even text encoded in some other encoding.
If you want to preserve arbitrary binary data as a string, you should use something like Base64 or hex. Basically, you need to determine whether your data is inherently textual (in which case, you should use strings for as much of the time as possible, and use an appropriate Charset to convert to binary where necessary) or inherently binary (in which case you should use bytes for as much of the time as possible, and use base64 or hex to convert to text where necessary).
This line:
byte[] orig = tekst.getBytes();
... is almost always a bad idea. It uses the platform-default encoding to convert a string to bytes. If you really, really want to use the platform-default encoding, I would make that explicit:
byte[] orig = tekst.getBytes(Charset.defaultCharset());
... but this is an extremely unusual requirement these days. It's almost always better to stick to UTF-8 everywhere.

What is Java Byte Array and How should it be used?

What does it mean by byte array ? I mean it holds the 0s and 1s just like how data is hold in memory ?
For example
String a = "32";
byte [] arr = a.getBytes() ;
What does exist now inside arr array,why and when to use it?

By byte array, it literally means an array where each item is of the byte primitive data type. If you do not know the difference between a byte and a common int (Integer), the main difference is the bit width: bytes are 8-bit and integers are 32-bit. You can read up on that here.
If you do not know what an array is, an array is basically a sequence of items (in your case a sequence of bytes, declared as byte[]).
The function a.getBytes() takes a, which is a String, and returns an array of bytes. This can be done because the human-readable characters in a String can be represented as 8-bit numbers, where the mapping between number and character is determined by the CharSet. Examples of two common CharSets are ASCII and UTF-8. Now, arr is an array of bytes, where each byte in the array represents each character in the original string a. In both ASCII and UTF-8, the String "32" is represented by the bytes 51 and 50 in decimal, and 0x33 and 0x32 in hexadecimal.
Byte arrays are commonly used in applications that read and write data byte-wise, such as socket connections that send data in byte streams through TCP or UDP protocols.
Hope I could help!

Bit masking java, only showing last 6 bites of a hex

I am playing around on how to manipulate bytes from an inputted Hex number. Data is a Hex:
0x022DA822 == 10001011011010100000100010. After I run the following code:
byte mask= (byte) data;
mask will = 100010, only those last bits. How come it only shows the last 6 bits or 22 in the hex?
Does it mask the first 20 bits by default?

Your cast is causing a loss of data. A byte can hold (you guessed it), one byte of data. Thus the range of a byte is [-128, 127]. Note that the most significant bit is reserved as the sign bit. So basically when you are saying: (byte)data, you are converting your hex data into a variable of type byte, which has a smaller range than your hex string. And thus only the last byte of your data can be stored in the byte.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.