strange behaviour of java getBytes vs getBytes(charset)

strange behaviour of java getBytes vs getBytes(charset) - java

consider the following:
public static void main(String... strings) throws Exception {
byte[] b = { -30, -128, -94 };
//section utf-32
String string1 = new String(b,"UTF-32");
System.out.println(string1); //prints ?
printBytes(string1.getBytes("UTF-32")); //prints 0 0 -1 -3
printBytes(string1.getBytes()); //prints 63
//section utf-8
String string2 = new String(b,"UTF-8");
System.out.println(string2); // prints •
printBytes(string2.getBytes("UTF-8")); //prints -30 -128 -94
printBytes(string2.getBytes()); //prints -107
}
public static void printBytes(byte[] bytes){
for(byte b : bytes){
System.out.print(b + " " );
}
System.out.println();
}
output:
?
0 0 -1 -3
63
•
-30 -128 -94
-107
so I have two questions:
in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset
why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)

Question 1:
in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset
The character set you've specified is used during character encoding of the string to the byte array (i.e. in the method itself only). It's not part of the String instance itself. You are not setting the character set for the string, the character set is not stored.
Java does not have an internal byte encoding of the character set; it uses arrays of char internally. If you call String.getBytes() without specifying a character set, it will use the platform default - e.g. Windows-1252 on Windows machines.
Question 2:
why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)
You cannot always do this. Not all bytes represent a valid encoding of characters. So if such an encoded array is decoded then these kind of encodings are silently ignored, i.e. the bytes are simply skipped.
This already happens during String string1 = new String(b,"UTF-32"); and String string2 = new String(b,"UTF-8");.
You can change this behavior using an instance of CharsetDecoder, retrieved using Charset.newDecoder.
If you want to encode a random byte array into a String instance then you should use a hexadecimal or base 64 encoder. You should not use a character decoder for that.

Java String / char (16 bits UTF-16!) / Reader / Writer are for Unicode text. So all scripts may be combined in a text.
Java byte (8 bits) / InputStream / OutputStream are for binary data. If that data represents text, one needs to know its encoding to make text out of it.
So a conversion from bytes to text always needs a Charset. Often there exists an overloaded method without the charset, and then it defaults to the System.getProperty("file.encoding") which can differ on every platform.
Using a default is absolutely non-portable, if the data is cross-platform.
So you had the misconception that the encoding belonged to the String. This is understandable, seeing that in C/C++ unsigned char and byte were largely interchangeable, and encodings a nightmare.

Related

Byte array with negative byte values can't be converted to String using UTF-8 [closed]

Consider this is byte array,
byte[] by = [2, 126, 33, -66, -100, 4, -39, 108]
then if we execute the below code and print it,
String utf8_str = new String(by, StandardCharsets.UTF_8);
System.out.println(utf8_str);
the output is:
\~!���l
Where all the negative values are converted to '�' which means that the byte with -ve value is not in the UTF-8 character set.
But the UTF-8 character set has a range of 0 to 255.
If only 0-127 can be shown in +ve in the form of byte datatype, then the numbers greater than 127 can never be used when encoding to UTF-8 character set as Java does not support unsigned byte value.
Any solution for this?
I needed to encode a byte array to UTF-8 character String and get the byte array back from the UTF-8 character String.
But all the character are encoded and retrieved properly except '�'.
when I try to retrieve '�' (i.e, print it's UTF-8 Unicode), it gives some other Unicode rather than the Unicode of the encoded character.

tl;dr: You can't decode arbitrary bytes as UTF-8, because some byte streams are not conforming UTF-8 streams. If you need to represent arbitrary bytes as String, use something like Base64:
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
Not all byte sequences are valid UTF-8
UTF-8 has very specific rules about what bytes sequences are allowed. The short version is:
a byte in the range 0x00-0x7F can stand alone (and represents the equivalent character as its ASCII encoding).
a byte in the range 0xC2-0xF4 is a leading byte that starts a multi-byte sequence with the exact value indicating the number of continuation bytes
a byte in the range 0x80-0xBF is a continuation byte that has to come after a leading byte and possibly some other continuation bytes.
There's a few more rules and nuances to it, but that's the basic idea.
As you can see there are several byte values (0xC0, 0xC1, 0xF5-0xFF) that can't appear in a well-formed UTF-8 stream at all. Additionally some other bytes can only occur in specific sequences. For example a leading byte can never be followed by another leading byte or a stand-alone byte. Similarly a stand-alone byte must never be followed by a continuation byte.
Note about "negative values": byte in Java is a signed data type. But the signed/unsigned debate is not relevant for this topic, as it only matters when calculating with the value or when printing it. It's the 8-bit type to use in Java and the fact that the byte 0xBE is represented as -66 in Java is mostly a visual distinction. For the purposes of this discussion "negative values" is equivalent to "byte values between 0x80 and 0xFF". It just so happens that the non-negative values are exactly the stand alone bytes in UTF-8 and are converted just fine.
All this means that decoding arbitrary byte[] as UTF-8 will not work in most cases!**
Then why doesn't new String(...) throw an exception?
But if arbitraryBytes contains a byte[] that isn't valid UTF-8, then why doesn't new String(arbitraryBytes, StandardCharsets.UTF_8) throw an exception?
Good question! Maybe it should, but the designers of Java have decided that this specific way of decoding a byte[] into a String should be lenient:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.
The "default replacement string" in this case is simply the Unicode character U+FFFD Replacement Character, which looks like a question mark in a filled rhombus: �
And as the documentation states, there is of course a way to decode a byte[] to a String and getting a real exception when it doesn't go right:
byte[] arbitraryBytes = new byte[] { 2, 126, 33, -66, -100, 4, -39, 108 };
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
String string = decoder.decode(ByteBuffer.wrap(arbitraryBytes)).toString();
This code will throw an exception:
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:820)
at org.example.Main.main(Main.java:13)
Okay, but I really need a String!
We have realized that decoding your byte[] to a String using UTF-8 doesn't work. One could use ISO-8859-1, which maps all 256 byte values to characters, but that would result in Strings with many unprintable control characters, which would be quite cumbersome to handle.
Use Base64
The usual solution for this is to use Base64:
// encode byte[] to Base64
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
System.out.println(base64);
// decode Base64 to byte[]
byte[] decoded = Base64.getDecoder().decode(base64);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
With the same arbitraryBytes as before this will print
An4hvpwE2Ww=
true
Base64 is a common choice because it is able to represent arbitrary bytes with a reasonable number of characters (on average it will take about a third more characters than it has input bytes, depending on the exact formatting and/or padding used).
There are a few variations of Base64, which are used in various situations. Particularly common is the use of the URL- and filename-safe variant, which ensures that no characters with any special meaning in URLs and file names are used. Luckily it is directly supported in Java.
Format as a hex string
Base64 is neat and useful, but it somewhat obfuscates the individual byte values. Occasionally we want a format that allows us to interpret the values in some way. For this a hexadecimal representation of the data might be more useful, even though it takes up more characters than Base64:
// encode byte[] to hex
String hexFormatted = HexFormat.of().formatHex(arbitraryBytes);
System.out.println(hexFormatted);
// decode hex to byte[]
byte[] decoded = HexFormat.of().parseHex(hexFormatted);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
This will print
027e21be9c04d96c
true
This hex format (without separator) will take exactly 2 characters per input byte, making this format more verbose than Base64.
If you're not yet on Java 17 or later, there are plenty of other ways to do this.
But I've already converted my byte[] to String using UTF-8 and I really need my original data back.
Sorry, but you most likely can't. Unless you were very lucky and your original byte[] happened to be a well-formed UTF-8 stream, the conversion to String will have lost some data and you will only be able to recover a fraction of your original byte[].
String badString = new String(arbitraryBytes, StandardCharsets.UTF_8);
byte[] recoveredBytes = badString.getBytes(StandardCharsets.UTF_8);
This will give you something but every time your input contained a encoding error, this will contain the byte sequence 0xEF 0xBF 0xBD (or -17 -65 -67, when interpreted as signed bytes and printed in decimal). That byte sequence is what UTF-8 encodes the U+FFFD Replacement Character as.
Depending on the specific input (and even the specific implementation of the UTF-8 decoder!) each replacement character can replace one or more bytes, so you can't even reliably tell the size of the original input array like this.

What is the best way to get the size of text in bytes in Java?

I have implemented a cryptographic algorithm in Java. Now, I want to measure the size of the message before and after encryption in bytes.
How to get the size of the text in bytes?
For example, if I have a simple text Hi! I am alphanumeric (8÷4=2)
I have tried my best but can't find a good solution.
String temp = "Hi! I am alphanumeric (8÷4=2)"
temp.length() // this works because in ASCII every char takes one byte
// and in java every char in String takes two bytes so multiply by 2
temp.length() * 2
// also String.getBytes().length and getBytes("UTF-8").length
// returns same result
But in my case after decryption of message the chars becomes the mixture of ASCII and Unicode.
e.g. QÂʫP†ǒ!‡˜q‡Úy¦\dƒὥì£‰ὥ
Upper methods returns the length or length * 2
But I want to calculate the actual bytes (not in JVM). For example the char a takes one byte in general and Unicode ™ for example takes two bytes.
So how to implement this technique in Java?
I want some this likes the technique used in this website http://bytesizematters.com/
It gives me 26 bytes for this text QÂʫP†ǒ!‡˜q‡Úy¦\dƒὥì£‰ὥ although the length of text is 22.

Be aware: String is for Unicode text (being able to mix all kind of scripts) and char is two bytes UTF-16.
This means that binary data byte[] need to know its encoding/charset, and will be converted to String.
byte[] b = ...
String s = ...
b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);
Without explicit charset of the bytes, the platform default is taken, which will give non-portable code.
UTF-8 will allow all text, not just some scripts, but Greek, Arab, Japanese.
However as there is a conversion involved, non-text binary data can get corrupted, will not be legal UTF-8, will cost double the memory and be slower because of the conversion.
Hence avoid String for binary data at all costs.
To respond to your question:
You might get away by StandardCharsets.ISO_8859_1 - which is a single byte encoding.
String.getBytes(StandardCharsets.ISO_8859_1).length() will then correspond to String.length() though the String might use double the memory as char is two bytes.
Alternatives to String:
byte[] themselves, Arrays provides utility functions, like arrayEquals.
ByteArrayInputStream, ByteArrayOutputStream
ByteBuffer can wrap byte[]; can read and write short/int/...
Convert the byte[] to a Base64 String using Base64.getEncoder().encode(bytes).
Converting a byte to some char
The goal is to convert a byte to a visible symbol displayable in a GUI text field, and where the length in chars is the same as the number of original bytes.
For instance the font Lucida Sans Unicode has from U+2400 symbols representing the ASCII control characters. For the bytes with an 8th bit, one could take Cyrillic, though confusion may arise because of similarity Cyrillic е and Latin e.
static char byte2char(byte b) {
if (b < 0) { // -128 .. -1
return (char)(0x400 - b);
} else if (b < 32) {
return (char)(0x2400 + b);
} else if (b == 127) {
return '\u25C1';
} else {
return (char) b;
}
}
A char is a UTF-16 encoding of Unicode, but here also correspond to a Unicode code point (int).
A byte is signed, hence ranges from -128 to 127.

String.getBytes() returns array of Unicode chars

I was reading getbytes and from documentation it states that it will return
the resultant byte array.
But when i ran the following program, i found that it is returning array of Unicode symbols.
public class GetBytesExample {
public static void main(String args[]) {
String str = new String("A");
byte[] array1 = str.getBytes();
System.out.print("Default Charset encoding:");
for (byte b : array1) {
System.out.print(b);
}
}
}
The above program prints output
Default Charset encoding:65
This 65 is equivalent to Unicode representation of A. My question is that where are the bytes whose return type is expected.

There is no PrintStream.print(byte) overload, so the byte needs to be widened to invoke the method.
Per JLS 5.1.2:
19 specific conversions on primitive types are called the widening primitive conversions:
byte to short, int, long, float, or double
...
There's no PrintStream.print(short) overload either.
The next most-specific one is PrintStream.print(int). So that's the one that's invoked, hence you are seeing the numeric value of the byte.

String.getBytes() returns the encoding of the string using the platform encoding. The result depends on which machine you run this. If the platform encoding is UTF-8, or ASCII, or ISO-8859-1, or a few others, an 'A' will be encoded as 65 (aka 0x41).

This 65 is equivalent to Unicode representation of A
It is also equivalent to a UTF-8 representation of A
It is also equivalent to a ASCII representation of A
It is also equivalent to a ISO/IEC 8859-1 representation of A
It so happens that the encoding for A is similar in a lot character encodings, and that these are all similar to the Unicode code-point. And this is not a coincidence. It is a result of the history of character set / character encoding standards.
My question is that where are the bytes whose return type is expected.
In the byte array, of course :-)
You are (just) misinterpreting them.
When you do this:
for (byte b : array1) {
System.out.print(b);
}
you output a series of bytes as decimal numbers with no spaces between them. This is consistent with the way that Java distinguishes between text / character data and binary data. Bytes are binary. The getBytes() method gives a binary encoding (in some character set) of the text in the string. You are then formatting and printing the binary (one byte at a time) as decimal numbers.
If you want more evidence of this, replace the "A" literal with a literal containing (say) some Chinese characters. Or any Unicode characters greater than \u00ff ... expressed using \u syntax.

Is a Java char array always a valid UTF-16 (Big Endian) encoding?

Say that I would encode a Java character array (char[]) instance as bytes:
using two bytes for each character
using big endian encoding (storing the most significant 8 bits in the leftmost and the least significant 8 bits in the rightmost byte)
Would this always create a valid UTF-16BE encoding? If not, which code points will result in an invalid encoding?
This question is very much related to this question about the Java char type and this question about the internal representation of Java strings.

No. You can create char instances that contain any 16-bit value you desire---there is nothing that constrains them to be valid UTF-16 code units, nor constrains an array of them to be a valid UTF-16 sequence. Even String does not require that its data be valid UTF-16:
char data[] = {'\uD800', 'b', 'c'}; // Unpaired lead surrogate
String str = new String(data);
The requirements for valid UTF-16 data are set out in Chapter 3 of the Unicode Standard (basically, everything must be a Unicode scalar value, and all surrogates must be correctly paired). You can test if a char array is a valid UTF-16 sequence, and turn it into a sequence of UTF-16BE (or LE) bytes, by using a CharsetEncoder:
CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException
(And similarly using a CharsetDecoder if you have bytes.)

SHA-1 Hashes Mixed with Strings

I have to parse something like the following "some text <40 byte hash>" can i read this whole thing in to a string without corrupting 40 byte hash part?
The thing is hash is not going to be there so i don't want to process it while reading.
EDIT: I forgot to mention that the 40 byte hash is 2x20 byte hashes no encoding raw bytes.

Read it from your input stream as a byte stream, and then strip the String out of the stream like this:
String s = new String(Arrays.copyOfRange(bytes, 0, bytes.length-40));
Then get your bytes as:
byte[] hash = Arrays.copyOfRange(bytes, s.length-1, bytes.length-1)

SHA-1 hashes are 20 bytes (160 bits) in length. If you are dealing with 40 character hashes, then they are probably an ASCII representation of the hash, and therefore only contain the characters 0-9 and a-f. If this is the case, then you should be able to read and manipulate the strings in Java without any trouble.

Some more details could be useful, but I think the answer is that you should be okay.
You didn't say how the SHA-1 hash was encoded (common possibilities include "none" (the raw bytes), Base64 and hex). Since SHA-1 produces a 20 byte (160 bit) hash, I am guessing that it will be encoded using hex, since that doubles the space needed to the 40 bytes you mentioned. With that encoding, 2 characters will be used to encode each byte from the hash, using the symbols 0 through 9 and A through F. Those are all ASCII characters so you are safe.
Base64 encoding would also work (though probably not what you asked about since it increases the size by about 1/3 leaving you at well less than 40 bytes) as each of the characters used in Base64 are also ASCII.
If the raw bytes were used directly, you would have a problem, as some of the values are not valid characters.

OK, now that you've clarified that these are raw bytes
No, you cannot read this into Java as a string, you will need to read it as raw bytes.

WORKING CODE:
Converts byte string inputs into hex characters which should be safe in almost all string encodings. Use the code I posted in your other question to decode the hex chars back to raw bytes.
/** Lookup table: character for a half-byte */
static final char[] CHAR_FOR_BYTE = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
/** Encode byte data as a hex string... hex chars are UPPERCASE */
public static String encode(byte[] data){
if(data == null || data.length==0){
return null;
}
char[] store = new char[data.length*2];
for(int i=0; i<data.length; i++){
final int val = (data[i]&0xFF);
final int charLoc=i<<1;
store[charLoc]=CHAR_FOR_BYTE[val>>>4];
store[charLoc+1]=CHAR_FOR_BYTE[val&0x0F];
}
return new String(store);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

strange behaviour of java getBytes vs getBytes(charset) - java

Related

Byte array with negative byte values can't be converted to String using UTF-8 [closed]

What is the best way to get the size of text in bytes in Java?

String.getBytes() returns array of Unicode chars

Is a Java char array always a valid UTF-16 (Big Endian) encoding?

SHA-1 Hashes Mixed with Strings

Categories

Resources