String byte encoding issue - java

Given that I have following function
static void fun(String str) {
System.out.println(String.format("%s | length in String: %d | length in bytes: %d | bytes: %s", str, str.length(), str.getBytes().length, Arrays.toString(str.getBytes())));
}
on invoking fun("ó"); its output is
ó | length in String: 1 | length in bytes: 2 | bytes: [-61, -77]
so it means character ó needs 2 bytes to represent and as per Character class documentation too default is UTF-16 in java, considering that when I do following
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.ISO_8859_1));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.US_ASCII));// output=��
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_8));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16BE));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16LE));// output=돃
Why any of UTF_16, UTF_16BE, UTF_16LE charset not able to decode bytes properly, given that bytes are representing a 16 bit length character?
And how UTF-8 is able decode it properly given that UTF-8 consider each character only 8 bit long so it should have printed 2 chars(1 char for each byte) like in ISO_8859_1.

getBytes always returns the bytes encoded in the platform's default charset, which is probably UTF-8 for you.
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
So you are essentially trying to decode a bunch of UTF-8 bytes with non-UTF-8 charsets. No wonder you don't get expected results.
Though kind of pointless, you can get what you want by passing the desired charset to getBytes, so that the string is encoded correctly.
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16));
System.out.println(new String("ó".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));
System.out.println(new String("ó".getBytes(StandardCharsets.US_ASCII), StandardCharsets.US_ASCII));
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16BE), StandardCharsets.UTF_16BE));
System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16LE), StandardCharsets.UTF_16LE));
You also seem to have some misunderstanding about encodings. It's not just about the number of bytes that a character takes. The byte-count-per-character for two encodings being the same doesn't mean that they are compatible with each other. Also, it is not always one byte per character in UTF-8. UTF-8 is a variable-length encoding.

Related

Byte array is a valid UTF8 encoded String in Java but not in Python

When I run the following in Python 2.7.6, I get an exception:
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
The output:
base 64 of the bytes:
gAID
'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
When I try it in Java 8 (HotSpot 1.8.0_74), using this code:
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
I don't get any exception.
How so? Why is the same byte array is valid in Java and invalid in Python, using UTF8 decoding?
This is because the String constructor in Java just doesn't throw exceptions in the case of invalid characters. See documentation here
public String(byte[] bytes, Charset charset)
... This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.
It's not valid UTF8. https://en.wikipedia.org/wiki/UTF-8
Bytes between 0x80 and 0xBF cannot be the first byte of a multi-byte character. They can only be the second byte or later.
Java replaces bytes which cannot be decoded with a ? rather than throw an exception.
So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
This is wrong as you try to decode the Base64 encoded bytes.
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
# store the decoded bytes
some_bytes = base64.b64encode(some_bytes)
decoded_bytes = [hex(ord(c)) for c in some_bytes]
print ("decoded bytes: ")
print (decoded_bytes)
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
output
gAID
['0x67', '0x41', '0x49', '0x44']
gAID
In Java you try to create a String from the Base64 encoded bytes using the UTF-8 charset. Which results (as already answered) in the default replacement character �.
Running following snippet
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
System.out.println("base 64 of the bytes:");
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
produce following output
base 64 of the bytes:
x80 x02 x03
?
There you can see the same bytes you are using in the Python snippet. Which lead in Python to 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte lead there to an ? (it stands for the default replacement character on a not-unicode console)
Following snippet used the bytes from gAID to construct a String with the UTF-8 character set.
byte[] bytes = "gAID".getBytes(StandardCharsets.ISO_8859_1);
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
output
x67 x41 x49 x44
gAID

How does String.getBytes("utf-8") work in java?

OS default encoding: UTF-8
convert UTF-8 str to UTF-16 in python:
utf8_str = "Hélô" # type(utf8_str) is str and encoded in UTF-8
unicode_str = utf8_str.decode("UTF-8") # type(unicode_str) is unicode
utf16_str = unicode_str.encode("UTF-16") #type(utf16_str) is str and encoded in UTF-16
As you can see, unicode is bridge of converting utf-8 str to utf-16 str, and it's easy to understand.
But, in java, I am confused about the conversion:
String utf16Str = "Hélô";// String encoded in "UTF-16"
byte[] bytes = utf16Str.getBytes("UTF-8");//byte array encoded in UTF-8, getBytes will call a encode method.
String newUtf16Str = new String(bytes, "UTF-8");// String encoded in "UTF-16"
There is no decode, no unicode. So, what happened in this process?

What is the target charset when I encrypt a string with Cipher and Base64.Encoder?

I'm trying to accommodate encrypted tokens in a DSL I'm designing (i.e. I need a char to be used as delimiter). The encoder.encodeToString(...) docs says it uses the ISO-8859-1 charset. But when I encrypt a sampling of texts, it looks like it's not using all of the ISO-8859-1 charset, instead upper/lower-case and some symbols, and not certain punctuation and accented chars. What am I missing about this encodeToString() call and what is the final char domain?
//import java.util.Base64;
//import javax.crypto.Cipher;
//import javax.crypto.SecretKey;
static Cipher cipher;
public static String decrypt(String encryptedText, SecretKey secretKey) throws Exception {
Base64.Decoder decoder = Base64.getDecoder();
byte[] encryptedTextByte = decoder.decode(encryptedText);
cipher.init(Cipher.DECRYPT_MODE, secretKey);
byte[] decryptedByte = cipher.doFinal(encryptedTextByte);
String decryptedText = new String(decryptedByte);
return decryptedText;
}
String has a constructor with charset; otherwise the default OS charset is taken.
new String(decryptedByte, StandardCharsets.ISO_8859_1);
As there often is a mix-up of Latin-1 (ISO-8859-1) with Windows Latin-1 (Windows-1252) you might try "Windows-1252" too.
Base64 is named just so because it uses 64 characters from the ASCII table. The encoding used should not matter as long as it is compatible with ASCII.
If you want to use more than 64 characters you will have to use a different encoding.
Encryption returns bytes, no matter the encryption is.
I personally use the following function to convert from decrypted bytes to string:
public static String getStringFromBytes(byte[] data) {
StringBuilder sb = new StringBuilder();
if (data != null && data.length > 0) {
for (byte b : data) {
sb.append((char) (b));
}
}
return sb.toString();
}
Read the document line carefully:
This method first encodes all input bytes into a base64 encoded byte
array and then constructs a new String by using the encoded byte array
and the ISO-8859-1 charset.
This means 2 things:
Encoding scheme used is Base64, that's the reason you can not getting expected output because ISO-8859-1 is not used as encoding scheme.
ISO-8859-1 is only used as charset reference for constructing String.
Base64 class give your encoders and decoders for the Base64 encoding scheme, so if you are using its encoders and decoders then output will be encoded/decoded using Base64 encoding scheme.
This class consists exclusively of static methods for obtaining
encoders and decoders for the Base64 encoding scheme.
So, what you need is to properly encode your byte array using some encoding scheme - UTF-8 or ISO-8859-1.
I would personally recommend using "UTF-8" because it is widely used + it encodes all ASCII and some Latin character in 1 byte, Unicode other BMP character using 2 byte and Unicode supplementary characters using 4 bytes. So, it is not space consuming for all ASCII and some Latin character + ability to encode all Unicode characters.
There are many ways you can find to encode your String using desired encoding scheme, an example below:
byte[] byteArr = new byte[3];
String decodeText = new String(byteArr);
Charset.forName("UTF-8").encode(decodeText);

Conversion of UTF-8 char to ISO-8859-1

I have tried to convert UTF8 char to ISO-8859-1 but all characters (like 0x84; 0x96;) are not converting into ISO-8859-1, See code below in java
static byte[] encode(byte[] arr) throws CharacterCodingException{
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer= iso88591charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.encode(data);
data = iso88591charset.decode(outputBuffer);
byte[] outputData = outputBuffer.array();
return outputData;
}
Please help to resolve it.
Thanks.
First you might use StandardCharsets.UTF_8 and StandardCharsets.ISO_8859_1.
However, better replace "ISO-8859-1" with "Windows-1252".
The reason is that browsers and others interprete an indication of ISO-8859-1 (Latin-1) as Windows-1252 (Windows Latin-1). In Windows Latin-1 the range 0x80 - 0xbf are used for comma-like quotes and such.
So with a bit of luck (I do not think you meant browsers) this will work.
BTW in browsers this will even work on the Mac, and is official since HTML5.
Give the following a go,
String str = new String(utf8Bytes, "UTF-8");
byte[] isoBytes = str.getBytes( "ISO-8859-1" );
If it gives you exactly the same result, then you have characters that do not map between those character sets.
My guess is that when you say "0x84, 0x96" you mean bytes in the byte array.
If that is the case, you are taking those bytes and try to interpret them as UTF-8, but
that sequence of bytes is not a valid UTF-8 sequence.
from U+0000 to U+007F : 1 byte : 0xxxxxxx
from U+0080 to U+07FF : 2 bytes : 110xxxxx 10xxxxxx
from U+0800 to U+FFFF : 3 bytes : 1110xxxx 10xxxxxx 10xxxxxx
from U+10000 to U+1FFFFF : 4 bytes : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Since 84 96 is 0x10000100 0x10010110 is does not match the bit patterns above
(note the 0x11.... or 0x0.... in the lead byte, never 0x10...., that is a "trailing byte")

Encode byte[] as String

Heyho,
I want to convert byte data, which can be anything, to a String. My question is, whether it is "secure" to encode the byte data with UTF-8 for example:
String s1 = new String(data, "UTF-8");
or by using base64:
String s2 = Base64.encodeToString(data, false); //migbase64
I'm just afraid that using the first method has negative side effects. I mean both variants work p̶e̶r̶f̶e̶c̶t̶l̶y̶ , but s1 can contain any character of the UTF-8 charset, s2 only uses "readable" characters. I'm just not sure if it's really need to use base64. Basically I just need to create a String send it over the network and receive it again. (There is no other way in my situation :/)
The question is only about negative side effects, not if it's possible!
You should absolutely use base64 or possibly hex. (Either will work; base64 is more compact but harder for humans to read.)
You claim "both variants work perfectly" but that's actually not true. If you use the first approach and data is not actually a valid UTF-8 sequence, you will lose data. You're not trying to convert UTF-8-encoded text into a String, so don't write code which tries to do so.
Using ISO-8859-1 as an encoding will preserve all the data - but in very many cases the string that is returned will not be easily transported across other protocols. It may very well contain unprintable control characters, for example.
Only use the String(byte[], String) constructor when you've got inherently textual data, which you happen to have in an encoded form (where the encoding is specified as the second argument). For anything else - music, video, images, encrypted or compressed data, just for example - you should use an approach which treats the incoming data as "arbitrary binary data" and finds a textual encoding of it... which is precisely what base64 and hex do.
You can store a byte in a String, though it's not a good idea. You can't use UTF-8 as this will mange the bytes but a faster and more efficient way is to use ISO-8859-1 encoding or plain 8-bit. The simplest way to do this is to use
String s1 = new String(data, 0);
or
String s1 = new String(data, "ISO-8859-1");
From UTF-8 on Wikipedia, As Jon Skeet notes, these encodings are not valid under the standard. Their behaviour in Java varies. DataInputStream treats them as the same for the first three version and the next two throw an exception. The Charset decoder treats them as separate characters silently.
00000000 is \0
11000000 10000000 is \0
11100000 10000000 10000000 is \0
11110000 10000000 10000000 10000000 is \0
11111000 10000000 10000000 10000000 10000000 is \0
11111100 10000000 10000000 10000000 10000000 10000000 is \0
This means if you see \0 in you String, you have no way of knowing for sure what the original byte[] values were. DataOutputStream uses the second option for compatibility with C which sees \0 as a terminator.
BTW DataOutputStream is not aware of code points so writes high code point characters in UTF-16 and then UTF-8 encoding.
0xFE and 0xFF are not valid to appear in a character. Values 0x11000000+ can only appear at the start of a character, not inside a multi-byte character.
Confirmed the accepted answer with Java. To repeat, UTF-8, UTF-16 do not preserve all the byte values. ISO-8859-1 does preserve all the byte values. But if the encoded bytes is to be transported beyond the JVM, use Base64.
#Test
public void testBase64() {
final byte[] original = enumerate();
final String encoded = Base64.encodeBase64String( original );
final byte[] decoded = Base64.decodeBase64( encoded );
assertTrue( "Base64 preserves bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testIso8859() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.ISO_8859_1 );
final byte[] decoded = s.getBytes( StandardCharsets.ISO_8859_1 );
assertTrue( "ISO-8859-1 preserves bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testUtf16() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_16 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_16 );
assertFalse( "UTF-16 does not preserve bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testUtf8() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_8 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_8 );
assertFalse( "UTF-8 does not preserve bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testEnumerate() {
final Set<Byte> byteSet = new HashSet<>();
final byte[] bytes = enumerate();
for ( byte b : bytes ) {
byteSet.add( b );
}
assertEquals( "Expecting 256 distinct values of byte.", 256, byteSet.size() );
}
/**
* Enumerates all the byte values.
*/
private byte[] enumerate() {
final int length = Byte.MAX_VALUE - Byte.MIN_VALUE + 1;
final byte[] bytes = new byte[length];
for ( int i = 0; i < length; i++ ) {
bytes[i] = (byte)(i + Byte.MIN_VALUE);
}
return bytes;
}

Categories