How does String.getBytes("utf-8") work in java?

How does String.getBytes("utf-8") work in java? - java

OS default encoding: UTF-8
convert UTF-8 str to UTF-16 in python:
utf8_str = "Hélô" # type(utf8_str) is str and encoded in UTF-8
unicode_str = utf8_str.decode("UTF-8") # type(unicode_str) is unicode
utf16_str = unicode_str.encode("UTF-16") #type(utf16_str) is str and encoded in UTF-16
As you can see, unicode is bridge of converting utf-8 str to utf-16 str, and it's easy to understand.
But, in java, I am confused about the conversion:
String utf16Str = "Hélô";// String encoded in "UTF-16"
byte[] bytes = utf16Str.getBytes("UTF-8");//byte array encoded in UTF-8, getBytes will call a encode method.
String newUtf16Str = new String(bytes, "UTF-8");// String encoded in "UTF-16"
There is no decode, no unicode. So, what happened in this process?

Related

convert UTF-8 String to ISO-8859-1 java

I have an application where i want to convert an utf-8 encoded string to ISO-8859-1 because this is the encoding of my oracle DB.
Currently this is what i'm inserting in my db:
BelgiÃ«
But i expect this:
België
When i print my string in java i get the following:
BelgiÃ«
Can anybody help me?
This is what i already tried:
System.out.println(xmlString);
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(xmlString.getBytes(utf8charset));
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
xmlt = new oracle.xdb.XMLType(con, new String(outputData, iso88591charset));
suggestion from the comments didn't work either:
byte[] utf8 = xmlString.getBytes("UTF-8");
byte[] latin = new String(utf8, "UTF-8").getBytes("ISO-8859-1");
ByteArrayInputStream bis = new ByteArrayInputStream(latin);
xmlt = new oracle.xdb.XMLType(con, bis);

Normally UTF-8 can do encoding of any Unicode code. ISO-8859-1 is capable of handling tiny fraction of them and transcoding of UTF-8 to ISO-8859-1 can cause "replacement characters" (�) to appear in your text when unsupported characters are found and also there might be other issues.
Transcoding from ISO-8859-1 to UTF-8 dont have any problem.
I would suggest to transcode text:
byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
or
byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

How to convert char in UTF-16 encoding to char in CP855 encoding?

For example, I have char a = "п", and I want to get cyrillic ascii value of it, is it possible?

To convert a text from one encoding to another you can use java.nio.Charset class:
byte[] data = // your encoded data in UTF-16
Charset from = StandardCharsets.UTF_16;
Charset to = // cyrillic charset
byte[] converted = to.encode(from.decode(ByteBuffer.wrap(data))).array();

Byte array is a valid UTF8 encoded String in Java but not in Python

When I run the following in Python 2.7.6, I get an exception:
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
The output:
base 64 of the bytes:
gAID
'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
When I try it in Java 8 (HotSpot 1.8.0_74), using this code:
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
I don't get any exception.
How so? Why is the same byte array is valid in Java and invalid in Python, using UTF8 decoding?

This is because the String constructor in Java just doesn't throw exceptions in the case of invalid characters. See documentation here
public String(byte[] bytes, Charset charset)
... This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.

It's not valid UTF8. https://en.wikipedia.org/wiki/UTF-8
Bytes between 0x80 and 0xBF cannot be the first byte of a multi-byte character. They can only be the second byte or later.
Java replaces bytes which cannot be decoded with a ? rather than throw an exception.

So in Python 2.7.6 the bytes represented as gAID are not a valid UTF8.
This is wrong as you try to decode the Base64 encoded bytes.
import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
# store the decoded bytes
some_bytes = base64.b64encode(some_bytes)
decoded_bytes = [hex(ord(c)) for c in some_bytes]
print ("decoded bytes: ")
print (decoded_bytes)
try:
print (some_bytes.decode("utf-8"))
except Exception as e:
print(e)
output
gAID
['0x67', '0x41', '0x49', '0x44']
gAID
In Java you try to create a String from the Base64 encoded bytes using the UTF-8 charset. Which results (as already answered) in the default replacement character �.
Running following snippet
java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
System.out.println("base 64 of the bytes:");
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
produce following output
base 64 of the bytes:
x80 x02 x03
?
There you can see the same bytes you are using in the Python snippet. Which lead in Python to 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte lead there to an ? (it stands for the default replacement character on a not-unicode console)
Following snippet used the bytes from gAID to construct a String with the UTF-8 character set.
byte[] bytes = "gAID".getBytes(StandardCharsets.ISO_8859_1);
for (byte b : bytes) {
System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);
output
x67 x41 x49 x44
gAID

What is the target charset when I encrypt a string with Cipher and Base64.Encoder?

I'm trying to accommodate encrypted tokens in a DSL I'm designing (i.e. I need a char to be used as delimiter). The encoder.encodeToString(...) docs says it uses the ISO-8859-1 charset. But when I encrypt a sampling of texts, it looks like it's not using all of the ISO-8859-1 charset, instead upper/lower-case and some symbols, and not certain punctuation and accented chars. What am I missing about this encodeToString() call and what is the final char domain?
//import java.util.Base64;
//import javax.crypto.Cipher;
//import javax.crypto.SecretKey;
static Cipher cipher;
public static String decrypt(String encryptedText, SecretKey secretKey) throws Exception {
Base64.Decoder decoder = Base64.getDecoder();
byte[] encryptedTextByte = decoder.decode(encryptedText);
cipher.init(Cipher.DECRYPT_MODE, secretKey);
byte[] decryptedByte = cipher.doFinal(encryptedTextByte);
String decryptedText = new String(decryptedByte);
return decryptedText;
}

String has a constructor with charset; otherwise the default OS charset is taken.
new String(decryptedByte, StandardCharsets.ISO_8859_1);
As there often is a mix-up of Latin-1 (ISO-8859-1) with Windows Latin-1 (Windows-1252) you might try "Windows-1252" too.

Base64 is named just so because it uses 64 characters from the ASCII table. The encoding used should not matter as long as it is compatible with ASCII.
If you want to use more than 64 characters you will have to use a different encoding.

Encryption returns bytes, no matter the encryption is.
I personally use the following function to convert from decrypted bytes to string:
public static String getStringFromBytes(byte[] data) {
StringBuilder sb = new StringBuilder();
if (data != null && data.length > 0) {
for (byte b : data) {
sb.append((char) (b));
}
}
return sb.toString();
}

Read the document line carefully:
This method first encodes all input bytes into a base64 encoded byte
array and then constructs a new String by using the encoded byte array
and the ISO-8859-1 charset.
This means 2 things:
Encoding scheme used is Base64, that's the reason you can not getting expected output because ISO-8859-1 is not used as encoding scheme.
ISO-8859-1 is only used as charset reference for constructing String.
Base64 class give your encoders and decoders for the Base64 encoding scheme, so if you are using its encoders and decoders then output will be encoded/decoded using Base64 encoding scheme.
This class consists exclusively of static methods for obtaining
encoders and decoders for the Base64 encoding scheme.
So, what you need is to properly encode your byte array using some encoding scheme - UTF-8 or ISO-8859-1.
I would personally recommend using "UTF-8" because it is widely used + it encodes all ASCII and some Latin character in 1 byte, Unicode other BMP character using 2 byte and Unicode supplementary characters using 4 bytes. So, it is not space consuming for all ASCII and some Latin character + ability to encode all Unicode characters.
There are many ways you can find to encode your String using desired encoding scheme, an example below:
byte[] byteArr = new byte[3];
String decodeText = new String(byteArr);
Charset.forName("UTF-8").encode(decodeText);

Conversion of UTF-8 char to ISO-8859-1

I have tried to convert UTF8 char to ISO-8859-1 but all characters (like 0x84; 0x96;) are not converting into ISO-8859-1, See code below in java
static byte[] encode(byte[] arr) throws CharacterCodingException{
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer= iso88591charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.encode(data);
data = iso88591charset.decode(outputBuffer);
byte[] outputData = outputBuffer.array();
return outputData;
}
Please help to resolve it.
Thanks.

First you might use StandardCharsets.UTF_8 and StandardCharsets.ISO_8859_1.
However, better replace "ISO-8859-1" with "Windows-1252".
The reason is that browsers and others interprete an indication of ISO-8859-1 (Latin-1) as Windows-1252 (Windows Latin-1). In Windows Latin-1 the range 0x80 - 0xbf are used for comma-like quotes and such.
So with a bit of luck (I do not think you meant browsers) this will work.
BTW in browsers this will even work on the Mac, and is official since HTML5.

Give the following a go,
String str = new String(utf8Bytes, "UTF-8");
byte[] isoBytes = str.getBytes( "ISO-8859-1" );
If it gives you exactly the same result, then you have characters that do not map between those character sets.

My guess is that when you say "0x84, 0x96" you mean bytes in the byte array.
If that is the case, you are taking those bytes and try to interpret them as UTF-8, but
that sequence of bytes is not a valid UTF-8 sequence.
from U+0000 to U+007F : 1 byte : 0xxxxxxx
from U+0080 to U+07FF : 2 bytes : 110xxxxx 10xxxxxx
from U+0800 to U+FFFF : 3 bytes : 1110xxxx 10xxxxxx 10xxxxxx
from U+10000 to U+1FFFFF : 4 bytes : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Since 84 96 is 0x10000100 0x10010110 is does not match the bit patterns above
(note the 0x11.... or 0x0.... in the lead byte, never 0x10...., that is a "trailing byte")

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How does String.getBytes("utf-8") work in java? - java

Related

convert UTF-8 String to ISO-8859-1 java

How to convert char in UTF-16 encoding to char in CP855 encoding?

Byte array is a valid UTF8 encoded String in Java but not in Python

What is the target charset when I encrypt a string with Cipher and Base64.Encoder?

Conversion of UTF-8 char to ISO-8859-1

Categories

Resources