Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner?

Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner? - java

I am trying to decode UTF8 byte by byte with charset decoder. Is this possible?
The following code
public static void main(String[] args) {
Charset cs = Charset.forName("utf8");
CharsetDecoder decoder = cs.newDecoder();
CoderResult res;
byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
byte[] b = new byte[1];
ByteBuffer bb = ByteBuffer.wrap(b);
char[] c = new char[1];
CharBuffer cb = CharBuffer.wrap(c);
decoder.reset();
b[0] = source[0];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
b[0] = source[1];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
}
gives the following output.
UNDERFLOW
1
MALFORMED[1]
1
Why?

My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.
Note this sentence in the javadoc:
"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "
But you are clobbering the (presumably) unread byte.
You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb after the first decode(...) call.
If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.
Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.

As has been said, utf has 1-6 bytes per char. you need to add all bytes to the bytebuffer before you decode try this:
public static void main(String[] args) {
Charset cs = Charset.forName("utf8");
CharsetDecoder decoder = cs.newDecoder();
CoderResult res;
byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
byte[] b = new byte[2]; //two bytes for this char
ByteBuffer bb = ByteBuffer.wrap(b);
char[] c = new char[1];
CharBuffer cb = CharBuffer.wrap(c);
decoder.reset();
b[0] = source[0];
b[1] = source[1];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false); //translates 2 bytes to 1 char
System.out.println(cb.remaining()); //prints 0
System.out.println(cb.get(0)); //prints latin ae
}

Related

Java Conversion from byte[] to char[]

I have wrote a simple snippet in which I try to convert (maybe) a byte array in char array e vice versa.
There are a lot of examples on the Net but this is for me the simplest way to do it.
I need to use array and not strings because its content is a password field to manage.
So I ask you, is this snippet correct?
private static char[] fromByteToCharArrayConverter(byte[] byteArray){
ByteBuffer buffer = ByteBuffer.wrap(byteArray);
clearArray(byteArray);
CharBuffer charBuffer = Charset.forName("UTF-8").decode(buffer);
char[] charArray = new char[charBuffer.remaining()];
charBuffer.get(charArray);
return charArray;
}
private static byte[] fromCharToByteArray(char[] charArray){
CharBuffer charBuffer = CharBuffer.wrap(charArray);
ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(charBuffer);
byte[] byteArray = new byte[byteBuffer.remaining()];
byteBuffer.get(byteArray);
return byteArray;
}
Thanks

No, that won't work for (at least) the following reason:
ByteBuffer buffer = ByteBuffer.wrap(byteArray); // Wrap the array
clearArray(byteArray); // Clear the array -> ByteBuffer cleared too

send tdata frame to a socket

byte[] demande=new byte[2];
Let's suppose that demande is a data frame which will be send to a socket.
What should be byte[0] and byte[1] if I want send 200. I try to write byte[0]=1 and byte[1]=-56 ( 1*256 - 56)=200 but it don't work. How can I do?

I assume that the number 200 is a decimal value.
As 200 is less than 255 it will fit into one byte because the hexadecimal value of 200 is 0xC8.
So in your case you have two options. Which one is correct depends on the protocol you are using.
Either
byte[] demande = { 0x00, 0xC8 }; // little endian
or
byte[] demande = { 0xC8, 0x00 }; // big endian
Or if you prefer
byte[] demande = new byte[2];
demande[0] = 0x00;
demande[1] = 0xC8;
(little endian)

You can use the ByteBuffer class to create a byte array. If you wanted to convert the integer 200 to a byte array:
ByteBuffer b = ByteBuffer.allocate(2);
b.putInt(0x000000c8);
byte[] result = b.array();

Encode byte[] as String

Heyho,
I want to convert byte data, which can be anything, to a String. My question is, whether it is "secure" to encode the byte data with UTF-8 for example:
String s1 = new String(data, "UTF-8");
or by using base64:
String s2 = Base64.encodeToString(data, false); //migbase64
I'm just afraid that using the first method has negative side effects. I mean both variants work p̶e̶r̶f̶e̶c̶t̶l̶y̶ , but s1 can contain any character of the UTF-8 charset, s2 only uses "readable" characters. I'm just not sure if it's really need to use base64. Basically I just need to create a String send it over the network and receive it again. (There is no other way in my situation :/)
The question is only about negative side effects, not if it's possible!

You should absolutely use base64 or possibly hex. (Either will work; base64 is more compact but harder for humans to read.)
You claim "both variants work perfectly" but that's actually not true. If you use the first approach and data is not actually a valid UTF-8 sequence, you will lose data. You're not trying to convert UTF-8-encoded text into a String, so don't write code which tries to do so.
Using ISO-8859-1 as an encoding will preserve all the data - but in very many cases the string that is returned will not be easily transported across other protocols. It may very well contain unprintable control characters, for example.
Only use the String(byte[], String) constructor when you've got inherently textual data, which you happen to have in an encoded form (where the encoding is specified as the second argument). For anything else - music, video, images, encrypted or compressed data, just for example - you should use an approach which treats the incoming data as "arbitrary binary data" and finds a textual encoding of it... which is precisely what base64 and hex do.

You can store a byte in a String, though it's not a good idea. You can't use UTF-8 as this will mange the bytes but a faster and more efficient way is to use ISO-8859-1 encoding or plain 8-bit. The simplest way to do this is to use
String s1 = new String(data, 0);
or
String s1 = new String(data, "ISO-8859-1");
From UTF-8 on Wikipedia, As Jon Skeet notes, these encodings are not valid under the standard. Their behaviour in Java varies. DataInputStream treats them as the same for the first three version and the next two throw an exception. The Charset decoder treats them as separate characters silently.
00000000 is \0
11000000 10000000 is \0
11100000 10000000 10000000 is \0
11110000 10000000 10000000 10000000 is \0
11111000 10000000 10000000 10000000 10000000 is \0
11111100 10000000 10000000 10000000 10000000 10000000 is \0
This means if you see \0 in you String, you have no way of knowing for sure what the original byte[] values were. DataOutputStream uses the second option for compatibility with C which sees \0 as a terminator.
BTW DataOutputStream is not aware of code points so writes high code point characters in UTF-16 and then UTF-8 encoding.
0xFE and 0xFF are not valid to appear in a character. Values 0x11000000+ can only appear at the start of a character, not inside a multi-byte character.

Confirmed the accepted answer with Java. To repeat, UTF-8, UTF-16 do not preserve all the byte values. ISO-8859-1 does preserve all the byte values. But if the encoded bytes is to be transported beyond the JVM, use Base64.
#Test
public void testBase64() {
final byte[] original = enumerate();
final String encoded = Base64.encodeBase64String( original );
final byte[] decoded = Base64.decodeBase64( encoded );
assertTrue( "Base64 preserves bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testIso8859() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.ISO_8859_1 );
final byte[] decoded = s.getBytes( StandardCharsets.ISO_8859_1 );
assertTrue( "ISO-8859-1 preserves bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testUtf16() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_16 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_16 );
assertFalse( "UTF-16 does not preserve bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testUtf8() {
final byte[] original = enumerate();
String s = new String( original, StandardCharsets.UTF_8 );
final byte[] decoded = s.getBytes( StandardCharsets.UTF_8 );
assertFalse( "UTF-8 does not preserve bytes", Arrays.equals( original, decoded ) );
}
#Test
public void testEnumerate() {
final Set<Byte> byteSet = new HashSet<>();
final byte[] bytes = enumerate();
for ( byte b : bytes ) {
byteSet.add( b );
}
assertEquals( "Expecting 256 distinct values of byte.", 256, byteSet.size() );
}
/**
* Enumerates all the byte values.
*/
private byte[] enumerate() {
final int length = Byte.MAX_VALUE - Byte.MIN_VALUE + 1;
final byte[] bytes = new byte[length];
for ( int i = 0; i < length; i++ ) {
bytes[i] = (byte)(i + Byte.MIN_VALUE);
}
return bytes;
}

char and byte buffer encoding & decoding

I'm trying to understand the encoding way, here is my code to encode and decode a string.
Charset utfset = Charset.forName("UTF-8");
CharsetEncoder encoder = utfset.newEncoder();
String text = "java.abcded.tocken";
CharBuffer cb = CharBuffer.wrap(text.toCharArray());
ByteBuffer bb = encoder.encode(cb);
byte[] bytes = bb.array();
CharsetDecoder isodecoder = utfset.newDecoder();
CharBuffer isodcb = isodecoder.decode(bb);
System.out.println(String.valueOf(cb.array()).equals(String.valueOf(isodcb.array())));
CharBuffer isodcb2 = isodecoder.decode(ByteBuffer.wrap(bytes));
System.out.println(String.valueOf(cb.array()).equals(String.valueOf(isodcb2.array())));
When the decode is performed with byteBuffer itself, the strings are equal but, when the decode is performed with bytebuffer.wrap of the byte array from bytebuffer, the strings are not equal. It is appending spaces to the end, is there a reason behind it ?

CharsetEncoder.encode makes no guarantees about the underlying array size, nor that the ByteBuffer will actually be backed by an array. The array backing the buffer is larger than the number of bytes contained in it.
You should see different numbers if you run this code:
CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
String text = "java.abcded.tocken";
CharBuffer cb = CharBuffer.wrap(text.toCharArray());
ByteBuffer bb = encoder.encode(cb);
System.out.println(bb.remaining());
System.out.println(bb.array().length);

Java client server chat pads strings with squares when converting to byte[]

I'm building a chat client and server as part of a class project and running into one problem I can't seem to fix. Text has to be passed in the form of fixed size byte[] (either 32 or 64 bytes) depending on the particular case.
When I change the strings to byte[] with the .getBytes() method it pads out the length of the string with empty squares. This is fine during transit and receipt but at some point I need to change the string to it's original format (currently done with new String(byte[]) and delete the empty squares.
I can't seem to find a good way to do this. Any suggestions?
Relevant code bits client side:
byte[] bigDataByte = new byte[64];
sendData[2] = (bigDataByte = message.getBytes())
for (int i = 0; i < sendData.length; i++){
if (sendData[i] != null){
DatagramPacket sendPacket = new DatagramPacket(sendData[i], sendData[i].length, IPAddress, clientPort);
clientSocket.send(sendPacket);
}
}
Relevant code bits server side:
String name = new String(getBytes(32));
private static byte[] getBytes(int size) throws IOException {
byte[] dataByte = new byte[size];
DatagramPacket dataPacket = new DatagramPacket(dataByte, dataByte.length);
servSocket.receive(dataPacket);
return dataPacket.getData();
}

Not sure, but the issue might be that you are not specifying the charset.
Try using the
constructor: String(byte[] bytes, String charsetName)
and the method: getBytes(String charsetName).
e.g.
byte[] bytes = str.getBytes("UTF-8");
and
String str = new String(bytes, "UTF-8");
The default ones use the platform's default charset, which could lead to a mismatch.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner? - java

Related

Java Conversion from byte[] to char[]

send tdata frame to a socket

Encode byte[] as String

char and byte buffer encoding & decoding

Java client server chat pads strings with squares when converting to byte[]

Categories

Resources