Reading special UTF-8 characters from ByteBuffer - java

How can I read special characters from a ByteBuffer? ByteBuffer.wrap("\u0001".getBytes()).asCharBuffer().get() throws a BufferUnderflow exception. As far as I understand ByteBuffer.getChar() always advances by two bytes, and there's only one byte in this string.

Related

Why is it said: CharacterStream classes are used to perform the input/output for the 16-bit Unicode characters?

When an I/O stream manages 8-bit bytes of raw binary data, it is
called a byte stream. And, when the I/O stream manages 16-bit Unicode
characters, it is called a character stream.
Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.
So that is why we use character streams. Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.
Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote Ạ that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?
In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.
A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.
A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.
In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.
If you converted it to bytes with the UTF-8 encoding, as here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();
Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.
In Java, a character (char) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.
"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):
Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.
"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.
You said:
Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!
That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println on a PrintStream, you are giving it a String, which is a bunch of chars under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.
It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.

What's the difference between writeUTF and writeChars?

What's the difference between writeUTF and writeChars? (methods of ObjectOutputStream)
Further I have not found the corresponding readChars in ObjectInputStream.
writeUTF writes text in UTF-8 format encoding preceeded with text length, so readUTF knows how many characters to read from stream.
writeChars writes text as a sequence of 2-bytes chars with no length. To read it, we should use readChar method and we need to know how many chars were written.
writeChars() uses Unicode values
Writes every character in the string s, to the output stream, in
order, two bytes per character. If s is null, a NullPointerException
is thrown. If s.length is zero, then no characters are written.
Otherwise, the character s[0] is written first, then s1, and so on;
the last character written is s[s.length-1]. For each character, two
bytes are actually written, high-order byte first, in exactly the
manner of the writeChar method.
writeUTF() uses a slightly-modified version of UTF-8
Writes two bytes of length information to the output stream, followed
by the modified UTF-8 representation of every character in the string
s. If s is null, a NullPointerException is thrown. Each character in
the string s is converted to a group of one, two, or three bytes,
depending on the value of the character.

Reading one byte by File IO operation

the read method of FileInputStream will read 1 byte everytime, but who it is reading a character in a file, as the character size in java is (16 bit-2 Bytes). Is it because read method is native it will convert to 8-bits?
The read method of FileInputStream reader returns an int (equivalent to a byte), not a character. It's your responsibility to transform the result into a character.
As the javadoc of FileInputStream suggest: "for reading streams of characters, consider using FileReader".
The number of bytes to encode a character depends on the encoding of the file. For example, if the file is encoded with ASCII, each byte is a character, but if your file is encoded in UTF-8, a character is 1, 2, 3 or 4 bytes.
If you want more information about encoding, I would suggest reading the following article : The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Storing characters as single bytes in java

I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.

Is there a simple way to append a byte to a StringBuffer and specify the encoding?

Question
What is the simplest way to append a byte to a StringBuffer (i.e. cast a byte to a char) and specify the character encoding used (ASCII, UTF-8, etc)?
Context
I want to append a byte to a stringbuffer. Doing so requires casting the byte to a char:
myStringBuffer.append((char)nextByte);
However, the code above uses the default character encoding for my machine (which is MacRoman). Meanwhile, other components in the system/network require UTF-8. So I need to so something like:
try {
myStringBuffer.append(new String(new Byte[]{nextByte}, "UTF-8"));
} catch (UnsupportedEncodingException e) {
//handle error
}
Which, frankly, is pretty ugly.
Surely, there's a better way (other than breaking the same code into multiple lines)???????
The simple answer is 'no'. What if the byte is the first byte of a multi-byte sequence? Nothing would maintain the state.
If you have all the bytes of a logical character in hand, you can do:
sb.append(new String(bytes, charset));
But if you have one byte of UTF-8, you can't do this at all with stock classes.
It would not be terribly difficult to build a juiced-up StringBuffer that uses java.nio.charset classes to implement byte appending, but it would not be one or two lines of code.
Comments indicate that there's some basic Unicode knowledge needed here.
In UTF-8, 'a' is one byte, 'á' is two bytes, '丧' is three bytes, and '𝌎' is four bytes. The job of CharsetDecoder is to convert these sequences into Unicode characters. Viewed as a sequential operation over bytes, this is obviously a stateful process.
If you create a CharsetDecoder for UTF-8, you can feed it only byte at a time (in a ByteBuffer) via this method. The UTF-16 characters will accumulate in the output CharBuffer.
I think the error here is in dealing with bytes at all. You want to deal with strings of characters instead.
Just interpose a reader on the input and output stream to do the mapping between bytes and characters for you. Use the InputStreamReader(InputStream in, CharsetDecoder dec) form of the constructor for the input, though, so that you can detect input encoding errors via an exception. Now you have strings of characters instead of buffers of bytes. Put an OutputStreamWriter on the other end.
Now you no longer have to worry about bytes or encodings. It’s much simpler this way.

Categories