Unicode character length in bytes - always the same? - java

I defined a unicode character as a byte array:
private static final byte[] UNICODE_MEXT_LINE = Charsets.UTF_8.encode("\u0085").array();
At the moment byte array length is 3, is it safe to assume the length of the array is always 3 across platforms?
Thank you

It's safe to assume that that particular character will always be three bytes long, regardless of platform.
But unicode characters in UTF-8 can be one byte, two bytes, three bytes or even four bytes long, so no, you can't assume that if you convert any character to UTF-8 then it'll come out as three bytes.

That particular character will always be 3 bytes in length, but others will be different. Unicode characters are anywhere from 1-4 bytes long. The 8 in 'UTF-8' just means that it uses 8-bit code units.
The Wikipedia page on UTF-8 provides a pretty good overview of how that works. Basically, the first bits of the first byte tell you how many bytes long that character will be. For instance, if the first bit of the first byte is a 0 as in 01111111, then that means this character is only one byte long (in utf-8, these are the ascii characters). If the first bits are 110 as in 11011111, then that tells you that this character will be two bytes long. The chart in the Wikipedia page provides a good illustration of this.
There's also this question, which has some good answers as well.

Related

Why is it said: CharacterStream classes are used to perform the input/output for the 16-bit Unicode characters?

When an I/O stream manages 8-bit bytes of raw binary data, it is
called a byte stream. And, when the I/O stream manages 16-bit Unicode
characters, it is called a character stream.
Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.
So that is why we use character streams. Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.
Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote Ạ that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?
In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.
A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.
A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.
In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.
If you converted it to bytes with the UTF-8 encoding, as here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();
Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.
In Java, a character (char) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.
"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):
Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.
"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.
You said:
Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!
That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println on a PrintStream, you are giving it a String, which is a bunch of chars under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.
It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.

Difference between String.length() and String.getBytes().length

I am beginner and self-learning in Java programming.
So, I want to know about difference between String.length() and String.getBytes().length in Java.
What is more suitable to check the length of the string?
String.length()
String.length() is the number of 16-bit UTF-16 code units needed to represent the string. That is, it is the number of char values that are used to represent the string and thus also equal to toCharArray().length. For most characters used in western languages this is typically the same as the number of unicode characters (code points) in the string, but the number of code point will be less than the number of code units if any UTF-16 surrogate pairs are used. Such pairs are needed only to encode characters outside the BMP and are rarely used in most writing (emoji are a common exception).
String.getBytes().length
String.getBytes().length on the other hand is the number of bytes needed to represent your string in the platform's default encoding. For example, if the default encoding was UTF-16 (rare), it would be exactly 2x the value returned by String.length() (since each 16-bit code unit takes 2 bytes to represent). More commonly, your platform encoding will be a multi-byte encoding like UTF-8.
This means the relationship between those two lengths are more complex. For ASCII strings, the two calls will almost always produce the same result (outside of unusual default encodings that don't encode the ASCII subset in 1 byte). Outside of ASCII strings, String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.
Which is more suitable?
Usually you'll use String.length() in concert with other string methods that take offsets into the string. E.g., to get the last character, you'd use str.charAt(str.length()-1). You'd only use the getBytes().length if for some reason you were dealing with the array-of-bytes encoding returned by getBytes.
The length() method returns the length of the string in characters.
Characters may take more than a single byte. The expression String.getBytes().length returns the length of the string in bytes, using the platform's default character set.
The String.length() method returns the quantity of symbols in string. While String.getBytes().length() returns number of bytes used to store those symbols.
Usually chars are stored in UTF-16 encoding. So it takes 2 bytes to store one char.
Check this SO answer out.
I hope that it will help :)
In short, String.length() returns the number of characters in the string while String.getBytes().length returns the number of bytes to represent the characters in the string with specified encoding.
In many cases, String.length() will have the same value as String.getBytes().length. But in cases like encoding UTF-8 and the character has value over 127, String.length() will not be the same as String.getBytes().length.
Here is an example which explains how characters in string is converted to bytes when calling String.getBytes(). This should give you a sense of the difference between String.length() and String.getBytes().length.

Storing characters as single bytes in java

I read that we should use Reader/Writer for reading/writing character data and InputStream/OutputStream for reading/writing binary data. Also, in java characters are 2 bytes. I am wondering how the following program works. It reads characters from standard input stores them in a single byte and prints them out. How are two byte characters fitting into one byte here?
http://www.cafeaulait.org/course/week10/06.html
The comment explains it pretty clearly:
// Notice that although a byte is read, an int
// with value between 0 and 255 is returned.
// Then this is converted to an ISO Latin-1 char
// in the same range before being printed.
So basically, this assumes that the incoming byte represents a character in ISO-8859-1.
If you use a console with a different encoding, or perhaps provide a character which isn't in ISO-8859-1, you'll end up with problems.
Basically, this is not good code.
Java stores characters as 2 bytes, but for normal ASCII characters the actual data fits in one byte. So as long as you can assume the file being read there is ASCII then that will work fine, as the actual numeric value of the character fits in a single byte.

Java - stream of bytes vs. stream of characters?

Title is pretty self-explanatory. In a lot of the JRE javadocs I see the phrases "stream of bytes" and "stream of characters" all over the place.
But aren't they the same thing? Or are they slightly different (e.g. interpreted differently) in Java-land? Thanks in advance.
In Java, a byte is not the same thing as a char. Therefore a byte stream is different from a character stream. Bytes are intended for arbitrary binary data; characters are specifically for data representing the building blocks of strings.
but if a char is only 1 byte in width
Except that it's not.
As per the JLS §4.2.1 a char is a number in the range:
from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535
But a byte is a number in the range
from -128 to 127, inclusive
Stream of byte is just plain byte, like how you would see it when you open a file in HEX Editor.
Character is different from just plain byte. ASCII encoding uses exactly 1 byte per character, but that is not true for many other encoding. For example, UTF-8 encoding may use from 1 to 4 bytes to encode a single character. Stream of character is designed to abstract away the underlying encoding, and produce char of one type of encoding (in Java, char and String uses UTF-16 encoding).
As a rule of thumb:
When you are dealing with text, you must use stream of character to decode the byte into character with the appropriate encoding.
When you are dealing with binary data or mixed of binary and text, you must use stream of byte, since it doesn't make sense otherwise. If a sequence of byte represents a String in certain encoding, then you can always pick those bytes out and use String(byte[] bytes, Charset charset) constructor to get back the String.
They are different. char is a 2-byte datatype in Java: byte is a 1-byte datatype.
Edit: char is also an unsigned type, while byte is not.
Generally it is better off to talk about streams in terms of their sizes, rather than what they carry. Stream of bytes is more intuitive than streams of chars, because streams of chars make us have to double check that a char is indeed a byte, not a unicode char, or anything fancy.
A char is a representation, which can be represented by a byte, but a byte is always going to be a byte. All world will burn when bytes will stop being 8 bits.

Isn't the size of character in Java 2 bytes?

I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?
A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)
Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).
The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).
In ASCII text file each character is just one byte
Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.
There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.
Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.
The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

Categories