Handling a char as a byte in Java, different results - java

Why the two following results are different?
bsh % System.out.println((byte)'\u0080');
-128
bsh % System.out.println("\u0080".getBytes()[0]);
63
Thanks for your answers.

(byte)'\u0080' just takes the numerical value of the codepoint, which does not fit into a byte and thus is subject to a narrowing primitive conversion which drops the bits that don't fit into the byte and, since the highest-order bit is set, yields a negative number.
"\u0080".getBytes()[0] transforms the characters to bytes according to your platform default encoding (there is an overloaded getBytes() method that allows you to specify the encoding). It looks like your platform default encoding cannot represent codepoint U+0080, and replaces it by "?" (codepoint U+003F, decimal value 63).

Unicode character U+0080 <control> can't be represented in your system default encoding and therefore is replaced by ? (ASCII code 0x3F = 63) when string is encoded into your default encoding by getBytes().

Here the byte array has 2 elements - that's because the representation of unicode chars does not fit in 1 byte.
On my machine the array contains [-62, -128]. That's because my default encoding is UTF-8. Never use getBytes() without specifying an encoding.

When you have a character which a character encoding doesn't support it turns it into '?' which is 63 in ASCII.
try
System.out.println(Arrays.toString("\u0080".getBytes("UTF-8")));
prints
[-62, -128]

Actually, if you want to get the same result with the toString() call, specify UTF-16_LE as the charset encoding:
bsh % System.out.println("\u0080".getBytes("UTF-16LE")[0]);
-128
Java Strings are encoded internally as UTF-16, and since we want the lower byte like for the cast char -> byte, we use little endian here. Big endian works too, if we change the array index:
bsh % System.out.println("\u0080".getBytes("UTF-16BE")[1]);
-128

Related

Which charset should I use to encode and decode 8 bit values?

I have a problem with encoding and decoding specific byte values. I'm implementing an application, where I need to get String data, make some bit manipulation on it and return another String.
I'm currently getting byte[] values by String.getbytes(), doing proper manipulation and then returning String by constructor String(byte[] data). The issue is, when some of bytes have specific values e.g. -120, -127, etc., the coding in the constructor returns ? character, that is byte value 63. As far as I know, these values are ones, that can't be printed on Windows, concerning the fact, that -120 in Java is 10001000, that is \b character according to ASCII table
Is there any charset, that I could use to properly code and decode every byte value (from -128 to 127)?
EDIT: I shall also say, that ISO-8859-1 charset works pretty fine, but does not code language specific characters, such as ąęćśńźżół
You seem to have some confusion regarding encodings, not specific to Java, so I'll try to help clear some of that up.
There do not exist any charsets nor encodings which use the code points from -128 to 0. If you treat the byte as an unsigned integer, then you get the range 0-255 which is valid for all the cp-* and isoo-8859-* charsets.
ASCII characters are in the range 0-127 and so appear valid whether you treat the int as signed or unsigned.
UTF-8 characters are either in the range 0-127 or double-byte characters with the first byte in the range 128-255.
You mention some Polish characters, so instead of ISO-8859-1 you should encode as ISO-8859-2 or (preferably) UTF-8.

Size of a char in a byte array

As java doc states it:
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
But when I have a String (just containing ASCII-signs) and convert it to a byte array, every sign of the String is stored in one byte, which is less than the 16 bit as java docs states it. How does it work? I could imagine that the java compiler/interpreter uses just one byte per char for an ASCII sign for performance issues.
Furthermore, what happens if I've got a String with just ASCII signs and one UTF-16 sign and convert it to a byte array. Every sign of the String uses 2 bytes now?
Converting characters to bytes and vice versa is done using a character encoding.
The character encoding determines how characters are represented by bytes. For example, ASCII is a character encoding which uses 7 bits per character. Obviously, it can only represent 128 characters, way less than the 65,536 characters that exist in Java.
Other character encodings are UTF-8 and UTF-16. In fact, a Java char is really an UTF-16 character - if you directly cast it to an int, you would get the UTF-16 code for the character.
Here's a longer tutorial to character encodings: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
If you call getBytes() on a String, it will use the default character encoding of the system to convert the characters in the string to bytes. It's better to use the version of getBytes() that takes a character set name as an argument, so that you know what character set is used. For example:
byte[] bytes = str.getBytes("UTF-8");
The internal format of a String uses 16 bits per character. When you convert it to a byte array, you use a certain character encoding which is either specified explicitly or the default platform encoding. The encoding may use fewer bits per character.
For example the ASCII encoding will store each character in a byte but it can only represent 128 different characters.
Another often used encoding is UTF-8 which uses a variable number of bytes per character. The first 128 characters (corresponding to the characters available in ASCII) can be stored in one byte each. Characters with order numbers 128 or higher need two or more bytes.
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
Your platform's default charset is probably UTF8. Hence, getBytes() will use one byte per character for characters which fit comfortably into that size.
String.getBytes() "encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array". The platform's default charset (Charset.defaultCharset()) is probably UTF-8.
As for the second question, strings aren't actually required to use UTF-16. The way a JVM stores strings internally is irrelevant. The few occurrences of UTF-16 in the JVM spec apply only to chars.

Is char 'A' always 65 under all platforms and encodings?

Will always char ch = 'A' have value 65 independent of the platform? I know that getting bytes out of String is platform-dependent but I am not sure about how - using which encoding (or if there is some encoding in play at all)- java translates character literals to numerical values.
Yes: The char type in Java uses the UTF-16 encoding (see JLS 3.2), in which 'A' has the numerical (decimal) code 65.
Yes. Java is specified to use Unicode, so 'A' is U+0041, with value 65.
Encoding comes into play when you try to convert a char or string (which are sequences of 16 bit code points) into a sequence of bytes - which can be done in a huge number of different ways. Many of those will represent 'A' as a single byte of value 65, but lots don't.
In java every char literal follows Unicode standard . Java default implementation of Unicode standard is UTF-16. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
So A always has numeric value 65(U+0041 in UTF-16).

Java - stream of bytes vs. stream of characters?

Title is pretty self-explanatory. In a lot of the JRE javadocs I see the phrases "stream of bytes" and "stream of characters" all over the place.
But aren't they the same thing? Or are they slightly different (e.g. interpreted differently) in Java-land? Thanks in advance.
In Java, a byte is not the same thing as a char. Therefore a byte stream is different from a character stream. Bytes are intended for arbitrary binary data; characters are specifically for data representing the building blocks of strings.
but if a char is only 1 byte in width
Except that it's not.
As per the JLS §4.2.1 a char is a number in the range:
from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535
But a byte is a number in the range
from -128 to 127, inclusive
Stream of byte is just plain byte, like how you would see it when you open a file in HEX Editor.
Character is different from just plain byte. ASCII encoding uses exactly 1 byte per character, but that is not true for many other encoding. For example, UTF-8 encoding may use from 1 to 4 bytes to encode a single character. Stream of character is designed to abstract away the underlying encoding, and produce char of one type of encoding (in Java, char and String uses UTF-16 encoding).
As a rule of thumb:
When you are dealing with text, you must use stream of character to decode the byte into character with the appropriate encoding.
When you are dealing with binary data or mixed of binary and text, you must use stream of byte, since it doesn't make sense otherwise. If a sequence of byte represents a String in certain encoding, then you can always pick those bytes out and use String(byte[] bytes, Charset charset) constructor to get back the String.
They are different. char is a 2-byte datatype in Java: byte is a 1-byte datatype.
Edit: char is also an unsigned type, while byte is not.
Generally it is better off to talk about streams in terms of their sizes, rather than what they carry. Stream of bytes is more intuitive than streams of chars, because streams of chars make us have to double check that a char is indeed a byte, not a unicode char, or anything fancy.
A char is a representation, which can be represented by a byte, but a byte is always going to be a byte. All world will burn when bytes will stop being 8 bits.

Isn't the size of character in Java 2 bytes?

I used RandomAccessFile to read a byte from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
Why am I seeing one full character being read by this?
A char represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)
Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).
The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.
It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).
In ASCII text file each character is just one byte
Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.
There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.
On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing
It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".
I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.
Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.
The Java char is 2 bytes. But the file encoding may be different.
So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.
If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

Categories