Will always char ch = 'A' have value 65 independent of the platform? I know that getting bytes out of String is platform-dependent but I am not sure about how - using which encoding (or if there is some encoding in play at all)- java translates character literals to numerical values.
Yes: The char type in Java uses the UTF-16 encoding (see JLS 3.2), in which 'A' has the numerical (decimal) code 65.
Yes. Java is specified to use Unicode, so 'A' is U+0041, with value 65.
Encoding comes into play when you try to convert a char or string (which are sequences of 16 bit code points) into a sequence of bytes - which can be done in a huge number of different ways. Many of those will represent 'A' as a single byte of value 65, but lots don't.
In java every char literal follows Unicode standard . Java default implementation of Unicode standard is UTF-16. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
So A always has numeric value 65(U+0041 in UTF-16).
Related
Regarding Java syntax, there is a NumericType which consists of IntegralType and FloatingPointType. IntegralTypes are byte, short, int, long and char.
At the same time, I can assign a single character to char variable.
char c1 = 10;
char c2 = 'c';
So here is my question. Why char in numeric type and how JVM convert 'c' to a number?
Why char in numeric type...
Using numbers to represent characters as indexes into a table is the standard way the text is handled in computers. It's called character encoding and has a long history, going back at least to telegraphs. For a long time personal computers used ASCII (a 7-bit encoding = 127 characters plus nul) and then "extended ASCII" (an 8-bit encoding of various forms where the "upper" 128 characters had a variety of interpretations), but these are now obsolete and suitable only for niche purposes thanks to their limited character set. Before personal computers, popular ones were EBCDIC and its precursor BCD. Modern systems use Unicode (usually by storing one or more of its transformations such as UTF-8 or UTF-16) or various standardized "code pages" such as Windows-1252 or ISO-8859-1.
...and how JVM convert 'c' to a number?
Java's numeric char values map to and from characters via Unicode (which is how the JVM knows that 'c' is the value 0x0063, or that 'é' is 0x00E9). Specifically, a char value maps to a Unicode code point and strings are sequences of code points.
There's quite a lot about the char data type, including why the value is 16 bits wide, in the JavaDoc of the Character class:
Unicode Character Representations
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.
Because underneath Java represents chars as Unicode. There is some convenience to this, for example you can run a loop from 'A' to 'Z' and do something. It's important to realize, however, that in Java Strings aren't strictly arrays of characters like they are in some other languages. More info here
Internally char is stored as ASCII (or UNICODE) code which is integer. The difference is in how it is processed after it is read from memory.
In C/C++ char and int are very close and is type casted implicitly. Similar behavior in Java shows the relation between C/C++ and Java as JVM is written in C/C++.
Besides being able to do arithmetic operations on chars which sometimes comes handy (like c >= 'a' && c <= 'z') I would say it is a design decision driven by the similar approach taken in other languages when Java was invented (primarily C and C++).
The fact that Character does not extend Number (as other numeric primitive wrappers do) somehow indicates that Java designers tried to find some kind of a compromise between numeric and non-numeric nature of characters.
DISCLAIMER I was not able to find any official docs about this.
This question already has answers here:
How does Java 16 bit chars support Unicode?
(3 answers)
Closed 8 years ago.
Someone asked a similar question. But I didnt really get the answer.
when I say
char myChar = 'k' in java its going to reserve 16 bits for it (according to java docs below?
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
Now lets say I have a unicode character '電' and assume that its code point is something like U+FFFF1. This code point could not be stored in 2 bytes and so would java allocate extra bytes (UTF16 based string) for it?
In short when I have something like this -
char myChar = '電'
Assuming that its code point representation is long and will require more than 2 bytes.
How many bits will myChar have - 16 or 32
Thanks
Jave uses UTF-16, and yes every Java char is 16-bits. From the Java Tutorial - Primitive Data Types,
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
Further, the Character Javadoc says (in part),
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
So, supplementary characters (like your second example) aren't represented as a single 16-bit character.
As java doc states it:
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
But when I have a String (just containing ASCII-signs) and convert it to a byte array, every sign of the String is stored in one byte, which is less than the 16 bit as java docs states it. How does it work? I could imagine that the java compiler/interpreter uses just one byte per char for an ASCII sign for performance issues.
Furthermore, what happens if I've got a String with just ASCII signs and one UTF-16 sign and convert it to a byte array. Every sign of the String uses 2 bytes now?
Converting characters to bytes and vice versa is done using a character encoding.
The character encoding determines how characters are represented by bytes. For example, ASCII is a character encoding which uses 7 bits per character. Obviously, it can only represent 128 characters, way less than the 65,536 characters that exist in Java.
Other character encodings are UTF-8 and UTF-16. In fact, a Java char is really an UTF-16 character - if you directly cast it to an int, you would get the UTF-16 code for the character.
Here's a longer tutorial to character encodings: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
If you call getBytes() on a String, it will use the default character encoding of the system to convert the characters in the string to bytes. It's better to use the version of getBytes() that takes a character set name as an argument, so that you know what character set is used. For example:
byte[] bytes = str.getBytes("UTF-8");
The internal format of a String uses 16 bits per character. When you convert it to a byte array, you use a certain character encoding which is either specified explicitly or the default platform encoding. The encoding may use fewer bits per character.
For example the ASCII encoding will store each character in a byte but it can only represent 128 different characters.
Another often used encoding is UTF-8 which uses a variable number of bytes per character. The first 128 characters (corresponding to the characters available in ASCII) can be stored in one byte each. Characters with order numbers 128 or higher need two or more bytes.
getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
Your platform's default charset is probably UTF8. Hence, getBytes() will use one byte per character for characters which fit comfortably into that size.
String.getBytes() "encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array". The platform's default charset (Charset.defaultCharset()) is probably UTF-8.
As for the second question, strings aren't actually required to use UTF-16. The way a JVM stores strings internally is irrelevant. The few occurrences of UTF-16 in the JVM spec apply only to chars.
Why the two following results are different?
bsh % System.out.println((byte)'\u0080');
-128
bsh % System.out.println("\u0080".getBytes()[0]);
63
Thanks for your answers.
(byte)'\u0080' just takes the numerical value of the codepoint, which does not fit into a byte and thus is subject to a narrowing primitive conversion which drops the bits that don't fit into the byte and, since the highest-order bit is set, yields a negative number.
"\u0080".getBytes()[0] transforms the characters to bytes according to your platform default encoding (there is an overloaded getBytes() method that allows you to specify the encoding). It looks like your platform default encoding cannot represent codepoint U+0080, and replaces it by "?" (codepoint U+003F, decimal value 63).
Unicode character U+0080 <control> can't be represented in your system default encoding and therefore is replaced by ? (ASCII code 0x3F = 63) when string is encoded into your default encoding by getBytes().
Here the byte array has 2 elements - that's because the representation of unicode chars does not fit in 1 byte.
On my machine the array contains [-62, -128]. That's because my default encoding is UTF-8. Never use getBytes() without specifying an encoding.
When you have a character which a character encoding doesn't support it turns it into '?' which is 63 in ASCII.
try
System.out.println(Arrays.toString("\u0080".getBytes("UTF-8")));
prints
[-62, -128]
Actually, if you want to get the same result with the toString() call, specify UTF-16_LE as the charset encoding:
bsh % System.out.println("\u0080".getBytes("UTF-16LE")[0]);
-128
Java Strings are encoded internally as UTF-16, and since we want the lower byte like for the cast char -> byte, we use little endian here. Big endian works too, if we change the array index:
bsh % System.out.println("\u0080".getBytes("UTF-16BE")[1]);
-128
A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?
Does this boil down to what character encoding you are using?
You can handle them all if you're careful enough.
Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).
See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.
(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)
Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two chars. This is reflected by API methods such as String.codePointAt().
And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.
To add to the other answers, some points to remember:
A Java char takes always 16 bits.
A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).
"Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.
A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".
Corolary: string.length() returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.
Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.
You said:
A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters.
Unicode grows
Actually, the inventory of characters defined in Unicode has grown dramatically. Unicode continues to grow — and not just because of emojis.
143,859 characters in Unicode 13 (Java 15, release notes)
137,994 characters in Unicode 12.1 (Java 13 & 14)
136,755 characters in Unicode 10 (Java 11 & 12)
120,737 characters in Unicode 8 (Java 9)
110,182 characters in Unicode 6.2 (Java 8)
109,449 characters in Unicode 6.0 (Java 7)
96,447 characters in Unicode 4.0 (Java 5 & 6)
49,259 characters in Unicode 3.0 (Java 1.4)
38,952 characters in Unicode 2.1 (Java 1.1.7)
38,950 characters in Unicode 2.0 (Java 1.1)
34,233 characters in Unicode 1.1.5 (Java 1.0)
char is legacy
The char type is long outmoded, now legacy.
Use code point numbers
Instead, you should be working with code point numbers.
You asked:
Does this mean that you can't handle certain Unicode characters in a Java application?
The char type can address less than half of today's Unicode characters.
To represent any Unicode character, use code point numbers. Never use char.
Every character in Unicode is assigned a code point number. These range over a million, from 0 to 1,114,112. Doing the math when comparing to the numbers listed above, this means most of the numbers in that range have not yet been assigned to a character yet. Some of those numbers are reserved as Private Use Areas and will never be assigned.
The String class has gained methods for working with code point numbers, as did the Character class.
Get the code point number for any character in a string, by zero-based index number. Here we get 97 for the letter a.
int codePoint = "Cat".codePointAt( 1 ) ; // 97 = 'a', hex U+0061, LATIN SMALL LETTER A.
For the more general CharSequence rather than String, use Character.codePointAt.
We can get the Unicode name for a code point number.
String name = Character.getName( 97 ) ; // letter `a`
LATIN SMALL LETTER A
We can get a stream of the code point numbers of all the characters in a string.
IntStream codePointsStream = "Cat".codePoints() ;
We can turn that into a List of Integer objects. See How do I convert a Java 8 IntStream to a List?.
List< Integer > codePointsList = codePointsStream.boxed().collect( Collectors.toList() ) ;
Any code point number can be changed into a String of a single character by calling Character.toString.
String s = Character.toString( 97 ) ; // 97 is `a`, LATIN SMALL LETTER A.
a
We can produce a String object from an IntStream of code point numbers. See Make a string from an IntStream of code point numbers?.
IntStream intStream = IntStream.of( 67 , 97 , 116 , 32 , 128_008 ); // 32 = SPACE, 128,008 = CAT (emoji).
String output =
intStream
.collect( // Collect the results of processing each code point.
StringBuilder :: new , // Supplier<R> supplier
StringBuilder :: appendCodePoint , // ObjIntConsumer<R> accumulator
StringBuilder :: append // BiConsumer<R,R> combiner
) // Returns a `CharSequence` object.
.toString(); // If you would rather have a `String` than `CharSequence`, call `toString`.
Cat 🐈
You asked:
Does this boil down to what character encoding you are using?
Internally, a String in Java is always using UTF-16.
You only use other character encoding when importing or exporting text in or out of Java strings.
So, to answer your question, no, character encoding is not directly related here. Once you get your text into a Java String, it is in UTF-16 encoding and can therefore contain any Unicode character. Of course, to see that character, you must be using a font with a glyph defined for that particular character.
When exporting text from Java strings, if you specify a legacy character encoding that cannot represent some of the Unicode characters used in your text, you will have a problem. So use a modern character encoding, which nowadays means UTF-8 as UTF-16 is now considered harmful.
Have a look at the Unicode 4.0 support in J2SE 1.5 article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.
In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:
char is a UTF-16 code unit, not a code point
new low-level APIs use an int to represent a Unicode code point
high level APIs have been updated to understand surrogate pairs
a preference towards char sequence APIs instead of char based methods
Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.
Here's Oracle's documentation on Unicode Character Representations. Or, if you prefer, a more thorough documentation here.
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to
as the Basic Multilingual Plane (BMP). Characters whose code points
are greater than U+FFFF are called supplementary characters. The Java
2 platform uses the UTF-16 representation in char arrays and in the
String and StringBuffer classes. In this representation, supplementary
characters are represented as a pair of char values, the first from
the high-surrogates range, (\uD800-\uDBFF), the second from the
low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP)
code points, including the surrogate code points, or code units of the
UTF-16 encoding. An int value represents all Unicode code points,
including supplementary code points. The lower (least significant) 21
bits of int are used to represent Unicode code points and the upper
(most significant) 11 bits must be zero. Unless otherwise specified,
the behavior with respect to supplementary characters and surrogate
char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate
ranges as undefined characters. For example,
Character.isLetter('\uD840') returns false, even though this specific
value if followed by any low-surrogate value in a string would
represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example,
Character.isLetter(0x2F81A) returns true because the code point value
represents a letter (a CJK ideograph).
From the OpenJDK7 documentation for String:
A String represents a string in the
UTF-16 format in which supplementary
characters are represented by
surrogate pairs (see the section
Unicode Character Representations in
the Character class for more
information). Index values refer to
char code units, so a supplementary
character uses two positions in a
String.