Is the Java char type guaranteed to be stored in any particular encoding?
Edit: I phrased this question incorrectly. What I meant to ask is are char literals guaranteed to use any particular encoding?
"Stored" where? All Strings in Java are represented in UTF-16. When written to a file, sent across a network, or whatever else, it's sent using whatever character encoding you specify.
Edit: Specifically for the char type, see the Character docs. Specifically: "The char data type ... are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities." Therefore, casting char to int will always give you a UTF-16 value if the char actually contains a character from that charset. If you just poked some random value into the char, it obviously won't necessarily be a valid UTF-16 character, and likewise if you read the character in using a bad encoding. The docs go on to discuss how the supplementary UTF-16 characters can only be represented by an int, since char doesn't have enough space to hold them, and if you're operating at this level, it might be important to get familiar with those semantics.
A Java char is conventionally used to hold a Unicode code unit; i.e. a 16 bit unit that is part of a valid UTF-16 sequence. However, there is nothing to prevent an application from putting any 16 bit unsigned value into a char, irrespective of what it actually means.
So you could say that a Unicode code unit can be represented by a char and a char can represent a Unicode code unit ... but neither of these is necessarily true, in the general case.
Your question about how a Java char is stored cannot be answered. Simply said, it depends on what you mean by "stored":
If you mean "represented in an executing program", then the answer is JVM implementation specific. (The char data type is typically represented as a 16 bit machine integer, though it may or may not be machine word aligned, depending on the specific context.)
If you mean "stored in a file" or something like that, then the answer is entirely dependent on how the application chooses to store it.
Is the Java char type guaranteed to be stored in any particular encoding?
In the light of what I said above the answer is "No". In an executing application, it is up to the application to decide what a char means / contains. When a char is stored to a file, the application decides how it wants to store it and what on-disk representation it will use.
FOLLOWUP
What about char literals? For example, 'c' must have some value that is defined by the language.
Java source code is required (by the language spec) to be Unicode text, represented in some character encoding that the tool chain understands; see the javac -encoding option. In theory, a character encoding could map the c in 'c' in your source code to something unexpected.
In practice though, the c will map to the Unicode lower-case C code-point (U+0063) and will be represented as the 16-bit unsigned value 0x0063.
To the extent that char literals have a meaning ascribed by the Java language, they represent (and are represented as) UTF-16 code units. Note that they may or may not be assigned Unicode code points ("characters"). Some Unicode code points in the range U+0000 to U+FFFF are unassigned.
Originally, Java used UCS-2 internally; now it uses UTF-16. The two are virtually identical, except for D800 - DFFF, which are used in UTF-16 as part of the extended representation for larger characters.
Related
//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char.
System.out.println(ch); // the same number output to
"java internal encoding is UTF16". Where does it meanfully come to play in that?
Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?
I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)
So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???
P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs
й
1081
with win-1251 encoding of source file and UTF-8 in console outputs
�
65533
Same output if we use String instead of char...
String s = "й";
System.out.println(s);
In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???
If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte).
Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).
Am I right, or am I a big idiot? Sorry for asking in latter case :)))
SO research shows no direct answer to my question:
In what encoding is a Java char stored in?
What encoding is used when I type a character?
To which character encoding (Unicode version) set does a char object
correspond?
In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.
The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.
Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.
Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)
While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.
So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.
This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.
If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)
If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.
So how does this relate to UTF-8, windows-1251 and so on?
Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.
This doesn't always work.
Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.
If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).
You asked:
Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?
Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)
Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.
However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.
I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.
Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?
BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.
The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.
In particular, notice the following quote (emphasis mine):
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard
uses the term “UCS transformation format” for UTF; the two terms are
merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round
tripping: mapping from any Unicode coded character sequence S to a
sequence of bytes and back will produce S again. To ensure round
tripping, a UTF mapping must map all code points (except surrogate
code points) to unique byte sequences. This includes reserved
(unassigned) code points and the 66 noncharacters (including U+FFFE
and U+FFFF).
So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).
Additionally, here is a quote directly from the Unicode Standard that also talks about this:
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.
As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).
is there a text that cannot be represented by UTF-8 character encoding?
Java strings use UTF-16, and standard UTF-8 is designed to handle every Unicode codepoint that UTF-16 can handle (and then some).
However, do be careful, because Java also uses a Modified UTF-8 in some areas, and that does have some differences/limitations from standard UTF-8.
Regarding Java syntax, there is a NumericType which consists of IntegralType and FloatingPointType. IntegralTypes are byte, short, int, long and char.
At the same time, I can assign a single character to char variable.
char c1 = 10;
char c2 = 'c';
So here is my question. Why char in numeric type and how JVM convert 'c' to a number?
Why char in numeric type...
Using numbers to represent characters as indexes into a table is the standard way the text is handled in computers. It's called character encoding and has a long history, going back at least to telegraphs. For a long time personal computers used ASCII (a 7-bit encoding = 127 characters plus nul) and then "extended ASCII" (an 8-bit encoding of various forms where the "upper" 128 characters had a variety of interpretations), but these are now obsolete and suitable only for niche purposes thanks to their limited character set. Before personal computers, popular ones were EBCDIC and its precursor BCD. Modern systems use Unicode (usually by storing one or more of its transformations such as UTF-8 or UTF-16) or various standardized "code pages" such as Windows-1252 or ISO-8859-1.
...and how JVM convert 'c' to a number?
Java's numeric char values map to and from characters via Unicode (which is how the JVM knows that 'c' is the value 0x0063, or that 'é' is 0x00E9). Specifically, a char value maps to a Unicode code point and strings are sequences of code points.
There's quite a lot about the char data type, including why the value is 16 bits wide, in the JavaDoc of the Character class:
Unicode Character Representations
The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:
The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.
Because underneath Java represents chars as Unicode. There is some convenience to this, for example you can run a loop from 'A' to 'Z' and do something. It's important to realize, however, that in Java Strings aren't strictly arrays of characters like they are in some other languages. More info here
Internally char is stored as ASCII (or UNICODE) code which is integer. The difference is in how it is processed after it is read from memory.
In C/C++ char and int are very close and is type casted implicitly. Similar behavior in Java shows the relation between C/C++ and Java as JVM is written in C/C++.
Besides being able to do arithmetic operations on chars which sometimes comes handy (like c >= 'a' && c <= 'z') I would say it is a design decision driven by the similar approach taken in other languages when Java was invented (primarily C and C++).
The fact that Character does not extend Number (as other numeric primitive wrappers do) somehow indicates that Java designers tried to find some kind of a compromise between numeric and non-numeric nature of characters.
DISCLAIMER I was not able to find any official docs about this.
In documentation of JNI function FindClass I can read about argument name:
name: a fully-qualified class name (...) The string is encoded in modified UTF-8.
According to documentation modified UTF-8 has to end with double '\0' chars:
the null character (char)0 is encoded using the two-byte format rather than the one-byte format
Does it mean that I should invoke FindClass from C in this way:
FindClass("java/lang/String\0")
i.e. with double '\0' at the end?
Character set, encoding and termination are three different things. Obviously, an encoding is designed for a specific character set but a character set can be encoded in multiple ways. And, often, a terminator (if used) is an encoded character, but with modified UTF-8, this is not the case.
Java uses the Unicode character set. For string and char types, it uses the UTF-16 encoding. The string type is counted; It doesn't use a terminator.
In C, terminated strings are common, as well as single-byte encodings of various character sets. C and C++ compilers terminate literal strings with the NUL character. In the destination character set encoding of the compiler, this is either one or two 0x00 bytes. Almost all common character sets and their encodings have the same byte representation for the non-control ASCII characters. This is true of the UTF-8 encoding of the Unicode character set. (But, note that is not true for characters outside of the limited subset.)
The JNI designers opted to use this limited "interoperability" between C strings. Many JNI functions accept 0x00-terminated modified UTF-8 strings. These are compatible what a C compiler would produce from a literal string in the source code, again provided that the characters are limited to non-control ASCII characters. This covers the use case of writing Java package & class, method and field strings in JNI. (Well, almost: Java allows any Unicode currency symbol in an identifier.)
So, you can pass C string literals to JNI functions in a WYSIWYG style. No need to add a terminator—the compiler does that. The C compiler would encode extra '\0' characters as 0x00 so it wouldn't do any harm but isn't necessary.
There are a couple modifications from the standard UTF-8 encoding. One is to allow C functions that expect a 0x00 terminator to "handle" modified UTF-8 strings, the NUL character (U+00000) is encoded to avoid 0x00, which would be the standard. That allows modified UTF-8 strings to be laid into a buffer with a 0x00 terminator beyond the bytes of the original encoded string. The other modification is a bit esoteric but both modifications make a modified UTF-8 string incompatible with a strictly compliant UTF-8 function.
You didn't ask, but there is another use of 0x00 terminated, modified UTF-8 strings in JNI. It is with the GetStringUTFChars and NewStringUTF functions. (The JNI documentation doesn't actually say that GetStringUTFChars returns a 0x00 terminated string but there are no known JVM implementations that don't. Check your JVM implementor's documentation or source code.) These functions are designed on the same "interoperability" basis. However, the use cases are different, making them dangerous. They are generally used to pass Java strings between C functions. The C functions, generally, would have no idea what modified UTF-8 is, or possibly not even what UTF-8 or Unicode are. It is much more direct to use the Java String and Charset classes to convert to and from character sets and encodings that the C functions are designed for. Often, it is a system setting, user setting, application setting or thread setting that determines which a C function is using. The Java String class attempts to conform to such settings when not given a specific encoding for a conversion. But, it many cases, the desired encoding is fixed and can be specified with clear intent.
No, you don't encode the terminating zero, it is not part of the class name.
No, according to the first reference I found, it means it should be encoded like this:
FindChar("java/lang/String\xc0\x80");
^
|
|
This is not the shortest
way to encode the codepoint
U+0000, which is why it's
"modified" UTF-8.
Note that this assumes that you're really looking for class names whose names end in U+0000, which is rather unlikely. The C string should be terminated just like normal, with a single 0-byte as you get from just:
FindChar("java/lang/String");
The special 2-byte encoding of U+0000 provided by Modified UTF-8 only matters if you want to put U+0000 in a string, and still be able to differentiate it from the C terminator.
I couldn't find any documentation about this...
I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
does anyone know what class to use?
I want to write to a file a bunch of char and make sure that the file's size is # of chars bytes.
Okay - so you need to pick an encoding which only uses a single byte per character, such as ISO-8859-1. Create a FileOutputStream, wrap it in an OutputStreamWriter specifying the encoding, and you're away. However, you need to be aware that you're limiting the range of characters which can be represented in your file.
Take a "Writer"
Writer do output chars
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/FileWriter.html
http://docs.oracle.com/javase/1.4.2/docs/api/java/io/OutputStreamWriter.html
OutputStream do output bytes
You may try to use an other encoding.
In that case you should supply an CharSetEncoder as this has an onUnmappableCharacter method
http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/charset/CharsetEncoder.html#onUnmappableCharacter%28java.nio.charset.CodingErrorAction%29
First figure out which kinds of chars you are going to be talking about.
In C a char is eight bits, even if you need two or more chars in sequence to represent one glyph, or in human-terms, one typed character. It gets worse, there are also glyphs that represent two "typed" characters, like the conjoined ff and ll glyphs you often see in typesetting.
If you are talking about C chars, then by definition every file contains the same number of chars as chars. If you are talking about any other meaning of the word character, then you need to make some choices.
Eight bit characters are guaranteed for the ASCII character set in UTF-8, which is by far the best character set to choose going forward, as it has explicit support in web protocols (thank you w3c!). This means that as long as you verify that every java char in your string is less than 128 (integer value), you are going to get one byte per char with UTF-8.
ISO-8859-1 is a character set which also uses only one byte per character. The downside to ISO-8859-1 is that it tends to not be the default character set of anything other than Microsoft systems. Even within the Microsoft realm, UTF-8 has been making a lot of headway.
The cost to convert between the two is not overly high, but the extensibility of the two differ dramatically. Basically, if you are using ISO-8859-1 and someone tells you that the next product must support language "X", then in some cases, you must first convert to a different character set and then add the language support. With UTF-8 such a need to convert to another character set prior to adding support is rare. I mean very rare, like so rare that you should consider just using images because the language is likely dead, is likely of historical interest only, and is likely to have been documented as a dialect from a lesser tribe on an island where the primary language has full support.