How Java execute the lexical translation? - java

In the Jave Spec, I read that
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.here
It means the lexical translation is only applied for ASCII character? Because when I tried to write a code with Cyrillic, Hebrew, or Kanji character, there are no compile-time error even though these characters are not ASCII?
I don't understand why? Can anyone help me to understand

The quote doesn't say anything about what happens if you write a program containing a Cyrillic/Hebrew letter. In fact, the section just before the one you quoted says:
3.1 Unicode
Programs are written using the Unicode character set.
Note that "allows" here means that this translation step adds a new capability to Java. When you are allowed to do something, you can, but are not required to do it.
The quote merely says that the lexical translator will turn anything of the form \uxxxx to the corresponding Unicode character U+xxxx.
The natural consequence of this is that, you can write a program containing any Unicode code point (i.e. "any program") using only an ASCII keyboard. How? Whenever you need to write some non-ASCII character, just write its Unicode escape.
As a concrete example:
These are valid Java statements:
int Д = 0;
System.out.println("Д");
But let's say my text editor can only handle ASCII text, or that I only have a US keyboard, so I can't type "Д". The language spec says that I can still write this in ASCII, like this:
int \u0414 = 0;
System.out.println("\u0414");
It will do exactly the same thing.

Related

When I assign char (from literal or otherwise), what "java internal encoding is UTF16" means here? In what encoding is it stored in char?

//non-utf source file encoding
char ch = 'ё'; // some number within 0..65535 is stored in char.
System.out.println(ch); // the same number output to
"java internal encoding is UTF16". Where does it meanfully come to play in that?
Besides, I can perfectly put into char one utf16 codeunit from surrogate range (say '\uD800') - making this char perfectly invalid Unicode. And let us stay within BMP, so to avoid thinking that we might have 2 chars (codeunits) for a supplementary symbol (thinking this way sounds to me that "char internally uses utf16" is complete nonsense). But maybe "char internally uses utf16" makes sense within BMP?
I could undersand it if were like this: my source code file is in windows-1251 encoding, char literal is converted to number according to windows-1251 encoding (what really happens), then this number is automatically converted to another number (from windows-1251 number to utf-16 number) - which is NOT taking place (am I right?! this I could understand as "internally uses UTF-16"). And then that stored number is written to (really it is written as given, as from win-1251, no my "imaginary conversion from internal utf16 to output\console encoding" taking place), console shows it converting from number to glyph using console encoding (what really happens)
So this "UTF16 encoding used internally" is NEVER USED ANYHOW ??? char just stores any number (in [0..65535]), and besides specific range and being "unsigned" has NO DIFFERENCE FROM int (in scope of my example of course)???
P.S. Experimentally, code above with UTF-8 encoding of source file and console outputs
й
1081
with win-1251 encoding of source file and UTF-8 in console outputs
�
65533
Same output if we use String instead of char...
String s = "й";
System.out.println(s);
In API, all methods taking char as argument usually never take encoding as argument. But methods taking byte[] as argument often take encoding as another argument. Implying that with char we don't need encoding (meaning that we know this encoding for sure). But **how on earth do we know in what encoding something was put into char???
If char is just a storage for a number, we do need to understand what encoding this number originally came from?**
So char vs byte is just that char has two bytes of something with UNKNOWN encoding (instead of one byte of UNKNOWN encoding for a byte).
Given some initialized char variable, we don't know what encoding to use to correctly display it (to choose correct console encoding for output), we cannot tell what was encoding of source file where it was initialized with char literal (not counting cases where various encodings and utf would be compatilble).
Am I right, or am I a big idiot? Sorry for asking in latter case :)))
SO research shows no direct answer to my question:
In what encoding is a Java char stored in?
What encoding is used when I type a character?
To which character encoding (Unicode version) set does a char object
correspond?
In most cases it is best to think of a char just as a certain character (independent of any encoding), e.g. the character 'A', and not as a 16-bit value in some encoding. Only when you convert between char or a String and a sequence of bytes does the encoding play a role.
The fact that a char is internally encoded as UTF-16 is only important if you have to deal with it's numeric value.
Surrogate pairs are only meaningful in a character sequence. A single char can not hold a character value outside the BMP. This is where the character abstraction breaks down.
Unicode is system of expressing textual data as codepoints. These are typically characters, but not always. A Unicode codepoint is always represented in some encoding. The common ones are UTF-8, UTF-16 and UTF-32, where the number indicates the number of bits in a codeunit. (For example UTF-8 is encoded as 8-bit bytes, and UTF-16 is encoded as 16-bit words.)
While the first version of Unicode only allowed code points in the range 0hex ... FFFFhex, in Unicode 2.0, they changed the range to 0hex to 10FFFFhex.
So, clearly, a Java (16 bit) char is no longer big enough to represent every Unicode code point.
This brings us back to UTF-16. A Java char can represent Unicode code points that are less or equal to FFFFhex. For larger codepoints, the UTF-16 representation consists of 2 16-bit values; a so-called surrogate pair. And that will fit into 2 Java chars. So in fact, the standard representation of a Java String is a sequence of char values that constitute the UTF-16 representation of the Unicode code points.
If we are working with most modern languages (including CJK with simplified characters), the Unicode code points of interest are all found in code plane zero (0hex through FFFFhex). If you can make that assumption, then it is possible to treat a char as a Unicode code point. However, increasingly we are seeing code points in higher planes. A common case is the code points for Emojis.)
If you look at the javadoc for the String class, you will see a bunch of methods line codePointAt, codePointCount and so on. These allow you to handle text data properly ... that is to deal with the surrogate pair cases.
So how does this relate to UTF-8, windows-1251 and so on?
Well these are 8-bit character encodings that are used at the OS level in text files and so on. When you read a file using a Java Reader your text is effectively transcoded from UTF-8 (or windows-1251) into UTF-16. When you write characters out (using a Writer) you transcode in the other direction.
This doesn't always work.
Many character encodings such as windows-1251 are not capable of representing the full range of Unicode codepoints. So, if you attempt to write (say) a CJK character via a Writer configured a windows-1251, you will get ? characters instead.
If you read an encoded file using the wrong character encoding (for example, if you attempt to read a UTF-8 file as windows-1251, or vice versa) then the trancoding is liable to give garbage. This phenomenon is so common it has a name: Mojibake).
You asked:
Does that mean that in char ch = 'й'; literal 'й' is always converted to utf16 from whatever encoding source file was in?
Now we are (presumably) talking about Java source code. The answer is that it depends. Basically, you need to make sure that the Java compiler uses the correct encoding to read the source file. This is typically specified using the -encoding command line option. (If you don't specify the -encoding then the "platform default converter" is used; see the javac manual entry.)
Assuming that you compile your source code with the correct encoding (i.e. matching the actual representation in the source file), the Java compiler will emit code containing the correct UTF-16 representation of any String literals.
However, note that this is independent of the character encoding that your application uses to read and write files at runtime. That encoding is determined by what your application selects or the execution platform's default encoding.

String that cannot be represented in UTF-8

I am creating a set of tests for the size of a String to do so I am using something like this myString.getBytes("UTF-8").length > MAX_SIZE for which java has a checked exception UnsupportedEncodingException.
Just for curiosity, and to further consider other possible test scenarios, is there a text that cannot be represented by UTF-8 character encoding?
BTW: I did my homework, but nowhere (that I can find) specifies that indeed UTF-8/Unicode includes ALL the characters which are possible. I know that its size is 2^32 and many of them are still empty, but the question remains.
The official FAQ from the Unicode Consortium is pretty clear on the matter, and is a great source of information on all questions related to UTF-8, UTF-16, etc.
In particular, notice the following quote (emphasis mine):
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard
uses the term “UCS transformation format” for UTF; the two terms are
merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round
tripping: mapping from any Unicode coded character sequence S to a
sequence of bytes and back will produce S again. To ensure round
tripping, a UTF mapping must map all code points (except surrogate
code points) to unique byte sequences. This includes reserved
(unassigned) code points and the 66 noncharacters (including U+FFFE
and U+FFFF).
So, as you can see, by definition, all UTF encodings (including UTF-8) must be able to handle all Unicode code points (except the surrogate code points of course, but they are not real characters anyways).
Additionally, here is a quote directly from the Unicode Standard that also talks about this:
The Unicode Standard supports three character encoding forms: UTF-32,
UTF-16, and UTF-8. Each encoding form maps the Unicode code points
U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.
As you can see, the specified range of characters covers the whole assigned Unicode range (excluding the surrogate character range of course).
is there a text that cannot be represented by UTF-8 character encoding?
Java strings use UTF-16, and standard UTF-8 is designed to handle every Unicode codepoint that UTF-16 can handle (and then some).
However, do be careful, because Java also uses a Modified UTF-8 in some areas, and that does have some differences/limitations from standard UTF-8.

Regex for matching Unicode pattern

I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding. I am not interested to find Unicode special characters, that are not in the ASCII range. I am trying to find if the content of the file contains at least one Unicode pattern, like \u0046 for example.
For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?
So, as far as I have searched on the Internet, I've seen Unicode characters written like \u0046, or like U+0046. Based on this, I have written the following regex:
(\\u|U\+)....
This means, \u or U+ followed by four characters. This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character. It is always \u or U+? Can it be more or less than 4 characters after \u or U+?
Thanks
The notation U+Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code. In java source code and *.properties \u followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.
The pattern to search for that:
"\\\\u[0-9A-Fa-f]{4}"
Or a String.contains on:
"\\u"
In other languages than Java \Uxxxxxx (six hex chars) is possible, for the full UTF-32 range. Unfortunately upto Java 8 not so.

How does javac process Unicode glyphs?

I tried System.out.println("ñ"); and it prints ñ. Why didn't javac run through an error?
Javac can be configured to have a source file encoding. That way, you can use character literals (and symbol names!) with non-ASCII characters.
If that matches what the file encoding actually is, all works well.
If not, you may get an error, but more likely, just some broken strings.
In order to print the text back again, the program needs to know which encoding to use when printing as well. All this needs to be configured correctly (the defaults in Java are not portable), otherwise you can get all kinds of broken text output.
Java char and String are natively in UTF-16. It can handle 'ñ' and "ñ".
JLS-3.1. Unicode says (in part),
The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.
That is expanded on by JLS-3.2. Lexical Structure which explains,
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).
A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).

In what encoding is a Java char stored in?

Is the Java char type guaranteed to be stored in any particular encoding?
Edit: I phrased this question incorrectly. What I meant to ask is are char literals guaranteed to use any particular encoding?
"Stored" where? All Strings in Java are represented in UTF-16. When written to a file, sent across a network, or whatever else, it's sent using whatever character encoding you specify.
Edit: Specifically for the char type, see the Character docs. Specifically: "The char data type ... are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities." Therefore, casting char to int will always give you a UTF-16 value if the char actually contains a character from that charset. If you just poked some random value into the char, it obviously won't necessarily be a valid UTF-16 character, and likewise if you read the character in using a bad encoding. The docs go on to discuss how the supplementary UTF-16 characters can only be represented by an int, since char doesn't have enough space to hold them, and if you're operating at this level, it might be important to get familiar with those semantics.
A Java char is conventionally used to hold a Unicode code unit; i.e. a 16 bit unit that is part of a valid UTF-16 sequence. However, there is nothing to prevent an application from putting any 16 bit unsigned value into a char, irrespective of what it actually means.
So you could say that a Unicode code unit can be represented by a char and a char can represent a Unicode code unit ... but neither of these is necessarily true, in the general case.
Your question about how a Java char is stored cannot be answered. Simply said, it depends on what you mean by "stored":
If you mean "represented in an executing program", then the answer is JVM implementation specific. (The char data type is typically represented as a 16 bit machine integer, though it may or may not be machine word aligned, depending on the specific context.)
If you mean "stored in a file" or something like that, then the answer is entirely dependent on how the application chooses to store it.
Is the Java char type guaranteed to be stored in any particular encoding?
In the light of what I said above the answer is "No". In an executing application, it is up to the application to decide what a char means / contains. When a char is stored to a file, the application decides how it wants to store it and what on-disk representation it will use.
FOLLOWUP
What about char literals? For example, 'c' must have some value that is defined by the language.
Java source code is required (by the language spec) to be Unicode text, represented in some character encoding that the tool chain understands; see the javac -encoding option. In theory, a character encoding could map the c in 'c' in your source code to something unexpected.
In practice though, the c will map to the Unicode lower-case C code-point (U+0063) and will be represented as the 16-bit unsigned value 0x0063.
To the extent that char literals have a meaning ascribed by the Java language, they represent (and are represented as) UTF-16 code units. Note that they may or may not be assigned Unicode code points ("characters"). Some Unicode code points in the range U+0000 to U+FFFF are unassigned.
Originally, Java used UCS-2 internally; now it uses UTF-16. The two are virtually identical, except for D800 - DFFF, which are used in UTF-16 as part of the extended representation for larger characters.

Categories