Unicode escape syntax in Java - java

In Java, I learned that the following syntax can be used for mentioning Unicode characters that are not on the keyboard (eg. non-ASCII characters):
(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)
My question is:
What is the purpose of (u)* in the above syntax?
One use case that I understood which represents Yen symbol in Java is:
char ch = '\u00A5';

Interesting question. Section 3.3 of the JLS says:
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u
UnicodeMarker u
which translates to \\u+\p{XDigit}{4}
and
If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.
So you're right, there can be one or more u after the backslash. The reason is given further down:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
So this input
\u0020ä
becomes
\uu0020\u00e4
The first uu means here "this was a unicode escape sequence to begin with" while the second u says "An automatic tool converted a non-ASCII character to a unicode escape."
This information is useful when you want to convert back from ASCII to unicode: You can restore as much of the original code as possible.

It means you can add as many u as you want - for example these lines are equivalent:
char ch = '\u00A5';
char ch = '\uuuuu00A5';
char ch = '\uuuuuuuuuuuuuuuuuu00A5';
(and all compile)

Java supports only \uXXXX (4 hex chars) notation for Unicode characters in the BMP but doesn't support the \u{YYYYY} (5 hex chars) notation for characters outside the BMP (16 other planes). So it's impossible to represent them into a single constant char, you'll have to write them as a surrogate pair.
For example, if you want to write MATHEMATICAL BOLD CAPITAL A (U+1D400) you can't write "u\{1D400}" it's an illegal Unicode escape sequence in Java. Writing "u\1D400" is only doing "u\1D40" + "0" so it will output ᵀ0. No you really have to use surrogates in Java. So you have to write "\uD835\uDC00" instead.
But writing surrogates is not handy, so if you want to write them directly from a code point you can use one of those tricks:
String test1 = new String(new int[] { 0x1D400 }, 0, 1);
String test2 = String.valueOf(Character.toChars(0x1D400));
String test3 = Character.toString(0x1D400):

Related

In Java, how are Unicode chars and Java UTF-16 codepoints handled?

I'm struggling with Unicode characters in Java 10.
I'm using the java.text.BreakIterator package.
For this output:
myString="a𝓞b" hex=0061d835dcde0062
myString.length()=4
myString.codePointCount(0,s.length())=3
BreakIterator output:
a hex=0061
𝓞 hex=d835dcde
b hex=0062
Seems correct.
Using the same Java code, then with this output:
myString="G̲íl" hex=0047033200ed006c
myString.length()=4
myString.codePointCount(0,s.length())=4
BreakIterator output:
G̲ hex=00470332
í hex=00ed
l hex=006c
Seems correct too, EXCEPT for the codePointCount=4.
Why isn't it 3, and is there a means of getting
a 3 value without using BreakIterator?
My goal is to determine if all (output) chars of a string are
16-bit, or are surrogate or combining chars present?
"G̲íl" is four code points: U+0047, U+0332, U+00ED, U+006C.
U+0332 is a combining character, but it is a separate code point. That's not the same as your first example, which requires using a surrogate pair (2 UTF-16 code units) to represent U+1D4DE - but the latter is still a single code point.
BreakIterator finds boundaries in text - the two code points here that are combined don't have a boundary between them in that sense. From the documentation:
Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation through character strings, regardless of how the character is stored.
So I think everything is working correctly here.
A codepoint corresponds to one Unicode character.
Java represents Unicode in UTF-16, i.e., in 16-bit units. Characters with codepoint values larger than U+FFFF are represented by a pair of 'surrogate characters', as in your first example. Thus the first result of 3.
In the second case, you have an example that is not a single Unicode character. It is one character, LETTER G, followed by another character COMBINING CHARACTER LOW LINE. That is two codepoints per the definition. Thus the second result of 4.
In general, Unicode has tables of character attributes (I'm not sure if I have the right word here) and it is possible to find out that one of your codepoints is a combining character.
Take a look at the Character class. getType(character) will tell you if a codepoint is a combining character or a surrogate.

How do I use high-order unicode characters in java?

How do I use unicode characters in Java, like the Negative Squared Latin Capital Letter E? Using "\u1F174" doesn't work as the \u escape only accepts 4 hex-digits.
You need to specify it as a surrogate pair - two UTF-16 code units.
For example, if you copy and paste the character into my Unicode explorer you can see that U+1F174 is represented in UTF-16 code units as U+D83C U+DD74. (You can work this out manually, of course.) So you could write it in a Java string literal as:
String text = "\uD83C\uDD74";
Other options include:
String text = new StringBuilder().appendCodePoint(0x1f174).toString();
String text = new String(new int[] { 0x1f174 }, 0, 1);
char[] chars = Character.toChars(0x1f174);
"\uD83C\uDD74"
Or indeed
"🅴"
Because Java characters represent UTF-16 units rather than actual Unicode characters, you need to represent it as a string, that will have the two UTF-16 surrogates.

Clarification on Java Language Specification

I read the following phrase in the Java language specification.
It is a compile-time error for the character following the SingleCharacter or
EscapeSequence to be other than a '.'
I am not able to understand what is the meaning of above line. Could someone please explain it with example.
What is says is basically: A compile time error will be generated for every character different than a ', that comes after the "character" itself. Where the "character" is the content in the form of a character (like: a, 0, \u0093) or an escape sequence (like: \\, \b, \n).
So, this will be wrong:
'aa', because the second a is not a single quote (').
'\\a', because the second character (the a) is not a single quote.
'a, because the character which comes after the "content" is not a quote (but probably a newline or a space).
Side note: This won't work either: char c = '\u0027';. Because that is the code point for a single quote, so it gets translated into: char c = ''';.
I guess this is about character literals. Another way to say this is: character literals must be enclosed by apostrophes, it is an error if you forget the second apostrophe.
Hence:
'a' // correct
'\007' // correct
'ab // wrong
In Java, you can define character variable as an escape sequences or single characters. Those should be surrounded by single quotes.
char ch = 'a';
// Unicode for uppercase Greek omega character
char uniChar = '\u039A';
More information and examples can be found in Java tutorial on Characters.

Why can some ASCII characters not be expressed in the form '\uXXXX' in Java source code?

I stumbled over this (again) today:
class Test {
char ok = '\n';
char okAsWell = '\u000B';
char error = '\u000A';
}
It does not compile:
Invalid character constant in line 4.
The compiler seems to insist that I write '\n' instead. I see no reason for this, yet it's very annoying.
Is there a logical explanation why characters that have a special notation (like \t, \n, \r) must be expressed in that form in Java source?
Unicode characters are replaced by their value, so your line is replaced by the compiler with:
char error = '
';
which is not a valid Java statement.
This is dictated by the Language Specification:
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
This can lead to surprising stuff, for example, this is a valid Java program (it contains hidden unicode characters) - courtesy of Peter Lawrey:
public static void main(String[] args) {
for (char c‮h = 0; c‮h < Character.MAX_VALUE; c‮h++) {
if (Character.isJavaIdentifierPart(c‮h) && !Character.isJavaIdentifierStart(c‮h)) {
System.out.printf("%04x <%s>%n", (int) c‮h, "" + c‮h);
}
}
}
Unicode escape sequences like \u000a are replaced by the actual characters they represent before the Java compiler does anything else with the source code. And so, your program eventually ends up at
char ch = '
';
So the \u000a in your source code is replaced internally by a linefeed character. Note that this happens before the compiler actually reads and interprets your source code.
Referring to the Java Language Specification:
It is a compile-time error for a line terminator (§3.4) to appear after the opening ' and before the closing '.
And as well all know by heart, \n is a line terminator, quoting:
LineTerminator:
the ASCII LF character, also known as "newline"
the ASCII CR character, also known as "return"
the ASCII CR character followed by the ASCII LF character
Other symbols that could cause problems are \, ' and " for example.
I think the reason is that \uXXXX sequences are expanded when the code is being parsed, see JLS §3.2. Lexical Translations.
It is described in 3.3. Unicode Escapes http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html. Javac first finds \uxxxx sequences in .java and replaces them with real characters then compiles. In case of
char error = '\u000A';
\u000A will be replace with newline character code (10) and the actual text will be
char error = '
';
Because the compiler treats them the same as unescaped text.
This is valid code:
class \u00C9 {}

How to encode a string to replace all special characters

I have a string which contains special character. But I have to convert the string into a string without having any special character so I used Base64 But in Base64 we are using equals to symbol (=) which is a special character. But I want to convert the string into a string which will have only alphanumerical letters. Also I can't remove special character only i have to replace all the special characters to maintain unique between two different strings. How to achieve this, Which encoding will help me to achieve this?
The simplest option would be to encode the text to binary using UTF-8, and then convert the binary back to text as hex (two characters per byte). It won't be terribly efficient, but it will just be alphanumeric.
You could use base32 instead to be a bit more efficient, but that's likely to be significantly more work, unless you can find a library which supports it out of the box. (Libraries to perform hex encoding are very common.)
There are a number of variations of base64, some of which don't use padding. (You still have a couple of non-alphanumeric characters for characters 62 and 63.)
The Wikipedia page on base64 goes into the details, including the "standard" variations used for a number of common use-cases. (Does yours match one of those?)
If your strings have to be strictly alphanumeric, then you'll need to use hex encoding (one byte becomes 2 hex digits), or roll your own encoding scheme. Your stated requirements are rather unusual ...
Commons codec has a url safe version of base64, which emits - and _ instead of + and / characters
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafe(byte[])
The easiest way would be to use a regular expression to match all nonalphanumeric characters and replace them with an empty string.
// This will remove all special characters except space.
var cleaned = stringToReplace.replace(/[^\w\s]/gm, '')
Adding any special characters to the above regex will skip that character.
// This will remove all special characters except space and period.
var cleaned = stringToReplace.replace(/[^\w\s.]/gm, '')
A working example.
const regex = /[^\w\s]/gm;
const str = `This is a text with many special characters.
Hello, user, your password is 543#!\$32=!`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Regex explained.
[^\w\s]/gm
Match a single character not present in the list below [^\w\s]
\w matches any word character (equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (equivalent to [\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
If you truly can only use alphanumerical characters you will have to come up with an escaping scheme that uses one of those chars for example, use 0 as the escape, and then encode the special char as a 2 char hex encoding of the ascii. Use 000 to mean 0.
e.g.
This is my special sentence with a 0.
encodes to:
This020is020my020special020sentence020with020a02000002e

Categories