How do I use unicode characters in Java, like the Negative Squared Latin Capital Letter E? Using "\u1F174" doesn't work as the \u escape only accepts 4 hex-digits.
You need to specify it as a surrogate pair - two UTF-16 code units.
For example, if you copy and paste the character into my Unicode explorer you can see that U+1F174 is represented in UTF-16 code units as U+D83C U+DD74. (You can work this out manually, of course.) So you could write it in a Java string literal as:
String text = "\uD83C\uDD74";
Other options include:
String text = new StringBuilder().appendCodePoint(0x1f174).toString();
String text = new String(new int[] { 0x1f174 }, 0, 1);
char[] chars = Character.toChars(0x1f174);
"\uD83C\uDD74"
Or indeed
"🅴"
Because Java characters represent UTF-16 units rather than actual Unicode characters, you need to represent it as a string, that will have the two UTF-16 surrogates.
Related
I have a Java program that takes in a string and escapes it so that it can be safely passed to a program in bash. The strategy is basically to escape any of the special characters mentioned here and wrap the result in double quotes.
The algorithm is pretty simple -- just loop over the input string and use input.charAt(i) to check whether the current character needs to be escaped.
This strategy works quite well for characters that aren't represented by surrogate pairs, but I have some concerns if non-latin characters or something like an emoji is embedded in the string. In that case, if we assumed that an emoji was the first character in my input string, input.charAt(0) would give me the first code unit while input.charAt(1) would return the second code unit. My concern is that some of these code units might be interpreted as one of the special characters that need to be escaped. If that happened, I'd try to escape one of the code units which would irrevocably garble the input.
Is such a thing possible? Or is it safe to use input.charAt(i) for something like this?
From the Java docs:
The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes. In this representation,
supplementary characters are represented as a pair of char values, the
first from the high-surrogates range, (\uD800-\uDBFF), the second from
the low-surrogates range (\uDC00-\uDFFF).
From the UTF-16 Wikipedia page:
U+D800 to U+DFFF: The Unicode standard permanently reserves these code point values for
UTF-16 encoding of the high and low surrogates, and they will never be
assigned a character, so there should be no reason to encode them. The
official Unicode standard says that no UTF forms, including UTF-16,
can encode these code points.
From the charAt javadoc:
Returns the char value at the specified index. An index ranges from 0
to length() - 1. The first char value of the sequence is at index 0,
the next at index 1, and so on, as for array indexing.
If the char value specified by the index is a surrogate, the surrogate
value is returned.
There is no overlap between the surrogate pair code point range and the range where my special characters ($,`,\ etc) exist as they're all using the ASCII character mappings (i.e. they're all mapped between 0 and 255).
Therefore, if I scan through a string that contains, say, an emoji (which definitely is outside of the supplementary character range) I won't mistake either of the items in the surrogate pair for a special character. Here's a simple test program:
I am looking for a way to detect if a character in a java string "is a combining character" or not. For instance,
String khmerCombiningVowel =
new String(new byte[]{(byte) 0xe1,(byte) 0x9f,(byte) 0x80}, "UTF-8"); // unicode 17c0
represents a combining Khmer vowel sign. I have tried "\\p{InCombiningDiacriticalMarks}" regex but it doesn't seem to apply to these particular combining characters. Or even if there is some comprehensive list of all unicode combining character blocks I might be able to make a regex for them?
According to Algorithm to check for combining characters in Unicode, there are a number of blocks for combining characters.
Java has a number of helpful functions, try:
String codePointStr = new String(new byte[]{(byte) 0xe1, (byte) 0x9f, (byte) 0x80}, "UTF-8"); // unicode 17c0
System.out.println(codePointStr.matches("\\p{Mc}"));
System.out.println(
Character.COMBINING_SPACING_MARK == Character.getType(codePointStr.codePointAt(0)));
(prints true in both cases)
In this case, the COMBINING_SPACING_MARK (and related regex \p{gc=Mc}) both refer to the Unicode category "Mark, Spacing Combining" which is basically any character that combines with a previous character while also adding width.
Other regular expressions that may be useful: \p{M} for any kind of mark. If you want to use the Character getType() constants, you can get the same behavior to that by checking if its type is COMBINING_SPACING_MARK or ENCLOSING_MARK, or NON_SPACING_MARK.
ENCLOSING_MARK is a surrounding character, like a circle--also adds width to the character it combines with.
NON_SPACING_MARK includes the Latin alphabet diacritical combining marks, etc. (Marks that basically go on top or below, and don't add any width to the character).
I need to print a unicode literal string as an equivalent unicode character.
System.out.println("\u00A5"); // prints ¥
System.out.println("\\u"+"00A5"); //prints \u0045 I need to print it as ¥
How can evaluate this string a unicode character ?
As an alternative to the other options here, you could use:
int codepoint = 0x00A5; // Generate this however you want, maybe with Integer.parseInt
String s = String.valueOf(Character.toChars(codepoint));
This would have the advantage over other proposed techniques in that it would also work with Unicode codepoints outside of the basic multilingual plane.
If you have a string:
System.out.println((char)(Integer.parseInt("00A5",16)));
probably works (haven't tested it)
Convert it to a character.
System.out.println((char) 0x00A5);
This will of course not work for very high code points, those may require 2 "characters".
In Java, I learned that the following syntax can be used for mentioning Unicode characters that are not on the keyboard (eg. non-ASCII characters):
(\u)(u)*(HexDigit)(HexDigit)(HexDigit)(HexDigit)
My question is:
What is the purpose of (u)* in the above syntax?
One use case that I understood which represents Yen symbol in Java is:
char ch = '\u00A5';
Interesting question. Section 3.3 of the JLS says:
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u
UnicodeMarker u
which translates to \\u+\p{XDigit}{4}
and
If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.
So you're right, there can be one or more u after the backslash. The reason is given further down:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
So this input
\u0020ä
becomes
\uu0020\u00e4
The first uu means here "this was a unicode escape sequence to begin with" while the second u says "An automatic tool converted a non-ASCII character to a unicode escape."
This information is useful when you want to convert back from ASCII to unicode: You can restore as much of the original code as possible.
It means you can add as many u as you want - for example these lines are equivalent:
char ch = '\u00A5';
char ch = '\uuuuu00A5';
char ch = '\uuuuuuuuuuuuuuuuuu00A5';
(and all compile)
Java supports only \uXXXX (4 hex chars) notation for Unicode characters in the BMP but doesn't support the \u{YYYYY} (5 hex chars) notation for characters outside the BMP (16 other planes). So it's impossible to represent them into a single constant char, you'll have to write them as a surrogate pair.
For example, if you want to write MATHEMATICAL BOLD CAPITAL A (U+1D400) you can't write "u\{1D400}" it's an illegal Unicode escape sequence in Java. Writing "u\1D400" is only doing "u\1D40" + "0" so it will output áµ€0. No you really have to use surrogates in Java. So you have to write "\uD835\uDC00" instead.
But writing surrogates is not handy, so if you want to write them directly from a code point you can use one of those tricks:
String test1 = new String(new int[] { 0x1D400 }, 0, 1);
String test2 = String.valueOf(Character.toChars(0x1D400));
String test3 = Character.toString(0x1D400):
I have a string which contains special character. But I have to convert the string into a string without having any special character so I used Base64 But in Base64 we are using equals to symbol (=) which is a special character. But I want to convert the string into a string which will have only alphanumerical letters. Also I can't remove special character only i have to replace all the special characters to maintain unique between two different strings. How to achieve this, Which encoding will help me to achieve this?
The simplest option would be to encode the text to binary using UTF-8, and then convert the binary back to text as hex (two characters per byte). It won't be terribly efficient, but it will just be alphanumeric.
You could use base32 instead to be a bit more efficient, but that's likely to be significantly more work, unless you can find a library which supports it out of the box. (Libraries to perform hex encoding are very common.)
There are a number of variations of base64, some of which don't use padding. (You still have a couple of non-alphanumeric characters for characters 62 and 63.)
The Wikipedia page on base64 goes into the details, including the "standard" variations used for a number of common use-cases. (Does yours match one of those?)
If your strings have to be strictly alphanumeric, then you'll need to use hex encoding (one byte becomes 2 hex digits), or roll your own encoding scheme. Your stated requirements are rather unusual ...
Commons codec has a url safe version of base64, which emits - and _ instead of + and / characters
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafe(byte[])
The easiest way would be to use a regular expression to match all nonalphanumeric characters and replace them with an empty string.
// This will remove all special characters except space.
var cleaned = stringToReplace.replace(/[^\w\s]/gm, '')
Adding any special characters to the above regex will skip that character.
// This will remove all special characters except space and period.
var cleaned = stringToReplace.replace(/[^\w\s.]/gm, '')
A working example.
const regex = /[^\w\s]/gm;
const str = `This is a text with many special characters.
Hello, user, your password is 543#!\$32=!`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Regex explained.
[^\w\s]/gm
Match a single character not present in the list below [^\w\s]
\w matches any word character (equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (equivalent to [\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
If you truly can only use alphanumerical characters you will have to come up with an escaping scheme that uses one of those chars for example, use 0 as the escape, and then encode the special char as a 2 char hex encoding of the ascii. Use 000 to mean 0.
e.g.
This is my special sentence with a 0.
encodes to:
This020is020my020special020sentence020with020a02000002e