Print unicode literal string as Unicode character - java

I need to print a unicode literal string as an equivalent unicode character.
System.out.println("\u00A5"); // prints ¥
System.out.println("\\u"+"00A5"); //prints \u0045 I need to print it as ¥
How can evaluate this string a unicode character ?

As an alternative to the other options here, you could use:
int codepoint = 0x00A5; // Generate this however you want, maybe with Integer.parseInt
String s = String.valueOf(Character.toChars(codepoint));
This would have the advantage over other proposed techniques in that it would also work with Unicode codepoints outside of the basic multilingual plane.

If you have a string:
System.out.println((char)(Integer.parseInt("00A5",16)));
probably works (haven't tested it)

Convert it to a character.
System.out.println((char) 0x00A5);
This will of course not work for very high code points, those may require 2 "characters".

Related

In Java, how are Unicode chars and Java UTF-16 codepoints handled?

I'm struggling with Unicode characters in Java 10.
I'm using the java.text.BreakIterator package.
For this output:
myString="a𝓞b" hex=0061d835dcde0062
myString.length()=4
myString.codePointCount(0,s.length())=3
BreakIterator output:
a hex=0061
𝓞 hex=d835dcde
b hex=0062
Seems correct.
Using the same Java code, then with this output:
myString="G̲íl" hex=0047033200ed006c
myString.length()=4
myString.codePointCount(0,s.length())=4
BreakIterator output:
G̲ hex=00470332
í hex=00ed
l hex=006c
Seems correct too, EXCEPT for the codePointCount=4.
Why isn't it 3, and is there a means of getting
a 3 value without using BreakIterator?
My goal is to determine if all (output) chars of a string are
16-bit, or are surrogate or combining chars present?
"G̲íl" is four code points: U+0047, U+0332, U+00ED, U+006C.
U+0332 is a combining character, but it is a separate code point. That's not the same as your first example, which requires using a surrogate pair (2 UTF-16 code units) to represent U+1D4DE - but the latter is still a single code point.
BreakIterator finds boundaries in text - the two code points here that are combined don't have a boundary between them in that sense. From the documentation:
Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation through character strings, regardless of how the character is stored.
So I think everything is working correctly here.
A codepoint corresponds to one Unicode character.
Java represents Unicode in UTF-16, i.e., in 16-bit units. Characters with codepoint values larger than U+FFFF are represented by a pair of 'surrogate characters', as in your first example. Thus the first result of 3.
In the second case, you have an example that is not a single Unicode character. It is one character, LETTER G, followed by another character COMBINING CHARACTER LOW LINE. That is two codepoints per the definition. Thus the second result of 4.
In general, Unicode has tables of character attributes (I'm not sure if I have the right word here) and it is possible to find out that one of your codepoints is a combining character.
Take a look at the Character class. getType(character) will tell you if a codepoint is a combining character or a surrogate.

Java - Regex Replace All will not replace matched text

Trying to remove a lot of unicodes from a string but having issues with regex in java.
Example text:
\u2605 StatTrak\u2122 Shadow Daggers
Example Desired Result:
StatTrak Shadow Daggers
The current regex code I have that will not work:
list.replaceAll("\\\\u[0-9]+","");
The code will execute but the text will not be replaced. From looking at other solutions people seem to use only two "\\" but anything less than 4 throws me the typical error:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 2
\u[0-9]+
I've tried the current regex solution in online test environments like RegexPlanet and FreeFormatter and both give the correct result.
Any help would be appreciated.
Assuming that you would like to replace a "special string" to empty String. As I see, \u2605 and \u2122 are POSIX character class. That's why we can try to replace these printable characters to "". Then, the result is the same as your expectation.
Sample would be:
list = list.replaceAll("\\P{Print}", "");
Hope this help.
In Java, something like your \u2605 is not a literal sequence of six characters, it represents a single unicode character — therefore your pattern "\\\\u[0-9]{4}" will not match it.
Your pattern describes a literal character \ followed by the character u followed by exactly four numeric characters 0 through 9 but what is in your string is the single character from the unicode code point 2605, the "Black Star" character.
This is just as other escape sequences: in the string "some\tmore" there is no character \ and there is no character t ... there is only the single character 0x09, a tab character — because it is an escape sequence known to Java (and other languages) it gets replaced by the character that it represents and the literal \ t are no longer characters in the string.
Kenny Tai Huynh's answer, replacing non-printables, may be the easiest way to go, depending on what sorts of things you want removed, or you could list the characters you want (if that is a very limited set) and remove the complement of those, such as mystring.replaceAll("[^A-Za-z0-9]", "");
I'm an idiot. I was calling the replaceAll on the string but not assigning it as I thought it altered the string anyway.
What I had previously:
list.replaceAll("\\\\u[0-9]+","");
What I needed:
list = list.replaceAll("\\\\u[0-9]+","");
Result works fine now, thanks for the help.

detect any combining character in Java

I am looking for a way to detect if a character in a java string "is a combining character" or not. For instance,
String khmerCombiningVowel =
new String(new byte[]{(byte) 0xe1,(byte) 0x9f,(byte) 0x80}, "UTF-8"); // unicode 17c0
represents a combining Khmer vowel sign. I have tried "\\p{InCombiningDiacriticalMarks}" regex but it doesn't seem to apply to these particular combining characters. Or even if there is some comprehensive list of all unicode combining character blocks I might be able to make a regex for them?
According to Algorithm to check for combining characters in Unicode, there are a number of blocks for combining characters.
Java has a number of helpful functions, try:
String codePointStr = new String(new byte[]{(byte) 0xe1, (byte) 0x9f, (byte) 0x80}, "UTF-8"); // unicode 17c0
System.out.println(codePointStr.matches("\\p{Mc}"));
System.out.println(
Character.COMBINING_SPACING_MARK == Character.getType(codePointStr.codePointAt(0)));
(prints true in both cases)
In this case, the COMBINING_SPACING_MARK (and related regex \p{gc=Mc}) both refer to the Unicode category "Mark, Spacing Combining" which is basically any character that combines with a previous character while also adding width.
Other regular expressions that may be useful: \p{M} for any kind of mark. If you want to use the Character getType() constants, you can get the same behavior to that by checking if its type is COMBINING_SPACING_MARK or ENCLOSING_MARK, or NON_SPACING_MARK.
ENCLOSING_MARK is a surrounding character, like a circle--also adds width to the character it combines with.
NON_SPACING_MARK includes the Latin alphabet diacritical combining marks, etc. (Marks that basically go on top or below, and don't add any width to the character).

How do I use high-order unicode characters in java?

How do I use unicode characters in Java, like the Negative Squared Latin Capital Letter E? Using "\u1F174" doesn't work as the \u escape only accepts 4 hex-digits.
You need to specify it as a surrogate pair - two UTF-16 code units.
For example, if you copy and paste the character into my Unicode explorer you can see that U+1F174 is represented in UTF-16 code units as U+D83C U+DD74. (You can work this out manually, of course.) So you could write it in a Java string literal as:
String text = "\uD83C\uDD74";
Other options include:
String text = new StringBuilder().appendCodePoint(0x1f174).toString();
String text = new String(new int[] { 0x1f174 }, 0, 1);
char[] chars = Character.toChars(0x1f174);
"\uD83C\uDD74"
Or indeed
"🅴"
Because Java characters represent UTF-16 units rather than actual Unicode characters, you need to represent it as a string, that will have the two UTF-16 surrogates.

How to encode a string to replace all special characters

I have a string which contains special character. But I have to convert the string into a string without having any special character so I used Base64 But in Base64 we are using equals to symbol (=) which is a special character. But I want to convert the string into a string which will have only alphanumerical letters. Also I can't remove special character only i have to replace all the special characters to maintain unique between two different strings. How to achieve this, Which encoding will help me to achieve this?
The simplest option would be to encode the text to binary using UTF-8, and then convert the binary back to text as hex (two characters per byte). It won't be terribly efficient, but it will just be alphanumeric.
You could use base32 instead to be a bit more efficient, but that's likely to be significantly more work, unless you can find a library which supports it out of the box. (Libraries to perform hex encoding are very common.)
There are a number of variations of base64, some of which don't use padding. (You still have a couple of non-alphanumeric characters for characters 62 and 63.)
The Wikipedia page on base64 goes into the details, including the "standard" variations used for a number of common use-cases. (Does yours match one of those?)
If your strings have to be strictly alphanumeric, then you'll need to use hex encoding (one byte becomes 2 hex digits), or roll your own encoding scheme. Your stated requirements are rather unusual ...
Commons codec has a url safe version of base64, which emits - and _ instead of + and / characters
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64URLSafe(byte[])
The easiest way would be to use a regular expression to match all nonalphanumeric characters and replace them with an empty string.
// This will remove all special characters except space.
var cleaned = stringToReplace.replace(/[^\w\s]/gm, '')
Adding any special characters to the above regex will skip that character.
// This will remove all special characters except space and period.
var cleaned = stringToReplace.replace(/[^\w\s.]/gm, '')
A working example.
const regex = /[^\w\s]/gm;
const str = `This is a text with many special characters.
Hello, user, your password is 543#!\$32=!`;
const subst = ``;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Regex explained.
[^\w\s]/gm
Match a single character not present in the list below [^\w\s]
\w matches any word character (equivalent to [a-zA-Z0-9_])
\s matches any whitespace character (equivalent to [\r\n\t\f\v \u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
If you truly can only use alphanumerical characters you will have to come up with an escaping scheme that uses one of those chars for example, use 0 as the escape, and then encode the special char as a 2 char hex encoding of the ascii. Use 000 to mean 0.
e.g.
This is my special sentence with a 0.
encodes to:
This020is020my020special020sentence020with020a02000002e

Categories