Encoding pinyin - java

I'm currently developing a program in java, and I want to display Chinese pinyin, which I get from a distant website.
But I have the following problem: Chinese pinyin is displayed this way: jiǎ
Whereas it should be displayed this way: jiǎ
(I just typed the same sequence, except I stripped the slashes).
I think the answer to this question is really simple but I'm struggling to find it.

The problem is you have an HTML encoded Unicode character and what you want is the decoded version of it. A library like commons-lang3 (part of Apache Commons) will take your HTML encoded string and decode it for Java to display like this:
String decoded = StringEscapeUtils.unescapeHtml("jiǎ");
You can also escape Unicode characters in Java like this:
String jia = "ji\u01ce";
This clever one-liner will take a Unicode character and show you its escaped form:
System.out.println( "\\u" + Integer.toHexString('ǎ' | 0x10000).substring(1) );

Related

Convert non english character to english alphabets (those are looking same as alphabets) in java?

If the name is typed for example- "ОХ699" using a different keyboard. as a result, “OX” is flagged as non-English characters, even though they appear to be English characters. so is there any way to convert the characters like these to English characters directly?
i tried following code to convert "OX" to english alphabets "OX":
String subjectString = "ОХ699";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
but it is not converting to english alphabets.
Showing output : "699"
Expected output : "OX699"
It is not possible using standard lib. You have to implement your own translations. Someone want to translate Р (R in Cyrillic) to p, and someone wants r. Also there is a problem with Chinese characters or emojis.
There is a linux program uni2ascii that do exactly what you want - you can see how it is implemented in other apps https://salsa.debian.org/debian/uni2ascii/-/blob/master/uni2ascii.c (see the extremely big switch statements).
There is also Python clone of this app, but very, very simplified - https://github.com/ajanin/uni2ascii/blob/master/uni2ascii/__init__.py#L65 . You can copy that stwich and implement translation in your app.
Or install the uni2ascii on the server and just call it (or call it using jni).
But any way - the common practice is just to ignore and skip non-ascii chars
EDIT: I found java implementation in Lucene engine - https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs

? is the only output for all unicode above U+0080 in java [duplicate]

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

php json_encode euro symbol for android

I am loading data in an android app from a php service.
In php i use json_encode to convert my data.
Now i have a string with a € character in it.
json_encode converts this to \u0080, but as far as i know the actual correct unicode should be \u20AC.
Usually thats not a problem but the Droid Sans Font does only render \u20AC as the euro symbol.
My question: Is there a way to make the € character convert correctly (i dont care if thats in Javaor in PHP, although i would prefer a php solution) without using any string replaces or regex etc..
replacing seems ugly and there might be more symbols that dont get converted properly that i dont know of yet.
\u0080 means that the input character was \x80 which is the Euro sign in Windows-1252. So I assume your string is encoded in this charset, then you should convert it to UTF-8 because json_encode only works with UTF-8 input:
$string = iconv('Windows-1252', 'UTF-8', $string);

Automatic Unicode string formatting in Java

I just came across something like this:
String sample = "somejunk+%3cfoobar%3e+morestuff";
Printed out, sample looks like this:
somejunk+<foobar>+morestuff
How does that work? U+003c and U+003e are the Unicode codes for the less than and greater than signs, respectively, which seems like more than a coincidence, but I've never heard of Java automatically doing something like this. I figured it'd be an easy thing to pop into Google, but it turns out Google doesn't like the percent sign.
That string is probably URL encoded You'd decode that in java using the URLDecoder
String res = java.net.URLDecoder.decode(sample, "UTF8");
You can do something like this,
String sample = "somejunk+%3cfoobar%3e+morestuff";
String result = URLDecoder.decode(sample.replaceAll("\\+", "%2B"), "UTF8");
Java does support Unicode escapes in char and String literals, but not URL encoding.
The Unicode escapes use '\uXXXX', where XXXX is the Unicode point in hexadecimal.
Curious tidbit: The grammar allows 'u' to occur multiple times, so that '\uuuuuuuu0041' is a valid Unicode escape (for 'A').

Categories