Split combined arabic characters in individual characters - java

I'm trying to convert "combined arabic characters" (like ﻼ ) in the different individual characters that compose that "combined" character (e.g. ﻝ ا). I wasnt able to do this in JAVA or C#, because I need split the complete list of characters.
In C# i'm trying to get the Unicode character, convert it to Windows-1256 waiting to get 2 o 3 bytes that are the individual characters and that combined character uses, but I wasnt able to do this.
String unicodeWord = (char)sc;
byte[] arabicBytes = System.Text.Encoding.GetEncoding(1256).GetBytes(unicodeWord);
but the result is always ?.
Can you help me with this? I have not problem to use either java or c#.
Thanks a lot!

string input = "ﻼ";
string normalized = input.Normalize(NormalizationForm.FormKC);
Note that there are different normalization forms with different results; FormKC results in ل and ا

Related

Convert non english character to english alphabets (those are looking same as alphabets) in java?

If the name is typed for example- "ОХ699" using a different keyboard. as a result, “OX” is flagged as non-English characters, even though they appear to be English characters. so is there any way to convert the characters like these to English characters directly?
i tried following code to convert "OX" to english alphabets "OX":
String subjectString = "ОХ699";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
but it is not converting to english alphabets.
Showing output : "699"
Expected output : "OX699"
It is not possible using standard lib. You have to implement your own translations. Someone want to translate Р (R in Cyrillic) to p, and someone wants r. Also there is a problem with Chinese characters or emojis.
There is a linux program uni2ascii that do exactly what you want - you can see how it is implemented in other apps https://salsa.debian.org/debian/uni2ascii/-/blob/master/uni2ascii.c (see the extremely big switch statements).
There is also Python clone of this app, but very, very simplified - https://github.com/ajanin/uni2ascii/blob/master/uni2ascii/__init__.py#L65 . You can copy that stwich and implement translation in your app.
Or install the uni2ascii on the server and just call it (or call it using jni).
But any way - the common practice is just to ignore and skip non-ascii chars
EDIT: I found java implementation in Lucene engine - https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs

How to display string from Latin-Extended-A charset

I can't display string, which contains Latin-Extended-A chars in an appropriate way.
I have tried getting bytes with different Unicode and creating new String with new Unicode.
If you have string "ăăăăăăăăă". How can i output it in an appropriate way.
Java supports unicode characters. If you have:
String x = "ăăăăăăăăă";
System.out.println(x);
You'll get
ăăăăăăăăă
If you get question marks or funky looking characters, then it's most likely not a problem with java or the code, but with the fonts on your computer not supporting it.

How do I store accented characters in S3 metadata?

I am trying to store accented characters such as ò in the metadata of an S3 object. I am using the REST API which according to this page only accepts US-ASCII: http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html
Is there a way to convert Strings in Scala or Java from Bòrd to B\u00F2rd?
I have tried using Normalizer.normalize(str, Normalizer.Form.NFD) but the character when submitted to S3 is still causing an error because it appears as ò. When I try to print out the returned String it is also showing ò.
A normalized unicode string is just normalized in terms of composing characters, not necessarily to ASCII. Using NFKC would be more likely to convert characters to ASCII forms, but certainly would not reliably to do so.
It sounds like what you want is to escape non-ascii characters. You could use e.g. UnicodeEscaper from commons-lang, and UnicodeUnescaper to translate back.

Android- remove URL percent symbols from string

I have a URL that looks like this:
Liberty%21%20ft.%20Whiskey%20Pete%20-%20Thunderfist%20%28Original%20Mix%29.mp3
I'm trying to extract just the words from it. Right now, I'm using string.replace("%21", "!") for each and every %20, %29, etc. because each segment represent different characters or spaces. Is there a way to just covert those symbols and numbers to what they actually mean?
Thanks.
Those symbols are URLEncoded representations of characters that can't legally exist in a URL. (%20 = a single space, etc)
You need to UrlDecode those strings:
http://icfun.blogspot.com/2009/08/java-urlencode-and-urldecode-options.html
Official documentation here:
http://download.oracle.com/javase/6/docs/api/java/net/URLDecoder.html
It seems the input string is written using the URL encoding. Instead of writing all possible replacements manually (you can hardly cover all possibilities), you can use URLDecoder class in Java.
String input = "Liberty%21%20ft.%20Whiskey%20Pete...";
String decoded = URLDecoder.decode(input, "UTF-8");

Replacing Unicode character codes with characters in String in Java

I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).
Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.
I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.
If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.

Categories