The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.
For instance here are a few conversions:
ҥ->H
Ѷ->V
Ȳ->Y
Ǭ->O
Ƈ->C
tђє Ŧค๓เℓy --> the Family
...
and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.
The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.
How can I convert all these with Java? Please help me :(
Reposting my post from How do I remove diacritics (accents) from a string in .NET?
This method works fine in java (purely for the purpose of removing diacritical marks aka accents).
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
It's a part of Apache Commons Lang as of ver. 3.0.
org.apache.commons.lang3.StringUtils.stripAccents("Añ");
returns An
Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/
Attempting to "convert them all" is the wrong approach to the problem.
Firstly, you need to understand the limitations of what you are trying to do. As others have pointed out, diacritics are there for a reason: they are essentially unique letters in the alphabet of that language with their own meaning / sound etc.: removing those marks is just the same as replacing random letters in an English word. This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic, which simply cannot be "converted" to English.
If you must, for whatever reason, convert characters, then the only sensible way to approach this it to firstly reduce the scope of the task at hand. Consider the source of the input - if you are coding an application for "the Western world" (to use as good a phrase as any), it would be unlikely that you would ever need to parse Arabic characters. Similarly, the Unicode character set contains hundreds of mathematical and pictorial symbols: there is no (easy) way for users to directly enter these, so you can assume they can be ignored.
By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup / replace operation is feasible. It then becomes a small amount of slightly boring work creating the dictionaries, and a trivial task to perform the replacement. If your language supports native Unicode characters (as Java does) and optimises static structures correctly, such find and replaces tend to be blindingly quick.
This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters. The lookup arrays (as it was in our case) took perhaps 1 man day to produce, to cover all diacritic marks for all Western European languages.
Since the encoding that turns "the Family" into "tђє Ŧค๓เℓy" is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved, there's no general way to solve this algorithmically.
You will need to build the mapping of Unicode characters into latin characters which they resemble. You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints. But I think the effort for this would be greater than manually building that mapping. Especially if you have a good amount of examples from which you can build your mapping.
To clarify: a few of the substitutions can actually be solved via the Unicode data (as the other answers demonstrate), but some letters simply have no reasonable association with the latin characters which they resemble.
Examples:
"ђ" (U+0452 CYRILLIC SMALL LETTER DJE) is more related to "d" than to "h", but is used to represent "h".
"Ŧ" (U+0166 LATIN CAPITAL LETTER T WITH STROKE) is somewhat related to "T" (as the name suggests) but is used to represent "F".
"ค" (U+0E04 THAI CHARACTER KHO KHWAI) is not related to any latin character at all and in your example is used to represent "a"
String tested : ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß
Tested :
Output from Apache Commons Lang3 : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from ICU4j : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from JUnidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUUss (problem with Ý and another issue)
Output from Unidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUYss
The last choice is the best.
The original request has been answered already.
However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.
Naive meaning of tranliteration:
Translated string in it's final form/target charset sounds like the string in it's original form.
If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.
Here is the code snippet in java:
import com.ibm.icu.text.Transliterator; //ICU4J library import
public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";
/**
* Returns the transliterated string to convert any charset to latin.
*/
public static String transliterate(String input) {
Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
String result = transliterator.transliterate(input);
return result;
}
If the need is to convert "òéışöç->oeisoc", you can use this a starting point :
public class AsciiUtils {
private static final String PLAIN_ASCII =
"AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOoNn" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
+ "OoUu" // double acute
;
private static final String UNICODE =
"\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+ "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+ "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+ "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
+ "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+ "\u00C5\u00E5"
+ "\u00C7\u00E7"
+ "\u0150\u0151\u0170\u0171"
;
// private constructor, can't be instanciated!
private AsciiUtils() { }
// remove accentued from a string and replace with ascii equivalent
public static String convertNonAscii(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
int n = s.length();
for (int i = 0; i < n; i++) {
char c = s.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1){
sb.append(PLAIN_ASCII.charAt(pos));
}
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String args[]) {
String s =
"The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
System.out.println(AsciiUtils.convertNonAscii(s));
// output :
// The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
}
}
The JDK 1.6 provides the java.text.Normalizer class that can be used for this task.
See an example here
The problem with "converting" arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent. For example, “ß” to a German-speaking person should be converted to "ss" while an English-speaker would probably convert it to “B”.
Add to that the fact that Unicode has multiple code points for the same glyphs.
The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to. You can take a shortcut by normalizing characters with accents to normalization form KD, but not all characters normalize to ASCII. In addition, Unicode does not define which parts of a glyph are "accents".
Here is a tiny excerpt from an app that does this:
switch (c)
{
case 'A':
case '\u00C0': // À LATIN CAPITAL LETTER A WITH GRAVE
case '\u00C1': // Á LATIN CAPITAL LETTER A WITH ACUTE
case '\u00C2': // Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
// and so on for about 20 lines...
return "A";
break;
case '\u00C6':// Æ LATIN CAPITAL LIGATURE AE
return "AE";
break;
// And so on for pages...
}
You could try using unidecode, which is available as a ruby gem and as a perl module on cpan. Essentially, it works as a huge lookup table, where each unicode code point relates to an ascii character or string.
There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to. They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter.
If you want that conversion, you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to.
(If you only want to remove diacritial marks, there are some answers in this thread: How do I remove diacritics (accents) from a string in .NET? However you describe a more general problem)
I'm late to the party, but after facing this issue today, I found this answer to be very good:
String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Reference:
https://stackoverflow.com/a/16283863
Following Class does the trick:
org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter
Related
I'd like to sort strings in Japanese (that may contain the various japanese characters as well as latin chars), and the latin chars should be sorted to the end.
final Collator collator = Collator.getInstance(Locale.JAPANESE);
List<String> objcts = new ArrayList<>();
objcts.add("Alpha");
objcts.add("家事問屋");
Collections.sort(objcts, collator);
System.out.println(objcts);
Out: [Alpha, 家事問屋]
Desired Out: [家事問屋, Alpha]
Is there a simple way known how to achive this?
Probably you could implement a Comparator or extend Collator that ranks Latin before CJK using a regex like this:
public class LatinBeforeCJKCollator implements Comparator<String> {
private final Collator collator;
public LatinBeforeCJKCollator(Collator collator) {
this.collator = collator;
}
#Override
public int compare(String source, String target) {
if (source.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+") && target.matches("\\p{IsLatin}+")) {
return -1;
}
if (source.matches("\\p{IsLatin}+") && target.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+")) {
return 1;
}
return collator.compare(source, target);
}
}
I used Unicode character-sets from answer to this question:
How can I detect japanese text in a Java string?
You might need to customize the matching (e.g. all letters are latin, first letter is latin, etc.) after your needs.
When used like this:
final Comparator comparator = new LatinBeforeCJKCollator(Collator.getInstance(Locale.JAPANESE);
List<String> strings = List.of("Alpha", "Beta", "問屋", "家事問屋");
System.out.println(strings.stream().sorted(collator).collect(Collectors.joining(",")));
Then the output would appear sorted like this:
家事問屋,問屋,Alpha,Beta
I guess, the letters are in Unicode.
The range of Latin letters is
Wiki in this wiki article
says:
As of version 13.0 of the Unicode Standard, 1,374 characters in the fo:
llowing blocks are classified as belonging to the Latin script:2
Basic Latin, 0000–007F. This block corresponds to ASCII.
Latin-1 Supplement, 0080–00FF
Latin Extended-A, 0100–017F
Latin Extended-B, 0180–024F
IPA Extensions, 0250–02AF
Spacing Modifier Letters, 02B0–02FF
Phonetic Extensions, 1D00–1D7F
Phonetic Extensions Supplement, 1D80–1DBF
Latin Extended Additional, 1E00–1EFF
Superscripts and Subscripts, 2070–209F
Letterlike Symbols, 2100–214F
Number Forms, 2150–218F
Latin Extended-C, 2C60–2C7F
Latin Extended-D, A720–A7FF
Latin Extended-E, AB30–AB6F
Alphabetic Presentation Forms (Latin ligatures) FB00–FB4F
Halfwidth and Fullwidth Forms, FF00–FFEF
So most of them are before the Japanese.
Using these ranges, you could make that Japanese letters are put in front.
And the range of Japanese is
Japanese-style punctuation ( 3000 - 303f)
Hiragana ( 3040 - 309f)
Katakana ( 30a 0 - 30ff)
Full-width roman characters and half-width katakana ( ff00 - ffef)
CJK unifed ideographs - Common and uncommon kanji ( 4e00 - 9faf)
listed here. According to this post.
Does the order of the Japanese and English strings matter? If yes, you need to implement your own comparison method for the collator.
If the order does not matter, you can just do:
Collections.sort(objcts, Collections.reverseOrder());
To add a bit more to this - a collator is usually used for a single language, therefore you need to implement a way to differentiate the characters for the two alphabets. I would strongly suggest you to use two separate lists for English and Japanese text, where you detect what language the characters are in and decide in which list to put the word it. Then you can sort both lists accordingly and combine/use them as you wish.
I don't code much in Java, but I can explain the steps you can take.
As far as I know, there is no alphabet string provided in Java, so you can create a string variable that contains the alphabet (both upper- and lower-case). Let's call it alphabet. The string would look like this: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
Then you'll have to make a variable containing the last index number (a.k.a. the size of the list). We will call it last.
Assuming each item is either fully Japanese or fully Latin and assuming that your list is already full, you can loop through the list and perform these steps on each item:
Get the first character in the string.
Test to see if it is in alphabet.
If True, set its index in the list to last. If False, leave it as it is.
That's basically it! I sincerely apologise for not being able to provide the code, as I code mostly in Python, but I hope this helped!
I have a String named fancy, the String fancy is this "𝖑𝖒𝖆𝖔", however, I need to make "lmao" out of it.
I've tried calling String#trim, however with no success.
Example code:
var fancy = "𝖑𝖒𝖆𝖔"
var normal = //Magic to convert 𝖑𝖒𝖆𝖔 to lmao
EDIT: So I figured out, if I take the UTF-8 code of this fancy character, and subtract it by 120101, I get the original character, however, there are more types of these fancy texts so it does not seem like a solution for my problem.
You can take advantage of the fact that your "𝖆" character decomposes to a regular "a":
Decomposition LATIN SMALL LETTER A (U+0061)
Java's java.text.Normalizer class contains different normalizer forms. The NKFD and NKFC forms use the above decomposition rule.
String normal = Normalizer.normalize(fancy, Normalizer.Form.NFKC);
Using compatibility equivalence is what you need here:
Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors.
(The reason you do not lose diacritics is because this process simply separates these diacritic marks from their base letters - and then re-combines them if you use the relevant form.)
Those are unicode characters: https://unicode-table.com also provides reverse lookup to identify them (copy-paste them into the search).
The fancy characters identify as:
𝖑 Mathematical Bold Fraktur Small L (U+1D591)
𝖒 Mathematical Bold Fraktur Small M 'U+1D592)
𝖆 Mathematical Bold Fraktur Small A (U+1D586)
𝖔 Mathematical Bold Fraktur Small O (U+1D594)
You also find them as 'old style english alphabet' on this list: https://unicode-table.com/en/sets/fancy-letters. There we notice that they are ordered and in the same way that the alphabetic characters are. So the characters have a fixed offset:
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
You can thus transform the characters back by subtracting that offset.
Now comes the tricky part: these unicode code points cannot be represented by a single char data type, which is only 16 bit, and thus cannot represent every single unicode character on its own (1-4 chars are actually needed, depending on unicode char).
The proper way to deal with this is to work with the code points directly:
String fancy = "𝖑𝖒𝖆𝖔";
int offset = 0x1D586 - 'a' // 𝖆 is U+1D586
String plain = fancy.codePoints()
.map(i-> i - offset)
.mapToObj(c-> (char)c)
.map(String::valueOf)
.collect(java.util.stream.Collectors.joining());
System.out.println(plain);
This then prints lmao.
I am trying to clean up a TextFile in which I need to convert
uppercase characters of the textFile to capitalized the first
character of them only and then write in in new file.
for example:
intext = In general, my primary concern regarding this patient was regarding her CHEST PAIN.
outtext = In general, my primary concern regarding this patient was regarding her Chest Pain.
I could find only .toLowerCase which convert all characters into
lower case.
Any help would be greatly appreciated.
WordUtils.capitalizeFully(str)
using apache commons-lang you can capitalize first char
Java doc comment
public static String capitalizeFully(String str)
Converts all the whitespace separated words in a String into
capitalized words, that is each word is made up of a titlecase
character and then a series of lowercase characters.
Whitespace is defined by Character.isWhitespace(char). A null input
String returns null. Capitalization uses the Unicode title case,
normally equivalent to upper case.
WordUtils.capitalizeFully(null) = null
WordUtils.capitalizeFully("") = ""
WordUtils.capitalizeFully("i am FINE") = "I Am Fine"
I receive bytes into a method and I want to send them over serial, but I only want to send valid bytes, (i.e. a-zA-Z0-9"!£$%^&*()-_=+), things like that, spaces, new lines etc. I just want to filter out any character like ones with accents or �, in any order and any number of times.
Would something like this including all characters with | work?
^[a-z|A-Z|0-9|\\s|-<other characters>]*
Or, what would be the correct expression?
So if a string contained "exit����", I would only want to send "exit", and never send characters that are not valid, but send everything else.
public void write(byte[] bytes, int offset, int count) {
String str;
try {
str = new String(bytes, "ASCII");
Log.d(TAG, "data received in write: " +str );
//^[a-z|A-Z|0-9|\s|-]*
//test here, call next line on any character that is valid
GraphicsTerminalActivity.sendOverSerial(str.getBytes("ASCII"));
} catch (UnsupportedEncodingException e) {
Log.d(TAG, "exception" );
e.printStackTrace();
}
// appendToEmulator(bytes, 0, bytes.length);
}
EDIT: I tried [^\x00-\x7F] which is the range of ascii characters....but then the � symbols still get through, weird.
Try using pattern like [\x20-\x7E] These are the ASCII codes of the printable characters.
By the way I assume you are asking about ASCII, because this is how you parse in your question.
You want to do a search-replace:
String fixed = input.replaceAll("[^\p{Print}\t\n]", "");
Rolf
Edit: Add references:
Pattern Javadoc -> scroll down to POSIX Character Classes (US-ASCII ONLY)
The pattern above matches all characters that are not printable characters....
You may want to look into Java's Normalizer class if you haven't already. It would allow you to extract the "normal" character from its accented equivalent, as an alternative to throwing away the whole character.
I don't remember my exact source for this idea (I was trying to do accent-agnostic searching recently), but a quick search turned up this simple blog post that may offer a little more insight into how to use it.
The pipe is not the correct way to turn your list of characters into a regular expression. Put the characters in a charecter class with square brackets around it. All characters in the character class are by default ORed, so no need for pipes. There is a need to escape symbols that are not numbers and letters.
[a-zA-Z0-9\"\!\£\$\%\^\&\*\(\)\-\_\=\+]
And then if you want to put that into a Java string, you need to double escape the escapes
Pattern p = Pattern.compile("[a-zA-Z0-9\\"\\!\\£\\$\\%\\^\\&\\*\\(\\)\\-\\_\\=\\+]");
Keep in mind that the pound symbol (£) is not an ASCII character, so converting it to ASCII is not going to work.
The method should allows only "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-" chars in URI strings.
What is the best way to make nice SEO URI string?
This is what the general consensus is:
Lowercase the string.
string = string.toLowerCase();
Normalize all characters and get rid of all diacritical marks (so that e.g. é, ö, à becomes e, o, a).
string = Normalizer.normalize(string, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
Replace all remaining non-alphanumeric characters by - and collapse when necessary.
string = string.replaceAll("[^\\p{Alnum}]+", "-");
So, summarized:
public static String toPrettyURL(String string) {
return Normalizer.normalize(string.toLowerCase(), Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
.replaceAll("[^\\p{Alnum}]+", "-");
}
The following regex will do the same thing as your algorithm. I'm not aware of libraries for doing this type of thing.
String s = input
.replaceAll(" ?- ?","-") // remove spaces around hyphens
.replaceAll("[ ']","-") // turn spaces and quotes into hyphens
.replaceAll("[^0-9a-zA-Z-]",""); // remove everything not in our allowed char set
These are commonly called "slugs" if you want to search for more information.
You may want to check out other answers such as How can I create a SEO friendly dash-delimited url from a string? and How to make Django slugify work properly with Unicode strings?
They cover C# and Python more than javascript but have some language-agnostic discussion about slug conventions and issues you may face when making them (such as uniqueness, unicode normalization problems, etc).