How do I match unicode characters in Java - java

I m trying to match unicode characters in Java.
Input String: informa
String to match : informátion
So far I ve tried this:
Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
String s = "informátion";
Matcher m = p.matcher(s);
if(m.matches()){
System.out.println("Match!");
}else{
System.out.println("No match");
}
It comes out as "No match". Any ideas?

The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range".
In regex terms that would be [^\x20-\x7E].
boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");
Depending on what you'd like to do with this information, here are some useful follow-up answers:
Get rid of special characters
Get rid of diacritical marks

Is it because informa isn't a substring of informátion at all?
How would your code work if you removed the last a from informa in your regex?

It sounds like you want to match letters while ignoring diacritical marks. If that's right, then normalize your strings to NFD form, strip out the diacritical marks, and then do your search.
String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...
To learn more about NFD:
https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
http://unicode.org/faq/normalization.html

Related

Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

I need to validate name,saved in a String, which can be in any language with spaces using \p{L}:
You can match a single character belonging to the "letter" category with \p{L}
I tried to use String.matches, but it failed to match non English characters, even for 1 character, for example
String name = "อั";
boolean isMatch = name.matches("[\\p{L}]+")); // return false
I tried with/without brackets, adding + for multiple letters, but it's always failing to match non English characters
Is there an issue using String.matches with \p{L}?
I failed also using [\\x00-\\x7F]+ suggested in Pattern
\p{ASCII} All ASCII:[\x00-\x7F]
You should bear in mind that Java regex parses strings as collections of Unicode code units, not code points. \p{L} matches any Unicode letter from the BMP plane, it does not match letters glued with diacritics after them.
Since your input can contain letters and diacritics you should at least use both \p{L} and \p{M} Unicode property classes in your character class:
String regex = "[\\p{L}\\p{M}]+";
If the input string can contain words separated with whitespaces, you may add \s shorthand class and to match any kind of whitespace you may compile this regex with Pattern.UNICODE_CHARACTER_CLASS flag:
String regex = "(?U)[\\p{L}\\p{M}\\s]+";
Note that this regex allows entering diacritics, letters and whitespaces in any order. If you need a more precise regex (e.g. diacritics only allowed after a base letter) you may consider something like
String regex = "(?U)\\s*(?>\\p{L}\\p{M}*+)+(?:\\s+(?>\\p{L}\\p{M}*+)+)*\\s*";
Here, (?>\\p{L}\\p{M}*+)+ matches one or more letters each followed with zero or more diacritics, \s* matches zero or more whitespaces and \s+ matches 1 or more whitespaces.
\p{IsAlphabetic} vs. [\p{L}\p{M}]
If you check the source code, \p{Alphabetic} checks if Character.isAlphabetic(ch) is true. It is true if the char belongs to any of the following classes: UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER, LETTER_NUMBER or it has contributory property Other_Alphabetic. It is derived from Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic.
While all those L subclasses form the general L class, note that Other_Alphabetic also includes Letter number Nl class, and it includes more chars than \p{M} class, see this reference (although it is in German, the categories and char names are in English).
So, \p{IsAlphabetic} is broader than [\p{L}\p{M}] and you should make the right decision based on the languages you want to support.
The only solution I found is using \p{IsAlphabetic}
\p{Alpha} An alphabetic character:\p{IsAlphabetic}
boolean isMatch = name.matches("[ \\p{IsAlphabetic}]+"))
Which doesn't work in sites as https://regex101.com/ in demo
There are two characters there. The first is a letter, the second is a non-letter mark.
String name = "\u0e2d";
boolean isMatch = name.matches("[\\p{L}]+"); // true
works, but
String name = "\u0e2d\u0e31";
boolean isMatch = name.matches("[\\p{L}]+"); // false
does not because ั U+E31 is a Non-Spacing Mark [NSM], not a letter.
Googled that character to find the language. Seems to be Thai. Thai Unicode character range is: 0E00 to 0E7F:
When you are working with unicode characters you can use \u. So, the regex should be look like this:
[\u0E00-\u0E7F]
Which is match in this REGEX test with your character.
If you want to match any languages use this:
[\p{L}]
Which is match in this REGEX test with your example characters.
Try including more categories:
[\p{L}\p{Mn}\p{Mc}\p{Nl}\p{Pc}\p{Pd}\p{Po}\p{Sk}]+
Note that it might be best to simply not validate names. People can't really complain if they entered it wrong but your system didn't catch it. However, it's much more of a problem if someone is unable to enter their name. If you do insist on adding validation, please make it overridable: that should have the advantages of each method without their disadvantages.

Java Replace Unicode Characters in a String

I have a string which contains multiple unicode characters. I want to identify all these unicode characters, ex: \ uF06C, and replace it with a back slash and four hexa digits without "u" in it.
Example:
Source String: "add \uF06Cd1 Clause"
Result String: "add \F06Cd1 Clause"
How can achieve this in Java?
Edit:
Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work.
The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.
The regex to match the unicode-string:
A unicode-character looks like \uABCD, so \u, followed by a 4-character hexnumber string. Matching these can be done using
\\u[A-Fa-f\d]{4}
But there's a problem with this:
In a String like "just some \\uabcd arbitrary text" the \u would still get matched. So we need to make sure the \u is preceeded by an even number of \s:
(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}
Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:
(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})
As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:
$1\\$3
Now for the actual code:
String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";
Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);
That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.
EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:
StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
if(c > 127)
sb.append("\\").append(String.format("%04x", (int) c));
else
sb.append(c);
This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.
Try using String.replaceAll() method
s = s.replaceAll("\u", "\");

Regex to replace All turkish symbols to regular latin symbols

I have a class that replaces all turkish symbols to similar latin symbols and pass the result to searcher.
these are the methods for symbol replacement
#Override
String replaceTurkish(String words) {
if (checkWithRegExp(words)) {
return words.toLowerCase().replaceAll("ç", "c").replaceAll("ğ", "g").replaceAll("ı", "i").
replaceAll("ö", "o").replaceAll("ş", "s").replaceAll("ü", "u");
} else return words;
}
public static boolean checkWithRegExp(String word){
Pattern p = Pattern.compile("[öçğışü]");
Matcher m = p.matcher(word);
return m.matches();
}
But this always return unmodified words statement.
What am I doing wrong?
Thanks in advance!
Per the Java 7 api, Matcher.matches()
Attempts to match the entire region against the pattern.
Your pattern is "[öçğışü]", which regex101.com (an awesome resource) says will match
a single character in the list öçğışü literally
Perhaps you may see the problem already. Your regex is not going to match anything except a single Turkish character, since you are attempting to match the entire region against a regex which will only ever accept one character.
I recommend either using find(), per suggestion by Andreas in the comments, or using a regex like this:
".*[öçğışü].*"
which should actually find words which contains any Turkish-specific characters.
Additionally, I'll point out that regex is case-sensitive, so if there are upper-case variants of these letters, you should include those as well and modify your replace statements.
Finally (edit): you can make your Pattern case-insensitive, but your replaceAll's will still need to change to be case-insensitive. I am unsure of how this will work with non-Latin characters, so you should test that flag before relying on it.
Pattern p = Pattern.compile(".*[öçğışü].*", Pattern.CASE_INSENSITIVE);

How to subString based on the special character?

I have String like below ,I want to get subString If any special character is there.
String myString="Regular $express&ions are <patterns <that can# be %matched against *strings";
I want out like below
express
inos
patterns
that
matched
Strings
Any one help me.Thanks in Advance
Note: as #MaxZoom pointed out, it seems that I didn't understand the OP's problem properly. The OP apparently does not want to split the string on special characters, but rather keep the words starting with a special character. The former is adressed by my answer, the latter by #MaxZoom's answer.
You should take a look at the String.split() method.
Give it a regexp matching all the characters you want, and you'll get an array of all the strings you want. For instance:
String myString = "Regular $express&ions are <patterns <that can# be %matched against *strings";
String[] words = myString.split("[$&<#%*]");
This regex will select words that starts with special character:
[$&<%*](\w*)
explanation:
[$&<%*] match a single character present in the list below
$&<%* a single character in the list $&<%* literally (case sensitive)
1st Capturing group (\w*)
\w* match any word character [a-zA-Z0-9_]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
g modifier: global. All matches (don't return on first match)
DEMO
MATCH 1 [9-16] express
MATCH 2 [17-21] ions
MATCH 3 [27-35] patterns
MATCH 4 [37-41] that
MATCH 5 [51-58] matched
MATCH 6 [68-75] strings
Solution in Java code:
String str = "Regular $express&ions are <patterns <that can# be %matched against *strings";
Matcher matcher = Pattern.compile("[$&<%*](\\w*)").matcher(str);
List<String> words = new ArrayList<>();
while (matcher.find()) {
words.add(matcher.group(1));
}
System.out.println(words.toString());
// prints [express, ions, patterns, that, matched, strings]

Java ignore special characters in string matching

I want to match two strings in java eg.
text: János
searchExpression: Janos
Since I don't want to replace all special characters, I thought I could just make the á a wildcard, so everything would match for this character. For instance if I search in János with Jxnos, it should find it. Of course there could be multiple special characters in the text. Does anyone have an idea how I could achieve this via any pattern matcher, or do I have to compare char by char?
use pattern and matcher classes with J\\Snos as regex. \\S matches any non-space character.
String str = "foo János bar Jxnos";
Matcher m = Pattern.compile("J\\Snos").matcher(str);
while(m.find())
{
System.out.println(m.group());
}
Output:
János
Jxnos
A possible solution would be to strip the accent with the help of Apache Commons StringUtils.stripAccents(input) method:
String input = StringUtils.stripAccents("János");
System.out.println(input); //Janos
Make sure to also read upon the more elaborate approaches based on the Normalizer class: Is there a way to get rid of accents and convert a whole string to regular letters?

Categories