Java regex for support Unicode?

Java regex for support Unicode? - java

To match A to Z, we will use regex:
[A-Za-z]
How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部

What you are looking for are Unicode properties.
e.g. \p{L} is any kind of letter from any language
So a regex to match such a Chinese word could be something like
\p{L}+
There are many such properties, for more details see regular-expressions.info
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes see my answer here for some more details and links
You could do something like this
Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
and \w would match all letters and all digits from any languages (and of course some word combining characters like _).

To address NLS support and avoid accepting English special character, we can use below pattern...
[a-zA-Z0-9 \u0080-\u9fff]*+
For UTF code point reference: http://www.utf8-chartable.de/unicode-utf8-table.pl
Code snippet:
String vowels = "అఆఇఈఉఊఋఌఎఏఐఒఓఔౠౡ";
String consonants = "కఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహ";
String signsAndPunctuations = "కఁకంకఃకాకికీకుకూకృకౄకెకేకైకొకోకౌక్కౕకౖ";
String symbolsAndNumerals = "౦౧౨౩౪౫౬౭౮౯";
String engChinesStr = "ABC導字會";
Pattern ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU = Pattern
.compile("[a-zA-Z0-9 \\u0c00-\\u0c7f]*+");
System.out.println(ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU.matcher(vowels)
.matches());
Pattern ALPHANUMERIC_AND_SPACE_PATTERN_CHINESE = Pattern
.compile("[a-zA-Z0-9 \\u4e00-\\u9fff]*+");
Pattern ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN = Pattern
.compile("[a-zA-Z0-9 \\u0080-\\u9fff]*+");
System.out.println(ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN.matcher(engChinesStr)
.matches());

To match individual characters, you can simply include them in an a character class, either as literals or via the \u03FB syntax.
Obviously you often cannot list all allowed characters in ideographic languages. To make the regex treat unicode characters according to their type or code block, various other escapes are supported that are defined here. Look at the section "Unicode support", particularly the references to the Character class and to the Unicode Standard itself.

the Java regular expression API works on the char type
the char type is implicitly UTF-16
if you have UTF-8 data you will need to transcode it to UTF-16 on input if this is not already being done
Unicode is the universal set of characters and UTF-8 can describe all of it (including control characters, punctuation, symbols, letters, etc.) You will have to be more specific about what you want to include and what you want to exclude. Java regular expressions uses the \p{category} syntax to match codepoints by category. See the Unicode standard for the list of categories.
If you want to identify and separate words in a sequence of ideographs, you will need to look at a more sophisticated API. I would start with the BreakIterator type.

Starting from Java 9, you can also use \X to match any Unicode extended grapheme cluster. See more at Java Doc: Pattern.

Related

Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

I need to validate name,saved in a String, which can be in any language with spaces using \p{L}:
You can match a single character belonging to the "letter" category with \p{L}
I tried to use String.matches, but it failed to match non English characters, even for 1 character, for example
String name = "อั";
boolean isMatch = name.matches("[\\p{L}]+")); // return false
I tried with/without brackets, adding + for multiple letters, but it's always failing to match non English characters
Is there an issue using String.matches with \p{L}?
I failed also using [\\x00-\\x7F]+ suggested in Pattern
\p{ASCII} All ASCII:[\x00-\x7F]

You should bear in mind that Java regex parses strings as collections of Unicode code units, not code points. \p{L} matches any Unicode letter from the BMP plane, it does not match letters glued with diacritics after them.
Since your input can contain letters and diacritics you should at least use both \p{L} and \p{M} Unicode property classes in your character class:
String regex = "[\\p{L}\\p{M}]+";
If the input string can contain words separated with whitespaces, you may add \s shorthand class and to match any kind of whitespace you may compile this regex with Pattern.UNICODE_CHARACTER_CLASS flag:
String regex = "(?U)[\\p{L}\\p{M}\\s]+";
Note that this regex allows entering diacritics, letters and whitespaces in any order. If you need a more precise regex (e.g. diacritics only allowed after a base letter) you may consider something like
String regex = "(?U)\\s*(?>\\p{L}\\p{M}*+)+(?:\\s+(?>\\p{L}\\p{M}*+)+)*\\s*";
Here, (?>\\p{L}\\p{M}*+)+ matches one or more letters each followed with zero or more diacritics, \s* matches zero or more whitespaces and \s+ matches 1 or more whitespaces.
\p{IsAlphabetic} vs. [\p{L}\p{M}]
If you check the source code, \p{Alphabetic} checks if Character.isAlphabetic(ch) is true. It is true if the char belongs to any of the following classes: UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER, LETTER_NUMBER or it has contributory property Other_Alphabetic. It is derived from Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic.
While all those L subclasses form the general L class, note that Other_Alphabetic also includes Letter number Nl class, and it includes more chars than \p{M} class, see this reference (although it is in German, the categories and char names are in English).
So, \p{IsAlphabetic} is broader than [\p{L}\p{M}] and you should make the right decision based on the languages you want to support.

The only solution I found is using \p{IsAlphabetic}
\p{Alpha} An alphabetic character:\p{IsAlphabetic}
boolean isMatch = name.matches("[ \\p{IsAlphabetic}]+"))
Which doesn't work in sites as https://regex101.com/ in demo

There are two characters there. The first is a letter, the second is a non-letter mark.
String name = "\u0e2d";
boolean isMatch = name.matches("[\\p{L}]+"); // true
works, but
String name = "\u0e2d\u0e31";
boolean isMatch = name.matches("[\\p{L}]+"); // false
does not because ั U+E31 is a Non-Spacing Mark [NSM], not a letter.

Googled that character to find the language. Seems to be Thai. Thai Unicode character range is: 0E00 to 0E7F:
When you are working with unicode characters you can use \u. So, the regex should be look like this:
[\u0E00-\u0E7F]
Which is match in this REGEX test with your character.
If you want to match any languages use this:
[\p{L}]
Which is match in this REGEX test with your example characters.

Try including more categories:
[\p{L}\p{Mn}\p{Mc}\p{Nl}\p{Pc}\p{Pd}\p{Po}\p{Sk}]+
Note that it might be best to simply not validate names. People can't really complain if they entered it wrong but your system didn't catch it. However, it's much more of a problem if someone is unable to enter their name. If you do insist on adding validation, please make it overridable: that should have the advantages of each method without their disadvantages.

Character class for Unicode digits

I need to create a Pattern that will match all Unicode digits and alphabetic characters. So far I have "\\p{IsAlphabetic}|[0-9]".
The first part is working well for me, it's doing a good job of identifying non-Latin characters as alphabetic characters. The problem is the second half. Obviously it will only work for Arabic Numerals. The character classes \\d and \p{Digit} are also just [0-9]. The javadoc for Pattern does not seem to mention a character class for Unicode digits. Does anyone have a good solution for this problem?
For my purposes, I would accept a way to match the set of all characters for which Character.isDigit returns true.

Quoting the Java docs about isDigit:
A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.
So, I believe the pattern to match digits should be \p{Nd}.
Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.

Use \d, but with the (?U) flag to enable the Unicode version of predefined character classes and POSIX character classes:
(?U)\d+
or in code:
System.out.println("3๓३".matches("(?U)\\d+")); // true
Using (?U) is equivalent to compiling the regex by calling Pattern.compile() with the UNICODE_CHARACTER_CLASS flag:
Pattern pattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);

Java Regular Expression with International Letters

Here's my current code:
return str.matches("^[A-Za-z\\-'. ]+");
I want it to include international letters. How do I do that in Java?
Thanks.

It seems that you want is, to match all the alphabetic characters. Typically you would do that by using Posix \p{Alpha} expression, extended by the punctuation you want also to permit. As Java Regular Expressions documentation says, it matches ASCII only.
However, what documentation does not say clearly is, you can make this class work with Unicode characters. To do just that you need to turn Unicode character class matching on.
You can do this in one of two ways:
By creating Pattern object passing the UNICODE_CHARACTER_CLASS constant:
Pattern p = Pattern.compile("^[p{Alpha}\\-'. ]+", UNICODE_CHARACTER_CLASS);
By using (?U) embedded pattern flag:
str.matches("^(?U)[\\p{Alpha}\\-'. ]+");
Prove of concept:
String[] test = {"Jean-Marie Le'Blanc", "Żółć", "Ὀδυσσεύς", "原田雅彦"};
for (String str : test) {
System.out.print(str.matches("^(?U)[\\p{Alpha}\\-'. ]+") + " ");
}
The obvious result is:
true true true true
If you think that all is correct, I have two additional points to make:
原田雅彦 (Masahiko Harada) is composed of Ideographic characters. In fact they are not the alphabetic characters,
You want to match the dot (.) symbol. It's OK, but please consider matching Ideographic fullstops as well.

I assume you want to match alphanumeric characters other than the ASCII letters A-Z. You can do this with the \p{IsAlphabetic} Unicode character class:
return str.matches("^[\\p{IsAlphabetic}\\-'. ]+");
You'll find more Unicode character classes the full documentation.

Replace the pattern with:
"^[\\p{L}\\-'. ]+"
\p{L} includes all unicode letters.

Use the regex \P{L} to match any letters (national or international)
By adding [\p{L}&&[^\p{IsLatin}]], you can match all letters that are not latin.
Especially for Greek, regex has \p{InGreek} to match Greek letters and \P{InGreek}(the difference is capital P) to match non Greek letters.

The question cannot be answered completely unless you say what you mean by "international letters", but the general solution is to use named character classes, via the \p{name} syntax. There are many named character classes. Some are defined by the regex language, and others by the Unicode standard. Refer to the Pattern javadocs for a partial list, and to the relevant Unicode standard.

regex that allows chinese characters

I have a regex that blocks invalid characters in a string, but it's also blocking chinese characters and i dont want it. Please help me with it. Below is the regex string that I am using.
String re = "[^\\x09\\x0A\\x0D\\x20-\\xD7FF\\xE000-\\xFFFD\\x10000-x10FFFF]";
Thanks in anticipation!

Since Java 7 you can make use of Unicode properties/scripts.
E.g. you can use the property \p{L} to match a letter in any language. Or the script \p{IsHiragana} to match a character contained in Hiragana. You need to check what script is fitting your needs.
See here on docs.Oracle.com for more details about regex and Unicode
It is also possible to match for the opposite, e.g. \P{L} is matching every character, that is NOT a letter, or you just add \p{L} to your negated character class, instead of the ranges that should define letters.

Removing all non-word characters in a Cyrillic UTF-8 encoded String

Normally, in order to remove non-word characters from a String the replaceAll method can be used:
String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");
The above returns a cleaned string "somestringwithnonwordssuchas".
However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.
What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?

Try [^\\p{L}]. That should match every Unicode codepoint except for letters.
The Pattern class has a pretty thorough description of the possible character classes. Note that the POSIX character classes are ASCII-only by default and won't help you a lot, you'll need to use the Unicode-specific classes.
Note that there's the UNICODE_CHARACTER_CLASS flag that changes the behavior of the POSIX classes to conform to this section of the Unicode Standard (basically making them equivalent to their closest Unicode-aware equivalents).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex for support Unicode? - java

To match A to Z, we will use regex: [A-Za-z] How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部

Starting from Java 9, you can also use \X to match any Unicode extended grapheme cluster. See more at Java Doc: Pattern.

Related

Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

Character class for Unicode digits

Java Regular Expression with International Letters

regex that allows chinese characters

Removing all non-word characters in a Cyrillic UTF-8 encoded String

Categories

Resources