Character class for Unicode digits - java

I need to create a Pattern that will match all Unicode digits and alphabetic characters. So far I have "\\p{IsAlphabetic}|[0-9]".
The first part is working well for me, it's doing a good job of identifying non-Latin characters as alphabetic characters. The problem is the second half. Obviously it will only work for Arabic Numerals. The character classes \\d and \p{Digit} are also just [0-9]. The javadoc for Pattern does not seem to mention a character class for Unicode digits. Does anyone have a good solution for this problem?
For my purposes, I would accept a way to match the set of all characters for which Character.isDigit returns true.

Quoting the Java docs about isDigit:
A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.
So, I believe the pattern to match digits should be \p{Nd}.
Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.

Use \d, but with the (?U) flag to enable the Unicode version of predefined character classes and POSIX character classes:
(?U)\d+
or in code:
System.out.println("3๓३".matches("(?U)\\d+")); // true
Using (?U) is equivalent to compiling the regex by calling Pattern.compile() with the UNICODE_CHARACTER_CLASS flag:
Pattern pattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);

Related

What is the equivalent in .Net of the :print: character class from PHP or Java? [duplicate]

Is there a special regex statement like \w that denotes all printable characters? I'd like to validate that a string only contains a character that can be printed--i.e. does not contain ASCII control characters like \b (bell), or null, etc. Anything on the keyboard is fine, and so are UTF chars.
If there isn't a special statement, how can I specify this in a regex?
Very late to the party, but this regexp works: /[ -~]/.
How? It matches all characters in the range from space (ASCII DEC 32) to tilde (ASCII DEC 126), which is the range of all printable characters.
If you want to strip non-ASCII characters, you could use something like:
$someString.replace(/[^ -~]/g, '');
NOTE: this is not valid .net code, but an example of regexp usage for those who stumble upon this via search engines later.
If your regex flavor supports Unicode properties, this is probably the best the best way:
\P{Cc}
That matches any character that's not a control character, whether it be ASCII -- [\x00-\x1F\x7F] -- or Latin1 -- [\x80-\x9F] (also known as the C1 control characters).
The problem with POSIX classes like [:print:] or \p{Print} is that they can match different things depending on the regex flavor and, possibly, the locale settings of the underlying platform. In Java, they're strictly ASCII-oriented. That means \p{Print} matches only the ASCII printing characters -- [\x20-\x7E] -- while \P{Cntrl} (note the capital 'P') matches everything that's not an ASCII control character -- [^\x00-\x1F\x7F]. That is, it matches any ASCII character that isn't a control character, or any non-ASCII character--including C1 control characters.
TLDR Answer
Use this Regex...
\P{Cc}\P{Cn}\P{Cs}
Working Demo
In this demo, I use this regex to search the string "Hello, World!_". I'm going to add a weird character at the end, (char)4 — this is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static void Main() {
// your code goes here
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
var matches = regex.Matches("Hello, World!" + (char)4);
Console.WriteLine("Results: " + matches.Count);
foreach (Match match in matches) {
Console.WriteLine("Result: " + match);
}
}
}
Full Working Demo at IDEOne.com
TLDR Explanation
\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
There is a POSIX character class designation [:print:] that should match printable characters, and [:cntrl:] for control characters. Note that these match codes throughout the ASCII table, so they might not be suitable for matching other encodings.
Failing that, the expression [\x00-\x1f] will match through the ASCII control characters, although again, these could be printable in other encodings.
In Java, the \p{Print} option specifies the printable character class.
It depends wildly on what regex package you are using. This is one of these situations about which some wag said that the great thing about standards is there are so many to choose from.
If you happen to be using C, the isprint(3) function/macro is your friend.
Adding on to #Alan-Moore, \P{Cc} is actually as example of Negative Unicode Category or Unicode Block (ref: Character Classes in Regular Expressions). \P{name} matches any character that does not belong to a Unicode general category or named block. See the referred link for more examples of named blocks supported in .Net

Java Regular Expression with International Letters

Here's my current code:
return str.matches("^[A-Za-z\\-'. ]+");
I want it to include international letters. How do I do that in Java?
Thanks.
It seems that you want is, to match all the alphabetic characters. Typically you would do that by using Posix \p{Alpha} expression, extended by the punctuation you want also to permit. As Java Regular Expressions documentation says, it matches ASCII only.
However, what documentation does not say clearly is, you can make this class work with Unicode characters. To do just that you need to turn Unicode character class matching on.
You can do this in one of two ways:
By creating Pattern object passing the UNICODE_CHARACTER_CLASS constant:
Pattern p = Pattern.compile("^[p{Alpha}\\-'. ]+", UNICODE_CHARACTER_CLASS);
By using (?U) embedded pattern flag:
str.matches("^(?U)[\\p{Alpha}\\-'. ]+");
Prove of concept:
String[] test = {"Jean-Marie Le'Blanc", "Żółć", "Ὀδυσσεύς", "原田雅彦"};
for (String str : test) {
System.out.print(str.matches("^(?U)[\\p{Alpha}\\-'. ]+") + " ");
}
The obvious result is:
true true true true
If you think that all is correct, I have two additional points to make:
原田雅彦 (Masahiko Harada) is composed of Ideographic characters. In fact they are not the alphabetic characters,
You want to match the dot (.) symbol. It's OK, but please consider matching Ideographic fullstops as well.
I assume you want to match alphanumeric characters other than the ASCII letters A-Z. You can do this with the \p{IsAlphabetic} Unicode character class:
return str.matches("^[\\p{IsAlphabetic}\\-'. ]+");
You'll find more Unicode character classes the full documentation.
Replace the pattern with:
"^[\\p{L}\\-'. ]+"
\p{L} includes all unicode letters.
Use the regex \P{L} to match any letters (national or international)
By adding [\p{L}&&[^\p{IsLatin}]], you can match all letters that are not latin.
Especially for Greek, regex has \p{InGreek} to match Greek letters and \P{InGreek}(the difference is capital P) to match non Greek letters.
The question cannot be answered completely unless you say what you mean by "international letters", but the general solution is to use named character classes, via the \p{name} syntax. There are many named character classes. Some are defined by the regex language, and others by the Unicode standard. Refer to the Pattern javadocs for a partial list, and to the relevant Unicode standard.

regex that allows chinese characters

I have a regex that blocks invalid characters in a string, but it's also blocking chinese characters and i dont want it. Please help me with it. Below is the regex string that I am using.
String re = "[^\\x09\\x0A\\x0D\\x20-\\xD7FF\\xE000-\\xFFFD\\x10000-x10FFFF]";
Thanks in anticipation!
Since Java 7 you can make use of Unicode properties/scripts.
E.g. you can use the property \p{L} to match a letter in any language. Or the script \p{IsHiragana} to match a character contained in Hiragana. You need to check what script is fitting your needs.
See here on docs.Oracle.com for more details about regex and Unicode
It is also possible to match for the opposite, e.g. \P{L} is matching every character, that is NOT a letter, or you just add \p{L} to your negated character class, instead of the ranges that should define letters.

Java regex for support Unicode?

To match A to Z, we will use regex:
[A-Za-z]
How to allow regex to match utf8 characters entered by user? For example Chinese words like 环保部
What you are looking for are Unicode properties.
e.g. \p{L} is any kind of letter from any language
So a regex to match such a Chinese word could be something like
\p{L}+
There are many such properties, for more details see regular-expressions.info
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
In Java 7 there is a new property Pattern.UNICODE_CHARACTER_CLASS that enables the Unicode version of the predefined character classes see my answer here for some more details and links
You could do something like this
Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
and \w would match all letters and all digits from any languages (and of course some word combining characters like _).
To address NLS support and avoid accepting English special character, we can use below pattern...
[a-zA-Z0-9 \u0080-\u9fff]*+
For UTF code point reference: http://www.utf8-chartable.de/unicode-utf8-table.pl
Code snippet:
String vowels = "అఆఇఈఉఊఋఌఎఏఐఒఓఔౠౡ";
String consonants = "కఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహ";
String signsAndPunctuations = "కఁకంకఃకాకికీకుకూకృకౄకెకేకైకొకోకౌక్కౕకౖ";
String symbolsAndNumerals = "౦౧౨౩౪౫౬౭౮౯";
String engChinesStr = "ABC導字會";
Pattern ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU = Pattern
.compile("[a-zA-Z0-9 \\u0c00-\\u0c7f]*+");
System.out.println(ALPHANUMERIC_AND_SPACE_PATTERN_TELUGU.matcher(vowels)
.matches());
Pattern ALPHANUMERIC_AND_SPACE_PATTERN_CHINESE = Pattern
.compile("[a-zA-Z0-9 \\u4e00-\\u9fff]*+");
Pattern ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN = Pattern
.compile("[a-zA-Z0-9 \\u0080-\\u9fff]*+");
System.out.println(ENGLISH_ALPHANUMERIC_SPACE_AND_NLS_PATTERN.matcher(engChinesStr)
.matches());
To match individual characters, you can simply include them in an a character class, either as literals or via the \u03FB syntax.
Obviously you often cannot list all allowed characters in ideographic languages. To make the regex treat unicode characters according to their type or code block, various other escapes are supported that are defined here. Look at the section "Unicode support", particularly the references to the Character class and to the Unicode Standard itself.
the Java regular expression API works on the char type
the char type is implicitly UTF-16
if you have UTF-8 data you will need to transcode it to UTF-16 on input if this is not already being done
Unicode is the universal set of characters and UTF-8 can describe all of it (including control characters, punctuation, symbols, letters, etc.) You will have to be more specific about what you want to include and what you want to exclude. Java regular expressions uses the \p{category} syntax to match codepoints by category. See the Unicode standard for the list of categories.
If you want to identify and separate words in a sequence of ideographs, you will need to look at a more sophisticated API. I would start with the BreakIterator type.
Starting from Java 9, you can also use \X to match any Unicode extended grapheme cluster. See more at Java Doc: Pattern.

Regex matching capital characters, numbers and period

I'm trying to see if a input only contains capital letters, numbers and a period in regex. What would the regex pattern be for this in Java?
Is there any guides on how I can build this regex, even some online tools?
Also is it possible to check length of string is no more than 50 using regex?
This is the Unicode answer:
^[\p{Lu}\p{Nd}.]{0,50}$
From regular-expressions.info
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
^ and $ is the start and the end of the string
Regex pattern:
Pattern.compile("^[A-Z\\d.]*$")
To check the length of a string:
Pattern.compile("^.{0,50}$")
Both combined:
Pattern.compile("^[A-Z\\d.]{0,50}$")
Although I wouldn't use regular expressions to check for length if I were you, just call .length() on the string.
This website is really handy for building and testing and regular expressions
Regular expressions in Java have a lot in common with other languages when it comes to the simple syntax, with some predefined character classes that add more than you'd find in Perl for example. The Java API docs on Pattern show the various patterns that are supported. A friendlier introduction to regexes in Java is http://www.regular-expressions.info/java.html.
Some very quick Googling shows there are many tools online for testing Java regular expressions against input strings. Here is one.
To check for the type of input you are interested in, the following regex should work:
^[A-Z0-9.]{,50}$
Broken down, this is saying:
^: start matching from the start of the input; do not allow the first character(s) to be skipped
[]: match one of the characters in this range
A-Z: within a range, - means to accept all values between the first and last character inclusive, so in this case all characters from A to Z.
0-9: add to the previous range all digits
.: periods are special in regexes, but all special characters become simple again within a character class ([])
{,50}: require (or 0) matches up to 50 of the character class just defined.
$: the match must reach the end of the input; do not allow the last character(s) to be skipped
This returns true for strings, containing only 50 characters that can be numbers, capital letters or a dot.
string.matches("[0-9A-Z\\.]{0,50}")
In response to what tools you can use, I prefer Regex Coach

Categories