What does the regex string "\\p{Cntrl}" match in Java?

What does the regex string "\\p{Cntrl}" match in Java? - java

I think it's matching all control characters (not sure what "all" might be) but I can't be certain, nor can I find it in any documentation other than some musings in a Perl forum. Does anyone know?

From the documentation of Pattern:
\p{Cntrl} A control character: [\x00-\x1F\x7F]
That is, it matches any character with hexadecimal value 00 through 1F or 7F.
The Wikipedia article on control characters lists each character and what it's used for if you're interested.

\p{name} matches a Unicode character class; consult the appropriate Unicode spec to see what code points are in the class. Here is a discussion specific to the Java regex engine (Cntrl being one of the examples Any ASCII control character in the range 0-127. This effectively means characters 0-31 and 127.), although the same thing applies to many other regex engines.

Related

Different Java Regex matching behavior when using UNICODE_CHARACTER_CLASS flag

I was testing the behavior of the Pattern.UNICODE_CHARACTER_CLASS flag for different punctuation characters and noticed that the matches for grave accent character (U+0060) ` occur differently depending on whether Pattern.UNICODE_CHARACTER_CLASS is used.
For example, see the below code:
public class GraceAccentTest {
public static void main(String args[]) {
Pattern p = Pattern.compile("\\p{Punct}");
Matcher m = p.matcher("`");
System.out.println(m.matches()); // returns true
Pattern p1 = Pattern.compile("\\p{Punct}", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m1 = p1.matcher("`");
System.out.println(m1.matches()); // returns false
}
}
When I don't use Pattern.UNICODE_CHARACTER_CLASS flag grave accent character matches with \p{Punct} character class but when I use the flag it doesn't match. Can someone explain the reasoning for this ?

When you use Pattern p = Pattern.compile("\\p{Punct}");, then \p{Punct} refers to the following 32 characters:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
Reference: the Pattern class.
These 32 characters correspond to the ASCII character set characters 0x21 through 0x7e, excluding letters and digits. They also happen to represent all the non-letter and non-digit symbols on my standard U.S. keyboard (your keyboard may be different, of course).
The grave accent (also known as a backtick) is in that list and on my keyboard.
That is a simple example of a "predefined character class" - and explains why your m.matches() returns true.
When you add the Pattern.UNICODE_CHARACTER_CLASS flag things get more complicated.
As the documentation for this flag explains, it:
Enables the Unicode version of Predefined character classes and POSIX character classes.
and:
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expressions Annex C: Compatibility Properties.
Looking at the Annex C referred to above, we find a table showing the "recommended assignments for compatibility property names".
For our property name (punct), the standard recommendation is to use characters defined by this:
\p{gc=Punctuation}
Here, "gc" stands for "general category". Unicode characters are assigned a "general category" value. In this case, that is Punctuation - also abbreviated to P and further broken down into various sub-categories such as Pc for connectors, Pd for dashes, and so on. There is also a catch-all Po for "other punctuation characters".
The grave character is assigned to the Symbol general category in Unicode - and to the Modifier subcategory. You can see that assignment to Sk here.
Contrast that with a character such as the ASCII exclamation mark (also part of our original \p{Punct} list, shown above). For that we can see that the general category assignment is Po.
That explains why the grave is no longer matched when we add the Pattern.UNICODE_CHARACTER_CLASS flag to our original pattern.
It is assigned to a different general category from the punctuation category we are using in our regex.
The obvious next question is why did the grave character not get included in the Unicode Po general category? Why is it in Sk instead?
I do not have a good answer for that - I'm sure there are "historical reasons". It's worth noting, however, that the Sk cateogry includes characters such as the acute accent, the cedilla, the diaeresis, and so on - and (as already noted) our grave accent.
All these are diacritics - typically used in combination with a base letter to alter the pronunciation. So maybe that is the underlying reason.
The grave is a bit of an oddity, perhaps, given it has a historical usage outside of being used as a diacritic.
It may be more relevant to ask how the grave ended up as part of the original ASCII character set, in the first place. Some background about this is provided in the Wikipedia page for the backtick.

Reading the documentation for UNICODE_CHARACTER_CLASS
When this flag is specified then the (US-ASCII only) Predefined
character classes and POSIX character classes are in conformance with
Unicode Technical Standard #18: Unicode Regular Expression Annex C:
Compatibility Properties.
So this is saying that is using US-ASCII only. So if you check the table of characters Punctuation you will check there is a lot of missing chars.
Tables :
https://www.fileformat.info/info/unicode/category/Po/list.htm
https://www.gaijin.at/en/infos/unicode-character-table-punctuation

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.

If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.

Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Compare non-English characters

I have a problem when I try compare 'Đ' and 'D' and I need to return true, but Locale English returns false, because 'Đ' cannot be replaced with regex:
"\\p{Block=CombiningDiacriticalMarks}+".

The character class [\u0110D] will match either a Đ or a D (Đ is code point U+0110).
Matching Non-English Characters, a Primer
One common example is the word 'über-geek'. How do I match that word whether or not there is an umlaut above the u? Simple: [\u0252u]ber-geek will match either 'über-geek' or 'uber-geek'.
Depending on your regular expression engine, there are multiple great ways to match locale-specific characters. Buy a book on your specific implementation to discover its wrinkles. By the way, you can find an excellent resource for Unicode-specific regex information at Regular-Expressions.info's Unicode page.
What if I want to match any character? If you have access to the \X character class, it will act as a . in a Unicode context. This means that multiple Unicode code points which combine to form one grapheme will register as one 'character' to the engine.
NOTE: I'm not trying to 'steal' an answer to this one, and I'll delete mine if Ted Hopp moves his out of the comments. I just wanted to make sure that people looking for non-English regex matches can see that this question did indeed get answered.

Character class for Unicode digits

I need to create a Pattern that will match all Unicode digits and alphabetic characters. So far I have "\\p{IsAlphabetic}|[0-9]".
The first part is working well for me, it's doing a good job of identifying non-Latin characters as alphabetic characters. The problem is the second half. Obviously it will only work for Arabic Numerals. The character classes \\d and \p{Digit} are also just [0-9]. The javadoc for Pattern does not seem to mention a character class for Unicode digits. Does anyone have a good solution for this problem?
For my purposes, I would accept a way to match the set of all characters for which Character.isDigit returns true.

Quoting the Java docs about isDigit:
A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.
So, I believe the pattern to match digits should be \p{Nd}.
Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.

Use \d, but with the (?U) flag to enable the Unicode version of predefined character classes and POSIX character classes:
(?U)\d+
or in code:
System.out.println("3๓३".matches("(?U)\\d+")); // true
Using (?U) is equivalent to compiling the regex by calling Pattern.compile() with the UNICODE_CHARACTER_CLASS flag:
Pattern pattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);

Java regular expression to match _all_ whitespace characters

I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.
[Edit]
To clarify: I do not mean the string sequence " " I mean the sincle unicode character U+00A0 that is often represented by " ", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.
[Answer]
For my pupose, ie catching all whitespace characters, unicode + traditional, the following expression does the job:
[\p{Z}\s]
The answer is in the comments below but since it is a bit hidden I repeat it here.

is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.
You are mixing abstraction levels here.
If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z} or \p{Zs} will do the work.
You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).

You clarified the question the way as I expected: you're actually not looking for the String literal as many here seem to think and for which the solution is too obvious.
Well, unfortunately, there's no way to match them using regex. Best is to include the particular codepoints in the pattern, for example: "[\\s\\xA0]".
Edit as turned out in one of the comments, you could use the undocumented "\\p{Z}" for this. Alan, can you please leave comment how you found that out? This one is quite useful.

The is only whitespace in HTML. Use an HTML parser to extract the plain text. and \s should work just fine.

In case anyone runs into this question again looking for help, I suggest pursuing the following answer: https://stackoverflow.com/a/6255512/1678392
The short version: \\p{javaSpaceChar}
Why: Per the Pattern class, this maps the Character.isSpaceChar method:
Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.
👍

Click here for a summary I made of several competing definitions of "whitespace".
You might end up having to explicitly list the additional ones you care about that aren't matched by one of the prefab ones.

is not white space. It is a character encoding sequence that represents whitespace in HTML. You most likely want to convert HTML encoded text into plain text before running your string match against it. If that is the case, go look up
javax.swing.text.html

The regex characters are the only ones independent of encoding. Here is a list of some characters which - in Unicode - are non-printing:
How many non-printing characters are in common use?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.