Java Regex include all letters of the alphabet except certain letters - java

What I need to do is to determine whether a word consists of letters except certain letters. For example I need to test whether a word consists of the letters from the English alphabet except letters: I, V and X.
Currently I have this long regex for the simple task above:
Pattern pattern = Pattern.compile("[ABCDEFGHJKLMNOPQRSTUWYZ]+");
Any of you know any shorthand way of excluding certain letters from a Java regex? Thanks.

You can use the && operator to create a compound character class using subtraction:
String regex = "[A-Z&&[^IVX]]+";

You could simply specify character ranges inside your character class:
[A-HJ-UWYZ]+

Just use a negative lookahead in your pattern.
Pattern pattern = Pattern.compile("^(?:(?![IVX])[A-Z])+$");
DEMO

'&&' didn't work for me,
used: (?:(?![IVX])[A-Z])

Use [A-Z&&[^IVX]]+ to exclude certain characters from the A-Z range - see Pattern

Related

Exclude a letter in Regex Pattern

I am trying to create a Regex pattern for <String>-<String>. This is my current pattern:
(\w+\-\w+).
The first String is not allowed to be "W". However, it can still contain "W"s if it's more than one letter long.
For example:
W-80 -> invalid
W42-80 -> valid
How can this be achieved?
So your first string can be either: one character but not W or 2+ characters. Simple pattern to achieve that is:
([^W]|\w{2,})-\w+
But this pattern is not entirely correct, because now it allows any character for first part, but originally only \w characters were expected to be allowed. So correct pattern is:
([\w&&[^W]]|\w{2,})-\w+
Pattern [\w&&[^W]] means any character from \w character class except W character.
Just restrict the last char to "any word char except 'W'".
There are a couple of ways to do this:
Negative look-behind (easy to read):
^\w+(?<!W)-\w+$
See live demo.
Negated intersection (trainwreck to read):
^\w*[\w&&[^W]]-\w+$
See live demo.
——
The question has shifted. Here’s a new take:
^.+(?<!^W)-\w+
This allows anything as the first term except just "W".

Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

I need to validate name,saved in a String, which can be in any language with spaces using \p{L}:
You can match a single character belonging to the "letter" category with \p{L}
I tried to use String.matches, but it failed to match non English characters, even for 1 character, for example
String name = "อั";
boolean isMatch = name.matches("[\\p{L}]+")); // return false
I tried with/without brackets, adding + for multiple letters, but it's always failing to match non English characters
Is there an issue using String.matches with \p{L}?
I failed also using [\\x00-\\x7F]+ suggested in Pattern
\p{ASCII} All ASCII:[\x00-\x7F]
You should bear in mind that Java regex parses strings as collections of Unicode code units, not code points. \p{L} matches any Unicode letter from the BMP plane, it does not match letters glued with diacritics after them.
Since your input can contain letters and diacritics you should at least use both \p{L} and \p{M} Unicode property classes in your character class:
String regex = "[\\p{L}\\p{M}]+";
If the input string can contain words separated with whitespaces, you may add \s shorthand class and to match any kind of whitespace you may compile this regex with Pattern.UNICODE_CHARACTER_CLASS flag:
String regex = "(?U)[\\p{L}\\p{M}\\s]+";
Note that this regex allows entering diacritics, letters and whitespaces in any order. If you need a more precise regex (e.g. diacritics only allowed after a base letter) you may consider something like
String regex = "(?U)\\s*(?>\\p{L}\\p{M}*+)+(?:\\s+(?>\\p{L}\\p{M}*+)+)*\\s*";
Here, (?>\\p{L}\\p{M}*+)+ matches one or more letters each followed with zero or more diacritics, \s* matches zero or more whitespaces and \s+ matches 1 or more whitespaces.
\p{IsAlphabetic} vs. [\p{L}\p{M}]
If you check the source code, \p{Alphabetic} checks if Character.isAlphabetic(ch) is true. It is true if the char belongs to any of the following classes: UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER, LETTER_NUMBER or it has contributory property Other_Alphabetic. It is derived from Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic.
While all those L subclasses form the general L class, note that Other_Alphabetic also includes Letter number Nl class, and it includes more chars than \p{M} class, see this reference (although it is in German, the categories and char names are in English).
So, \p{IsAlphabetic} is broader than [\p{L}\p{M}] and you should make the right decision based on the languages you want to support.
The only solution I found is using \p{IsAlphabetic}
\p{Alpha} An alphabetic character:\p{IsAlphabetic}
boolean isMatch = name.matches("[ \\p{IsAlphabetic}]+"))
Which doesn't work in sites as https://regex101.com/ in demo
There are two characters there. The first is a letter, the second is a non-letter mark.
String name = "\u0e2d";
boolean isMatch = name.matches("[\\p{L}]+"); // true
works, but
String name = "\u0e2d\u0e31";
boolean isMatch = name.matches("[\\p{L}]+"); // false
does not because ั U+E31 is a Non-Spacing Mark [NSM], not a letter.
Googled that character to find the language. Seems to be Thai. Thai Unicode character range is: 0E00 to 0E7F:
When you are working with unicode characters you can use \u. So, the regex should be look like this:
[\u0E00-\u0E7F]
Which is match in this REGEX test with your character.
If you want to match any languages use this:
[\p{L}]
Which is match in this REGEX test with your example characters.
Try including more categories:
[\p{L}\p{Mn}\p{Mc}\p{Nl}\p{Pc}\p{Pd}\p{Po}\p{Sk}]+
Note that it might be best to simply not validate names. People can't really complain if they entered it wrong but your system didn't catch it. However, it's much more of a problem if someone is unable to enter their name. If you do insist on adding validation, please make it overridable: that should have the advantages of each method without their disadvantages.

How to avoid multiple sequential occurences of a character in regex?

I would like to match any alphanumerics that can be separated by one or multiple dots :
bertie
bert.123
bert.01.03.27
but not :
.bert (dot is not separating)
bert123. (dot is not separating)
bert...123 (multiple sequential occurences of dot)
Now i have this ^[^\\.][\w\.]+?[^\\.]$, but still cannot handle the multiple sequential occurences of the dot character.
What you want is ^\w++(\.\w++)*$
At least one alphanumeric character, followed by an arbitrary number of groups of only one dot followed by at least one alphanumeric character.
This should work:
^(\w+\.)*\w+$
You can replace \w with something more restrictive if you like (e.g., [a-z])
I think this should work
boolean ok = !str.matches(".*[^0-9a-zA-Z.].*|\\..*|.*\\.|.*\\.{2,}.*")
Try this (not sure if it is what you want ...):
Pattern p = Pattern.compile("^[^\\.][\\w\\.]+[^\\.]$");
Pattern dotSeparated = Pattern.compile("^\\w+(\.\w+)*$");
If needed, replace \w by [A-Za-z0-9] or \\p{Alnum}, since \w allows underscores as well.
This matches one or more alphanumeric characters, followed by zero or more of [a dot, followed by one or more alphanumeric characters].

Character class for Unicode digits

I need to create a Pattern that will match all Unicode digits and alphabetic characters. So far I have "\\p{IsAlphabetic}|[0-9]".
The first part is working well for me, it's doing a good job of identifying non-Latin characters as alphabetic characters. The problem is the second half. Obviously it will only work for Arabic Numerals. The character classes \\d and \p{Digit} are also just [0-9]. The javadoc for Pattern does not seem to mention a character class for Unicode digits. Does anyone have a good solution for this problem?
For my purposes, I would accept a way to match the set of all characters for which Character.isDigit returns true.
Quoting the Java docs about isDigit:
A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.
So, I believe the pattern to match digits should be \p{Nd}.
Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.
Use \d, but with the (?U) flag to enable the Unicode version of predefined character classes and POSIX character classes:
(?U)\d+
or in code:
System.out.println("3๓३".matches("(?U)\\d+")); // true
Using (?U) is equivalent to compiling the regex by calling Pattern.compile() with the UNICODE_CHARACTER_CLASS flag:
Pattern pattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);

Java Regular Expression with International Letters

Here's my current code:
return str.matches("^[A-Za-z\\-'. ]+");
I want it to include international letters. How do I do that in Java?
Thanks.
It seems that you want is, to match all the alphabetic characters. Typically you would do that by using Posix \p{Alpha} expression, extended by the punctuation you want also to permit. As Java Regular Expressions documentation says, it matches ASCII only.
However, what documentation does not say clearly is, you can make this class work with Unicode characters. To do just that you need to turn Unicode character class matching on.
You can do this in one of two ways:
By creating Pattern object passing the UNICODE_CHARACTER_CLASS constant:
Pattern p = Pattern.compile("^[p{Alpha}\\-'. ]+", UNICODE_CHARACTER_CLASS);
By using (?U) embedded pattern flag:
str.matches("^(?U)[\\p{Alpha}\\-'. ]+");
Prove of concept:
String[] test = {"Jean-Marie Le'Blanc", "Żółć", "Ὀδυσσεύς", "原田雅彦"};
for (String str : test) {
System.out.print(str.matches("^(?U)[\\p{Alpha}\\-'. ]+") + " ");
}
The obvious result is:
true true true true
If you think that all is correct, I have two additional points to make:
原田雅彦 (Masahiko Harada) is composed of Ideographic characters. In fact they are not the alphabetic characters,
You want to match the dot (.) symbol. It's OK, but please consider matching Ideographic fullstops as well.
I assume you want to match alphanumeric characters other than the ASCII letters A-Z. You can do this with the \p{IsAlphabetic} Unicode character class:
return str.matches("^[\\p{IsAlphabetic}\\-'. ]+");
You'll find more Unicode character classes the full documentation.
Replace the pattern with:
"^[\\p{L}\\-'. ]+"
\p{L} includes all unicode letters.
Use the regex \P{L} to match any letters (national or international)
By adding [\p{L}&&[^\p{IsLatin}]], you can match all letters that are not latin.
Especially for Greek, regex has \p{InGreek} to match Greek letters and \P{InGreek}(the difference is capital P) to match non Greek letters.
The question cannot be answered completely unless you say what you mean by "international letters", but the general solution is to use named character classes, via the \p{name} syntax. There are many named character classes. Some are defined by the regex language, and others by the Unicode standard. Refer to the Pattern javadocs for a partial list, and to the relevant Unicode standard.

Categories