Regex with conditional replacement

Regex with conditional replacement - java

I need to write a regex to validate phone numbers with the following criteria:
Return the input as-is if it's fewer than 7 digits. Otherwise, remove the first character if it is a 1 or 0. If we haven't returned yet and the number is < 10 digits, return it. If it's >= 10 digits, return the last 7.
This is performance-critical code converted from coded conditional statements so ideally it can be done in a single regex. I managed to hack together something that got me close but I'm having some trouble meeting all criteria without further breaking things.
(Spaces are just to break things up since there's a lot here).
var pattern = Pattern.compile("(?<=\A[01]?) ([0-9]{1,9}) (?![0-9]) | (?:[01]?) (?<=\A[01]?) (?:[0-9]{3,}) ([0-9]{7}) (.*)", "$1$2");
return pattern.replaceAll(phoneNum);
This passes all the test strings I gave it except it doesn't remove the 0 or 1 like it should if they exist as the first character of strings of length 7+.
// Returns input as-is if fewer than 7 digits
555123 --> 555123 Success
// If 7+ digits remove the first character if it is a 1 or 0
1234567 --> 234567 Failure, returned 1234567
// If we haven't returned yet and the number is < 10 digits, return
5551212 --> 5551212 Success
// If it's >= 10 digits, return the last 7
5551234567 --> 1234567 Success

Java isn't my forte, but as people have mentioned regex might not be the right solution to your question. Just in case you are still interested in a regular expression, I think the following covers all your criteria:
^(?:(?=\d{7,9}$)[01]?|\d*(?=\d{7}$)|)(\d+$)
See the online demo
^ - Start string ancor.
(?: - Open non-capturing group.
(?=\d{7,9}$- A positive lookahead to assert position when there are 7-9 digits up to end string ancor.
[01]? - Optionally capture a zero or one.
| - Or:
\d* - Capture as many digits but untill:
(?=\d{7}$) - Positive lookahead for 7 digits untill end string ancor.
| - Or: Match nothing.
) - Close non-capturing group.
(\d+$) - Capture all remaining digits in 1st capture group until end string ancor.

A replaceAll with a lambda might be sufficient, having the disadvantage that the lambda is a bit slower, though the regex faster. It is more maintainable, certainly for real-world business logic. Just time the result in a micro-benchmark.
var pattern = Pattern.compile("\\b(\\d+)\\b");
return pattern.matcher(phoneNum).replaceAll(mr -> {
String digits = mr.group(1);
if (digits.length() < 7) { // Or better \\d{7, 20}
return digits;
}
if (digits.startsWith("0") || digits.startsWith(1)) { // Can be optimized
digits = digits.substring(1);
}
if (digits >= 10) {
digits = digits.substring(digits.length() - 7);
}
return digits;
});
Your test cases should be kept as unit tests, as such business rules tend to change "slightly" - especially if you prefer a single regex.

Here's the if version, as suggested in comments, I've also added your tests as unit tests:
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.assertEquals;
public class SomeClass {
public String correctPhoneNumber(String number) {
if (number.length() >= 7 && (number.startsWith("0") || number.startsWith("1"))) {
return number.substring(1);
}
if (number.length() >= 10) {
return number.substring(number.length() - 7);
}
return number;
}
#Test
void correctPhoneNumberTest() {
SomeClass objectToTest = new SomeClass();
assertEquals("555123", objectToTest.correctPhoneNumber("555123"));
assertEquals("234567", objectToTest.correctPhoneNumber("1234567"));
assertEquals("5551212", objectToTest.correctPhoneNumber("5551212"));
assertEquals("1234567", objectToTest.correctPhoneNumber("5551234567"));
}
}

Related

Length of some characters in regex

I have following regex:
\+?[0-9\.,()\-\s]+$
which allows:
optional + at the beginning
then numbers, dots, commas, round brackets, dashes and white spaces.
In addition to that I need to make sure that amount of numbers and plus symbol (if exists) has length between 9 and 15 (so I'm not counting any special characters apart from + symbol).
And this last condition is what I'm having problem with.
valid inputs:
+358 (9) 1234567
+3 5 8.9,1-2(3)4..5,6.7 (25 characters but only 12 characters that counts (numbers and plus symbol))
invalid input:
+3 5 8.9,1-2(3)4..5,6.777777777 (33 characters and only 20 characters that counts (numbers and plus symbol) is too many)
It is important to use regex if possible because it's used in javax.validation.constraints.Pattern annotation as:
#Pattern(regexp = REGEX)
private String number;
where my REGEX is what I'm looking for here.
And if regex cannot be provided then it means that I need to rewrite my entity validation implementation. So is it possible to add such condition to regex or do I need a function to validate such pattern?

You may use
^(?=(?:[^0-9+]*[0-9+]){9,15}[^0-9+]*$)\+?[0-9.,()\s-]+$
See the regex demo
Details
^ - start of string
(?=(?:[^0-9+]*[0-9+]){9,15}[^0-9+]*$) - a positive lookahead whose pattern must match for the regex to find a match:
(?:[^0-9+]*[0-9+]){9,15} - 9 to 15 repetitions of
[^0-9+]* - any 0+ chars other than digits and + symbol
[0-9+] - a digit or +
[^0-9+]* - 0+ chars other than digits and +
$ - end of string
\+? - an optional + symbol
[0-9.,()\s-]+ - 1 or more digits, ., ,, (, ), whitespace and - chars
$ - end of string.
In Java, when used with matches(), the ^ and $ anchors may be omitted:
s.matches("(?=(?:[^0-9+]*[0-9+]){9,15}[^0-9+]*$)\\+?[0-9.,()\\s-]+")

Not using regex, you could simply loop and count the numbers and +s:
int count = 0;
for (int i = 0; i < str.length(); i++) {
if (Character.isDigit(str.charAt(i)) || str.charAt(i) == '+') {
count++;
}
}

Since you're using Java, I wouldn't rely solely on a regex here:
String input = "+123,456.789";
int count = input.replaceAll("[^0-9+]", "").length();
if (input.matches("^\\+?[0-9.,()\\-\\s]+$") && count >= 9 && count <= 15) {
System.out.println("PASS");
}
else {
System.out.println("FAIL");
}
This approach allows us to just use straightaway your original regex. We handle the length requirements of numbers (and maybe plus) using Java string calls.

Figuring out regex for the mentioned condition

I came across the concept of regex recently and was poised to solve the problem using just the regex inside matches() and length() method of String class. The problem was related to password matching.Here are the three conditions that need to be considered:
A password must have at least eight characters.
A password consists of only letters and digits.
A password must contain at least two digits.
I was able to do this problem by using various other String and Character class methods but I need to do them only by regex.What I have tried helps me with most of the test cases but some of them(test cases) are still failing.Since, I am learning regex implementation so please help me with what I am missing or doing wrong. Below is what I tried:
public class CheckPassword {
public static void main(String[]args){
Scanner sc = new Scanner(System.in);
System.out.println("Enter your password:\n");
String str1 = sc.next();
//String dig2 = "\\d{2}";
//String letter = ".*[A-Z].*";
//String letter1 = ".*[a-z].*";
//if(str1.length() >= 8 && str1.matches(dig2) &&(str1.matches(letter) || str1.matches(letter1)) )
if(str1.length() >= 8 && str1.matches("^(?=.*[A-Z])(?=.*[a-z])(?=.*\\d{2,})(?=.*[0-9])[A-Z0-9a-z]+$"))
System.out.println("Valid Password");
else
System.out.println("Invalid Password");
}
}
EDIT
Okay So I figured out the first and second case just I am having problem in appending the third case with them i.e. contains at least 2 digits.
if(str1.length() >= 8 && str1.matches("[a-zA-Z0-9]*"))
//works exclusive of the third criterion

You may actually use a single regex inside matches() to validate all 3 conditions:
A password must have at least eight characters and
A password consists of only letters and digits - use \p{Alnum}{8,} in the consuming part
A password must contain at least two digits - use the (?=(?:[a-zA-Z]*\d){2}) positive lookahead anchored at the start
Combining all three:
.matches("(?=(?:[a-zA-Z]*\\d){2})\\p{Alnum}{8,}")
Since matches() method anchors the pattern by default (i.e. it requires a full string match) no ^ and $ anchors are necessary.
Details
^ - implicit in matches() - start of string
(?=(?:[a-zA-Z]*\d){2}) - a positive lookahead ((?=...)) that requires the presence of exactly two sequences of:
[a-zA-Z]* - zero or more ASCII letters
\d - an ASCII digit
\p{Alnum}{8,} - 8 or more alphanumeric chars (ASCII only)
$ - implicit in matches() - end of string.

Okay Thank you #TDG and M.Aroosi for giving your precious time. I have figured out the solution and this solution satisfies all cases
// answer edited based on OP's working comment.
String dig2 = "^(?=.*?\\d.*\\d)[a-zA-Z0-9]{8,}$";
if(str1.matches(dig2))
{
//body
}

trouble with writing regex java

String always consists of two distinct alternating characters. For example, if string 's two distinct characters are x and y, then t could be xyxyx or yxyxy but not xxyy or xyyx.
But a.matches() always returns false and output becomes 0. Help me understand what's wrong here.
public static int check(String a) {
char on = a.charAt(0);
char to = a.charAt(1);
if(on != to) {
if(a.matches("["+on+"("+to+""+on+")*]|["+to+"("+on+""+to+")*]")) {
return a.length();
}
}
return 0;
}

Use regex (.)(.)(?:\1\2)*\1?.
(.) Match any character, and capture it as group 1
(.) Match any character, and capture it as group 2
\1 Match the same characters as was captured in group 1
\2 Match the same characters as was captured in group 2
(?:\1\2)* Match 0 or more pairs of group 1+2
\1? Optionally match a dangling group 1
Input must be at least two characters long. Empty string and one-character string will not match.
As java code, that would be:
if (a.matches("(.)(.)(?:\\1\\2)*\\1?")) {
See regex101.com for working examples1.
1) Note that regex101 requires use of ^ and $, which are implied by the matches() method. It also requires use of flags g and m to showcase multiple examples at the same time.
UPDATE
As pointed out by Austin Anderson:
fails on yyyyyyyyy or xxxxxx
To prevent that, we can add a zero-width negative lookahead, to ensure input doesn't start with two of the same character:
(?!(.)\1)(.)(.)(?:\2\3)*\2?
See regex101.com.
Or you can use Austin Anderson's simpler version:
(.)(?!\1)(.)(?:\1\2)*\1?

Actually your regex is almost correct but problem is that you have enclosed your regex in 2 character classes and you need to match an optional 2nd character in the end.
You just need to use this regex:
public static int check(String a) {
if (a.length() < 2)
return 0;
char on = a.charAt(0);
char to = a.charAt(1);
if(on != to) {
String re = on+"("+to+on+")*"+to+"?|"+to+"("+on+to+")*"+on+"?";
System.out.println("re: " + re);
if(a.matches(re)) {
return a.length();
}
}
return 0;
}
Code Demo

Java Regex mimicking if-else for numbers

Using this as a guide to attempt to emulate an if-else Java regex, I came up with:
[0-2]?(?:(?<=2)(?![6-9])|(?<!2)(?=[0-9])) to do the following:
An optional digit between 0-2 inclusive as the leftmost digit; However, if the first digit is a 2, then the next digit to the right can be maximum 5. If it is a 0 or 1, or left blank, then 0-9 is valid. I am trying to ultimately end up allowing a user to only write the numbers 0-255.
Testing the regular expression on both Regex101 as well as javac doesn't work on test cases, despite the Regex101 explanation being congruent with what I want.
When I test the regex:
System.out.println("0".matches("[0-2]?(?:(?<=2)(?![6-9])|(?<!2)(?=[0-9]))")); ---> false
System.out.println("2".matches("[0-2]?(?:(?<=2)(?![6-9])|(?<!2)(?=[0-9]))")); ----> true
System.out.println("25".matches("[0-2]?(?:(?<=2)(?![6-9])|(?<!2)(?=[0-9]))")); ----> false
System.out.println("22".matches("[0-2]?(?:(?<=2)(?![6-9])|(?<!2)(?=[0-9]))")); ----> false
System.out.println("1".matches("[0-2]?(?:(?<=2)(?![6-9])|(?<!2)(?=[0-9]))")); ----> false
It appears so far, from few test cases, 2 is the only valid case that is accepted by the regex.
For reference, here is my initial regex, using if-else that limits a number to the range of 0-255: [0-2]?(?(?<=2)[0-5]|[0-9])(?(?<=25)[0-5]|[0-9])

I don't see why to mimic if else for checking a range. It's just putting some patterns together.
^(?:[1-9]?\d|1\d\d|2[0-4]\d|25[0-5])$
^ start anchor
(?: opens a non capture group for alternation
[1-9]?\d matches 0-99
1\d\d matches 100-199
2[0-4]\d matches 200-249
25[0-5] matches 250-255
$ end anchor
See demo at regex101
With allowing leading zeros, you can reduce it to ^(?:[01]?\d\d?|2[0-4]\d|25[0-5])$

As you are trying to only allow a range of numbers (0-255), why use regex at all? Instead, parse the string as an int and check if it falls within the range.
public static boolean isInRange(String input, int min, int max) {
try {
int val = Integer.parseInt(input);
return val >= min && val < max;
} catch (NumberFormatException e) {
return false;
}
}

Java Regex hung on a long string

I am trying to write a REGEX to validate a string. It should validate to the requirement which is that it should have only Uppercase and lowercase English letters (a to z, A to Z) (ASCII: 65 to 90, 97 to 122) AND/OR Digits 0 to 9 (ASCII: 48 to 57) AND Characters - _ ~ (ASCII: 45, 95, 126). Provided that they are not the first or last character. It can also have Character. (dot, period, full stop) (ASCII: 46) Provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. I have tried using the following
Pattern.compile("^[^\\W_*]+((\\.?[\\w\\~-]+)*\\.?[^\\W_*])*$");
It works fine for smaller strings but it doesn't for long strings as i am experiencing thread hung issues and huge spikes in cpu. Please help.
Test cases for invalid strings:
"aB78."
"aB78..ab"
"aB78,1"
"aB78 abc"
".Abc12"
Test cases for valid strings:
"abc-def"
"a1b2c~3"
"012_345"

Your regex suffers from catastrophic backtracking, which leads to O(2n) (ie exponential) solution time.
Although following the link will provide a far more thorough explanation, briefly the problem is that when the input doesn't match, the engine backtracks the first * term to try different combinations of the quantitys of the terms, but because all groups more or less match the same thing, the number of combinations of ways to group grows exponentially with the length of the backtracking - which in the case of non- matching input is the entire input.
The solution is to rewrite the regex so it won't catastrophically backtrack:
don't use groups of groups
use possessive quantifiers eg .*+ (which never backtrack)
fail early on non-match (eg using an anchored negative look ahead)
limit the number of times terms may appear using {n,m} style quantifiers
Or otherwise mitigate the problem

Problem
It is due to catastrophic backtracking. Let me show where it happens, by simplifying the regex to a regex which matches a subset of the original regex:
^[^\W_*]+((\.?[\w\~-]+)*\.?[^\W_*])*$
Since [^\W_*] and [\w\~-] can match [a-z], let us replace them with [a-z]:
^[a-z]+((\.?[a-z]+)*\.?[a-z])*$
Since \.? are optional, let us remove them:
^[a-z]+(([a-z]+)*[a-z])*$
You can see ([a-z]+)*, which is the classical example of regex which causes catastrophic backtracking (A*)*, and the fact that the outermost repetition (([a-z]+)*[a-z])* can expand to ([a-z]+)*[a-z]([a-z]+)*[a-z]([a-z]+)*[a-z] further exacerbate the problem (imagine the number of permutation to split the input string to match all expansions that your regex can have). And this is not mentioning [a-z]+ in front, which adds insult to injury, since it is of the form A*A*.
Solution
You can use this regex to validate the string according to your conditions:
^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$
As Java string literal:
"^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$"
Breakdown of the regex:
^ # Assert beginning of the string
(?=[a-zA-Z0-9]) # Must start with alphanumeric, no special
[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+
(?<=[a-zA-Z0-9]) # Must end with alphanumeric, no special
$ # Assert end of the string
Since . can't appear consecutively, and can't start or end the string, we can consider it a separator between strings of [a-zA-Z0-9_~-]+. So we can write:
[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+
All quantifiers are made possessive to reduce stack usage in Oracle's implementation and make the matching faster. Note that it is not appropriate to use them everywhere. Due to the way my regex is written, there is only one way to match a particular string to begin with, even without possessive quantifier.
Shorthand
Since this is Java and in default mode, you can shorten a-zA-Z0-9_ to \w and [a-zA-Z0-9] to [^\W_] (though the second one is a bit hard for other programmer to read):
^(?=[^\W_])[\w~-]++(\.[\w~-]++)*+(?<=[^\W_])$
As Java string literal:
"^(?=[^\\W_])[\\w~-]++(\\.[\\w~-]++)*+(?<=[^\\W_])$"
If you use the regex with String.matches(), the anchors ^ and $ can be removed.

As #MarounMaroun already commented, you don't really have a pattern. It might be better to iterate over the string as in the following method:
public static boolean validate(String string) {
char chars[] = string.toCharArray();
if (!isSpecial(chars[0]) && !isLetterOrDigit(chars[0]))
return false;
if (!isSpecial(chars[chars.length - 1])
&& !isLetterOrDigit(chars[chars.length - 1]))
return false;
for (int i = 1; i < chars.length - 1; ++i)
if (!isPunctiation(chars[i]) && !isLetterOrDigit(chars[i])
&& !isSpecial(chars[i]))
return false;
return true;
}
public static boolean isPunctiation(char c) {
return c == '.' || c == ',';
}
public static boolean isSpecial(char c) {
return c == '-' || c == '_' || c == '~';
}
public static boolean isLetterOrDigit(char c) {
return (Character.isDigit(c) || (Character.isLetter(c) && (Character
.getType(c) == Character.UPPERCASE_LETTER || Character
.getType(c) == Character.LOWERCASE_LETTER)));
}
Test code:
public static void main(String[] args) {
System.out.println(validate("aB78."));
System.out.println(validate("aB78..ab "));
System.out.println(validate("abcdef"));
System.out.println(validate("aB78,1"));
System.out.println(validate("aB78 abc"));
}
Output:
false
false
true
true
false

A solution should try and find negatives rather than try and match a pattern over the entire string.
Pattern bad = Pattern.compile( "[^-\\W.~]|\\.\\.|^\\.|\\.$" );
for( String str: new String[]{ "aB78.", "aB78..ab", "abcdef",
"aB78,1", "aB78 abc" } ){
Matcher mat = bad.matcher( str );
System.out.println( mat.find() );
}
(It is remarkable to see how the initial statement "string...should have only" leads programmers to try and create positive assertions by parsing or matching valid characters over the full length rather than the much simpler search for negatives.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex with conditional replacement - java

Related

Length of some characters in regex

Figuring out regex for the mentioned condition

trouble with writing regex java

Java Regex mimicking if-else for numbers

Java Regex hung on a long string

Categories

Resources