Java Regex hung on a long string

Java Regex hung on a long string - java

I am trying to write a REGEX to validate a string. It should validate to the requirement which is that it should have only Uppercase and lowercase English letters (a to z, A to Z) (ASCII: 65 to 90, 97 to 122) AND/OR Digits 0 to 9 (ASCII: 48 to 57) AND Characters - _ ~ (ASCII: 45, 95, 126). Provided that they are not the first or last character. It can also have Character. (dot, period, full stop) (ASCII: 46) Provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. I have tried using the following
Pattern.compile("^[^\\W_*]+((\\.?[\\w\\~-]+)*\\.?[^\\W_*])*$");
It works fine for smaller strings but it doesn't for long strings as i am experiencing thread hung issues and huge spikes in cpu. Please help.
Test cases for invalid strings:
"aB78."
"aB78..ab"
"aB78,1"
"aB78 abc"
".Abc12"
Test cases for valid strings:
"abc-def"
"a1b2c~3"
"012_345"

Your regex suffers from catastrophic backtracking, which leads to O(2n) (ie exponential) solution time.
Although following the link will provide a far more thorough explanation, briefly the problem is that when the input doesn't match, the engine backtracks the first * term to try different combinations of the quantitys of the terms, but because all groups more or less match the same thing, the number of combinations of ways to group grows exponentially with the length of the backtracking - which in the case of non- matching input is the entire input.
The solution is to rewrite the regex so it won't catastrophically backtrack:
don't use groups of groups
use possessive quantifiers eg .*+ (which never backtrack)
fail early on non-match (eg using an anchored negative look ahead)
limit the number of times terms may appear using {n,m} style quantifiers
Or otherwise mitigate the problem

Problem
It is due to catastrophic backtracking. Let me show where it happens, by simplifying the regex to a regex which matches a subset of the original regex:
^[^\W_*]+((\.?[\w\~-]+)*\.?[^\W_*])*$
Since [^\W_*] and [\w\~-] can match [a-z], let us replace them with [a-z]:
^[a-z]+((\.?[a-z]+)*\.?[a-z])*$
Since \.? are optional, let us remove them:
^[a-z]+(([a-z]+)*[a-z])*$
You can see ([a-z]+)*, which is the classical example of regex which causes catastrophic backtracking (A*)*, and the fact that the outermost repetition (([a-z]+)*[a-z])* can expand to ([a-z]+)*[a-z]([a-z]+)*[a-z]([a-z]+)*[a-z] further exacerbate the problem (imagine the number of permutation to split the input string to match all expansions that your regex can have). And this is not mentioning [a-z]+ in front, which adds insult to injury, since it is of the form A*A*.
Solution
You can use this regex to validate the string according to your conditions:
^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$
As Java string literal:
"^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$"
Breakdown of the regex:
^ # Assert beginning of the string
(?=[a-zA-Z0-9]) # Must start with alphanumeric, no special
[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+
(?<=[a-zA-Z0-9]) # Must end with alphanumeric, no special
$ # Assert end of the string
Since . can't appear consecutively, and can't start or end the string, we can consider it a separator between strings of [a-zA-Z0-9_~-]+. So we can write:
[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+
All quantifiers are made possessive to reduce stack usage in Oracle's implementation and make the matching faster. Note that it is not appropriate to use them everywhere. Due to the way my regex is written, there is only one way to match a particular string to begin with, even without possessive quantifier.
Shorthand
Since this is Java and in default mode, you can shorten a-zA-Z0-9_ to \w and [a-zA-Z0-9] to [^\W_] (though the second one is a bit hard for other programmer to read):
^(?=[^\W_])[\w~-]++(\.[\w~-]++)*+(?<=[^\W_])$
As Java string literal:
"^(?=[^\\W_])[\\w~-]++(\\.[\\w~-]++)*+(?<=[^\\W_])$"
If you use the regex with String.matches(), the anchors ^ and $ can be removed.

As #MarounMaroun already commented, you don't really have a pattern. It might be better to iterate over the string as in the following method:
public static boolean validate(String string) {
char chars[] = string.toCharArray();
if (!isSpecial(chars[0]) && !isLetterOrDigit(chars[0]))
return false;
if (!isSpecial(chars[chars.length - 1])
&& !isLetterOrDigit(chars[chars.length - 1]))
return false;
for (int i = 1; i < chars.length - 1; ++i)
if (!isPunctiation(chars[i]) && !isLetterOrDigit(chars[i])
&& !isSpecial(chars[i]))
return false;
return true;
}
public static boolean isPunctiation(char c) {
return c == '.' || c == ',';
}
public static boolean isSpecial(char c) {
return c == '-' || c == '_' || c == '~';
}
public static boolean isLetterOrDigit(char c) {
return (Character.isDigit(c) || (Character.isLetter(c) && (Character
.getType(c) == Character.UPPERCASE_LETTER || Character
.getType(c) == Character.LOWERCASE_LETTER)));
}
Test code:
public static void main(String[] args) {
System.out.println(validate("aB78."));
System.out.println(validate("aB78..ab "));
System.out.println(validate("abcdef"));
System.out.println(validate("aB78,1"));
System.out.println(validate("aB78 abc"));
}
Output:
false
false
true
true
false

A solution should try and find negatives rather than try and match a pattern over the entire string.
Pattern bad = Pattern.compile( "[^-\\W.~]|\\.\\.|^\\.|\\.$" );
for( String str: new String[]{ "aB78.", "aB78..ab", "abcdef",
"aB78,1", "aB78 abc" } ){
Matcher mat = bad.matcher( str );
System.out.println( mat.find() );
}
(It is remarkable to see how the initial statement "string...should have only" leads programmers to try and create positive assertions by parsing or matching valid characters over the full length rather than the much simpler search for negatives.)

Related

Figuring out regex for the mentioned condition

I came across the concept of regex recently and was poised to solve the problem using just the regex inside matches() and length() method of String class. The problem was related to password matching.Here are the three conditions that need to be considered:
A password must have at least eight characters.
A password consists of only letters and digits.
A password must contain at least two digits.
I was able to do this problem by using various other String and Character class methods but I need to do them only by regex.What I have tried helps me with most of the test cases but some of them(test cases) are still failing.Since, I am learning regex implementation so please help me with what I am missing or doing wrong. Below is what I tried:
public class CheckPassword {
public static void main(String[]args){
Scanner sc = new Scanner(System.in);
System.out.println("Enter your password:\n");
String str1 = sc.next();
//String dig2 = "\\d{2}";
//String letter = ".*[A-Z].*";
//String letter1 = ".*[a-z].*";
//if(str1.length() >= 8 && str1.matches(dig2) &&(str1.matches(letter) || str1.matches(letter1)) )
if(str1.length() >= 8 && str1.matches("^(?=.*[A-Z])(?=.*[a-z])(?=.*\\d{2,})(?=.*[0-9])[A-Z0-9a-z]+$"))
System.out.println("Valid Password");
else
System.out.println("Invalid Password");
}
}
EDIT
Okay So I figured out the first and second case just I am having problem in appending the third case with them i.e. contains at least 2 digits.
if(str1.length() >= 8 && str1.matches("[a-zA-Z0-9]*"))
//works exclusive of the third criterion

You may actually use a single regex inside matches() to validate all 3 conditions:
A password must have at least eight characters and
A password consists of only letters and digits - use \p{Alnum}{8,} in the consuming part
A password must contain at least two digits - use the (?=(?:[a-zA-Z]*\d){2}) positive lookahead anchored at the start
Combining all three:
.matches("(?=(?:[a-zA-Z]*\\d){2})\\p{Alnum}{8,}")
Since matches() method anchors the pattern by default (i.e. it requires a full string match) no ^ and $ anchors are necessary.
Details
^ - implicit in matches() - start of string
(?=(?:[a-zA-Z]*\d){2}) - a positive lookahead ((?=...)) that requires the presence of exactly two sequences of:
[a-zA-Z]* - zero or more ASCII letters
\d - an ASCII digit
\p{Alnum}{8,} - 8 or more alphanumeric chars (ASCII only)
$ - implicit in matches() - end of string.

Okay Thank you #TDG and M.Aroosi for giving your precious time. I have figured out the solution and this solution satisfies all cases
// answer edited based on OP's working comment.
String dig2 = "^(?=.*?\\d.*\\d)[a-zA-Z0-9]{8,}$";
if(str1.matches(dig2))
{
//body
}

Regex to identify strings containing a particular symbol?

I have set of inputs ++++,----,+-+-.Out of these inputs I want the string containing only + symbols.

If you want to see if a String contains nothing but + characters, write a loop to check it:
private static boolean containsOnly(String input, char ch) {
if (input.isEmpty())
return false;
for (int i = 0; i < input.length(); i++)
if (input.charAt(i) != ch)
return false;
return true;
}
Then call it to check:
System.out.println(containsOnly("++++", '+')); // prints: true
System.out.println(containsOnly("----", '+')); // prints: false
System.out.println(containsOnly("+-+-", '+')); // prints: false
UPDATE
If you must do it using regex (worse performance), then you can do any of these:
// escape special character '+'
input.matches("\\++")
// '+' not special in a character class
input.matches("[+]+")
// if "+" is dynamic value at runtime, use quote() to escape for you,
// then use a repeating non-capturing group around that
input.matches("(?:" + Pattern.quote("+") + ")+")
Replace final + with * in each of these, if an empty string should return true.

The regular expression for checking if a string is composed of only one repeated symbol is
^(.)\1*$
If you only want lines composed by '+', then it's
^\++$, or ^++*$ if your regex implementation does not support +(meaning "one or more").

For a sequence of the same symbol, use
(.)\1+
as the regular expression. For example, this will match +++, and --- but not +--.

Regex pattern: ^[^\+]*?\+[^\+]*$
This will only permit one plus sign per string.
Demo Link
Explanation:
^ #From start of string
[^\+]* #Match 0 or more non plus characters
\+ #Match 1 plus character
[^\+]* #Match 0 or more non plus characters
$ #End of string
edit, I just read the comments under the question, I didn't actually steal the commented regex (it just happens to be intellectual convergence):
Whoops, when using matches disregard ^ and $ anchors.
input.matches("[^\\+]*?\+[^\\+]*")

Regex for detecting repeating symbols

I'm looking for the regex expression that will detect repeating symbols in a String. And currently I didn't found solution that fits all my requirements.
Requirements are pretty simple:
detect any repeating symbol in a String;
to be able to setup repeating count (eg. more than twice)
Examples of required detection (of symbol 'a', more than 2 times, true if detects, false otherwise)
"Abcdefg" - false
"AbcdaBCD" - false
"abcd_ab_ab" - true (symbol 'a' used three times)
"aabbaabb" - true (symbols 'a' used four times)
Since I'm not a pro in regex and usage of them - code snippet and explanation would be appreciated!
Thanks!

I think that
(.).*\1
would work:
(.) match a single character and capture
.* match any intervening characters
\1 match the captured group again.
(You'd need to compile with the DOTALL flag, or replace . with [\s\S] or similar if the string contains characters not ordinarily matched by .)
and if you want to require that it is found at least 3 times, just change the quantifier of the second two bullets:
(.)(.*\1){2}
etc.
This is going to be pretty inefficient, though, because it's going to have to do the "search for the next matching character" between every character in the string and the end of the string, making it at least quadratic.
You might be as well off not using regular expressions, e.g.
char[] cs = str.toCharArray();
Arrays.sort(cs);
int n = numOccurrencesRequired - 1;
for (int i = n; i < cs.length; ++i) {
boolean allSame = true;
for (int j = 1; j <= n && allSame; ++j) {
allSame = cs[i] == cs[i - j];
}
if (allSame) return true;
}
return false;
This sorts all of the same characters together, allowing you just to pass over the string once looking for adjacent equal characters.
Note that this doesn't quite work for any symbol: it will split up multi-char codepoints like 🍕. You can adapt the code above to work with codepoints, rather than chars.

Try this regex: (.)(?:.*\1)
It basically matches any character (.) is followed by anything .* and itself \1. If you want to check for 2 or more repeats only add {n,} at the end with n being the number of repeats you want to check for.

Yea, such regex exists but just because the set of characters is finite.
regex: .*(a.*a|b.*b|c.*c|...|y.*y|z.*z).*
It makes no sense. Use another approach:
String string = "something";
int[] count = new int[256];
for (int i = 0; i < string.length; i++) {
int temp = int(string.charAt(i));
count[temp]++;
}
Now you have all characters counted and you can use them as you wish.

Regex for password matching

I have searched the site and not finding exactly what I am looking for.
Password Criteria:
Must be 6 characters, 50 max
Must include 1 alpha character
Must include 1 numeric or special character
Here is what I have in java:
public static Pattern p = Pattern.compile(
"((?=.*\\d)(?=.*[a-z])(?=.*[A-Z])|(?=.*[\\d~!##$%^&*\\(\\)_+\\{\\}\\[\\]\\?<>|_]).{6,50})"
);
The problem is that a password of 1234567 is matching(it is valid) which it should not be.
Any help would be great.

I wouldn't try to use a single regular expression to do that. Regular expressions tend not to perform well when they get long and complicated.
boolean valid(String password){
return password != null &&
password.length() >= 6 &&
password.length() <= 50 &&
password.matches(".*[A-Za-z].*") &&
password.matches(".*[0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_+\\{\\}\\[\\]\\?<>|_].*");
}

Make sure you use Matcher.matches() method, which assert that the whole string matches the pattern.
Your current regex:
"((?=.*\\d)(?=.*[a-z])(?=.*[A-Z])|(?=.*[\\d~!##$%^&*\\(\\)_+\\{\\}\\[\\]\\?<>|_]).{6,50})"
means:
The string must contain at least a digit (?=.*\\d), a lower case English alphabet (?=.*[a-z]), and an upper case character (?=.*[A-Z])
OR | The string must contain at least 1 character which may be digit or special character (?=.*[\\d~!##$%^&*\\(\\)_+\\{\\}\\[\\]\\?<>|_])
Either conditions above holds true, and the string must be between 6 to 50 characters long, and does not contain any line separator.
The correct regex is:
"(?=.*[a-zA-Z])(?=.*[\\d~!##$%^&*()_+{}\\[\\]?<>|]).{6,50}"
This will check:
The string must contain an English alphabet character (either upper case or lower case) (?=.*[a-zA-Z]), and a character which can be either a digit or a special character (?=.*[\\d~!##$%^&*()_+{}\\[\\]?<>|])
The string must be between 6 and 50 characters, and does not contain any line separator.
Note that I removed escaping for most characters, except for [], since {}?() loses their special meaning inside character class.

A regular expression can only match languages which can be expressed as a deterministic finite automaton, i.e. which doesn't require memory. Since you have to count special and alpha characters, this does require memory, so you're not going to be able to do this in a DFA. Your rules are simple enough, though that you could just scan the password, determine its length and ensure that the required characters are available.

I'd suggest you to separate characters and length validation:
boolean checkPassword(String password) {
return password.length() >= 6 && password.length() <= 50 && Pattern.compile("\\d|\\w").matcher(password).find();
}

I would suggest splitting into separate regular expressions
$re_numbers = "/[0-9]/";
$re_letters = "/[a-zA-Z]/";
both of them must match and the length is tested separately, too.
The code looks quite cleaner then and is easier to understand/change.

This way too complex for such a simple task:
Validate length using String#length()
password.length() >= 6 && password.length() <= 50
Validate each group using Matcher#find()
Pattern alpha = Pattern.compile("[a-zA-Z]");
boolean hasAlpha = alpha.matcher(password).find();
Pattern digit = Pattern.compile("\d");
boolean hasDigit = digit.matcher(password).find();
Pattern special = Pattern.compile("[\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_+\\{\\}\\[\\]\\?<>|_]");
boolean hasSpecial = special.matcher(password).find();

Match exactly N repetitions of the same character

How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.
I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.

Use negative lookahead and negative lookbehind.
This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).
English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."
Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:
The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"
This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.
Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]

I suspect you want to be using negative lookahead: (.)\1{N-1}(?!\1).
But that said...I suspect the simplest cross-language solution is just write it yourself without using regexes.
UPDATE:
^(.)\\1{3}(?!\\1)|(.)(?<!(?=\\2)..)\\2{3}(?!\\2) works for me more generally, including matches starting at the beginning of the string.

It is easy to put too much burden onto regular expressions and try to get them to do everything, when just nearly everything will do!
Use a regex to find all substrings consisting of a single character, and then check their length separately, like this:
use strict;
use warnings;
my $str = 'xxaaaayyybbbbbzzccccxx';
while ( $str =~ /((.)\2*)/g ) {
next unless length $1 == 4;
my $substr = $1;
print "$substr\n";
}
output
aaaa
cccc

Perl’s regex engine does not support variable-length lookbehind, so we have to be deliberate about it.
sub runs_of_length {
my($n,$str) = #_;
my $n_minus_1 = $n - 1;
my $_run_pattern = qr/
(?:
# In the middle of the string, we have to force the
# run being matched to start on a new character.
# Otherwise, the regex engine will give a false positive
# by starting in the middle of a run.
(.) ((?!\1).) (\2{$n_minus_1}) (?!\2) |
#$1 $2 $3
# Don't forget about a potential run that starts at
# the front of the target string.
^(.) (\4{$n_minus_1}) (?!\4)
# $4 $5
)
/x;
my #runs;
while ($str =~ /$_run_pattern/g) {
push #runs, defined $4 ? "$4$5" : "$2$3";
}
#runs;
}
A few test cases:
my #tests = (
"xxaaaayyybbbbbzzccccxx",
"aaaayyybbbbbzzccccxx",
"xxaaaa",
"aaaa",
"",
);
$" = "][";
for (#tests) {
my #runs = runs_of_length 4, $_;
print qq<"$_":\n>,
" - [#runs]\n";
}
Output:
"xxaaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"aaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"xxaaaa":
- [aaaa]
"aaaa":
- [aaaa]
"":
- []
It’s a fun puzzle, but your regex-averse colleagues will likely be unhappy if such a construction shows up in production code.

How about this in python?
def match(string, n):
parts = []
current = None
for c in string:
if not current:
current = c
else:
if c == current[-1]:
current += c
else:
parts.append(current)
current = c
result = []
for part in parts:
if len(part) == n:
result.append(part)
return result
Testing with your string with various sizes:
match("xxaaaayyybbbbbzzccccxx", 6) = []
match("xxaaaayyybbbbbzzccccxx", 5) = ["bbbbb"]
match("xxaaaayyybbbbbzzccccxx", 4) = ['aaaa', 'cccc']
match("xxaaaayyybbbbbzzccccxx", 3) = ["yyy"]
match("xxaaaayyybbbbbzzccccxx", 2) = ['xx', 'zz']
Explanation:
The first loop basically splits the text into parts, like so: ["xx", "aaaa", "yyy", "bbbbb", "zz", "cccc", "xx"]. Then the second loop tests those parts for their length. In the end the function only returns the parts that have the current length. I'm not the best at explaining code, so anyone is free to enhance this explanation if needed.
Anyways, I think this'll do!

Why not leave to regexp engine what it does best - finding longest string of same symbols and then check length yourself?
In Perl:
my $str = 'xxaaaayyybbbbbzzccccxx';
while($str =~ /(.)\1{3,}/g){
if(($+[0] - $-[0]) == 4){ # insert here full match length counting specific to language
print (($1 x 4), "\n")
}
}

>>> import itertools
>>> zz = 'xxaaaayyybbbbbzzccccxxaa'
>>> z = [''.join(grp) for key, grp in itertools.groupby(zz)]
>>> z
['xx', 'aaaa', 'yyy', 'bbbbb', 'zz', 'cccc', 'xx', 'aa']
From there you can iterate through the list and check for occasions when N==4 very easily, like this:
>>> [item for item in z if len(item)==4]
['cccc', 'aaaa']

In Java we can do like below code
String test ="xxaaaayyybbbbbzzccccxx uuuuuutttttttt";
int trimLegth = 4; // length of the same characters
Pattern p = Pattern.compile("(\\w)\\1+",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE);
Matcher m = p.matcher(test);
while (m.find())
{
if(m.group().length()==trimLegth) {
System.out.println("Same Characters String " + m.group());
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex hung on a long string - java

Related

Figuring out regex for the mentioned condition

Regex to identify strings containing a particular symbol?

Regex for detecting repeating symbols

Regex for password matching

Match exactly N repetitions of the same character

Categories

Resources