Java, poor regex performance with lazy expressions - java

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.
The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like:
(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)
The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!
In my code, I have tried a couple of ways to match regex:
if ( (regex findFirstIn log).nonEmpty ) { do something }
val allGroups = regex.findAllIn(log).matchData.toList
if (allGroups.nonEmpty) { do something }
if (regex.pattern.matcher(log).matches()){do something}
All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?
The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.
I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.

The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group
(?>(.+?)#011)
Once matched, there's no more backtracking. Done and looking forward for the next group.
Negating a string
The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:
((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011
Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).
An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:
((?: (?!#011). )+)#011
It's all pretty complicated and most probably less performant than simply using the atomic group.
Optimizations
Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)
( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011
It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.
A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

Related

Regex to match if string *only* contains *all* characters from a character set, plus an optional one

I ran into a wee problem with Java regex. (I must say in advance, I'm not very experienced in either Java or regex.)
I have a string, and a set of three characters. I want to find out if the string is built from only these characters. Additionally (just to make it even more complicated), two of the characters must be in the string, while the third one is **optional*.
I do have a solution, my question is rather if anyone can offer anything better/nicer/more elegant, because this makes me cry blood when I look at it...
The set-up
There mandatory characters are: | (pipe) and - (dash).
The string in question should be built from a combination of these. They can be in any order, but both have to be in it.
The optional character is: : (colon).
The string can contain colons, but it does not have to. This is the only other character allowed, apart from the above two.
Any other characters are forbidden.
Expected results
Following strings should work/not work:
"------" = false
"||||" = false
"---|---" = true
"|||-|||" = true
"--|-|--|---|||-" = true
...and...
"----:|--|:::|---::|" = true
":::------:::---:---" = false
"|||:|:::::|" = false
"--:::---|:|---G---n" = false
...etc.
The "ugly" solution
Now, I have a solution that seems to work, based on this stackoverflow answer. The reason I'd like a better one will become obvious when you've recovered from seeing this:
if (string.matches("^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$") || string.matches("^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$")) {
//do funny stuff with a meaningless string
} else {
//don't do funny stuff with a meaningless string
}
Breaking it down
The first regex
"^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$"
checks for all three characters
The next one
"^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$"
check for the two mandatory ones only.
...Yea, I know...
But believe me I tried. Nothing else gave the desired result, but allowed through strings without the mandatory characters, etc.
The question is...
Does anyone know how to do it a simpler / more elegant way?
Bonus question: There is one thing I don't quite get in the regexes above (more than one, but this one bugs me the most):
As far as I understand(?) regular expressions, (?\\|)? should mean that the character | is either contained or not (unless I'm very much mistaken), still in the above setup it seems to enforce that character. This of course suits my purpose, but I cannot understand why it works that way.
So if anyone can explain, what I'm missing there, that'd be real great, besides, this I suspect holds the key to a simpler solution (checking for both mandatory and optional characters in one regex would be ideal.
Thank you all for reading (and suffering ) through my question, and even bigger thanks for those who reply. :)
PS
I did try stuff like ^[\\|\\-(?:\\:)?)]$, but that would not enforce all mandatory characters.
Use a lookahead based regex.
^(?=.*\\|)(?=.*-)[-:|]+$
or
^(?=.*\\|)[-:|]*-[-:|]*$
or
^[-:|]*(?:-:*\\||\\|:*-)[-:|]*$
DEMO 1DEMO 2
(?=.*\\|) expects atleast one pipe.
(?=.*-) expects atleast one hyphen.
[-:|]+ any char from the list one or more times.
$ End of the line.
Here is a simple answer:
(?=.*\|.*-|.*-.*\|)^([-|:]+)$
This says that the string needs to have a '-' followed by '|', or a '|' followed by a '-', via the look-ahead. Then the string only matches the allowed characters.
Demo: http://fiddle.re/1hnu96
Here is one without lookbefore and -hind.
^[-:|]*\\|[-:|]*-[-:|]*|[-:|]*-[-:|]*\\|[-:|]*$
This doesn't scale, so Avinash's solution is to be preferred - if your regex system has the lookbe*.

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.
Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

with regex, is using both "is" and "is not" range definitons within the same range possible?

Note: I'm using a 3rd party app that uses regex for searches which has its own flavor but almost always works like java's flavor of regex. Of course this may not matter.
After searching for many different ways of this same question (phrased many ways), I did not see any tutorials, examples, or even mentions of whether it is possible to use both an "is" (positive?) and "is not" (negative?) definition within the same range.
I can't run a test the example right now in the app to see if my ideas work, because the amount of data being searched is massive and will screw up the matches it has already gathered. I'm only asking because of this.
Here are examples of what I thought might work but caused tester to act weird:
[\w^\s<>.!?]{2}
[\w|^\s<>.!?]{2}
I would rather have it work the way I think the first one would work (any digit, lower case, or upper case character, or other normal character that is not a space, >, <, period, !, or ?) rather then the second which only has an or operator.
The regex testers I used gave me different funky results which is what is confusing me.
Also note: I'm using this within a capture group which is followed by a catch everything match which I may or may not be using properly. So if you'd like to include how to follow what I'm attempting with how to properly do that, feel free. I AM MAINLY JUST CURIOUS TO IF THIS WAS POSSIBLE OR NOT, OR IF IT WAS A IMPROPER METHOD.
Why do you need the \w at all?
[^\s<>.!?]{2}
This already matches all alphanumeric characters since they are neither space nor any of the punctuation characters you mentioned.
In general, you can substract character classes to some degree, for example, to match alphanumerics exluding digits, you can do
[^\W\d]
because [^\W] matches the same as \w, and \d is substracted from that because it's in a negated character class.
Edit:
Some regex engines (like XPath, .NET and JGSoft) allow flexible character class substraction like this:
[a-z-[e-g]]
to match any character from the range [a-z], excluding e, f and g. But Java does not have this feature.
Another possibility is to use two ranges and combine them; e.g.
([\w]|[^\s<>.!?]){2}
However, this does bring up the question of what you are actually trying to express here. Because this example (as I've rewritten it) doesn't make a lot of sense.
What it says is "a word character, or any character that is not whitespace or certain punctuation". But the class of characters that are not "whitespace or certain punctuation" ALREADY includes all of the word characters. So, unless you mean something different, the \w is redundant.
From your question, it looks like a no-space regex would match your needs, you can achieve that with:
[\S]{2}

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$

Regex for checking > 1 upper, lower, digit, and special char

^.(?=.{15,})(?=.\d)(?=.[a-z])(?=.[A-Z])(?=.[!##$%^&+=]).*$
This is the regex I am currently using which will evaluate on 1 of each: upper,lower,digit, and specials of my choosing. The question I have is how do I make it check for 2 of each of these? Also I ask because it is seemingly difficult to write a test case for this as I do not know if it is only evaluating the first set of criteria that it needs. This is for a password, however the requirement is that it needs to be in regex form based on the package we are utilizing.
EDIT
Well as it stands in my haste to validate the expression I forgot to validate my string length. Thanks to Ken and Gumbo on helping me with this.
This is the code I am executing:
I do apologize as regex is not my area.
The password I am using is the following string "$$QiouWER1245", the behavior I am experiencing at the moment is that it randomly chooses to pass or fail. Any thoughts on this?
Pattern pattern = Pattern.compile(regEx);
Matcher match = pattern.matcher(password);
while(match.find()){
System.out.println(match.group());
}
From what I see if it evaluates to true it will throw the value in password back to me else it is an empty string.
Personally, I think a password policy that forces use of all three character classes is not very helpful. You can get the same degree of randomness by letting people make longer passwords. Users will tend to get frustrated and write passwords down if they have to abide by too many password rules (which make the passwords too difficult to remember). I recommend counting bits of entropy and making sure they're greater than 60 (usually requires a 10-14 character password). Entropy per character would depend roughly on the number of characters, the range of character sets they use, and maybe how often they switch between character sets (I would guess that passwords like HEYthere are more predictable than heYThEre).
Another note: do you plan not to count the symbols to the right of the keyboard (period, comma, angle brackets, etc.)?
If you still have to find groups of two characters, why not just repeat each pattern? For example, make (?=.\d) into (?=.\d.*\d).
For your test cases, if you are worried that it would only check the first criteria, then write a test case that makes sure each of the following passwords fails (because one and only one of the criteria is not met in each case): Just for fun I reversed the order of expectation of each character set, though it probably won't make a difference unless someone removes/forgets the ?= at some future date.
!##TESTwithoutnumbers
TESTwithoutsymbols123
&*(testwithoutuppercase456
+_^TESTWITHOUTLOWERCASE3498
I should point out that technically none of these passwords should be acceptable because they use dictionary words, which have about 2 bits of entropy per character instead of something more like 6. However, I realize that it's difficult to write a (maintainable and efficient) regular expression to check for dictionary words.
Try this:
"^(?=(?:\\D*\\d){2})(?=(?:[^a-z]*[a-z]){2})(?=(?:[^A-Z]*[A-Z]){2})(?=(?:[^!##$%^&*+=]*[!##$%^&*+=]){2}).{15,}$"
Here non-capturing groups (?:…) are used to group the conditions and repeat them. I’ve also used the complements of each character class for optimization instead of the universal ..
If I understand your question correctly, you want at least 15 characters, and to require at least 2 uppercase characters, at least 2 lowercase characters, at least 2 digits, and at least 2 special characters. In that case you could it like this:
^.*(?=.{15,})(?=.*\d.*\d)(?=.*[a-z].*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[!##$%^&*+=].*[!##$%^&*+=]).*$
BTW, your original regex had an extra backslash before the \d
I'm not sure that one big regex is the right way to go here. It already looks far too complicated and will be very difficult to change in the future.
My suggestion is to structure the code in the following way:
check that the string has 2 lower case characters
return failure if not found or continue
check that the string has 2 upper case characters
return failure if not found or continue
etc.
This will also allow you to pass out a return code or errors string specifying why the password was not accepted and the code will be much simpler.

Categories