Speed up regular expression - java

This is a regex to extract the table name from a SQL statement:
(?:\sFROM\s|\sINTO\s|\sNEXTVAL[\s\W]*|^UPDATE\s|\sJOIN\s)[\s`'"]*([\w\.-_]+)
It matches a token, optionally enclosed in [`'"], preceded by FROM etc. surrounded by whitespace, except for UPDATE which has no leading whitespace.
We execute many regexes, and this is the slowest one, and I'm not sure why. SQL strings can get up to 4k in size, and execution time is at worst 0,35ms on a 2.2GHz i7 MBP.
This is a slow input sample: https://pastebin.com/DnamKDPf
Can we do better? Splitting it up into multiple regexes would be an option, as well if alternation is an issues.

There is a rule of thumb:
Do not let engine make an attempt on matching each single one character if there are some boundaries.
Try the following regex (~2500 steps on the given input string):
(?!FROM|INTO|NEXTVAL|UPDATE|JOIN)\S*\s*|\w+\W*(\w[\w\.-]*)
Live demo
Note: What you need is in the first capturing group.
The final regex according to comments (which is a little bit slower than the previous clean one):
(?!(?:FROM|INTO|NEXTVAL|UPDATE|JOIN)\b)\S*\s*|\b(?:NEXTVAL\W*|\w+\s[\s`'"]*)([\[\]\w\.-]+)

Regex optimisation is a very complex topic and should be done with help of some tools. For example, I like Regex101 which calculates for us number of steps Regex engine had to do to match pattern to payload. For your pattern and given example it prints:
1 match, 22976 steps (~19ms)
First thing which you can always do it is grouping similar parts to one group. For example, FROM, INTO and JOIN look similar, so we can write regex as below:
(?:\s(?:FROM|INTO|JOIN)\s|\sNEXTVAL[\s\W]*|^UPDATE\s)[\s`'"]*([\w\.-_]+)
For above example, Regex101, prints:
1 match, 15891 steps (~13ms)
Try to find some online tools which explain and optimise Regex such as myregextester and calculate how many steps engine needs to do.

Because matches are often near the end, one possibility would be to essentially start at the end and backtrack, rather than start at the beginning and forward-track, something along the lines of
^(?:UPDATE\s|.*(?:\s(?:(?:FROM|INTO|JOIN)\s|NEXTVAL[\s\W]*)))[\s`'\"]*([\w\.-_]+)
https://regex101.com/r/SO7M87/1/ (154 steps)
While this may be much faster when a match exists, it's only a moderate improvement when there's no match, because the pattern must backtrack all the way to the beginning (~9000 steps from ~23k steps)

Related

How to exclude previous captured group

Here is my requirement, I want to recognize a valid String definition in compiler design, the string should either start and end with double quote ("hello world"), or start and end with single quote('hello world').
I used (['"]).*\1 to achieve the goal, the \1 here is to reference previous first captured group, namely first single or double quote, as explanation from regex 101,
\1 matches the same text as most recently matched by the 1st capturing group
It works so far so good.
Then I got new requirement, which is to treat an inner single quote in external single quotes as invalid vase, and same to double quotes situation. Which means both 'hello ' world' and "hello " world" are invalid case.
I think the solution should not be hard if we can represent not previous 1st captured group, something like (['"])(?:NOT\1)*\1.
The (?:) here is used as a non capturing group, to make sure \1 represents to first quote always. But the key is how to replace NOT with correct regex symbol. It's not like my previous experience about exclusion, like [^abcd] to exclude abcd, but to exclude the previous capture group and the symbol ^ doesn't work that way.
The most efficient method for this is probably a simple alternation, like already mentioned by #LorenzHetterich in his first comment. Easy to read, a short pattern and it gets the job done.
^(?:"[^"]*"|'[^']*')$
See this demo at regex101
This just alternates between either pairs of quotes without any of the same quote-type inside.
The technique to exclude a capture between certain parts, that you were outlining is known as tempered greedy token. Best to use it if there are no other options available (not for this task).
^(['"])(?:(?!\1).)*\1$
Another demo at regex101
The greedy dot gets tempered by what was captured in the first group and won't skip over.
Similar to this solution but much more efficient:
• Unrolled star alternation solution: ^(['"])[^"']*+(?:(?!\1)['"][^"']*)*\1$ (efficient)
• Explicit greedy alternation solution: ^(['"])(?:[^"']++|(?!\1)["'])*\1$ (a bit slower)
Especially for the latter use of a possessive quantifier is crucial to avoid runaway issues.
Just for having it mentioned, another option is using a negative lookahead to check after capturing the first match if there are not two more ahead. Also not highly efficient but sometimes useful.
^(['"])(?!(?:.*?\1){2}).*
One more demo at regex101
FYI: If the pattern is used with Java matches(), the ^ start and $ end anchors are not needed.

Regex search with pattern containing (?:.|\s)*? takes increasingly long time

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches.
I suspect catastrophic backtracking.
Please help! How can I optimize this regex:
String regex = "<tr bgcolor=\"ffffff\">\\s*?<td width=\"20%\"><b>((?:.|\\s)+?): *?</b></td>\\s*?<td width=\"80%\">((?:.|\\s)*?)(?=(?:</td>\\s*?</tr>\\s*?<tr bgcolor=\"ffffff\">)|(?:</td>\\s*?</tr>\\s*?</table>\\s*?<b>Tags</b>))";
EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards.
The alternation (?:.|\\s)+? is very inefficient, as it involves too much backtracking.
Basically, all variations of this pattern are extremely inefficient: (?:.|\s)*?, (?:.|\n)*?, (?:.|\r\n)*? and there greedy counterparts, too ((?:.|\s)*, (?:.|\n)*, (?:.|\r\n)*). (.|\s)*? is probably the worst of them all.
Why?
The two alternatives, . and \s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.
Patterns like (?:.|\n)*? / (?:.|\n)* in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|\n) is the only way to match any char).
Solution
If you want to match any characters including whitespace with regex, do it with
[\\s\\S]*?
Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (e.g. (?s)start(.*?)end).
NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.

Match contents surrounded by optional group in Java regex

I'm having trouble wrapping my head around how a particular Java regex should be written. The regex will be used in a sequence, and will match sections ending with /.
The problem is that using a simple split won't work because the text before the / can optionally be surrounded by ~. If it is, then the text inside can match anything - including / and ~. The key here is the ending ~/, which is the only way to escape this 'anything goes' sequence if it begins with ~.
Because the regex pattern will be used in a sequence (i.e. (xxx)+), I can't use ^ or $ for non-greedy matching.
Example matches:
foo/
~foo~/
~foo/~/
~foo~~/
~foo/bar~/
and some that wouldn't match:
foo~//
~foo~/bar~/
~foo/
foo~/ (see edit 2)
Is there any way to do this without being redundant with my regexes? What would be the best way to think about matching this? Java doesn't have a conditional modifier (?) so that complicated things in my head a bit more.
EDIT: After working on this in the meantime, the regex ((?:\~())?)(((?!((?!\2)/|\~/)).)+)\1/ gets close but #6 doesn't match.
EDIT 2: After Steve pointed out that there is ambiguity, it became clear #6 shouldn't match.
I don't think that this is a solvable problem. From your givens, these are all acceptable:
~foo/~/
~foo/
foo~/
So, now, let's consider this combination:
~foo/foo~/
What happens here? We have combined the second example and the third example to create an instance of the first example. How do you suggest a correct splitting? As far as I can tell, there's no way to tell if we should be taking the entire expression as one or two valid expressions. Hence, I don't think it's possible to break it up accurately based on your listed restrictions.

How to bound +/* for a regex group?

Say I have the regex:
(CC|NP)*
As such it creates problems in look-before regexes in Java. How shall I write it to avoid those problem?
I thought of re-writing it as:
(CC|NP){1,9}
Testing on regexr it seems like the upperbound is ignored completely.
In Java those quantitiers {} seem to work only on non-group regex elements as in:
\w+\[\S{1,9}\]
Sorry, look behind patterns usually have restrictions on the sub pattern. See f.x. Why doesn't finite repetition in lookbehind work in some flavors?p. Or search for "lookbehind pattern restrictions" on the web.
You may try to write down all fixed length variants of the look behind pattern as alternating pattern. But this might be many...
You may also simulate lookbehind by normally matching the inner pattern and match and group your actual target: (?:CC|NP)*(.*)
I'm not sure of where you percieve the problem. Quantifiers act on groups just like any entity.
So, \w+\[\S{1,9}\] could have been written \w+\[(\S){1,9}\] with the same result.
As far as your example on regexr, nothing is broken there. It matches what it's supposed to.
(PUN|CC|NP){1,3} will greedily try to match any of the alternations (in left-to-right priority). There will be no breaks in what it will match. It matches 1-3 consecutive occurances of PUN or CC or NP.
The sample string you provided had a space between CC's, so since a space does not exist in the regex, it is not matched. The only thing that is matching is a single CC.
If you want to account for a space, it can be added to the grouping like this:
(?:(?:PUN|CC|NP)\s*){1,3}
If you want to only allow spaces between the alternation's, it can be done like this:
(?:PUN|CC|NP)(?:\s*(?:PUN|CC|NP)){0,2}

Regular expression: who's greedier?

My primary concern is with the Java flavor, but I'd also appreciate information regarding others.
Let's say you have a subpattern like this:
(.*)(.*)
Not very useful as is, but let's say these two capture groups (say, \1 and \2) are part of a bigger pattern that matches with backreferences to these groups, etc.
So both are greedy, in that they try to capture as much as possible, only taking less when they have to.
My question is: who's greedier? Does \1 get first priority, giving \2 its share only if it has to?
What about:
(.*)(.*)(.*)
Let's assume that \1 does get first priority. Let's say it got too greedy, and then spit out a character. Who gets it first? Is it always \2 or can it be \3?
Let's assume it's \2 that gets \1's rejection. If this still doesn't work, who spits out now? Does \2 spit to \3, or does \1 spit out another to \2 first?
Bonus question
What happens if you write something like this:
(.*)(.*?)(.*)
Now \2 is reluctant. Does that mean \1 spits out to \3, and \2 only reluctantly accepts \3's rejection?
Example
Maybe it was a mistake for me not to give concrete examples to show how I'm using these patterns, but here's some:
System.out.println(
"OhMyGod=MyMyMyOhGodOhGodOhGod"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><My><God>"
// same pattern, different input string
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><MyGod><>"
// now \2 is reluctant
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*?)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><><MyGod>"
\1 will have priority, \2 and \3 will always match nothing. \2 will then have priority over \3.
As a general rule think of it like this, back-tracking will only occur to satisfy a match, it will not occur to satisfy greediness, so left is best :)
explaining back tracking and greediness is to much for me to tackle here, i'd suggest friedl's Mastering Regular Expressions
The addition of your concrete examples changes the nature of the question drastically. It still starts out as I described in my first answer, with the first (.*) gobbling up all the characters, and the second and third groups letting it have them, but then it has to match an equals sign.
Obviously there isn't one at the end of the string, so group #1 gives back characters one by one until the = in the regex can match the = in the target. Then the regex engine starts trying to match (\1|\2|\3)+$ and the real fun starts.
Group 1 gives up the d and group 2 (which is still empty) takes it, but the rest of the regex still can't match. Group 1 gives up the o and group 2 matches od, but the rest of the regex still can't match. And so it goes, with the third group getting involved, and the three of them slicing up the input in every way possible until an overall match is achieved. RegexBuddy reports that it takes 13,426 steps to get there.
In the first example, greediness (or lack of it) isn't really a factor; the only way a match can be achieved is if the words Oh, My and God are captured in separate groups, so eventually that's what happens. It doesn't even matter which group captures which word--that's just first come, first served, as I said before.
In the second and third examples it's only necessary to break the prefix into two chunks: Oh and MyGod. Group 2 captures MyGod in the second example because it's next in line and it's greedy, just like in the first example. In the third example, every time group 1 drops a character, group 2 (being reluctant) lets group 3 take it instead, so that's the one that ends up in possession of MyGod.
It's more complicated (and tedious) than that, of course, but I hope this answers your question. And I have to say, that's an interesting target string you chose; if it were possible for a regex engine to have an orgasm, I think these regexes would be the ones to bring it off. :D
Quantifiers aren't really greedy by default, they're just hasty. In your example, the first (.*) will start out by gobbling up everything it can without regard to the needs of the regex as a whole. Only then does it hand control to the next part, and if necessary it will give back some or all of what it just took (i.e., backtrack) so the rest of the regex can do its work.
That isn't necessary in this case because everything else can legally match zero characters. If the quantifiers were really greedy, the three groups would haggle until they had divided the input as evenly as possible; instead, the second and third groups let the first one keep what it took. They'll take it if it's put in front of them, but they won't fight for it. (That would be true even if they had possessive quantifiers, i.e, (.*)(.*+)(.*+).)
Making the second dot-star reluctant doesn't change anything, but switching the first one does. A reluctant quantifier starts out by matching only as much as it has to, then hands off to the next part. So the first group in (.*?)(.*)(.*) starts out by matching nothing, then the second group gobbles everything, and the third group cries "weee weee weee" all the way home.
Here's a bonus question for you: What happens if you make all three of the quantifiers reluctant? (Hint: In Java this is as much an API question as it is a regex question.)
Regular Expressions work in a sequence, that means the Regex-evaluator will only leave a group when he can't find a solution to that group anymore, and eventually do some backtracking to make the string fit to the next group. If you execute this regex, you will get all your chars evaluated in the first group, none in the next ones (Question-sign doesn't matter either).
As a simple general rule: leftmost quantifier wins. So as long as the following quantifiers identify purely optional subpatterns (regardless of them being made ungreedy), the first takes all.

Categories