Regular expression: who's greedier? - java

My primary concern is with the Java flavor, but I'd also appreciate information regarding others.
Let's say you have a subpattern like this:
(.*)(.*)
Not very useful as is, but let's say these two capture groups (say, \1 and \2) are part of a bigger pattern that matches with backreferences to these groups, etc.
So both are greedy, in that they try to capture as much as possible, only taking less when they have to.
My question is: who's greedier? Does \1 get first priority, giving \2 its share only if it has to?
What about:
(.*)(.*)(.*)
Let's assume that \1 does get first priority. Let's say it got too greedy, and then spit out a character. Who gets it first? Is it always \2 or can it be \3?
Let's assume it's \2 that gets \1's rejection. If this still doesn't work, who spits out now? Does \2 spit to \3, or does \1 spit out another to \2 first?
Bonus question
What happens if you write something like this:
(.*)(.*?)(.*)
Now \2 is reluctant. Does that mean \1 spits out to \3, and \2 only reluctantly accepts \3's rejection?
Example
Maybe it was a mistake for me not to give concrete examples to show how I'm using these patterns, but here's some:
System.out.println(
"OhMyGod=MyMyMyOhGodOhGodOhGod"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><My><God>"
// same pattern, different input string
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><MyGod><>"
// now \2 is reluctant
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*?)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><><MyGod>"

\1 will have priority, \2 and \3 will always match nothing. \2 will then have priority over \3.
As a general rule think of it like this, back-tracking will only occur to satisfy a match, it will not occur to satisfy greediness, so left is best :)
explaining back tracking and greediness is to much for me to tackle here, i'd suggest friedl's Mastering Regular Expressions

The addition of your concrete examples changes the nature of the question drastically. It still starts out as I described in my first answer, with the first (.*) gobbling up all the characters, and the second and third groups letting it have them, but then it has to match an equals sign.
Obviously there isn't one at the end of the string, so group #1 gives back characters one by one until the = in the regex can match the = in the target. Then the regex engine starts trying to match (\1|\2|\3)+$ and the real fun starts.
Group 1 gives up the d and group 2 (which is still empty) takes it, but the rest of the regex still can't match. Group 1 gives up the o and group 2 matches od, but the rest of the regex still can't match. And so it goes, with the third group getting involved, and the three of them slicing up the input in every way possible until an overall match is achieved. RegexBuddy reports that it takes 13,426 steps to get there.
In the first example, greediness (or lack of it) isn't really a factor; the only way a match can be achieved is if the words Oh, My and God are captured in separate groups, so eventually that's what happens. It doesn't even matter which group captures which word--that's just first come, first served, as I said before.
In the second and third examples it's only necessary to break the prefix into two chunks: Oh and MyGod. Group 2 captures MyGod in the second example because it's next in line and it's greedy, just like in the first example. In the third example, every time group 1 drops a character, group 2 (being reluctant) lets group 3 take it instead, so that's the one that ends up in possession of MyGod.
It's more complicated (and tedious) than that, of course, but I hope this answers your question. And I have to say, that's an interesting target string you chose; if it were possible for a regex engine to have an orgasm, I think these regexes would be the ones to bring it off. :D

Quantifiers aren't really greedy by default, they're just hasty. In your example, the first (.*) will start out by gobbling up everything it can without regard to the needs of the regex as a whole. Only then does it hand control to the next part, and if necessary it will give back some or all of what it just took (i.e., backtrack) so the rest of the regex can do its work.
That isn't necessary in this case because everything else can legally match zero characters. If the quantifiers were really greedy, the three groups would haggle until they had divided the input as evenly as possible; instead, the second and third groups let the first one keep what it took. They'll take it if it's put in front of them, but they won't fight for it. (That would be true even if they had possessive quantifiers, i.e, (.*)(.*+)(.*+).)
Making the second dot-star reluctant doesn't change anything, but switching the first one does. A reluctant quantifier starts out by matching only as much as it has to, then hands off to the next part. So the first group in (.*?)(.*)(.*) starts out by matching nothing, then the second group gobbles everything, and the third group cries "weee weee weee" all the way home.
Here's a bonus question for you: What happens if you make all three of the quantifiers reluctant? (Hint: In Java this is as much an API question as it is a regex question.)

Regular Expressions work in a sequence, that means the Regex-evaluator will only leave a group when he can't find a solution to that group anymore, and eventually do some backtracking to make the string fit to the next group. If you execute this regex, you will get all your chars evaluated in the first group, none in the next ones (Question-sign doesn't matter either).

As a simple general rule: leftmost quantifier wins. So as long as the following quantifiers identify purely optional subpatterns (regardless of them being made ungreedy), the first takes all.

Related

How to exclude previous captured group

Here is my requirement, I want to recognize a valid String definition in compiler design, the string should either start and end with double quote ("hello world"), or start and end with single quote('hello world').
I used (['"]).*\1 to achieve the goal, the \1 here is to reference previous first captured group, namely first single or double quote, as explanation from regex 101,
\1 matches the same text as most recently matched by the 1st capturing group
It works so far so good.
Then I got new requirement, which is to treat an inner single quote in external single quotes as invalid vase, and same to double quotes situation. Which means both 'hello ' world' and "hello " world" are invalid case.
I think the solution should not be hard if we can represent not previous 1st captured group, something like (['"])(?:NOT\1)*\1.
The (?:) here is used as a non capturing group, to make sure \1 represents to first quote always. But the key is how to replace NOT with correct regex symbol. It's not like my previous experience about exclusion, like [^abcd] to exclude abcd, but to exclude the previous capture group and the symbol ^ doesn't work that way.
The most efficient method for this is probably a simple alternation, like already mentioned by #LorenzHetterich in his first comment. Easy to read, a short pattern and it gets the job done.
^(?:"[^"]*"|'[^']*')$
See this demo at regex101
This just alternates between either pairs of quotes without any of the same quote-type inside.
The technique to exclude a capture between certain parts, that you were outlining is known as tempered greedy token. Best to use it if there are no other options available (not for this task).
^(['"])(?:(?!\1).)*\1$
Another demo at regex101
The greedy dot gets tempered by what was captured in the first group and won't skip over.
Similar to this solution but much more efficient:
• Unrolled star alternation solution: ^(['"])[^"']*+(?:(?!\1)['"][^"']*)*\1$ (efficient)
• Explicit greedy alternation solution: ^(['"])(?:[^"']++|(?!\1)["'])*\1$ (a bit slower)
Especially for the latter use of a possessive quantifier is crucial to avoid runaway issues.
Just for having it mentioned, another option is using a negative lookahead to check after capturing the first match if there are not two more ahead. Also not highly efficient but sometimes useful.
^(['"])(?!(?:.*?\1){2}).*
One more demo at regex101
FYI: If the pattern is used with Java matches(), the ^ start and $ end anchors are not needed.

Speed up regular expression

This is a regex to extract the table name from a SQL statement:
(?:\sFROM\s|\sINTO\s|\sNEXTVAL[\s\W]*|^UPDATE\s|\sJOIN\s)[\s`'"]*([\w\.-_]+)
It matches a token, optionally enclosed in [`'"], preceded by FROM etc. surrounded by whitespace, except for UPDATE which has no leading whitespace.
We execute many regexes, and this is the slowest one, and I'm not sure why. SQL strings can get up to 4k in size, and execution time is at worst 0,35ms on a 2.2GHz i7 MBP.
This is a slow input sample: https://pastebin.com/DnamKDPf
Can we do better? Splitting it up into multiple regexes would be an option, as well if alternation is an issues.
There is a rule of thumb:
Do not let engine make an attempt on matching each single one character if there are some boundaries.
Try the following regex (~2500 steps on the given input string):
(?!FROM|INTO|NEXTVAL|UPDATE|JOIN)\S*\s*|\w+\W*(\w[\w\.-]*)
Live demo
Note: What you need is in the first capturing group.
The final regex according to comments (which is a little bit slower than the previous clean one):
(?!(?:FROM|INTO|NEXTVAL|UPDATE|JOIN)\b)\S*\s*|\b(?:NEXTVAL\W*|\w+\s[\s`'"]*)([\[\]\w\.-]+)
Regex optimisation is a very complex topic and should be done with help of some tools. For example, I like Regex101 which calculates for us number of steps Regex engine had to do to match pattern to payload. For your pattern and given example it prints:
1 match, 22976 steps (~19ms)
First thing which you can always do it is grouping similar parts to one group. For example, FROM, INTO and JOIN look similar, so we can write regex as below:
(?:\s(?:FROM|INTO|JOIN)\s|\sNEXTVAL[\s\W]*|^UPDATE\s)[\s`'"]*([\w\.-_]+)
For above example, Regex101, prints:
1 match, 15891 steps (~13ms)
Try to find some online tools which explain and optimise Regex such as myregextester and calculate how many steps engine needs to do.
Because matches are often near the end, one possibility would be to essentially start at the end and backtrack, rather than start at the beginning and forward-track, something along the lines of
^(?:UPDATE\s|.*(?:\s(?:(?:FROM|INTO|JOIN)\s|NEXTVAL[\s\W]*)))[\s`'\"]*([\w\.-_]+)
https://regex101.com/r/SO7M87/1/ (154 steps)
While this may be much faster when a match exists, it's only a moderate improvement when there's no match, because the pattern must backtrack all the way to the beginning (~9000 steps from ~23k steps)

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:
(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))
This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.
Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.
Indeed! It's possible using forward references:
(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)
Proof
Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.
No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.
That's great and all, but I want to match inner groups too!
OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:
(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$)))
Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.
So... how the hell does this actually work?
I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:
Note
Component
Description
(?=\()
Make sure '(' follows before doing any hard work.
(?:
Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
(?=
This lookahead deals with finding the next '('.
.*?\((?!.*?\1)
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'
(?=
This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)
Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.
Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?
Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation
.*?(?=\1)
Match up to and including the last '(' found.
[^(]*(?=\2$)
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).
Conclusion
So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.
I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.
Brief
Input Corrections
First of all, your input is incorrect as there's an extra parenthesis (as shown below)
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:
Extra parenthesis removed
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
^
Additional parenthesis added to match extra closing parenthesis
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Regex Capabilities
Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).
This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.
Input
Considering the original input is actually invalid, we'll use the following inputs to test against.
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
Testing against these inputs should yield the following results:
INVALID (no match)
VALID (match)
VALID (match)
Code
There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).
See regex in use here
Using DEFINE block
(?(DEFINE)
(?<value>[^()\r\n]+)
(?<groupVal>(?&group)|(?&value))
(?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$
Note: This regex uses the flags gmx
Without DEFINE block
See regex in use here
^(?<group>
(?<value>[^()\r\n]+)*
\((?<groupVal>(?&group)|(?&value))\)
(?&groupVal)*
)$
Note: This regex uses the flags gmx
Without x modifier (one-liner)
See regex in use here
^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$
Without named (groups & references)
See regex in use here
^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$
Note: This is the shortest possible method that I could come up with.
Explanation
I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.
^ Assert position at the start of the line
(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*) Capture the following into capture group 1
([^()\r\n]+)* Capture the following into capture group 2 any number of times
[^()\r\n]+ Match any character not present in the set ()\r\n one or more times
\( Match a left/opening parenthesis character ( literally
((?1)|(?2)) Capture either of the following into capture group 3
(?1) Recurse the first subpattern (1)
(?2) Recurse the second subpattern (2)
\) Match a right/closing parenthesis character ) literally
(?3)* Recurse the third subpattern (3) any number of times
$ Assert position at the end of the line

Regex to capture text with unknown number of repeated groups in between

I'm trying to parse the number that follows "Dining:" in the following text, under SECOND LEVEL. So '666' should be returned.
MAIN LEVEL
Entrance: 11
Dining: 33
SECOND LEVEL
Entrance: 4444
Living: 5555
Dining: 666
THIRD LEVEL
Dining: 999
Kitchen: 000
Family: 33332
If I use something like (?:\bDining:\s)(.*\b) then it captures the first occurrence under MAIN. I'm trying to therefore specify SECOND LEVEL in the regex, followed by a repeating pattern of: new lines, multiple spaces, and then any text, until Dining: is found. This demo illustrates the two problems I encounter. The regex used is: (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b)
A "Catastrophic backtracking" error appears until you delete the very last line containing Laundry: 1. Is this caused by too many matches or something?
Once you delete that line, the regex captures only the last match under OTHER LEVEL .. returning '2' as opposed to the match under SECOND LEVEL.
Sometimes Dining: will not exist under SECOND LEVEL and therefore nothing should be returned.
What is a regex that will only capture the SECOND LEVEL's Dining: number, and if it doesn't exist then returns nothing? Straight up regex preferred, no looping in Java if possible. Thanks
Use a negative lookahead based regex.
"(?m)^\\s*\\bSECOND LEVEL\\n(?:(?!\\n\\n)[\\s\\S])*\\bDining:\\s*(\\d+)"
DEMO
The best example I know of for catastrophic backtracking from here is (x+x+)+y. That is to say, it cannot work out the correct boundaries for the capture groups containing x because there are too many ways to divide them.
xxxxy is the first two + once, the third twice, or each of the first twice and the third once, or either of the first thrice, the other once and the last once. As you can see that gets dangerous!
You had (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b) note the (\n\s+.*)*
the .* can be a nightmare when combined with the previous \n\s and enclosed with a *. It should be rewritten (\n\s+[^\s\n][^\n]*)* this ensures each quantifier ends before the next begins, minimising backtracking.
With this kind of thinking in mind I came up with the following regex to match your string:
(?<=SECOND LEVEL\n)(?:\s+(?:[^\s\n:][^\n:]*):[^\n]*)*\s+Dining:\s*([^\s\n][^\n$]*)

This regex line exceeds my understanding "(?=(?:\d{3})++(?!\d))"

i am pretty ok with basic reg-ex. but this line of code used to make the thousand separation in large numbers exceeds my knowledge and googling it quite a bit did also not satisfy my curiosity. can one of u please take a minute to explain to me the following line of code?
someString.replaceAll("(\\G-?\\d{1,3})(?=(?:\\d{3})++(?!\\d))", "$1,");
i especially don't understand the regex structure "(?=(?:\d{3})++(?!\d))".
thanks a lot in advance.
"(?=(?:\d{3})++(?!\d))" is a lookahead assertion.
It means "only match if followed by ((three digits that we don't need to capture) repeated one or more times (and again repeated one or more times) (not followed by a digit))". See this explanation about (?:...) notation. It's called non-capturing group and means you don't need to reference this group after the match.
"(\\G-?\\d{1,3})" is the part that should actually match (but only if the above-described conditions are met).
Edit: I think + must be a special character, otherwise it's just a plus. If it's a special character (and quick search suggests that it is in Java, too), the second one is redundant.
Edit 2: Thanks to Alan Moore, it's now clear. The second + means possessive matching, so it means that if after checking as many 3-digit groups as possible it won't find that they're not followed by a non-digit, the engine will immediately give up instead of stepping one 3-digit group back.
this expression has some advanced stuff in it.
first , the easiest: \d{3} means exactly three digits. These are your thousands.
then: the ++ is a variant of + (which means one or more), but possessive, which means it will eat all of the thousands. Im not completely sure why this is necessary.
?:means it is a non-capturing group - i think this is just there for performance reasons and could be omitted.
?=is a positive lookahead - i think this means it is only checked whether that group exists but will not count towards the matched string - meaning it wont be replaced.
?! is a negative lookahead - i dont quite understand that but i think it means it must NOT match, which in turn means there cannot be another digit at the end of the matched sequence. This makes sure the first group gets the right digits. E.g. 10000 can only be matched as 10(000) but not 1(000)0 if you see what i mean.
Through the lookaheads, if i understand it correctly (i havent tested it), only the first group would actually be replaced, as it is the one that matches.
To me, the most interesting part of that regex is the \G. It took me a while to remember what it's for: to prevent adding commas to the fraction part if there is one. If the regex were simply:
(-?\d{1,3})(?=(?:\d{3})++(?!\d))
...this number:
12345.67890
...would end up as:
12,345.67,890
But adding \G to the beginning means a match can only start at the beginning of the string or at the position where the previous match ended. So it doesn't match 345 because of the . following it, and it doesn't match 67 because it would have to skip over some of the string to do so. And so it correctly returns:
12,345.67890
I know this isn't an answer to the question, but I thought it was worth a mention.

Categories