How exactly does the possessive quantifier work? - java

At the end of the page there is at attempted explanation of how do greedy, reluctant and possessive quantifiers work: http://docs.oracle.com/javase/tutorial/essential/regex/quant.html
However I tried myself an example and I don't seem to understand it fully.
I will paste my results directly:
Enter your regex: .*+foo
Enter input string to search: xfooxxxxxxfoo
No match found.
Enter your regex: (.*)+foo
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Why does the first reg.exp. find no match and the second does?
What is the exact difference between those 2 reg.exp.?

The + after another quantifier means "don't allow the regex engine to backtrack into whatever the previous token has matched". (See a tutorial on possessive quantifiers here).
So when you apply .*foo to "xfooxxxxxxfoo", the .* first matches the entire string. Then, since foo can't be matched, the regex engine backtracks until that's possible, achieving a match when .* has matched "xfooxxxxxx" and foo has matched "foo".
Now the additional + prevents that backtracking from happening, so the match fails.
When you write (.*)+foo. the + takes on an entirely different meaning; now it means "one or more of the preceding token". You've created nested quantifiers, which is not a good idea, by the way. If you apply that regex to a string like "xfoxxxxxxxxxfox", you'll run into catastrophic backtracking.

The possessive quantifier takes the entire string and checks if it matches, if not it fails. In your case xfooxxxxxxfoo matches the .*+ but then you ask for another foo, which isn't present, so the matcher fails.
The greedy quantifier first does the same, but instead of failing it "backs off" and tries again:
xfooxxxxxxfoo fail
xfooxxxxxxfo fail
xfooxxxxxxf fail
xfooxxxxxx match
In your second regex you ask for something else by confusing the grouping mechanism. You ask for "one or more matches of (.*)" as the + now relates to the () and there is one match.

Related

Regex will not matching a word [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Confusion on Quantifiers? [duplicate]

I found this tutorial on regular expressions and while I intuitively understand what "greedy", "reluctant" and "possessive" qualifiers do, there seems to be a serious hole in my understanding.
Specifically, in the following example:
Enter your regex: .*foo // Greedy qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Enter your regex: .*?foo // Reluctant qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Enter your regex: .*+foo // Possessive qualifier
Enter input string to search: xfooxxxxxxfoo
No match found.
The explanation mentions eating the entire input string, letters been consumed, matcher backing off, rightmost occurrence of "foo" has been regurgitated, etc.
Unfortunately, despite the nice metaphors, I still don't understand what is eaten by whom... Do you know of another tutorial that explains (concisely) how regular expression engines work?
Alternatively, if someone can explain in somewhat different phrasing the following paragraph, that would be much appreciated:
The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f", "o", "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f", "o", "o") have already been consumed [by whom?]. So the matcher slowly backs off [from right-to-left?] one letter at a time until the rightmost occurrence of "foo" has been regurgitated [what does this mean?], at which point the match succeeds and the search ends.
The second example, however, is reluctant, so it starts by first consuming [by whom?] "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow [who swallows?] the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.
The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+ [how?], leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off [what does back off mean?]; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.
I'll give it a shot.
A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less character (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step, making the greedy quantifier match one less character again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!
A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more character (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!
In your example, it then starts the process over with the remaining unmatched portion of the string, "xxxxxxfoo", following the same process.
A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.
It is just my practice output to visualise the scene-
I haven't heard the exact terms 'regurgitate' or 'backing off' before; the phrase that would replace these is "backtracking", but 'regurgitate' seems like as good a phrase as any for "the content that had been tentatively accepted before backtracking threw it away again".
The important thing to realize about most regex engines is that they are backtracking: they will tentatively accept a potential, partial match, while trying to match the entire contents of the regex. If the regex cannot be completely matched at the first attempt, then the regex engine will backtrack on one of its matches. It will try matching *, +, ?, alternation, or {n,m} repetition differently, and try again. (And yes, this process can take a long time.)
The first example uses the greedy
quantifier .* to find "anything", zero
or more times, followed by the letters
"f" "o" "o". Because the quantifier is
greedy, the .* portion of the
expression first eats the entire input
string. At this point, the overall
expression cannot succeed, because the
last three letters ("f" "o" "o") have
already been consumed (by whom?).
The last three letters, f, o, and o were already consumed by the initial .* portion of the rule. However, the next element in the regex, f, has nothing left in the input string. The engine will be forced to backtrack on its initial .* match, and try matching all-but-the-last character. (It might be smart and backtrack to all-but-the-last-three, because it has three literal terms, but I'm unaware of implementation details at this level.)
So the matcher
slowly backs off (from right-to-left?) one letter at a time
until the rightmost occurrence of
"foo" has been regurgitated (what does this mean?), at which
This means the foo had tentatively been including when matching .*. Because that attempt failed, the regex engine tries accepting one fewer character in .*. If there had been a successful match before the .* in this example, then the engine would probably try shortening the .* match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .* in my hypothetical example.
point the match succeeds and the
search ends.
The second example, however, is
reluctant, so it starts by first
consuming (by whom?) "nothing". Because "foo"
The initial nothing is consumed by .?*, which will consume the shortest possible amount of anything that allows the rest of the regex to match.
doesn't appear at the beginning of the
string, it's forced to swallow (who swallows?) the
Again the .?* consumes the first character, after backtracking on the initial failure to match the entire regex with the shortest possible match. (In this case, the regex engine is extending the match for .*? from left-to-right, because .*? is reluctant.)
first letter (an "x"), which triggers
the first match at 0 and 4. Our test
harness continues the process until
the input string is exhausted. It
finds another match at 4 and 13.
The third example fails to find a
match because the quantifier is
possessive. In this case, the entire
input string is consumed by .*+, (how?)
A .*+ will consume as much as possible, and will not backtrack to find new matches when the regex as a whole fails to find a match. Because the possessive form does not perform backtracking, you probably won't see many uses with .*+, but rather with character classes or similar restrictions: account: [[:digit:]]*+ phone: [[:digit:]]*+.
This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3) to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)
But more than potential speed ups, this also can let you write regexs that match exactly what you need to match. I'm having trouble coming up with an easy example :) but writing a regex using possessive vs greedy quantifiers can give you different matches, and one or the other may be more appropriate.
leaving nothing left over to satisfy
the "foo" at the end of the
expression. Use a possessive
quantifier for situations where you
want to seize all of something without
ever backing off (what does back off mean?); it will outperform
"Backing off" in this context means "backtracking" -- throwing away a tentative partial match to try another partial match, which may or may not succeed.
the equivalent greedy quantifier in
cases where the match is not
immediately found.
http://swtch.com/~rsc/regexp/regexp1.html
I'm not sure that's the best explanation on the internet, but it's reasonably well written and appropriately detailed, and I keep coming back to it. You might want to check it out.
If you want a higher-level (less detailed explanation), for simple regular expressions such as the one you're looking at, a regular expression engine works by backtracking. Essentially, it chooses ("eats") a section of the string and tries to match the regular expression against that section. If it matches, great. If not, the engine alters its choice of the section of the string and tries to match the regexp against that section, and so on, until it's tried every possible choice.
This process is used recursively: in its attempt to match a string with a given regular expression, the engine will split the regular expression into pieces and apply the algorithm to each piece individually.
The difference between greedy, reluctant, and possessive quantifiers enters when the engine is making its choices of what part of the string to try to match against, and how to modify that choice if it doesn't work the first time. The rules are as follows:
A greedy quantifier tells the engine to start with the entire string (or at least, all of it that hasn't already been matched by previous parts of the regular expression) and check whether it matches the regexp. If so, great; the engine can continue with the rest of the regexp. If not, it tries again, but trimming one character (the last one) off the section of the string to be checked. If that doesn't work, it trims off another character, etc. So a greedy quantifier checks possible matches in order from longest to shortest.
A reluctant quantifier tells the engine to start with the shortest possible piece of the string. If it matches, the engine can continue; if not, it adds one character to the section of the string being checked and tries that, and so on until it finds a match or the entire string has been used up. So a reluctant quantifier checks possible matches in order from shortest to longest.
A possessive quantifier is like a greedy quantifier on the first attempt: it tells the engine to start by checking the entire string. The difference is that if it doesn't work, the possessive quantifier reports that the match failed right then and there. The engine doesn't change the section of the string being looked at, and it doesn't make any more attempts.
This is why the possessive quantifier match fails in your example: the .*+ gets checked against the entire string, which it matches, but then the engine goes on to look for additional characters foo after that - but of course it doesn't find them, because you're already at the end of the string. If it were a greedy quantifier, it would backtrack and try making the .* only match up to the next-to-last character, then up to the third to last character, then up to the fourth to last character, which succeeds because only then is there a foo left after the .* has "eaten" the earlier part of the string.
Here is my take using Cell and Index positions (See the diagram here to distinguish a Cell from an Index).
Greedy - Match as much as possible to the greedy quantifier and the entire regex. If there is no match, backtrack on the greedy quantifier.
Input String: xfooxxxxxxfoo
Regex: .*foo
The above Regex has two parts:
(i)'.*' and
(ii)'foo'
Each of the steps below will analyze the two parts. Additional comments for a match to 'Pass' or 'Fail' is explained within braces.
Step 1:
(i) .* = xfooxxxxxxfoo - PASS ('.*' is a greedy quantifier and will use the entire Input String)
(ii) foo = No character left to match after index 13 - FAIL
Match failed.
Step 2:
(i) .* = xfooxxxxxxfo - PASS (Backtracking on the greedy quantifier '.*')
(ii) foo = o - FAIL
Match failed.
Step 3:
(i) .* = xfooxxxxxxf - PASS (Backtracking on the greedy quantifier '.*')
(ii) foo = oo - FAIL
Match failed.
Step 4:
(i) .* = xfooxxxxxx - PASS (Backtracking on the greedy quantifier '.*')
(ii) foo = foo - PASS
Report MATCH
Result: 1 match(es)
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Reluctant - Match as little as possible to the reluctant quantifier and match the entire regex. if there is no match, add characters to the reluctant quantifier.
Input String: xfooxxxxxxfoo
Regex: .*?foo
The above regex has two parts:
(i) '.*?' and
(ii) 'foo'
Step 1:
.*? = '' (blank) - PASS (Match as little as possible to the reluctant quantifier '.*?'. Index 0 having '' is a match.)
foo = xfo - FAIL (Cell 0,1,2 - i.e index between 0 and 3)
Match failed.
Step 2:
.*? = x - PASS (Add characters to the reluctant quantifier '.*?'. Cell 0 having 'x' is a match.)
foo = foo - PASS
Report MATCH
Step 3:
.*? = '' (blank) - PASS (Match as little as possible to the reluctant quantifier '.*?'. Index 4 having '' is a match.)
foo = xxx - FAIL (Cell 4,5,6 - i.e index between 4 and 7)
Match failed.
Step 4:
.*? = x - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4.)
foo = xxx - FAIL (Cell 5,6,7 - i.e index between 5 and 8)
Match failed.
Step 5:
.*? = xx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 5.)
foo = xxx - FAIL (Cell 6,7,8 - i.e index between 6 and 9)
Match failed.
Step 6:
.*? = xxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 6.)
foo = xxx - FAIL (Cell 7,8,9 - i.e index between 7 and 10)
Match failed.
Step 7:
.*? = xxxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 7.)
foo = xxf - FAIL (Cell 8,9,10 - i.e index between 8 and 11)
Match failed.
Step 8:
.*? = xxxxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 8.)
foo = xfo - FAIL (Cell 9,10,11 - i.e index between 9 and 12)
Match failed.
Step 9:
.*? = xxxxxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 9.)
foo = foo - PASS (Cell 10,11,12 - i.e index between 10 and 13)
Report MATCH
Step 10:
.*? = '' (blank) - PASS (Match as little as possible to the reluctant quantifier '.*?'. Index 13 is blank.)
foo = No character left to match - FAIL (There is nothing after index 13 to match)
Match failed.
Result: 2 match(es)
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Possessive - Match as much as possible to the possessive quantifer and match the entire regex. Do NOT backtrack.
Input String: xfooxxxxxxfoo
Regex: .*+foo
The above regex has two parts: '.*+' and 'foo'.
Step 1:
.*+ = xfooxxxxxxfoo - PASS (Match as much as possible to the possessive quantifier '.*')
foo = No character left to match - FAIL (Nothing to match after index 13)
Match failed.
Note: Backtracking is not allowed.
Result: 0 match(es)
Greedy: "match the longest possible sequence of characters"
Reluctant: "match the shortest possible sequence of characters"
Possessive: This is a bit strange as it does NOT (in contrast to greedy and reluctant) try to find a match for the whole regex.
By the way: No regex pattern matcher implementation will ever use backtracking. All real-life pattern matcher are extremely fast - nearly independent of the complexity of the regular expression!
Greedy Quantification involves pattern matching using all of the remaining unvalidated characters of a string during an iteration. Unvalidated characters start in the active sequence. Every time a match does not occur, the character at the end is quarantined and the check is performed again.
When only leading conditions of the regex pattern are satisfied by the active sequence, an attempt is made to validate the remaining conditions against the quarantine. If this validation is successful, matched characters in the quarantine are validated and residual unmatched characters remain unvalidated and will be used when the process begins anew in the next iteration.
The flow of characters is from the active sequence into the quarantine. The resulting behavior is that as much of the original sequence is included in a match as possible.
Reluctant Quantification is mostly the same as greedy qualification except the flow of characters is the opposite--that is, they start in the quarantine and flow into the active sequence. The resulting behavior is that as little of the original sequence is included in a match as possible.
Possessive Quantification does not have a quarantine and includes everything in a fixed active sequence.

Regular expression not extracting correctly (Java Pattern / Matcher) [duplicate]

I found this tutorial on regular expressions and while I intuitively understand what "greedy", "reluctant" and "possessive" qualifiers do, there seems to be a serious hole in my understanding.
Specifically, in the following example:
Enter your regex: .*foo // Greedy qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Enter your regex: .*?foo // Reluctant qualifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Enter your regex: .*+foo // Possessive qualifier
Enter input string to search: xfooxxxxxxfoo
No match found.
The explanation mentions eating the entire input string, letters been consumed, matcher backing off, rightmost occurrence of "foo" has been regurgitated, etc.
Unfortunately, despite the nice metaphors, I still don't understand what is eaten by whom... Do you know of another tutorial that explains (concisely) how regular expression engines work?
Alternatively, if someone can explain in somewhat different phrasing the following paragraph, that would be much appreciated:
The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f", "o", "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f", "o", "o") have already been consumed [by whom?]. So the matcher slowly backs off [from right-to-left?] one letter at a time until the rightmost occurrence of "foo" has been regurgitated [what does this mean?], at which point the match succeeds and the search ends.
The second example, however, is reluctant, so it starts by first consuming [by whom?] "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow [who swallows?] the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.
The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+ [how?], leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off [what does back off mean?]; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.
I'll give it a shot.
A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less character (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step, making the greedy quantifier match one less character again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!
A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more character (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!
In your example, it then starts the process over with the remaining unmatched portion of the string, "xxxxxxfoo", following the same process.
A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.
It is just my practice output to visualise the scene-
I haven't heard the exact terms 'regurgitate' or 'backing off' before; the phrase that would replace these is "backtracking", but 'regurgitate' seems like as good a phrase as any for "the content that had been tentatively accepted before backtracking threw it away again".
The important thing to realize about most regex engines is that they are backtracking: they will tentatively accept a potential, partial match, while trying to match the entire contents of the regex. If the regex cannot be completely matched at the first attempt, then the regex engine will backtrack on one of its matches. It will try matching *, +, ?, alternation, or {n,m} repetition differently, and try again. (And yes, this process can take a long time.)
The first example uses the greedy
quantifier .* to find "anything", zero
or more times, followed by the letters
"f" "o" "o". Because the quantifier is
greedy, the .* portion of the
expression first eats the entire input
string. At this point, the overall
expression cannot succeed, because the
last three letters ("f" "o" "o") have
already been consumed (by whom?).
The last three letters, f, o, and o were already consumed by the initial .* portion of the rule. However, the next element in the regex, f, has nothing left in the input string. The engine will be forced to backtrack on its initial .* match, and try matching all-but-the-last character. (It might be smart and backtrack to all-but-the-last-three, because it has three literal terms, but I'm unaware of implementation details at this level.)
So the matcher
slowly backs off (from right-to-left?) one letter at a time
until the rightmost occurrence of
"foo" has been regurgitated (what does this mean?), at which
This means the foo had tentatively been including when matching .*. Because that attempt failed, the regex engine tries accepting one fewer character in .*. If there had been a successful match before the .* in this example, then the engine would probably try shortening the .* match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .* in my hypothetical example.
point the match succeeds and the
search ends.
The second example, however, is
reluctant, so it starts by first
consuming (by whom?) "nothing". Because "foo"
The initial nothing is consumed by .?*, which will consume the shortest possible amount of anything that allows the rest of the regex to match.
doesn't appear at the beginning of the
string, it's forced to swallow (who swallows?) the
Again the .?* consumes the first character, after backtracking on the initial failure to match the entire regex with the shortest possible match. (In this case, the regex engine is extending the match for .*? from left-to-right, because .*? is reluctant.)
first letter (an "x"), which triggers
the first match at 0 and 4. Our test
harness continues the process until
the input string is exhausted. It
finds another match at 4 and 13.
The third example fails to find a
match because the quantifier is
possessive. In this case, the entire
input string is consumed by .*+, (how?)
A .*+ will consume as much as possible, and will not backtrack to find new matches when the regex as a whole fails to find a match. Because the possessive form does not perform backtracking, you probably won't see many uses with .*+, but rather with character classes or similar restrictions: account: [[:digit:]]*+ phone: [[:digit:]]*+.
This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3) to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)
But more than potential speed ups, this also can let you write regexs that match exactly what you need to match. I'm having trouble coming up with an easy example :) but writing a regex using possessive vs greedy quantifiers can give you different matches, and one or the other may be more appropriate.
leaving nothing left over to satisfy
the "foo" at the end of the
expression. Use a possessive
quantifier for situations where you
want to seize all of something without
ever backing off (what does back off mean?); it will outperform
"Backing off" in this context means "backtracking" -- throwing away a tentative partial match to try another partial match, which may or may not succeed.
the equivalent greedy quantifier in
cases where the match is not
immediately found.
http://swtch.com/~rsc/regexp/regexp1.html
I'm not sure that's the best explanation on the internet, but it's reasonably well written and appropriately detailed, and I keep coming back to it. You might want to check it out.
If you want a higher-level (less detailed explanation), for simple regular expressions such as the one you're looking at, a regular expression engine works by backtracking. Essentially, it chooses ("eats") a section of the string and tries to match the regular expression against that section. If it matches, great. If not, the engine alters its choice of the section of the string and tries to match the regexp against that section, and so on, until it's tried every possible choice.
This process is used recursively: in its attempt to match a string with a given regular expression, the engine will split the regular expression into pieces and apply the algorithm to each piece individually.
The difference between greedy, reluctant, and possessive quantifiers enters when the engine is making its choices of what part of the string to try to match against, and how to modify that choice if it doesn't work the first time. The rules are as follows:
A greedy quantifier tells the engine to start with the entire string (or at least, all of it that hasn't already been matched by previous parts of the regular expression) and check whether it matches the regexp. If so, great; the engine can continue with the rest of the regexp. If not, it tries again, but trimming one character (the last one) off the section of the string to be checked. If that doesn't work, it trims off another character, etc. So a greedy quantifier checks possible matches in order from longest to shortest.
A reluctant quantifier tells the engine to start with the shortest possible piece of the string. If it matches, the engine can continue; if not, it adds one character to the section of the string being checked and tries that, and so on until it finds a match or the entire string has been used up. So a reluctant quantifier checks possible matches in order from shortest to longest.
A possessive quantifier is like a greedy quantifier on the first attempt: it tells the engine to start by checking the entire string. The difference is that if it doesn't work, the possessive quantifier reports that the match failed right then and there. The engine doesn't change the section of the string being looked at, and it doesn't make any more attempts.
This is why the possessive quantifier match fails in your example: the .*+ gets checked against the entire string, which it matches, but then the engine goes on to look for additional characters foo after that - but of course it doesn't find them, because you're already at the end of the string. If it were a greedy quantifier, it would backtrack and try making the .* only match up to the next-to-last character, then up to the third to last character, then up to the fourth to last character, which succeeds because only then is there a foo left after the .* has "eaten" the earlier part of the string.
Here is my take using Cell and Index positions (See the diagram here to distinguish a Cell from an Index).
Greedy - Match as much as possible to the greedy quantifier and the entire regex. If there is no match, backtrack on the greedy quantifier.
Input String: xfooxxxxxxfoo
Regex: .*foo
The above Regex has two parts:
(i)'.*' and
(ii)'foo'
Each of the steps below will analyze the two parts. Additional comments for a match to 'Pass' or 'Fail' is explained within braces.
Step 1:
(i) .* = xfooxxxxxxfoo - PASS ('.*' is a greedy quantifier and will use the entire Input String)
(ii) foo = No character left to match after index 13 - FAIL
Match failed.
Step 2:
(i) .* = xfooxxxxxxfo - PASS (Backtracking on the greedy quantifier '.*')
(ii) foo = o - FAIL
Match failed.
Step 3:
(i) .* = xfooxxxxxxf - PASS (Backtracking on the greedy quantifier '.*')
(ii) foo = oo - FAIL
Match failed.
Step 4:
(i) .* = xfooxxxxxx - PASS (Backtracking on the greedy quantifier '.*')
(ii) foo = foo - PASS
Report MATCH
Result: 1 match(es)
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Reluctant - Match as little as possible to the reluctant quantifier and match the entire regex. if there is no match, add characters to the reluctant quantifier.
Input String: xfooxxxxxxfoo
Regex: .*?foo
The above regex has two parts:
(i) '.*?' and
(ii) 'foo'
Step 1:
.*? = '' (blank) - PASS (Match as little as possible to the reluctant quantifier '.*?'. Index 0 having '' is a match.)
foo = xfo - FAIL (Cell 0,1,2 - i.e index between 0 and 3)
Match failed.
Step 2:
.*? = x - PASS (Add characters to the reluctant quantifier '.*?'. Cell 0 having 'x' is a match.)
foo = foo - PASS
Report MATCH
Step 3:
.*? = '' (blank) - PASS (Match as little as possible to the reluctant quantifier '.*?'. Index 4 having '' is a match.)
foo = xxx - FAIL (Cell 4,5,6 - i.e index between 4 and 7)
Match failed.
Step 4:
.*? = x - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4.)
foo = xxx - FAIL (Cell 5,6,7 - i.e index between 5 and 8)
Match failed.
Step 5:
.*? = xx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 5.)
foo = xxx - FAIL (Cell 6,7,8 - i.e index between 6 and 9)
Match failed.
Step 6:
.*? = xxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 6.)
foo = xxx - FAIL (Cell 7,8,9 - i.e index between 7 and 10)
Match failed.
Step 7:
.*? = xxxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 7.)
foo = xxf - FAIL (Cell 8,9,10 - i.e index between 8 and 11)
Match failed.
Step 8:
.*? = xxxxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 8.)
foo = xfo - FAIL (Cell 9,10,11 - i.e index between 9 and 12)
Match failed.
Step 9:
.*? = xxxxxx - PASS (Add characters to the reluctant quantifier '.*?'. Cell 4 thru 9.)
foo = foo - PASS (Cell 10,11,12 - i.e index between 10 and 13)
Report MATCH
Step 10:
.*? = '' (blank) - PASS (Match as little as possible to the reluctant quantifier '.*?'. Index 13 is blank.)
foo = No character left to match - FAIL (There is nothing after index 13 to match)
Match failed.
Result: 2 match(es)
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Possessive - Match as much as possible to the possessive quantifer and match the entire regex. Do NOT backtrack.
Input String: xfooxxxxxxfoo
Regex: .*+foo
The above regex has two parts: '.*+' and 'foo'.
Step 1:
.*+ = xfooxxxxxxfoo - PASS (Match as much as possible to the possessive quantifier '.*')
foo = No character left to match - FAIL (Nothing to match after index 13)
Match failed.
Note: Backtracking is not allowed.
Result: 0 match(es)
Greedy: "match the longest possible sequence of characters"
Reluctant: "match the shortest possible sequence of characters"
Possessive: This is a bit strange as it does NOT (in contrast to greedy and reluctant) try to find a match for the whole regex.
By the way: No regex pattern matcher implementation will ever use backtracking. All real-life pattern matcher are extremely fast - nearly independent of the complexity of the regular expression!
Greedy Quantification involves pattern matching using all of the remaining unvalidated characters of a string during an iteration. Unvalidated characters start in the active sequence. Every time a match does not occur, the character at the end is quarantined and the check is performed again.
When only leading conditions of the regex pattern are satisfied by the active sequence, an attempt is made to validate the remaining conditions against the quarantine. If this validation is successful, matched characters in the quarantine are validated and residual unmatched characters remain unvalidated and will be used when the process begins anew in the next iteration.
The flow of characters is from the active sequence into the quarantine. The resulting behavior is that as much of the original sequence is included in a match as possible.
Reluctant Quantification is mostly the same as greedy qualification except the flow of characters is the opposite--that is, they start in the quarantine and flow into the active sequence. The resulting behavior is that as little of the original sequence is included in a match as possible.
Possessive Quantification does not have a quarantine and includes everything in a fixed active sequence.

Regular Expression: Backtracking in Possessive Quantifier

I was reviewing a test and noticed that a possessive quantifier actually worked in an str.split(). so I wrote the following code:
String str = "aaaaab";
if(str.matches("a*+b"))
System.out.println("I backtrack");
else
System.out.println("Nope.");
When run, this prints out I backtrack. Here's why this is confusing, I've been told that a possessive quantifier never backtracks, so why would a*+ give up the b in the String?
What I want is a more detailed explanation of when possessive quantifiers backtrack.
There is no backtracking in your example.
You are saying "any number of a characters". So, the engine will collect those 5 a chars and then stop; to then find the b.
That is all there is to this.
Backtracking means that the engine has to "track back" after collecting too much of the input string; see here for an example of that.
And beyond that: your if condition returns true when the pattern matches the input. Your conclusion that this means "it is backtracking" is not correct:
A match is a match; no matter if the engine had to backtrack in order to match (or not). In other words: your little test there doesn't tell you anything (it only tells you if the input matched the given pattern; period).

Java Regex how to get all the matching occurences of a pattern [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Categories