I'm reading about regular expression in Java. And I understand that possessive quantifiers do not backtrack and release characters to give a chance for other group to achieve a match.
But I couldn't figure any situations where possessive quantifiers are used in reality.
I have read some resources saying that since possessive quantifiers don't backtrack, they don't need to remember the position of each character in the input string, which helps to significantly improve performance of the regular expression engine.
I have tested this by writing an example:
I have a string containing about thousands of digits.
First I defined a greedy: String regex = "(\d+)";
Then I counted the time it took.
Second: I change to possessive: String regex = "(\d++)";
Also I counted the time it took but I don't see any difference in time
Am I misunderstanding something?
And besides, can anyone give me some specific cases where possessive quantifiers are in use?
And about the term: In the book "Java Regular Expressions Taming the Java.Util.Regex Engine by Mehran Habibi" he used the term "possessive qualifiers", while I read in the Internet, people used "Possessive quantifier". Which one is correct or both?
Possessive quantifiers are quantifiers that are greedy (they try to match as many characters as possible) and don't backtrack (it is possible matching fails if the possessive quantifiers go to far).
Example
Normal (greedy) quantifiers
Say you have the following regex:
^([A-Za-z0-9]+)([A-Z0-9][A-Z0-9])(.*)
The regex aims to match "one or more alphanumerical-characters (case independent) [A-Za-z0-9] and should end with two alphanumerical characters and then any character can occur.
Any string that obeys this constraint will match. AAA as well. One can claim that the second and the third A should belong to the second group, but that would result in the fact that the string will not match. The regex has thus the intelligence (using dynamic programming), to know when to leave the (first) ship.
Non-greedy quantifiers
Now a problem that can occur is that the first group is "too greedy" for data extraction purposes. Say you have the following string AAAAAAA. Several subdivisions are possible: (A)(AA)(AAAA), (AA)(AA)(AAA), etc. By default, each group in a regex is as greedy as possible (as long as this has no effect on the fact whether the string will still be matched). The regex will thus subdivide the string in (AAAAA)(AA)(). If you want to extract data in such a way, that from the moment one character has been passed, from the moment two characters in the [A-Z0-9] range occur, the regex should move to the next group.
In order to achieve this, you can write:
^([A-Za-z0-9]+?)([A-Z0-9][A-Z0-9])(.*)
The string AAAAAAA will match with (A)(AA)(AAAA).
Possessive quantifiers
Possessive quantifiers are greedy quantifiers, but once it is possible, they will never give a character back to another group. For instance:
^([A-Z]++)([H-Zw])(.*)
If you would write ^([A-Z]+)([H-Z])(.*) a string AH0 would be matched. The first group is greedy (taking A), but since eating (that's the word sometimes used) H would result in the string not being matched, it is willingly to give up H. Using the possessive quantifiers. The group is not willing to give up H as well. As a result it eats both A and H. Only 0 is left for the second group, but the second group cannot eat that character. As a result the regex fails where using the non possessive quantifiers would result in a successful match. The string Aw will however successfully match, since the first group is not interested in w...
By default, quantifers are greedy. They will try to match as much as possible. The possessive quantifier prevents backtracking, meaning what gets matched by the regular expression will not be backtracked into, even if that causes the whole match to fail. As stated in Regex Tutorial ( Possessive Quantifiers ) ...
Possessive quantifiers are a way to prevent the regex engine from
trying all permutations. This is primarily useful for performance
reasons. You can also use possessive quantifiers to eliminate certain
matches.
Related
What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""
I'm using a Java regex pattern in an application that only allows access to the whole match value (that is, I cannot use capturing groups).
I am trying to extract values from my sample text:
C02 SURVEY : 2010 F10446P BONAPARTE 2D
In the above example I need to check for the keyword SURVEY and have to extract value after that :. And I wanted my output to be:
2010 F10446P BONAPARTE 2D
I used the pattern (?<=(?i)survey\s{2}[:])(?:(?![\n]).)*
In this pattern, I have hardcoded the spaces to be 2 (\s{2}) which may vary and not constant value.
I need to use quantifiers with lookbehind operation.
If any other option is there please let me know.
You may leverage a feature in a Java regex engine that is called "constrained width lookbehind":
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
That means, you may replace the {2} limiting quantifier with a limiting quantifier with both minimum and maximum values, e.g. {0,100} to allow zero to a hundred whitespace symbols. Adjust them as you see fit.
Besides, you needn't use a tempered greedy token (?:(?![\n]).)* as the dot in Java regex does not match a newline. Just replace it with .* to match any zero or more chars other than newline. So, your pattern might look as simple as (?i)(?<=survey\s{0,100}:).*.
I was reviewing a test and noticed that a possessive quantifier actually worked in an str.split(). so I wrote the following code:
String str = "aaaaab";
if(str.matches("a*+b"))
System.out.println("I backtrack");
else
System.out.println("Nope.");
When run, this prints out I backtrack. Here's why this is confusing, I've been told that a possessive quantifier never backtracks, so why would a*+ give up the b in the String?
What I want is a more detailed explanation of when possessive quantifiers backtrack.
There is no backtracking in your example.
You are saying "any number of a characters". So, the engine will collect those 5 a chars and then stop; to then find the b.
That is all there is to this.
Backtracking means that the engine has to "track back" after collecting too much of the input string; see here for an example of that.
And beyond that: your if condition returns true when the pattern matches the input. Your conclusion that this means "it is backtracking" is not correct:
A match is a match; no matter if the engine had to backtrack in order to match (or not). In other words: your little test there doesn't tell you anything (it only tells you if the input matched the given pattern; period).
What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""
I'm trying to match a coordinate pair in a String using a Regex in Java. I explicitly want to exclude strings using negative lookahead.
to be matched:
558,228
558,228,
558,228,589
558,228,A,B,C
NOT to be matched:
558,228,<Text>
The Regex ^558,228(?!,<).* does the job, while ^\d{1,},\d{1,}(?!,<).* doesn't. It's the same regex with the metacharacter \d instead of values. Any ideas why?
The reason is the .* part at the end. It matches everything that wasn't matched earlier.
In combination with \d{1,}, which allows to match less than 3 digits, it will go like this:
^\d{1,},\d{1,}(?!,<) will match 558,22 and .* will match the remaining part 8,<Text>.
The problem is the \d{1,} part in combination with the .* at the end.
In your case
558,228,<Text>
The ^\d{1,},\d{1,}(?!,<) matches ">558,22" and the .* matches the rest "8,<Text>"
You can solve this using the possessive quanitifier ++
^\d+,\d++(?!,<)(.*)
See it here online on Regexr
\d++ is a seldom used possessive quantifier, which is here useful. ++ means match at least once as many as you can and do not backtrack. That means it will not give back the digits once it has found them.
Java Quantifier tutorial