Can I improve performance of this regular expression further - java

I am trying to fetch thread names from the thread dumps file.
The thread names are usually contained within "double quotes" in the first line of each thread dump.
It may look as simple as follows:
"THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000]
Or as big as follows:
"[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000]
The regular expression I wrote is simple one: "(.*)". It captures everything inside double quotes as a group. However it causes heavy backtracking thus requiring a lot of steps, as can be seen here. Verbally we can explain this regex as "capture anything that is enclosed inside double quotes as a group"
So I came up with another regex which performs the same: "([^\"])". Verbally we can describe this regex as "capture any number of non-double quote characters that are enclosed inside double quotes". I did not found any fast regex than this. It does not perform any backtracking and hence it requires minimum steps as can be seen here.
I told this above to my colleague. He came up with yet another one: "(.*?)". I didnt get how it works. It performs considerable less backtracking than the first one but is a bit slower than the second one as can be seen here.
However
I don't get why the backtracking stops early.
I understand ? is a quantifier which means once or not at all. However I dont understand how once or not at all is getting used here.
In fact I am not able to guess how can we describe this regex verbally.
My colleague tried explaining me but I am still not able to understand it completely. Can anyone explain?

Brief explanation and a solution
The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there's more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.
The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.
If you need to match strings like " + no quotes here + ", use
"([^"]*+)"
or even you do not need to match the trailing quote in this situation:
"([^"]*+)
See regex demo
In fact I am not able to guess how can we describe this regex verbally.
The latter "([^"]*+) regex can be described as
" - find the first " symbol from the left of the string
([^"]*+) - match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.
Quantifiers
More information on quantifiers from Rexegg.com:
A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
As you see, ? is not a separate quantifier, it is a part of another quantifier.
I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).
Difference between .*?, .* and [^"]*+ quantifiers
Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.
Lazy "(.*?)" solution works like this: the engine finds the first " and then advances in the pattern and tries the next token (which is ") against the T in THREAD1. This fails, so the engine backtracks and allows the .*? to expand its match by one item, so that it matches the T. Once again, the engine advances in the pattern. It now tries the " against the H in THREAD1. This fails, so the engine backtracks and allows the .*? to expand and match the H. The process then repeats itself—the engine advances, fails, backtracks, allows the lazy .*? to expand its match by one item, advances, fails and so on. For each character matched by the .*?, the engine has to backtrack. From a computing standpoint, this process of matching one item, advancing, failing, backtracking, expanding is "expensive".
Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.
possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].
Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be "re-sorted" due to some "inconveniences" met by the regex engine, and it will be unable to shift any characters from and into this block of text.
For the current expression, it does not make a big difference though.

Related

Regex will not matching a word [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:
(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))
This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.
Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.
Indeed! It's possible using forward references:
(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)
Proof
Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.
No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.
That's great and all, but I want to match inner groups too!
OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:
(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$)))
Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.
So... how the hell does this actually work?
I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:
Note
Component
Description
(?=\()
Make sure '(' follows before doing any hard work.
(?:
Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
(?=
This lookahead deals with finding the next '('.
.*?\((?!.*?\1)
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'
(?=
This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)
Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.
Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?
Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation
.*?(?=\1)
Match up to and including the last '(' found.
[^(]*(?=\2$)
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).
Conclusion
So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.
I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.
Brief
Input Corrections
First of all, your input is incorrect as there's an extra parenthesis (as shown below)
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:
Extra parenthesis removed
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
^
Additional parenthesis added to match extra closing parenthesis
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Regex Capabilities
Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).
This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.
Input
Considering the original input is actually invalid, we'll use the following inputs to test against.
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
Testing against these inputs should yield the following results:
INVALID (no match)
VALID (match)
VALID (match)
Code
There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).
See regex in use here
Using DEFINE block
(?(DEFINE)
(?<value>[^()\r\n]+)
(?<groupVal>(?&group)|(?&value))
(?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$
Note: This regex uses the flags gmx
Without DEFINE block
See regex in use here
^(?<group>
(?<value>[^()\r\n]+)*
\((?<groupVal>(?&group)|(?&value))\)
(?&groupVal)*
)$
Note: This regex uses the flags gmx
Without x modifier (one-liner)
See regex in use here
^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$
Without named (groups & references)
See regex in use here
^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$
Note: This is the shortest possible method that I could come up with.
Explanation
I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.
^ Assert position at the start of the line
(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*) Capture the following into capture group 1
([^()\r\n]+)* Capture the following into capture group 2 any number of times
[^()\r\n]+ Match any character not present in the set ()\r\n one or more times
\( Match a left/opening parenthesis character ( literally
((?1)|(?2)) Capture either of the following into capture group 3
(?1) Recurse the first subpattern (1)
(?2) Recurse the second subpattern (2)
\) Match a right/closing parenthesis character ) literally
(?3)* Recurse the third subpattern (3) any number of times
$ Assert position at the end of the line

Java Regex how to get all the matching occurences of a pattern [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Regular expression to match strings in quotes with double-quotes inside

I face a challenge to match the input in the following format:
The input consists of key=value pairs. The key starts with slash. The value may be a number or a string in quotes.
The value may optionally contain escaped quotes, that is quote following by a quote (""). Such escaped quote should be considered a part of value. There is no need to check that escaped quotes are balanced (e.g. ends by another escaped quote).
The regular expression should match the given key=value part of the sequence and should not break for long inputs (e.g. value is 10000 characters).
First I came to this solution:
/(\w+)=(\d+|"(?:""|[^"])+"(?!"))
and it performs not bad, however it fails in Java6 with StackOverflowError for long inputs (cashes regexplanet for example). I tried to improve it a bit to run faster:
/(\w+)=(\d+|"(?:""|[^"]+)+"(?!"))
but then if input is not matching, it enters endless loop in backtracking trying to match it.
Then I came to this regex:
/(\w+)=(\d+|".+?(?<!")(?:"")*"(?!"))
which is performing slower, but it seems to solve the task.
Can anyone suggest a better / faster regex?
Sample input:
/mol_type="protein" /transl_table=11 /note="[CDS] (""multi
line)" nn /organism="""Some"" Sequence" nn /organism="Some ""Sequence"""
/translation="MHPSSSRIPHIAVVGVSAIFPGSLDAHGFWRDILSGTDLITDVPSTHWLVE
DYYDPDPSAPDKTYAKRGAFLKDVPFDPLEWGVPPSIVPATDTTQLLALIVAKRVLEDAAQGQFE
SMSRERMSVILGVTSAQELLASMVSRIQRPVWAKALRDLGYPEDEVKRACDKIAGNYVPWQESSF
PGLLGNVVAGRIANRLDLGGTNCVTDAACASSLSAMSMAINELALGQSDLVIAGGCDTMNDAFMY
MCFSKTPALSKSGDCRPFSDKADGTLLGEGIAMVALKRLDDAERDGDRVYAVIRGIGSSSDGRSK
SVYAPVPEGQAKALRRTYAAAGYGPETVELMEAHGTGTKAGDAAEFEGLRAMFDESGREDRQWCA
LGSVKSQIGHTKAAAGAAGLFKAIMALHHKVLPPTIKVDKPNPKLDIEKTAFYLNTQARPWIRPG
DHPRRASVSSFGFGGSNFHVALEEYTGPAPKAWRVRALPAELFLLSADTPAALADRARALAKEAE
VPEILRFLARESVLSFDASRPARLGLCATDEADLRKKLEQVAAHLEARPEQALSAPLVHCASGEA
PGRVAFLFPGQGSQYVGMGADALMTFDPARAAWDAAAGVAIADAPLHEVVFPRPVFSDEDRAAQE
ARLRETRWAQPAIGATSLAHLALLAALGVRAEAFAGHSFGEITALHAAGALSAADLLRVARRRGE
LRTLGQVVDHLRASLPAAGPAASASPAAAASVPKASTAAVPAVASVAAPGAAEVERVVMAVVAET
TGYPAEMLGLQMELESDLGIDSIKRVEILSAVRDRTPGLSEVDASALAQLRTLGQVVDHLRASLP
AASAGPAVAAPAAKAPAVAAPTGVSGATPGAAEVERVVMAVVAETTGYPAEMLGLQMELESDLGI
DSIKRVEILSAVRDRTPGLAEVDASALAQLRTLGQVVDHLRASLGPAAVTAGAAPAEPAEEPAST
PLGRWTLVEEPAPAAGLAMPGLFDAGTLVITGHDAIGPALVAALAARGIAAEYAPAVPRGARGAV
FLGGLRELATADAALAVHREAFLAAQAIAAKPALFVTVQDTGGDFGLAGSDRAWVGGLPGLVKTA
ALEWPEASCRAIDLERAGRSDGELAEAIASELLSGGVELEIGLRADGRRTTPRSVRQDAQPGPLP
LGPSDVVVASGGARGVTAATLIALARASHARFALLGRTALEDEPAACRGADGEAALKAALVKAAT
SAGQRVTPAEIGRSVAKILANREVRATLDAIRAAGGEALYVPVDVNDARAVAAALDGVRGALGPV
TAIVHGAGVLADKLVAEKTVEQFERVFSTKVDGLRALLGATAGDPLKAIVLFSSIAARGGNKGQC
DYAMANEVLNKVAAAEAARRPGCRVKSLGWGPWQGGMVNAALEAHFAQLGVPLIPLAAGAKMLLD
ELCDASGDRGARGQGGAPPGAVELVLGAEPKALAAQGHGGRVALAVRADRATHPYLGDHAINGVP
VVPVVIALEWFARAARACRPDLVVTELRDVRVLRGIKLAAYESGGEVFRVDCREVSNGHGAVLAA
ELRGPQGALHYAATIQMQQPEGRVAPKGPAAPELGPWPAGGELYDGRTLFHGRDFQVIRRLDGVS
RDGIAGTVVGLREAGWVAQPWKTDPAALDGGLQLATLWTQHVLGGAALPMSVGALHTFAEGPSDG
PLRAVVRGQIVARDRTKADIAFVDDRGSLVAELRDVQYVLRPDTARGQA"
/note="primer of Streptococcus pneumoniae
Expected output (from regexhero.net):
In order to fail in a reasonable time you need, indeed, to avoid catastrophic backtracking. This can be done using atomic grouping (?>...):
/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))
# (?>(?>""|[^"]+)+)
(?> # throw away the states created by (...)+
(?> # throw away the states created by [^"]+
""|[^"]+
)+
)
Your issue when using (?:""|[^"]+)+ on a string that will never match, is linked to the fact that each time you match a new [^"] character the regex engine can choose to use the inner or outer + quantifier.
This leads to a lot of possibilities for backtracking, and before returning a failure the engine has to try them all.
We know that if we haven't found a match by the time the engine reaches the end, we never will: all we need to do is throw away the backtracking positions to avoid the issue, and that's what atomic grouping is for.
See a DEMO: 24 steps on failure, while preserving the speed on the successful cases (not a real benchmarking tool, but catastrophic backtracking would be pretty easy to spot)
Your initial regex was already quite good, but it was more complicated than necessary, leading to catastrophic backtracking.
You should use
/(\w+)=(\d+|"(?:""|[^"])*"(?!"))
See it live on regex101.com.
Explanation:
/ # Slash
(\w+) # Indentifier --> Group 1
= # Equals sign
( # Group 2:
\d+ # Either a number
| # or
"(?:""|[^"])*" # a quoted string
(?!") # unless another quote follows
) # End of group 2
How about this one:
/(\w+)=("(?:[^"]|"")*"|\d+)
(Note that the / is part of the regex here. Escape it as appropriate for your host language.)
If your regex engine supports it (Java does), make the * possessive:
/(\w+)=("(?:[^"]|"")*+"|\d+)
After some debugging the latter expression can be improved to:
/(\w+)=("(?:""|[^"]*+)*+"|\d++)
Note the double *+)*+ which allows matching contiguous text in one step while not being susceptible to catastrophic backtracking.

Greedy Quantifiers

I was reading K.Sierra and found the following sentance:
The greedy quantifier does in fact read the entire source data, and then it works
backward (from the right) until it finds the rightmost match. At that point, it
includes everything from earlier in the source data up to and including the data that
is part of the rightmost match.
Now, Suppose we have a source as follows:
"proj3.txt,proj1sched.pdf,proj1,proj2,proj1.java"
and pattern: proj1([^,])*
why doesn't it match the whole text? Being greedy it should have match the rightmost "proj1.java" and the returned match should have been the entire source before the right most match? Instead it returns:
proj1sched.pdf
proj1
proj1.java
why doesn't it match the whole text?
Because you stated it must start with proj1
Being greedy it should have match the rightmost "proj1.java"
correct.
and the returned match should have been the entire source before the right most match?
no idea why you would think that, or why that would be useful. You can just do .*proj1.* if that is what you want.
why doesn't it match the whole text?
Oh, it tried. But it found the sequence p, r, o, j, 1 in one spot, then went on to find zero or more characters not being a comma, therefore matching ., p, d, f. And it stopped there, since the next character, being a comma, did not match [^,].
Note that the next matching attempt will start at the next character, ie r, and so on until it finds a p; when it finds one, it will try r, etc etc.
The regex being satisfied in its entirety, the engine decided that it was a success and didn't try any further, even though there are matches beyond that point.
The text matched is therefore proj1.pdf. And not the entire input. Regular expressions are lazy, they only match what they need to match and never go any further.
BUT. And this is where it gets interesting. Some engines don't work this way.
Consider the regex cat(|flap) and the input text catflap. POSIX has had a go at regex engines, and dictated that a regex engine should match the leftmost, longest match.
So, if a regex engine obeys POSIX, it should match catflap. But most regex engines in existence, here, will only match cat: the empty alternation matches first, the regex is satisfied, end of story!
Now, to the core of the question: quantifiers are of three types, greedy, lazy and possessive:
greedy: the default, *;
lazy, aka the overused: *?;
possessive: *+.
A greedy quantifier will try and match as much text as it can, and give back only when it has to; a lazy quantifier will try and match as little text as it can; a possessive quantifier will try and match as much text as it can, and it will not give text back.
Illustration: here is the input text:
The answer to everything is 42, says the mouse
Here are three regexes to match this text, with a capturing group:
.*(\d+) (greedy);
.*?(\d+) (lazy);
.*+(\d+) (possessive).
Question: what will the group capture in each of these expressions? Answer:
the first: 2;
the second: 42;
the third: will not even match the text! .*+ will have swallowed everything but will not give back, \d+ is therefore left with nothing which can match, regex failure.
We have proj1([^,])* in which -
([^,])* means it will concatenate sub-string of any combination of
characters (occurred zero or more times), which does not consists of char ',' with string "proj1" like:
"sched.pdf" or " " or ".java" all three does not include a ',' character. Hence the result.

Categories