My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches.
I suspect catastrophic backtracking.
Please help! How can I optimize this regex:
String regex = "<tr bgcolor=\"ffffff\">\\s*?<td width=\"20%\"><b>((?:.|\\s)+?): *?</b></td>\\s*?<td width=\"80%\">((?:.|\\s)*?)(?=(?:</td>\\s*?</tr>\\s*?<tr bgcolor=\"ffffff\">)|(?:</td>\\s*?</tr>\\s*?</table>\\s*?<b>Tags</b>))";
EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards.
The alternation (?:.|\\s)+? is very inefficient, as it involves too much backtracking.
Basically, all variations of this pattern are extremely inefficient: (?:.|\s)*?, (?:.|\n)*?, (?:.|\r\n)*? and there greedy counterparts, too ((?:.|\s)*, (?:.|\n)*, (?:.|\r\n)*). (.|\s)*? is probably the worst of them all.
Why?
The two alternatives, . and \s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.
Patterns like (?:.|\n)*? / (?:.|\n)* in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|\n) is the only way to match any char).
Solution
If you want to match any characters including whitespace with regex, do it with
[\\s\\S]*?
Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (e.g. (?s)start(.*?)end).
NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.
Related
This is a regex to extract the table name from a SQL statement:
(?:\sFROM\s|\sINTO\s|\sNEXTVAL[\s\W]*|^UPDATE\s|\sJOIN\s)[\s`'"]*([\w\.-_]+)
It matches a token, optionally enclosed in [`'"], preceded by FROM etc. surrounded by whitespace, except for UPDATE which has no leading whitespace.
We execute many regexes, and this is the slowest one, and I'm not sure why. SQL strings can get up to 4k in size, and execution time is at worst 0,35ms on a 2.2GHz i7 MBP.
This is a slow input sample: https://pastebin.com/DnamKDPf
Can we do better? Splitting it up into multiple regexes would be an option, as well if alternation is an issues.
There is a rule of thumb:
Do not let engine make an attempt on matching each single one character if there are some boundaries.
Try the following regex (~2500 steps on the given input string):
(?!FROM|INTO|NEXTVAL|UPDATE|JOIN)\S*\s*|\w+\W*(\w[\w\.-]*)
Live demo
Note: What you need is in the first capturing group.
The final regex according to comments (which is a little bit slower than the previous clean one):
(?!(?:FROM|INTO|NEXTVAL|UPDATE|JOIN)\b)\S*\s*|\b(?:NEXTVAL\W*|\w+\s[\s`'"]*)([\[\]\w\.-]+)
Regex optimisation is a very complex topic and should be done with help of some tools. For example, I like Regex101 which calculates for us number of steps Regex engine had to do to match pattern to payload. For your pattern and given example it prints:
1 match, 22976 steps (~19ms)
First thing which you can always do it is grouping similar parts to one group. For example, FROM, INTO and JOIN look similar, so we can write regex as below:
(?:\s(?:FROM|INTO|JOIN)\s|\sNEXTVAL[\s\W]*|^UPDATE\s)[\s`'"]*([\w\.-_]+)
For above example, Regex101, prints:
1 match, 15891 steps (~13ms)
Try to find some online tools which explain and optimise Regex such as myregextester and calculate how many steps engine needs to do.
Because matches are often near the end, one possibility would be to essentially start at the end and backtrack, rather than start at the beginning and forward-track, something along the lines of
^(?:UPDATE\s|.*(?:\s(?:(?:FROM|INTO|JOIN)\s|NEXTVAL[\s\W]*)))[\s`'\"]*([\w\.-_]+)
https://regex101.com/r/SO7M87/1/ (154 steps)
While this may be much faster when a match exists, it's only a moderate improvement when there's no match, because the pattern must backtrack all the way to the beginning (~9000 steps from ~23k steps)
The standard implementation of the Java Pattern class uses recursion to implement many forms of regular expressions (e.g., certain operators, alternation).
This approach causes stack overflow issues with input strings that exceed a (relatively small) length, which may not even be more than 1,000 characters, depending on the regex involved.
A typical example of this is the following regex using alternation to extract a possibly multiline element (named Data) from a surrounding XML string, which has already been supplied:
<Data>(?<data>(?:.|\r|\n)+?)</Data>
The above regex is used in with the Matcher.find() method to read the "data" capturing group and works as expected, until the length of the supplied input string exceeds 1,200 characters or so, in which case it causes a stack overflow.
Can the above regex be rewritten to avoid the stack overflow issue?
Some more details on the origin of the stack overflow issue:
Sometimes the regex Pattern class will throw a StackOverflowError. This is a manifestation of the known bug #5050507, which has been in the java.util.regex package since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because the Pattern class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See the description of the bug for more details. It seems it's triggered mostly by the use of alternations.
Your regex (that has alternations) is matching any 1+ characters between two tags.
You may either use a lazy dot matching pattern with the Pattern.DOTALL modifier (or the equivalent embedded flag (?s)) that will make the . match newline symbols as well:
(?s)<Data>(?<data>.+?)</Data>
See this regex demo
However, lazy dot matching patterns still consume lots of memory in case of huge inputs. The best way out is to use an unroll-the-loop method:
<Data>(?<data>[^<]*(?:<(?!/?Data>)[^<]*)*)</Data>
See the regex demo
Details:
<Data> - literal text <Data>
(?<data> - start of the capturing group "data"
[^<]* - zero or more characters other than <
(?:<(?!/?Data>)[^<]*)* - 0 or more sequences of:
<(?!/?Data>) - a < that is not followed with Data> or /Data>
[^<]* - zero or more characters other than <
) - end of the "data" group
</Data> - closing delimiter
i had very bad performance with this regex pattern
(?s:.+<.*#.+\..+>.*:)
in my java application.
Next day I installed java profiler and start trying some optimalizations, after few hours, i have added "^" as first in my pattern.
^(?s:.+<.*#.+\..+>.*:)
and performance is much better (7 seconds vs. 800ms on approximately 1500 operations).
My question is why?
As already stated in the comments: because in your first expression every character was tested a couple of times to find a possible match while your second expression is bound to the beginning of the line/string and when it is failing, no further characters will be checked (so the regex engine fails faster, a very important aspect when crafting good expressions).
But read as well the comments from #WiktorStribiżew he's certainly more regex-gifted and/or experienced than I am.
That's pretty clear .. Let me ask you, do you know what's the meaning of ^ when you use it in the beginning of regex?
^ assert position at start of the string
So when you append ^ in the beginning of your regex, you are actually reducing a lot of process. That cause that capturing group (in your regex) only matches the beginning of your string, and matching breaks if there isn't match.
I have a wrapper class for matching regular expressions. Obviously, you compile a regular expression into a Pattern like this.
Pattern pattern = Pattern.compile(regex);
But suppose I used a .* to specify any number of characters. So it's basically a wildcard.
Pattern pattern = Pattern.compile(".*");
Does the pattern optimize to always return true and not really calculate anything? Or should I have my wrapper implement that optimization? I am doing this because I could easily process hundreds of thousands of regex operations in a process. If a regex parameter is null I coalesce it to a .*
In your case, I could just use a possessive quantifier to avoid any backtracking:
.*+
The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically.
Here is what Cristian Mocanu's writes in his Optimizing regular expressions in Java about a case similar to .*:
Java regex engine was not able to optimize the expression .*abc.*. I expected it would search for abc in the input string and report a failure very quickly, but it didn't. On the same input string, using String.indexOf("abc") was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as .{100}abc.* the engine will match it more than ten times faster. Why? Because now the mandatory string abc is at a known position inside the string (there should be exactly one hundred characters before it).
Some of the hints on Java regex optimization from the same source:
If the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression \d{100} is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.
Don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match
If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches().
Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
Beware of alternation. Regular expressions like (X|Y|Z) have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of (abcd|abef) use ab(cd|ef).
Whenever you are using negated character classes to match something other than something else, use possessive quantifiers: instead of [^a]*a use [^a]*+a.
Non-matching strings may cause your code to freeze more often than those that contain a match. Remember to always test your regular expressions using non-matching strings first!
Beware of a known bug #5050507 (when the regex Pattern class throws a StackOverflowError), if you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.
Instead of lazy dot matching, use tempered greedy token (e.g. (?:(?!something).)*) or unrolling the loop techinque (got downvoted for it today, no idea why).
Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.
I'm reading about regular expression in Java. And I understand that possessive quantifiers do not backtrack and release characters to give a chance for other group to achieve a match.
But I couldn't figure any situations where possessive quantifiers are used in reality.
I have read some resources saying that since possessive quantifiers don't backtrack, they don't need to remember the position of each character in the input string, which helps to significantly improve performance of the regular expression engine.
I have tested this by writing an example:
I have a string containing about thousands of digits.
First I defined a greedy: String regex = "(\d+)";
Then I counted the time it took.
Second: I change to possessive: String regex = "(\d++)";
Also I counted the time it took but I don't see any difference in time
Am I misunderstanding something?
And besides, can anyone give me some specific cases where possessive quantifiers are in use?
And about the term: In the book "Java Regular Expressions Taming the Java.Util.Regex Engine by Mehran Habibi" he used the term "possessive qualifiers", while I read in the Internet, people used "Possessive quantifier". Which one is correct or both?
Possessive quantifiers are quantifiers that are greedy (they try to match as many characters as possible) and don't backtrack (it is possible matching fails if the possessive quantifiers go to far).
Example
Normal (greedy) quantifiers
Say you have the following regex:
^([A-Za-z0-9]+)([A-Z0-9][A-Z0-9])(.*)
The regex aims to match "one or more alphanumerical-characters (case independent) [A-Za-z0-9] and should end with two alphanumerical characters and then any character can occur.
Any string that obeys this constraint will match. AAA as well. One can claim that the second and the third A should belong to the second group, but that would result in the fact that the string will not match. The regex has thus the intelligence (using dynamic programming), to know when to leave the (first) ship.
Non-greedy quantifiers
Now a problem that can occur is that the first group is "too greedy" for data extraction purposes. Say you have the following string AAAAAAA. Several subdivisions are possible: (A)(AA)(AAAA), (AA)(AA)(AAA), etc. By default, each group in a regex is as greedy as possible (as long as this has no effect on the fact whether the string will still be matched). The regex will thus subdivide the string in (AAAAA)(AA)(). If you want to extract data in such a way, that from the moment one character has been passed, from the moment two characters in the [A-Z0-9] range occur, the regex should move to the next group.
In order to achieve this, you can write:
^([A-Za-z0-9]+?)([A-Z0-9][A-Z0-9])(.*)
The string AAAAAAA will match with (A)(AA)(AAAA).
Possessive quantifiers
Possessive quantifiers are greedy quantifiers, but once it is possible, they will never give a character back to another group. For instance:
^([A-Z]++)([H-Zw])(.*)
If you would write ^([A-Z]+)([H-Z])(.*) a string AH0 would be matched. The first group is greedy (taking A), but since eating (that's the word sometimes used) H would result in the string not being matched, it is willingly to give up H. Using the possessive quantifiers. The group is not willing to give up H as well. As a result it eats both A and H. Only 0 is left for the second group, but the second group cannot eat that character. As a result the regex fails where using the non possessive quantifiers would result in a successful match. The string Aw will however successfully match, since the first group is not interested in w...
By default, quantifers are greedy. They will try to match as much as possible. The possessive quantifier prevents backtracking, meaning what gets matched by the regular expression will not be backtracked into, even if that causes the whole match to fail. As stated in Regex Tutorial ( Possessive Quantifiers ) ...
Possessive quantifiers are a way to prevent the regex engine from
trying all permutations. This is primarily useful for performance
reasons. You can also use possessive quantifiers to eliminate certain
matches.