Does the Java regex library optimize for any characters .*? - java

I have a wrapper class for matching regular expressions. Obviously, you compile a regular expression into a Pattern like this.
Pattern pattern = Pattern.compile(regex);
But suppose I used a .* to specify any number of characters. So it's basically a wildcard.
Pattern pattern = Pattern.compile(".*");
Does the pattern optimize to always return true and not really calculate anything? Or should I have my wrapper implement that optimization? I am doing this because I could easily process hundreds of thousands of regex operations in a process. If a regex parameter is null I coalesce it to a .*

In your case, I could just use a possessive quantifier to avoid any backtracking:
.*+
The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically.
Here is what Cristian Mocanu's writes in his Optimizing regular expressions in Java about a case similar to .*:
Java regex engine was not able to optimize the expression .*abc.*. I expected it would search for abc in the input string and report a failure very quickly, but it didn't. On the same input string, using String.indexOf("abc") was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as .{100}abc.* the engine will match it more than ten times faster. Why? Because now the mandatory string abc is at a known position inside the string (there should be exactly one hundred characters before it).
Some of the hints on Java regex optimization from the same source:
If the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression \d{100} is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.
Don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match
If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches().
Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
Beware of alternation. Regular expressions like (X|Y|Z) have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of (abcd|abef) use ab(cd|ef).
Whenever you are using negated character classes to match something other than something else, use possessive quantifiers: instead of [^a]*a use [^a]*+a.
Non-matching strings may cause your code to freeze more often than those that contain a match. Remember to always test your regular expressions using non-matching strings first!
Beware of a known bug #5050507 (when the regex Pattern class throws a StackOverflowError), if you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.
Instead of lazy dot matching, use tempered greedy token (e.g. (?:(?!something).)*) or unrolling the loop techinque (got downvoted for it today, no idea why).
Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

Related

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.
If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.
Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Speed up regular expression

This is a regex to extract the table name from a SQL statement:
(?:\sFROM\s|\sINTO\s|\sNEXTVAL[\s\W]*|^UPDATE\s|\sJOIN\s)[\s`'"]*([\w\.-_]+)
It matches a token, optionally enclosed in [`'"], preceded by FROM etc. surrounded by whitespace, except for UPDATE which has no leading whitespace.
We execute many regexes, and this is the slowest one, and I'm not sure why. SQL strings can get up to 4k in size, and execution time is at worst 0,35ms on a 2.2GHz i7 MBP.
This is a slow input sample: https://pastebin.com/DnamKDPf
Can we do better? Splitting it up into multiple regexes would be an option, as well if alternation is an issues.
There is a rule of thumb:
Do not let engine make an attempt on matching each single one character if there are some boundaries.
Try the following regex (~2500 steps on the given input string):
(?!FROM|INTO|NEXTVAL|UPDATE|JOIN)\S*\s*|\w+\W*(\w[\w\.-]*)
Live demo
Note: What you need is in the first capturing group.
The final regex according to comments (which is a little bit slower than the previous clean one):
(?!(?:FROM|INTO|NEXTVAL|UPDATE|JOIN)\b)\S*\s*|\b(?:NEXTVAL\W*|\w+\s[\s`'"]*)([\[\]\w\.-]+)
Regex optimisation is a very complex topic and should be done with help of some tools. For example, I like Regex101 which calculates for us number of steps Regex engine had to do to match pattern to payload. For your pattern and given example it prints:
1 match, 22976 steps (~19ms)
First thing which you can always do it is grouping similar parts to one group. For example, FROM, INTO and JOIN look similar, so we can write regex as below:
(?:\s(?:FROM|INTO|JOIN)\s|\sNEXTVAL[\s\W]*|^UPDATE\s)[\s`'"]*([\w\.-_]+)
For above example, Regex101, prints:
1 match, 15891 steps (~13ms)
Try to find some online tools which explain and optimise Regex such as myregextester and calculate how many steps engine needs to do.
Because matches are often near the end, one possibility would be to essentially start at the end and backtrack, rather than start at the beginning and forward-track, something along the lines of
^(?:UPDATE\s|.*(?:\s(?:(?:FROM|INTO|JOIN)\s|NEXTVAL[\s\W]*)))[\s`'\"]*([\w\.-_]+)
https://regex101.com/r/SO7M87/1/ (154 steps)
While this may be much faster when a match exists, it's only a moderate improvement when there's no match, because the pattern must backtrack all the way to the beginning (~9000 steps from ~23k steps)

Regex - Disallow certain characters to appear consecutively

I'm not sure if this is possible or not:
Writing a program to convert an infix notation to postfix notation. All is working well so far but trying to implement validation is proving difficult.
I'm trying to use a regex to validate an infix notation, conforming to the following rules:
String must only start with a number or ( (program does not allow negative numbers)
String must only end with number or )
String must only contain 0-9*/()-+
String must not allow following characters to appear together +*/-
I have a regex which conforms to the first 3 rules:
(^[0-9(])([0-9+()*]+)([0-9)]+$)
Is it possible to use regex to implement the last rule?
I will answer only to fourth rule as you have problem only with it.
Yes, there is a possibility, but I think regex is not appropriate tool to check that...
This pattern ^(?(?=.*\+)(?!.*[\*\/-])).+$ will match any string that contain + and not contain other characters: /,*,-. For one character is already lengthy and hard to read. See demo.
It uses conditional expression (?...) to check if lookahead checking for + was successfull, if it is, then negative lookahead assures that you won't have any of \*- characters.
For all characters, the regex will become very big and hard to maintain.
That's why I don't recommend it for this task.
I agree with Michał Turczyn that regex is not the task for this, but not with the reason. It is easy to implement your restrictions. However, your restrictions also allow expressions like (0+3, 2(*4), ((((1, and other things you likely don't want - so regex validation is kind of pointless. If you were writing this with a regex engine with some significant power like PCRE (Perl, PHP) or Onigmo (Ruby), you can fake a parser in regex; but in Java, regex is quite restricted in what it can do. It is enough for the requirements in the question, though:
^[0-9(](?:(?![+*/-][+*/-])[0-9+*/()-])*[0-9)]$
starts with digit or paren
any number of repetitions of any allowed character, such that that character and the next character aren't both operators
ends with digit or thesis.

Regex search with pattern containing (?:.|\s)*? takes increasingly long time

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches.
I suspect catastrophic backtracking.
Please help! How can I optimize this regex:
String regex = "<tr bgcolor=\"ffffff\">\\s*?<td width=\"20%\"><b>((?:.|\\s)+?): *?</b></td>\\s*?<td width=\"80%\">((?:.|\\s)*?)(?=(?:</td>\\s*?</tr>\\s*?<tr bgcolor=\"ffffff\">)|(?:</td>\\s*?</tr>\\s*?</table>\\s*?<b>Tags</b>))";
EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards.
The alternation (?:.|\\s)+? is very inefficient, as it involves too much backtracking.
Basically, all variations of this pattern are extremely inefficient: (?:.|\s)*?, (?:.|\n)*?, (?:.|\r\n)*? and there greedy counterparts, too ((?:.|\s)*, (?:.|\n)*, (?:.|\r\n)*). (.|\s)*? is probably the worst of them all.
Why?
The two alternatives, . and \s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.
Patterns like (?:.|\n)*? / (?:.|\n)* in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|\n) is the only way to match any char).
Solution
If you want to match any characters including whitespace with regex, do it with
[\\s\\S]*?
Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (e.g. (?s)start(.*?)end).
NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.

Conditional Regex searches

I'm attempting to create a Regular Expressions code in Java that will have a conditional search term.
What I mean by this is let's say I have 5 words; tree, car, dog, cat, bird. Now I would like the expression to search for these terms, however is only required to match 3 out of the five, and it could be any of the 5 it chooses to match.
I thought perhaps a using a back reference ?(3) would work but doesn't seem to do the trick.
A standard optional search (?) wouldn't work either because all terms are optional, however the number of matches required is not. Essentially is there a way to create a string that must be 50% (or any percent) correct to provide a match?
Would anyone happen to know or could point me in the right direction?
(I would hopefully like it working client side if possible)
Does it have to be a free-standing regular expression without any further code? A simple loop testing for each word and counting matches should do this perfectly. Pseudocode assuming you want N unique matches (you can also swap the substring test with a regex, doesn't matter how you determine matches as long as you keep the counting of unique matches out of the regex):
bool has_N_words(int n, string[] words, string text) {
int matches = 0;
foreach word in words {
if (word.substringOf(text)) counter++
if (counter >= n) return true
}
return false
}
It seems to me the only (save mind-blowing uses of obscure regex extensions - not that I have something in mind, I've just been surprised again and again what modern regex implementations allow) way to do this with an regular expression goes like this:
Enumerate all unique (ignoring order or not depending on implementation, see below) permutations of words
For each permutation, build a sub-regex that matches a string containing those words, either by
joining the first three words with .*? (this requires all unique permutations)
using three lookahead assertions like (?=.*word) (this allows dropping word combinations that occured before in a different order)
Combine all sub-regexes in a giant or.
That's impractical to do by hand, ugly and complex (as in computational complexity, not in programming effort) to do automatically, and inefficient as well as quite hacky either way.
I don't see why you would want to do this with a regext but if you really need it to be a regex:
/(tree|car|dog|cat|bird)/
Then count the matches you get from that...
(?i)(?s)(.*(tree|car|dog|cat|bird)){3,}?.*
The (?i) is for case insensitive and the (?s) to match new lines with .* also, since you are looking at emails.
The ? at the end is the reluctant quantifier.
I haven't actually tried it.

Categories