Performance overhead/improvement using regular expressions - java

If I need to check if for example a word A or word B exists in a text (String), is there a performance difference if I do:
if(text.contains(wordA) || text.contains(wordB))
to using some regular expression that searches the string?
Does it depend on the regular expression format?
Or is it just a matter of taste?
UPDATE:
If text.contains(wordA) is false then the text.contains(wordB) will be evaluated.
This means that contains will be called twice.
I was thinking if in performance terms a regex might be better than calling contains twice.

The code you have expresses your intent clearly, is more readable than a regexp, and is also probably faster.
Anyway, there is a very low probability that this part of your code causes any significant performance problem. So I wouldn't worry about performance here, but about readability and maintainability.

While the performance of regular expression is lower, it has more expressive power and often this is more important. For example.
"performance".contains("form") // is true
this may not be wheat you intended by a "word" Instead you can have a pattern
"\\bform\\b"
This will only match a complete word in a string which can be at the start or the end.

Yes their is a difference. Contains does various array manipulation to find the words, a regex uses diffent logic so it will be different, performance will even change depending how you uses the regular expression matching.
Will it be significant ? thats hard to tell. But the best thing you should realise:
First write your code and dont bother with questioning performance until you run into problems, after profiling clearly indicates that this test is the issue.
I would just use the contains method. But thats an opinion without actually testing anything.

With this trivial example you shouldn't see much of a performance difference, but purely from the algorithms involved the regular expression
wordA|wordB
would indeed be faster, as it just makes a single pass through the string and employs a finite automaton to match one of the two substrings. However, this is offset by building the finite automaton first, which should be pretty much linear in the length of the regex in this case. You can compile the regex first to have that cost only once as long as the compiled object lives.
So essentially cost comes down to:
linear search through the string twice (2 · string length)
or linear search through the string once and building the DFA (string length + regex length)
if your text is very large and the substrings very small, then this could be worthwhile.
Still, you're optimising the wrong place, most likely. Use a profiler to find the actual bottlenecks in your code and optimise those; don't ever worry about such trivial “optimisations” unless you can prove them to make an impact.
One final thing to consider, though: With a regex you could make sure you're actually matching words (or things that look like words) instead of word parts, which might be an actual reason to consider regex instead of contains.

In my opinion its a matter of taste. Avoid doing premature optimization, see Practical rules for premature optimization.
As a general rule, if you are looking for words substrings and not patterns, then don't use regular expressions.
There will be only a minor performance difference for such a simple regex against the text search, so if you do this search only once in a while its not a performance issue. If you do it for some thousand times or more, in a loop, then make a benchmark, if you have performance problems

Related

Finding all regular expression/s from a value

I have a variable having some URL and a file containing 100's of regular expression. How can i find which regular expression/s will hold true for that variable. I don't want to do the pattern match for each and every pattern in the file. Looking for performance efficient solution.
While, ultimately, you won't get away with a "true" performance-efficient solution, there are some simple heuristics you can utilize to help cut down on the number of patterns you need to evaluate.
For instance, try "grouping" patterns using simplified versions. Consider the two patterns
[a-z]\d[a-z]
[a-z]{3}
Any string matching both of these patterns will also match the pattern [a-z].[a-z]. If you skip the previous two patterns if the more general pattern doesn't match, you'll (likely) save on overall processing time. The more you can generalize, the more patterns you can eliminate at once. The ultimate expression of this is hierarchical, in which patterns follow a file-system-like organization of groups. While the worst-case performance of this system is worse than just going through all the patterns, the average case will likely be somewhat better as different groups of patterns are eliminated.
You're not going to get better than O(n) performance on the number of regexes, but you're likely to have savings on the coefficient of n.

RegEx Vs If statement to validate number range

I have to validate a number falls within the range (0-255).
I can do this with Regular expression or using if statement.
RegEx:
\b([0-9]{1,2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\b
Or
If(number>-1 && number <=255)
I want to know which one is better to use to validate number range.
I use a simple rule:
If you can code without regexp and keep it simple - than do it without.
Regexps gives you a lot of power, but it can be tricky to master.
In your case - the "if" code will run faster and will have much better readability.
A lot of times - regexps can amount to something which is very complex to understand and maintain as requirements change.
You will probably use String.matches() for matching / checking. Which is very inefficient. It internally compiles the pattern, uses synchronization blah blah..
So , bottom line, avoid regexes wherever possible (Also, you will have to convert the number to a String and then use regex. What a waste of both space and time)
PS : Also note that mathematical operations are always handled more efficiently across platforms.
Number comparison is much efficient than String with regex comparison. By comparing number as a String is over complication.
Your regex will get you partial matches when used with the following data:
-123
+12
!12.
So its better to use string comparison to avoid unseen problems and to maintain a complex regex.See demo.
https://regex101.com/r/mS3tQ7/11
To keep it simple and easy to understand(w.r.t your problem) I would suggest to go for If statement. But for complex validations I would suggest using regex. The reason for this is also the same - to keep it simple and easy to understand. Why use 8-10 lines of if-then blocks when you can validate the same with concise 25-30 character regex pattern! And if you put that same pattern in a .config file, you can now change the behavior of your app without recompiling. It's less code doing more work in a flexible way.

Does java world has the counterpart Regexp::Optimizer in perl? [duplicate]

I wrote a Java program which can generate a sequence of symbols, like "abcdbcdefbcdbcdefg". What I need is Regex optimizer, which can result "a((bcd){2}ef){2}g".
As the input may contain unicodes, like "a\u0063\u0063\bbd", I prefer a Java version.
The reason I want to get a "shorter" expression is for saving space/memory. The sequence of symbols here could be very long.
In general, to find the "shortest" optimized regex is hard. So, here, I don't need ones that guarantee the "shortest" criteria.
I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)
It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg in terms of matching speed is likely to be abcdbcdefbcdbcdefg. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.
The reason that I need this is due to the space/memory limits.
Do you have clear evidence that you need to do this?
I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)
Regular expressions are not a substitute for compression
Don't use a regular expression to represent a string constant. Regular expressions are designed to be used to match one of many strings. That's not what you're doing.
I assume you are trying to find a small regex to encode a finite set of input strings. If so, you haven't chosen the best possible subject line.
I can't give you an existing program, but I can tell you how to approach writing one.
There is no canonical minimum regex form and determining the true minimum size regex is NP hard. Certainly your sets are finite, so this may be a simpler problem. I'll have to think about it.
But a good heuristic algorithm would be:
Construct a trivial non-deterministic finite automaton (NFA) that accepts all your strings.
Convert the NFA to a deterministic finite automaton (DFA) with the subset construction.
Minimize the DFA with the standard algorithm.
Use the construction from the proof of Kleene's theorem to get to a regex.
Note that step 3 does give you a unique minimum DFA. That would probably be the best way to encode your string sets.

String manipulation vs Regexps

We are often told that Regexps are slow and should be avoided whenever possible.
However, taking into account the overhead of doing some string manipulation oneself (not talking about algorithm mistakes - this is a different matter), especially in PHP or Perl (maybe Java) what is the limit, in which case can we consider string manipulation to be a better alternative? What regexps are particularly CPU greedy?
For instance, for the following, in C++, Java, PHP or Perl, what would you recommend
The regexps would probably be faster:
s/abc/def/g or a ... while((i=index("abc",$x)>=0) ...$y .= substr()... based solution?
s/(\d)+/N/g or a scanning algorithm
But what about
an email validation regexp?
s/((0|\w)+?[xy]*[^xy]){2,7}/u/g
wouldn't a handmade and specific algorithm be faster (while longer to write)?
edit
The point of the question is to determine what kind of regexp would better be rewritten specifically for a given problem via string manipulation?
edit2
A common implementation is Perl regexp. For instance in Perl - that requires to know how they are implemented - what kind of regexp is to be avoided, because the implementation will make the process lengthy and ineffective? It may not be a complex regexp...
edit July 2011 (based on comments)
I'm not saying all regexps are slow. Some particular regexps patterns are known to be slow, due to the particular processing their and due to their implementation.In recent Perl / PHP implementations for instance, what is known to be rather slow - and should be avoided?The answer is expected from people who did already their own research (profiler...) and who are able to provide a kind of general guidelines about what is recommended/to be avoided.
Who said regexes were slow? At least in Perl they tend to be the preferred method of manipulating strings.
Regexes are bad at some things like email validation because the subject is too complex, not because they are slow. A proper email validation regex is over 6,000 characters long, and it doesn't even handle all of the cases (you have to strip out comments first).
At least in Perl 5, if it has a grammar it probably shouldn't be parsed with one regex.
You should also rewrite a regex as a custom function if the regex has grown to the point it can no longer be easily maintained (see the previous email validation example) or profiling shows that the regex is the slow component of your code.
You seem to be concerned with the speed of the regex vs the custom algorithm, but that is not a valid concern until you prove that it is with a profiler. Write the code in the most maintainable way. If a regex is clear, then use a regex. If a custom algorithm is clear, then use a custom algorithm. If you find that either is eating up a lot of time after profiling your code, then start looking for alternatives.
A nice feature of manipulating text with regular expressions is that patterns are high-level and declarative. This leaves the implementation considerable room for optimization such as factoring out the longest common prefix or using Boyer-Moore for static strings. Concise notation makes for quicker reading by experts. I understand immediately what
if (s/^(.)//) {
...
}
is doing, and index($_, 0, 1) = "" looks noisy in comparison.
Rather than the lower bound, the important consideration for regular expressions is the upper bound. It's a powerful tool, so people believe it's capable of correctly extracting tokens from XML, email addresses, or C++ programs and don't realize that an even more powerful tool such as a parser is necessary.
Regular expressions will never be faster than a hand-made algorithm for a very specific purpose. Worse, in PHP they have to be compiled the first time they're used (a cache is used afterwards).
However, they are certainly more succinct. Moreover, writing custom algorithms is often slower than regexes because the regular expressions engine are usually implemented in a more low level language, with less overhead in calling functions, etc.
For instance, preg_replace('/a/', 'b', $string) will almost certainly be faster than looping in PHP through the string. But this is a constant penalty in execution time and sometimes regular expressions, due to backtracking, can have a much worse asymptotic behavior.
You are strongly encouraged to know how regular expressions are implemented so that you can know when you're writing inefficient ones.
Some regular expressions are extremely fast and the difference between the regex and a custom solution may be negligible (or not worth anyone's time to bother).
The cases where regular expressions are slow, however, is when excessive backtracking occurs. Regular expressions parse from left to right and have the potential to match text in more than one way. So if they reach a point where the engine realizes that the pattern isn't going to match your test string, then it may start over and try to match in another way. This repeated backtracking adds up and slows down the algorithm.
Often the regular expression can be rewritten to perform better. But the ultimate in performance would be to write your own optimized parser for the specific task. By writing your own parser you can for example parse from left to right while maintaining a memory (or state). If you use this technique in procedural code you can often achieve the effect you're looking for in one pass and without the slowness of backtracking.
I was faced with this decision earlier this year. In fact the task at hand was on the outer fringe of what was even possible with regular expressions. In the end I decided to write my own parser, an embedded pushdown automaton, which is incredibly efficient for what I was trying to do. The task, by the way, was to build something that can parse regular expressions and provide Intellisense-like code hinting for them.
It's somewhat ironic that I didn't use regular expressions to parse regular expressions, but you can read about the thought behind it all here...
http://blog.regexhero.net/2010/03/code-hinting-for-regular-expressions.html
what kind of regexp would better be rewritten specifically for a given problem via string manipulation?
Easy.
Determine if you ever need to rewrite anything.
(positive answer would be for about 1 per 10000 scripts, massive text parsing, resource critical)
Do profile possible solutions.
Use one suits you for a given problem
As for the rest 9999 cases do not waste your time with such a trifle problem and use whatever you like more.
Every time you ask yourself such a question, it is extremely useful to remind yourself that by default all your extra-optimized and super-fast code being parsed char by char on every user request. No brain-cracking regexps, no devious string manipulation, but just old good picking chars one by one.
Regexes aren't slow. But implementation can be slow, mostly because it is often interpreted and build again each time when they are used. But good regexp library allows you to use compiled versions. They are pretty fast.

Best practices for regex performance VS sheer iteration

I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString") and/or other String API calls?
While above given decision for .contains() is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains() calls or a single regex?
My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.
Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.
RegexBuddy has a built-in regular expressions debugger. It shows how many steps the regular expression engine needed to find a match or to fail to find a match. By using the debugger on strings of different lengths, you can get an idea of the complexity (big O) of the regular expression. If you look up "benchmark" in the index of RegexBuddy's help file you'll get some more tips on how to interpret this.
When judging the performance of a regular expression, it is particularly important to test situations where the regex fails to find a match. It is very easy to write a regular expression that finds its matches in linear time, but fails in exponential time in a situation that I call catastrophic backtracking.
To use your 5 if statements as an example, the regex one|two|three|four|five scans the input string once, doing a little bit of extra work when an o, t, or f is encountered. But 5 if statements checking if the string contains a word will search the entire string 5 times if none of the words can be found. If five occurs at the start of the string, then the regex finds the match instantly, while the first 4 if statements scan the whole string in vain before the 5th if statement finds the match.
It's hard to estimate performance without using a profiler, generally the best strategy is to write what makes the most logical sense and is easier to understand/read. If two .contains() calls are easier to logically understand then that's the better route, the same logic applies if a regex makes more sense.
It's also important to consider that other developers on your team may not have a great understanding of regex. If at a later time in production the use of regex over .contains() (or vice versa) is identified as a bottleneck, try and profile both.
Rule of thumb: Write code to be readable, use a profiler to identify bottlenecks and only then replace the readable code with faster code.
I would strongly suggest that you write the code for both and time it. It's pretty simple to do this and you'll get an answers that is not a generic "rule of thumb" but instead a very specific answer that holds for your problem domain.
Vance Morrison has an excellent post about micro benchmarking, and has a tool that makes it really simple for you to answer questions like this...
http://msdn.microsoft.com/en-us/magazine/cc500596.aspx
If you want my personal "rule of thumb" then it's that RegEx is often slower for this sort of thing, but you should ignore me and measure it yourself :-)
If, for non-performance reasons, you continue to use Regular Expressions then I can really recommend two things. Get a profiler (such as ANTS) and see what your code does in production. Then, get a copy of the Regular Expression Cookbook...
http://www.amazon.co.uk/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/0596520689/ref=sr_1_1?ie=UTF8&s=books&qid=1259147763&sr=8-1
... as it has loads of tips on speeding up RegEx code. I've optimized RegEx code by a factor of 10 following tips from this book.
The answer (as usual) is that it depends.
In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex's weaknesses. The "OR" in this case doesn't really know what the sub-patterns are trying to do and so can't easily optimize. It ends up doing the equivalent of (in pseudo code):
for( i = 0; i < stringLength; i++ ) {
if( stringAt pos i starts with "this" )
found!
if( stringAt pos i starts with "that" )
found!
}
There almost isn't a slower way to do it. In this case, two contains() calls will be much faster.
On the other hand, a full match on: ".*this.*|.*that.*" may optimize better.
To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with 'A' or 'B' and ending in 'g'-'m'... then use regex.
And then you won't be so worried about a few cycles here and there.

Categories