Best practices for regex performance VS sheer iteration

Best practices for regex performance VS sheer iteration - java

I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString") and/or other String API calls?
While above given decision for .contains() is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains() calls or a single regex?
My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.
Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.

RegexBuddy has a built-in regular expressions debugger. It shows how many steps the regular expression engine needed to find a match or to fail to find a match. By using the debugger on strings of different lengths, you can get an idea of the complexity (big O) of the regular expression. If you look up "benchmark" in the index of RegexBuddy's help file you'll get some more tips on how to interpret this.
When judging the performance of a regular expression, it is particularly important to test situations where the regex fails to find a match. It is very easy to write a regular expression that finds its matches in linear time, but fails in exponential time in a situation that I call catastrophic backtracking.
To use your 5 if statements as an example, the regex one|two|three|four|five scans the input string once, doing a little bit of extra work when an o, t, or f is encountered. But 5 if statements checking if the string contains a word will search the entire string 5 times if none of the words can be found. If five occurs at the start of the string, then the regex finds the match instantly, while the first 4 if statements scan the whole string in vain before the 5th if statement finds the match.

It's hard to estimate performance without using a profiler, generally the best strategy is to write what makes the most logical sense and is easier to understand/read. If two .contains() calls are easier to logically understand then that's the better route, the same logic applies if a regex makes more sense.
It's also important to consider that other developers on your team may not have a great understanding of regex. If at a later time in production the use of regex over .contains() (or vice versa) is identified as a bottleneck, try and profile both.
Rule of thumb: Write code to be readable, use a profiler to identify bottlenecks and only then replace the readable code with faster code.

I would strongly suggest that you write the code for both and time it. It's pretty simple to do this and you'll get an answers that is not a generic "rule of thumb" but instead a very specific answer that holds for your problem domain.
Vance Morrison has an excellent post about micro benchmarking, and has a tool that makes it really simple for you to answer questions like this...
http://msdn.microsoft.com/en-us/magazine/cc500596.aspx
If you want my personal "rule of thumb" then it's that RegEx is often slower for this sort of thing, but you should ignore me and measure it yourself :-)
If, for non-performance reasons, you continue to use Regular Expressions then I can really recommend two things. Get a profiler (such as ANTS) and see what your code does in production. Then, get a copy of the Regular Expression Cookbook...
http://www.amazon.co.uk/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/0596520689/ref=sr_1_1?ie=UTF8&s=books&qid=1259147763&sr=8-1
... as it has loads of tips on speeding up RegEx code. I've optimized RegEx code by a factor of 10 following tips from this book.

The answer (as usual) is that it depends.
In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex's weaknesses. The "OR" in this case doesn't really know what the sub-patterns are trying to do and so can't easily optimize. It ends up doing the equivalent of (in pseudo code):
for( i = 0; i < stringLength; i++ ) {
if( stringAt pos i starts with "this" )
found!
if( stringAt pos i starts with "that" )
found!
}
There almost isn't a slower way to do it. In this case, two contains() calls will be much faster.
On the other hand, a full match on: ".*this.*|.*that.*" may optimize better.
To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with 'A' or 'B' and ending in 'g'-'m'... then use regex.
And then you won't be so worried about a few cycles here and there.

Related

RegEx Vs If statement to validate number range

I have to validate a number falls within the range (0-255).
I can do this with Regular expression or using if statement.
RegEx:
\b([0-9]{1,2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\b
Or
If(number>-1 && number <=255)
I want to know which one is better to use to validate number range.

I use a simple rule:
If you can code without regexp and keep it simple - than do it without.
Regexps gives you a lot of power, but it can be tricky to master.
In your case - the "if" code will run faster and will have much better readability.
A lot of times - regexps can amount to something which is very complex to understand and maintain as requirements change.

You will probably use String.matches() for matching / checking. Which is very inefficient. It internally compiles the pattern, uses synchronization blah blah..
So , bottom line, avoid regexes wherever possible (Also, you will have to convert the number to a String and then use regex. What a waste of both space and time)
PS : Also note that mathematical operations are always handled more efficiently across platforms.

Number comparison is much efficient than String with regex comparison. By comparing number as a String is over complication.

Your regex will get you partial matches when used with the following data:
-123
+12
!12.
So its better to use string comparison to avoid unseen problems and to maintain a complex regex.See demo.
https://regex101.com/r/mS3tQ7/11

To keep it simple and easy to understand(w.r.t your problem) I would suggest to go for If statement. But for complex validations I would suggest using regex. The reason for this is also the same - to keep it simple and easy to understand. Why use 8-10 lines of if-then blocks when you can validate the same with concise 25-30 character regex pattern! And if you put that same pattern in a .config file, you can now change the behavior of your app without recompiling. It's less code doing more work in a flexible way.

Performance overhead/improvement using regular expressions

If I need to check if for example a word A or word B exists in a text (String), is there a performance difference if I do:
if(text.contains(wordA) || text.contains(wordB))
to using some regular expression that searches the string?
Does it depend on the regular expression format?
Or is it just a matter of taste?
UPDATE:
If text.contains(wordA) is false then the text.contains(wordB) will be evaluated.
This means that contains will be called twice.
I was thinking if in performance terms a regex might be better than calling contains twice.

The code you have expresses your intent clearly, is more readable than a regexp, and is also probably faster.
Anyway, there is a very low probability that this part of your code causes any significant performance problem. So I wouldn't worry about performance here, but about readability and maintainability.

While the performance of regular expression is lower, it has more expressive power and often this is more important. For example.
"performance".contains("form") // is true
this may not be wheat you intended by a "word" Instead you can have a pattern
"\\bform\\b"
This will only match a complete word in a string which can be at the start or the end.

Yes their is a difference. Contains does various array manipulation to find the words, a regex uses diffent logic so it will be different, performance will even change depending how you uses the regular expression matching.
Will it be significant ? thats hard to tell. But the best thing you should realise:
First write your code and dont bother with questioning performance until you run into problems, after profiling clearly indicates that this test is the issue.
I would just use the contains method. But thats an opinion without actually testing anything.

With this trivial example you shouldn't see much of a performance difference, but purely from the algorithms involved the regular expression
wordA|wordB
would indeed be faster, as it just makes a single pass through the string and employs a finite automaton to match one of the two substrings. However, this is offset by building the finite automaton first, which should be pretty much linear in the length of the regex in this case. You can compile the regex first to have that cost only once as long as the compiled object lives.
So essentially cost comes down to:
linear search through the string twice (2 · string length)
or linear search through the string once and building the DFA (string length + regex length)
if your text is very large and the substrings very small, then this could be worthwhile.
Still, you're optimising the wrong place, most likely. Use a profiler to find the actual bottlenecks in your code and optimise those; don't ever worry about such trivial “optimisations” unless you can prove them to make an impact.
One final thing to consider, though: With a regex you could make sure you're actually matching words (or things that look like words) instead of word parts, which might be an actual reason to consider regex instead of contains.

In my opinion its a matter of taste. Avoid doing premature optimization, see Practical rules for premature optimization.
As a general rule, if you are looking for words substrings and not patterns, then don't use regular expressions.
There will be only a minor performance difference for such a simple regex against the text search, so if you do this search only once in a while its not a performance issue. If you do it for some thousand times or more, in a loop, then make a benchmark, if you have performance problems

Should we use regular expression in Java?

I know regular expressions are very powerful, and to become an expert with them is not easy.
One of my colleagues once wrote a java class to parse formatted text files. Unfortunately it caused a StackOverFlowError in the first integration test. It seems difficault to find the bug, before another colleague from structural programming world came over and fixed it quickly by thowing away all regular expressions and instead using many nested conditional statements and many split and trim methods, and it works very well!
Well, why do we need regular expression in a programming language like Java? As far as I know, the only necessary usage of regular expression is the find/replace function in text editors.

Like everything else: Use with care and KISS
I use regexes quite often, but I don't go over the top and write a 100 character regex, because I know that I (personally) won't understand it later... in fact I think my limit is about 30-40 characters, something larger than that makes me spend too much time scratching my head.

Anything that can be expressed as a regular expression can, by definition, be expressed as a chain of IFs. You use REGEX basically for two reasons:
REGEX libraries tends to have optimized implementation that most of the time will be better than a hand-coded "IF" chain for some expressions.
REGEX are usually easier to follow, if properly written, than the IF chains. Specially for more complex expressions.
If your expression gets too complex, the use the advice given by this answer. If it get truly nasty, think about learning how to use a parser generator like ANTLR or JavaCC. A simple grammar usually can replace a regex, and it is a lot easier to maintain.

So the multiple nested conditional statements with many split and trim methods are easier for you to debug than a single line or two with regular expressions?
My preference is regular expressions because once you learn them, they are far more maintainable and far easier to read than parsing huge nested if loops.

If you find that a regular expression would get too complex and unmaintable, use code instead. Regular expressions can get very complex even for things that sound very simple at first. For example validation of dates in the format mm/dd/yy[yy] is as "simple" as:
^(((((((0?[13578])|(1[02]))[\.\-/]?((0?[1-9])|([12]\d)|(3[01])))|(((0?[469])|(11))[\.\-/]?((0?[1-9])|([12]\d)|(30)))|((0?2)[\.\-/]?((0?[1-9])|(1\d)|(2[0-8]))))[\.\-/]?(((19)|(20))?([\d][\d]))))|((0?2)[\.\-/]?(29)[\.\-/]?(((19)|(20))?(([02468][048])|([13579][26])))))$
Nobody can maintain that. Manually parsing the date will need more code but can be much more readable and maintainable.
Regular expressions are very powerful and useful for matching TEXT patterns, but are bad for validation with numeric parts like dates.

As always, you should use the best tool for the job. I would define the "best tool" by the most simple, understandable, effective method that fulfills the requirements.
Often regexes will simplify code and make it more readable. But this is not always the case.
Also, I would not jump to conclusions that regexes caused the StackOverflowError.

Regular expressions are a tool (like many others). You should use it when the work to be done could best be done with that tool. To know which tool to use, it helps ask a question like "When could I use regular expressions?". And of course it will become easier to decide which tool to use when you have many different tools in your toolbox and you know them fairly well.

You can use regex cleverly by spliting those into smaller chunks, something like,
final String REGEX_SOMETHING = "something";
final String REGEX_WHATEVER = "whatever";
..
String REGEX_COMPLETE = REGEX_SOMETHING + REGEX_WHATEVER + ...

Regular expressions can be easier to read, but they can also be too complicated. It depends on the format of data you want to match.
The Java RE implementation still has some quirks, with the effect that some quite simple expressions (like '((?:[^'\\]|\\.)*)') cause a stack overflow when matching longer strings. So make sure you test with real life data (and more extreme examples, too) - or use a regex engine with a different implementation (there are several ones, also as Java libraries).

Regular expression is very powerful in looking for patterns in the content. You can certainly avoid using regular expression and rely on the conditional statements, but you will soon notice that it takes many lines of code to accomplish the same task. Using too many nested conditional statements increases the cyclomatic complexity of your code, as a result, it becomes even more difficult to test because there are too many branches to test. Further, it also makes the code difficult to read and understand.
Granted, your colleague should have written testcases to test his regular expressions first.
There's no right or wrong answer here. If the task is simple, then there's no need to use regular expression. Otherwise, it is nice to sprinkle a little regular expressions here and there to make your code easy to read.

When would it be worth using RegEx in Java?

I'm writing a small app that reads some input and do something based on that input.
Currently I'm looking for a line that ends with, say, "magic", I would use String's endsWith method. It's pretty clear to whoever reads my code what's going on.
Another way to do it is create a Pattern and try to match a line that ends with "magic". This is also clear, but I personally think this is an overkill because the pattern I'm looking for is not complex at all.
When do you think it's worth using RegEx Java? If it's complexity, how would you personally define what's complex enough?
Also, are there times when using Patterns are actually faster than string manipulation?
EDIT: I'm using Java 6.

Basically: if there is a non-regex operation that does what you want in one step, always go for that.
This is not so much about performance, but about a) readability and b) compile-time-safety. Specialized non-regex versions are usually a lot easier to read than regex-versions. And a typo in one of these specialized methods will not compile, while a typo in a Regex will fail miserably at runtime.
Comparing Regex-based solutions to non-Regex-bases solutions
String s = "Magic_Carpet_Ride";
s.startsWith("Magic"); // non-regex
s.matches("Magic.*"); // regex
s.contains("Carpet"); // non-regex
s.matches(".*Carpet.*"); // regex
s.endsWith("Ride"); // non-regex
s.matches(".*Ride"); // regex
In all these cases it's a No-brainer: use the non-regex version.
But when things get a bit more complicated, it depends. I guess I'd still stick with non-regex in the following case, but many wouldn't:
// Test whether a string ends with "magic" in any case,
// followed by optional white space
s.toLowerCase().trim().endsWith("magic"); // non-regex, 3 calls
s.matches(".*(?i:magic)\\s*"); // regex, 1 call, but ugly
And in response to RegexesCanCertainlyBeEasierToReadThanMultipleFunctionCallsToDoTheSameThing:
I still think the non-regex version is more readable, but I would write it like this:
s.toLowerCase()
.trim()
.endsWith("magic");
Makes the whole difference, doesn't it?

You would use Regex when the normal manipulations on the String class are not enough to elegantly get what you need from the String.
A good indicator that this is the case is when you start splitting, then splitting those results, then splitting those results. The code is getting unwieldy. Two lines of Pattern/Regex code can clean this up, neatly wrapped in a method that is unit tested....

Anything that can be done with regex can also be hand-coded.
Use regex if:
Doing it manually is going to take more effort without much benefit.
You can easily come up with a regex for your task.
Don't use regex if:
It's very easy to do it otherwise, as in your example.
The string you're parsing does not lend itself to regex. (it is customary to link to this question)

I think you are best with using endsWith. Unless your requirements change, it's simpler and easier to understand. Might perform faster too.
If there was a bit more complexity, such as you wanted to match "magic", "majik', but not "Magic" or "Majik"; or you wanted to match "magic" followed by a space and then 1 word such as "... magic spoon" but not "...magic soup spoon", then I think RegEx would be a better way to go.

Any complex parsing where you are generating a lot of Objects would be better done with RegEx when you factor in both computing power, and brainpower it takes to generate the code for that purpose. If you have a RegEx guru handy, it's almost always worthwhile as the patterns can easily be tweaked to accommodate for business rule changes without major loop refactoring which would likely be needed if you used pure java to do some of the complex things RegEx does.

If your basic line ending is the same everytime, such as with "magic", then you are better of using endsWith.
However, if you have a line that has the same base, but can have multiple values, such as:
<string> <number> <string> <string> <number>
where the strings and numbers can be anything, you're better of using RegEx.
Your lines are always ending with a string, but you don't know what that string is.

If it's as simple as endsWith, startsWith or contains, then you should use these functions. If you are processing more "complex" strings and you want to extract information from these strings, then regexp/matchers can be used.
If you have something like "commandToRetrieve someNumericArgs someStringArgs someOptionalArgs" then regexp will ease your task a lot :)

I'd never use regexes in java if I have an easier way to do it, like in this case the endsWith method. Regexes in java are as ugly as they get, probably with the only exception of the match method on String.
Usually avoiding regexes makes your core more readable and easier for other programmers. The opposite is true, complex regexes might confuse even the most experience hackers out there.
As for performance concerns: just profile. Specially in java.

If you are familiar with how regexp works you will soon find that a lot of problems are easily solved by using regexp.
Personally I look to using java String operations if that is easy, but if you start splitting strings and doing substring on those again, I'd start thinking in regular expressions.
And again, if you use regular expressions, why stop at lines. By configuring your regexp you can easily read entire files in one regular expression (Pattern.DOTALL as parameter to the Pattern.compile and your regexp don't end in the newlines). I'd combine this with Apache Commons IOUtils.toString() methods and you got something very powerful to do quick stuff with.
I would even bring out a regular expression to parse some xml if needed. (For instance in a unit test, where I want to check that some elements are present in the xml).
For instance, from some unit test of mine:
Pattern pattern = Pattern.compile(
"<Monitor caption=\"(.+?)\".*?category=\"(.+?)\".*?>"
+ ".*?<Summary.*?>.+?</Summary>"
+ ".*?<Configuration.*?>(.+?)</Configuration>"
+ ".*?<CfgData.*?>(.+?)</CfgData>", Pattern.DOTALL);
which will match all segments in this xml and pick out some segments that I want to do some sub matching on.

I would suggest using a regular expression when you know the format of an input but you are not necessarily sure on the value (or possible value(s)) of the formatted input.
What I'm saying, if you have an input all ending with, in your case, "magic" then String.endsWith() works fine (seeing you know that your possible input value will end with "magic").
If you have a format e.g a RFC 5322 message format, one cannot clearly say that all email address can end with a .com, hence you can create a regular expression that conforms to the RFC 5322 standard for verification.
In a nutshell, if you know a format structure of your input data but don't know exactly what values (or possible values) you can receive, use regular expressions for validation.

There's a saying that goes:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (link).
For a simple test, I'd proceed exactly like you've done. If you find that it's getting more complicated, then I'd consider Regular Expressions only if there isn't another way.

String manipulation vs Regexps

We are often told that Regexps are slow and should be avoided whenever possible.
However, taking into account the overhead of doing some string manipulation oneself (not talking about algorithm mistakes - this is a different matter), especially in PHP or Perl (maybe Java) what is the limit, in which case can we consider string manipulation to be a better alternative? What regexps are particularly CPU greedy?
For instance, for the following, in C++, Java, PHP or Perl, what would you recommend
The regexps would probably be faster:
s/abc/def/g or a ... while((i=index("abc",$x)>=0) ...$y .= substr()... based solution?
s/(\d)+/N/g or a scanning algorithm
But what about
an email validation regexp?
s/((0|\w)+?[xy]*[^xy]){2,7}/u/g
wouldn't a handmade and specific algorithm be faster (while longer to write)?
edit
The point of the question is to determine what kind of regexp would better be rewritten specifically for a given problem via string manipulation?
edit2
A common implementation is Perl regexp. For instance in Perl - that requires to know how they are implemented - what kind of regexp is to be avoided, because the implementation will make the process lengthy and ineffective? It may not be a complex regexp...
edit July 2011 (based on comments)
I'm not saying all regexps are slow. Some particular regexps patterns are known to be slow, due to the particular processing their and due to their implementation.In recent Perl / PHP implementations for instance, what is known to be rather slow - and should be avoided?The answer is expected from people who did already their own research (profiler...) and who are able to provide a kind of general guidelines about what is recommended/to be avoided.

Who said regexes were slow? At least in Perl they tend to be the preferred method of manipulating strings.
Regexes are bad at some things like email validation because the subject is too complex, not because they are slow. A proper email validation regex is over 6,000 characters long, and it doesn't even handle all of the cases (you have to strip out comments first).
At least in Perl 5, if it has a grammar it probably shouldn't be parsed with one regex.
You should also rewrite a regex as a custom function if the regex has grown to the point it can no longer be easily maintained (see the previous email validation example) or profiling shows that the regex is the slow component of your code.
You seem to be concerned with the speed of the regex vs the custom algorithm, but that is not a valid concern until you prove that it is with a profiler. Write the code in the most maintainable way. If a regex is clear, then use a regex. If a custom algorithm is clear, then use a custom algorithm. If you find that either is eating up a lot of time after profiling your code, then start looking for alternatives.

A nice feature of manipulating text with regular expressions is that patterns are high-level and declarative. This leaves the implementation considerable room for optimization such as factoring out the longest common prefix or using Boyer-Moore for static strings. Concise notation makes for quicker reading by experts. I understand immediately what
if (s/^(.)//) {
...
}
is doing, and index($_, 0, 1) = "" looks noisy in comparison.
Rather than the lower bound, the important consideration for regular expressions is the upper bound. It's a powerful tool, so people believe it's capable of correctly extracting tokens from XML, email addresses, or C++ programs and don't realize that an even more powerful tool such as a parser is necessary.

Regular expressions will never be faster than a hand-made algorithm for a very specific purpose. Worse, in PHP they have to be compiled the first time they're used (a cache is used afterwards).
However, they are certainly more succinct. Moreover, writing custom algorithms is often slower than regexes because the regular expressions engine are usually implemented in a more low level language, with less overhead in calling functions, etc.
For instance, preg_replace('/a/', 'b', $string) will almost certainly be faster than looping in PHP through the string. But this is a constant penalty in execution time and sometimes regular expressions, due to backtracking, can have a much worse asymptotic behavior.
You are strongly encouraged to know how regular expressions are implemented so that you can know when you're writing inefficient ones.

Some regular expressions are extremely fast and the difference between the regex and a custom solution may be negligible (or not worth anyone's time to bother).
The cases where regular expressions are slow, however, is when excessive backtracking occurs. Regular expressions parse from left to right and have the potential to match text in more than one way. So if they reach a point where the engine realizes that the pattern isn't going to match your test string, then it may start over and try to match in another way. This repeated backtracking adds up and slows down the algorithm.
Often the regular expression can be rewritten to perform better. But the ultimate in performance would be to write your own optimized parser for the specific task. By writing your own parser you can for example parse from left to right while maintaining a memory (or state). If you use this technique in procedural code you can often achieve the effect you're looking for in one pass and without the slowness of backtracking.
I was faced with this decision earlier this year. In fact the task at hand was on the outer fringe of what was even possible with regular expressions. In the end I decided to write my own parser, an embedded pushdown automaton, which is incredibly efficient for what I was trying to do. The task, by the way, was to build something that can parse regular expressions and provide Intellisense-like code hinting for them.
It's somewhat ironic that I didn't use regular expressions to parse regular expressions, but you can read about the thought behind it all here...
http://blog.regexhero.net/2010/03/code-hinting-for-regular-expressions.html

what kind of regexp would better be rewritten specifically for a given problem via string manipulation?
Easy.
Determine if you ever need to rewrite anything.
(positive answer would be for about 1 per 10000 scripts, massive text parsing, resource critical)
Do profile possible solutions.
Use one suits you for a given problem
As for the rest 9999 cases do not waste your time with such a trifle problem and use whatever you like more.
Every time you ask yourself such a question, it is extremely useful to remind yourself that by default all your extra-optimized and super-fast code being parsed char by char on every user request. No brain-cracking regexps, no devious string manipulation, but just old good picking chars one by one.

Regexes aren't slow. But implementation can be slow, mostly because it is often interpreted and build again each time when they are used. But good regexp library allows you to use compiled versions. They are pretty fast.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.