Does java world has the counterpart Regexp::Optimizer in perl? [duplicate] - java

I wrote a Java program which can generate a sequence of symbols, like "abcdbcdefbcdbcdefg". What I need is Regex optimizer, which can result "a((bcd){2}ef){2}g".
As the input may contain unicodes, like "a\u0063\u0063\bbd", I prefer a Java version.
The reason I want to get a "shorter" expression is for saving space/memory. The sequence of symbols here could be very long.
In general, to find the "shortest" optimized regex is hard. So, here, I don't need ones that guarantee the "shortest" criteria.

I've got a nasty feeling that the problem of creating the shortest regex that matches a given input string or set of strings is going to be computationally "difficult". (There are parallels with the problem of computing Kolmogorov Complexity ...)
It is also worth noting that the optimal regex for abcdbcdefbcdbcdefg in terms of matching speed is likely to be abcdbcdefbcdbcdefg. Adding repeating groups may make the regex string shorter, but it won't make the regex faster. In fact, it is likely to be slower unless the regex engine unrolls the repeating groups.
The reason that I need this is due to the space/memory limits.
Do you have clear evidence that you need to do this?
I suspect that you won't save a worthwhile amount of space by doing this ... unless the input strings are really long. (And if they are long, then you'll get better results using a regular text compression algorithm to compress the strings.)

Regular expressions are not a substitute for compression
Don't use a regular expression to represent a string constant. Regular expressions are designed to be used to match one of many strings. That's not what you're doing.

I assume you are trying to find a small regex to encode a finite set of input strings. If so, you haven't chosen the best possible subject line.
I can't give you an existing program, but I can tell you how to approach writing one.
There is no canonical minimum regex form and determining the true minimum size regex is NP hard. Certainly your sets are finite, so this may be a simpler problem. I'll have to think about it.
But a good heuristic algorithm would be:
Construct a trivial non-deterministic finite automaton (NFA) that accepts all your strings.
Convert the NFA to a deterministic finite automaton (DFA) with the subset construction.
Minimize the DFA with the standard algorithm.
Use the construction from the proof of Kleene's theorem to get to a regex.
Note that step 3 does give you a unique minimum DFA. That would probably be the best way to encode your string sets.

Related

RegEx Vs If statement to validate number range

I have to validate a number falls within the range (0-255).
I can do this with Regular expression or using if statement.
RegEx:
\b([0-9]{1,2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\b
Or
If(number>-1 && number <=255)
I want to know which one is better to use to validate number range.
I use a simple rule:
If you can code without regexp and keep it simple - than do it without.
Regexps gives you a lot of power, but it can be tricky to master.
In your case - the "if" code will run faster and will have much better readability.
A lot of times - regexps can amount to something which is very complex to understand and maintain as requirements change.
You will probably use String.matches() for matching / checking. Which is very inefficient. It internally compiles the pattern, uses synchronization blah blah..
So , bottom line, avoid regexes wherever possible (Also, you will have to convert the number to a String and then use regex. What a waste of both space and time)
PS : Also note that mathematical operations are always handled more efficiently across platforms.
Number comparison is much efficient than String with regex comparison. By comparing number as a String is over complication.
Your regex will get you partial matches when used with the following data:
-123
+12
!12.
So its better to use string comparison to avoid unseen problems and to maintain a complex regex.See demo.
https://regex101.com/r/mS3tQ7/11
To keep it simple and easy to understand(w.r.t your problem) I would suggest to go for If statement. But for complex validations I would suggest using regex. The reason for this is also the same - to keep it simple and easy to understand. Why use 8-10 lines of if-then blocks when you can validate the same with concise 25-30 character regex pattern! And if you put that same pattern in a .config file, you can now change the behavior of your app without recompiling. It's less code doing more work in a flexible way.

Java : Matcher.find using high cpu utilization

I am using mod security rules https://github.com/SpiderLabs/owasp-modsecurity-crs to sanitize user input data. I am facing cpu shoot up and delay in matching the user input with mod security rule regular expressions. Overall it contains 500+ regular expression to check different types of attacks(xss , badrobots , generic and sql). For each request , I go through all parameters and check against all these 500 regular expressions. I am using Matcher.find to check the parameters. In this case some parameters fall in infinite looping , I tackled this using below technique.
Cancelling a long running regex match?.
Sanitize a user request took around ~500 ms and cpu % shoots up. I analyzed using visualvm.java.net with my test suite runner.
Cpu Profile Output
Please help me to reduce the cpu usage % and load average?
If possible, compile your regexes once and keep them around, rather than repeatedly (implicitly) compiling (especially inside a loop).
See java.util.regex - importance of Pattern.compile()? for more info.
I suggest you look at this paper:
"Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort"
There are better ways to do the matching you describe. Essentially you take the 500 patterns you want to match and compile it into a single suffix tree which can very efficiently match an input against all the rules at once.
The paper explains that this approach was described as "Boyer-Moore Approach to Exact Set Matching" by Dan Gusfield.
Boyer-Moore is a well known algorithm for String matching. The paper describes a variation on Boyer-Moore for Set Matching.
I think this is the root of your problem, not the regex performance per-se:
For each request , I go through all parameters and check against all these 500 regular expressions
No matter how fast your regex will be, this is still plenty of work. I don't know how many parameters you have, but even if there are only a few, that's still checking thousands of regular expressions per request. That can kill your CPU.
Apart from the obvious things like improving your regex performance by precompiling and/or simplifying them, you can do the following things to reduce the amount of regex checking:
Use positive-validation of user input based on the parameter type. E.g. if some parameter must be a simple number, don't waste time checking if it contains malicious XML script. Just check whether it matches [0-9]+ (or something similarly simple). If it does, it is ok - skip checking all the 500 regexps.
Try to find simple regexps that could eliminate the whole classes of attacks - find common things in your regexps. If e.g. you've got 100 regexps checking for existence of certain HTML tags, check if the content contains at least one HTML tag first. If it doesn't, you immediately save on checking 100 regexps.
Cache results. Many parameters generated in webapps repeat themselves. Don't check the same content over and over again, but just remember the final validation result. Beware to limit the maximum size of the cache to avoid DOS attacks.
Also note that negative-validation is usually easy to bypass. Someone just changes a few chars in their malicious code and your regexps won't match. You'll have to grow your "database" of regexps in order to protect against new attacks. Positive validation (whitelisting) doesn't have this disadvantage and is much more effective.
Avoid expresions with:
Multiline
case insensitive
etc.
Perhaps you can consider grouping regular expressions and apply a given group of regular expresions depending on user input.
If you have such a big number of regex, you could group (at least some of) them using a trie algorithm (http://en.wikipedia.org/wiki/Trie).
The idea is that if you have for example regexes like /abc[0-9-]/, /abde/, /another example/, /.something else/ and /.I run out of ideas/, you can combine them into the single regex
/a(?:b(?:c[0-9-]|de)|nother example)|.(?:I run out of ideas|something else)/
In this way, the matcher has to run only once instead of four times, and you avoid a lot of backtracking, because of how the common starting parts have been written in the regex above.
There must be a subset of problematic regexes among these 500. I.e. such a regex
String s = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB";
Pattern.compile("(A+)+").matcher(s).matches();
will take years to complete.
So in your case I would log all the problematic regexes with their problematic inputs. Once these are found you could manually rewrite these few problematic regexes and test them versus the original. Regexes can always be rewritten with a simpler and more readable java functions.
Another option, though it would not resolve the problem above, is you could also utilize a faster (x20 in some cases) and more limited regex library. It is available in Maven Central.

Performance overhead/improvement using regular expressions

If I need to check if for example a word A or word B exists in a text (String), is there a performance difference if I do:
if(text.contains(wordA) || text.contains(wordB))
to using some regular expression that searches the string?
Does it depend on the regular expression format?
Or is it just a matter of taste?
UPDATE:
If text.contains(wordA) is false then the text.contains(wordB) will be evaluated.
This means that contains will be called twice.
I was thinking if in performance terms a regex might be better than calling contains twice.
The code you have expresses your intent clearly, is more readable than a regexp, and is also probably faster.
Anyway, there is a very low probability that this part of your code causes any significant performance problem. So I wouldn't worry about performance here, but about readability and maintainability.
While the performance of regular expression is lower, it has more expressive power and often this is more important. For example.
"performance".contains("form") // is true
this may not be wheat you intended by a "word" Instead you can have a pattern
"\\bform\\b"
This will only match a complete word in a string which can be at the start or the end.
Yes their is a difference. Contains does various array manipulation to find the words, a regex uses diffent logic so it will be different, performance will even change depending how you uses the regular expression matching.
Will it be significant ? thats hard to tell. But the best thing you should realise:
First write your code and dont bother with questioning performance until you run into problems, after profiling clearly indicates that this test is the issue.
I would just use the contains method. But thats an opinion without actually testing anything.
With this trivial example you shouldn't see much of a performance difference, but purely from the algorithms involved the regular expression
wordA|wordB
would indeed be faster, as it just makes a single pass through the string and employs a finite automaton to match one of the two substrings. However, this is offset by building the finite automaton first, which should be pretty much linear in the length of the regex in this case. You can compile the regex first to have that cost only once as long as the compiled object lives.
So essentially cost comes down to:
linear search through the string twice (2 · string length)
or linear search through the string once and building the DFA (string length + regex length)
if your text is very large and the substrings very small, then this could be worthwhile.
Still, you're optimising the wrong place, most likely. Use a profiler to find the actual bottlenecks in your code and optimise those; don't ever worry about such trivial “optimisations” unless you can prove them to make an impact.
One final thing to consider, though: With a regex you could make sure you're actually matching words (or things that look like words) instead of word parts, which might be an actual reason to consider regex instead of contains.
In my opinion its a matter of taste. Avoid doing premature optimization, see Practical rules for premature optimization.
As a general rule, if you are looking for words substrings and not patterns, then don't use regular expressions.
There will be only a minor performance difference for such a simple regex against the text search, so if you do this search only once in a while its not a performance issue. If you do it for some thousand times or more, in a loop, then make a benchmark, if you have performance problems

When would it be worth using RegEx in Java?

I'm writing a small app that reads some input and do something based on that input.
Currently I'm looking for a line that ends with, say, "magic", I would use String's endsWith method. It's pretty clear to whoever reads my code what's going on.
Another way to do it is create a Pattern and try to match a line that ends with "magic". This is also clear, but I personally think this is an overkill because the pattern I'm looking for is not complex at all.
When do you think it's worth using RegEx Java? If it's complexity, how would you personally define what's complex enough?
Also, are there times when using Patterns are actually faster than string manipulation?
EDIT: I'm using Java 6.
Basically: if there is a non-regex operation that does what you want in one step, always go for that.
This is not so much about performance, but about a) readability and b) compile-time-safety. Specialized non-regex versions are usually a lot easier to read than regex-versions. And a typo in one of these specialized methods will not compile, while a typo in a Regex will fail miserably at runtime.
Comparing Regex-based solutions to non-Regex-bases solutions
String s = "Magic_Carpet_Ride";
s.startsWith("Magic"); // non-regex
s.matches("Magic.*"); // regex
s.contains("Carpet"); // non-regex
s.matches(".*Carpet.*"); // regex
s.endsWith("Ride"); // non-regex
s.matches(".*Ride"); // regex
In all these cases it's a No-brainer: use the non-regex version.
But when things get a bit more complicated, it depends. I guess I'd still stick with non-regex in the following case, but many wouldn't:
// Test whether a string ends with "magic" in any case,
// followed by optional white space
s.toLowerCase().trim().endsWith("magic"); // non-regex, 3 calls
s.matches(".*(?i:magic)\\s*"); // regex, 1 call, but ugly
And in response to RegexesCanCertainlyBeEasierToReadThanMultipleFunctionCallsToDoTheSameThing:
I still think the non-regex version is more readable, but I would write it like this:
s.toLowerCase()
.trim()
.endsWith("magic");
Makes the whole difference, doesn't it?
You would use Regex when the normal manipulations on the String class are not enough to elegantly get what you need from the String.
A good indicator that this is the case is when you start splitting, then splitting those results, then splitting those results. The code is getting unwieldy. Two lines of Pattern/Regex code can clean this up, neatly wrapped in a method that is unit tested....
Anything that can be done with regex can also be hand-coded.
Use regex if:
Doing it manually is going to take more effort without much benefit.
You can easily come up with a regex for your task.
Don't use regex if:
It's very easy to do it otherwise, as in your example.
The string you're parsing does not lend itself to regex. (it is customary to link to this question)
I think you are best with using endsWith. Unless your requirements change, it's simpler and easier to understand. Might perform faster too.
If there was a bit more complexity, such as you wanted to match "magic", "majik', but not "Magic" or "Majik"; or you wanted to match "magic" followed by a space and then 1 word such as "... magic spoon" but not "...magic soup spoon", then I think RegEx would be a better way to go.
Any complex parsing where you are generating a lot of Objects would be better done with RegEx when you factor in both computing power, and brainpower it takes to generate the code for that purpose. If you have a RegEx guru handy, it's almost always worthwhile as the patterns can easily be tweaked to accommodate for business rule changes without major loop refactoring which would likely be needed if you used pure java to do some of the complex things RegEx does.
If your basic line ending is the same everytime, such as with "magic", then you are better of using endsWith.
However, if you have a line that has the same base, but can have multiple values, such as:
<string> <number> <string> <string> <number>
where the strings and numbers can be anything, you're better of using RegEx.
Your lines are always ending with a string, but you don't know what that string is.
If it's as simple as endsWith, startsWith or contains, then you should use these functions. If you are processing more "complex" strings and you want to extract information from these strings, then regexp/matchers can be used.
If you have something like "commandToRetrieve someNumericArgs someStringArgs someOptionalArgs" then regexp will ease your task a lot :)
I'd never use regexes in java if I have an easier way to do it, like in this case the endsWith method. Regexes in java are as ugly as they get, probably with the only exception of the match method on String.
Usually avoiding regexes makes your core more readable and easier for other programmers. The opposite is true, complex regexes might confuse even the most experience hackers out there.
As for performance concerns: just profile. Specially in java.
If you are familiar with how regexp works you will soon find that a lot of problems are easily solved by using regexp.
Personally I look to using java String operations if that is easy, but if you start splitting strings and doing substring on those again, I'd start thinking in regular expressions.
And again, if you use regular expressions, why stop at lines. By configuring your regexp you can easily read entire files in one regular expression (Pattern.DOTALL as parameter to the Pattern.compile and your regexp don't end in the newlines). I'd combine this with Apache Commons IOUtils.toString() methods and you got something very powerful to do quick stuff with.
I would even bring out a regular expression to parse some xml if needed. (For instance in a unit test, where I want to check that some elements are present in the xml).
For instance, from some unit test of mine:
Pattern pattern = Pattern.compile(
"<Monitor caption=\"(.+?)\".*?category=\"(.+?)\".*?>"
+ ".*?<Summary.*?>.+?</Summary>"
+ ".*?<Configuration.*?>(.+?)</Configuration>"
+ ".*?<CfgData.*?>(.+?)</CfgData>", Pattern.DOTALL);
which will match all segments in this xml and pick out some segments that I want to do some sub matching on.
I would suggest using a regular expression when you know the format of an input but you are not necessarily sure on the value (or possible value(s)) of the formatted input.
What I'm saying, if you have an input all ending with, in your case, "magic" then String.endsWith() works fine (seeing you know that your possible input value will end with "magic").
If you have a format e.g a RFC 5322 message format, one cannot clearly say that all email address can end with a .com, hence you can create a regular expression that conforms to the RFC 5322 standard for verification.
In a nutshell, if you know a format structure of your input data but don't know exactly what values (or possible values) you can receive, use regular expressions for validation.
There's a saying that goes:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (link).
For a simple test, I'd proceed exactly like you've done. If you find that it's getting more complicated, then I'd consider Regular Expressions only if there isn't another way.

Best practices for regex performance VS sheer iteration

I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString") and/or other String API calls?
While above given decision for .contains() is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains() calls or a single regex?
My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.
Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.
RegexBuddy has a built-in regular expressions debugger. It shows how many steps the regular expression engine needed to find a match or to fail to find a match. By using the debugger on strings of different lengths, you can get an idea of the complexity (big O) of the regular expression. If you look up "benchmark" in the index of RegexBuddy's help file you'll get some more tips on how to interpret this.
When judging the performance of a regular expression, it is particularly important to test situations where the regex fails to find a match. It is very easy to write a regular expression that finds its matches in linear time, but fails in exponential time in a situation that I call catastrophic backtracking.
To use your 5 if statements as an example, the regex one|two|three|four|five scans the input string once, doing a little bit of extra work when an o, t, or f is encountered. But 5 if statements checking if the string contains a word will search the entire string 5 times if none of the words can be found. If five occurs at the start of the string, then the regex finds the match instantly, while the first 4 if statements scan the whole string in vain before the 5th if statement finds the match.
It's hard to estimate performance without using a profiler, generally the best strategy is to write what makes the most logical sense and is easier to understand/read. If two .contains() calls are easier to logically understand then that's the better route, the same logic applies if a regex makes more sense.
It's also important to consider that other developers on your team may not have a great understanding of regex. If at a later time in production the use of regex over .contains() (or vice versa) is identified as a bottleneck, try and profile both.
Rule of thumb: Write code to be readable, use a profiler to identify bottlenecks and only then replace the readable code with faster code.
I would strongly suggest that you write the code for both and time it. It's pretty simple to do this and you'll get an answers that is not a generic "rule of thumb" but instead a very specific answer that holds for your problem domain.
Vance Morrison has an excellent post about micro benchmarking, and has a tool that makes it really simple for you to answer questions like this...
http://msdn.microsoft.com/en-us/magazine/cc500596.aspx
If you want my personal "rule of thumb" then it's that RegEx is often slower for this sort of thing, but you should ignore me and measure it yourself :-)
If, for non-performance reasons, you continue to use Regular Expressions then I can really recommend two things. Get a profiler (such as ANTS) and see what your code does in production. Then, get a copy of the Regular Expression Cookbook...
http://www.amazon.co.uk/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/0596520689/ref=sr_1_1?ie=UTF8&s=books&qid=1259147763&sr=8-1
... as it has loads of tips on speeding up RegEx code. I've optimized RegEx code by a factor of 10 following tips from this book.
The answer (as usual) is that it depends.
In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex's weaknesses. The "OR" in this case doesn't really know what the sub-patterns are trying to do and so can't easily optimize. It ends up doing the equivalent of (in pseudo code):
for( i = 0; i < stringLength; i++ ) {
if( stringAt pos i starts with "this" )
found!
if( stringAt pos i starts with "that" )
found!
}
There almost isn't a slower way to do it. In this case, two contains() calls will be much faster.
On the other hand, a full match on: ".*this.*|.*that.*" may optimize better.
To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with 'A' or 'B' and ending in 'g'-'m'... then use regex.
And then you won't be so worried about a few cycles here and there.

Categories