I'm looking for how to create a regular expression, which is 100% equivalent to the "contains" method in the String class. Basically, I have thousands of phrases that I'm searching for, and from what I understand it is much better for performance reasons to compile the regular expression once and use it multiple times, vs calling "mystring.contains(testString)" over and over again on different "mystring" values, with the same testString values.
Edit: to expand on my question... I will have many thousands of "testString" values, and I don't want to have to convert those to a format that the regular expression mechanism understands. I just want to be able to directly pass in a phrase that users enter, and see if it is found in whatever value "mystring" happens to contain. "testString" will not change it's value ever, but there will be thousands of them so that is why I was thinking of creating the matcher object and re-using it over and over etc. (Obviously my regexp skills are not up to snuff)
You can use the LITERAL flag when compiling your pattern to tell the engine you're using a literal string, e.g.:
Pattern p = Pattern.compile(yourString, Pattern.LITERAL);
But are you really sure that doing that and then reusing the result is faster than just String#contains? Enough to make the complexity worth it?
Well you could use Pattern.quote to get a "piece of regular expression" for each input string. Do any of your terms contain line breaks? If so, that could at least make life slightly trickier, though far from impossible.
Anyway, you'd basically just join the quoted terms together as:
Pattern pattern = Pattern.compile("quoted1|quoted2|quoted3|...");
You might want to use Guava's Joiner to easily join the quoted strings together, although obviously it's not terribly hard to do manually.
However, I would try this and then test whether it's actually more efficient than just calling contains. Have you already got a benchmark which shows that contains is too slow?
Related
I have to validate a number falls within the range (0-255).
I can do this with Regular expression or using if statement.
RegEx:
\b([0-9]{1,2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])\b
Or
If(number>-1 && number <=255)
I want to know which one is better to use to validate number range.
I use a simple rule:
If you can code without regexp and keep it simple - than do it without.
Regexps gives you a lot of power, but it can be tricky to master.
In your case - the "if" code will run faster and will have much better readability.
A lot of times - regexps can amount to something which is very complex to understand and maintain as requirements change.
You will probably use String.matches() for matching / checking. Which is very inefficient. It internally compiles the pattern, uses synchronization blah blah..
So , bottom line, avoid regexes wherever possible (Also, you will have to convert the number to a String and then use regex. What a waste of both space and time)
PS : Also note that mathematical operations are always handled more efficiently across platforms.
Number comparison is much efficient than String with regex comparison. By comparing number as a String is over complication.
Your regex will get you partial matches when used with the following data:
-123
+12
!12.
So its better to use string comparison to avoid unseen problems and to maintain a complex regex.See demo.
https://regex101.com/r/mS3tQ7/11
To keep it simple and easy to understand(w.r.t your problem) I would suggest to go for If statement. But for complex validations I would suggest using regex. The reason for this is also the same - to keep it simple and easy to understand. Why use 8-10 lines of if-then blocks when you can validate the same with concise 25-30 character regex pattern! And if you put that same pattern in a .config file, you can now change the behavior of your app without recompiling. It's less code doing more work in a flexible way.
Given a String containing a comma delimited list representing a proper noun & category/description pair, what are the pros & cons of using String.split() versus Pattern & Matcher approach to find a particular proper noun and extract the associated category/description pair?
The haystack String format will not change. It will always contain comma delimited data in the form of
PROPER_NOUN|CATEGORY/DESCRIPTION
Common variables for both approaches:
String haystack="EARTH|PLANET/COMFORTABLE,MARS|PLANET/HARDTOBREATHE,PLUTO|DWARF_PLANET/FARAWAY";
String needle="PLUTO";
String result=null;
Using String.split():
for (String current : haystack.split(","))
if (current.contains(needle))
{
result=current.split("\\|")[1]);
break; // *edit* Not part of original code - added in response to comment from Pshemo
{
Using Pattern & Matcher:
Pattern pattern = pattern.compile("(" +needle+ "\|)(\w+/\w+)");
Matcher matches = pattern.matcher(haystack);
if (matches.find())
result=matches.group(2);
Both approaches provide the information I require.
I'm wondering if any reason exists to choose one over the other. I am not currently using Pattern & Matcher within my project so this approach will require imports from java.util.regex
And, of course, if there is an objectively 'better' way to parse the information I will welcome your input.
Thank you for your time!
Conclusion
I've opted for the Pattern/Matcher approach. While a little tricky to read w/the regex, it is faster than .split()/.contains()/.split() and, more importantly to me, captures the first match only.
For what it is worth, here are the results of my imperfect benchmark tests, in nanoseconds, after 100,000 iterations:
.split()/.contains()/.split
304,212,973
Pattern/Matcher w/ Pattern.compile() invoked for each iteration
230,511,000
Pattern/Matcher w/Pattern.compile() invoked prior to iteration
111,545,646
In a small case such as this, it won't matter that much. However, if you have extremely large strings, it may be beneficial to use Pattern/Matcher directly.
Most string functions that use regular expressions (such as matches(), split(), replaceAll(), etc.) makes use of Matcher/Pattern directly. Thus it will create a Matcher object every time, causing inefficiency when used in a large loop.
Thus if you really want speed, you can use Matcher/Pattern directly and ideally only create a single Matcher object.
There are no advantages to using pattern/matcher in cases where the manipulation to be done is as simple as this.
You can look at String.split() as a convenience method that leverages many of the same functionalities you use when you use a pattern/matcher directly.
When you need to do more complex matching/manipulation, use a pattern/matcher, but when String.split() meets your needs, the obvious advantage to using it is that it reduces code complexity considerably - and I can think of no good reason to pass this advantage up.
I would say that the split() version is much better here due to the following reasons:
The split() code is very clear, and it is easy to see what it does. The regex version demands much more analysis.
Regular expressions are more complex, and therefore the code becomes more error-prone.
I am doing string manipulations and I need more advanced functions than the original ones provided in Java.
For example, I'd like to return a substring between the (n-1)th and nth occurrence of a character in a string.
My question is, are there classes already written by users which perform this function, and many others for string manipulations? Or should I dig on stackoverflow for each particular function I need?
Check out the Apache Commons class StringUtils, it has plenty of interesting ways to work with Strings.
http://commons.apache.org/lang/api-2.3/index.html?org/apache/commons/lang/StringUtils.html
Have you looked at the regular expression API? That's usually your best bet for doing complex things with strings:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Along the lines of what you're looking to do, you can traverse the string against a pattern (in your case a single character) and match everything in the string up to but not including the next instance of the character as what is called a capture group.
It's been a while since I've written a regex, but if you were looking for the character A for instance, then I think you could use the regex A([^A]*) and keep matching that string. The stuff in the parenthesis is a capturing group, which I reference below. To match it, you'd use the matcher method on pattern:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#matcher%28java.lang.CharSequence%29
On the Matcher instance, you'd make sure that matches is true, and then keep calling find() and group(1) as needed, where group(1) would get you what is in between the parentheses. You could use a counter in your looping to make sure you get the n-1 instance of the letter.
Lastly, Pattern provides flags you can pass in to indicate things like case insensitivity, which you may need.
If I've made some mistakes here, then someone please correct me. Like I said, I don't write regexes every day, so I'm sure I'm a little bit off.
I'm writing a small app that reads some input and do something based on that input.
Currently I'm looking for a line that ends with, say, "magic", I would use String's endsWith method. It's pretty clear to whoever reads my code what's going on.
Another way to do it is create a Pattern and try to match a line that ends with "magic". This is also clear, but I personally think this is an overkill because the pattern I'm looking for is not complex at all.
When do you think it's worth using RegEx Java? If it's complexity, how would you personally define what's complex enough?
Also, are there times when using Patterns are actually faster than string manipulation?
EDIT: I'm using Java 6.
Basically: if there is a non-regex operation that does what you want in one step, always go for that.
This is not so much about performance, but about a) readability and b) compile-time-safety. Specialized non-regex versions are usually a lot easier to read than regex-versions. And a typo in one of these specialized methods will not compile, while a typo in a Regex will fail miserably at runtime.
Comparing Regex-based solutions to non-Regex-bases solutions
String s = "Magic_Carpet_Ride";
s.startsWith("Magic"); // non-regex
s.matches("Magic.*"); // regex
s.contains("Carpet"); // non-regex
s.matches(".*Carpet.*"); // regex
s.endsWith("Ride"); // non-regex
s.matches(".*Ride"); // regex
In all these cases it's a No-brainer: use the non-regex version.
But when things get a bit more complicated, it depends. I guess I'd still stick with non-regex in the following case, but many wouldn't:
// Test whether a string ends with "magic" in any case,
// followed by optional white space
s.toLowerCase().trim().endsWith("magic"); // non-regex, 3 calls
s.matches(".*(?i:magic)\\s*"); // regex, 1 call, but ugly
And in response to RegexesCanCertainlyBeEasierToReadThanMultipleFunctionCallsToDoTheSameThing:
I still think the non-regex version is more readable, but I would write it like this:
s.toLowerCase()
.trim()
.endsWith("magic");
Makes the whole difference, doesn't it?
You would use Regex when the normal manipulations on the String class are not enough to elegantly get what you need from the String.
A good indicator that this is the case is when you start splitting, then splitting those results, then splitting those results. The code is getting unwieldy. Two lines of Pattern/Regex code can clean this up, neatly wrapped in a method that is unit tested....
Anything that can be done with regex can also be hand-coded.
Use regex if:
Doing it manually is going to take more effort without much benefit.
You can easily come up with a regex for your task.
Don't use regex if:
It's very easy to do it otherwise, as in your example.
The string you're parsing does not lend itself to regex. (it is customary to link to this question)
I think you are best with using endsWith. Unless your requirements change, it's simpler and easier to understand. Might perform faster too.
If there was a bit more complexity, such as you wanted to match "magic", "majik', but not "Magic" or "Majik"; or you wanted to match "magic" followed by a space and then 1 word such as "... magic spoon" but not "...magic soup spoon", then I think RegEx would be a better way to go.
Any complex parsing where you are generating a lot of Objects would be better done with RegEx when you factor in both computing power, and brainpower it takes to generate the code for that purpose. If you have a RegEx guru handy, it's almost always worthwhile as the patterns can easily be tweaked to accommodate for business rule changes without major loop refactoring which would likely be needed if you used pure java to do some of the complex things RegEx does.
If your basic line ending is the same everytime, such as with "magic", then you are better of using endsWith.
However, if you have a line that has the same base, but can have multiple values, such as:
<string> <number> <string> <string> <number>
where the strings and numbers can be anything, you're better of using RegEx.
Your lines are always ending with a string, but you don't know what that string is.
If it's as simple as endsWith, startsWith or contains, then you should use these functions. If you are processing more "complex" strings and you want to extract information from these strings, then regexp/matchers can be used.
If you have something like "commandToRetrieve someNumericArgs someStringArgs someOptionalArgs" then regexp will ease your task a lot :)
I'd never use regexes in java if I have an easier way to do it, like in this case the endsWith method. Regexes in java are as ugly as they get, probably with the only exception of the match method on String.
Usually avoiding regexes makes your core more readable and easier for other programmers. The opposite is true, complex regexes might confuse even the most experience hackers out there.
As for performance concerns: just profile. Specially in java.
If you are familiar with how regexp works you will soon find that a lot of problems are easily solved by using regexp.
Personally I look to using java String operations if that is easy, but if you start splitting strings and doing substring on those again, I'd start thinking in regular expressions.
And again, if you use regular expressions, why stop at lines. By configuring your regexp you can easily read entire files in one regular expression (Pattern.DOTALL as parameter to the Pattern.compile and your regexp don't end in the newlines). I'd combine this with Apache Commons IOUtils.toString() methods and you got something very powerful to do quick stuff with.
I would even bring out a regular expression to parse some xml if needed. (For instance in a unit test, where I want to check that some elements are present in the xml).
For instance, from some unit test of mine:
Pattern pattern = Pattern.compile(
"<Monitor caption=\"(.+?)\".*?category=\"(.+?)\".*?>"
+ ".*?<Summary.*?>.+?</Summary>"
+ ".*?<Configuration.*?>(.+?)</Configuration>"
+ ".*?<CfgData.*?>(.+?)</CfgData>", Pattern.DOTALL);
which will match all segments in this xml and pick out some segments that I want to do some sub matching on.
I would suggest using a regular expression when you know the format of an input but you are not necessarily sure on the value (or possible value(s)) of the formatted input.
What I'm saying, if you have an input all ending with, in your case, "magic" then String.endsWith() works fine (seeing you know that your possible input value will end with "magic").
If you have a format e.g a RFC 5322 message format, one cannot clearly say that all email address can end with a .com, hence you can create a regular expression that conforms to the RFC 5322 standard for verification.
In a nutshell, if you know a format structure of your input data but don't know exactly what values (or possible values) you can receive, use regular expressions for validation.
There's a saying that goes:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (link).
For a simple test, I'd proceed exactly like you've done. If you find that it's getting more complicated, then I'd consider Regular Expressions only if there isn't another way.
I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString") and/or other String API calls?
While above given decision for .contains() is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains() calls or a single regex?
My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.
Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.
RegexBuddy has a built-in regular expressions debugger. It shows how many steps the regular expression engine needed to find a match or to fail to find a match. By using the debugger on strings of different lengths, you can get an idea of the complexity (big O) of the regular expression. If you look up "benchmark" in the index of RegexBuddy's help file you'll get some more tips on how to interpret this.
When judging the performance of a regular expression, it is particularly important to test situations where the regex fails to find a match. It is very easy to write a regular expression that finds its matches in linear time, but fails in exponential time in a situation that I call catastrophic backtracking.
To use your 5 if statements as an example, the regex one|two|three|four|five scans the input string once, doing a little bit of extra work when an o, t, or f is encountered. But 5 if statements checking if the string contains a word will search the entire string 5 times if none of the words can be found. If five occurs at the start of the string, then the regex finds the match instantly, while the first 4 if statements scan the whole string in vain before the 5th if statement finds the match.
It's hard to estimate performance without using a profiler, generally the best strategy is to write what makes the most logical sense and is easier to understand/read. If two .contains() calls are easier to logically understand then that's the better route, the same logic applies if a regex makes more sense.
It's also important to consider that other developers on your team may not have a great understanding of regex. If at a later time in production the use of regex over .contains() (or vice versa) is identified as a bottleneck, try and profile both.
Rule of thumb: Write code to be readable, use a profiler to identify bottlenecks and only then replace the readable code with faster code.
I would strongly suggest that you write the code for both and time it. It's pretty simple to do this and you'll get an answers that is not a generic "rule of thumb" but instead a very specific answer that holds for your problem domain.
Vance Morrison has an excellent post about micro benchmarking, and has a tool that makes it really simple for you to answer questions like this...
http://msdn.microsoft.com/en-us/magazine/cc500596.aspx
If you want my personal "rule of thumb" then it's that RegEx is often slower for this sort of thing, but you should ignore me and measure it yourself :-)
If, for non-performance reasons, you continue to use Regular Expressions then I can really recommend two things. Get a profiler (such as ANTS) and see what your code does in production. Then, get a copy of the Regular Expression Cookbook...
http://www.amazon.co.uk/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/0596520689/ref=sr_1_1?ie=UTF8&s=books&qid=1259147763&sr=8-1
... as it has loads of tips on speeding up RegEx code. I've optimized RegEx code by a factor of 10 following tips from this book.
The answer (as usual) is that it depends.
In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex's weaknesses. The "OR" in this case doesn't really know what the sub-patterns are trying to do and so can't easily optimize. It ends up doing the equivalent of (in pseudo code):
for( i = 0; i < stringLength; i++ ) {
if( stringAt pos i starts with "this" )
found!
if( stringAt pos i starts with "that" )
found!
}
There almost isn't a slower way to do it. In this case, two contains() calls will be much faster.
On the other hand, a full match on: ".*this.*|.*that.*" may optimize better.
To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with 'A' or 'B' and ending in 'g'-'m'... then use regex.
And then you won't be so worried about a few cycles here and there.