String replaceAll() vs. Matcher replaceAll() (Performance differences) - java

Are there known difference(s) between String.replaceAll() and Matcher.replaceAll() (On a Matcher Object created from a Regex.Pattern) in terms of performance?
Also, what are the high-level API 'ish differences between the both? (Immutability, Handling NULLs, Handling empty strings, etc.)

According to the documentation for String.replaceAll, it has the following to say about calling the method:
An invocation of this method of the
form str.replaceAll(regex, repl)
yields exactly the same result as the
expression
Pattern.compile(regex).matcher(str).replaceAll(repl)
Therefore, it can be expected the performance between invoking the String.replaceAll, and explicitly creating a Matcher and Pattern should be the same.
Edit
As has been pointed out in the comments, the performance difference being non-existent would be true for a single call to replaceAll from String or Matcher, however, if one needs to perform multiple calls to replaceAll, one would expect it to be beneficial to hold onto a compiled Pattern, so the relatively expensive regular expression pattern compilation does not have to be performed every time.

Source code of String.replaceAll():
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
It has to compile the pattern first - if you're going to run it many times with the same pattern on short strings, performance will be much better if you reuse one compiled Pattern.

The main difference is that if you hold onto the Pattern used to produce the Matcher, you can avoid recompiling the regex every time you use it. Going through String, you don't get the ability to "cache" like this.
If you have a different regex every time, using the String class's replaceAll is fine. If you are applying the same regex to many strings, create one Pattern and reuse it.

Immutability / thread safety: compiled Patterns are immutable, Matchers are not. (see Is Java Regex Thread Safe?)
Handling empty strings: replaceAll should handle empty strings gracefully (it won't match an empty input string pattern)
Making coffee, etc.: last I heard, neither String nor Pattern nor Matcher had any API features for that.
edit: as for handling NULLs, the documentation for String and Pattern doesn't explicitly say so, but I suspect they'd throw a NullPointerException since they expect a String.

The implementation of String.replaceAll tells you everything you need to know:
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
(And the docs say the same thing.)
While I haven't checked for caching, I'd certainly expect that compiling a pattern once and keeping a static reference to that would be more efficient than calling Pattern.compile with the same pattern each time. If there's a cache it'll be a small efficiency saving - if there isn't it could be a large one.

The difference is that String.replaceAll() compiles the regex each time it's called. There's no equivalent for .NET's static Regex.Replace() method, which automatically caches the compiled regex. Usually, replaceAll() is something you do only once, but if you're going to be calling it repeatedly with the same regex, especially in a loop, you should create a Pattern object and use the Matcher method.
You can create the Matcher ahead of time, too, and use its reset() method to retarget it for each use:
Matcher m = Pattern.compile(regex).matcher("");
for (String s : targets)
{
System.out.println(m.reset(s).replaceAll(repl));
}
The performance benefit of reusing the Matcher, of course, is nowhere as great as that of reusing the Pattern.

The other answers sufficiently cover the performance part of the OP, but another difference between Matcher::replaceAll and String::replaceAll is also a reason to compile your own Pattern. When you compile a Pattern yourself, there are options like flags to modify how the regex is applied. For example:
Pattern myPattern = Pattern.compile(myRegex, Pattern.CASE_INSENSITIVE);
The Matcher will apply all the flags you set when you call Matcher::replaceAll.
There are other flags you can set as well. Mostly I just wanted to point out that the Pattern and Matcher API has lots of options, and that's the primary reason to go beyond the simple String::replaceAll

Related

create reusable Java Matcher

I understand from Java Pattern Matcher: create new or reset? that it's best to reuse a Matcher if I'm in a single-threaded context.
So let's say I have a stream of paths using File.list(basePath) and I want to filter them based upon matching their filenames against a regex. It seems I should use matcher.reset(filename) for each path filename in the stream. Wonderful.
But how do I initialize the Matcher so that I can reuse it, without first creating it and matching against something? Because I don't know what the first "something" will be—I won't even know if there will be a "something" (e.g. a file in some directory).
So do I do this to kick things off?
final Matcher filenamePatternMatcher=filenamePattern.matcher("");
That seems cumbersome and wasteful. But if I set filenamePatternMatcher to null, I'll have to do needless checks processing the individual files, like:
if((filenamePatternMatcher!=null
? filenamePatternMatcher.reset(filename)
: filenamePattern.matcher(filename)).matches) {…}
Besides, I can't even do that within a Stream<Path>, because the matcher must be effectively final.
So what's an elegant way to create a matcher that will later match against strings using Matcher.reset()? Did the Java API creators not think of this use case?
I did a few timings on some file name matching I do frequently, and calling Matcher.reset(String) improves the matching speed by ~20% / cuts down memory used.
Fortunately Matcher.reset() returns this to make it easy to reference within a stream filter, and although it appears a little wasteful to setup a blank matcher before use, it is worth the effort to change from:
stream.filter(s -> pattern.matcher(s).matches())
... to have extra line to initialise the Matcher:
Matcher matcher = pattern.matcher("");
stream.filter(s -> matcher.reset(s).matches())

Java - Parsing strings - String.split() versus Pattern & Matcher

Given a String containing a comma delimited list representing a proper noun & category/description pair, what are the pros & cons of using String.split() versus Pattern & Matcher approach to find a particular proper noun and extract the associated category/description pair?
The haystack String format will not change. It will always contain comma delimited data in the form of
PROPER_NOUN|CATEGORY/DESCRIPTION
Common variables for both approaches:
String haystack="EARTH|PLANET/COMFORTABLE,MARS|PLANET/HARDTOBREATHE,PLUTO|DWARF_PLANET/FARAWAY";
String needle="PLUTO";
String result=null;
Using String.split():
for (String current : haystack.split(","))
if (current.contains(needle))
{
result=current.split("\\|")[1]);
break; // *edit* Not part of original code - added in response to comment from Pshemo
{
Using Pattern & Matcher:
Pattern pattern = pattern.compile("(" +needle+ "\|)(\w+/\w+)");
Matcher matches = pattern.matcher(haystack);
if (matches.find())
result=matches.group(2);
Both approaches provide the information I require.
I'm wondering if any reason exists to choose one over the other. I am not currently using Pattern & Matcher within my project so this approach will require imports from java.util.regex
And, of course, if there is an objectively 'better' way to parse the information I will welcome your input.
Thank you for your time!
Conclusion
I've opted for the Pattern/Matcher approach. While a little tricky to read w/the regex, it is faster than .split()/.contains()/.split() and, more importantly to me, captures the first match only.
For what it is worth, here are the results of my imperfect benchmark tests, in nanoseconds, after 100,000 iterations:
.split()/.contains()/.split
304,212,973
Pattern/Matcher w/ Pattern.compile() invoked for each iteration
230,511,000
Pattern/Matcher w/Pattern.compile() invoked prior to iteration
111,545,646
In a small case such as this, it won't matter that much. However, if you have extremely large strings, it may be beneficial to use Pattern/Matcher directly.
Most string functions that use regular expressions (such as matches(), split(), replaceAll(), etc.) makes use of Matcher/Pattern directly. Thus it will create a Matcher object every time, causing inefficiency when used in a large loop.
Thus if you really want speed, you can use Matcher/Pattern directly and ideally only create a single Matcher object.
There are no advantages to using pattern/matcher in cases where the manipulation to be done is as simple as this.
You can look at String.split() as a convenience method that leverages many of the same functionalities you use when you use a pattern/matcher directly.
When you need to do more complex matching/manipulation, use a pattern/matcher, but when String.split() meets your needs, the obvious advantage to using it is that it reduces code complexity considerably - and I can think of no good reason to pass this advantage up.
I would say that the split() version is much better here due to the following reasons:
The split() code is very clear, and it is easy to see what it does. The regex version demands much more analysis.
Regular expressions are more complex, and therefore the code becomes more error-prone.

How do I determine if a string is not a regular expression?

I am trying to improve the performance of some code. It looks something like this:
public boolean isImportant(String token) {
for (Pattern pattern : patterns) {
return pattern.matches(token).find();
}
}
What I noticed is that many of the Patterns seem to be simple string literals with no regular expression constructs. So I want to simply store these in a separate list (importantList) and do an equality test instead of performing a more expensive pattern match, such as follows:
public boolean isImportant(String token) {
if (importantList.contains(token)) return true;
for (Pattern pattern : patterns) {
return pattern.matches(token).find();
}
}
How do I programmatically determine if a particular string contains no regular expression constructs?
Edit:
I should add that the answer doesn't need to be performance-sensitive. (i.e. regular expressions can be used) I'm mainly concerned with the performance of isImportant() because it's called millions of times, while the initialzation of the patterns is only done once.
I normally hate answers that say this but...
Don't do that.
It probably won't make the code run faster, in fact it might even cause the program to take more time.
if you really need to optimize your code, there are likely much mush much more effective places where you can go.
It's going to be difficult. You can check for the non-presence of any regex metacharacters; that should be a good approximation:
Pattern regex = Pattern.compile("[$^()\\[\\]{}.*+?\\\\]");
Matcher regexMatcher = regex.matcher(subjectString);
regexIsLikely = regexMatcher.find();
Whether it's worth it is another question. Are you sure a regex match is slower than a list lookup (especially since you'll be doing a regex match after that in many cases anyway)? I'd bet it's much faster to just keep the regex match.
There is no way to determine it as every regex pattern is nothing else than a string. Furthermore there is nearly no performance difference as regex is smart nowadays and I'm pretty sure, if the pattern and source lengths are the same, equity check is the first that will be done
This is wrong
for (Pattern pattern : patterns)
you should create one big regex that ORs all patterns; then for each input you only match once.

java regular expression for String.contains

I'm looking for how to create a regular expression, which is 100% equivalent to the "contains" method in the String class. Basically, I have thousands of phrases that I'm searching for, and from what I understand it is much better for performance reasons to compile the regular expression once and use it multiple times, vs calling "mystring.contains(testString)" over and over again on different "mystring" values, with the same testString values.
Edit: to expand on my question... I will have many thousands of "testString" values, and I don't want to have to convert those to a format that the regular expression mechanism understands. I just want to be able to directly pass in a phrase that users enter, and see if it is found in whatever value "mystring" happens to contain. "testString" will not change it's value ever, but there will be thousands of them so that is why I was thinking of creating the matcher object and re-using it over and over etc. (Obviously my regexp skills are not up to snuff)
You can use the LITERAL flag when compiling your pattern to tell the engine you're using a literal string, e.g.:
Pattern p = Pattern.compile(yourString, Pattern.LITERAL);
But are you really sure that doing that and then reusing the result is faster than just String#contains? Enough to make the complexity worth it?
Well you could use Pattern.quote to get a "piece of regular expression" for each input string. Do any of your terms contain line breaks? If so, that could at least make life slightly trickier, though far from impossible.
Anyway, you'd basically just join the quoted terms together as:
Pattern pattern = Pattern.compile("quoted1|quoted2|quoted3|...");
You might want to use Guava's Joiner to easily join the quoted strings together, although obviously it's not terribly hard to do manually.
However, I would try this and then test whether it's actually more efficient than just calling contains. Have you already got a benchmark which shows that contains is too slow?

Java string: classes or packages with advanced functions?

I am doing string manipulations and I need more advanced functions than the original ones provided in Java.
For example, I'd like to return a substring between the (n-1)th and nth occurrence of a character in a string.
My question is, are there classes already written by users which perform this function, and many others for string manipulations? Or should I dig on stackoverflow for each particular function I need?
Check out the Apache Commons class StringUtils, it has plenty of interesting ways to work with Strings.
http://commons.apache.org/lang/api-2.3/index.html?org/apache/commons/lang/StringUtils.html
Have you looked at the regular expression API? That's usually your best bet for doing complex things with strings:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Along the lines of what you're looking to do, you can traverse the string against a pattern (in your case a single character) and match everything in the string up to but not including the next instance of the character as what is called a capture group.
It's been a while since I've written a regex, but if you were looking for the character A for instance, then I think you could use the regex A([^A]*) and keep matching that string. The stuff in the parenthesis is a capturing group, which I reference below. To match it, you'd use the matcher method on pattern:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#matcher%28java.lang.CharSequence%29
On the Matcher instance, you'd make sure that matches is true, and then keep calling find() and group(1) as needed, where group(1) would get you what is in between the parentheses. You could use a counter in your looping to make sure you get the n-1 instance of the letter.
Lastly, Pattern provides flags you can pass in to indicate things like case insensitivity, which you may need.
If I've made some mistakes here, then someone please correct me. Like I said, I don't write regexes every day, so I'm sure I'm a little bit off.

Categories