Why are most string manipulations in Java based on regexp?

Why are most string manipulations in Java based on regexp? - java

In Java there are a bunch of methods that all have to do with manipulating Strings.
The simplest example is the String.split("something") method.
Now the actual definition of many of those methods is that they all take a regular expression as their input parameter(s). Which makes then all very powerful building blocks.
Now there are two effects you'll see in many of those methods:
They recompile the expression each time the method is invoked. As such they impose a performance impact.
I've found that in most "real-life" situations these methods are called with "fixed" texts. The most common usage of the split method is even worse: It's usually called with a single char (usually a ' ', a ';' or a '&') to split by.
So it's not only that the default methods are powerful, they also seem overpowered for what they are actually used for. Internally we've developed a "fastSplit" method that splits on fixed strings. I wrote a test at home to see how much faster I could do it if it was known to be a single char. Both are significantly faster than the "standard" split method.
So I was wondering: why was the Java API chosen the way it is now?
What was the good reason to go for this instead of having a something like split(char) and split(String) and a splitRegex(String) ??
Update: I slapped together a few calls to see how much time the various ways of splitting a string would take.
Short summary: It makes a big difference!
I did 10000000 iterations for each test case, always using the input
"aap,noot,mies,wim,zus,jet,teun"
and always using ',' or "," as the split argument.
This is what I got on my Linux system (it's an Atom D510 box, so it's a bit slow):
fastSplit STRING
Test 1 : 11405 milliseconds: Split in several pieces
Test 2 : 3018 milliseconds: Split in 2 pieces
Test 3 : 4396 milliseconds: Split in 3 pieces
homegrown fast splitter based on char
Test 4 : 9076 milliseconds: Split in several pieces
Test 5 : 2024 milliseconds: Split in 2 pieces
Test 6 : 2924 milliseconds: Split in 3 pieces
homegrown splitter based on char that always splits in 2 pieces
Test 7 : 1230 milliseconds: Split in 2 pieces
String.split(regex)
Test 8 : 32913 milliseconds: Split in several pieces
Test 9 : 30072 milliseconds: Split in 2 pieces
Test 10 : 31278 milliseconds: Split in 3 pieces
String.split(regex) using precompiled Pattern
Test 11 : 26138 milliseconds: Split in several pieces
Test 12 : 23612 milliseconds: Split in 2 pieces
Test 13 : 24654 milliseconds: Split in 3 pieces
StringTokenizer
Test 14 : 27616 milliseconds: Split in several pieces
Test 15 : 28121 milliseconds: Split in 2 pieces
Test 16 : 27739 milliseconds: Split in 3 pieces
As you can see it makes a big difference if you have a lot of "fixed char" splits to do.
To give you guys some insight; I'm currently in the Apache logfiles and Hadoop arena with the data of a big website. So to me this stuff really matters :)
Something I haven't factored in here is the garbage collector. As far as I can tell compiling a regular expression into a Pattern/Matcher/.. will allocate a lot of objects, that need to be collected some time. So perhaps in the long run the differences between these versions is even bigger .... or smaller.
My conclusions so far:
Only optimize this if you have a LOT of strings to split.
If you use the regex methods always precompile if you repeatedly use the same pattern.
Forget the (obsolete) StringTokenizer
If you want to split on a single char then use a custom method, especially if you only need to split it into a specific number of pieces (like ... 2).
P.S. I'm giving you all my homegrown split by char methods to play with (under the license that everything on this site falls under :) ). I never fully tested them .. yet. Have fun.
private static String[]
stringSplitChar(final String input,
final char separator) {
int pieces = 0;
// First we count how many pieces we will need to store ( = separators + 1 )
int position = 0;
do {
pieces++;
position = input.indexOf(separator, position + 1);
} while (position != -1);
// Then we allocate memory
final String[] result = new String[pieces];
// And start cutting and copying the pieces.
int previousposition = 0;
int currentposition = input.indexOf(separator);
int piece = 0;
final int lastpiece = pieces - 1;
while (piece < lastpiece) {
result[piece++] = input.substring(previousposition, currentposition);
previousposition = currentposition + 1;
currentposition = input.indexOf(separator, previousposition);
}
result[piece] = input.substring(previousposition);
return result;
}
private static String[]
stringSplitChar(final String input,
final char separator,
final int maxpieces) {
if (maxpieces <= 0) {
return stringSplitChar(input, separator);
}
int pieces = maxpieces;
// Then we allocate memory
final String[] result = new String[pieces];
// And start cutting and copying the pieces.
int previousposition = 0;
int currentposition = input.indexOf(separator);
int piece = 0;
final int lastpiece = pieces - 1;
while (currentposition != -1 && piece < lastpiece) {
result[piece++] = input.substring(previousposition, currentposition);
previousposition = currentposition + 1;
currentposition = input.indexOf(separator, previousposition);
}
result[piece] = input.substring(previousposition);
// All remaining array elements are uninitialized and assumed to be null
return result;
}
private static String[]
stringChop(final String input,
final char separator) {
String[] result;
// Find the separator.
final int separatorIndex = input.indexOf(separator);
if (separatorIndex == -1) {
result = new String[1];
result[0] = input;
}
else {
result = new String[2];
result[0] = input.substring(0, separatorIndex);
result[1] = input.substring(separatorIndex + 1);
}
return result;
}

Note that the regex need not be recompiled each time. From the Javadoc:
An invocation of this method of the form str.split(regex, n) yields the same result as the expression
Pattern.compile(regex).split(str, n)
That is, if you are worried about performance, you may precompile the pattern and then reuse it:
Pattern p = Pattern.compile(regex);
...
String[] tokens1 = p.split(str1);
String[] tokens2 = p.split(str2);
...
instead of
String[] tokens1 = str1.split(regex);
String[] tokens2 = str2.split(regex);
...
I believe that the main reason for this API design is convenience. Since regular expressions include all "fixed" strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.
My feeling (which I can't back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.
It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.

I wouldn't say most string manipulations are regex-based in Java. Really we are only talking about split and replaceAll/replaceFirst. But I agree, it's a big mistake.
Apart from the ugliness of having a low-level language feature (strings) becoming dependent on a higher-level feature (regex), it's also a nasty trap for new users who might naturally assume that a method with the signature String.replaceAll(String, String) would be a string-replace function. Code written under that assumption will look like it's working, until a regex-special character creeps in, at which point you've got confusing, hard-to-debug (and maybe even security-significant) bugs.
It's amusing that a language that can be so pedantically strict about typing made the sloppy mistake of treating a string and a regex as the same thing. It's less amusing that there's still no builtin method to do a plain string replace or split. You have to use a regex replace with a Pattern.quoted string. And you only even get that from Java 5 onwards. Hopeless.
#Tim Pietzcker:
Are there other languages that do the same?
JavaScript's Strings are partly modelled on Java's and are also messy in the case of replace(). By passing in a string, you get a plain string replace, but it only replaces the first match, which is rarely what's wanted. To get a replace-all you have to pass in a RegExp object with the /g flag, which again has problems if you want to create it dynamically from a string (there is no built-in RegExp.quote method in JS). Luckily, split() is purely string-based, so you can use the idiom:
s.split(findstr).join(replacestr)
Plus of course Perl does absolutely everything with regexen, because it's just perverse like that.
(This is a comment more than an answer, but is too big for one. Why did Java do this? Dunno, they made a lot of mistakes in the early days. Some of them have since been fixed. I suspect if they'd thought to put regex functionality in the box marked Pattern back in 1.0, the design of String would be cleaner to match.)

I imagine a good reason is that they can simply pass the buck on to the regex method, which does all the real heavy lifting for all of the string methods. Im guessing they thought if they already had a working solution it was less efficient, from a development and maintenance standpoint, to reinvent the wheel for each string manipulation method.

Interesting discussion!
Java was not originally intended as a batch programming language. As such the API out of the box are more tuned towards doing one "replace" , one "parse" etc. except on Application initialization when the app may be expected to be parsing a bunch of configuration files.
Hence optimization of these APIs was sacrificed in the altar of simplicity IMO. But the question brings up an important point. Python's desire to keep the regex distinct from the non regex in its API, stems from the fact that Python can be used as an excellent scripting language as well. In UNIX too, the original versions of fgrep did not support regex.
I was engaged in a project where we had to do some amount of ETL work in java. At that time, I remember coming up with the kind of optimizations that you have alluded to, in your question.

I suspect that the reason why things like String#split(String) use regexp under the hood is because it involves less extraneous code in the Java Class Library. The state machine resulting from a split on something like , or space is so simple that it is unlikely to be significantly slower to execute than a statically implemented equivalent using a StringCharacterIterator.
Beyond that the statically implemented solution would complicate runtime optimization with the JIT because it would be a different block of code that also requires hot code analysis. Using the existing Pattern algorithms regularly across the library means that they are more likely candidates for JIT compilation.

Very good question..
I suppose when the designers sat down to look at this (and not for very long, it seems), they came at it from a point of view that it should be designed to suit as many different possibilities as possible. Regular Expressions offered that flexibility.
They didn't think in terms of efficiencies. There is the Java Community Process available to raise this.
Have you looked at using the java.util.regex.Pattern class, where you compile the expression once and then use on different strings.
Pattern exp = Pattern.compile(":");
String[] array = exp.split(sourceString1);
String[] array2 = exp.split(sourceString2);

In looking at the Java String class, the uses of regex seem reasonable, and there are alternatives if regex is not desired:
http://java.sun.com/javase/6/docs/api/java/lang/String.html
boolean matches(String regex) - A regex seems appropriate, otherwise you could just use equals
String replaceAll/replaceFirst(String regex, String replacement) - There are equivalents that take CharSequence instead, preventing regex.
String[] split(String regex, int limit) - A powerful but expensive split, you can use StringTokenizer to split by tokens.
These are the only functions I saw that took regex.
Edit: After seeing that StringTokenizer is legacy, I would defer to Péter Török's answer to precompile the regex for split instead of using the tokenizer.

The answer to your question is that the Java core API did it wrong. For day to day work you can consider using Guava libraries' CharMatcher which fills the gap beautifully.

...why was the Java API chosen the way it is now?
Short answer: it wasn't. Nobody ever decided to favor regex methods over non-regex methods in the String API, it just worked out that way.
I always understood that Java's designers deliberately kept the string-manipulation methods to a minimum, in order to avoid API bloat. But when regex support came along in JDK 1.4, of course they had to add some convenience methods to String's API.
So now users are faced with a choice between the immensely powerful and flexible regex methods, and the bone-basic methods that Java always offered.

Related

Is use of AtomicInteger for indexing in Stream a legit way?

I would like to get an answer pointing out the reasons why the following idea described below on a very simple example is commonly considered bad and know its weaknesses.
I have a sentence of words and my goal is to make every second one to uppercase. My starting point for both of the cases is exactly the same:
String sentence = "Hi, this is just a simple short sentence";
String[] split = sentence.split(" ");
The traditional and procedural approach is:
StringBuilder stringBuilder = new StringBuilder();
for (int i=0; i<split.length; i++) {
if (i%2==0) {
stringBuilder.append(split[i]);
} else {
stringBuilder.append(split[i].toUpperCase());
}
if (i<split.length-1) { stringBuilder.append(" "); }
}
When want to use java-stream the use is limited due the effectively-final or final variable constraint used in the lambda expression. I have to use the workaround using the array and its first and only index, which was suggested in the first comment of my question How to increment a value in Java Stream. Here is the example:
int index[] = {0};
String result = Arrays.stream(split)
.map(i -> index[0]++%2==0 ? i : i.toUpperCase())
.collect(Collectors.joining(" "));
Yeah, it's a bad solution and I have heard few good reasons somewhere hidden in comments of a question I am unable to find (if you remind me some of them, I'd upvote twice if possible). But what if I use AtomicInteger - does it make any difference and is it a good and safe way with no side effects compared to the previous one?
AtomicInteger atom = new AtomicInteger(0);
String result = Arrays.stream(split)
.map(i -> atom.getAndIncrement()%2==0 ? i : i.toUpperCase())
.collect(Collectors.joining(" "));
Regardless of how ugly it might look for anyone, I ask for the description of possible weaknesses and their reasons. I don't care the performance but the design and possible weaknesses of the 2nd solution.
Please, don't match AtomicInteger with multi-threading issue. I used this class since it receives, increments and stores the value in the way I need for this example.
As I often say in my answers that "Java Stream-API" is not the bullet for everything. My goal is to explore and find the edge where is this sentence applicable since I find the last snippet quite clear, readable and brief compared to StringBuilder's snippet.
Edit: Does exist any alternative way applicable for the snippets above and all the issues when it’s needed to work with both item and index while iteration using Stream-API?

The documentation of the java.util.stream package states that:
Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.
[...]
The ordering of side-effects may be surprising. Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.
This means that the elements may be processed out of order, and thus the Stream-solutions may produce wrong results.
This is (at least for me) a killer argument against the two Stream-solutions.
By the process of elimination, we only have the "traditional solution" left. And honestly, I do not see anything wrong with this solution. If we wanted to get rid of the for-loop, we could re-write this code using a foreach-loop:
boolean toUpper = false; // 1st String is not capitalized
for (String word : splits) {
stringBuilder.append(toUpper ? word.toUpperCase() : word);
toUpper = !toUpper;
}
For a streamified and (as far as I know) correct solution, take a look at Octavian R.'s answer.
Your question wrt. the "limits of streams" is opinion-based.
The answer to the question (s) ends here. The rest is my opinion and should be regarded as such.
In Octavian R.'s solution, an artificial index-set is created through a IntStream, which is then used to access the String[]. For me, this has a higher cognitive complexity than a simple for- or foreach-loop and I do not see any benefit in using streams instead of loops in this situation.

In Java, comparing with Scala, you must be inventive. One solution without mutation is this one:
String sentence = "Hi, this is just a simple short sentence";
String[] split = sentence.split(" ");
String result = IntStream.range(0, split.length)
.mapToObj(i -> i%2==0 ? split[i].toUpperCase():split[i])
.collect(Collectors.joining(" "));
System.out.println(result);
In Java streams you should avoid the mutation. Your solution with AtomicInteger it's ugly and it's a bad practice.
Kind regards!

As explained in Turing85’s answer, your stream solutions are not correct, as they rely on the processing order, which is not guaranteed. This can lead to incorrect results with parallel execution today, but even if it happens to produce the desired result with a sequential stream, that’s only an implementation detail. It’s not guaranteed to work.
Besides that, there is no advantage in rewriting code to use the Stream API with a logic that basically still is a loop, but obfuscated with a different API. The best way to describe the idea of the new APIs, is to say that you should express what to do but not how.
Starting with Java 9, you could implement the same thing as
String result = Pattern.compile("( ?+[^ ]* )([^ ]*)").matcher(sentence)
.replaceAll(m -> m.group(1)+m.group(2).toUpperCase());
which expresses the wish to replace every second word with its upper case form, but doesn’t express how to do it. That’s up to the library, which likely uses a single StringBuilder instead of splitting into an array of strings, but that’s irrelevant to the application logic.
As long as you’re using Java 8, I’d stay with the loop and even when switching to a newer Java version, I would consider replacing the loop as not being an urgent change.
The pattern in the above example has been written in a way to do exactly the same as your original code splitting at single space characters. Usually, I’d encode “replace every second word” more like
String result = Pattern.compile("(\\w+\\W+)(\\w+)").matcher(sentence)
.replaceAll(m -> m.group(1)+m.group(2).toUpperCase());
which would behave differently when encountering multiple spaces or other separators, but usually is closer to the actual intention.

recommended way to find a word in a comma-separated string?

I want to find if a utility is in one of the utilities.
I have a JUnit test as following
#Test
public void testUtilityInUtilities() {
final String utilities = "Pacific Gas & Electric (PG&E),San Diego Gas & Electric (SDG&E), Salt River Project (SRP),Southern California Edison (SCE)";
final String utility = "San Diego Gas & Electric (SDG&E)";
assertTrue(utilities.contains(utility));
}
Is it a good enough test? or shall I do something similar to following?
String[] splitString = (utilities.split(","));
for (String string : splitString) {
if (string.equals(utility)) {return true;}
}
return false;
which method is recommended? split or contains or anything else?

The contains way is faster, but it is prone to false positives: it will match a sub-string, say, "Gas & Electric", even though the actual string was "Pacific Gas & Electric (PG&E)". You can guard against this by requiring that the points around the match be at an end of the string or at a comma. You could improve upon the first method by constructing a regular expression from the regex-quoted search string framed by end markers (i.e. commas, $ and ^) to require a complete match, too.
The split way is more reliable, but it is wasteful: you end up creating a whole array of substrings, only to check for a presence of a single string, and throw away the rest.
All in all, I would prefer the first method in situations where performance matters, because it is not wasteful. If you run this method once in a while, though, the split-based method is easier to code and to read.

For the case that you have mentioned contains should suffice. Split would unnecessarily end up creating an additional array which you are not using for your data processing (atleast in the above mentioned code).
Also another point that you need to consider is how many searches will you be performing in the given String. If you are performing multiple searches of String utility in the utilities String then you should think of using more complex data structures that enable multiple fast searches eg :Suffix trees.

Would Java indexOf (brute force method) be more practical for me or some other substring algorithm?

I'm looking at finding very short substrings (pattern, needle) in many short lines of text (haystack). However, I'm not quite sure which method to use outside the naive, brute force method.
Background: I'm doing a side project for fun where I receive text messaging chat logs of multiple users (anywhere from 2000-15000 lines of text and 2-50 users), and I want to find all the various pattern matches in the chat logs based on predetermined words that I've come up with. So far I have about 1600 patterns that I'm looking for, but I may look for more.
So for example, I want to find the number of food related words that are used in an average text message log such as "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", "McDonalds". While I gave out English examples, I will actually be using Korean for my program. Each of these designated words will have their own respective score, which I put in a hashmap as key and value separately. I then show the top scorers for food related words as well as the most frequent words used by those users for food words.
My current method is to eliminate each line of text by whitespaces, and process each individual word from the haystack by using contains method (which uses the indexOf method and the naive substring search algorithm) of the haystack contains the pattern.
wordFromInput.contains(wordFromPattern);
To give an example, with 17 users in chat, 13000 lines of text, and the 1600 patterns, I've found that this whole program took 12-13 seconds with this method. And on the Android app that I'm developing, it took 2 minutes and 30 seconds to process, which is far too slow.
Originally, I tried to use a hash map and to merely get the pattern instead of searching for it in the ArrayList, but I then realized that is...
not possible with hash table
for what I am trying to do with a substring.
I've looked around through Stackoverflow and found a lot of helpful and related questions, such as these two:
1 and 2. I'm somewhat more familiar with the various string algorithms (Boyer Moore, KMP, etc.)
I initially thought then that the naive method would of course be the worst type of algorithm for my case, but having found this question, I've realized that my case (short pattern, short text), might actually be more effective with the naive method. But I wanted to know if there was something that I was neglecting completely.
Here is a snippet of my code though if anyone wants to see my issue more concretely.
While I removed large parts of the code to simplify it, the primary method that I use to actually match substrings is there in the method matchWords().
I know that's really ugly and bad code (5 for loops...), so if there are any suggestions for that, I'm happy to hear it as well.
So to clean it up:
lines of text from chat logs (2000-10,000+), haystack
1600+ patterns, needle(s)
mostly using Korean characters, although some English is included
Brute force naive method is simply too slow, but debating whether there are other alternatives and even if there are, whether they are practical given the nature of short patterns and text.
I just want some input on my thought process, and possibly some general advice. But additionally, I would like some specific suggestion for a particular algorithm or method if that is possible.

You can replace the hashtable with a Trie.
Split the line of text into words using white space to separate words. Then check if the word is in the Trie. If it is in the Trie, update a counter associated with the word. Ideally, the counter would be integrated into the Trie.
This appraoch is O(C) where C is the number of characters in the text. It's highly unlikely that you can avoid checking each character at least once. Thus this approach should be as good as you can get at least in terms of big O.
However, it sounds like you may not want to list all of the possible words you are searching for. Therefore, you might want to simply use you could build a counting Trie from all of the words. If nothing else that'll probably make it easier for any pattern matching algorithm you use. Although, it might require some modifications to the Trie.

What you're describing sounds like an excellent use case for the Aho-Corasick string-matching algorithm. This algorithm finds all matches of a set of pattern strings inside of a source string and does so in linear time (plus the time to report the matches). If you have a fixed set of strings to search for, you can do linear preprocessing work up front on the patterns to search for all matches very quickly.
There's a Java implementation of Aho-Corasick available here. I haven't tried it out, but it might be a good match.
Hope this helps!

I'm pretty sure string.contains is already highly optimized, so replacing it with something else is not going to do you a lot of good.
So the way to go, I suspect, is not to look for each and every bank-word in your chat words, but rather do multiple comparisons at once.
The first way to do it would be to create one huge regular expression that will match all your bank-words. Compile it and hope the regular expression package is efficient enough (chances are - it is). You will have a rather lengthy setup stage (the regex compilation), but matches should be a lot faster.

You can build an index of the words you need to match and count them as you process them. If you can use a HashMap to lookup the patterns for each word, the cost will be O(n * m)
You can use a HashMap for all the possible words, you can then dissect the words later.
e.g. say you need to match red and apple, you can combine the sum of
redapple = 1
applered = 0
red = 10
apple = 15
This means that red is actually 11 (10 + 1), and apple is 16 (15 + 1)

I don't know Korean so I imagine the same strategies used to tinker with Strings in Korean isn't necessarily possible in the way it is with English, but perhaps this strategy in pseudocode can be applied with your knowledge of Korean to make it work. (Java is of course still the same, but for example, in Korean is it still highly likely for the letters "ough" to be in succession? Are there even letters for "ough"? But with that being said, hopefully the principle can be applied
I would use String.toCharArray to create a two-dimensional array (or ArrayList if variable size needed). The
if (first letter of word matches keyword's first letter)//we have a candidate
skip to last letter of the current word //see comment below
if(last letter of word matches keyword's last letter)//strong candidate
iterate backwards to start+1 checking remainder of letters
The reason I suggest to skip to the last letter is because statistically a "consonant, vowel" for the first two letters of a word is significantly high, especially nouns, which will consist of alot of your keywords since any food is a noun (almost all the keyword examples you gave were matched that structure of consonant, vowel). And since there are only 5 vowels(plus y), the likelihood of the second letter "i" showing up in the keyword "pizza" is inherently highly likely, yet after that point there is still a good chance that the word may turn out to not be a match.
However if you know that the first letter and the last letter match, then you probably have a much stronger candidate and can then iterate in reverse. I think over larger sets of data, this would eliminate candidates much faster than checking letters in order. Basically you'd be letting too many fake candidates past the second iteration, thus increasing your overall conditional operations. It might sound like something small, but in a project like this there's lots of reiterating, so micro-optimizations will accumulate very quickly.
If this approach can be applied in a language that's probably structurally very different from English(I'm speaking from ignorance here though), then I think it might provide some efficiency for you whether you make it happen through iterating a char array or with a scanner, or any other construct.

The trick is to realise that if you can describe the string you are searching for as a regular expression you can also, by definition, describe it with a state machine.
At every character in your message start a state machine for every one of your 1600 patterns and pass the character through it. This sounds scary but believe me most of them will terminate immediately anyway so you aren't really doing a huge amount of work. Bear in mind that a state machine can usually be encoded with a simple switch/case or a ch == s.charAt at each step so they are close to the ultimate in light-weight.
Obviously you know what to do whenever one of your search machines terminates at the end of their search. Any that terminate before full-match can be discarded immediately.
private static class Matcher {
private final int where;
private final String s;
private int i = 0;
public Matcher ( String s, int where ) {
this.s = s;
this.where = where;
}
public boolean match(char ch) {
return s.charAt(i++) == ch;
}
public int matched() {
return i == s.length() ? where: -1;
}
}
// Words I am looking for.
String[] watchFor = new String[] {"flies", "like", "arrow", "banana", "a"};
// Test string to search.
String test = "Time flies like an arrow, fruit flies like a banana";
public void test() {
// Use a LinkedList because it is O(1) to remove anywhere.
List<Matcher> matchers = new LinkedList<> ();
int pos = 0;
for ( char c : test.toCharArray()) {
// Fire off all of the matchers at this point.
for ( String s : watchFor ) {
matchers.add(new Matcher(s, pos));
}
// Discard all matchers that fail here.
for ( Iterator<Matcher> i = matchers.iterator(); i.hasNext(); ) {
Matcher m = i.next();
// Should it be removed?
boolean remove = !m.match(c);
if ( !remove ) {
// Still matches! Is it complete?
int matched = m.matched();
if ( matched >= 0 ) {
// Todo - Should use getters.
System.out.println(" "+m.s +" found at "+m.where+" active matchers "+matchers.size());
// Complete!
remove = true;
}
}
// Remove it where necessary.
if ( remove ) {
i.remove();
}
}
// Step pos to keep track.
pos += 1;
}
}
prints
flies found at 5 active matchers 6
like found at 11 active matchers 6
a found at 16 active matchers 2
a found at 19 active matchers 2
arrow found at 19 active matchers 6
flies found at 32 active matchers 6
like found at 38 active matchers 6
a found at 43 active matchers 2
a found at 46 active matchers 3
a found at 48 active matchers 3
banana found at 45 active matchers 6
a found at 50 active matchers 2
There are several simple optimisations. With some simple pre-processing the most obvious is to use the current character to determine which matchers may be applicable.

This is a pretty broad question, so I won't go into too much detail, but roughly:
Pre-process the haystacks using something like broad lemmatizer to create "topic word only" versions of the messages by noting which topics all words in it cover. For example, any occurrences of "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", or "McDonalds" would cause the "topic" word "food" to be collected for that message. Some words may have multiple topics, eg "McDonalds" may be in the topics "food" and "business". Most words won't have any topic.
After this process, you'll have haystacks consisting of only "topic" words. Then create a Map<String, Set<Integer>> and populate it with the topic word and the Set of chat message ids that contain it. This is reverse index of topic word to the chat messages that contain it.
The runtime code to find all documents that contain all n words is then trivial and super fast - near O(#terms):
private Map<String, Set<Integer>> index; // pre-populated
Set<Integer> search(String... topics) {
Set<Integer> results = null;
for (String topic : topics) {
Set<Integer> hits = index.get(topic);
if (hits == null)
return Collections.emptySet();
if (results == null)
results = new HashSet<Integer>(hits);
else
results.retainAll(hits);
if (results.isEmpty())
return Collections.emptySet(); // exit early
}
return results;
}
This will perform near O(1), and tell you which messages share all search terms. If you just want the number, use the trivial size() of the returned Set.

Pattern.compile.split vs StringBuilder iteration and substring

I have to split a very large string in the fastest way possible and from what research i did i narrow it down to 2 possibilities:
1.Pattern.compile("[delimiter]").split("[large_string]");
2. Iterate through StringBuilder and call substring
StringBuilder sb = new StringBuilder("[large_string]");
ArrayList<String> pieces = new ArrayList<String>();
int pos = 0;
int currentPos;
while((currentPos = sb.indexOf("[delimiter]", pos)) != -1){
pieces.add(sb.substring(pos, currentPos));
pos = currentPos+"[delimiter]".length();
}
Any help is appreciated , i will benchmark them but i'm more interested in the theoretic part : why is one faster then the other .
Furthermore if you have other suggestions please post them.
UPDATE: So as I said I've done the benchmark , generated 5 mil strings each having 32 chars , these were put in a single string delimited by ~~ :
StringBuilder approach , surprisingly , was the slowest with an avg of 2.50-2.55 sec
Pattern.compile.split come on 2nd place with an avg of 2.47-2.49 sec
Splitter by Guava was the undisputed winner with an avg of 1.12-1.18 sec half the time of others (special thanks to fge who suggested it)
Thank you all for the help!

If your string is large, something to consider is whether any copies are made. If you don't use StringBuilder but use the plain String#substring(from,to), then no copies will be made of the contents of the string. There will be 1 instance of the whole String, and it will stick around as long as at least 1 substring persists.
Hmm... Source perusal of the Pattern class shows that split does the same thing, while the source of the StringBuilder shows that copies are made for each substring.

If this is a fixed pattern, and you do not need a regex, you might want to consider Guava's Splitter. It is very well written and performs admirably:
private static final Splitter SPLITTER = Splitter.on("myDelimiterHere");
Also, unlike .split(), you don't get nasty surprises with empty strings at the end... (you must pass a negative integer as an argument in order for it to do a "real" split)
You will also see that this class' .split() method returns an Iterable<CharSequence>; when the string is REALLY large, it only makes the necessary copies you ask it to make!

If you have to use it multiple times, a static object of your Pattern would be the choice. Look into the StringBuilder. The method indexOf is doing the same, iterating through all characters. Internally the String.split() method is also using Pattern to compile and split the string. Use the given methods and you should have the best performance...

Performance comparison of immutable string concatenation between Java and Python

UPDATES: thanks a lot to Gabe and Glenn for the detailed explanation. The test is wrote not for language comparison benchmark, just for my studying on VM optimization technologies.
I did a simple test to understand the performance of string concatenation between Java and Python.
The test is target for the default immutable String object/type in both languages. So I don't use StringBuilder/StringBuffer in Java test.
The test simply adds strings for 100k times. Java consumes ~32 seconds to finish, while Python only uses ~13 seconds for Unicode string and 0.042 seconds for non Unicode string.
I'm a bit surprise about the results. I thought Java should be faster than Python. What optimization technology does Python leverage to achieve better performance? Or String object is designed too heavy in Java?
OS: Ubuntu 10.04 x64
JDK: Sun 1.6.0_21
Python: 2.6.5
Java test did use -Xms1024m to minimize GC activities.
Java code:
public class StringConcateTest {
public static void test(int n) {
long start = System.currentTimeMillis();
String a = "";
for (int i = 0; i < n; i++) {
a = a.concat(String.valueOf(i));
}
long end = System.currentTimeMillis();
System.out.println(a.length() + ", time:" + (end - start));
}
public static void main(String[] args) {
for (int i = 0; i < 10; i++) {
test(1000 * 100);
}
}
}
Python code:
import time
def f(n):
start = time.time()
a = u'' #remove u to use non Unicode string
for i in xrange(n):
a = a + str(i)
print len(a), 'time', (time.time() - start)*1000.0
for j in xrange(10):
f(1000 * 100)

#Gabe's answer is correct, but needs to be shown clearly rather than hypothesized.
CPython (and probably only CPython) does an in-place string append when it can. There are limitations on when it can do this.
First, it can't do it for interned strings. That's why you'll never see this if you test with a = "testing"; a = a + "testing", because assigning a string literal results in an interned string. You have to create the string dynamically, as this code does with str(12345). (This isn't much of a limitation; once you do an append this way once, the result is an uninterned string, so if you append string literals in a loop this will only happen the first time.)
Second, Python 2.x only does this for str, not unicode. Python 3.x does do this for Unicode strings. This is strange: it's a major performance difference--a difference in complexity. This discourages using Unicode strings in 2.x, when they should be encouraging it to help the transition to 3.x.
And finally, there can be no other references to the string.
>>> a = str(12345)
>>> id(a)
3082418720
>>> a += str(67890)
>>> id(a)
3082418720
This explains why the non-Unicode version is so much faster in your test than the Unicode version.
The actual code for this is string_concatenate in Python/ceval.c, and works for both s1 = s1 + s2 and s1 += s2. The function _PyString_Resize in Objects/stringobject.c also says explicitly: The following function breaks the notion that strings are immutable. See also http://bugs.python.org/issue980695.

My guess is that Python just does a realloc on the string rather than create a new one with a copy of the old one. Since realloc takes no time when there is enough empty space following the allocation, it is very fast.
So how come Python can call realloc and Java can't? Python's garbage collector uses reference counting so it can tell that nobody else is using the string and it won't matter if the string changes. Java's garbage collector doesn't maintain reference counts so it can't tell whether any other reference to the string is extant, meaning it has no choice but to create a whole new copy of the string on every concatenation.
EDIT: Although I don't know that Python actually does call realloc on a concat, here's the comment for _PyString_Resize in stringobject.c indicating why it can:
The following function breaks the notion that strings are immutable:
it changes the size of a string. We get away with this only if there
is only one module referencing the object. You can also think of it
as creating a new string object and destroying the old one, only
more efficiently. In any case, don't use this if the string may
already be known to some other part of the code...

I don't think your test means a lot, since Java and Python handle strings differently (I am no expert in Python but I do know my way in Java). StringBuilders/Buffers exists for a reason in Java. The language designers didn't do any kind of more efficient memory management/manipulation exactly for this reason: there are other tools than the "String" object to do this kind of manipulation and they expect you to use them when you code.
When you do things the way they are meant to be done in Java, you will be surprised how fast the platform is... But I have to admit that I have been pretty much impressed by the performance of some Python applications I have tried recently.

I do not know the answer for sure. But here are some thoughts. First, Java internally stores strings as char [] arrays containing the UTF-16 encoding of the string. This means that every character in the strings takes at least two bytes. So just in terms of raw storage, Java would have to copy around twice as much data as python strings. Python unicode strings are therefore the better test because they are similarly capable. Perhaps python stores unicode strings as UTF-8 encoded bytes. In that case, if all you are storing in these are ASCII characters, then again you'd have Java using twice as much space and therefore doing twice as much copying. To get a better comparison you should concatenate strings containing more interesting characters that require two or more bytes in their UTF-8 encoding.

I ran Java code with a StringBuilder in place of a String and saw an average finish time of 10ms (high 34ms, low 5ms).
As for the Python code, using "Method 6" here (found to be the fastest method), I was able to achieve an average of 84ms (high 91ms, low 81ms) using unicode strings. Using non-unicode strings reduced these numbers by ~25ms.
As such, it can be said based on these highly unscientific tests that using the fastest available method for string concatenation, Java is roughly an order of magnitude faster than Python.
But I still <3 Python ;)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.