How to use Streams instead of a loop that contains a variable? - java

I developed this code to check if the string made up of parentheses is balanced. For instance,
Not balanced: ["(", ")", "((", "))", "()(", ")(", ...].
Balanced: ["()", "()()", "(())", "(())()", ...]
I want to use a stream instead of a for-loop. There is a conditional statement inside the loop that checks when the counter variable is less than zero. I am unclear know how to include this variable in a stream.
I would welcome feedback on this. Please see my code below:
public String findBalance(String input) {
String[] arr = input.split("");
Integer counter = 0;
for (String s : arr) {
counter += s.equals("(") ? 1 : -1;
if (counter < 0) break;
}
return counter == 0 ? "Balanced" : "Not Balanced";
}
Thanks in advance.

Streaming isn't a good fit. Ask yourself what would happen if you used parallelStream(). How would you handle a simple edge case?
)(
You want to detect when the count dips below 0, even if it later goes back up. It's quite difficult to do that with a parallel stream. The string is best processed sequentially.
Stream operations work best when they are independent and stateless. Stateful actions are better suited for a regular for loop like you already have.

Let's see whether Stream could fit.
Sequentially, by character, maintaining a nesting stack, one would need a counter; reduction would be feasible.
However the code is not much better, more verbose and slower. Exploiting parallelism
would need to handle "...(" & ")...". Possible but not nice.
Let's look at a more parallel approach: substituting inner reducible expressions (redex ()) recursively/repeatedly:
public String findBalance(String input) {
String s = input.replaceAll("[^()]", "");
for (;;) {
String t = s.replace("()", "");
if (t.length() == s.length()) {
break;
}
s = t;
}
return s.isEmpty();
}
Though a split-and-conquer parallelism is feasible here, the problem is the repetition needed for things like (()).
For a Stream one might think of a flatMap or such. However Stream processing steps do not repeat (in the current java version), hence not easily feasible.
So a Stream solution is not fitting.
One could combine the recursion function with a Stream splitting the passed string,
to use a parallel Stream, but even with 16 cores and strings of length 1000 that will not be considerably faster.

Related

How is it that when I call a method 10000 times it throws out of memory error?

Situation
I was tasked to implement a string anagram problem in a live coding interview. The problem was given two strings, code the logic for the method boolean isAnagram(String str1, String str2).
Solution
I presented the following solution(the mergeSort is an implementation of my own, and the containsChar is using a binary search that is also my own implementation)
public static boolean isAnagram(String value, String valueToCompare) {
String temp = valueToCompare.replaceAll("'", "").replaceAll(" ", "").toLowerCase();
String t = value.replaceAll("'", "").replaceAll(" ", "").toLowerCase();
if (t.length() == temp.length()) {
char[] c = t.toCharArray();
char[] orderedChars = MergeSort.mergeSort(temp.toCharArray());
for (int i = 0; i < orderedChars.length ; i++) {
if (!containsChar(orderedChars, c[i], 0, orderedChars.length - 1))
return false;
}
return true;
}
return false;
}
The efficiency of the solution is redundant, i'm more concerned on what is happening in the background.
Question
Once I presented the solution the interviewer asked me,
Lets suppose I have a computer with significantly low memory, and I want to run this
algorithm 10.000 times with random strings of size between 1000 and 10000, what would happened to your code?. I didn't know what to answer, so he told me that I would get a OutOfMemoryError exception. I know(or I least think) that because of the efficiency of the algorithm I would get such exception.
So my question is:
Why is throwing an OutOfMemoryError exception?
If I call that method 1000 times, is it because it takes to long to complete one call that such exception is thrown?
What is happening in the background when I call that method x amount of times?
Let be clear on this.
The Interviewer has asked you a hypothetical question
The Interviewer did not specify the conditions properly (more later)
The Interviewer has asserted that something will happen ... with no proof, and no way to validate that assertion.
Lets suppose I have a computer with significantly low memory ... so he told me that I would get a OutOfMemoryError exception.
I think the Interviewer was probably wrong.
First of all, your code has no obvious memory leaks. Not that I can see, and not that other commenters can see.
Your solution code does generate a few temporary objects on each call. I can count up to 6 temporary strings and 1 or 2 temporary arrays, plus potentially other temporary objects created by some library methods. You could probably reduce that ... if it was worth the expending developer time on the optimization.
But temporary objects by themselves should not lead to an OOME. Modern Oracle / OpenJDK garbage collectors are really good at collecting short term objects.
Except in a couple of pathological scenarios:
Scenario #1.
Suppose that you were already on the cusp of running out of memory. For instance, suppose that before you started the 1000 method calls, you only had a small amount of free (eden) space after running a full GC.
For your task to complete, it will generate in the order of 1000 times x 10 objects x 10,000 bytes of temporary space. That is about 100MB.
If you have 10MB of Eden space free that means that you will need to do roughly 10 Eden space collections in a short period of time.
If you have 1MB of Eden space free that means that you will need to do roughly 100 Eden space collections in a short period of time.
10 Eden space collections back to back could be sufficient to cause an OOME "Overhead limit exceeded". With 100, it is much more likely.
But the bottom line is that if you are running close enough to a full heap, any piece of code that allocates an object could be the final straw. The real problem is that your heap is too small for the task ... or something else is creating / retaining too many long term objects.
Scenario #2.
Suppose that your application has stringent latency requirements. To implement this you configure the JVM to use a low-pause collector, and you set some really aggressive latency goals for the collector. And you don't have a lot of memory as well.
Now if your application generates too much garbage too fast, a low-pause collector may not be able to keep up. If you push it beyond the limit, the GC will fall back to doing a stop-the-world collection to try to recover. You might get an OOME ... though I doubt it. But you will certainly fail to meet your latency goals.
But the bottom line if you have an application with requirements like this, it is essential that you run it on a machine with sufficient resources; i.e. enough spare memory, an enough cores that a (parallel) GC can keep up. You would possibly design your isAnagram method to be (erm) a bit more careful in the way it creates temporary objects ... but you would know up front that you needed to do that.
Recap
Returning to question posed by your Interviewer (as relayed by you):
The Interviewer doesn't say how much free heap space there is, so we can't say if Scenario #1 would apply. But if it did, the real problem would be either a mismatch between the heap size and the problem, or a memory leak somewhere else in the application.
The Interviewer doesn't mention latency constraints. Even if they existed, the first step would be to spec the hardware and use appropriate (i.e. realistic) JVM GC settings.
If you did run into problems (OOMEs, missed latency goals), then you start looking for solutions. Use memory profiling to identify the nature of the problem (e.g. is it caused by temp objects, long term objects, a memory leak, etc) and to track down the source of the problematic objects.
Don't just assume a particular bit of code will cause OOMEs ... as the Interviewer is doing. Premature optimization is a bad idea.
Make It Work. Make It Right. Make It Fast.
It's way too early to think about performance or memory usage. Your method returns false positive since it only checks if every letter in the first word is included in the second word.
With this check, 'aaa' and 'abc' are considered to be anagrams, but not 'abc' and 'aaa'.
Here's a complete class to test your code:
import java.util.Arrays;
public class AnagramTest
{
public static void main(String[] args) {
String[][] anagrams = {
{ "abc", "cba" },
{ "ABC", "CAB" },
{ "Clint Eastwood", "Old West action" }
};
for (String[] words : anagrams) {
if (isAnagram(words[0], words[1])) {
System.out.println(".");
} else {
System.out.println(
"OH NO! '" + words[0] + "' and '" + words[1] + "' are anagrams but isAnagram returned false.");
}
}
String[][] notAnagrams = {
{ "hello", "world" },
{ "aabb", "aab" },
{ "abc", "aaa" },
{ "aaa", "abc" },
{ "aab", "bba" },
{ "aab", "bba" },
};
for (String[] words : notAnagrams) {
if (isAnagram(words[0], words[1])) {
System.out.println(
"OH NO! '" + words[0] + "' and '" + words[1] + "' are not anagrams but isAnagram returned true.");
} else {
System.out.println(".");
}
}
}
public static boolean isAnagram(String value, String valueToCompare) {
String temp = valueToCompare.replaceAll("'", "").replaceAll(" ", "").toLowerCase();
String t = value.replaceAll("'", "").replaceAll(" ", "").toLowerCase();
if (t.length() == temp.length()) {
char[] c = t.toCharArray();
char[] orderedChars = mergeSort(temp.toCharArray());
for (int i = 0; i < orderedChars.length; i++) {
if (!containsChar(orderedChars, c[i], 0, orderedChars.length - 1))
return false;
}
return true;
}
return false;
}
// Dummy method. Warning: sorts chars in place.
private static char[] mergeSort(char[] chars) {
Arrays.sort(chars);
return chars;
}
// replace with your binary search if you want.
private static boolean containsChar(char[] orderedChars, char c, int m, int n) {
for (int i = m; i <= n; i++) {
if (orderedChars[i] == c) {
return true;
}
}
return false;
}
}
It outputs:
.
.
.
.
.
.
OH NO! 'aaa' and 'abc' are not anagrams but isAnagram returned true.
OH NO! 'aab' and 'bba' are not anagrams but isAnagram returned true.
OH NO! 'aab' and 'bba' are not anagrams but isAnagram returned true.
Here's an example implementation which should pass all the tests:
public static boolean isAnagram(String word1, String word2) {
word1 = word1.replaceAll("'", "").replaceAll(" ", "").toLowerCase();
word2 = word2.replaceAll("'", "").replaceAll(" ", "").toLowerCase();
return Arrays.equals(mergeSort(word1.toCharArray()), mergeSort(word2.toCharArray()));
}
My best guess:
You have a problem in your MergeSort, which you haven't shown us;
It doesn't happen on every input, so the interviewer wants you to run it 10000 times with random inputs to make it happen with high probability;
The problem can cause your merge sort to recurse much too deeply. Maybe O(N) instead of O(log N) depth, or maybe an infinite recursion; and
Your merge sort unnecessarily allocates a new temporary array in every recursive call. Since there are way too many of them, this results in an out of memory error.

Better way to check a String?

I have a code that check a string for space,comma and etc. Well since
I will deal a scenario where my app will going to check, lets say thousand of string with a max length of 15 and a minimum length of 14. I am worried if it will affect the performance since it is in android. Check the code i used..
private final static char[] undefinedChars = {' ','/','.','<','>','*','!'};
public static boolean checkMessage(String message){
if (message == null)
return false;
char[] _message = message.toCharArray();
for (char c : _message) {
for (int i = 0;i > undefinedChars.length;i++)
if (c == undefinedChars[i])
return true;
}
return false;
}
Is this correct? or there is a way to improve it?
There is a change that you could make that might make a little difference:
Change
char[] _message = message.toCharArray();
for (char c : _message) {
to
for (int i = 0; i < message.length(); i++) {
char c = message.charAt(i);
However, I doubt that it will be significant.
Replacing the inner loop with a switch is more likely to be fruitful, though it depends on what the JIT compiler does with the code. (And a switch will only works if the set of undefined characters can be hard-wired into the switch statement as compile-time constants.)
I am worried if it will affect the performance since it is in android.
Don't "worry". Approach the problem scientifically.
Implement the code and then benchmark it.
If the measured performance is a concern, then:
profile the code
look at hotspots, and identify possible improvements
implement and test possible improvement
rerun the benchmark to see if the improvement actually made any difference
repeat ... until performance is good enough or you run out of options.
The other thing to note is that the same code could well perform differently across different Android platforms. The quality of JIT compilers has (apparently) improved markedly in more recent releases.
I would argue that it is a bad idea to "bend" your code just to get it to run well on old phones. The chances are that the user will upgrade their hardware soon anyway ... and it is conceivable that your optimization for the old platform actually makes your code slower on a new platform ... 'cos your hand-optimizations have made the code too tricky for the JIT compiler's optimizer to deal with.
This is also an argument for NOT trying to make your code go "as fast as possible" ...
First of all, I see a bug there.
for (int i = 0;i > undefinedChars.length;i++)
that I think you meant
for (int i = 0;i < undefinedChars.length;i++)
instead?
Anyway it seems that your algorithm runs in O(m*n) where m is the length of message and n is the length of undefined chars(in this case fixed size, 15). Therefore it should be efficient in run-time analysis perspective.
I would profile the scenario first then decide how to improve it, that you could've sorted the message upfront somewhere then you can only check for either 1st char or the last char of the string, but as I said, only if that's been sorted elsewhere.
Or maybe think of parallelizing the routine. It should be straightforward.
Without using memory, you're about as fast as you can get. You can trade memory for performance. For example, you can put the characters you want to check into a HashMap. Then you can loop over the string you're checking, and check if each index is in that map or not. If the number of characters you want to check for is small, this will be less efficient. If the number is big, it will be more efficient (Technically this algorithm is O(n) instead of O(n*m), but if m is small then the constants you're usually taught to ignore will matter).
Another way is to use an array of booleans, with each possible character in the string mapping to an index in that array. Set only the characters you care about to true (and save that array). Then you can avoid the hash calculation above, but at the cost of a lot of memory.
Really, your original algorithm is likely good enough. But these (especially the hash map) are things you can consider if needed.
Try using a regular expression. I find it very clean and it should not hurt your performance.
public static boolean checkMessage(String message)
{
if (message == null)
return false;
String regex = " |\\.|/|<|>|\\*|!";
Matcher matcher = Pattern.compile(regex).matcher(message);
if (matcher.find())
return true;
else
return false;
}
For symmetry and possibly some compiler optimization, why not use a for-each style loop for both loops. As an additional benefit, you wouldn't risk a typo like the one pointed out by glaze. Your code would then become:
private final static char[] undefinedChars = {' ','/','.','<','>','*','!'};
public static boolean checkMessage(String message){
if (message == null)
return false;
char[] _message = message.toCharArray();
for (char c : _message) {
for (for u : undefinedChars)
if (c == u)
return true;
}
return false;
}
An additional optimization would be to order the characters in undefinedChars in the order most likely to occur. That way you'll bail-out as quick as possible.
Use a Set to hold your undefinedChars
Set<Character> undefinedChars = new HashSet<Character>(Arrays.asList(new Character(' ') ,new Character('/'),new Character('.')));
public boolean hasUndefinedChar(String str) {
for (int i = 0; i < str.length(); i++) {
char iChar = str.charAt(i);
Character charWrapper = new Character(iChar);
if (undefinedChars.contains(charWrapper)) {
return true;
}
}
return false;
}
This method is O(n) time efficient and does not sufficiently affect space complexity. The contains calls to the Set are O(1) operations and you make n of these contains calls in the worst case.

Best way to modify an existing string? StringBuilder or convert to char array and back to string?

I'm learning Java and am wondering what's the best way to modify strings here (both for performance and to learn the preferred method in Java). Assume you're looping through a string and checking each character/performing some action on that index in the string.
Do I use the StringBuilder class, or convert the string into a char array, make my modifications, and then convert the char array back to a string?
Example for StringBuilder:
StringBuilder newString = new StringBuilder(oldString);
for (int i = 0; i < oldString.length() ; i++) {
newString.setCharAt(i, 'X');
}
Example for char array conversion:
char[] newStringArray = oldString.toCharArray();
for (int i = 0; i < oldString.length() ; i++) {
myNameChars[i] = 'X';
}
myString = String.valueOf(newStringArray);
What are the pros/cons to each different way?
I take it that StringBuilder is going to be more efficient since the converting to a char array makes copies of the array each time you update an index.
I say do whatever is most readable/maintainable until you you know that String "modification" is slowing you down. To me, this is the most readable:
Sting s = "foo";
s += "bar";
s += "baz";
If that's too slow, I'd use a StringBuilder. You may want to compare this to StringBuffer. If performance matters and synchronization does not, StringBuilder should be faster. If sychronization is needed, then you should use StringBuffer.
Also it's important to know that these strings are not being modified. In java, Strings are immutable.
This is all context specific. If you optimize this code and it doesn't make a noticeable difference (and this is usually the case), then you just thought longer than you had to and you probably made your code more difficult to understand. Optimize when you need to, not because you can. And before you do that, make sure the code you're optimizing is the cause of your performance issue.
What are the pros/cons to each different way. I take it that StringBuilder is going to be more efficient since the convering to a char array makes copies of the array each time you update an index.
As written, the code in your second example will create just two arrays: one when you call toCharArray(), and another when you call String.valueOf() (String stores data in a char[] array). The element manipulations you are performing should not trigger any object allocations. There are no copies being made of the array when you read or write an element.
If you are going to be doing any sort of String manipulation, the recommended practice is to use a StringBuilder. If you are writing very performance-sensitive code, and your transformation does not alter the length of the string, then it might be worthwhile to manipulate the array directly. But since you are learning Java as a new language, I am going to guess that you are not working in high frequency trading or any other environment where latency is critical. Therefore, you are probably better off using a StringBuilder.
If you are performing any transformations that might yield a string of a different length than the original, you should almost certainly use a StringBuilder; it will resize its internal buffer as necessary.
On a related note, if you are doing simple string concatenation (e.g, s = "a" + someObject + "c"), the compiler will actually transform those operations into a chain of StringBuilder.append() calls, so you are free to use whichever you find more aesthetically pleasing. I personally prefer the + operator. However, if you are building up a string across multiple statements, you should create a single StringBuilder.
For example:
public String toString() {
return "{field1 =" + this.field1 +
", field2 =" + this.field2 +
...
", field50 =" + this.field50 + "}";
}
Here, we have a single, long expression involving many concatenations. You don't need to worry about hand-optimizing this, because the compiler will use a single StringBuilder and just call append() on it repeatedly.
String s = ...;
if (someCondition) {
s += someValue;
}
s += additionalValue;
return s;
Here, you'll end up with two StringBuilders being created under the covers, but unless this is an extremely hot code path in a latency-critical application, it's really not worth fretting about. Given similar code, but with many more separate concatenations, it might be worth optimizing. Same goes if you know the strings might be very large. But don't just guess--measure! Demonstrate that there's a performance problem before you try to fix it. (Note: this is just a general rule for "micro optimizations"; there's rarely a downside to explicitly using a StringBuilder. But don't assume it will make a measurable difference: if you're concerned about it, you should actually measure.)
String s = "";
for (final Object item : items) {
s += item + "\n";
}
Here, we're performing a separate concatenation operation on each loop iteration, which means a new StringBuilder will be allocated on each pass. In this case, it's probably worth using a single StringBuilder since you may not know how large the collection will be. I would consider this an exception to the "prove there's a performance problem before optimizing rule": if the operation has the potential to explode in complexity based on input, err on the side of caution.
Which option will perform the best is not an easy question.
I did a benchmark using Caliper:
RUNTIME (NS)
array 88
builder 126
builderTillEnd 76
concat 3435
Benchmarked methods:
public static String array(String input)
{
char[] result = input.toCharArray(); // COPYING
for (int i = 0; i < input.length(); i++)
{
result[i] = 'X';
}
return String.valueOf(result); // COPYING
}
public static String builder(String input)
{
StringBuilder result = new StringBuilder(input); // COPYING
for (int i = 0; i < input.length(); i++)
{
result.setCharAt(i, 'X');
}
return result.toString(); // COPYING
}
public static StringBuilder builderTillEnd(String input)
{
StringBuilder result = new StringBuilder(input); // COPYING
for (int i = 0; i < input.length(); i++)
{
result.setCharAt(i, 'X');
}
return result;
}
public static String concat(String input)
{
String result = "";
for (int i = 0; i < input.length(); i++)
{
result += 'X'; // terrible COPYING, COPYING, COPYING... same as:
// result = new StringBuilder(result).append('X').toString();
}
return result;
}
Remarks
If we want to modify a String, we have to do at least 1 copy of that input String, because Strings in Java are immutable.
java.lang.StringBuilder extends java.lang.AbstractStringBuilder. StringBuilder.setCharAt() is inherited from AbstractStringBuilder and looks like this:
public void setCharAt(int index, char ch) {
if ((index < 0) || (index >= count))
throw new StringIndexOutOfBoundsException(index);
value[index] = ch;
}
AbstractStringBuilder internally uses the simplest char array: char value[]. So, result[i] = 'X' is very similar to result.setCharAt(i, 'X'), however the second will call a polymorphic method (which probably gets inlined by JVM) and check bounds in if, so it will be a bit slower.
Conclusions
If you can operate on StringBuilder until the end (you don't need String back) - do it. It's the preferred way and also the fastest. Simply the best.
If you want String in the end and this is the bottleneck of your program, then you might consider using char array. In benchmark char array was ~25% faster than StringBuilder. Be sure to properly measure execution time of your program before and after optimization, because there is no guarantee about this 25%.
Never concatenate Strings in the loop with + or +=, unless you really know what you do. Usally it's better to use explicit StringBuilder and append().
I'd prefer to use StringBuilder class where original string is modified.
For String manipulation, I like StringUtil class. You'll need to get Apache commons dependency to use it

Java indexOf function more efficient than Rabin-Karp? Search Efficiency of Text

I posed a question to Stackoverflow a few weeks ago about a creating an efficient algorithm to search for a pattern in a large chunk of text. Right now I am using the String function indexOf to do the search. One suggestion was to use Rabin-Karp as an alternative. I wrote a little test program as follows to test an implementation of Rabin-Karp as follows.
public static void main(String[] args) {
String test = "Mary had a little lamb whose fleece was white as snow";
String p = "was";
long start = Calendar.getInstance().getTimeInMillis();
for (int x = 0; x < 200000; x++)
test.indexOf(p);
long end = Calendar.getInstance().getTimeInMillis();
end = end -start;
System.out.println("Standard Java Time->"+end);
RabinKarp searcher = new RabinKarp("was");
start = Calendar.getInstance().getTimeInMillis();
for (int x = 0; x < 200000; x++)
searcher.search(test);
end = Calendar.getInstance().getTimeInMillis();
end = end -start;
System.out.println("Rabin Karp time->"+end);
}
And here is the implementation of Rabin-Karp that I am using:
import java.math.BigInteger;
import java.util.Random;
public class RabinKarp {
private String pat; // the pattern // needed only for Las Vegas
private long patHash; // pattern hash value
private int M; // pattern length
private long Q; // a large prime, small enough to avoid long overflow
private int R; // radix
private long RM; // R^(M-1) % Q
static private long dochash = -1L;
public RabinKarp(int R, char[] pattern) {
throw new RuntimeException("Operation not supported yet");
}
public RabinKarp(String pat) {
this.pat = pat; // save pattern (needed only for Las Vegas)
R = 256;
M = pat.length();
Q = longRandomPrime();
// precompute R^(M-1) % Q for use in removing leading digit
RM = 1;
for (int i = 1; i <= M - 1; i++)
RM = (R * RM) % Q;
patHash = hash(pat, M);
}
// Compute hash for key[0..M-1].
private long hash(String key, int M) {
long h = 0;
for (int j = 0; j < M; j++)
h = (R * h + key.charAt(j)) % Q;
return h;
}
// Las Vegas version: does pat[] match txt[i..i-M+1] ?
private boolean check(String txt, int i) {
for (int j = 0; j < M; j++)
if (pat.charAt(j) != txt.charAt(i + j))
return false;
return true;
}
// check for exact match
public int search(String txt) {
int N = txt.length();
if (N < M)
return -1;
long txtHash;
if (dochash == -1L) {
txtHash = hash(txt, M);
dochash = txtHash;
} else
txtHash = dochash;
// check for match at offset 0
if ((patHash == txtHash) && check(txt, 0))
return 0;
// check for hash match; if hash match, check for exact match
for (int i = M; i < N; i++) {
// Remove leading digit, add trailing digit, check for match.
txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q;
txtHash = (txtHash * R + txt.charAt(i)) % Q;
// match
int offset = i - M + 1;
if ((patHash == txtHash) && check(txt, offset))
return offset;
}
// no match
return -1; // was N
}
// a random 31-bit prime
private static long longRandomPrime() {
BigInteger prime = new BigInteger(31, new Random());
return prime.longValue();
}
// test client
}
The implementation of Rabin-Karp works in that it returns the correct offset of the string I am looking for. What is surprising to me though is the timing statistics that occurred when I ran the test program. Here they are:
Standard Java Time->39
Rabin Karp time->409
This was really surprising. Not only is Rabin-Karp (at least as it is implemented here) not faster than the standard java indexOf String function, it is slower by an order of magnitude. I don't know what is wrong (if anything). Any one have thoughts on this?
Thanks,
Elliott
I answered this question earlier and Elliot pointed out I was just plain wrong. I apologise to the community.
There is nothing magical about the String.indexOf code. It is not natively optimised or anything like that. You can copy the indexOf method from the String source code and it runs just as quickly.
What we have here is the difference between O() efficiency and actual efficiency. Rabin-Karp for a String of length N and a pattern of length M, Rabin-Karp is O(N+M) and a worst case of O(NM). When you look into it, String.indexOf() also has a best case of O(N+M) and a worst case of O(NM).
If the text contains many partial matches to the start of the pattern Rabin-Karp will stay close to its best-case performance, whilst String.indexOf will not. For example I tested the above code (properly this time :-)) on a million '0's followed by a single '1', and the searched for 1000 '0's followed by a single '1'. This forced the String.indexOf to its worst case performance. For this highly degenerate test, the Rabin-Karp algorithm was about 15 times faster than indexOf.
For natural language text, Rabin-Karp will remain close to best-case and indexOf will only deteriorate slightly. The deciding factor is therefore the complexity of operations performed on each step.
In it's innermost loop, indexOf scans for a matching first character. At each iteration is has to:
increment the loop counter
perform two logical tests
do one array access
In Rabin-Karp each iteration has to:
increment the loop counter
perform two logical tests
do two array accesses (actually two method invocations)
update a hash, which above requires 9 numerical operations
Therefore at each iteration Rabin-Karp will fall further and further behind. I tried simplifying the hash algorithm to just XOR characters, but I still had an extra array access and two extra numerical operations so it was still slower.
Furthermore, when a match is find, Rabin-Karp only knows the hashes match and must therefore test every character, whereas indexOf already knows the first character matches and therefore has one less test to do.
Having read on Wikipedia that Rabin-Karp is used to detect plagiarism, I took the Bible's Book of Ruth, removed all punctuation and made everything lower case which left just under 10000 characters. I then searched for "andthewomenherneighboursgaveitaname" which occurs near the very end of the text. String.indexOf was still faster, even with just the XOR hash. However, if I removed String.indexOfs advantage of being able to access String's private internal character array and forced it to copy the character array, then, finally, Rabin-Karp was genuinely faster.
However, I deliberately chose that text as there are 213 "and"s in the Book of Ruth and 28 "andthe"s. If instead I searched just for the last characters "ursgaveitaname", well there are only 3 "urs"s in the text so indexOf returns closer to its best-case and wins the race again.
As a fairer test I chose random 20 character strings from the 2nd half of the text and timed them. Rabin-Karp was about 20% slower than the indexOf algorithm run outside of the String class, and 70% slower than the actual indexOf algorithm. Thus even in the use case it is supposedly appropriate for, it was still not the best choice.
So what good is Rabin-Karp? No matter the length or nature of the text to be searched, at every character compared it will be slower. No matter what hash function we choose we are surely required to make an additional array access and at least two numerical operations. A more complex hash function will give us less false matches, but require more numerical operators. There is simply no way Rabin-Karp can ever keep up.
As demonstrated above, if we need to find a match prefixed by an often repeated block of text, indexOf can be slower, but if we know we are doing that it would look like we would still be better off using indexOf to search for the text without the prefix and then check to see if the prefix was present.
Based on my investigations today, I cannot see any time when the additional complexity of Rabin Karp will pay off.
Here is the source to java.lang.String. indexOf is line 1770.
My suspicion is since you are using it on such a short input string, the extra overhead of the Rabin-Karp algorithm over the seemly naive implementation of java.lang.String's indexOf, you aren't seeing the true performance of the algorithm. I would suggest trying it on a much longer input string to compare performance.
From my understanding, Rabin Karp is best used when searching a block of text for mutiple words/phrases.
Think about a bad word search, for flagging abusive language.
If you have a list of 2000 words, including derivations, then you would need to call indexOf 2000 times, one for each word you are trying to find.
RabinKarp helps with this by doing the search the other way around.
Make a 4 character hash of each of the 2000 words, and put that into a dictionary with a fast lookup.
Now, for each 4 characters of the search text, hash and check against the dictionary.
As you can see, the search is now the other way around - we're searching the 2000 words for a possible match instead.
Then we get the string from the dictionary and do an equals to check to be sure.
It's also a faster search this way, because we're searching a dictionary instead of string matching.
Now, imagine the WORST case scenario of doing all those indexOf searches - the very LAST word we check is a match ...
The wikipedia article for RabinKarp even mentions is inferiority in the situation you describe. ;-)
http://en.wikipedia.org/wiki/Rabin-Karp_algorithm
But this is only natural to happen!
Your test input first of all is too trivial.
indexOf returns the index of was searching a small buffer (String's internal char array`) while the Rabin-Karp has to do preprocessing to setup its data to work which takes extra time.
To see a difference you would have to test in a really large text to find expressions.
Also please note that when using more sofisticated string search algorithm they can have "expensive" setup/preprocessing to provide the really fast search.
In your case you just search a was in a sentence. I any case you should always take the input into account
Without looking into details, two reasons come to my mind:
you are very likely to outperform standard API implementations only for very special cases. I do not consider "Mary had a little lamb whose fleece was white as snow" to be such.
microbenchmarking is very difficult and can give quite misleading results. Here is an explanation, here a list of tools you could use
Not only simply try a longer static string, but try generating random long strings and inserting the search target into a random location each time. Without randomizing it, you will see a fixed result for indexOf.
EDITED:
Random is the wrong concept. Most text is not truly random. But you would need a lot of different long strings to be effective, and not just testing the same String multiple times. I am sure there are ways to extract "random" large Strings from an even larger text source, or something like that.
For this kind of search, Knuth-Morris-Pratt may perform better. In particular if the sub-string doesn't just repeat characters, then KMP should outperform indexOf(). Worst case (string of all the same characters) it will be the same.

Compare first three characters of two strings

Strings s1 and s2 will always be of length 1 or higher.
How can I speed this up?
int l1 = s1.length();
if (l1 > 3) { l1 = 3; }
if (s2.startsWith(s1.substring(0,l1)))
{
// do something..
}
Regex maybe?
Rewrite to avoid object creation
Your instincts were correct. The creation of new objects (substring()) is not very fast and it means that each one created must incur g/c overhead as well.
This might be a lot faster:
static boolean fastCmp(String s1, String s2) {
return s1.regionMatches(0, s2, 0, 3);
}
This seems pretty reasonable. Is this really too slow for you? You sure it's not premature optimization?
if (s2.startsWith(s1.substring(0, Math.min(3, s1.length())) {..};
Btw, there is nothing slow in it. startsWith has complexity O(n)
Another option is to compare the char values, which might be more efficient:
boolean match = true;
for (int i = 0; i < Math.min(Math.min(s1.length(), 3), s2.length()); i++) {
if (s1.charAt(i) != s2.charAt(i)) {
match = false;
break;
}
}
My java isn't that good, so I'll give you an answer in C#:
int len = Math.Min(s1.Length, Math.Min(s2.Length, 3));
for(int i=0; i< len; ++i)
{
if (s1[i] != s2[i])
return false;
}
return true;
Note that unlike yours and Bozho's, this does not create a new string, which would be the slowest part of your algorithm.
Perhaps you could do this
if (s1.length() > 3 && s2.length() > 3 && s1.indexOf (s2.substring (0, 3)) == 0)
{
// do something..
}
There is context missing here:
What are you trying to scan for? What type of application? How often is it expected to run?
These things are important because different scenarios call for different solutions:
If this is a one-time scan then this is probably unneeded optimization. Even for a 20MB text file, it wouldn't take more than a couple of minutes in the worst case.
If you have a set of inputs and for each of them you're scanning all the words in a 20MB file, it might be better to sort/index the 20MB file to make it easy to look up matches and skip the 99% of unnecessary comparisons. Also, if inputs tend to repeat themselves it might make sense to employ caching.
Other solutions might also be relevant, depending on the actual problem.
But if you boil it down only to comparing the first 3 characters of two strings, I believe the code snippets given here are as good as you're going to get - they're all O(1)*, so there's no drastic optimization you can do.
*The only place where this might not hold true is if getting the length of the string is O(n) rather than O(1) (which is the case for the strlen function in C++), which is not the case for Java and C# string objects.

Categories