'Substring' or 'if with length & min' - significant better performance

'Substring' or 'if with length & min' - significant better performance - java

Which piece of code has better performance?
Pseudocode:
1)
prvate String doSth(String s) {
...
return s.substring(0, Math.min(s.length(), constparam));
}
2)
prvate String doSth(String s) {
if (s.length() > constparam) {
return s.substring(0, constparam);
}
return s;
}
In most cases (99%) - s.length < constparam.
This method is invoked 20-200 times per second.
Which solution (and why) would have a better performance?
Will it be a significant impact?

Let's look at what each one does:
1 always finds the lower of two values and always calls substring returning a new String.
2 always compares two values and sometimes calls sub string so sometimes creates a new String.
So 2, because some of the time it will do less work and create less objects.

If s.length < constparam for most of the cases, case 2 will be faster as substring() operation need not to be done for most of the cases.

public static int min(int a, int b) {
return (a <= b) ? a : b;
}
From Math. I would run a JMH test, and I would say that the two presented solution won't show statistically significant differences.
If you are really concerned about performance, don't even use substring (check the code, it has 3 ifs and creates a new String every time you call it), but you should operate with char arrays.
And then again: in real life I don't think it will matter. Run the JMH test with some real parameters (lengths of strings and constant values). I think you'll see numbers which are enough for almost every sane use-case.

Function substring with arguments (0, length) returns the string unmodified.
The difference is checking if s.length() > constparam, but basically it's what Math.min does.
So im my opinion, there's almost no performance differences, assuming substring invocation takes much more time than this conditional or Math.min, or even without this assumption.

The branching may cost something like a few nanoseconds. In both cases there's no needless char[] copying. The method call overhead is rather big, but gets optimized out.
A few nanoseconds times 200 make a few microseconds at most. The difference between the two approaches is smaller, so you may spend 0.0001% the time on this. This is a very rough estimate, but even if I was off by a factor of thousand, there's no point in optimizing here.

Related

Arrays.sort() vs sorting using map

I have a requirement where I have to loop through an array which has list of strings:
String[] arr = {"abc","cda","cka","snd"}
and match the string "bca", ignoring the order of the characters, which will return true as it’s present in the array ("abc").
To solve this I have two approaches:
Use Arrays.sort() to sort both the strings and then use Arrays.equals to compare them.
create 2 hashmaps and add frequency of each letter in string and then finally compare two map of char using equals method.
I read that complexity of using Arrays.sort() method is more. So, thought of working on 2nd approach but when I am running both the code 1st approach is taking very less time to execute program.
Any suggestions why this is happening?

The Time Complexity only tells you, how the approach will scale with (significantly) larger input. It doesn’t tell you which approach is faster.
It’s perfectly possible that a solution is faster for small input sizes (string lengths and/or array length) but scales badly for larger sizes, due to its Time Complexity. But it’s even possible that you never encounter the point where an algorithm with a better Time Complexity becomes faster, when natural limits to the input sizes prevent it.
You didn’t show the code of your approaches, but it’s likely that your first approach calls a method like toCharArray() on the strings, followed by Arrays.sort(char[]). This implies that sort operates on primitive data.
In contrast, when your second approach uses a HashMap<Character,Integer> to record frequencies, it will be subject to boxing overhead, for the characters and the counts, and also use a significantly larger data structure that needs to be processed.
So it’s not surprising that the hash approach is slower for small strings and arrays, as it has a significantly larger fixed overhead and also a size dependent (O(n)) overhead.
So first approach had to suffer from the O(n log n) time complexity significantly to turn this result. But this won’t happen. That time complexity is a worst case of sorting in general. As explained in this answer, the algorithms specified in the documentation of Arrays.sort should not be taken for granted. When you call Arrays.sort(char[]) and the array size crosses a certain threshold, the implementation will turn to Counting Sort with an O(n) time complexity (but use more memory temporarily).
So even with large strings, you won’t suffer from a worse time complexity. In fact, the Counting Sort shares similarities with the frequency map, but usually is more efficient, as it avoids the boxing overhead, using an int[] array instead of a HashMap<Character,Integer>.

Approach 1: will be O(NlogN)
Approach 2: will be O(N*M), where M is the length of each string in your array.
You should search linearly in O(N):
for (String str : arr) {
if (str.equals(target)) return true;
}
return false;

Let's decompose the problem:
You need a function to sort a string by its chars (bccabc -> abbccc) to be able to compare a given string with the existing ones.
Function<String, String> sortChars = s -> s.chars()
.sorted()
.mapToObj(i -> (char) i)
.map(String::valueOf)
.collect(Collectors.joining());
Instead of sorting the chars of the given strings anytime you compare them, you can precompute the set of unique tokens (values from your array, sorted chars):
Set<String> tokens = Arrays.stream(arr)
.map(sortChars)
.collect(Collectors.toSet());
This will result in the values "abc","acd","ack","dns".
Afterwards you can create a function which checks if a given string, when sorted by chars, matches any of the given tokens:
Predicate<String> match = s -> tokens.contains(sortChars.apply(s));
Now you can easily check any given string as follows:
boolean matches = match.test("bca");
Matching will only need to sort the given input and do a hash set lookup to check if it matches, so it's very efficient.
You can of course write the Function and Predicate as methods instead (String sortChars(String s) and boolean matches(String s) if you're unfamiliar with functional programming.

More of an addendum to the other answers. Of course, your two options have different performance characteristics. But: understand that performance is not necessarily the only factor to make a decision!
Meaning: if you are talking about a search that runs hundreds or thousands of time per minute, on large data sets: then for sure, you should invest a lot of time to come up with a solution that gives you best performance. Most likely, that includes doing various experiments with actual measurements when processing real data. Time complexity is a theoretical construct, in the real world, there are also elements such as CPU cache sizes, threading issues, IO bottlenecks, and whatnot that can have significant impact on real numbers.
But: when your code will doing its work just once a minute, even on a few dozen or hundred MB of data ... then it might not be worth to focus on performance.
In other words: the "sort" solution sounds straight forward. It is easy to understand, easy to implement, and hard to get wrong (with some decent test cases). If that solution gets the job done "good enough", then consider to use use that: the simple solution.
Performance is a luxury problem. You only address it if there is a reason to.

How to analyze time complexity here?

Assume you are playing the following Flip Game with your friend: Given a string that contains only these two characters: + and -, you and your friend take turns to flip two consecutive "++" into "--". The game ends when a person can no longer make a move and therefore the other person will be the winner.
Write a function to determine if the starting player can guarantee a win.
For example, given s = "++++", return true. The starting player can guarantee a win by flipping the middle "++" to become "+--+".
Here is my code:
public boolean canWin(String s) {
if(s==null || s.length()<2) return false;
char[] arr=s.toCharArray();
return canWinHelper(arr);
}
public boolean canWinHelper(char[] arr){
for(int i=0; i<arr.length-1; i++){
if(arr[i]=='+' && arr[i+1]=='+'){
arr[i]='-';
arr[i+1]='-';
boolean win=!canWinHelper(arr);
arr[i]='+';
arr[i+1]='+';
if(win) return true;
}
}
return false;
}
It works, but I'm not sure how to calculate the time complexity here since the function will keep calling itself until a false is returned. Anyone share some idea here?
Also during the search, we will encounter duplicate computation, so I think I can use a hashmap to avoid those duplicates. Key: String, Value: Boolean.
My updated code using a hashmap:
public boolean canWin(String s){
if(s==null || s.length()<2) return false;
HashMap<String,Boolean> map=new HashMap<String,Boolean>();
return helper(s,map);
}
public boolean helper(String s, HashMap<String,Boolean> map){
if(map.containsKey(s)) return map.get(s);
for(int i=0; i<s.length()-1; i++){
if(s.charAt(i)=='+' && s.charAt(i+1)=='+'){
String fliped=s.substring(0,i)+"--"+s.substring(i+2);
if(!helper(fliped,map)){
map.put(s,true);
return true;
}
}
}
map.put(s,false);
return false;
}
Still, I wanna know how to analyze the time and space complexity here?

Take that n = arr.length - 1
First pass you have n recursive calls. For each you have removed two +'s so each will have at most n-2 recursive calls, and so on.
So you have at most n+n(n-2)+n(n-2)(n-4)+... recursive calls.
In essence this is n!!(1+1/2+1/(2*4)+1/(2*4*8)+...) Since 1+1/2+1/(2*4)+1/(2*4*8)+... is convergent, ≤2, you have O(n!!)
Regarding memory, you have an array of length n for each recursive call, so you have n + nn + nnn + n ... (n/2 times) ... *n = n(n^(n/2)-1)/(n-1) and this is O(n^(n/2))
This is obviously pointing to not much better performance than with an exhaustive search.
For the hashed improvement, you are asking for all possible combinations that you have managed to create with your code. However, your code is not much different than the code that would actually create all combinations, apart from the fact that you are replacing two +'s with two -'s, which is reducing the complexity by some factor but not the level of it. Overall, the worst case scenario is the same as with the number of combinations of bits among n/2 locations which is 2^(n/2). Observe that hash function itself has probably some hidden log so the total complexity would be for search O(2^(n/2)*ln(n/2)) and memory O(2^(n/2)).
This is the worst case scenario. However, if there are arrangements where you cannot win, when there is no winning strategy, this above is really the complexity you need to count on.
The question of the average scenario is then the question of the number of cases where you can/cannot win and their distribution among all arrangements. This question has not much to do with your algorithm and requires a totally different set of tools in order to be solved.
After a few moments of checking whether the above reasoning is correct and to the point or not, I would be quite happy with the result, since it is telling me all that I need to know. You cannot expect that you will have an arrangement that will be favorable, and I really doubt that you have like only 0.01% of worst case arrangements, so you need to prepare the worst case scenario anyway and unless this is some special project the back-of-the-envelope calculation is your friend.
Anyway, these type of calculations are there to have test cases correctly prepared, not to have a correct and final implementation. Using the tests you can find what the hidden factors in O() really are, taking into account the compiler, memory consumption, pagination and so on.
Still not to leave this as it is, we can always improve the back-of-the-envelope reasoning, of course. For example, you actually do not have n-2 at each step, because it depends on the parity. For example for ++++++++... if you replace third +++--+++++... it is obvious that you are going to have n-3, not n-2 recursive calls, or even n-4. So the half number of calls may have n-3 recursive calls which would be n/2(n-3)+n/2(n-2)=n(n-5/2)
Observe that since n!=n!!(n-1)!! we can take n!!≈√n!, again n!=n!!!(n-1)!!!(n-2)!!! or n!!!≈∛n! This might lead to a conclusion that we should have something like O((n!)^(5/2)). The testing would tell me how much we can reduce x=3 in O((n!)^(x)).
(It is quite normal to look for the complexity in one particular form just like we have O((n!)^(x)), although it can be expressed differently. So I would continue with the complexity form O((n!)^(x)),1≤x≤3)

More efficient alternative to these "for" loops?

I'm taking an introductory course to Java and one of my latest projects involve making sure an array doesn't contain any duplicate elements (has distinct elements). I used a for loop with an inner for loop, and it works, but I've heard that you should try to avoid using many iterations in a program (and other methods in my classes have a fair number of iterations as well). Is there any efficient alternative to this code? I'm not asking for code of course, just "concepts." Would there potentially be a recursive way to do this? Thanks!
The array sizes are generally <= 10.
/** Iterates through a String array ARRAY to see if each element in ARRAY is
* distinct. Returns false if ARRAY contains duplicates. */
boolean distinctElements(String[] array) { //Efficient?
for (int i = 0; i < array.length; i += 1) {
for (int j = i + 1; j < array.length; j += 1) {
if (array[i] == array[j]) {
return false;
}
}
} return true;
}

"Efficiency" is almost always a trade-off. Occasionally, there are algorithms that are simply better than others, but often they are only better in certain circumstances.
For example, this code above: it's got time complexity O(n^2).
One improvement might be to sort the strings: you can then compare the strings by comparing if an element is equal to its neighbours. The time complexity here is reduced to O(n log n), because of the sorting, which dominates the linear comparison of elements.
However - what if you don't want to change the elements of the array - for instance, some other bit of your code relies on them being in their original order - now you also have to copy the array and then sort it, and then look for duplicates. This doesn't increase the overall time or storage complexity, but it does increase the overall time and storage, since more work is being done and more memory is required.
Big-oh notation only gives you a bound on the time ignoring multiplicative factors. Maybe you only have access to a really slow sorting algorithm: actually, it turns out to be faster just to use your O(n^2) loops, because then you don't have to invoke the very slow sort.
This could be the case when you have very small inputs. An oft-cited example of an algorithm that has poor time complexity but actually is useful in practice is Bubble Sort: it's O(n^2) in the worst case, but if you have a small and/or nearly-sorted array, it can actually be pretty darn fast, and pretty darn simple to implement - never forget the inefficiency of you having to write and debug the code, and to have to ask questions on SO when it doesn't work as you expect.
What if you know that the elements are already sorted, because you know something about their source. Now you can simply iterate through the array, comparing neighbours, and the time complexity is now O(n). I can't remember where I read it, but I once saw a blog post saying (I paraphrase):
A given computer can never be made to go quicker; it can only ever do less work.
If you can exploit some property to do less work, that improves your efficiency.
So, efficiency is a subjective criterion:
Whenever you ask "is this efficient", you have to be able to answer the question: "efficient with respect to what?". It might be space; it might be time; it might be how long it takes you to write the code.
You have to know the constraints of the hardware that you're going to run it on - memory, disk, network requirements etc may influence your choices.
You need to know the requirements of the user on whose behalf you are running it. One user might want the results as soon as possible; another user might want the results tomorrow. There is never a need to find a solution better than "good enough" (although that can be a moving goal once the user sees what is possible).
You also have to know what inputs you want it to be efficient for, and what properties of that input you can exploit to avoid unnecessary work.

First, array[i] == array[j] tests reference equality. That's not how you test String(s) for value equality.
I would add each element to a Set. If any element isn't successfully added (because it's a duplicate), Set.add(E) returns false. Something like,
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>();
for (String str : array) {
if (!set.add(str)) {
return false;
}
}
return true;
}
You could render the above without a short-circuit like
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>(Arrays.asList(array));
return set.size() == array.length;
}

Is it more efficient to reset a counter or let it increase and use modulo

Say you need to track the number of times a method is called and print something when it has been called n times. What would be the most efficient:
Use a long variable _counter and increase it each time the method is called. Each call you test for the equality "_counter % n == 0"
Use an int variable _counter and increase it each time the method is called. When _counter = n, print the message and reset the variable _counter to 0.
Some would say the difference is negligible and you are probably right. I am just curious of what method is most commonly used

In this particular case, since you need to have an if-statement ANYWAY, I would say that you should just set it to zero when it reaches the count.
However, for a case where you use the value every time, and just want to "wrap round to zero when we reach a certain value", then the case is less obvious.
If you can adjust n to be a power of 2 (2, 4, 8, 16, 32 ...), then you can use the trick of counter % n is the same as counter & (n-1) - which makes the operation REALLY quick.
If n is not a power of two, then chances are that you end up doing a real divide, which is a bad idea - divide is very expensive, compared to regular instructions, and a compare and reset is highly likely faster than the divide option.
Of course, as others have mentioned, if your counter ever reaches the MAX limit for the type, you could end up with all manner of fun and games.
Edit: And of course, if you are printing something, that probably takes 100 times longer than the divide, so it really is micro-optimization, unless n is quite large.

It depends on the value of n... but I bet resetting and a simple equality check is faster.
Additionally resetting the counter is safer, you will never reach the representation limit for your number.
Edit: also consider readability, doing micro optimizations may obscure your code.

Why not do both.
If it becomes a problem then look to see if it is worth optimizing.
But there is no point even looking at it until it is a problem (there will be much bigger problems in your algorithms).
count = (count+1) % countMax;

I believe that it is always better to reset the counter for the following reasons:
The code is clearer to an unfamiliar programmer (for example, the maintenance programmer).
There is less chance of an arithmetic (perhaps bad spelling) overflow when you reset the counter.

Inspection of Guava's RateLimiter will give you some idea of a similar utility implementation http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/RateLimiter.html

Here are performance times for 100000000 iterations, in ms
modTime = 1258
counterTime = 449
po2Time = 108
As we see Power of 2 outperforms other methods by far, but its only for powers of 2, also our plain counter is almost 2.5 times faster than modulus as well. So why would we like to use modulus increments at all? Well in my opinion I think they provide a clean code and if used properly they are a great tool to know of
original post

KMP prefix table running time

I wrote a code for filling the prefix table for KMP. It is small variation of this algorithm. I'm unable to convince myself that this algorithm/implementation runs in O(n) time. I have hard time figuring out the second recursive call affect on the total run time. Any help?
public void fillFailTable(int[] failTable,String p){
failTable[failTable.length-1] = preLength(failTable,p);
}
private int preLength(int[] failTable,String s){
if(s.length() == 1){
return 0;
}
int n = s.length();
int k = preLength(failTable,s.substring(0,n-1));
failTable[n-2] = k;
if(s.charAt(k) == s.charAt(n-1)){
return k+1;
}else{
return preLength(failTable,s.substring(n-1-k));
}
}

It's actually pretty interesting (I'm still wondering why no one smarter than me answered this yet). Please take this explanation with a grain of salt as I'm not 100% sure this is even close to being correct (although I can tell you for 100% that this method runs in O(n) since that's what they told me at the University years ago but they didn't bother explaining it though, d'uh, so I had to come up with it on my own).
Ok so let's start with a very basic example of s.length = 2. Two things to mention beforehand:
during each example lets only worry about the worst case scenarion, since we're interested in Big Oh, meaning we enter the second preLength() method.
we can observe, when looking for the Big Oh, that "k" (and the values returned by preLength()) in this code will always be 0, which you will notice in the images below and which is really important.
s.length == 2
We first enter the first preLength() method (lets call it *), which is now invoked with s.length = 1 and return immediately with a 0. Now since we're considering only the worst case scenario (meaning s.charAt(k) != s.charAt(n-1)) we enter the second preLength() with also a string of length = 1 (since n=2 and k=0). This one also returns a 0 immediately to our *. This ends our method invocation. In total we had 3 method invocations. Our * and two preLength(). Here's an image:
s.length == 3
Now lets look at an example with a starting s.length = 3. As you can notice we immediately invoke a preLength() with s.length = 2 and, from our previous example, we know that this one need 3 method invocations. Now we need to remember that when the method preLength(2) returns this time it returns to our native preLength(3) which will now invoke again preLength(2) (the one in the else) which will again need 3 method invocations. So in total we need 2*3+1 method invocations. This gives us 7. Again, here's an image (a circle is an invocation of preLength with a string of the length shown in the circle):
Conclusions
Now as you can see all those method invocations are symmetrical - and that is because our k is always equal to 0 which means that the second preLengt() will be invoked with a string of the same size as the first one - and we can see how many of them we will need for s.length = m when we know how many of them we need for m-1 since f(m) = 2*f(m-1)+1 where f(m) is the function telling us how many method invocations we need to compute the table for a string of size m. This works since as I said before the method invocations are symmetrical (that's because in worst case k=0 always and preLenght() always return 0, hence the 2* and we need to add 1 method invocation, the first one we ever invoke).
So basically with each incrementation of our input (size of m) the computational time grows 2 times plus one (2*m+1) which to my understanding means that this method is, in worst case, O(n).
As I said please do take this with a grain of salt but I hope this makes some sense :)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.