More efficient alternative to these "for" loops?

More efficient alternative to these "for" loops? - java

I'm taking an introductory course to Java and one of my latest projects involve making sure an array doesn't contain any duplicate elements (has distinct elements). I used a for loop with an inner for loop, and it works, but I've heard that you should try to avoid using many iterations in a program (and other methods in my classes have a fair number of iterations as well). Is there any efficient alternative to this code? I'm not asking for code of course, just "concepts." Would there potentially be a recursive way to do this? Thanks!
The array sizes are generally <= 10.
/** Iterates through a String array ARRAY to see if each element in ARRAY is
* distinct. Returns false if ARRAY contains duplicates. */
boolean distinctElements(String[] array) { //Efficient?
for (int i = 0; i < array.length; i += 1) {
for (int j = i + 1; j < array.length; j += 1) {
if (array[i] == array[j]) {
return false;
}
}
} return true;
}

"Efficiency" is almost always a trade-off. Occasionally, there are algorithms that are simply better than others, but often they are only better in certain circumstances.
For example, this code above: it's got time complexity O(n^2).
One improvement might be to sort the strings: you can then compare the strings by comparing if an element is equal to its neighbours. The time complexity here is reduced to O(n log n), because of the sorting, which dominates the linear comparison of elements.
However - what if you don't want to change the elements of the array - for instance, some other bit of your code relies on them being in their original order - now you also have to copy the array and then sort it, and then look for duplicates. This doesn't increase the overall time or storage complexity, but it does increase the overall time and storage, since more work is being done and more memory is required.
Big-oh notation only gives you a bound on the time ignoring multiplicative factors. Maybe you only have access to a really slow sorting algorithm: actually, it turns out to be faster just to use your O(n^2) loops, because then you don't have to invoke the very slow sort.
This could be the case when you have very small inputs. An oft-cited example of an algorithm that has poor time complexity but actually is useful in practice is Bubble Sort: it's O(n^2) in the worst case, but if you have a small and/or nearly-sorted array, it can actually be pretty darn fast, and pretty darn simple to implement - never forget the inefficiency of you having to write and debug the code, and to have to ask questions on SO when it doesn't work as you expect.
What if you know that the elements are already sorted, because you know something about their source. Now you can simply iterate through the array, comparing neighbours, and the time complexity is now O(n). I can't remember where I read it, but I once saw a blog post saying (I paraphrase):
A given computer can never be made to go quicker; it can only ever do less work.
If you can exploit some property to do less work, that improves your efficiency.
So, efficiency is a subjective criterion:
Whenever you ask "is this efficient", you have to be able to answer the question: "efficient with respect to what?". It might be space; it might be time; it might be how long it takes you to write the code.
You have to know the constraints of the hardware that you're going to run it on - memory, disk, network requirements etc may influence your choices.
You need to know the requirements of the user on whose behalf you are running it. One user might want the results as soon as possible; another user might want the results tomorrow. There is never a need to find a solution better than "good enough" (although that can be a moving goal once the user sees what is possible).
You also have to know what inputs you want it to be efficient for, and what properties of that input you can exploit to avoid unnecessary work.

First, array[i] == array[j] tests reference equality. That's not how you test String(s) for value equality.
I would add each element to a Set. If any element isn't successfully added (because it's a duplicate), Set.add(E) returns false. Something like,
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>();
for (String str : array) {
if (!set.add(str)) {
return false;
}
}
return true;
}
You could render the above without a short-circuit like
static boolean distinctElements(String[] array) {
Set<String> set = new HashSet<>(Arrays.asList(array));
return set.size() == array.length;
}

Related

Count distinct elements in an array in O(n) only with loops and arrays in java

I found this solution
https://www.geeksforgeeks.org/count-distinct-elements-in-an-array/
The problem is that time complexity must be at O(n), space complexity must be at O(1), but i can't import any additional libraries and the code must be maximally short. I wasn't able to find a solution with sorting faster than O(nlog n), so i guess i need to find a clever way. And the answer is the third solution from the link above, but it requires additional library. Is it even possible to find a better way?
Edit:
In fact, i need to create a function that works exactly like
java.util.Arrays.stream(myarray).distinct().count();
It must have time complexity at O(n) and space complexity at O(1).
Basically i have to create it using only loops, arrays and if statements. Also it is forbidden to import anything other than import java.util.Scanner; and because of that i can't do it with any ready to use methods like java.util.Arrays.*;.
For example:
Input:
{1,12,3,0,1,3,15,6}
Output:
6

Maximally short solution with O(n) time complexity, using only Java 8+ built-in APIs, i.e. no additional libraries needed.
The code assumes myarray is an array of int, long, double, or object1.
long count = java.util.Arrays.stream(myarray).distinct().count();
1) Object must have valid equals() and hashCode() implementation.

A solution in O(n) time complexity and O(1) space complexity is possible in theory, but it might not be very practical. The basic idea is this:
let aMin be the minimum value of an entry in arr
let aMax be the maximum value of an entry in arr
let seenOnce and seenTwice be boolean arrays
whose indices are in the range [aMin..aMax]
initialize all elements of seenOnce and seenTwice to FALSE
countUnique = 0;
for a in arr {
if (!seenOnce[a - aMin]) {
// seeing `a` for the first time, so count it
seenOnce[a - aMin] = TRUE
countUnique = countUnique + 1
} else if (!seenTwice[a - aMin]) {
// seeing `a` for a second time, so un-count it
countUnique = countUnique - 1
seenTwice[a - aMin] = TRUE
}
}
If the values in arr could be any ints at all, then each of the boolean arrays will contain 2^32 entries, for a total of over 8 billion booleans. That's 1Gb of memory, provided we're careful to implement all those booleans in one bit each. But it is O(1): same 1Gb consumed regardless of whether arr contains two elements or a billion...

Comparison of these two algorithms?

So I'm presented with a problem that states. "Determine if a string contains all unique characters"
So I wrote up this solution that adds each character to a set, but if the character already exists it returns false.
private static boolean allUniqueCharacters(String s) {
Set<Character> charSet = new HashSet<Character>();
for (int i = 0; i < s.length(); i++) {
char currentChar = s.charAt(i);
if (!charSet.contains(currentChar)) {
charSet.add(currentChar);
} else {
return false;
}
}
return true;
}
According to the book I am reading this is the "optimal solution"
public static boolean isUniqueChars2(String str) {
if (str.length() > 128)
return false;
boolean[] char_set = new boolean[128];
for (int i = 0; i < str.length(); i++) {
int val = str.charAt(i);
if (char_set[val]) {
return false;
}
char_set[val] = true;
}
return true;
}
My question is, is my implementation slower than the one presented? I assume it is, but if a Hash look up is O(1) wouldn't they be the same complexity?
Thank you.

As Amadan said in the comments, the two solutions have the same time complexity O(n) because you have a for loop looping through the string, and you do constant time operations in the for loop. This means that the time it takes to run your methods increases linearly with the length of the string.
Note that time complexity is all about how the time it takes changes when you change the size of the input. It's not about how fast it is with data of the same size.
For the same string, the "optimal" solution should be faster because sets have some overheads over arrays. Handling arrays is faster than handling sets. However, to actually make the "optimal" solution work, you would need an array of length 2^16. That is how many different char values there are. You would also need to remove the check for a string longer than 128.
This is one of the many examples of the tradeoff between space and time. If you want it to go faster, you need more space. If you want to save space, you have to go slower.

Both algorithms have time complexity of O(N). The difference is in their space complexity.
The book's solution will always require storage for 128 characters - O(1), while your solution's space requirement will vary linearly according to the input - O(N).
The book's space requirement is based on an assumed character set with 128 characters. But this may be rather problematic (and not scalable) given the likelihood of needing different character sets.

The hashmap is in theory acceptable, but is a waste.
A hashmap is built over an array (so it is certainly more costly than an array), and collision resolution requires extra space (at least the double of the number of elements). In addition, any access requires the computation of the hash and possibly the resolution of collisions.
This adds a lot of overhead in terms of space and time, compared to a straight array.
Also note that it is kind of folklore that a hash table has an O(1) behavior. The worst case is much poorer, accesses can take up to O(N) time for a table of size N.
As a final remark, the time complexity of this algorithm is O(1) because you conclude false at worse when N>128.

Your algorithm is also O(1). You can think about complexity like how my algorithm will react to the change in amount of elements processed. Therefore O(n) and O(2n) are effectively equal.
People are talking about O notation as growth rate here

Your solution is could indeed be slower than the book's solution. Firstly, a hash lookup ideally has a constant time lookup. But, the retrieval of the object will not be if there are multiple hash collisions. Secondly, even if it is constant time lookup, there is usually significant overhead involved in executing the hash code function as compared to looking up an element in an array by index. That's why you may want to go with the array lookup. However, if you start to deal with non-ASCII Unicode characters, then you might not want to go with the array approach due to the significant amount of space overhead.

The bottleneck of your implementation is, that a set has a lookup (and insert) complexity* of O(log k), while the array has a lookup complexity in O(1).
This sounds like your algorithm must be much worse. But in fact it is not, as k is bounded by 128 (else the reference implementation would be wrong and produce a out-of-bounds error) and can be treated as a constant. This makes the set lookup O(1) as well with a bit bigger constants than the array lookup.
* assuming a sane implementation as tree or hashmap. The hashmap time complexity is in general not constant, as filling it up needs log(n) resize operations to avoid the increase of collisions which would lead to linear lookup time, see e.g. here and here for answers on stackoverflow.
This article even explains that java 8 by itself converts a hashmap to a binary tree (O(n log n) for the converstion, O(log n) for the lookup) before its lookup time degenerates to O(n) because of too many collisions.

What's the time complexity of sorting a list of objects with two properties?

Suppose I have a class:
`
public class Interval {
int start;
int end;
Interval() { start = 0; end = 0; }
Interval(int s, int e) { start = s; end = e; }
}
`
I would like to sort a list of intervals with Collections.sort() like this:
Collections.sort(intervals, new Comparator<Interval>(){
#Override
public int compare(Interval o1, Interval o2) {
if (o1.start == o2.start) {
return o1.end - o2.end;
}
else {
return o1.start - o2.start;
}
}
});
I know that sorting an array with the built-in sorting function takes O(nlogn) time, and the question is if I am sorting a list of objects with two properties, what is the time complexity of sorting this list? Thanks!!

#PaulMcKenzie's brief answer in comments is on the right track, but the full answer to your question is more subtle.
Many people do what you've done and confuse time with other measures of efficiency. What's correct in nearly all cases when someone says a "sort is O(n log n)" is that the number of comparisons is O(n log n).
I'm not trying to be pedantic. Sloppy analysis can make big problems in practice. You can't claim that any sort runs in O(n log n) time without a raft of additional statements about the data and the machine where the algorithm is running. Research papers usually do this by giving a standard machine model used for their analysis. The model states the time required for low level operations - memory access, arithmetic, and comparisons, for example.
In your case, each object comparison requires a constant number (2) of value comparisons. So long as value comparison itself is constant time -- true in practice for fixed-width integers -- O(n log n) is an accurate way to express run time.
However, something as simple as string sorting changes this picture. String comparison itself has a variable cost. It depends on string length! So sorting strings with a "good" sorting algorithm is O(nk log n), where k is the length of strings.
Ditto if you're sorting variable-length numbers (java BigIntegers for example).
Sorting is also sensitive to copy costs. Even if you can compare objects in constant time, sort time will depend on how big they are. Algorithms differ in how many times objects need to be moved in memory. Some accept more comparisons in order to do less copying. An implementation detail: sorting pointers vs. objects can change asymptotic run time - a space for time trade.
But even this has complications. After you've sorted pointers, touching the sorted elements in order hops around memory in arbitrary order. This can cause terrible memory hierarchy (cache) performance. Analysis that incorporates memory characteristics is a big topic in itself.

The big O notation actually do neglect the least contributing factors
for example if you complexity is n+1, n will be used and the 1 neglected.
So that answer is the same n * log(n).
As your code just adds one more statement, which will be translated into one instruction.

It should read the Collection.sort() Link here
This algorithm guaranteed n log(n) performance.
Note: Comparator does't change the its complexity rather than using Loops

How to analyze time complexity here?

Assume you are playing the following Flip Game with your friend: Given a string that contains only these two characters: + and -, you and your friend take turns to flip two consecutive "++" into "--". The game ends when a person can no longer make a move and therefore the other person will be the winner.
Write a function to determine if the starting player can guarantee a win.
For example, given s = "++++", return true. The starting player can guarantee a win by flipping the middle "++" to become "+--+".
Here is my code:
public boolean canWin(String s) {
if(s==null || s.length()<2) return false;
char[] arr=s.toCharArray();
return canWinHelper(arr);
}
public boolean canWinHelper(char[] arr){
for(int i=0; i<arr.length-1; i++){
if(arr[i]=='+' && arr[i+1]=='+'){
arr[i]='-';
arr[i+1]='-';
boolean win=!canWinHelper(arr);
arr[i]='+';
arr[i+1]='+';
if(win) return true;
}
}
return false;
}
It works, but I'm not sure how to calculate the time complexity here since the function will keep calling itself until a false is returned. Anyone share some idea here?
Also during the search, we will encounter duplicate computation, so I think I can use a hashmap to avoid those duplicates. Key: String, Value: Boolean.
My updated code using a hashmap:
public boolean canWin(String s){
if(s==null || s.length()<2) return false;
HashMap<String,Boolean> map=new HashMap<String,Boolean>();
return helper(s,map);
}
public boolean helper(String s, HashMap<String,Boolean> map){
if(map.containsKey(s)) return map.get(s);
for(int i=0; i<s.length()-1; i++){
if(s.charAt(i)=='+' && s.charAt(i+1)=='+'){
String fliped=s.substring(0,i)+"--"+s.substring(i+2);
if(!helper(fliped,map)){
map.put(s,true);
return true;
}
}
}
map.put(s,false);
return false;
}
Still, I wanna know how to analyze the time and space complexity here?

Take that n = arr.length - 1
First pass you have n recursive calls. For each you have removed two +'s so each will have at most n-2 recursive calls, and so on.
So you have at most n+n(n-2)+n(n-2)(n-4)+... recursive calls.
In essence this is n!!(1+1/2+1/(2*4)+1/(2*4*8)+...) Since 1+1/2+1/(2*4)+1/(2*4*8)+... is convergent, ≤2, you have O(n!!)
Regarding memory, you have an array of length n for each recursive call, so you have n + nn + nnn + n ... (n/2 times) ... *n = n(n^(n/2)-1)/(n-1) and this is O(n^(n/2))
This is obviously pointing to not much better performance than with an exhaustive search.
For the hashed improvement, you are asking for all possible combinations that you have managed to create with your code. However, your code is not much different than the code that would actually create all combinations, apart from the fact that you are replacing two +'s with two -'s, which is reducing the complexity by some factor but not the level of it. Overall, the worst case scenario is the same as with the number of combinations of bits among n/2 locations which is 2^(n/2). Observe that hash function itself has probably some hidden log so the total complexity would be for search O(2^(n/2)*ln(n/2)) and memory O(2^(n/2)).
This is the worst case scenario. However, if there are arrangements where you cannot win, when there is no winning strategy, this above is really the complexity you need to count on.
The question of the average scenario is then the question of the number of cases where you can/cannot win and their distribution among all arrangements. This question has not much to do with your algorithm and requires a totally different set of tools in order to be solved.
After a few moments of checking whether the above reasoning is correct and to the point or not, I would be quite happy with the result, since it is telling me all that I need to know. You cannot expect that you will have an arrangement that will be favorable, and I really doubt that you have like only 0.01% of worst case arrangements, so you need to prepare the worst case scenario anyway and unless this is some special project the back-of-the-envelope calculation is your friend.
Anyway, these type of calculations are there to have test cases correctly prepared, not to have a correct and final implementation. Using the tests you can find what the hidden factors in O() really are, taking into account the compiler, memory consumption, pagination and so on.
Still not to leave this as it is, we can always improve the back-of-the-envelope reasoning, of course. For example, you actually do not have n-2 at each step, because it depends on the parity. For example for ++++++++... if you replace third +++--+++++... it is obvious that you are going to have n-3, not n-2 recursive calls, or even n-4. So the half number of calls may have n-3 recursive calls which would be n/2(n-3)+n/2(n-2)=n(n-5/2)
Observe that since n!=n!!(n-1)!! we can take n!!≈√n!, again n!=n!!!(n-1)!!!(n-2)!!! or n!!!≈∛n! This might lead to a conclusion that we should have something like O((n!)^(5/2)). The testing would tell me how much we can reduce x=3 in O((n!)^(x)).
(It is quite normal to look for the complexity in one particular form just like we have O((n!)^(x)), although it can be expressed differently. So I would continue with the complexity form O((n!)^(x)),1≤x≤3)

removing duplicate strings from a massive array in java efficiently?

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!
I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)
Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:
HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));
If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

ANALYSIS
Let's perform some analysis:
Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).
Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).
Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)
Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).
In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.
CONCLUSION
If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.
If space is a concern - then QuickSort + one pass.
If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

I would suggest that you use a modified mergesort on the array. Within the merge step, add logic to remove duplicate values. This solution is n*log(n) complexity and could be performed in-place if needed (in this case in-place implementation is a bit harder than with normal mergesort because adjacent parts could contain gaps from the removed duplicates which also need to be closed when merging).
For more information on mergesort see http://en.wikipedia.org/wiki/Merge_sort

Creating a hashset to handle this task is way too expensive. Demonstrably, in fact the whole point of them telling you not to use the Collections API is because they don't want to hear the word hash. So that leaves the code following.
Note that you offered them binary search AFTER sorting the array: that makes no sense, which may be the reason your proposal was rejected.
OPTION 1:
public static void removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
for(int i=1; i<input.length; i++){
if(input[i-1] == input[i])
input[i-1]=null;
}
}
OPTION 2:
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort here: n log n
int size = 1;
for(int i=1; i<input.length; i++){
if(input[i-1] != input[i])
size++;
}
System.out.println(size);
String output[] = new String[size];
output[0]=input[0];
int n=1;
for(int i=1;i<input.length;i++)
if(input[i-1]!=input[i])
output[n++]=input[i];
//final step: either return output or copy output into input;
//here I just return output
return output;
}
OPTION 3: (added by 949300, based upon Option 1). Note that this mangles the input array, if that is unacceptable, you must make a copy.
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
int outputLength = 0;
for(int i=1; i<input.length; i++){
// I think equals is safer, but are nulls allowed in the input???
if(input[i-1].equals(input[i]))
input[i-1]=null;
else
outputLength++;
}
// check if there were zero duplicates
if (outputLength == input.length)
return input;
String[] output = new String[outputLength];
int idx = 0;
for ( int i=1; i<input.length; i++)
if (input[i] != null)
output[idx++] = input[i];
return output;
}

Hi do you need to put them into an array. It would be faster to use a collection using hash values like a set. Here each value is unique because of its hash value.
If you put all entries to a set collection type. You can use the
HashSet(int initialCapacity)
constructor to prevent memory expansion while run time.
Set<T> mySet = new HashSet<T>(Arrays.asList(someArray))
Arrays.asList() has runtime O(n) if memory do not have to be expanded.

Since this is an interview question, I think they want you to come up with your own implementation instead of using the set api.
Instead of sorting it first and compare it again, you can build a binary tree and create an empty array to store the result.
The first element in the array will be the root.
If the next element is equals to the node, return. -> this remove the duplicate elements
If the next element is less than the node, compare it to the left, else compare it to the right.
Keep doing the above the 2 steps until you reach to the end of the tree, then you can create a new node and know this has no duplicate yet.
Insert this new node value to the array.
After the traverse of all elements of the original array, you get a new copy of an array with no duplicate in the original order.
Traversing takes O(n) and searching the binary tree takes O(logn) (insertion should only take O(1) since you are just attaching it and not re-allocating/balancing the tree) so the total should be O(nlogn).

O.K., if they want super speed, let's use the hashcodes of the Strings as much as possible.
Loop through the array, get the hashcode for each String, and add it to your favorite data structure. Since you aren't allowed to use a Collection, use a BitSet. Note that you need two, one for positives and one for negatives, and they will each be huge.
Loop again through the array, with another BitSet. True means the String passes. If the hashcode for the String does not exist in the Bitset, you can just mark it as true. Else, mark it as possibly duplicate, as false. While you are at it, count how many possible duplicates.
Collect all the possible duplicates into a big String[], named possibleDuplicates. Sort it.
Now go through the possible duplicates in the original array and binary Search in the possibleDuplicates. If present, well, you are still stuck, cause you want to include it ONCE but not all the other times. So you need yet another array somewhere. Messy, and I've got to go eat dinner, but this is a start...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.