Performance of array analysis using Arrays.sort

Performance of array analysis using Arrays.sort - java

I have a code that uses Arrays.sort(char[]) in the following manner:
void arrayAnalysis(String[] array){
for(int a=0; a<array.length;a++){
char[] letters = array[a].toCharArray();
Arrays.sort(letters);
...
for(int b=a+1; b<array.length;b++){
char[] letters2 = array[b].toCharArray();
Arrays.sort(letters2);
if(Arrays.equals(letters, letters2)
print("equal");
}
}
}
In this case, n is equal to the array size. Due to the nested for loops, performance is automatically O(n^2). However, I think Arrays.sort (with O(nlog(n))) also affects the performance and makes it worse than O(n^2). Is this thinking correct?
Would the final performance be O(n*nlog(n)*(n*nlog(n))? Or am I way off?
Thanks.
Edit: I should add that while n is related to the array size, Arrays.sort is working with the number of letters in the array element. That is part of my confusion if this should be added to the performance analysis.
Edit2: It would be cool if the down-voter left a comment as to why it was deemed as a bad question.

If n is the length of the array, and m is the length of each array[i], then you will, on each of n^2 iterations, perform an O(m log m) sort, so overall it's O(n^2 (m log m)) (Or O(n^3 log n) if n == m. [EDIT: now that I think more about this, your guess is right, and this is the wrong complexity. But what I say below is still correct!]]
This is not really necessary, though. You could just make a sorted copy of the array, and do your nested for-loop using that one. Look at what happens when a is 0: first you sort array[0], then in the inner for loop you sort array[1] through array[n].
Then when a is 1, you first sort array[1], then in the inner for loop array[2] through array[n]. But you already sorted all that, and it's not as if it will have changed in the interim.

You run n outer loops, each of which runs n inner loops, each of which calls an O(n log n) algorithm, so the final result — absent any interaction between the levels — is O(n3 log n).

Related

Best way to retrieve K largest elements from large unsorted arrays?

I recently had a coding test during an interview. I was told:
There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?
During this, I was strongly hinted that I needed to sort the array.
So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find the smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).
I then reviewed this link on SO that suggests priority queue of K numbers, removing the smallest number every time a larger element is found, which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers
Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?

Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:
Find the kth largest number x using Quickselect (O(n))
Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
Return your saved elements
(If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)
The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.

There is a large unsorted array of one million ints. The user wants to retrieve the K largest elements.
During this, I was strongly hinted that I needed to sort the array.
So, I suggested using a built-in sort() or maybe a custom
implementation
That wasn't really a hint I guess, but rather a sort of trick to deceive you (to test how strong your knowledge is).
If you choose to approach the problem by sorting the whole source array using the built-in Dual-Pivot Quicksort, you can't obtain time complexity better than O(n log n).
Instead, we can maintain a PriorytyQueue which would store the result. And while iterating over the source array for each element we need to check whether the queue has reached the size K, if not the element should be added to the queue, otherwise (is size equals to K) we need to compare the next element against the lowest element in the queue - if the next element is smaller or equal we should ignore it if it is greater the lowest element has to be removed and the new element needs to be added.
The time complexity of this approach would be O(n log k) because adding a new element into the PriorytyQueue of size k costs O(k) and in the worst-case scenario this operation can be performed n times (because we're iterating over the array of size n).
Note that the best case time complexity would be Ω(n), i.e. linear.
So the difference between sorting and using a PriorytyQueue in terms of Big O boils down to the difference between O(n log n) and O(n log k). When k is much smaller than n this approach would give a significant performance gain.
Here's an implementation:
public static int[] getHighestK(int[] arr, int k) {
Queue<Integer> queue = new PriorityQueue<>();
for (int next: arr) {
if (queue.size() == k && queue.peek() < next) queue.remove();
if (queue.size() < k) queue.add(next);
}
return toIntArray(queue);
}
public static int[] toIntArray(Collection<Integer> source) {
return source.stream().mapToInt(Integer::intValue).toArray();
}
main()
public static void main(String[] args) {
System.out.println(Arrays.toString(getHighestK(new int[]{3, -1, 3, 12, 7, 8, -5, 9, 27}, 3)));
}
Output:
[9, 12, 27]
Sorting in O(n)
We can achieve worst case time complexity of O(n) when there are some constraints regarding the contents of the given array. Let's say it contains only numbers in the range [-1000,1000] (sure, you haven't been told that, but it's always good to clarify the problem requirements during the interview).
In this case, we can use Counting sort which has linear time complexity. Or better, just build a histogram (first step of Counting Sort) and look at the highest-valued buckets until you've seen K counts. (i.e. don't actually expand back to a fully sorted array, just expand counts back into the top K sorted elements.) Creating a histogram is only efficient if the array of counts (possible input values) is smaller than the size of the input array.
Another possibility is when the given array is partially sorted, consisting of several sorted chunks. In this case, we can use Timsort which is good at finding sorted runs. It will deal with them in a linear time.
And Timsort is already implemented in Java, it's used to sort objects (not primitives). So we can take advantage of the well-optimized and thoroughly tested implementation instead of writing our own, which is great. But since we are given an array of primitives, using built-in Timsort would have an additional cost - we need to copy the contents of the array into a list (or array) of wrapper type.

This is a classic problem that can be solved with so-called heapselect, a simple variation on heapsort. It also can be solved with quickselect, but like quicksort has poor quadratic worst-case time complexity.
Simply keep a priority queue, implemented as binary heap, of size k of the k smallest values. Walk through the array, and insert values into the heap (worst case O(log k)). When the priority queue is too large, delete the minimum value at the root (worst case O(log k)). After going through the n array elements, you have removed the n-k smallest elements, so the k largest elements remain. It's easy to see the worst-case time complexity is O(n log k), which is faster than O(n log n) at the cost of only O(k) space for the heap.

Here is one idea. I will think for creating array (int) with max size (2147483647) as it is max value of int (2147483647). Then for every number in for-each that I get from the original array just put the same index (as the number) +1 inside the empty array that I created.
So in the end of this for each I will have something like [1,0,2,0,3] (array that I created) which represent numbers [0, 2, 2, 4, 4, 4] (initial array).
So to find the K biggest elements you can make backward for over the created array and count back from K to 0 every time when you have different element then 0. If you have for example 2 you have to count this number 2 times.
The limitation of this approach is that it works only with integers because of the nature of the array...
Also the representation of int in java is -2147483648 to 2147483647 which mean that in the array that need to be created only the positive numbers can be placed.
NOTE: if you know that there is max number of the int then you can lower the created array size with that max number. For example if the max int is 1000 then your array which you need to create is with size 1000 and then this algorithm should perform very fast.

I think you misunderstood what you needed to sort.
You need to keep the K-sized list sorted, you don't need to sort the original N-sized input array. That way the time complexity would be O(N * log(K)) in the worst case (assuming you need to update the K-sized list almost every time).
The requirements said that N was very large, but K is much smaller, so O(N * log(K)) is also smaller than O(N * log(N)).
You only need to update the K-sized list for each record that is larger than the K-th largest element before it. For a randomly distributed list with N much larger than K, that will be negligible, so the time complexity will be closer to O(N).
For the K-sized list, you can take a look at the implementation of Is there a PriorityQueue implementation with fixed capacity and custom comparator? , which uses a PriorityQueue with some additional logic around it.

There is an algorithm to do this in worst-case time complexity O(n*log(k)) with very benign time constants (since there is just one pass through the original array, and the inner part that contributes to the log(k) is only accessed relatively seldomly if the input data is well-behaved).
Initialize a priority queue implemented with a binary heap A of maximum size k (internally using an array for storage). In the worst case, this has O(log(k)) for inserting, deleting and searching/manipulating the minimum element (in fact, retrieving the minimum is O(1)).
Iterate through the original unsorted array, and for each value v:
If A is not yet full then
insert v into A,
else, if v>min(A) then (*)
insert v into A,
remove the lowest value from A.
(*) Note that A can return repeated values if some of the highest k values occur repeatedly in the source set. You can avoid that by a search operation to make sure that v is not yet in A. You'd also want to find a suitable data structure for that (as the priority queue has linear complexity), i.e. a secondary hash table or balanced binary search tree or something like that, both of which are available in java.util.
The java.util.PriorityQueue helpfully guarantees the time complexity of its operations:
this implementation provides O(log(n)) time for the enqueing and dequeing methods (offer, poll, remove() and add); linear time for the remove(Object) and contains(Object) methods; and constant time for the retrieval methods (peek, element, and size).
Note that as laid out above, we only ever remove the lowest (first) element from A, so we enjoy the O(log(k)) for that. If you want to avoid duplicates as mentioned above, then you also need to search for any new value added to it (with O(k)), which opens you up to a worst-case overall scenario of O(n*k) instead of O(n*log(k)) in case of a pre-sorted input array, where every single element v causes the inner loop to fire.

How to calculate Big O time complexity for while loops

I am having trouble understanding how while loops affect the Big O time complexity.
For example, how would I calculate the time complexity for the code below?
Since it has a for loop that traverses through each element in the array and two nested while loops my initial thought was O(n^3) for the time complexity but I do not think that is right.
HashMap<Integer,Boolean> ht = new HashMap<>();
for(int j : array){
if(ht.get(j)) continue;
int left = j-1;
//check if hashtable contains number
while(ht.containsKey(left)){
//do something
left--;
}
int right = j+1;
//check if hashtable contains number
while(ht.containsKey(right)){
//do something
right++;
}
int diff = right - left;
if(max < diff) {
//do something
}
}

There is best case, average case, and worst case.
I'm going to have to assume there is something that constrains the two while loops so that neither iterates more than n times, where n is the number of elements in the array.
In the best case, you have O(n). That is because if(ht.get(j)) is always true, the continue path is always taken. Neither while loop is executed.
For the worst case, if(ht.get(j)) is always false, the while loops will be executed. Also, in the worst case, each while loop will have n passes. [1] The net result is 2 * n for both inner loops multiplied by n for the outer loop: (2 * n) * n. That would give you time complexity of O(n^2). [2]
The lookup time could potentially be a factor. A hash table lookup usually runs in constant time: O(1). That's the best case. But, the worst case is O(n). This happens when all entries have the same hash code. If that happens, it could potentially change your worst case to O(n^3).
[1] I suspect the worst case, the number of passes of the first while loop plus the number of passes of the second while loop is actually n or close to it. But, that doesn't change the result.
[2] In Big O, we chose the term that grows the fastest, and ignore the coefficients. So, in this example, we drop the 2 in 2*n*n.

Assuming there are m and n entries in your HashMap and array, respectively.
Since you have n elements for the for loop, the complexity can be written as n * complexity_inside_for.
Inside the for loop, you have two consecutive (not nested) while loops, each contributing a complexity of m as in worst case it'll need to go through all entries in your HashMap. Therefore, complexity_inside_for = m + m = 2m.
So overall, time complexity is n * 2m. However, as m and n approach infinity, the number 2 doesn't matter because it is not a function of m and/or n and can be discarded. This gives a big-O time complexity of O(m*n)

for one nested loop the time complexity works like this: O(n^2).
In each iteration of i, inner loop is executed 'n' times. The time complexity of a loop is equal to the number of times the innermost statement is to be executed.
so for your case that would be O(n^2)+O(n).
there you can find more explanation
Time-complexity

What is the time complexity for this algorithm?

public static void Comp(int n)
{
int count=0;
for(int i=0;i<n;i++)
{
for(int j=0;j<n;j++)
{
for(int k=1;k<n;k*=2)
{
count++;
}
}
}
System.out.println(count);
}
Does anyone knows what the time complexity is?
And what is the Big Oh()
Please can u explain this to me, step by step?

Whoever gave you this problem is almost certainly looking for the answer n^2 log(n), for reasons explained by others.
However the question doesn't really make any sense. If n > 2^30, k will overflow, making the inner loop infinite.
Even if we treat this problem as being completely theoretical, and assume n, k and count aren't Java ints, but some theoretical integer type, the answer n^2 log n assumes that the operations ++ and *= have constant time complexity, no matter how many bits are needed to represent the integers. This assumption isn't really valid.
Update
It has been pointed out to me in the comments below that, based on the way the hardware works, it is reasonable to assume that ++, *=2 and < all have constant time complexity, no matter how many bits are required. This invalidates the third paragraph of my answer.

In theory this is O(n^2 * log(n)).
Each of two outer loops is O(n) and the inner one is O(log(n)), because log base 2 of n is the number of times which you have to divide n by 2 to get 1.
Also this is a strict bound, i.e the code is also Θ(n^2 * log(n))

The time complexity is O(n^2 log n). Why? each for-loop is a function of n. And you have to multiply by n for each for loop; except the inner loop which grows as log n. why? for each iteration k is multiplied by 2. Think of merge sort or binary search trees.
details
for the first two loops: summation of 1 from 0 to n, which is n+1 and so the first two loops give (n+1)*(n+1)= n^2+2n+1= O(n^2)
for the k loop, we have k growing as 1,2,4,8,16,32,... so that 2^k = n. Take the log of both sides and you get k=log n
Again, not clear?
So if we set m=0, and a=2 then we get -2^n/-1 why is a=2? because that is the a value for which the series yields 2,4,8,16,...2^k

Why don't we consider stack frame sizes while calculation Space Complexity of recursive procedures?

Consider, the case of Merge Sort on an int Array containing n elements, we need an additional array of size n in order to perform merges.We discard the additional array in the end though.So the space complexity of Merge Sort comes out to be O(n).
But if you look at the recursive mergeSort procedure, on every recursive call mergeSort(something) one stack frame is added to the stack.And it does take some space, right?
public static void mergeSort(int[] a,int low,int high)
{
if(low<high)
{
int mid=(low+high)/2;
mergeSort(a,low,mid);
mergeSort(a,mid+1,high);
merge(a,mid,low,high);
}
}
My Questions is :
Why don't we take the size of stack frames into consideration while
calculating Merge Sort complexity ?
Is it because the stack contains only a few integer variables and
one reference, which don't take much memory?
What if my recursive function creates a new local array(lets say int a[]=new int [n];).Then will it be considered in calculating Space complexity?

The space consumed by the stack should absolutely be taken into consideration, but some may disagree here (I believe some algorithms even make complexity claims ignoring this - there's an unanswered related question about radix sort floating around here somewhere).
Since we split the array in half at each recursive call, the size of the stack will be O(log n).
So, if we take it into consideration, the total space will be O(n + log n), which is just O(n) (because, in big-O notation, we can discard asymptotically smaller terms), so it doesn't change the complexity.
And for creating a local array, a similar argument applies. If you create a local array at each step, you end up with O(n + n/2 + n/4 + n/8 + ...) = O(2n) = O(n) (because, in big-O notation, we can discard constant factors), so that doesn't change the complexity either.

Because you are not calculating the space-complexity when you are doing that. That is called determining: you are doing tests and try to conclude what the space complexity is by looking at the results. This is not a mathematical approach.
And yes, you are right with statement 2.

removing duplicate strings from a massive array in java efficiently?

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!
I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)
Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:
HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));
If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

ANALYSIS
Let's perform some analysis:
Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).
Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).
Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)
Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).
In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.
CONCLUSION
If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.
If space is a concern - then QuickSort + one pass.
If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

I would suggest that you use a modified mergesort on the array. Within the merge step, add logic to remove duplicate values. This solution is n*log(n) complexity and could be performed in-place if needed (in this case in-place implementation is a bit harder than with normal mergesort because adjacent parts could contain gaps from the removed duplicates which also need to be closed when merging).
For more information on mergesort see http://en.wikipedia.org/wiki/Merge_sort

Creating a hashset to handle this task is way too expensive. Demonstrably, in fact the whole point of them telling you not to use the Collections API is because they don't want to hear the word hash. So that leaves the code following.
Note that you offered them binary search AFTER sorting the array: that makes no sense, which may be the reason your proposal was rejected.
OPTION 1:
public static void removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
for(int i=1; i<input.length; i++){
if(input[i-1] == input[i])
input[i-1]=null;
}
}
OPTION 2:
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort here: n log n
int size = 1;
for(int i=1; i<input.length; i++){
if(input[i-1] != input[i])
size++;
}
System.out.println(size);
String output[] = new String[size];
output[0]=input[0];
int n=1;
for(int i=1;i<input.length;i++)
if(input[i-1]!=input[i])
output[n++]=input[i];
//final step: either return output or copy output into input;
//here I just return output
return output;
}
OPTION 3: (added by 949300, based upon Option 1). Note that this mangles the input array, if that is unacceptable, you must make a copy.
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
int outputLength = 0;
for(int i=1; i<input.length; i++){
// I think equals is safer, but are nulls allowed in the input???
if(input[i-1].equals(input[i]))
input[i-1]=null;
else
outputLength++;
}
// check if there were zero duplicates
if (outputLength == input.length)
return input;
String[] output = new String[outputLength];
int idx = 0;
for ( int i=1; i<input.length; i++)
if (input[i] != null)
output[idx++] = input[i];
return output;
}

Hi do you need to put them into an array. It would be faster to use a collection using hash values like a set. Here each value is unique because of its hash value.
If you put all entries to a set collection type. You can use the
HashSet(int initialCapacity)
constructor to prevent memory expansion while run time.
Set<T> mySet = new HashSet<T>(Arrays.asList(someArray))
Arrays.asList() has runtime O(n) if memory do not have to be expanded.

Since this is an interview question, I think they want you to come up with your own implementation instead of using the set api.
Instead of sorting it first and compare it again, you can build a binary tree and create an empty array to store the result.
The first element in the array will be the root.
If the next element is equals to the node, return. -> this remove the duplicate elements
If the next element is less than the node, compare it to the left, else compare it to the right.
Keep doing the above the 2 steps until you reach to the end of the tree, then you can create a new node and know this has no duplicate yet.
Insert this new node value to the array.
After the traverse of all elements of the original array, you get a new copy of an array with no duplicate in the original order.
Traversing takes O(n) and searching the binary tree takes O(logn) (insertion should only take O(1) since you are just attaching it and not re-allocating/balancing the tree) so the total should be O(nlogn).

O.K., if they want super speed, let's use the hashcodes of the Strings as much as possible.
Loop through the array, get the hashcode for each String, and add it to your favorite data structure. Since you aren't allowed to use a Collection, use a BitSet. Note that you need two, one for positives and one for negatives, and they will each be huge.
Loop again through the array, with another BitSet. True means the String passes. If the hashcode for the String does not exist in the Bitset, you can just mark it as true. Else, mark it as possibly duplicate, as false. While you are at it, count how many possible duplicates.
Collect all the possible duplicates into a big String[], named possibleDuplicates. Sort it.
Now go through the possible duplicates in the original array and binary Search in the possibleDuplicates. If present, well, you are still stuck, cause you want to include it ONCE but not all the other times. So you need yet another array somewhere. Messy, and I've got to go eat dinner, but this is a start...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Performance of array analysis using Arrays.sort - java

You run n outer loops, each of which runs n inner loops, each of which calls an O(n log n) algorithm, so the final result — absent any interaction between the levels — is O(n3 log n).

Related

Best way to retrieve K largest elements from large unsorted arrays?

How to calculate Big O time complexity for while loops

What is the time complexity for this algorithm?

Why don't we consider stack frame sizes while calculation Space Complexity of recursive procedures?

removing duplicate strings from a massive array in java efficiently?

Categories

Resources