Optimize code with ArrayList or TreeSet?

Optimize code with ArrayList or TreeSet? - java

TreeMap<String,ArrayList<String>> statesToPresidents = new TreeMap<String,ArrayList<String>>();
TreeMap<String,String> reversedMap = new TreeMap<String,String>();
TreeSet<String> presidentsWithoutStates = new TreeSet<String>();
TreeSet<String>statesWithoutPresidents = new TreeSet<String>(); while (infile2.ready())
{
String president = infile2.readLine();
if (reversedMap.containsKey(president)==false)
presidentsWithoutStates.add(president);
}
infile2.close();
System.out.println( "\nThese presidents were born before the states were formed:\n"); // DO NOT REMOVE OR MODIFY
// YOUR CODE HERE TO PRINT THE NAME(S) Of ANY PRESIDENT(s)
// WHO WERE BORN BEFORE THE STATES WERE FORMED = 10%
Iterator<String> iterator = presidentsWithoutStates.iterator();
while (iterator.hasNext()){
System.out.println(iterator.next());
}
I was wondering if my program would run faster if I used an ArrayList instead of a TreeSet. I add the string president to the presidentWithoutStates TreeSet if it's not a key in reversedMap and when I print it out I need it sorted order. Should I use the TreeSet and sort as I go or should I just use an arraylist instead and sort at the end. I saw a similar question about this but that person wasn't continually adding elements like I am.
Edit: There are no duplicates

Let's break the running time down:
ArrayList:
n inserts taking amortized O(1) each, giving us O(n)
Sort takes O(n log n), assuming you use the built-in Collections.sort, or an O(n log n) sorting algorithm.
Iterating through it takes O(n)
Total = O(n + n log n) = O(n log n)
TreeSet:
n inserts taking O(log n) each, giving us O(n log n).
Iterating through it takes O(n)
Total = O(n log n + n) = O(n log n)
Conclusion:
Asymptotically, we have the same performance.
In practice, ArrayList would probably be slightly faster.
Why do I say this? Well, let's assume it isn't. Then we could use TreeSet to sort an array faster than the method made specifically to sort it (the saving gotten from not having to insert into the ArrayList is fairly small). That seems counter-intuitive, doesn't it? If this were (consistently) true, Java developers would simply replace that method with TreeSet, wouldn't they?
One could analyse the constant factors involved with the sort versus the TreeSet, but that would probably be fairly complex, and the conditions under which the program is run also affects the constant factors, so it can't be exact.
Note on duplication:
The above assumes there isn't any duplicates.
If there were duplicates, you definitely shouldn't be doing a contains check if you were to use an ArrayList, but rather do the duplication removal afterwards (which can be done by simply ignoring consecutive elements which are the same during iteration after the sort). The reason the contains check should be avoided is because it takes O(n), which could make the whole thing take O(n²) instead.
If there are many duplicates, TreeSet is likely to be faster, as it only takes O(n log m), where m are the number of duplicates. The sorting option doesn't deal with duplicates so directly, so, unless m is really small, or you get lucky, still ends up taking O(n log n).
The exact point where TreeSet becomes faster than the sorting option is really something to benchmark.

Related

Want to delete some elements from a list that is existed in another list

I have a list suppose
listA=[679,890,907,780,5230,781]
and want to delete some elements that is existed in another
listB=[907,5230]
in minimum time complexity?
I can do this problem by using two "for loops" means O(n2) time complexity, but I want to reduce this complexity to O(nlog(n)) or O(n)?
Is it possible?

It's possible - if one of the lists is sorted. Assuming that list A is sorted and list B is unsorted, with respective dimensions M and N, the minimum time complexity to remove all of list B's elements from list A will be O((N+M)*log(M)). The way you can achieve this is by binary search - each lookup for an element in list A takes O(log(M)) time, and there are N lookups (one for each element in list B). Since it takes O(M*log(M)) time to sort A, it's more efficient for huge lists to sort and then remove all elements, with total time complexity O((N+M)*log(M)).
On the other hand, if you don't have a sorted list, just use Collection.removeAll, which has a time complexity of O(M*N) in this case. The reason for this time complexity is that removeAll does (by default) something like the following pseudocode:
public boolean removeAll(Collection<?> other)
for each elem in this list
if other contains elem
remove elem from this list
Since contains has a time complexity of O(N) for lists, and you end up doing M iterations, this takes O(M*N) time in total.
Finally, if you want to minimize the time complexity of removeAll (with possibly degraded real world performance) you can do the following:
List<Integer> a = ...
List<Integer> b = ...
HashSet<Integer> lookup = new HashSet<>(b);
a.removeAll(lookup);
For bad values of b, the time to construct lookup could take up to time O(N*log(N)), as shown here (see "pathologically distributed keys"). After that, invoking removeAll will take O(1) for contains over M iterations, taking O(M) time to execute. Therefore, the time complexity of this approach is O(M + N*log(N)).
So, there are three approaches here. One provides you with time complexity O((N+M)*log(M)), another provides you with time complexity O(M*N), and the last provides you with time complexity O(M + N*log(N)). Considering that the first and last approaches are similar in time complexity (as log tends to be very small even for large numbers), I would suggest going with the naive O(M*N) for small inputs, and the simplest O(M + N*log(N)) for medium-sized inputs. At the point where your memory usage starts to suffer from creating a HashSet to store the elements of B (very large inputs), I would finally switch to the more complex O((N+M)*log(M)) approach.
You can find an AbstractCollection.removeAll implementation here.
Edit:
The first approach doesn't work so well for ArrayLists - removing from the middle of list A takes O(M) time, apparently. Instead, sort list B (O(N*log(N))), and iterate through list A, removing items as appropriate. This takes O((M+N)*log(N)) time and is better than the O(M*N*log(M)) that you end up with when using an ArrayList. Unfortunately, the "removing items as appropriate" part of this algorithm requires that you create data to store the non-removed elements in O(M), as you don't have access to the internal data array of list A. In this case, it's strictly better to go with the HashSet approach. This is because (1) the time complexity of O((M+N)*log(N)) is actually worse than the time complexity for the HashSet method, and (2) the new algorithm doesn't save on memory. Therefore, only use the first approach when you have a List with O(1) time for removal (e.g. LinkedList) and a large amount of data. Otherwise, use removeAll. It's simpler, often faster, and supported by library designers (e.g. ArrayList has a custom removeAll implementation that allows it to take linear instead of quadratic time using negligible extra memory).

You can achieve this in following way
Sort second list( you can sort any one of the list. Here I have sorted second list). After that loop through first array and for each element of first array, do binary search in second array.
You can sort list by using Collections.sort() method.
Total complexity:-
For sorting :- O(mLogm) where m is size of second array. I have sorted only second array.
For removing :- O(nLogm)

Should I use TreeSet or HashSet?

I have large number of strings, I need to print unique strings in sorted order.
TreeSet stores them in sorted order but insertion time is O(Logn) for each insertion. HashSet takes O(1) time to add but then I will have to get list of the set and then sort using Collections.sort() which takes O(nLogn) (I assumes there is no memory overhead here since only the references of Strings will be copied in the new collection i.e. List). Is it fair to say overall any choice is same since at the end total time will be same?

That depends on how close you look. Yes, the asymptotic time complexity is O(n log n) in either case, but the constant factors differ. So it's not like one method can get a 100 times faster than the other, but it's certainly possible that one method is twice a fast as the other.
For most parts of a program, a factor of 2 is totally irrelevant, but if your program actually spends a significant part of its running time in this algorithm, it would be a good idea to implement both approaches, and measure their performance.

Measuring is the way to go, but if you're talking purely theoretically and ignoring read from after sorting, then consider for number of strings = x:
HashSet:
x * O(1) add operations + 1 O(n log n) (where n is x) sort operation = approximately O(n + n log n) (ok, that's a gross oversimplification, but..)
TreeSet:
x * O(log n) (where n increases from 1 to x) + O(0) sort operation = approximately O(n log (n/2)) (also a gross oversimplification, but..)
And continuing in the oversimplification vein, O(n + n log n) > O(n log (n/2)). Maybe TreeSet is the way to go?

If you distinguish the total number of strings (n) and number of unique strings (m), you get more detailed results for both approaches:
Hash set + sort: O(n) + O(m log m)
TreeSet: O(n log m)
So if n is much bigger than m, using a hash set and sorting the result should be slightly better.

You should take into account which methods will be executed more frequently and base your decision on that.
Apart from HashSet and TreeSet you can use LinkedHashSet which provides better performance for sorted sets. If you want to learn more about their differences in performance I suggest your read 6 Differences between TreeSet HashSet and LinkedHashSet in Java

Time Complexity of my program

I want to know the exact time complexity of my algorithm in this method. I think it is nlogn as it uses arrays.sort;
public static int largestElement(int[] num) throws NullPointerException // O(1)
{
int a=num.length; // O(1)
Arrays.sort(num); // O(1)? yes
if(num.length<1) // O(1)
return (Integer) null;
else
return num[a-1]; // O(1)
}

You seem to grossly contradict yourself in your post. You are correct in that the method is O(nlogn), but the following is incorrect:
Arrays.sort(num); // O(1)? yes
If you were right, the method would be O(1)! After all, a bunch of O(1) processes in sequence is still O(1). In reality, Arrays.sort() is O(nlogn), which determines the overall complexity of your method.
Finding the largest element in an array or collection can always be O(n), though, since we can simply iterate through each element and keep track of the maximum.

"You are only as fast as your slowest runner" --Fact
So the significant run time operations here are your sorting and your stepping through the array. Since Arrays.sort(num) is a method which most efficiently sorts your arrays, we can guarantee that this will be O(nlg(n)) (where lg(n) is log base 2 of n). This is the case because O notation denotes the worst case runtime. Furthermore, the stepping of the array takes O(n).
So, we have O(nlgn) + O(n) + O(1) + ...
Which really reduces to O(2nlg(n)). But co-efficient are negligible in asymptotic notation.
So your runtime approaches O(nlg(n)) as stated above.

Indeed, it is O(nlogn). Arrays.sort() uses merge sort. Using this method may not be the best way to find a max though. You can just loop through your array, comparing the elements instead.

removing duplicate strings from a massive array in java efficiently?

I'm considering the best possible way to remove duplicates from an (Unsorted) array of strings - the array contains millions or tens of millions of stringz..The array is already prepopulated so the optimization goal is only on removing dups and not preventing dups from initially populating!!
I was thinking along the lines of doing a sort and then binary search to get a log(n) search instead of n (linear) search. This would give me nlogn + n searches which althout is better than an unsorted (n^2) search = but this still seems slow. (Was also considering along the lines of hashing but not sure about the throughput)
Please help! Looking for an efficient solution that addresses both speed and memory since there are millions of strings involved without using Collections API!

Until your last sentence, the answer seemed obvious to me: use a HashSet<String> or a LinkedHashSet<String> if you need to preserve order:
HashSet<String> distinctStrings = new HashSet<String>(Arrays.asList(array));
If you can't use the collections API, consider building your own hash set... but until you've given a reason why you wouldn't want to use the collections API, it's hard to give a more concrete answer, as that reason could rule out other answers too.

ANALYSIS
Let's perform some analysis:
Using HashSet. Time complexity - O(n). Space complexity O(n). Note, that it requires about 8 * array size bytes (8-16 bytes - a reference to a new object).
Quick Sort. Time - O(n*log n). Space O(log n) (the worst case O(n*n) and O(n) respectively).
Merge Sort (binary tree/TreeSet). Time - O(n * log n). Space O(n)
Heap Sort. Time O(n * log n). Space O(1). (but it is slower than 2 and 3).
In case of Heap Sort you can through away duplicates on fly, so you'll save a final pass after sorting.
CONCLUSION
If time is your concern, and you don't mind allocating 8 * array.length bytes for a HashSet - this solution seems to be optimal.
If space is a concern - then QuickSort + one pass.
If space is a big concern - implement a Heap with throwing away duplicates on fly. It's still O(n * log n) but without additional space.

I would suggest that you use a modified mergesort on the array. Within the merge step, add logic to remove duplicate values. This solution is n*log(n) complexity and could be performed in-place if needed (in this case in-place implementation is a bit harder than with normal mergesort because adjacent parts could contain gaps from the removed duplicates which also need to be closed when merging).
For more information on mergesort see http://en.wikipedia.org/wiki/Merge_sort

Creating a hashset to handle this task is way too expensive. Demonstrably, in fact the whole point of them telling you not to use the Collections API is because they don't want to hear the word hash. So that leaves the code following.
Note that you offered them binary search AFTER sorting the array: that makes no sense, which may be the reason your proposal was rejected.
OPTION 1:
public static void removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
for(int i=1; i<input.length; i++){
if(input[i-1] == input[i])
input[i-1]=null;
}
}
OPTION 2:
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort here: n log n
int size = 1;
for(int i=1; i<input.length; i++){
if(input[i-1] != input[i])
size++;
}
System.out.println(size);
String output[] = new String[size];
output[0]=input[0];
int n=1;
for(int i=1;i<input.length;i++)
if(input[i-1]!=input[i])
output[n++]=input[i];
//final step: either return output or copy output into input;
//here I just return output
return output;
}
OPTION 3: (added by 949300, based upon Option 1). Note that this mangles the input array, if that is unacceptable, you must make a copy.
public static String[] removeDuplicates(String[] input){
Arrays.sort(input);//Use mergesort/quicksort here: n log n
int outputLength = 0;
for(int i=1; i<input.length; i++){
// I think equals is safer, but are nulls allowed in the input???
if(input[i-1].equals(input[i]))
input[i-1]=null;
else
outputLength++;
}
// check if there were zero duplicates
if (outputLength == input.length)
return input;
String[] output = new String[outputLength];
int idx = 0;
for ( int i=1; i<input.length; i++)
if (input[i] != null)
output[idx++] = input[i];
return output;
}

Hi do you need to put them into an array. It would be faster to use a collection using hash values like a set. Here each value is unique because of its hash value.
If you put all entries to a set collection type. You can use the
HashSet(int initialCapacity)
constructor to prevent memory expansion while run time.
Set<T> mySet = new HashSet<T>(Arrays.asList(someArray))
Arrays.asList() has runtime O(n) if memory do not have to be expanded.

Since this is an interview question, I think they want you to come up with your own implementation instead of using the set api.
Instead of sorting it first and compare it again, you can build a binary tree and create an empty array to store the result.
The first element in the array will be the root.
If the next element is equals to the node, return. -> this remove the duplicate elements
If the next element is less than the node, compare it to the left, else compare it to the right.
Keep doing the above the 2 steps until you reach to the end of the tree, then you can create a new node and know this has no duplicate yet.
Insert this new node value to the array.
After the traverse of all elements of the original array, you get a new copy of an array with no duplicate in the original order.
Traversing takes O(n) and searching the binary tree takes O(logn) (insertion should only take O(1) since you are just attaching it and not re-allocating/balancing the tree) so the total should be O(nlogn).

O.K., if they want super speed, let's use the hashcodes of the Strings as much as possible.
Loop through the array, get the hashcode for each String, and add it to your favorite data structure. Since you aren't allowed to use a Collection, use a BitSet. Note that you need two, one for positives and one for negatives, and they will each be huge.
Loop again through the array, with another BitSet. True means the String passes. If the hashcode for the String does not exist in the Bitset, you can just mark it as true. Else, mark it as possibly duplicate, as false. While you are at it, count how many possible duplicates.
Collect all the possible duplicates into a big String[], named possibleDuplicates. Sort it.
Now go through the possible duplicates in the original array and binary Search in the possibleDuplicates. If present, well, you are still stuck, cause you want to include it ONCE but not all the other times. So you need yet another array somewhere. Messy, and I've got to go eat dinner, but this is a start...

Java - Is calling sort() upon an already sorted list an O(1) operation?

Does the Collections.sort(list) check if the list is already sorted or is it maybe O(1) for some other reason?
Or, is it a good idea to have a flag sorted and set it to true/false upon calling sort()/adding an element to the list?

How can you determine if any list is sorted without looking at it? It wont be O(1). Determining if a list is sorted takes at least O(n).
That would mean If Collections.sortdid bother to check if a list was sorted first each sorting operation would take an average of O(n) + O(n log n).

As a matter of fact with Java7, Java has switched from mergesort to TimSort (named after python dev Tim Peters who implemented it for cpython first) for some sorting tasks.
While it's not O(1), sorting an already sorted, or partially sorted list with TimSort is quite more efficient than sorting a completely random data set (for the later there's no way to be more efficient than O(n log n) for comparison sorts, that's not true for not random data).

There is no way it's O(1), you can't check if the collection is sorted faster than O(n). Having a flag should be fine, but hard to say for sure without knowing what exactly you are doing.

Generally speaking, sorting an already sorted list doesn't make it faster (except simple sorts like bubble sort) In some cases pre-sorted is slower.
In the case of Collections.sort(), it is no faster to sort a sorted list.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimize code with ArrayList or TreeSet? - java

Related

Want to delete some elements from a list that is existed in another list

Should I use TreeSet or HashSet?

Time Complexity of my program

removing duplicate strings from a massive array in java efficiently?

Java - Is calling sort() upon an already sorted list an O(1) operation?

Categories

Resources