Should I use TreeSet or HashSet? - java

I have large number of strings, I need to print unique strings in sorted order.
TreeSet stores them in sorted order but insertion time is O(Logn) for each insertion. HashSet takes O(1) time to add but then I will have to get list of the set and then sort using Collections.sort() which takes O(nLogn) (I assumes there is no memory overhead here since only the references of Strings will be copied in the new collection i.e. List). Is it fair to say overall any choice is same since at the end total time will be same?

That depends on how close you look. Yes, the asymptotic time complexity is O(n log n) in either case, but the constant factors differ. So it's not like one method can get a 100 times faster than the other, but it's certainly possible that one method is twice a fast as the other.
For most parts of a program, a factor of 2 is totally irrelevant, but if your program actually spends a significant part of its running time in this algorithm, it would be a good idea to implement both approaches, and measure their performance.

Measuring is the way to go, but if you're talking purely theoretically and ignoring read from after sorting, then consider for number of strings = x:
HashSet:
x * O(1) add operations + 1 O(n log n) (where n is x) sort operation = approximately O(n + n log n) (ok, that's a gross oversimplification, but..)
TreeSet:
x * O(log n) (where n increases from 1 to x) + O(0) sort operation = approximately O(n log (n/2)) (also a gross oversimplification, but..)
And continuing in the oversimplification vein, O(n + n log n) > O(n log (n/2)). Maybe TreeSet is the way to go?

If you distinguish the total number of strings (n) and number of unique strings (m), you get more detailed results for both approaches:
Hash set + sort: O(n) + O(m log m)
TreeSet: O(n log m)
So if n is much bigger than m, using a hash set and sorting the result should be slightly better.

You should take into account which methods will be executed more frequently and base your decision on that.
Apart from HashSet and TreeSet you can use LinkedHashSet which provides better performance for sorted sets. If you want to learn more about their differences in performance I suggest your read 6 Differences between TreeSet HashSet and LinkedHashSet in Java

Related

In Java, why is comparing two elements of relatively time consuming

So I am on a program that uses, insertion sort, selection sort, and merge sort. I time all the programs and make a table on which are the fastest. I understand why merge sort is more efficient than selection sort and insertion sort(b/c of the effectiveness of comparing elements).
My question is why does comparing elements of an array relatively consuming and why does it make insertion and selection sort less efficient.
Note : I am new to java and couldn't find anything on this topic. Thx for your responses.
My question is why comparing 2 elements of an array relatively consuming ....
Relative to what?
In fact, the time taken to compare two instances of some class depends on how the compareTo or compare method is implemented. However comparison is typically expensive because of the nature of the computation.
For example, if you have to compare two strings that are equal (but different objects), you have to compare each character in one string with the corresponding character in the other one. For strings of length M, that is M character comparisons plus the overheads of looping over the characters. (Obviously, the comparison is cheaper in other cases .... depending on how different the strings are, for example.)
and why does it make insertion and selection sort less efficient.
The reason that insertion and selection sort are slower (for large datasets) is because they do more comparisons than other more complicated algorithms. Given a dataset with N elements:
The number of comparisons for quicksort and similar is proportional to N * logN
The number of comparisons for insertion sort and similar is proportional to N * N.
As N gets bigger N * N gets bigger than N * log N irrespective of the constants of proportionality.
Assuming that the datasets and element classes are the same, if you do more comparisons, that takes more CPU time.
The other thing to note is that the number of comparisons performed by a sort algorithm is typically proportional to other CPU overheads of the algorithm. That means that it is typically safe (though not mathematically sound) to use the comparison count as a proxy for the overall complexity of a sort algorithm.
why comparing 2 elements of an array relatively consuming
As asked in Stephen C's answer, relative to what?
Selection sort and insertion sort have time complexity O(n^2), while merge sort has time complexity O(n log(n)), so for reasonably large n, merge sort will be faster, but not because of compare overhead compared to the O(n^2) sorts.
For merge sort on an optimizing compiler, where compared elements are loaded into registers (assuming the elements fit in registers), then the compare overhead is small since the move will will write the value that was loaded into a register rather than read it from memory again.
As for compare overhead, if sorting an array of primitives, indexing is used to access the primitives, but if sorting an array of objects, which is usually implemented as an array of pointers to objects, the compare overhead is increased due to dereferencing of pointers. This would impact a comparison of quick sort versus merge sort (more moves, fewer compares), but the issue with merge sort versus insertion sort or selection sort is the O(n^2) time complexity versus merge sort O(n log(n)) time complexity.
In the case of sorting an array of objects, there's also the issue of sorting the pointers versus sorting the objects, which is a cache locality issue. Depending on object size, it may be better to sort the objects rather than sort the pointers, but this isn't really related to compare overhead as asked in the original question.

What's the time complexity of sorting a list of objects with two properties?

Suppose I have a class:
`
public class Interval {
int start;
int end;
Interval() { start = 0; end = 0; }
Interval(int s, int e) { start = s; end = e; }
}
`
I would like to sort a list of intervals with Collections.sort() like this:
Collections.sort(intervals, new Comparator<Interval>(){
#Override
public int compare(Interval o1, Interval o2) {
if (o1.start == o2.start) {
return o1.end - o2.end;
}
else {
return o1.start - o2.start;
}
}
});
I know that sorting an array with the built-in sorting function takes O(nlogn) time, and the question is if I am sorting a list of objects with two properties, what is the time complexity of sorting this list? Thanks!!
#PaulMcKenzie's brief answer in comments is on the right track, but the full answer to your question is more subtle.
Many people do what you've done and confuse time with other measures of efficiency. What's correct in nearly all cases when someone says a "sort is O(n log n)" is that the number of comparisons is O(n log n).
I'm not trying to be pedantic. Sloppy analysis can make big problems in practice. You can't claim that any sort runs in O(n log n) time without a raft of additional statements about the data and the machine where the algorithm is running. Research papers usually do this by giving a standard machine model used for their analysis. The model states the time required for low level operations - memory access, arithmetic, and comparisons, for example.
In your case, each object comparison requires a constant number (2) of value comparisons. So long as value comparison itself is constant time -- true in practice for fixed-width integers -- O(n log n) is an accurate way to express run time.
However, something as simple as string sorting changes this picture. String comparison itself has a variable cost. It depends on string length! So sorting strings with a "good" sorting algorithm is O(nk log n), where k is the length of strings.
Ditto if you're sorting variable-length numbers (java BigIntegers for example).
Sorting is also sensitive to copy costs. Even if you can compare objects in constant time, sort time will depend on how big they are. Algorithms differ in how many times objects need to be moved in memory. Some accept more comparisons in order to do less copying. An implementation detail: sorting pointers vs. objects can change asymptotic run time - a space for time trade.
But even this has complications. After you've sorted pointers, touching the sorted elements in order hops around memory in arbitrary order. This can cause terrible memory hierarchy (cache) performance. Analysis that incorporates memory characteristics is a big topic in itself.
The big O notation actually do neglect the least contributing factors
for example if you complexity is n+1, n will be used and the 1 neglected.
So that answer is the same n * log(n).
As your code just adds one more statement, which will be translated into one instruction.
It should read the Collection.sort() Link here
This algorithm guaranteed n log(n) performance.
Note: Comparator does't change the its complexity rather than using Loops

Optimize code with ArrayList or TreeSet?

TreeMap<String,ArrayList<String>> statesToPresidents = new TreeMap<String,ArrayList<String>>();
TreeMap<String,String> reversedMap = new TreeMap<String,String>();
TreeSet<String> presidentsWithoutStates = new TreeSet<String>();
TreeSet<String>statesWithoutPresidents = new TreeSet<String>(); while (infile2.ready())
{
String president = infile2.readLine();
if (reversedMap.containsKey(president)==false)
presidentsWithoutStates.add(president);
}
infile2.close();
System.out.println( "\nThese presidents were born before the states were formed:\n"); // DO NOT REMOVE OR MODIFY
// YOUR CODE HERE TO PRINT THE NAME(S) Of ANY PRESIDENT(s)
// WHO WERE BORN BEFORE THE STATES WERE FORMED = 10%
Iterator<String> iterator = presidentsWithoutStates.iterator();
while (iterator.hasNext()){
System.out.println(iterator.next());
}
I was wondering if my program would run faster if I used an ArrayList instead of a TreeSet. I add the string president to the presidentWithoutStates TreeSet if it's not a key in reversedMap and when I print it out I need it sorted order. Should I use the TreeSet and sort as I go or should I just use an arraylist instead and sort at the end. I saw a similar question about this but that person wasn't continually adding elements like I am.
Edit: There are no duplicates
Let's break the running time down:
ArrayList:
n inserts taking amortized O(1) each, giving us O(n)
Sort takes O(n log n), assuming you use the built-in Collections.sort, or an O(n log n) sorting algorithm.
Iterating through it takes O(n)
Total = O(n + n log n) = O(n log n)
TreeSet:
n inserts taking O(log n) each, giving us O(n log n).
Iterating through it takes O(n)
Total = O(n log n + n) = O(n log n)
Conclusion:
Asymptotically, we have the same performance.
In practice, ArrayList would probably be slightly faster.
Why do I say this? Well, let's assume it isn't. Then we could use TreeSet to sort an array faster than the method made specifically to sort it (the saving gotten from not having to insert into the ArrayList is fairly small). That seems counter-intuitive, doesn't it? If this were (consistently) true, Java developers would simply replace that method with TreeSet, wouldn't they?
One could analyse the constant factors involved with the sort versus the TreeSet, but that would probably be fairly complex, and the conditions under which the program is run also affects the constant factors, so it can't be exact.
Note on duplication:
The above assumes there isn't any duplicates.
If there were duplicates, you definitely shouldn't be doing a contains check if you were to use an ArrayList, but rather do the duplication removal afterwards (which can be done by simply ignoring consecutive elements which are the same during iteration after the sort). The reason the contains check should be avoided is because it takes O(n), which could make the whole thing take O(n²) instead.
If there are many duplicates, TreeSet is likely to be faster, as it only takes O(n log m), where m are the number of duplicates. The sorting option doesn't deal with duplicates so directly, so, unless m is really small, or you get lucky, still ends up taking O(n log n).
The exact point where TreeSet becomes faster than the sorting option is really something to benchmark.

Time Complexity of my program

I want to know the exact time complexity of my algorithm in this method. I think it is nlogn as it uses arrays.sort;
public static int largestElement(int[] num) throws NullPointerException // O(1)
{
int a=num.length; // O(1)
Arrays.sort(num); // O(1)? yes
if(num.length<1) // O(1)
return (Integer) null;
else
return num[a-1]; // O(1)
}
You seem to grossly contradict yourself in your post. You are correct in that the method is O(nlogn), but the following is incorrect:
Arrays.sort(num); // O(1)? yes
If you were right, the method would be O(1)! After all, a bunch of O(1) processes in sequence is still O(1). In reality, Arrays.sort() is O(nlogn), which determines the overall complexity of your method.
Finding the largest element in an array or collection can always be O(n), though, since we can simply iterate through each element and keep track of the maximum.
"You are only as fast as your slowest runner" --Fact
So the significant run time operations here are your sorting and your stepping through the array. Since Arrays.sort(num) is a method which most efficiently sorts your arrays, we can guarantee that this will be O(nlg(n)) (where lg(n) is log base 2 of n). This is the case because O notation denotes the worst case runtime. Furthermore, the stepping of the array takes O(n).
So, we have O(nlgn) + O(n) + O(1) + ...
Which really reduces to O(2nlg(n)). But co-efficient are negligible in asymptotic notation.
So your runtime approaches O(nlg(n)) as stated above.
Indeed, it is O(nlogn). Arrays.sort() uses merge sort. Using this method may not be the best way to find a max though. You can just loop through your array, comparing the elements instead.

How can I calculate the Big O complexity of my program?

I have a Big O notation question. Say I have a Java program that does the following things:
Read an Array of Integers into a HashMap that keeps track of how many occurrences of the Integers exists in the array. [1,2,3,1] would be [1->2, 2->1, 3->1].
Then I grab the Keys from the HashMap and place them in an Array:
Set<Integer> keys = dictionary.keySet();
Integer[] keysToSort = new Integer[keys.size()];
keys.toArray(keysToSort);
Sort the keyArray using Arrays.sort.
Then iterate through the sorted keyArray grabbing the corresponding value from the HashMap, in order to display or format the results.
I think I know the following:
Step 1 is O(n)
Step 3 is O(n log n) if I'm to believe the Java API
Step 4 is O(n)
Step 2: When doing this type of calculation I should know how Java implements the Set class toArray method. I would assume that it iterates through the HashMap retrieving the Keys. If that's the case I'll assume its O(n).
If sequential operations dictate I add each part then the final calculation would be
O(n + n·log n + n+n) = O(3n+n·log n).
Skip the constants and you have O(n+n log n). Can this be reduced any further or am I just completely wrong?
I believe O(n + nlogn) can be further simplified to just O(nlogn). This is because the n becomes asymptotically insignificant compared to the nlogn because they are different orders of complexity. The nlogn is of a higher order than n. This can be verified on the wikipedia page by scrolling down to the Order of Common Functions section.
When using complex data structures like hash maps you do need to know how it retrieves the object, not all data structures have the same retrieval process or time to retrieve elements.
This might help you with the finding the Big O of complex data types in Java:
http://www.coderfriendly.com/wp-content/uploads/2009/05/java_collections_v2.pdf
Step 2 takes O(capacity of the map).
Step 1 and 4 can get bad if you have many keys with same hash code (i.e. O(number of those keys) for a single lookup or change, multiply with the number of those lookups/changes).
O(n + n·log n) = O(n·log n)
You are correct to worry a little about step 2. As far as I can tell the Java API does not specify running times for these operations.
As for O(n + n log n) Treebranch is right. You can reduce that to O(n log n) the reason being that for some base value n0 n log n > c*n forall c /= 0, n > n0 this is obviously the case, since no matter what number you chose for c you could use an n0 set to 2^c+1
First,
Step 1 is only O(n) if inserting integers into a HashMap is O(1). In Perl, the worse case for inserting into a hash is O(N) for N items (aka amortised O(1)), and that's if you discount the length of the key (which is acceptable here). HashMap could be less efficient depending on how it addresses certain issues.
Second,
O(N) is O(N log N), so O(N + N log N) is O(N log N).
One thing big O doesn't tell you is that how big the scaling factor is. It also assume you have an ideal machine. The reason this is imporant is that read from a file is likely to be far more expensive than everything else you do.
If you actually time this you will get something which is startup cost + read time. The startup cost is likely to be the largest for even one million records. The read time will be propertional to the number of bytes read (i.e. the length of the numbers can matter) If you have 100 million the read time is likely to be more important. If you have one billion records, alot will depend on the number of unique entries rather than the total number of entries. The number of unique entries is limited to ~2 billion.
BTW: To perform the counting more efficiently, try TIntIntHashMap which can minimise object creation making it several times faster.
Of course I am only talking about real machines which big O doesn't consider ;)
The point I am making is that you can do a big O calculation but it will not be informative as to how a real application will behave.

Categories