I'm using Collections.sort with a custom comparator class. I've heard that this has O(N log N) runtime complexity. I'm curious to know what happens on subsequent sorts when the collection hasn't changed.
By example, lets say I have an ArrayList of Eggs, each which has an approximate size field (which my comparator sorts by). If I insert ten eggs into the array list, and sort it, I can expect it to take O(N log N) time.
If I sort it again, without adding, removing, or changing any elements, will it still take N log N time?
The Javadoc says 'the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist'. That appears to mean nothing happens so it should be quicker.
You could always test it.
I have not analysed the code in the current sun java library. However, the javadoc states that a merge sort is used. Most merge sorts yield a O(n) performance on already sorted collection. Although this is not stated in the documentation. My personal experience has shown me really good performance on sorted or nearly sorted lists.
Per JavaDoc Collections.sort uses merge sort algorithm.
You can see how it does, for yourself, here -> http://www.sorting-algorithms.com/
To expand on EJP's answer, if the documentation indicates that the merge pass is the step skipped, then in this best case runtime would be LG N because it will still break the list into LG N subproblems. The multiplication against the linear scan is the improvement in efficiency.
Related
I have an array that I need to sort the values in increasing order. The possible value inside the array are is between 1-9, there will be a lot of repeating value. (fyi: I'm working on a sudoku solver and trying to solve the puzzle starting with the box with least possibilities using backtracking)
The first idea that comes to my mine is to use Shell Sort.
I did some look up and I found out that the java collection uses "modified mergesort"(in which the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist).
So I wish to know if the differences in performance will be noticeable if I implement my own sorting algorithm.
If you only have 9 possible values, you probably want counting sort - the basic idea is:
Create an array of counts of size 9.
Iterate through the array and increment the corresponding index in the count array for each element.
Go through the count array and recreate the original array.
The running time of this would be O(n + 9) = O(n), where-as the running time of the standard API sort will be O(n log n).
So yes, this will most likely be faster than the standard comparison-based sort that the Java API uses, but only a benchmark will tell you for sure (and it could depend on the size of your data).
In general, I'd suggest that you first try using the standard API sort, and see if it's fast enough - it's literally just 1 line of code (except if you have to define a comparison function), compared to quite a few more for creating your own sorting function, and quite a bit of effort has gone into making sure it's as fast as possible, while keeping it generic.
If that's not fast enough, try to find and implement a sort that works well with your data. For example:
Insertion sort works well on data that's already almost sorted (although the running time is pretty terrible if the data is far from sorted).
Distribution sorts are worth considering if you have numeric data.
As noted in the comment, Arrays.parallelSort (from Java 8) is also an option worth considering, since it multi-threads the work (which sort doesn't do, and is certainly quite a bit of effort to do yourself ... efficiently).
I am optimizing an implementation of a sorted LinkedList.
To insert an element I traverse the list and compare each element until I have the correct index, and then break loop and insert.
I would like to know if there is any other way that I can insert the element at the same time as traversing the list to reduce the insert from O(n + (n capped at size()/2)) to O(n).
A ListIterator is almost what Im after because of its add() method, but unfortunately in the case where there are elements in the list equal to the insert, the insert has to be placed after them in the list. To implement this ListIterator needs a peek() which it doesnt have.
edit: I have my answer, but will add this anyway since a lot of people havent understood correctly:
I am searching for an insertion point AND inserting, which combined is higher than O(n)
You may consider a skip list, which is implemented using multiple linked lists at varying granularity. E.g. the linked list at level 0 contains all items, level 1 only links to every 2nd item on average, level 2 to only every 4th item on average, etc.... Searching starts from the top level and gradually descends to lower levels until it finds an exact match. This logic is similar to a binary search. Thus search and insertion is an O(log n) operation.
A concrete example in the Java class library is ConcurrentSkipListSet (although it may not be directly usable for you here).
I'd favor Péter Török suggestion, but I'd still like to add something for the iterator approach:
Note that ListIterator provides a previous() method to iterate through the list backwards. Thus first iterate until you find the first element that is greater and then go to the previous element and call add(...). If you hit the end, i.e. all elements are smaller, then just call add(...) without going back.
I have my answer, but will add this anyway since a lot of people havent understood correctly: I am searching for an insertion point AND inserting, which combined is higher than O(n).
Your require to maintain a collection of (possibly) non-unique elements that can iterated in an order given by a ordering function. This can be achieved in a variety of ways. (In the following I use "total insertion cost" to mean the cost of inserting a number (N) of elements into an initially empty data structure.)
A singly or doubly linked list offers O(N^2) total insertion cost (whether or not you combine the steps of finding the position and doing the insertion!), and O(N) iteration cost.
A TreeSet offers O(NlogN) total insertion cost and O(N) iteration cost. But has the restriction of no duplicates.
A tree-based multiset (e.g. TreeMultiset) has the same complexity as a TreeSet, but allows duplicates.
A skip-list data structure also has the same complexity as the previous two.
Clearly, the complexity measures say that a data structure that uses a linked list performs the worst as N gets large. For this particular group of requirements, a well-implemented tree-based multiset is probably the best, assuming there is only one thread accessing the collection. If the collection is heavily used by many threads (and it is a set), then a ConcurrentSkipListSet is probably better.
You also seem to have a misconception about how "big O" measures combine. If I have one step of an algorithm that is O(N) and a second step that is also O(N), then the two steps combined are STILL O(N) .... not "more than O(N)". You can derive this from the definition of "big O". (I won't bore you with the details, but the Math is simple.)
Does the Collections.sort(list) check if the list is already sorted or is it maybe O(1) for some other reason?
Or, is it a good idea to have a flag sorted and set it to true/false upon calling sort()/adding an element to the list?
How can you determine if any list is sorted without looking at it? It wont be O(1). Determining if a list is sorted takes at least O(n).
That would mean If Collections.sortdid bother to check if a list was sorted first each sorting operation would take an average of O(n) + O(n log n).
As a matter of fact with Java7, Java has switched from mergesort to TimSort (named after python dev Tim Peters who implemented it for cpython first) for some sorting tasks.
While it's not O(1), sorting an already sorted, or partially sorted list with TimSort is quite more efficient than sorting a completely random data set (for the later there's no way to be more efficient than O(n log n) for comparison sorts, that's not true for not random data).
There is no way it's O(1), you can't check if the collection is sorted faster than O(n). Having a flag should be fine, but hard to say for sure without knowing what exactly you are doing.
Generally speaking, sorting an already sorted list doesn't make it faster (except simple sorts like bubble sort) In some cases pre-sorted is slower.
In the case of Collections.sort(), it is no faster to sort a sorted list.
I am trying to clear up some things regarding complexity in some of the operations of TreeSet. On the javadoc it says:
"This implementation provides
guaranteed log(n) time cost for the
basic operations (add, remove and
contains)."
So far so good. My question is what happens on addAll(), removeAll() etc. Here the javadoc for Set says:
"If the specified collection is also a
set, the addAll operation effectively
modifies this set so that its value is
the union of the two sets."
Is it just explaining the logical outcome of the operation or is it giving a hint about the complexity? I mean, if the two sets are represented by e.g. red-black trees it would be better to somehow join the trees than to "add" each element of one to the other.
In any case, is there a way to combine two TreeSets into one with O(logn) complexity?
Thank you in advance. :-)
You could imagine how it would be possible to optimize special cases to O(log n), but the worst case has got to be O(m log n) where m and n are the number of elements in each tree.
Edit:
http://net.pku.edu.cn/~course/cs101/resource/Intro2Algorithm/book6/chap14.htm
Describes a special case algorithm that can join trees in O(log(m + n)) but note the restriction: all members of S1 must be less than all members of S2. This is what I meant that there are special optimizations for special cases.
Looking at the java source for TreeSet, it looks like it if the passed in collection is a SortedSet then it uses a O(n) time algorithm. Otherwise it calls super.addAll, which I'm guessing will result in O(n logn).
EDIT - guess I read the code too fast, TreeSet can only use the O(n) algorithm if it's backing map is empty
According to this blog post:
http://rgrig.blogspot.com/2008/06/java-api-complexity-guarantees.html
it's O(n log n). Because the documentation gives no hints about the complexity, you might want to write your own algorithm if the performance is critical for you.
It is not possible to perform merging of trees or join sets like in Disjoint-set data structures because you don't know if the elements in the 2 trees are disjoint. Since the data structures have knowledge about the content in other trees, it is necessary to check if one element exists in the other tree before adding to it or at-least trying to add it into another tree and abort adding it if you find it on the way.
So, it should be O(MlogN)
I think it is MergeSort, which is O(n log n).
However, the following output disagrees:
-1,0000000099000391,0000000099000427
1,0000000099000427,0000000099000346
5,0000000099000391,0000000099000346
1,0000000099000427,0000000099000345
5,0000000099000391,0000000099000345
1,0000000099000346,0000000099000345
I am sorting a nodelist of 4 nodes by sequence number, and the sort is doing 6 comparisons.
I am puzzled because 6 > (4 log(4)). Can someone explain this to me?
P.S. It is mergesort, but I still don't understand my results.
Thanks for the answers everyone. Thank you Tom for correcting my math.
O(n log n) doesn't mean that the number of comparisons will be equal to or less than n log n, just that the time taken will scale proportionally to n log n. Try doing tests with 8 nodes, or 16 nodes, or 32 nodes, and checking out the timing.
You sorted four nodes, so you didn't get merge sort; sort switched to insertion sort.
In Java, the Arrays.sort() methods use merge sort or a tuned quicksort depending on the datatypes and for implementation efficiency switch to insertion sort when fewer than seven array elements are being sorted. (Wikipedia, emphasis added)
Arrays.sort is used indirectly by the Collections classes.
A recently accepted bug report indicates that the Sun implementation of Java will use Python's timsort in the future: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124
(The timsort monograph, linked above, is well worth reading.)
An algorithm A(n) that processes an amount of data n is in O(f(n)), for some function f, if there exist two strictly positive constants C_inf and C_sup such that:
C_inf . f(n) < ExpectedValue(OperationCount(A(n))) < C_sup . f(n)
Two things to note:
The actual constants C could be anything, and do depend on the relative costs of operations (depending on the language, the VM, the architecture, or your actual definition of an operation). On some platforms, for instance, + and * have the same cost, on some other the later is an order of magnitude slower.
The quantity ascribed as "in O(f(n))" is an expected operation count, based on some probably arbitrary model of the data you are dealing with. For instance, if your data is almost completely sorted, a merge-sort algorithm is going to be mostly O(n), not O(n . Log(n)).
I've written some stuff you may be interested in about the Java sort algorithm and taken some performance measurements of Collections.sort(). The algorithm at present is a mergesort with an insertion sort once you get down to a certain size of sublists (N.B. this algorithm is very probably going to change in Java 7).
You should really take the Big O notation as an indication of how the algorithm will scale overall; for a particular sort, the precise time will deviate from the time predicted by this calculation (as you'll see on my graph, the two sort algorithms that are combined each have different performance characteristics, and so the overall time for a sort is a bit more complex).
That said, as a rough guide, for every time you double the number of elements, if you multiply the expected time by 2.2, you won't be far out. (It doesn't make much sense really to do this for very small lists of a few elements, though.)