What sort does Java Collections.sort(nodes) use? - java

I think it is MergeSort, which is O(n log n).
However, the following output disagrees:
-1,0000000099000391,0000000099000427
1,0000000099000427,0000000099000346
5,0000000099000391,0000000099000346
1,0000000099000427,0000000099000345
5,0000000099000391,0000000099000345
1,0000000099000346,0000000099000345
I am sorting a nodelist of 4 nodes by sequence number, and the sort is doing 6 comparisons.
I am puzzled because 6 > (4 log(4)). Can someone explain this to me?
P.S. It is mergesort, but I still don't understand my results.
Thanks for the answers everyone. Thank you Tom for correcting my math.

O(n log n) doesn't mean that the number of comparisons will be equal to or less than n log n, just that the time taken will scale proportionally to n log n. Try doing tests with 8 nodes, or 16 nodes, or 32 nodes, and checking out the timing.

You sorted four nodes, so you didn't get merge sort; sort switched to insertion sort.
In Java, the Arrays.sort() methods use merge sort or a tuned quicksort depending on the datatypes and for implementation efficiency switch to insertion sort when fewer than seven array elements are being sorted. (Wikipedia, emphasis added)
Arrays.sort is used indirectly by the Collections classes.
A recently accepted bug report indicates that the Sun implementation of Java will use Python's timsort in the future: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6804124
(The timsort monograph, linked above, is well worth reading.)

An algorithm A(n) that processes an amount of data n is in O(f(n)), for some function f, if there exist two strictly positive constants C_inf and C_sup such that:
C_inf . f(n) < ExpectedValue(OperationCount(A(n))) < C_sup . f(n)
Two things to note:
The actual constants C could be anything, and do depend on the relative costs of operations (depending on the language, the VM, the architecture, or your actual definition of an operation). On some platforms, for instance, + and * have the same cost, on some other the later is an order of magnitude slower.
The quantity ascribed as "in O(f(n))" is an expected operation count, based on some probably arbitrary model of the data you are dealing with. For instance, if your data is almost completely sorted, a merge-sort algorithm is going to be mostly O(n), not O(n . Log(n)).

I've written some stuff you may be interested in about the Java sort algorithm and taken some performance measurements of Collections.sort(). The algorithm at present is a mergesort with an insertion sort once you get down to a certain size of sublists (N.B. this algorithm is very probably going to change in Java 7).
You should really take the Big O notation as an indication of how the algorithm will scale overall; for a particular sort, the precise time will deviate from the time predicted by this calculation (as you'll see on my graph, the two sort algorithms that are combined each have different performance characteristics, and so the overall time for a sort is a bit more complex).
That said, as a rough guide, for every time you double the number of elements, if you multiply the expected time by 2.2, you won't be far out. (It doesn't make much sense really to do this for very small lists of a few elements, though.)

Related

If array needs to be sorted would it count as part of the binary search algorithm

I am trying to understand the speed of the Binary Search algorithm.
I understand it needs to operate on a sorted array.
However if the array comes in unsorted and performing the sorting. Wouldn't the sorting be part of the Binary Search and thus its performance would be slower?
I am confused because I think that there is very little chance to use this algorithm if the data does not come in sorted.
And if my code needs to sort it then why isn't it counting towards the search algorithm.
Sorry if I am confusing,
Thank you for helping.
You can't just point at an algorithm and say: It's got O(n^2) complexity!
That's what people usually say, sure. But that's shorthand. They're omitting things; assuming that the listener / reader will make assumptions.
You need to fully describe the exact algorithm, the conditions under which it is applied, and the precise definition of n and any other variable.
Then, you can answer that question. The problem you're having here is that the definition of 'what is the performance of binary search' is unclear. If you assume it means X whilst your buddy assumes it means Y, and you then argue about the answers, you're not actually having a constructive debate at all. You're just tilting at windmills; the real problem is that neither of you figured out the problem is communicating the basics.
Given that there is some confusion here, I'll give you 3 different more or less equally sensible more fleshed out definitions, along with the actual answer for each such definition. Hint, for one of them, 'binary search' isn't the fastest algorithm!
Given [1] a list that is already sorted, and [2] a single value, write me an algorithm that determines if this value is in the list or not.
The best answer would be: A binary sort algorithm, and its complexity would be O(log n).
Given [1] a list that is not sorted, and [2] a single value, write me an algorithm that determines if this value is in the list or not.
The best answer would be: Just iterate through the list. Its complexity would be O(n), and binary sort is not part of this answer at all.
given [1] a list that is not sorted, and [2] a list of tests, whereby each individual test is defined by a single value, but they all use the same input unsorted list, write an algorithm that will, for each test, determine if the value for that test is in the list or not, and then give me the amortized complexity (basically, the complexity of the whole thing, divided by the # of tests we ran).
Then the best answer would be: First sort the list, spending O(n log n) time to do so, but we get to amortize that over the test case count, and then use binary search for each individual test, adding an O(log n) complexity to each test. If we term n the size of the input list and t the number of tests we have, this gets us:
O( (n log n)/t + O(log n) )
Which is the actual answer to the question, complex as it may look. But, if t is large or even considered effectively infinite in size, OR we add one more rider to the question:
The list from [1] is given to you in advance and, within reasonable time and memory limits, you may preprocess this data without needing to amortize these costs across your test cases
then that boils down to just O(log n), as the large value for t makes that (n log n) / t factor approach zero.
In communicating this to your buddy, given that we don't talk in entire scientific papers, one might then say: "The algorithmic complexity of the binary sort algorithm is O(log n)", even if that omits a gigantic chunk of the full story.
You interpret the question as per the second case (input is unsorted, the input comprises both the list and the value to search for, no multi-test clause). Someone who says 'binary search is O(log n)' is labouring under either the first or third. You're both right.
NB: The third definition seems unusually complicated. However, it matches common scenarios. For example, 'we have compiled a list of folks living in town and their phone numbers, and we want to print them in a giant book with the aim of letting recipients of this book look up phone numbers. We expect over the lifetime of a single print run that the 100,000 citizens of the township will eaech do on average about 50 lookups, for a grand total of 5 million lookups for this single list. That gives you t= 5 million, n = 200,000 (let's say 200k people live here, half of which get a phonebook). Plug those numbers in and sorting the phonebook wins by a landslide vs. releasing the phonebook in arbitrary, unsorted order. Even if, yes, you start 'down' the effort of sorting it and won't make up for that loss until a few folks have speedily looked up a few phone numbers to make up for your efforts in sorting it before printing the book.
Yes. If
the data comes in unsorted
you only need to search for one element
...then you would have to first sort the data to use binary search, which would take a total of O(n log n + log n) = O(n log n) time.
But once the data is sorted, you can then binary search on that data as many times as you want. You don't have to sort it again each time.

Formula for determining number of comparisons in a sort?

I'm curious if there's a formula/rule to find the total number of comparisons done in a sorting algorithm, particularly merge sort, selection sort, and insertion sort. I'm pretty sure with selection sort the rule is n(n-1)/2where n is the number of elements being sorted. I thought the same was the case for an insertion sort but apparently that's not true according to a practice Java test I took (with a list of 6 items the insertion sort makes 14 comparisons, according to the answer key, and 15 comparisons with a selection sort). So now I'm confused.
In general, there are no precise formulae for a couple of reasons:
There is sufficient scope for variation in implementing various sort algorithms that there can be small discrepancies between different implementations of the same algorithm.
With some algorithms, the number of comparisons depend on the elements and their initial order.
And certainly, there is no "one size fits all" formula that covers all algorithms in (say) the same complexity class.
One clue: if a sort algorithm has a best case or worst case complexity that is different from its average complexity, then the number of comparisons or the number of moves depends on the inputs.

QuickSort and MergeSort performance on Sequential data fit in memory vs Slow to Access Sequential data on disk

The following quote is from "Comparison with other sort algorithms"
section from Wikipedia Merge Sort page
On typical modern architectures, efficient quicksort implementations
generally outperform mergesort for sorting RAM-based arrays.[citation
needed] On the other hand, merge sort is a stable sort and is more
efficient at handling slow-to-access sequential media.
My questions:
Why does Quicksort outperform Mergesort when the data to be sorted can all fit into memory? If all data needed are cached or in memory wouldn't it be fast for both Quicksort and Mergesort to access?
Why is Mergesort more efficient at handling slow-to-access sequential data (such as from disk in the case where the data to be sorted can't all fit into memory)?
(move from my comments below to here)In an array arr of primitives (data are sequential) of n elements. The pair of elements that has to be read and compared in MergeSort is arr[0] and arr[n/2] (happens in the final merge). Now think the pair of elements that has to be read and compared in QuickSort is arr[1] and arr[n] (happens in the first partition, assume we swap the randomly chosen pivot with the first element). We know data are read in blocks and load into cache, or disk to memory (correct me if I am wrong) then isn't there a better chance for the needed data gets load together in one block when using MergeSort? It just seems to me MergeSort would always have the upperhand because it is likely comparing elements that are closer together. I know this is False (see graph below) because QuickSort is obviously faster...... I know MergeSort is not in place and requires extra memory and that is likely to slow things down. Other than that what pieces am I missing in my analysis?
images are from Princeton CS MergeSort and QuickSort slides
My Motive:
I want to understand these above concepts because they are one of the main reasons of why mergeSort is preferred when sorting LinkedList,or none sequential data and quickSort is preferred when sorting Array, or sequential data. And why mergeSort is used to sort Object in Java and quickSort is used to sort primitive type in java.
update: Java 7 API actually uses TimSort to sort Object, which is a hybrid of MergeSort and InsertionSort. For primitives Dual-Pivot QuickSort. These changes were implemented starting in Java SE 7. This has to do with the stability of the sorting algorithm. Why does Java's Arrays.sort method use two different sorting algorithms for different types?
Edit:
I will appreciate an answer that addresses the following aspects:
I know the two sorting algorithms differ in the number of moves, read, and comparisons. If those are that reasons contribute to the behaviors I see listed in my questions (I suspected it) then a thorough explanation of how the steps and process of the sorting algorithm results it having advantages or disadvantages seeking data from disk or memory will be much appreciated.
Examples are welcome. I learn better with examples.
note: if you are reading #rcgldr's answer. check out our conversation in the chat room it has lots of good explanations and details. https://chat.stackoverflow.com/rooms/161554/discussion-between-rcgldr-and-oliver-koo
The main difference is that merge sort does more moves, but fewer compares than quick sort. Even in the case of sorting an array of native types, quick sort is only around 15% faster, at least when I've tested it on large arrays of pseudo random 64 bit unsigned integers, which should be quick sort's best case, on my system (Intel 3770K 3.5ghz, Windows 7 Pro 64 bit, Visual Studio 2015, sorting 16 million pseudo random 64 bit unsigned integers, 1.32 seconds for quick sort, 1.55 seconds for merge sort, 1.32/1.55 ~= 0.85, so quick sort was about 15% faster than merge sort). My test was with a quick sort that had no checks to avoid worst case O(n^2) time or O(n) space. As checks are added to quick sort to reduce or prevent worst case behavior (like fall back to heap sort if recursion becomes too deep), the speed advantage decreases to less than 10% (which is the difference I get between VS2015's implementation of std::sort (modified quick sort) versus std::stable_sort (modified merge sort).
If sorting "strings", it's more likely that what is being sorted is an array of pointers (or references) to those strings. This is where merge sort is faster, because the moves involve pointers, while the compares involve a level of indirection and comparison of strings.
The main reason for choosing quick sort over merge sort is not speed, but space requirement. Merge sort normally uses a second array the same size as the original. Quick sort and top down merge sort also need log(n) stack frames for recursion, and for quick sort limiting stack space to log(n) stack frames is done by only recursing on the smaller partition, and looping back to handle the larger partition.
In terms of cache issues, most recent processors have 4 or 8 way associative caches. For merge sort, during a merge, the two input runs will end up in 2 of the cache lines, and the one output run in a 3rd cache line. Quick sort scans the data before doing swaps, so the scanned data will be in cache, although in separate lines if the two elements being compared / swapped are located far enough from each other.
For an external sort, some variation of bottom up merge sort is used. This because merge sort merge operations are sequential (the only random access occurs when starting up a new pair of runs), which is fast in the case of hard drives, or in legacy times, tape drives (a minimum of 3 tapes drives is needed). Each read or write can be for very large blocks of data, reducing average access time per element in the case of a hard drive, since a large number of elements are read or written at a time with each I/O.
It should also be noted that most merge sorts in libraries are also some variation of bottom up merge sort. Top down merge sort is mostly a teaching environment implementation.
If sorting an array of native types on a processor with 16 registers, such as an X86 in 64 bit mode, 8 of the registers used as start + end pointers (or references) for 4 runs, then a 4-way merge sort is often about the same or a bit faster than quick sort, assuming a compiler optimizes the pointers or references to be register based. It's a similar trade off, like quick sort, 4-way merge sort does more compares (1.5 x compares), but fewer moves (0.5 x moves) than traditional 2-way merge sort.
It should be noted that these sorts are cpu bound, not memory bound. I made a multi-threaded version of a bottom up merge sort, and in the case of using 4 threads, the sort was 3 times faster. Link to Windows example code using 4 threads:
https://codereview.stackexchange.com/questions/148025/multithreaded-bottom-up-merge-sort

Collections.Sort performance on subsequent sorts?

I'm using Collections.sort with a custom comparator class. I've heard that this has O(N log N) runtime complexity. I'm curious to know what happens on subsequent sorts when the collection hasn't changed.
By example, lets say I have an ArrayList of Eggs, each which has an approximate size field (which my comparator sorts by). If I insert ten eggs into the array list, and sort it, I can expect it to take O(N log N) time.
If I sort it again, without adding, removing, or changing any elements, will it still take N log N time?
The Javadoc says 'the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist'. That appears to mean nothing happens so it should be quicker.
You could always test it.
I have not analysed the code in the current sun java library. However, the javadoc states that a merge sort is used. Most merge sorts yield a O(n) performance on already sorted collection. Although this is not stated in the documentation. My personal experience has shown me really good performance on sorted or nearly sorted lists.
Per JavaDoc Collections.sort uses merge sort algorithm.
You can see how it does, for yourself, here -> http://www.sorting-algorithms.com/
To expand on EJP's answer, if the documentation indicates that the merge pass is the step skipped, then in this best case runtime would be LG N because it will still break the list into LG N subproblems. The multiplication against the linear scan is the improvement in efficiency.

Why does java.util.Arrays.sort(Object[]) use 2 kinds of sorting algorithms?

I found that java.util.Arrays.sort(Object[]) use 2 kinds of sorting algorithms(in JDK 1.6).
pseudocode:
if(array.length<7)
insertionSort(array);
else
mergeSort(array);
Why does it need 2 kinds of sorting here? for efficiency?
It's important to note that an algorithm that is O(N log N) is not always faster in practice than an O(N^2) algorithm. It depends on the constants, and the range of N involved. (Remember that asymptotic notation measures relative growth rate, not absolute speed).
For small N, insertion sort in fact does beat merge sort. It's also faster for almost-sorted arrays.
Here's a quote:
Although it is one of the elementary sorting algorithms with O(N^2) worst-case time, insertion sort is the algorithm of choice either when the data is nearly sorted (because it is adaptive) or when the problem size is small (because it has low overhead).
For these reasons, and because it is also stable, insertion sort is often used as the recursive base case (when the problem size is small) for higher overhead divide-and-conquer sorting algorithms, such as merge sort or quick sort.
Here's another quote from Best sorting algorithm for nearly sorted lists paper:
straight insertion sort is best for small or very nearly sorted lists
What this means is that, in practice:
Some algorithm A1 with higher asymptotic upper bound may be preferable than another known algorithm A2 with lower asymptotic upper bound
Perhaps A2 is just too complicated to implement
Or perhaps it doesn't matter in the range of N considered
See e.g. Coppersmith–Winograd algorithm
Some hybrid algorithms may adapt different algorithms depending on the input size
Related questions
Which sorting algorithm is best suited to re-sort an almost fully sorted list?
Is there ever a good reason to use Insertion Sort?
A numerical example
Let's consider these two functions:
f(x) = 2x^2; this function has a quadratic growth rate, i.e. "O(N^2)"
g(x) = 10x; this function has a linear growth rate, i.e. "O(N)"
Now let's plot the two functions together:
Source: WolframAlpha: plot 2x^2 and 10x for x from 0 to 10
Note that between x=0..5, f(x) <= g(x), but for any larger x, f(x) quickly outgrows g(x).
Analogously, if A1 is a quadratic algorithm with a low overhead, and A2 is a linear algorithm with a high overhead, for smaller input, A1 may be faster than A2.
Thus, you can, should you choose to do so, create a hybrid algorithm A3 which simply selects one of the two algorithms depending on the size of the input. Whether or not this is worth the effort depends on the actual parameters involved.
Many tests and comparisons of sorting algorithms have been made, and it was decided that because insertion sort beats merge sort for small arrays, it was worth it to implement both for Arrays.sort.
It's for speed. The overhead of mergeSort is high enough that for short arrays it would be slower than insertion sort.
Quoted from: http://en.wikipedia.org/wiki/Insertion_sort
Some divide-and-conquer algorithms such as quicksort and mergesort sort by
recursively dividing the list into smaller sublists which are then sorted.
A useful optimization in practice for these algorithms is to use insertion
sort for sorting small sublists, where insertion sort outperforms these more
complex algorithms. The size of list for which insertion sort has the advantage
varies by environment and implementation, but is typically between eight and
twenty elements.
It appears that they believe mergeSort(array) is slower for short arrays. Hopefully they actually tested that.

Categories