Was going through Java 8 features, mentioned here. Couldn't understand what parallelSort() does exactly. Can someone explain what is the actual difference between sort() and parallelSort()?
Parallel sort uses threading - each thread gets a chunk of the list and all the chunks are sorted it in parallel. These sorted chunks are then merged into a result.
It's faster when there are a lot of elements in the collection. The overhead for parallelization (splitting into chunks and merging) becomes tolerably small on larger collections, but it is large for smaller ones.
Take a look at this table (of course, the results depend on the CPU, number of cores, background processes, etc):
Taken from this link: http://www.javacodegeeks.com/2013/04/arrays-sort-versus-arrays-parallelsort.html
Arrays.parallelSort() :
The method uses a threshold value and any array of size lesser than the threshold value is sorted using the Arrays#sort() API (i.e sequential sorting). And the threshold is calculated considering the parallelism of the machine, size of the array and is calculated as:
private static final int getSplitThreshold(int n) {
int p = ForkJoinPool.getCommonPoolParallelism();
int t = (p > 1) ? (1 + n / (p << 3)) : n;
return t < MIN_ARRAY_SORT_GRAN ? MIN_ARRAY_SORT_GRAN : t;
}
Once its decided whether to sort the array in parallel or in serial, its now to decide how to divide the array in to multiple parts and then assign each part to a Fork/Join task which will take care of sorting it and then another Fork/Join task which will take care of merging the sorted arrays. The implementation in JDK 8 uses this approach:
Divide the array into 4 parts.
Sort the first two parts and then merge them.
Sort the next two parts and then merge them.
And the above steps are repeated recursively with each part until the size of the part to sort is not lesser than the threshold value calculated above.
You can also read the implementation details in the Javadoc
The sorting algorithm is a parallel sort-merge that breaks the array into sub-arrays that are themselves sorted and then merged. When the sub-array length reaches a minimum granularity, the sub-array is sorted using the appropriate Arrays.sort method. If the length of the specified array is less than the minimum granularity, then it is sorted using the appropriate Arrays.sort method. The algorithm requires a working space no greater than the size of the specified range of the original array. The ForkJoin common pool is used to execute any parallel tasks.
Array.sort():
This uses merge sort OR Tim Sort underneath to sort the contents. This is all done sequentially, even though merge sort uses divide and conquer technique, its all done sequentially.
Source
The key differences between both the algorithm are as follow :
1. Arrays.sort() : is a sequential sorting.
The API uses single thread for the operation.
The API takes bit longer time to perform the operation.
2. Arrays.ParallelSort() : is a parallel sorting.
The API uses multiple threads.
The API takes lesser the time compared to Sort().
For more results, we all have to wait for JAVA 8 I guess !! cheers !!
You can refer to the javadoc, which explains that the algorithm uses several threads if the array is large enough:
The sorting algorithm is a parallel sort-merge that breaks the array into sub-arrays that are themselves sorted and then merged. When the sub-array length reaches a minimum granularity, the sub-array is sorted using the appropriate Arrays.sort method. [...] The ForkJoin common pool is used to execute any parallel tasks.
In a nutshell, parallelSort uses multiple threads. This article has way more detail if you really want to know.
From this link
Current sorting implementations provided by the Java Collections
Framework (Collections.sort and Arrays.sort) all perform the sorting
operation sequentially in the calling thread. This enhancement will
offer the same set of sorting operations currently provided by the
Arrays class, but with a parallel implementation that utilizes the
Fork/Join framework. These new API’s are still synchronous with regard
to the calling thread as it will not proceed past the sorting
operation until the parallel sort is complete.
Array.sort(myArray);
You can now use –
Arrays.parallelSort(myArray);
This will automatically break up the target collection into several parts, which will be sorted independently across a number of cores and then grouped back together. The only caveat here is that when called in highly multi-threaded environments, such as a busy web container, the benefits of this approach will begin to diminish (by more than 90%) due to the cost of increased CPU context switches.
Source- link
Related
The following quote is from "Comparison with other sort algorithms"
section from Wikipedia Merge Sort page
On typical modern architectures, efficient quicksort implementations
generally outperform mergesort for sorting RAM-based arrays.[citation
needed] On the other hand, merge sort is a stable sort and is more
efficient at handling slow-to-access sequential media.
My questions:
Why does Quicksort outperform Mergesort when the data to be sorted can all fit into memory? If all data needed are cached or in memory wouldn't it be fast for both Quicksort and Mergesort to access?
Why is Mergesort more efficient at handling slow-to-access sequential data (such as from disk in the case where the data to be sorted can't all fit into memory)?
(move from my comments below to here)In an array arr of primitives (data are sequential) of n elements. The pair of elements that has to be read and compared in MergeSort is arr[0] and arr[n/2] (happens in the final merge). Now think the pair of elements that has to be read and compared in QuickSort is arr[1] and arr[n] (happens in the first partition, assume we swap the randomly chosen pivot with the first element). We know data are read in blocks and load into cache, or disk to memory (correct me if I am wrong) then isn't there a better chance for the needed data gets load together in one block when using MergeSort? It just seems to me MergeSort would always have the upperhand because it is likely comparing elements that are closer together. I know this is False (see graph below) because QuickSort is obviously faster...... I know MergeSort is not in place and requires extra memory and that is likely to slow things down. Other than that what pieces am I missing in my analysis?
images are from Princeton CS MergeSort and QuickSort slides
My Motive:
I want to understand these above concepts because they are one of the main reasons of why mergeSort is preferred when sorting LinkedList,or none sequential data and quickSort is preferred when sorting Array, or sequential data. And why mergeSort is used to sort Object in Java and quickSort is used to sort primitive type in java.
update: Java 7 API actually uses TimSort to sort Object, which is a hybrid of MergeSort and InsertionSort. For primitives Dual-Pivot QuickSort. These changes were implemented starting in Java SE 7. This has to do with the stability of the sorting algorithm. Why does Java's Arrays.sort method use two different sorting algorithms for different types?
Edit:
I will appreciate an answer that addresses the following aspects:
I know the two sorting algorithms differ in the number of moves, read, and comparisons. If those are that reasons contribute to the behaviors I see listed in my questions (I suspected it) then a thorough explanation of how the steps and process of the sorting algorithm results it having advantages or disadvantages seeking data from disk or memory will be much appreciated.
Examples are welcome. I learn better with examples.
note: if you are reading #rcgldr's answer. check out our conversation in the chat room it has lots of good explanations and details. https://chat.stackoverflow.com/rooms/161554/discussion-between-rcgldr-and-oliver-koo
The main difference is that merge sort does more moves, but fewer compares than quick sort. Even in the case of sorting an array of native types, quick sort is only around 15% faster, at least when I've tested it on large arrays of pseudo random 64 bit unsigned integers, which should be quick sort's best case, on my system (Intel 3770K 3.5ghz, Windows 7 Pro 64 bit, Visual Studio 2015, sorting 16 million pseudo random 64 bit unsigned integers, 1.32 seconds for quick sort, 1.55 seconds for merge sort, 1.32/1.55 ~= 0.85, so quick sort was about 15% faster than merge sort). My test was with a quick sort that had no checks to avoid worst case O(n^2) time or O(n) space. As checks are added to quick sort to reduce or prevent worst case behavior (like fall back to heap sort if recursion becomes too deep), the speed advantage decreases to less than 10% (which is the difference I get between VS2015's implementation of std::sort (modified quick sort) versus std::stable_sort (modified merge sort).
If sorting "strings", it's more likely that what is being sorted is an array of pointers (or references) to those strings. This is where merge sort is faster, because the moves involve pointers, while the compares involve a level of indirection and comparison of strings.
The main reason for choosing quick sort over merge sort is not speed, but space requirement. Merge sort normally uses a second array the same size as the original. Quick sort and top down merge sort also need log(n) stack frames for recursion, and for quick sort limiting stack space to log(n) stack frames is done by only recursing on the smaller partition, and looping back to handle the larger partition.
In terms of cache issues, most recent processors have 4 or 8 way associative caches. For merge sort, during a merge, the two input runs will end up in 2 of the cache lines, and the one output run in a 3rd cache line. Quick sort scans the data before doing swaps, so the scanned data will be in cache, although in separate lines if the two elements being compared / swapped are located far enough from each other.
For an external sort, some variation of bottom up merge sort is used. This because merge sort merge operations are sequential (the only random access occurs when starting up a new pair of runs), which is fast in the case of hard drives, or in legacy times, tape drives (a minimum of 3 tapes drives is needed). Each read or write can be for very large blocks of data, reducing average access time per element in the case of a hard drive, since a large number of elements are read or written at a time with each I/O.
It should also be noted that most merge sorts in libraries are also some variation of bottom up merge sort. Top down merge sort is mostly a teaching environment implementation.
If sorting an array of native types on a processor with 16 registers, such as an X86 in 64 bit mode, 8 of the registers used as start + end pointers (or references) for 4 runs, then a 4-way merge sort is often about the same or a bit faster than quick sort, assuming a compiler optimizes the pointers or references to be register based. It's a similar trade off, like quick sort, 4-way merge sort does more compares (1.5 x compares), but fewer moves (0.5 x moves) than traditional 2-way merge sort.
It should be noted that these sorts are cpu bound, not memory bound. I made a multi-threaded version of a bottom up merge sort, and in the case of using 4 threads, the sort was 3 times faster. Link to Windows example code using 4 threads:
https://codereview.stackexchange.com/questions/148025/multithreaded-bottom-up-merge-sort
I watched a talk by José Paumard on InfoQ : http://www.infoq.com/fr/presentations/jdk8-lambdas-streams-collectors (French)
The thing is I got stuck on this one point.
To collect 1M Long using stream AND multithreading we can do it this way :
Stream<Long> stream =
Stream.generate(() -> ThreadLocalRandom.current().nextLong()) ;
List<Long> list1 =
stream.parallel().limit(10_000_000).collect(Collectors.toList()) ;
But given the fact that the threads are always checking the said limit in hinders performance.
In that talk we also see this second solution :
Stream<Long> stream =
ThreadLocalRandom.current().longs(10_000_000).mapToObj(Long::new) ;
List<Long> list =
stream.parallel().collect(Collectors.toList()) ;
and it seems to be better performance wise.
So here is my question : Why is that the second code better, and is there a better, or at least less costly way to do it?
This is an implementation dependent limitation. One thing that developers, concerned about parallel performance, have to understand, is that predictable stream sizes help the parallel performance generally as they allow balanced splitting of the workload.
The issue here is, that the combination of an infinite stream as created via Stream.generate() and limit() does not produce a stream with a predictable size, despite it looks perfectly predictable to us.
We can examine it using the following helper method:
static void sizeOf(String op, IntStream stream) {
final Spliterator.OfInt s = stream.spliterator();
System.out.printf("%-18s%5d, %d%n", op, s.getExactSizeIfKnown(), s.estimateSize());
}
Then
sizeOf("randoms with size", ThreadLocalRandom.current().ints(1000));
sizeOf("randoms with limit", ThreadLocalRandom.current().ints().limit(1000));
sizeOf("range", IntStream.range(0, 100));
sizeOf("range map", IntStream.range(0, 100).map(i->i));
sizeOf("range filter", IntStream.range(0, 100).filter(i->true));
sizeOf("range limit", IntStream.range(0, 100).limit(10));
sizeOf("generate limit", IntStream.generate(()->42).limit(10));
will print
randoms with size 1000, 1000
randoms with limit -1, 9223372036854775807
range 100, 100
range map 100, 100
range filter -1, 100
range limit -1, 100
generate limit -1, 9223372036854775807
So we see, certain sources like Random.ints(size) or IntStream.range(…) produce streams with a predictable size and certain intermediate operations like map are capable of carrying the information as they know that the size is not affected. Others like filter and limit do not propagate the size (as a known exact size).
It’s clear that filter cannot predict the actual number of elements, but it provides the source size as an estimate which is reasonable insofar that that’s the maximum number of elements that can ever pass the filter.
In contrast, the current limit implementation does not provide a size, even if the source has an exact size and we know the predictable size is as simple as min(source size, limit). Instead, it even reports a nonsensical estimate size (the source’s size) despite the fact that it is known that the resulting size will never be higher than the limit. In case of an infinite stream we have the additional obstacle that the Spliterator interface, on which streams are based, doesn’t have a way to report that it is infinite. In these cases, infinite stream + limit returns Long.MAX_VALUE as an estimate which means “I can’t even guess”.
Thus, as a rule of thumb, with the current implementation, a programmer should avoid using limit when there is a way to specify the desired size beforehand at the stream’s source. But since limit also has significant (documented) drawbacks in the case of ordered parallel streams (which doesn’t applies to randoms nor generate), most developers avoid limit anyway.
Why is that the second code better?
In the first case you create infinite source, split it for parallel execution to a bunch of tasks each providing an infinite number of elements, then limit the overall size of the result. Even though the source is unordered, this implies some overhead. In this case individual tasks should talk to each other to check when the overall size is reached. If they talk often, this increases the contention. If they talk less, they actually produce more numbers than necessary and then drop some of them. I believe, actual stream API implementation is to talk less between tasks, but this actually leads to produce more numbers than necessary. This also increases memory consumption and activates garbage collector.
In contrast in the second case you create a finite source of known size. When the task is split into subtasks, their sizes are also well-defined and in total they produce exactly the requested number of random numbers without the necessity to talk to each other at all. That's why it's faster.
Is there a better, or at least less costly way to do it?
The biggest problem in your code samples is boxing. If you need 10_000_000 random numbers, it's very bad idea to box each of them and store in the List<Long>: you create tons of unnecessary objects, perform many heap allocations and so on. Replace this with primitive streams:
long[] randomNumbers = ThreadLocalRandom.current().longs(10_000_000).parallel().toArray();
This would be much much faster (probably an order of magnitude).
Also you may consider new Java-8 SplittableRandom class. It provides roughly the same performance, but the generated random numbers have much higher quality (including passing of DieHarder 3.31.1):
long[] randomNumbers = new SplittableRandom().longs(10_000_000).parallel().toArray();
JDK docs has good explanation of this behavior, it is ordering constraint that kills performance for parallel processing
Text from doc for limit function - https://docs.oracle.com/javase/8/docs/api/java/util/stream/LongStream.html
While limit() is generally a cheap operation on sequential stream pipelines, it can be quite expensive on ordered parallel pipelines, especially for large values of maxSize, since limit(n) is constrained to return not just any n elements, but the first n elements in the encounter order. Using an unordered stream source (such as generate(LongSupplier)) or removing the ordering constraint with BaseStream.unordered() may result in significant speedups of limit() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with limit() in parallel pipelines, switching to sequential execution with sequential() may improve performance.
Blockquote
I'm writing a java application that transforms numbers (long) into a small set of result objects. This mapping process is very critical to the app's performance as it is needed very often.
public static Object computeResult(long input) {
Object result;
// ... calculate
return result;
}
There are about 150,000,000 different key objects, and about 3,000 distinct values.
The transformation from the input number (long) to the output (immutable object) can be computed by my algorithm with a speed of 4,000,000 transformations per second. (using 4 threads)
I would like to cache the mapping of the 150M different possible inputs to make the translation even faster but i found some difficulties creating such a cache:
public class Cache {
private static long[] sortedInputs; // 150M length
private static Object[] results; // 150M length
public static Object lookupCachedResult(long input) {
int index = Arrays.binarySearch(sortedInputs, input);
return results[index];
}
}
i tried to create two arrays with a length of 150M. the first array holds all possible input longs, and it is sorted numerically. the second array holds a reference to one of the 3000 distinct, precalculated result objects at the index corresponding to the first array's input.
to get to the cached result, i do a binary search for the input number on the first array. the cached result is then looked up in the second array at the same index.
sadly, this cache method is not faster than computing the results. not even half, only about 1.5M lookups per second. (also using 4 threads)
Can anyone think of a faster way to cache results in such a scenario?
I doubt there is a database engine that is able to answer more than 4,000,000 queries per second on, let's say an average workstation.
Hashing is the way to go here, but I would avoid using HashMap, as it only works with objects, i.e. must build a Long each time you insert a long, which can slow it down. Maybe this performance issue is not significant due to JIT, but I would recommend at least to try the following and measure performance against the HashMap-variant:
Save your longs in a long-array of some length n > 3000 and do the hashing by hand via a very simple hash-function (and thus efficient) like
index = key % n. Since you know your 3000 possible values before hand you can empirically find an array-length n such that this trivial hash-function won't cause collisions. So you circumvent rehashing etc. and have true O(1)-performance.
Secondly I would recommend you to look at Java-numerical libraries like
https://github.com/mikiobraun/jblas
https://github.com/fommil/matrix-toolkits-java
Both are backed by native Lapack and BLAS implementations that are usually highly optimized by very smart people. Maybe you can formulate your algorithm in terms of matrix/vector-algebra such that it computes the whole long-array at one time (or chunk-wise).
There are about 150,000,000 different key objects, and about 3,000 distinct values.
With the few values, you should ensure that they get re-used (unless they're pretty small objects). For this an Interner is perfect (though you can run your own).
i tried hashmap and treemap, both attempts ended in an outOfMemoryError.
There's a huge memory overhead for both of them. And there isn't much point is using a TreeMap as it uses a sort of binary search which you've already tried.
There are at least three implementations of a long-to-object-map available, google for "primitive collections". This should use slightly more memory than your two arrays. With hashing being usually O(1) (let's ignore the worst case as there's no reason for it to happen, is it?) and much better memory locality, it'll beat(*) your binary search by a factor of 20. You binary search needs log2(150e6), i.e., about 27 steps and hashing may need on the average maybe two. This depends on how tightly you pack the hash table; this is usually a parameter given when it gets created.
In case you run your own (which you most probably shouldn't), I'd suggest to use an array of size 1 << 28, i.e., 268435456 entries, so that you can use bitwise operations for indexing.
(*) Such predictions are hard, but I'm sure it's worth trying.
I have an array that I need to sort the values in increasing order. The possible value inside the array are is between 1-9, there will be a lot of repeating value. (fyi: I'm working on a sudoku solver and trying to solve the puzzle starting with the box with least possibilities using backtracking)
The first idea that comes to my mine is to use Shell Sort.
I did some look up and I found out that the java collection uses "modified mergesort"(in which the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist).
So I wish to know if the differences in performance will be noticeable if I implement my own sorting algorithm.
If you only have 9 possible values, you probably want counting sort - the basic idea is:
Create an array of counts of size 9.
Iterate through the array and increment the corresponding index in the count array for each element.
Go through the count array and recreate the original array.
The running time of this would be O(n + 9) = O(n), where-as the running time of the standard API sort will be O(n log n).
So yes, this will most likely be faster than the standard comparison-based sort that the Java API uses, but only a benchmark will tell you for sure (and it could depend on the size of your data).
In general, I'd suggest that you first try using the standard API sort, and see if it's fast enough - it's literally just 1 line of code (except if you have to define a comparison function), compared to quite a few more for creating your own sorting function, and quite a bit of effort has gone into making sure it's as fast as possible, while keeping it generic.
If that's not fast enough, try to find and implement a sort that works well with your data. For example:
Insertion sort works well on data that's already almost sorted (although the running time is pretty terrible if the data is far from sorted).
Distribution sorts are worth considering if you have numeric data.
As noted in the comment, Arrays.parallelSort (from Java 8) is also an option worth considering, since it multi-threads the work (which sort doesn't do, and is certainly quite a bit of effort to do yourself ... efficiently).
I am on a mission of sorting somewhat large array of unsigned, 64-bit, randomly generated integers (over 5E7 elements). Can you direct me to a parallel sorting algorithm that might exhibit almost linear speedup at least in the case of random data?
I am working with Java, in case it makes any difference with regard to fast sorting.
Edit: Note that this question is primarily concerned with parallel sorts capable to achieve near-linear speedup. (Meaning, when the amount of executing cores grows from P to 2P, the time spent by a parallel sort drops to 55 - 50 percent of the computation performed on P cores.)
Well if you got a lot of memory you can use Bucketsort. One other algorithm that goes well with parallelism is Quicksort
From the Wikipedia article on Quicksort,
Like merge sort, quicksort can also be parallelized due to its
divide-and-conquer nature. Individual in-place partition operations
are difficult to parallelize, but once divided, different sections of
the list can be sorted in parallel. The following is a straightforward
approach: If we have processors, we can divide a list of elements
into sublists in O(n) average time, then sort each of these in
average time. Ignoring the O(n) preprocessing and merge times, this is
linear speedup. If the split is blind, ignoring the values, the merge
naïvely costs O(n). If the split partitions based on a succession of
pivots, it is tricky to parallelize and naïvely costs O(n). Given
O(log n) or more processors, only O(n) time is required overall,
whereas an approach with linear speedup would achieve O(log n) time
for overall.
Obviously mergesort is another alternative. I think quicksort gives better average-case performance.
Quicksort and merge sort are both fairly easy to parallelize. Oracle has a fork/join-based integer merge sort here, which you could probably use (if not as-is, then at least as inspiration).
Say you have a few computers (5 on amazon cluster right?) and you want ascending sorting. Split your array into smaller chunks so it fits on each machine.
Assuming you have n chunks/arrays. Have each machine quicksort its chunk. This sorting
will be in parallel (more or less depending on chunk size and machine speed etc).
When done sorintg, have the machines merge the chunks;
You can do this in 2 ways:
2 machines at a time (you're building a merge tree). The merging will happen, again, in parallel. The problem is that the array will grow big due to merging and you have to cache to disk, so when you merge again the machine reads from disk. So some penalty here.
You can do n machines at a time. Have one coordinator machine which takes the min from all the other machines' arrays. This way the coordinator machine builds the entire sorted array by taking the smallest number from each of the other sorted arrays.
Bitonic sort is an algorithm targeted for parallel machines. Here is a sequential Java version and a parallel C++ version to help you get started.