Parallelism in Java. Divide and Conquer, Quick Sort [duplicate] - java

I am experimenting with parallelizing algorithms in Java. I began with merge sort, and posted my attempt in this question. My revised attempt is in the code below, where I now try to parallelize quick sort.
Are there any rookie mistakes in my multi-threaded implementation or approach to this problem? If not, shouldn't I expect more than a 32% speed increase between a sequential and a parallelized algorithm on a duel-core (see timings at bottom)?
Here is the multithreading algorithm:
public class ThreadedQuick extends Thread
{
final int MAX_THREADS = Runtime.getRuntime().availableProcessors();
CountDownLatch doneSignal;
static int num_threads = 1;
int[] my_array;
int start, end;
public ThreadedQuick(CountDownLatch doneSignal, int[] array, int start, int end) {
this.my_array = array;
this.start = start;
this.end = end;
this.doneSignal = doneSignal;
}
public static void reset() {
num_threads = 1;
}
public void run() {
quicksort(my_array, start, end);
doneSignal.countDown();
num_threads--;
}
public void quicksort(int[] array, int start, int end) {
int len = end-start+1;
if (len <= 1)
return;
int pivot_index = medianOfThree(array, start, end);
int pivotValue = array[pivot_index];
swap(array, pivot_index, end);
int storeIndex = start;
for (int i = start; i < end; i++) {
if (array[i] <= pivotValue) {
swap(array, i, storeIndex);
storeIndex++;
}
}
swap(array, storeIndex, end);
if (num_threads < MAX_THREADS) {
num_threads++;
CountDownLatch completionSignal = new CountDownLatch(1);
new ThreadedQuick(completionSignal, array, start, storeIndex - 1).start();
quicksort(array, storeIndex + 1, end);
try {
completionSignal.await(1000, TimeUnit.SECONDS);
} catch(Exception ex) {
ex.printStackTrace();
}
} else {
quicksort(array, start, storeIndex - 1);
quicksort(array, storeIndex + 1, end);
}
}
}
Here is how I start it off:
ThreadedQuick.reset();
CountDownLatch completionSignal = new CountDownLatch(1);
new ThreadedQuick(completionSignal, array, 0, array.length-1).start();
try {
completionSignal.await(1000, TimeUnit.SECONDS);
} catch(Exception ex){
ex.printStackTrace();
}
I tested this against Arrays.sort and a similar sequential quick sort algorithm. Here are the timing results on an intel duel-core dell laptop, in seconds:
Elements: 500,000,
sequential: 0.068592,
threaded: 0.046871,
Arrays.sort: 0.079677
Elements: 1,000,000,
sequential: 0.14416,
threaded: 0.095492,
Arrays.sort: 0.167155
Elements: 2,000,000,
sequential: 0.301666,
threaded: 0.205719,
Arrays.sort: 0.350982
Elements: 4,000,000,
sequential: 0.623291,
threaded: 0.424119,
Arrays.sort: 0.712698
Elements: 8,000,000,
sequential: 1.279374,
threaded: 0.859363,
Arrays.sort: 1.487671
Each number above is the average time of 100 tests, throwing out the 3 lowest and 3 highest cases. I used Random.nextInt(Integer.MAX_VALUE) to generate an array for each test, which was initialized once every 10 tests with the same seed. Each test consisted of timing the given algorithm with System.nanoTime. I rounded to six decimal places after averaging. And obviously, I did check to see if each sort worked.
As you can see, there is about a 32% increase in speed between the sequential and threaded cases in every set of tests. As I asked above, shouldn't I expect more than that?

Making numThreads static can cause problems, it is highly likely that you will end up with more than MAX_THREADS running at some point.
Probably the reason why you don't get a full double up in performance is that your quick sort can not be fully parallelised. Note that the first call to quicksort will do a pass through the whole array in the initial thread before it starts to really run in parallel. There is also an overhead in parallelising an algorithm in the form of context switching and mode transitions when farming off to separate threads.
Have a look at the Fork/Join framework, this problem would probably fit quite neatly there.
A couple of points on the implementation. Implement Runnable rather than extending Thread. Extending a Thread should be used only when you create some new version of Thread class. When you just want to do some job to be run in parallel you are better off with Runnable. While iplementing a Runnable you can also still extend another class which gives you more flexibility in OO design. Use a thread pool that is restricted to the number of threads you have available in the system. Also don't use numThreads to make the decision on whether to fork off a new thread or not. You can calculate this up front. Use a minimum partition size which is the size of the total array divided by the number of processors available. Something like:
public class ThreadedQuick implements Runnable {
public static final int MAX_THREADS = Runtime.getRuntime().availableProcessors();
static final ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
final int[] my_array;
final int start, end;
private final int minParitionSize;
public ThreadedQuick(int minParitionSize, int[] array, int start, int end) {
this.minParitionSize = minParitionSize;
this.my_array = array;
this.start = start;
this.end = end;
}
public void run() {
quicksort(my_array, start, end);
}
public void quicksort(int[] array, int start, int end) {
int len = end - start + 1;
if (len <= 1)
return;
int pivot_index = medianOfThree(array, start, end);
int pivotValue = array[pivot_index];
swap(array, pivot_index, end);
int storeIndex = start;
for (int i = start; i < end; i++) {
if (array[i] <= pivotValue) {
swap(array, i, storeIndex);
storeIndex++;
}
}
swap(array, storeIndex, end);
if (len > minParitionSize) {
ThreadedQuick quick = new ThreadedQuick(minParitionSize, array, start, storeIndex - 1);
Future<?> future = executor.submit(quick);
quicksort(array, storeIndex + 1, end);
try {
future.get(1000, TimeUnit.SECONDS);
} catch (Exception ex) {
ex.printStackTrace();
}
} else {
quicksort(array, start, storeIndex - 1);
quicksort(array, storeIndex + 1, end);
}
}
}
You can kick it off by doing:
ThreadedQuick quick = new ThreadedQuick(array / ThreadedQuick.MAX_THREADS, array, 0, array.length - 1);
quick.run();
This will start the sort in the same thread, which avoids an unnecessary thread hop at start up.
Caveat: Not sure the above implementation will actually be faster as I haven't benchmarked it.

This uses a combination of quick sort and merge sort.
import java.util.Arrays;
import java.util.Random;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ParallelSortMain {
public static void main(String... args) throws InterruptedException {
Random rand = new Random();
final int[] values = new int[100*1024*1024];
for (int i = 0; i < values.length; i++)
values[i] = rand.nextInt();
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(threads);
int blockSize = (values.length + threads - 1) / threads;
for (int i = 0; i < values.length; i += blockSize) {
final int min = i;
final int max = Math.min(min + blockSize, values.length);
es.submit(new Runnable() {
#Override
public void run() {
Arrays.sort(values, min, max);
}
});
}
es.shutdown();
es.awaitTermination(10, TimeUnit.MINUTES);
for (int blockSize2 = blockSize; blockSize2 < values.length / 2; blockSize2 *= 2) {
for (int i = 0; i < values.length; i += blockSize2) {
final int min = i;
final int mid = Math.min(min + blockSize2, values.length);
final int max = Math.min(min + blockSize2 * 2, values.length);
mergeSort(values, min, mid, max);
}
}
}
private static boolean mergeSort(int[] values, int left, int mid, int end) {
int[] results = new int[end - left];
int l = left, r = mid, m = 0;
for (; l < left && r < mid; m++) {
int lv = values[l];
int rv = values[r];
if (lv < rv) {
results[m] = lv;
l++;
} else {
results[m] = rv;
r++;
}
}
while (l < mid)
results[m++] = values[l++];
while (r < end)
results[m++] = values[r++];
System.arraycopy(results, 0, values, left, results.length);
return false;
}
}

Couple of comments if I understand your code right:
I don't see a lock around the numthreads object even though it could be accessed via multiple threads. Perhaps you should make it an AtomicInteger.
Use a thread pool and arrange the tasks, i.e. a single call to quicksort, to take advantange of a thread pool. Use Futures.
Your current method of dividing things the way you're doing could leave a smaller division with a thread and a larger division without a thread. That is to say, it doesn't prioritize larger segments with their own threads.

Related

Is my multi-threaded linear search flawed?

In the pursuit of learning I have written a multi-threaded linear search, designed to operate on an int[] array. I believe the search works as intended, however after completing it I tested it against a standard 'for loop' and was surprised to see that the 'for loop' beat my search in terms of speed every time. I've tried tinkering with the code, but cannot get the search to beat a basic 'for loop'. At the moment I am wondering the following:
Is there an obvious flaw in my code that I am not seeing?
Is my code perhaps not well optimised for CPU caches?
Is this just the overheads of multi-threading slowing down my program and so I need a larger array to reap the benefits?
Unable to work it out myself, I am hoping someone here may be able to point me in the right direction, leading to my question:
Is there an inefficiency/flaw in my code that is making it slower than a standard loop, or is this just the overheads of threading slowing it down?
The Search:
public class MLinearSearch {
private MLinearSearch() {};
public static int[] getMultithreadingPositions(int[] data, int processors) {
int pieceSize = data.length / processors;
int remainder = data.length % processors;
int curPosition = 0;
int[] results = new int[processors + 1];
for (int i = 0; i < results.length - 1; i++) {
results[i] = curPosition;
curPosition += pieceSize;
if(i < remainder) {
curPosition++;
}
}
results[results.length - 1] = data.length;
return results;
}
public static int search(int target, int[]data) {
MLinearSearch.processors = Runtime.getRuntime().availableProcessors();
MLinearSearch.foundIndex = -1;
int[] domains = MLinearSearch.getMultithreadingPositions(data, processors);
Thread[] threads = new Thread[MLinearSearch.processors];
for(int i = 0; i < MLinearSearch.processors; i++) {
MLSThread searcher = new MLSThread(target, data, domains[i], domains[(i + 1)]);
searcher.setDaemon(true);
threads[i] = searcher;
searcher.run();
}
for(Thread thread : threads) {
try {
thread.join();
} catch (InterruptedException e) {
return MLinearSearch.foundIndex;
}
}
return MLinearSearch.foundIndex;
}
private static class MLSThread extends Thread {
private MLSThread(int target, int[] data, int start, int end) {
this.counter = start;
this.dataEnd = end;
this.target = target;
this.data = data;
}
#Override
public void run() {
while(this.counter < (this.dataEnd) && MLinearSearch.foundIndex == -1) {
if(this.target == this.data[this.counter]) {
MLinearSearch.foundIndex = this.counter;
return;
}
counter++;
}
}
private int counter;
private int dataEnd;
private int target;
private int[] data;
}
private static volatile int foundIndex = -1;
private static volatile int processors;
}
Note: "getMultithreadingPositions" is normally in a separate class. I have copied the method here for simplicity.
This is how I've been testing the code. Another test (Omitted here, but in the same file & run) runs the basic for loop, which beats my multi-threaded search every time.
public class SearchingTest {
#Test
public void multiLinearTest() {
int index = MLinearSearch.search(TARGET, arrayData);
assertEquals(TARGET, arrayData[index]);
}
private static int[] getShuffledArray(int[] array) {
// https://stackoverflow.com/questions/1519736/random-shuffling-of-an-array
Random rnd = ThreadLocalRandom.current();
for (int i = array.length - 1; i > 0; i--)
{
int index = rnd.nextInt(i + 1);
int a = array[index];
array[index] = array[i];
array[i] = a;
}
return array;
}
private static final int[] arrayData = SearchingTests.getShuffledArray(IntStream.range(0, 55_000_000).toArray());
private static final int TARGET = 7;
}
The loop beating this is literally just a for loop that iterates over the same array. I would imagine for smaller arrays the for loop would win out as its simplicity allows it to get going before my multi-threaded search can initiate its threads. At the array size I am trying though I would have expected a single thread to lose out.
Note: I had to increase my heap size with the following JVM argument:
-Xmx4096m
To avoid a heap memory exception.
Thank you for any help offered.

Why should we call join after invokeAll method?

I am trying to learn about the ForkJoinPool framework and came across the below example:
public class ArrayCounter extends RecursiveTask<Integer> {
int[] array;
int threshold = 100_000;
int start;
int end;
public ArrayCounter(int[] array, int start, int end) {
this.array = array;
this.start = start;
this.end = end;
}
protected Integer compute() {
if (end - start < threshold) {
return computeDirectly();
} else {
int middle = (end + start) / 2;
ArrayCounter subTask1 = new ArrayCounter(array, start, middle);
ArrayCounter subTask2 = new ArrayCounter(array, middle, end);
invokeAll(subTask1, subTask2);
return subTask1.join() + subTask2.join();
}
}
protected Integer computeDirectly() {
Integer count = 0;
for (int i = start; i < end; i++) {
if (array[i] % 2 == 0) {
count++;
}
}
return count;
}
}
Main :
public class ForkJoinRecursiveTaskTest
{
static final int SIZE = 10_000_000;
static int[] array = randomArray();
public static void main(String[] args) {
ArrayCounter mainTask = new ArrayCounter(array, 0, SIZE);
ForkJoinPool pool = new ForkJoinPool();
Integer evenNumberCount = pool.invoke(mainTask);
System.out.println("Number of even numbers: " + evenNumberCount);
}
static int[] randomArray() {
int[] array = new int[SIZE];
Random random = new Random();
for (int i = 0; i < SIZE; i++) {
array[i] = random.nextInt(100);
}
return array;
}
}
According to the Java Docs,invokeAll() submits the tasks to the pool and returns the results as well.Hence no need for a separate join(). can someone please explain why a separate join is needed in this case?
in your example, you are using RecursiveTask<Integer>
so you are expecting to return a value from compute() method.
let's look at invokAll(t1,t12) signature.
static void invokeAll(ForkJoinTask<?> t1, ForkJoinTask<?> t2)
so invokeAll() doesn't have return a value.
according to the documentation :
Forks the given tasks, returning when isDone holds for each task or an (unchecked) exception is encountered, in which case the exception is rethrown.
So:
return subTask1.join() + subTask2.join(); is the key for your example.
both tasks are merged after each complete the task passing the result recursivly to the next call of compute() method.
task.join()
Returns the result of the computation when it is done.
As per javadoc, join
Returns the result of the computation when it is done. This method
differs from get() in that abnormal completion results in
RuntimeException or Error, not ExecutionException, and that interrupts
of the calling thread do not cause the method to abruptly return by
throwing InterruptedException.
So, when task is done, join helps you to get the computed value, which you are adding later together.
return subTask1.join() + subTask2.join();

Java: how ot optimize sum of big array

I try to solve one problem on codeforces. And I get Time limit exceeded judjment. The only time consuming operation is calculation sum of big array. So I've tried to optimize it, but with no result.
What I want: Optimize the next function:
//array could be Integer.MAX_VALUE length
private long canocicalSum(int[] array) {
int sum = 0;
for (int i = 0; i < array.length; i++)
sum += array[i];
return sum;
}
Question1 [main]: Is it possible to optimize canonicalSum?
I've tried: to avoid operations with very big numbers. So i decided to use auxiliary data. For instance, I convert array1[100] to array2[10], where array2[i] = array1[i] + array1[i+1] + array1[i+9].
private long optimizedSum(int[] array, int step) {
do {
array = sumItr(array, step);
} while (array.length != 1);
return array[0];
}
private int[] sumItr(int[] array, int step) {
int length = array.length / step + 1;
boolean needCompensation = (array.length % step == 0) ? false : true;
int aux[] = new int[length];
for (int i = 0, auxSum = 0, auxPointer = 0; i < array.length; i++) {
auxSum += array[i];
if ((i + 1) % step == 0) {
aux[auxPointer++] = auxSum;
auxSum = 0;
}
if (i == array.length - 1 && needCompensation) {
aux[auxPointer++] = auxSum;
}
}
return aux;
}
Problem: But it appears that canonicalSum is ten times faster than optimizedSum. Here my test:
#Test
public void sum_comparison() {
final int ARRAY_SIZE = 100000000;
final int STEP = 1000;
int[] array = genRandomArray(ARRAY_SIZE);
System.out.println("Start canonical Sum");
long beg1 = System.nanoTime();
long sum1 = canocicalSum(array);
long end1 = System.nanoTime();
long time1 = end1 - beg1;
System.out.println("canon:" + TimeUnit.MILLISECONDS.convert(time1, TimeUnit.NANOSECONDS) + "milliseconds");
System.out.println("Start optimizedSum");
long beg2 = System.nanoTime();
long sum2 = optimizedSum(array, STEP);
long end2 = System.nanoTime();
long time2 = end2 - beg2;
System.out.println("custom:" + TimeUnit.MILLISECONDS.convert(time2, TimeUnit.NANOSECONDS) + "milliseconds");
assertEquals(sum1, sum2);
assertTrue(time2 <= time1);
}
private int[] genRandomArray(int size) {
int[] array = new int[size];
Random random = new Random();
for (int i = 0; i < array.length; i++) {
array[i] = random.nextInt();
}
return array;
}
Question2: Why optimizedSum works slower than canonicalSum?
As of Java 9, vectorisation of this operation has been implemented but disabled, based on benchmarks measuring the all-in cost of the code plus its compilation. Depending on your processor, this leads to the relatively entertaining result that if you introduce artificial complications into your reduction loop, you can trigger autovectorisation and get a quicker result! So the fastest code, for now, assuming numbers small enough not to overflow, is:
public int sum(int[] data) {
int value = 0;
for (int i = 0; i < data.length; ++i) {
value += 2 * data[i];
}
return value / 2;
}
This isn't intended as a recommendation! This is more to illustrate that the speed of your code in Java is dependent on the JIT, its trade-offs, and its bugs/features in any given release. Writing cute code to optimise problems like this is at best vain and will put a shelf life on the code you write. For instance, had you manually unrolled a loop to optimise for an older version of Java, your code would be much slower in Java 8 or 9 because this decision would completely disable autovectorisation. You'd better really need that performance to do it.
Question1 [main]: Is it possible to optimize canonicalSum?
Yes, it is. But I have no idea with what factor.
Some things you can do are:
use the parallel pipelines introduced in Java 8. The processor has instruction for doing parallel sum of 2 arrays (and more). This can be observed in Octave when you sum two vectors with ".+" (parallel addition) or "+" it is way faster than using a loop.
use multithreading. You could use a divide and conquer algorithm. Maybe like this:
divide the array into 2 or more
keep dividing recursively until you get an array with manageable size for a thread.
start computing the sum for the sub arrays (divided arrays) with separate threads.
finally add the sum generated (from all the threads) for all sub arrays together to produce final result
maybe unrolling the loop would help a bit, too. By loop unrolling I mean reducing the steps the loop will have to make by doing more operations in the loop manually.
An example from http://en.wikipedia.org/wiki/Loop_unwinding :
for (int x = 0; x < 100; x++)
{
delete(x);
}
becomes
for (int x = 0; x < 100; x+=5)
{
delete(x);
delete(x+1);
delete(x+2);
delete(x+3);
delete(x+4);
}
but as mentioned this must be done with caution and profiling since the JIT could do this kind of optimizations itself probably.
A implementation for mathematical operations for the multithreaded approach can be seen here.
The example implementation with the Fork/Join framework introduced in java 7 that basically does what the divide and conquer algorithm above does would be:
public class ForkJoinCalculator extends RecursiveTask<Double> {
public static final long THRESHOLD = 1_000_000;
private final SequentialCalculator sequentialCalculator;
private final double[] numbers;
private final int start;
private final int end;
public ForkJoinCalculator(double[] numbers, SequentialCalculator sequentialCalculator) {
this(numbers, 0, numbers.length, sequentialCalculator);
}
private ForkJoinCalculator(double[] numbers, int start, int end, SequentialCalculator sequentialCalculator) {
this.numbers = numbers;
this.start = start;
this.end = end;
this.sequentialCalculator = sequentialCalculator;
}
#Override
protected Double compute() {
int length = end - start;
if (length <= THRESHOLD) {
return sequentialCalculator.computeSequentially(numbers, start, end);
}
ForkJoinCalculator leftTask = new ForkJoinCalculator(numbers, start, start + length/2, sequentialCalculator);
leftTask.fork();
ForkJoinCalculator rightTask = new ForkJoinCalculator(numbers, start + length/2, end, sequentialCalculator);
Double rightResult = rightTask.compute();
Double leftResult = leftTask.join();
return leftResult + rightResult;
}
}
Here we develop a RecursiveTask splitting an array of doubles until
the length of a subarray doesn't go below a given threshold. At this
point the subarray is processed sequentially applying on it the
operation defined by the following interface
The interface used is this:
public interface SequentialCalculator {
double computeSequentially(double[] numbers, int start, int end);
}
And the usage example:
public static double varianceForkJoin(double[] population){
final ForkJoinPool forkJoinPool = new ForkJoinPool();
double total = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
#Override
public double computeSequentially(double[] numbers, int start, int end) {
double total = 0;
for (int i = start; i < end; i++) {
total += numbers[i];
}
return total;
}
}));
final double average = total / population.length;
double variance = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
#Override
public double computeSequentially(double[] numbers, int start, int end) {
double variance = 0;
for (int i = start; i < end; i++) {
variance += (numbers[i] - average) * (numbers[i] - average);
}
return variance;
}
}));
return variance / population.length;
}
If you want to add N numbers then the runtime is O(N). So in this aspect your canonicalSum can not be "optimized".
What you can do to reduce runtime is make the summation parallel. I.e. break the array to parts and pass it to separate threads and in the end sum the result returned by each thread.
Update: This implies multicore system but there is a java api to get the number of cores

Using ExecutorService with a multithreaded version of Merge Sort

I am working on a homework problem where I have to create a Multithreaded version of Merge Sort. I was able to implement it, but I am not able to stop the creation of threads. I looked into using an ExecutorService to limit the creation of threads but I cannot figure out how to implement it within my current code.
Here is my current Multithreaded Merge Sort. We are required to implement a specific strategy pattern so that is where my sort() method comes from.
#Override
public int[] sort(int[] list) {
int array_size = list.length;
list = msort(list, 0, array_size-1);
return list;
}
int[] msort(int numbers[], int left, int right) {
final int mid;
final int leftRef = left;
final int rightRef = right;
final int array[] = numbers;
if (left<right) {
mid = (right + left) / 2;
//new thread
Runnable r1 = new Runnable(){
public void run(){
msort(array, leftRef, mid);
}
};
Thread t1 = new Thread(r1);
t1.start();
//new thread
Runnable r2 = new Runnable(){
public void run(){
msort(array, mid+1, rightRef);
}
};
Thread t2 = new Thread(r2);
t2.start();
//join threads back together
try {
t1.join();
t2.join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
merge(numbers, leftRef, mid, mid+1, rightRef);
}
return numbers;
}
void merge(int numbers[], int startA, int endA, int startB, int endB) {
int finalStart = startA;
int finalEnd = endB;
int indexC = 0;
int[] listC = new int[numbers.length];
while(startA <= endA && startB <= endB){
if(numbers[startA] < numbers[startB]){
listC[indexC] = numbers[startA];
startA = startA+1;
}
else{
listC[indexC] = numbers[startB];
startB = startB +1;
}
indexC++;
}
if(startA <= endA){
for(int i = startA; i < endA; i++){
listC[indexC]= numbers[i];
indexC++;
}
}
indexC = 0;
for(int i = finalStart; i <= finalEnd; i++){
numbers[i]=listC[indexC];
indexC++;
}
}
Any pointers would be gratefully received.
Following #mcdowella's comment, I also think that the fork/join framework is your best bet if you want to limit the number of threads that run in parallel.
I know that this won't give you any help on your homework, because you are probably not allowed to use the fork/join framework in Java7. However it is about to learn something, isn't it?;)
As I commented, I think your merge method is wrong. I can't pinpoint the failure, but I have rewritten it. I strongly suggest you to write a testcase with all the edge cases that can happen during that merge method and if you verified it works, plant it back to your multithreaded code.
#lbalazscs also gave you the hint that the fork/join sort is mentioned in the javadocs, however I had nothing else to do- so I will show you the solution if you'd implemented it with Java7.
public class MultithreadedMergeSort extends RecursiveAction {
private final int[] array;
private final int begin;
private final int end;
public MultithreadedMergeSort(int[] array, int begin, int end) {
this.array = array;
this.begin = begin;
this.end = end;
}
#Override
protected void compute() {
if (end - begin < 2) {
// swap if we only have two elements
if (array[begin] > array[end]) {
int tmp = array[end];
array[end] = array[begin];
array[begin] = tmp;
}
} else {
// overflow safe method to calculate the mid
int mid = (begin + end) >>> 1;
// invoke recursive sorting action
invokeAll(new MultithreadedMergeSort(array, begin, mid),
new MultithreadedMergeSort(array, mid + 1, end));
// merge both sides
merge(array, begin, mid, end);
}
}
void merge(int[] numbers, int startA, int startB, int endB) {
int[] toReturn = new int[endB - startA + 1];
int i = 0, k = startA, j = startB + 1;
while (i < toReturn.length) {
if (numbers[k] < numbers[j]) {
toReturn[i] = numbers[k];
k++;
} else {
toReturn[i] = numbers[j];
j++;
}
i++;
// if we hit the limit of an array, copy the rest
if (j > endB) {
System.arraycopy(numbers, k, toReturn, i, startB - k + 1);
break;
}
if (k > startB) {
System.arraycopy(numbers, j, toReturn, i, endB - j + 1);
break;
}
}
System.arraycopy(toReturn, 0, numbers, startA, toReturn.length);
}
public static void main(String[] args) {
int[] toSort = { 55, 1, 12, 2, 25, 55, 56, 77 };
ForkJoinPool pool = new ForkJoinPool();
pool.invoke(new MultithreadedMergeSort(toSort, 0, toSort.length - 1));
System.out.println(Arrays.toString(toSort));
}
Note that the construction of your threadpool limits the number of active parallel threads to the number of cores of your processor.
ForkJoinPool pool = new ForkJoinPool();
According to it's javadoc:
Creates a ForkJoinPool with parallelism equal to
java.lang.Runtime.availableProcessors, using the default thread
factory, no UncaughtExceptionHandler, and non-async LIFO processing
mode.
Also notice how my merge method differs from yours, because I think that is your main problem. At least your sorting works if I replace your merge method with mine.
As mcdowella pointed out, the Fork/Join framework in Java 7 is exactly for tasks that can be broken into smaller pieces recursively.
Actually, the Javadoc for RecursiveAction has a merge sort as the first example :)
Also note that ForkJoinPool is an ExecutorService.

Java: Parallelizing quick sort via multi-threading

I am experimenting with parallelizing algorithms in Java. I began with merge sort, and posted my attempt in this question. My revised attempt is in the code below, where I now try to parallelize quick sort.
Are there any rookie mistakes in my multi-threaded implementation or approach to this problem? If not, shouldn't I expect more than a 32% speed increase between a sequential and a parallelized algorithm on a duel-core (see timings at bottom)?
Here is the multithreading algorithm:
public class ThreadedQuick extends Thread
{
final int MAX_THREADS = Runtime.getRuntime().availableProcessors();
CountDownLatch doneSignal;
static int num_threads = 1;
int[] my_array;
int start, end;
public ThreadedQuick(CountDownLatch doneSignal, int[] array, int start, int end) {
this.my_array = array;
this.start = start;
this.end = end;
this.doneSignal = doneSignal;
}
public static void reset() {
num_threads = 1;
}
public void run() {
quicksort(my_array, start, end);
doneSignal.countDown();
num_threads--;
}
public void quicksort(int[] array, int start, int end) {
int len = end-start+1;
if (len <= 1)
return;
int pivot_index = medianOfThree(array, start, end);
int pivotValue = array[pivot_index];
swap(array, pivot_index, end);
int storeIndex = start;
for (int i = start; i < end; i++) {
if (array[i] <= pivotValue) {
swap(array, i, storeIndex);
storeIndex++;
}
}
swap(array, storeIndex, end);
if (num_threads < MAX_THREADS) {
num_threads++;
CountDownLatch completionSignal = new CountDownLatch(1);
new ThreadedQuick(completionSignal, array, start, storeIndex - 1).start();
quicksort(array, storeIndex + 1, end);
try {
completionSignal.await(1000, TimeUnit.SECONDS);
} catch(Exception ex) {
ex.printStackTrace();
}
} else {
quicksort(array, start, storeIndex - 1);
quicksort(array, storeIndex + 1, end);
}
}
}
Here is how I start it off:
ThreadedQuick.reset();
CountDownLatch completionSignal = new CountDownLatch(1);
new ThreadedQuick(completionSignal, array, 0, array.length-1).start();
try {
completionSignal.await(1000, TimeUnit.SECONDS);
} catch(Exception ex){
ex.printStackTrace();
}
I tested this against Arrays.sort and a similar sequential quick sort algorithm. Here are the timing results on an intel duel-core dell laptop, in seconds:
Elements: 500,000,
sequential: 0.068592,
threaded: 0.046871,
Arrays.sort: 0.079677
Elements: 1,000,000,
sequential: 0.14416,
threaded: 0.095492,
Arrays.sort: 0.167155
Elements: 2,000,000,
sequential: 0.301666,
threaded: 0.205719,
Arrays.sort: 0.350982
Elements: 4,000,000,
sequential: 0.623291,
threaded: 0.424119,
Arrays.sort: 0.712698
Elements: 8,000,000,
sequential: 1.279374,
threaded: 0.859363,
Arrays.sort: 1.487671
Each number above is the average time of 100 tests, throwing out the 3 lowest and 3 highest cases. I used Random.nextInt(Integer.MAX_VALUE) to generate an array for each test, which was initialized once every 10 tests with the same seed. Each test consisted of timing the given algorithm with System.nanoTime. I rounded to six decimal places after averaging. And obviously, I did check to see if each sort worked.
As you can see, there is about a 32% increase in speed between the sequential and threaded cases in every set of tests. As I asked above, shouldn't I expect more than that?
Making numThreads static can cause problems, it is highly likely that you will end up with more than MAX_THREADS running at some point.
Probably the reason why you don't get a full double up in performance is that your quick sort can not be fully parallelised. Note that the first call to quicksort will do a pass through the whole array in the initial thread before it starts to really run in parallel. There is also an overhead in parallelising an algorithm in the form of context switching and mode transitions when farming off to separate threads.
Have a look at the Fork/Join framework, this problem would probably fit quite neatly there.
A couple of points on the implementation. Implement Runnable rather than extending Thread. Extending a Thread should be used only when you create some new version of Thread class. When you just want to do some job to be run in parallel you are better off with Runnable. While iplementing a Runnable you can also still extend another class which gives you more flexibility in OO design. Use a thread pool that is restricted to the number of threads you have available in the system. Also don't use numThreads to make the decision on whether to fork off a new thread or not. You can calculate this up front. Use a minimum partition size which is the size of the total array divided by the number of processors available. Something like:
public class ThreadedQuick implements Runnable {
public static final int MAX_THREADS = Runtime.getRuntime().availableProcessors();
static final ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
final int[] my_array;
final int start, end;
private final int minParitionSize;
public ThreadedQuick(int minParitionSize, int[] array, int start, int end) {
this.minParitionSize = minParitionSize;
this.my_array = array;
this.start = start;
this.end = end;
}
public void run() {
quicksort(my_array, start, end);
}
public void quicksort(int[] array, int start, int end) {
int len = end - start + 1;
if (len <= 1)
return;
int pivot_index = medianOfThree(array, start, end);
int pivotValue = array[pivot_index];
swap(array, pivot_index, end);
int storeIndex = start;
for (int i = start; i < end; i++) {
if (array[i] <= pivotValue) {
swap(array, i, storeIndex);
storeIndex++;
}
}
swap(array, storeIndex, end);
if (len > minParitionSize) {
ThreadedQuick quick = new ThreadedQuick(minParitionSize, array, start, storeIndex - 1);
Future<?> future = executor.submit(quick);
quicksort(array, storeIndex + 1, end);
try {
future.get(1000, TimeUnit.SECONDS);
} catch (Exception ex) {
ex.printStackTrace();
}
} else {
quicksort(array, start, storeIndex - 1);
quicksort(array, storeIndex + 1, end);
}
}
}
You can kick it off by doing:
ThreadedQuick quick = new ThreadedQuick(array / ThreadedQuick.MAX_THREADS, array, 0, array.length - 1);
quick.run();
This will start the sort in the same thread, which avoids an unnecessary thread hop at start up.
Caveat: Not sure the above implementation will actually be faster as I haven't benchmarked it.
This uses a combination of quick sort and merge sort.
import java.util.Arrays;
import java.util.Random;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ParallelSortMain {
public static void main(String... args) throws InterruptedException {
Random rand = new Random();
final int[] values = new int[100*1024*1024];
for (int i = 0; i < values.length; i++)
values[i] = rand.nextInt();
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(threads);
int blockSize = (values.length + threads - 1) / threads;
for (int i = 0; i < values.length; i += blockSize) {
final int min = i;
final int max = Math.min(min + blockSize, values.length);
es.submit(new Runnable() {
#Override
public void run() {
Arrays.sort(values, min, max);
}
});
}
es.shutdown();
es.awaitTermination(10, TimeUnit.MINUTES);
for (int blockSize2 = blockSize; blockSize2 < values.length / 2; blockSize2 *= 2) {
for (int i = 0; i < values.length; i += blockSize2) {
final int min = i;
final int mid = Math.min(min + blockSize2, values.length);
final int max = Math.min(min + blockSize2 * 2, values.length);
mergeSort(values, min, mid, max);
}
}
}
private static boolean mergeSort(int[] values, int left, int mid, int end) {
int[] results = new int[end - left];
int l = left, r = mid, m = 0;
for (; l < left && r < mid; m++) {
int lv = values[l];
int rv = values[r];
if (lv < rv) {
results[m] = lv;
l++;
} else {
results[m] = rv;
r++;
}
}
while (l < mid)
results[m++] = values[l++];
while (r < end)
results[m++] = values[r++];
System.arraycopy(results, 0, values, left, results.length);
return false;
}
}
Couple of comments if I understand your code right:
I don't see a lock around the numthreads object even though it could be accessed via multiple threads. Perhaps you should make it an AtomicInteger.
Use a thread pool and arrange the tasks, i.e. a single call to quicksort, to take advantange of a thread pool. Use Futures.
Your current method of dividing things the way you're doing could leave a smaller division with a thread and a larger division without a thread. That is to say, it doesn't prioritize larger segments with their own threads.

Categories