In the pursuit of learning I have written a multi-threaded linear search, designed to operate on an int[] array. I believe the search works as intended, however after completing it I tested it against a standard 'for loop' and was surprised to see that the 'for loop' beat my search in terms of speed every time. I've tried tinkering with the code, but cannot get the search to beat a basic 'for loop'. At the moment I am wondering the following:
Is there an obvious flaw in my code that I am not seeing?
Is my code perhaps not well optimised for CPU caches?
Is this just the overheads of multi-threading slowing down my program and so I need a larger array to reap the benefits?
Unable to work it out myself, I am hoping someone here may be able to point me in the right direction, leading to my question:
Is there an inefficiency/flaw in my code that is making it slower than a standard loop, or is this just the overheads of threading slowing it down?
The Search:
public class MLinearSearch {
private MLinearSearch() {};
public static int[] getMultithreadingPositions(int[] data, int processors) {
int pieceSize = data.length / processors;
int remainder = data.length % processors;
int curPosition = 0;
int[] results = new int[processors + 1];
for (int i = 0; i < results.length - 1; i++) {
results[i] = curPosition;
curPosition += pieceSize;
if(i < remainder) {
curPosition++;
}
}
results[results.length - 1] = data.length;
return results;
}
public static int search(int target, int[]data) {
MLinearSearch.processors = Runtime.getRuntime().availableProcessors();
MLinearSearch.foundIndex = -1;
int[] domains = MLinearSearch.getMultithreadingPositions(data, processors);
Thread[] threads = new Thread[MLinearSearch.processors];
for(int i = 0; i < MLinearSearch.processors; i++) {
MLSThread searcher = new MLSThread(target, data, domains[i], domains[(i + 1)]);
searcher.setDaemon(true);
threads[i] = searcher;
searcher.run();
}
for(Thread thread : threads) {
try {
thread.join();
} catch (InterruptedException e) {
return MLinearSearch.foundIndex;
}
}
return MLinearSearch.foundIndex;
}
private static class MLSThread extends Thread {
private MLSThread(int target, int[] data, int start, int end) {
this.counter = start;
this.dataEnd = end;
this.target = target;
this.data = data;
}
#Override
public void run() {
while(this.counter < (this.dataEnd) && MLinearSearch.foundIndex == -1) {
if(this.target == this.data[this.counter]) {
MLinearSearch.foundIndex = this.counter;
return;
}
counter++;
}
}
private int counter;
private int dataEnd;
private int target;
private int[] data;
}
private static volatile int foundIndex = -1;
private static volatile int processors;
}
Note: "getMultithreadingPositions" is normally in a separate class. I have copied the method here for simplicity.
This is how I've been testing the code. Another test (Omitted here, but in the same file & run) runs the basic for loop, which beats my multi-threaded search every time.
public class SearchingTest {
#Test
public void multiLinearTest() {
int index = MLinearSearch.search(TARGET, arrayData);
assertEquals(TARGET, arrayData[index]);
}
private static int[] getShuffledArray(int[] array) {
// https://stackoverflow.com/questions/1519736/random-shuffling-of-an-array
Random rnd = ThreadLocalRandom.current();
for (int i = array.length - 1; i > 0; i--)
{
int index = rnd.nextInt(i + 1);
int a = array[index];
array[index] = array[i];
array[i] = a;
}
return array;
}
private static final int[] arrayData = SearchingTests.getShuffledArray(IntStream.range(0, 55_000_000).toArray());
private static final int TARGET = 7;
}
The loop beating this is literally just a for loop that iterates over the same array. I would imagine for smaller arrays the for loop would win out as its simplicity allows it to get going before my multi-threaded search can initiate its threads. At the array size I am trying though I would have expected a single thread to lose out.
Note: I had to increase my heap size with the following JVM argument:
-Xmx4096m
To avoid a heap memory exception.
Thank you for any help offered.
Related
I am trying to learn about the ForkJoinPool framework and came across the below example:
public class ArrayCounter extends RecursiveTask<Integer> {
int[] array;
int threshold = 100_000;
int start;
int end;
public ArrayCounter(int[] array, int start, int end) {
this.array = array;
this.start = start;
this.end = end;
}
protected Integer compute() {
if (end - start < threshold) {
return computeDirectly();
} else {
int middle = (end + start) / 2;
ArrayCounter subTask1 = new ArrayCounter(array, start, middle);
ArrayCounter subTask2 = new ArrayCounter(array, middle, end);
invokeAll(subTask1, subTask2);
return subTask1.join() + subTask2.join();
}
}
protected Integer computeDirectly() {
Integer count = 0;
for (int i = start; i < end; i++) {
if (array[i] % 2 == 0) {
count++;
}
}
return count;
}
}
Main :
public class ForkJoinRecursiveTaskTest
{
static final int SIZE = 10_000_000;
static int[] array = randomArray();
public static void main(String[] args) {
ArrayCounter mainTask = new ArrayCounter(array, 0, SIZE);
ForkJoinPool pool = new ForkJoinPool();
Integer evenNumberCount = pool.invoke(mainTask);
System.out.println("Number of even numbers: " + evenNumberCount);
}
static int[] randomArray() {
int[] array = new int[SIZE];
Random random = new Random();
for (int i = 0; i < SIZE; i++) {
array[i] = random.nextInt(100);
}
return array;
}
}
According to the Java Docs,invokeAll() submits the tasks to the pool and returns the results as well.Hence no need for a separate join(). can someone please explain why a separate join is needed in this case?
in your example, you are using RecursiveTask<Integer>
so you are expecting to return a value from compute() method.
let's look at invokAll(t1,t12) signature.
static void invokeAll(ForkJoinTask<?> t1, ForkJoinTask<?> t2)
so invokeAll() doesn't have return a value.
according to the documentation :
Forks the given tasks, returning when isDone holds for each task or an (unchecked) exception is encountered, in which case the exception is rethrown.
So:
return subTask1.join() + subTask2.join(); is the key for your example.
both tasks are merged after each complete the task passing the result recursivly to the next call of compute() method.
task.join()
Returns the result of the computation when it is done.
As per javadoc, join
Returns the result of the computation when it is done. This method
differs from get() in that abnormal completion results in
RuntimeException or Error, not ExecutionException, and that interrupts
of the calling thread do not cause the method to abruptly return by
throwing InterruptedException.
So, when task is done, join helps you to get the computed value, which you are adding later together.
return subTask1.join() + subTask2.join();
I've implemented a range tree which supports updates in the form of incrementing or decrementing the count of a specific value. It can also query the number of values lower or equal to the value provided.
The range tree has been tested to work in a single threaded environment, however I would like to know how to modify the implementation such that it can be updated and queried concurrently.
I know a simple solution would be to synchronise methods that access this tree, but I would like to know if there are ways to make RangeTree thread safe by itself with minimal affect on performance.
public class RangeTree {
public static final int ROOT_NODE = 0;
private int[] count;
private int[] min;
private int[] max;
private int levels;
private int lastLevelSize;
public RangeTree(int maxValue) {
levels = 1;
lastLevelSize = 1;
while (lastLevelSize <= maxValue) {
levels++;
lastLevelSize = lastLevelSize << 1;
}
int alloc = lastLevelSize * 2;
count = new int[alloc];
min = new int[alloc];
max = new int[alloc];
int step = lastLevelSize;
int pointer = ROOT_NODE;
for (int i = 0; i < levels; i++) {
int current = 0;
while (current < lastLevelSize) {
min[pointer] = current;
max[pointer] = current + step - 1;
current += step;
pointer++;
}
step = step >> 1;
}
}
public void register(int value) {
int index = lastLevelSize - 1 + value;
count[index]++;
walkAndRefresh(index);
}
public void unregister(int value) {
int index = lastLevelSize - 1 + value;
count[index]--;
walkAndRefresh(index);
}
private void walkAndRefresh(int node) {
int currentNode = node;
while (currentNode != ROOT_NODE) {
currentNode = (currentNode - 1) >> 1;
count[currentNode] = count[currentNode * 2 + 1] + count[currentNode * 2 + 2];
}
}
public int countLesserOrEq(int value) {
return countLesserOrEq0(value, ROOT_NODE);
}
private int countLesserOrEq0(int value, int node) {
if (max[node] <= value) {
return count[node];
} else if (min[node] > value) {
return 0;
}
return countLesserOrEq0(value, node * 2 + 1) + countLesserOrEq0(value, node * 2 + 2);
}
}
Louis Wasserman is right, this is a difficult question. But it may have simple solution.
Depending on your updates/reads ratio and the contention for the data, it may be useful to use ReadWriteLock instead of synchronized.
Another solution which may be efficient in some cases (depends on your workload) is to copy whole RangeTree object before update and then switch the reference to 'actual' RangeTree. Like it is done in CopyOnWriteArrayList. But this also violates atomic consistency agreement and leads us to eventual consistency.
This is probably a pretty easy question, but as I never worked with threads before I figured it would be best to ask instead of trying to find the optimal solution completely on my own.
I have a giant for loop that runs literally billions of times. On each on loop run, according to the current index, the program calculates a final result in the form of a number. I am only interested in storing the top result(or top x results), and its corresponding index.
My question is simple, what would be the right way running this loop in threads so it uses all the available CPUs/cores.
int topResultIndex;
double topResult = 0;
for (i=1; i < 1000000000; ++i) {
double result = // some complicated calculation based on the current index
if (result > topResult) {
topResult = result;
topResultIndex = i;
}
}
The calculation is completely independent for each index, no resources are shared. topResultIndex and topResult will be obviously accessed by each thread though.
* Update: Both Giulio's and rolfl's solution are good, also very similar. Could only accept one of them as my answer.
Let's assume that the result is computed by a calculateResult(long) method, which is private and static, and does not access any static field, (it can also be non-static, but still it must be thread-safe and concurrently-executable, hopefully thread-confined).
Then, I think this will do the dirty work:
public static class Response {
int index;
double result;
}
private static class MyTask implements Callable<Response> {
private long from;
private long to;
public MyTask(long fromIndexInclusive, long toIndexExclusive) {
this.from = fromIndexInclusive;
this.to = toIndexExclusive;
}
public Response call() {
int topResultIndex;
double topResult = 0;
for (long i = from; i < to; ++i) {
double result = calculateResult(i);
if (result > topResult) {
topResult = result;
topResultIndex = i;
}
}
Response res = new Response();
res.index = topResultIndex;
res.result = topResult;
return res;
}
};
private static calculateResult(long index) { ... }
public Response interfaceMethod() {
//You might want to make this static/shared/global
ExecutorService svc = Executors.newCachedThreadPool();
int chunks = Runtime.getRuntime().availableProcessors();
long iterations = 1000000000;
MyTask[] tasks = new MyTask[chunks];
for (int i = 0; i < chunks; ++i) {
//You'd better cast to double and round here
tasks[i] = new MyTask(iterations / chunks * i, iterations / chunks * (i + 1));
}
List<Future<Response>> resp = svc.invokeAll(Arrays.asList(tasks));
Iterator<Future<Response>> respIt = resp.iterator();
//You'll have to handle exceptions here
Response bestResponse = respIt.next().get();
while (respIt.hasNext()) {
Response r = respIt.next().get();
if (r.result > bestResponse.result) {
bestResponse = r;
}
}
return bestResponse;
}
From my experience, this division in chunks is much faster that having a task for each index (especially if the computational load for each single index is small, like it probably is. By small, I mean less than half a second). It's a bit harder to code, though, because you need to make a 2-step maximization (first at chunk-level, then at a global level). With this, if the computation is purely cpu-based (does not push the ram too much) you should get a speedup almost equal to 80% the number of physical cores.
Apart from the observation that a C program with OpenMP or some other parallel computing extensions would be a better idea, the Java way to do it would be to create a 'Future' Task that calculates a subset of the problem:
private static final class Result {
final int index;
final double result;
public Result (int index, double result) {
this.result = result;
this.index = index;
}
}
// Calculate 10,000 values in each thead
int steps = 10000;
int cpucount = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(cpucount);
ArrayList<Future<Result>> results = new ArrayList<>();
for (int i = 0; i < 1000000000; i+= steps) {
final int from = i;
final int to = from + steps;
results.add(service.submit(new Callable<Result>() {
public Result call() {
int topResultIndex = -1;
double topResult = 0;
for (int j = from; j < to; j++) {
// do complicated things with 'j'
double result = // some complicated calculation based on the current index
if (result > topResult) {
topResult = result;
topResultIndex = j;
}
}
return new Result(topResultIndex, topResult);
}
});
}
service.shutdown();
while (!service.isTerminated()) {
System.out.println("Waiting for threads to complete");
service.awaitTermination(10, TimeUnit.SECONDS);
}
Result best = null;
for (Future<Result> fut : results) {
if (best == null || fut.result > best.result) {
best = fut;
}
}
System.out.printf("Best result is %f at index %d\n", best.result, best.index);
Future<Result>
The easiest way would be to use an ExecutorService and submit your tasks as a Runnable or Callable. You can use Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()) to create an ExeuctorService that will use the same number of threads as there are processors.
I am working on a homework problem where I have to create a Multithreaded version of Merge Sort. I was able to implement it, but I am not able to stop the creation of threads. I looked into using an ExecutorService to limit the creation of threads but I cannot figure out how to implement it within my current code.
Here is my current Multithreaded Merge Sort. We are required to implement a specific strategy pattern so that is where my sort() method comes from.
#Override
public int[] sort(int[] list) {
int array_size = list.length;
list = msort(list, 0, array_size-1);
return list;
}
int[] msort(int numbers[], int left, int right) {
final int mid;
final int leftRef = left;
final int rightRef = right;
final int array[] = numbers;
if (left<right) {
mid = (right + left) / 2;
//new thread
Runnable r1 = new Runnable(){
public void run(){
msort(array, leftRef, mid);
}
};
Thread t1 = new Thread(r1);
t1.start();
//new thread
Runnable r2 = new Runnable(){
public void run(){
msort(array, mid+1, rightRef);
}
};
Thread t2 = new Thread(r2);
t2.start();
//join threads back together
try {
t1.join();
t2.join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
merge(numbers, leftRef, mid, mid+1, rightRef);
}
return numbers;
}
void merge(int numbers[], int startA, int endA, int startB, int endB) {
int finalStart = startA;
int finalEnd = endB;
int indexC = 0;
int[] listC = new int[numbers.length];
while(startA <= endA && startB <= endB){
if(numbers[startA] < numbers[startB]){
listC[indexC] = numbers[startA];
startA = startA+1;
}
else{
listC[indexC] = numbers[startB];
startB = startB +1;
}
indexC++;
}
if(startA <= endA){
for(int i = startA; i < endA; i++){
listC[indexC]= numbers[i];
indexC++;
}
}
indexC = 0;
for(int i = finalStart; i <= finalEnd; i++){
numbers[i]=listC[indexC];
indexC++;
}
}
Any pointers would be gratefully received.
Following #mcdowella's comment, I also think that the fork/join framework is your best bet if you want to limit the number of threads that run in parallel.
I know that this won't give you any help on your homework, because you are probably not allowed to use the fork/join framework in Java7. However it is about to learn something, isn't it?;)
As I commented, I think your merge method is wrong. I can't pinpoint the failure, but I have rewritten it. I strongly suggest you to write a testcase with all the edge cases that can happen during that merge method and if you verified it works, plant it back to your multithreaded code.
#lbalazscs also gave you the hint that the fork/join sort is mentioned in the javadocs, however I had nothing else to do- so I will show you the solution if you'd implemented it with Java7.
public class MultithreadedMergeSort extends RecursiveAction {
private final int[] array;
private final int begin;
private final int end;
public MultithreadedMergeSort(int[] array, int begin, int end) {
this.array = array;
this.begin = begin;
this.end = end;
}
#Override
protected void compute() {
if (end - begin < 2) {
// swap if we only have two elements
if (array[begin] > array[end]) {
int tmp = array[end];
array[end] = array[begin];
array[begin] = tmp;
}
} else {
// overflow safe method to calculate the mid
int mid = (begin + end) >>> 1;
// invoke recursive sorting action
invokeAll(new MultithreadedMergeSort(array, begin, mid),
new MultithreadedMergeSort(array, mid + 1, end));
// merge both sides
merge(array, begin, mid, end);
}
}
void merge(int[] numbers, int startA, int startB, int endB) {
int[] toReturn = new int[endB - startA + 1];
int i = 0, k = startA, j = startB + 1;
while (i < toReturn.length) {
if (numbers[k] < numbers[j]) {
toReturn[i] = numbers[k];
k++;
} else {
toReturn[i] = numbers[j];
j++;
}
i++;
// if we hit the limit of an array, copy the rest
if (j > endB) {
System.arraycopy(numbers, k, toReturn, i, startB - k + 1);
break;
}
if (k > startB) {
System.arraycopy(numbers, j, toReturn, i, endB - j + 1);
break;
}
}
System.arraycopy(toReturn, 0, numbers, startA, toReturn.length);
}
public static void main(String[] args) {
int[] toSort = { 55, 1, 12, 2, 25, 55, 56, 77 };
ForkJoinPool pool = new ForkJoinPool();
pool.invoke(new MultithreadedMergeSort(toSort, 0, toSort.length - 1));
System.out.println(Arrays.toString(toSort));
}
Note that the construction of your threadpool limits the number of active parallel threads to the number of cores of your processor.
ForkJoinPool pool = new ForkJoinPool();
According to it's javadoc:
Creates a ForkJoinPool with parallelism equal to
java.lang.Runtime.availableProcessors, using the default thread
factory, no UncaughtExceptionHandler, and non-async LIFO processing
mode.
Also notice how my merge method differs from yours, because I think that is your main problem. At least your sorting works if I replace your merge method with mine.
As mcdowella pointed out, the Fork/Join framework in Java 7 is exactly for tasks that can be broken into smaller pieces recursively.
Actually, the Javadoc for RecursiveAction has a merge sort as the first example :)
Also note that ForkJoinPool is an ExecutorService.
I've coded a multi-threaded matrix multiplication. I believe my approach is right, but I'm not 100% sure. In respect to the threads, I don't understand why I can't just run a (new MatrixThread(...)).start() instead of using an ExecutorService.
Additionally, when I benchmark the multithreaded approach versus the classical approach, the classical is much faster...
What am I doing wrong?
Matrix Class:
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class Matrix
{
private int dimension;
private int[][] template;
public Matrix(int dimension)
{
this.template = new int[dimension][dimension];
this.dimension = template.length;
}
public Matrix(int[][] array)
{
this.dimension = array.length;
this.template = array;
}
public int getMatrixDimension() { return this.dimension; }
public int[][] getArray() { return this.template; }
public void fillMatrix()
{
Random randomNumber = new Random();
for(int i = 0; i < dimension; i++)
{
for(int j = 0; j < dimension; j++)
{
template[i][j] = randomNumber.nextInt(10) + 1;
}
}
}
#Override
public String toString()
{
String retString = "";
for(int i = 0; i < this.getMatrixDimension(); i++)
{
for(int j = 0; j < this.getMatrixDimension(); j++)
{
retString += " " + this.getArray()[i][j];
}
retString += "\n";
}
return retString;
}
public static Matrix classicalMultiplication(Matrix a, Matrix b)
{
int[][] result = new int[a.dimension][b.dimension];
for(int i = 0; i < a.dimension; i++)
{
for(int j = 0; j < b.dimension; j++)
{
for(int k = 0; k < b.dimension; k++)
{
result[i][j] += a.template[i][k] * b.template[k][j];
}
}
}
return new Matrix(result);
}
public Matrix multiply(Matrix multiplier) throws InterruptedException
{
Matrix result = new Matrix(dimension);
ExecutorService es = Executors.newFixedThreadPool(dimension*dimension);
for(int currRow = 0; currRow < multiplier.dimension; currRow++)
{
for(int currCol = 0; currCol < multiplier.dimension; currCol++)
{
//(new MatrixThread(this, multiplier, currRow, currCol, result)).start();
es.execute(new MatrixThread(this, multiplier, currRow, currCol, result));
}
}
es.shutdown();
es.awaitTermination(2, TimeUnit.DAYS);
return result;
}
private class MatrixThread extends Thread
{
private Matrix a, b, result;
private int row, col;
private MatrixThread(Matrix a, Matrix b, int row, int col, Matrix result)
{
this.a = a;
this.b = b;
this.row = row;
this.col = col;
this.result = result;
}
#Override
public void run()
{
int cellResult = 0;
for (int i = 0; i < a.getMatrixDimension(); i++)
cellResult += a.template[row][i] * b.template[i][col];
result.template[row][col] = cellResult;
}
}
}
Main class:
import java.util.Scanner;
public class MatrixDriver
{
private static final Scanner kb = new Scanner(System.in);
public static void main(String[] args) throws InterruptedException
{
Matrix first, second;
long timeLastChanged,timeNow;
double elapsedTime;
System.out.print("Enter value of n (must be a power of 2):");
int n = kb.nextInt();
first = new Matrix(n);
first.fillMatrix();
second = new Matrix(n);
second.fillMatrix();
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using threads:\n" +
first.multiply(second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Threaded took "+elapsedTime+" seconds");
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using classical:\n" +
Matrix.classicalMultiplication(first,second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Classical took "+elapsedTime+" seconds");
}
}
P.S. Please let me know if any further clarification is needed.
There is a bunch of overhead involved in creating threads, even when using an ExecutorService. I suspect the reason why you're multithreaded approach is so slow is that you're spending 99% creating a new thread and only 1%, or less, doing the actual math.
Typically, to solve this problem you'd batch a whole bunch of operations together and run those on a single thread. I'm not 100% how to do that in this case, but I suggest breaking your matrix into smaller chunks (say, 10 smaller matrices) and run those on threads, instead of running each cell in its own thread.
You're creating a lot of threads. Not only is it expensive to create threads, but for a CPU bound application, you don't want more threads than you have available processors (if you do, you have to spend processing power switching between threads, which also is likely to cause cache misses which are very expensive).
It's also unnecessary to send a thread to execute; all it needs is a Runnable. You'll get a big performance boost by applying these changes:
Make the ExecutorService a static member, size it for the current processor, and send it a ThreadFactory so it doesn't keep the program running after main has finished. (It would probably be architecturally cleaner to send it as a parameter to the method rather than keeping it as a static field; I leave that as an exercise for the reader. ☺)
private static final ExecutorService workerPool =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(), new ThreadFactory() {
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setDaemon(true);
return t;
}
});
Make MatrixThread implement Runnable rather than inherit Thread. Threads are expensive to create; POJOs are very cheap. You can also make it static which makes the instances smaller (as non-static classes get an implicit reference to the enclosing object).
private static class MatrixThread implements Runnable
From change (1), you can no longer awaitTermination to make sure all tasks are finished (as this worker pool). Instead, use the submit method which returns a Future<?>. Collect all the future objects in a list, and when you've submitted all the tasks, iterate over the list and call get for each object.
Your multiply method should now look something like this:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
for(int currCol = 0; currCol < multiplier.dimension; currCol++) {
Runnable worker = new MatrixThread(this, multiplier, currRow, currCol, result);
futures.add(workerPool.submit(worker));
}
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
Will it be faster than the single-threaded version? Well, on my arguably crappy box the multithreaded version is slower for values of n < 1024.
This is just scratching the surface, though. The real problem is that you create a lot of MatrixThread instances - your memory consumption is O(n²), which is a very bad sign. Moving the inner for loop into MatrixThread.run would improve performance by a factor of craploads (ideally, you don't create more tasks than you have worker threads).
Edit: As I have more pressing things to do, I couldn't resist optimizing this further. I came up with this (... horrendously ugly piece of code) that "only" creates O(n) jobs:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
Runnable worker = new MatrixThread2(this, multiplier, currRow, result);
futures.add(workerPool.submit(worker));
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
private static class MatrixThread2 implements Runnable
{
private Matrix self, mul, result;
private int row, col;
private MatrixThread2(Matrix a, Matrix b, int row, Matrix result)
{
this.self = a;
this.mul = b;
this.row = row;
this.result = result;
}
#Override
public void run()
{
for(int col = 0; col < mul.dimension; col++) {
int cellResult = 0;
for (int i = 0; i < self.getMatrixDimension(); i++)
cellResult += self.template[row][i] * mul.template[i][col];
result.template[row][col] = cellResult;
}
}
}
It's still not great, but basically the multi-threaded version can compute anything you'll be patient enough to wait for, and it'll do it faster than the single-threaded version.
First of all, you should use a newFixedThreadPool of the size as many cores you have, on a quadcore you use 4. Second of all, don't create a new one for each matrix.
If you make the executorservice a static member variable I get almost consistently faster execution of the threaded version at a matrix size of 512.
Also, change MatrixThread to implement Runnable instead of extending Thread also speeds up execution to where the threaded is on my machine 2x as fast on 512