I am trying to compare sequantial and concurent matrix multiplication.Everytime sequential is more fast.For example 60 x 60 matrix sequantial find 4 ms while concurent 277 ms.In my code is something wrong?
concurent:
private static void multiplyMatrixConcurent() {
result_concurent =new Matrix(rows, columns);
for (int i = 0; i < cell; i++) {
Runnable task = new MatrixMultiplicationThread(i);
Thread worker = new Thread(task);
worker.start();
}
}
private static class MatrixMultiplicationThread implements Runnable{
private int cell;
MatrixMultiplicationThread(int cell) {
this.cell=cell;
}
#Override
public void run() {
int row = cell / columns ;
int column = cell % columns;
for (int i = 0; i < rows; i++) {
double t1 = matrix.getCell(row, i);
double t2= matrix.getCell(i, column);
double temp= t1*t2;
double res = result_concurent.getCell(row, column) +temp;
result_concurent.setCell(res, row, column);
}
}
}
sequential:
private static void multiplyMatrixSequence() {
result_sequantial =new Matrix(rows, columns);
for (int i = 0; i < rows; i++) {
for (int j = 0; j <rows; j++) {
for (int k = 0; k < columns; k++) {
double t1=matrix.getCell(i,k);
double t2=matrix.getCell(k, j);
double temp= t1*t2;
double res = result_sequantial.getCell(i, j) + temp;
result_sequantial.setCell(res,i,j);
}
}
}
}
I don't see anything obviously wrong. You don't set cell to rows*columns in the concurrent startup code you posted but I assume that is an issue in the posting not the code you ran.
Threads have overhead. They have memory to allocate and require extra management of the CPU resources. If the number of threads is modest and the hardware can handle multiple threads in parallel, then you win. However, for pure cpu bound tasks, having more threads than there are processing elements is just overhead without any gain. In this case, you have 3600 threads. I'm guessing you have a processor that can handle between 2 and 8 threads at once. Your thread count dwarfs the processor's ability and so you get a slowdown.
Note that when the threads are performing blocking operations such as disk or network I/O then more threads can allow interleaving. The statements also don't apply in the GPU computing case where even memory accesses allow efficient thread interleaving.
BTW, if your goal is actually to produce a fast matrix multiply - use an existing library. These libraries are developed by people who take advantage of processor cache structures, specialized hardware instruction sets and subtle details of floating point to produce libraries that are faster and more accurate than anything a casual coder can produce.
Creating a Thread takes some time (compared to other operations it is expensive). Instead of creating a new Thread for every cell you could use a ThreadPool and re-use existing(finished) threads. This reduces the time spend for creating new threads. But still you are in a very low execution time per thread scenario where setting up the threads takes more time than running it sequential.
private static void multiplyMatrixConcurent() {
result_concurent =new Matrix(rows, columns);
ExecutorService executor = Executors.newFixedThreadPool(4);
for (int i = 0; i < cell; i++) {
Runnable worker = new MatrixMultiplicationThread(i);
executor.execute(worker);
}
executor.shutdown();
}
Related
I am creating a program to calculate values of two arrays in steps of simulation (they are initialized from the beginning, I did not put it here). I would like to do it with threads and ExecutorService. I divided arrays into blocks and I want values of these blocks to be calculated by threads, one block = one thread. These two arrays - X and Y - take values from each other (as you can see in run()), I want X to be calculated first and Y after that, so I made two separate runnables:
public static class CountX implements Runnable {
private int start;
private int end;
private CountDownLatch cdl;
public CountX(int s, int e, CountDownLatch c) {
this.start = s;
this.end = e;
this.cdl = c;
}
public void run() {
for (int i = start + 1; i < end - 1; i++) {
x[i] = x[i] - (y[i-1] - 2 * y[i] + y[i+1]) + y[i];
}
cdl.countDown();
}
}
And same for CountY. I would like to give to it the information where the start and end of value for every block is.
This is, in a short, how my main looks like and this is the main problem of mine:
int NN = 400; //length of X and Y
int threads = 8;
int block_size = (int) NN/threads;
final ExecutorService executor_X = Executors.newFixedThreadPool(threads);
final ExecutorService executor_Y = Executors.newFixedThreadPool(threads);
CountDownLatch cdl = new CountDownLatch(threads);
CountX[] runnables_X = new CountX[threads];
CountY[] runnables_Y = new CountY[threads];
for (int r = 0; r < threads; r++) {
runnables_X[r] = new CountX((r*block_size), ((r+1)*block_size), cdl);
}
for (int r = 0; r < threads; r++) {
runnables_Y[r] = new CountY((r*block_size), ((r+1)*block_size), cdl);
}
int sim_steps = 4000;
for(int m = 0; m < sim_steps; m++) {
for (int e = 0; e < threads; e++) {
executor_X.execute(runnables_X[e]);
}
for (int e = 0; e < threads; e++) {
executor_Y.execute(runnables_Y[e]);
}
}
executor_X.shutdown();
executor_Y.shutdown();
I get wrong values of arrays X and Y from this program, because I also did it without threads.
Is CountDownLatch necessary here? Am I supposed to do for loop of runnables_X[r] = new CountX((r*block_size), ((r+1)*block_size), cdl); in every m (sim_step) loop? Or maybe I should use ExecutorService in a different way? I tried many options but the results are still wrong.
Thank you in advance!
Your approach is one I probably wouldn't take for this task.
You can work with references and Runnables, but in your case a Callable might be the better choice. With a Callable, you just give it the array and let it calculate a partial value, if possible and await the Futures. For me, it's not really clear what you actually want to calculate though, thus I am taking a blind guess here.
You don't need a CountDownLatch nor two ExecutorServices - one EXS is enough.
If you really want to use a Runnable for this, you should implement some sort of synchronization, either with a concurrent list, Atomic variables, volatile or a lock.
When I create one thread per value of the product of 2 matrices, ( so say the first matrix is n x m and the second is m x l then I create a total of n x l threads ) it appears to be much slower than using just 1 thread to compute everything. Is that normal?
Here's the code if it helps :
The class that create the threads :
package ex4;
import java.util.*;
public class OneThreadPerValue extends ComputeMethod {
public Matrix compute(Matrix m1, Matrix m2) {
int length = m1.get_m();
int res_n = m1.get_n();
int res_m = m2.get_m();
int[] row;
int[] col;
int[][] listOfCol = new int[res_m][m2.get_n()]; // we store the values of each column of the second matrix
// because the get_colonne is in Theta(n)
for (int x = 0; x < res_m; x++) {
listOfCol[x] = m2.get_colonne(x);
}
Matrix res = new Matrix(res_n, res_m, this);
List<Thread> threads = new LinkedList<Thread>();
try {
for (int i = 0; i < res_n; i++) {
row = m1.get_ligne(i);
for (int j = 0; j < res_m; j++) {
col = listOfCol[j];
ThreadMatrixV1 thread = new ThreadMatrixV1(res, col, row, i, j, length);
threads.add(thread);
thread.start();
}
}
for (Thread thread : threads) {
thread.join();
}
} catch (InterruptedException e) {
e.printStackTrace();
}
return res;
}
}
The threads :
package ex4;
public class ThreadMatrixV1 extends Thread {
private Matrix matrix;
private int[] col; // the column we want to compute
private int[] row; // the row we want to compute
private int toBeModifiedCol; // the column we want to modify
private int toBeModifiedRow; // the row we want to modify
private int length; // the length of the column and row to compute
public ThreadMatrixV1(Matrix matrix, int[] col, int[] row, int toBeModifiedRow, int toBeModifiedCol, int length) {
this.matrix = matrix;
this.col = col;
this.row = row;
this.toBeModifiedCol = toBeModifiedCol;
this.toBeModifiedRow = toBeModifiedRow;
this.length = length;
}
#Override
public void run() {
int res = Matrix.computeRowCol(this.row, this.col, this.length);
this.matrix.set(this.toBeModifiedRow, this.toBeModifiedCol, res);
}
}
Yes, this is expected behavior. Threads are generally expensive, requiring initializing new memory space, copying relevant data over, and then running them. They're best used for long-running or blocking operations, not for simple math like this.
Even if your matrices are large enough that using a loop is causing performance degradation, it would be better to use a ThreadPoolExecutor or similar to start tasks which handle multiple elements of the matrices than to use threads directly: Using an ExecutorService allows threads to be re-used internally to avoid extra overhead.
Threads do not come for free. They are resources owned and managed by the underlying operating system in the end.
In other words: using threads leads to overhead. That overhead only pays off when the threaded operations last (much) longer than that overhead.
When you have to move 2 bottles of beer from the car to your house, it is quicker to just carry them just so, instead of first getting a basket from the house. But when you have 200 bottles to move, it is worth to first look out for a better way than carrying them one by one. Solutions to problems come with a price tag.
That is the essential difference between effective and efficient btw.
I am trying to write a Java multithreaded program performing a multiplication on 2 matrices given as a file and using a limited total of threads used.
For example if I set a number of thread at 16 I want my threadpool to be able to reuse those 16 threads until all the tasks are done.
However I end up with a larger execution time for a larger number of threads and I am having a hard time trying to understand why.
Runnable:
class Task implements Runnable
{
int _row = 0;
int _col = 0;
public Task(int row, int col)
{
_row = row;
_col = col;
}
#Override
public void run()
{
Application.multiply(_row, _col);
}
}
Application:
public class Application
{
private static Scanner sc = new Scanner(System.in);
private static int _A[][];
private static int _B[][];
private static int _C[][];
public static void main(final String [] args) throws InterruptedException
{
ExecutorService executor = Executors.newFixedThreadPool(16);
ThreadPoolExecutor pool = (ThreadPoolExecutor) executor;
_A = readMatrix();
_B = readMatrix();
_C = new int[_A.length][_B[0].length];
long startTime = System.currentTimeMillis();
for (int x = 0; x < _C.length; x++)
{
for (int y = 0; y < _C[0].length; y++)
{
executor.execute(new Task(x, y));
}
}
long endTime = System.currentTimeMillis();
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.HOURS);
System.out.printf("Calculation Time: %d ms\n" , endTime - startTime);
}
public static void multMatrix(int row, int col)
{
int sum = 0;
for (int i = 0; i < _B.length; i++)
{
sum += _A[row][i] * _B[i][col];
}
_C[row][col] = sum;
}
...
}
The matrix calculations and workload sharing seems correct so it might come from a bad use of ThreadPool
Context switching takes time.
If you have 8 cores and you are executing 8 threads they all can work simultaneously and as soon as one finishes it will be reused.
On the other hand if you have 16 threads for 8 cores each thread will compete for the processor time and scheduler will switch those threads and your time would increase to - Execution time + Context swithcing.
The more the threads the more the context switching and hence the time increases.
Those threads are already being reused to execute the tasks, that's the expected behaviour of ThreadPoolExecutor.
http://www.codejava.net/java-core/concurrency/java-concurrency-understanding-thread-pool-and-executors
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
You're getting a higher computation time as you increase the name of threads because the time needed to create them is greater than the improvement of performance that the concurrency gives at the execution of that -relative short- tasks.
Use submit instead of execute
Make a list of returned Futures so that you can wait for them.
List<Future<?>> futures = new ArrayList<>();
futures.add(executor.submit(new Task(x, y)));
Then just wait for these futures to complete.
I'm in troubles with a multithreading java program.
The program consists of a splitted sum of an array of integers with multithreads and than the total sum of the slices.
The problem is that computing time does not decrements by incrementing number of threads (I know that there is a limit number of threads after that the computing time is slower than less threads). I expect to see a decrease of execution time before that limit number of threads (benefits of parallel execution). I use the variable fake in run method to make time "readable".
public class MainClass {
private final int MAX_THREAD = 8;
private final int ARRAY_SIZE = 1000000;
private int[] array;
private SimpleThread[] threads;
private int numThread = 1;
private int[] sum;
private int start = 0;
private int totalSum = 0;
long begin, end;
int fake;
MainClass() {
fillArray();
for(int i = 0; i < MAX_THREAD; i++) {
threads = new SimpleThread[numThread];
sum = new int[numThread];
begin = (long) System.currentTimeMillis();
for(int j = 0 ; j < numThread; j++) {
threads[j] = new SimpleThread(start, ARRAY_SIZE/numThread, j);
threads[j].start();
start+= ARRAY_SIZE/numThread;
}
for(int k = 0; k < numThread; k++) {
try {
threads[k].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
end = (long) System.currentTimeMillis();
for(int g = 0; g < numThread; g++) {
totalSum+=sum[g];
}
System.out.printf("Result with %d thread-- Sum = %d Time = %d\n", numThread, totalSum, end-begin);
numThread++;
start = 0;
totalSum = 0;
}
}
public static void main(String args[]) {
new MainClass();
}
private void fillArray() {
array = new int[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++)
array[i] = 1;
}
private class SimpleThread extends Thread{
int start;
int size;
int index;
public SimpleThread(int start, int size, int sumIndex) {
this.start = start;
this.size = size;
this.index = sumIndex;
}
public void run() {
for(int i = start; i < start+size; i++)
sum[index]+=array[i];
for(long i = 0; i < 1000000000; i++) {
fake++;
}
}
}
Unexpected Result Screenshot
As a general rule, you won't get a speedup from multi-threading if the "work" performed by each thread is less than the overheads of using the threads.
One of the overheads is the cost of starting a new thread. This is surprisingly high. Each time you start a thread the JVM needs to perform syscalls to allocate the thread stack memory segment and the "red zone" memory segment, and initialize them. (The default thread stack size is typically 500KB or 1MB.) Then there are further syscalls to create the native thread and schedule it.
In this example, you have 1,000,000 elements to sum and you divide this work among N threads. As N increases, the amount of work performed by each thread decreases.
It is not hard to see that the time taken to sum 1,000,000 elements is going to be less than the time needed to start 4 threads ... just based on counting the memory read and write operations. Then you need to take into account that the child threads are created one at a time by the parent thread.
If you do the analysis completely, it is clear that there is a point at which adding more threads actually slows down the computation even if you have enough to cores to run all threads in parallel. And your benchmarking seems to suggest1 that that point is around about 2 threads.
By the way, there is a second reason why you may not get as much speedup as you expect for a benchmark like this one. The "work" that each thread is doing is basically scanning a large array. Reading and writing arrays will generate requests to the memory system. Ideally, these requests will be satisfied by the (fast) on-chip memory caches. However, if you try to read / write an array that is larger than the memory cache, then many / most of those requests turn into (slow) main memory requests. Worse still, if you have N cores all doing this then you can find that the number of main memory requests is too much for the memory system to keep up .... and the threads slow down.
The bottom line is that multi-threading does not automatically make an application faster, and it certainly won't if you do it the wrong way.
In your example:
the amount of work per thread is too small compared with the overheads of creating and starting threads, and
memory bandwidth effects are likely to be a problem if can "factor out" the thread creation overheads
1 - I don't understand the point of the "fake" computation. It probably invalidates the benchmark, though it is possible that the JIT compiler optimizes it away.
Why sum is wrong sometimes?
Because ARRAY_SIZE/numThread may have fractional part (e.g. 1000000/3=333333.3333333333) which gets rounded down so start variable loses some hence the sum maybe less than 1000000 depending on the value of divisor.
Why the time taken is increasing as the number of threads increases?
Because in the run function of each thread you do this:
for(long i = 0; i < 1000000000; i++) {
fake++;
}
which I do not understand from your question :
I use the variable fake in run method to make time "readable".
what that means. But every thread needs to increment your fake variable 1000000000 times.
As a side note, for what you're trying to do there is the Fork/Join-Framework. It allows you easily split tasks recursively and implements an algorithm which will distribute your workload automatically.
There is a guide available here; it's example is very similar to your case, which boils down to a RecursiveTask like this:
class Adder extends RecursiveTask<Integer>
{
private int[] toAdd;
private int from;
private int to;
/** Add the numbers in the given array */
public Adder(int[] toAdd)
{
this(toAdd, 0, toAdd.length);
}
/** Add the numbers in the given array between the given indices;
internal constructor to split work */
private Adder(int[] toAdd, int fromIndex, int upToIndex)
{
this.toAdd = toAdd;
this.from = fromIndex;
this.to = upToIndex;
}
/** This is the work method */
#Override
protected Integer compute()
{
int amount = to - from;
int result = 0;
if (amount < 500)
{
// base case: add ints and return the result
for (int i = from; i < to; i++)
{
result += toAdd[i];
}
}
else
{
// array too large: split it into two parts and distribute the actual adding
int newEndIndex = from + (amount / 2);
Collection<Adder> invokeAll = invokeAll(Arrays.asList(
new Adder(toAdd, from, newEndIndex),
new Adder(toAdd, newEndIndex, to)));
for (Adder a : invokeAll)
{
result += a.invoke();
}
}
return result;
}
}
To actually run this, you can use
RecursiveTask adder = new Adder(fillArray(ARRAY_LENGTH));
int result = ForkJoinPool.commonPool().invoke(adder);
Starting threads is heavy and you'll only see the benefit of it on large processes that don't compete for the same resources (none of it applies here).
I've coded a multi-threaded matrix multiplication. I believe my approach is right, but I'm not 100% sure. In respect to the threads, I don't understand why I can't just run a (new MatrixThread(...)).start() instead of using an ExecutorService.
Additionally, when I benchmark the multithreaded approach versus the classical approach, the classical is much faster...
What am I doing wrong?
Matrix Class:
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class Matrix
{
private int dimension;
private int[][] template;
public Matrix(int dimension)
{
this.template = new int[dimension][dimension];
this.dimension = template.length;
}
public Matrix(int[][] array)
{
this.dimension = array.length;
this.template = array;
}
public int getMatrixDimension() { return this.dimension; }
public int[][] getArray() { return this.template; }
public void fillMatrix()
{
Random randomNumber = new Random();
for(int i = 0; i < dimension; i++)
{
for(int j = 0; j < dimension; j++)
{
template[i][j] = randomNumber.nextInt(10) + 1;
}
}
}
#Override
public String toString()
{
String retString = "";
for(int i = 0; i < this.getMatrixDimension(); i++)
{
for(int j = 0; j < this.getMatrixDimension(); j++)
{
retString += " " + this.getArray()[i][j];
}
retString += "\n";
}
return retString;
}
public static Matrix classicalMultiplication(Matrix a, Matrix b)
{
int[][] result = new int[a.dimension][b.dimension];
for(int i = 0; i < a.dimension; i++)
{
for(int j = 0; j < b.dimension; j++)
{
for(int k = 0; k < b.dimension; k++)
{
result[i][j] += a.template[i][k] * b.template[k][j];
}
}
}
return new Matrix(result);
}
public Matrix multiply(Matrix multiplier) throws InterruptedException
{
Matrix result = new Matrix(dimension);
ExecutorService es = Executors.newFixedThreadPool(dimension*dimension);
for(int currRow = 0; currRow < multiplier.dimension; currRow++)
{
for(int currCol = 0; currCol < multiplier.dimension; currCol++)
{
//(new MatrixThread(this, multiplier, currRow, currCol, result)).start();
es.execute(new MatrixThread(this, multiplier, currRow, currCol, result));
}
}
es.shutdown();
es.awaitTermination(2, TimeUnit.DAYS);
return result;
}
private class MatrixThread extends Thread
{
private Matrix a, b, result;
private int row, col;
private MatrixThread(Matrix a, Matrix b, int row, int col, Matrix result)
{
this.a = a;
this.b = b;
this.row = row;
this.col = col;
this.result = result;
}
#Override
public void run()
{
int cellResult = 0;
for (int i = 0; i < a.getMatrixDimension(); i++)
cellResult += a.template[row][i] * b.template[i][col];
result.template[row][col] = cellResult;
}
}
}
Main class:
import java.util.Scanner;
public class MatrixDriver
{
private static final Scanner kb = new Scanner(System.in);
public static void main(String[] args) throws InterruptedException
{
Matrix first, second;
long timeLastChanged,timeNow;
double elapsedTime;
System.out.print("Enter value of n (must be a power of 2):");
int n = kb.nextInt();
first = new Matrix(n);
first.fillMatrix();
second = new Matrix(n);
second.fillMatrix();
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using threads:\n" +
first.multiply(second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Threaded took "+elapsedTime+" seconds");
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using classical:\n" +
Matrix.classicalMultiplication(first,second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Classical took "+elapsedTime+" seconds");
}
}
P.S. Please let me know if any further clarification is needed.
There is a bunch of overhead involved in creating threads, even when using an ExecutorService. I suspect the reason why you're multithreaded approach is so slow is that you're spending 99% creating a new thread and only 1%, or less, doing the actual math.
Typically, to solve this problem you'd batch a whole bunch of operations together and run those on a single thread. I'm not 100% how to do that in this case, but I suggest breaking your matrix into smaller chunks (say, 10 smaller matrices) and run those on threads, instead of running each cell in its own thread.
You're creating a lot of threads. Not only is it expensive to create threads, but for a CPU bound application, you don't want more threads than you have available processors (if you do, you have to spend processing power switching between threads, which also is likely to cause cache misses which are very expensive).
It's also unnecessary to send a thread to execute; all it needs is a Runnable. You'll get a big performance boost by applying these changes:
Make the ExecutorService a static member, size it for the current processor, and send it a ThreadFactory so it doesn't keep the program running after main has finished. (It would probably be architecturally cleaner to send it as a parameter to the method rather than keeping it as a static field; I leave that as an exercise for the reader. ☺)
private static final ExecutorService workerPool =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(), new ThreadFactory() {
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setDaemon(true);
return t;
}
});
Make MatrixThread implement Runnable rather than inherit Thread. Threads are expensive to create; POJOs are very cheap. You can also make it static which makes the instances smaller (as non-static classes get an implicit reference to the enclosing object).
private static class MatrixThread implements Runnable
From change (1), you can no longer awaitTermination to make sure all tasks are finished (as this worker pool). Instead, use the submit method which returns a Future<?>. Collect all the future objects in a list, and when you've submitted all the tasks, iterate over the list and call get for each object.
Your multiply method should now look something like this:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
for(int currCol = 0; currCol < multiplier.dimension; currCol++) {
Runnable worker = new MatrixThread(this, multiplier, currRow, currCol, result);
futures.add(workerPool.submit(worker));
}
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
Will it be faster than the single-threaded version? Well, on my arguably crappy box the multithreaded version is slower for values of n < 1024.
This is just scratching the surface, though. The real problem is that you create a lot of MatrixThread instances - your memory consumption is O(n²), which is a very bad sign. Moving the inner for loop into MatrixThread.run would improve performance by a factor of craploads (ideally, you don't create more tasks than you have worker threads).
Edit: As I have more pressing things to do, I couldn't resist optimizing this further. I came up with this (... horrendously ugly piece of code) that "only" creates O(n) jobs:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
Runnable worker = new MatrixThread2(this, multiplier, currRow, result);
futures.add(workerPool.submit(worker));
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
private static class MatrixThread2 implements Runnable
{
private Matrix self, mul, result;
private int row, col;
private MatrixThread2(Matrix a, Matrix b, int row, Matrix result)
{
this.self = a;
this.mul = b;
this.row = row;
this.result = result;
}
#Override
public void run()
{
for(int col = 0; col < mul.dimension; col++) {
int cellResult = 0;
for (int i = 0; i < self.getMatrixDimension(); i++)
cellResult += self.template[row][i] * mul.template[i][col];
result.template[row][col] = cellResult;
}
}
}
It's still not great, but basically the multi-threaded version can compute anything you'll be patient enough to wait for, and it'll do it faster than the single-threaded version.
First of all, you should use a newFixedThreadPool of the size as many cores you have, on a quadcore you use 4. Second of all, don't create a new one for each matrix.
If you make the executorservice a static member variable I get almost consistently faster execution of the threaded version at a matrix size of 512.
Also, change MatrixThread to implement Runnable instead of extending Thread also speeds up execution to where the threaded is on my machine 2x as fast on 512