so for my programming class we have to do the following:
Fill an integer array with 5 million integers ranging from 0-9.
Then find the number of times each number (0-9) occurs and display this.
We have to measure the time it takes to count the occurences for both single threaded, and multi-threaded. Currently I average 9.3ms for single threaded, and 8.9 ms multithreaded with 8 threads on my 8 core cpu, why is this?
Currently for multithreading I have one array filled with numbers and am calculating lower and upper bounds for each thread to count occurences. here is my current attempt:
public void createThreads(int divisionSize) throws InterruptedException {
threads = new Thread[threadCount];
for(int i = 0; i < threads.length; i++) {
final int lower = (i*divisionSize);
final int upper = lower + divisionSize - 1;
threads[i] = new Thread(new Runnable() {
long start, end;
#Override
public void run() {
start = System.nanoTime();
for(int i = lower; i <= upper; i++) {
occurences[numbers[i]]++;
}
end = System.nanoTime();
milliseconds += (end-start)/1000000.0;
}
});
threads[i].start();
threads[i].join();
}
}
Could anyone shed some light? Cheers.
You are essentially doing all the work sequentially because each thread you create you immediately join it.
Move the threads[i].join() outside the main construction loop into it's own loop. While you're at it you should probably also start all of the threads outside of the loop as starting them while new threads are still being created is not a good idea because creating threads takes time.
class ThreadTester {
private final int threadCount;
private final int numberCount;
int[] numbers = new int[5_000_000];
AtomicIntegerArray occurences;
Thread[] threads;
AtomicLong milliseconds = new AtomicLong();
public ThreadTester(int threadCount, int numberCount) {
this.threadCount = threadCount;
this.numberCount = numberCount;
occurences = new AtomicIntegerArray(numberCount);
threads = new Thread[threadCount];
Random r = new Random();
for (int i = 0; i < numbers.length; i++) {
numbers[i] = r.nextInt(numberCount);
}
}
public void createThreads() throws InterruptedException {
final int divisionSize = numbers.length / threadCount;
for (int i = 0; i < threads.length; i++) {
final int lower = (i * divisionSize);
final int upper = lower + divisionSize - 1;
threads[i] = new Thread(new Runnable() {
#Override
public void run() {
long start = System.nanoTime();
for (int i = lower; i <= upper; i++) {
occurences.addAndGet(numbers[i], 1);
}
long end = System.nanoTime();
milliseconds.addAndGet(end - start);
}
});
}
}
private void startThreads() {
for (Thread thread : threads) {
thread.start();
}
}
private void finishThreads() throws InterruptedException {
for (Thread thread : threads) {
thread.join();
}
}
public long test() throws InterruptedException {
createThreads();
startThreads();
finishThreads();
return milliseconds.get();
}
}
public void test() throws InterruptedException {
for (int threads = 1; threads < 50; threads++) {
ThreadTester tester = new ThreadTester(threads, 10);
System.out.println("Threads=" + threads + " ns=" + tester.test());
}
}
Note that even here the fastest solution is using one thread but you can clearly see that an even number of threads does it quicker as I am using an i5 which has 2 cores but works as 4 via hyperthreading.
Interestingly though - as suggested by #biziclop - removing all contention between threads via the occurrences by giving each thread its own `occurrences array we get a more expected result:
The other answers all explored the immediate problems with your code, I'll give you a different angle: one that's about design of multi-threading in general.
The idea of parallel computing speeding up calculations depends on the assumption that the small bits you broke the problem up into can indeed be run in parallel, independently of each other.
And at first glance, your problem is exactly like that, chop the input range up into 8 equal parts, fire up 8 threads and off they go.
There is a catch though:
occurences[numbers[i]]++;
The occurences array is a resource shared by all threads, and therefore you must control access to it to ensure correctness: either by explicit synchronization (which is slow) or something like an AtomicIntegerArray. But the Atomic* classes are only really fast if access to them is rarely contested. And in your case access will be contested a lot, because most of what your inner loop does is incrementing the number of occurrences.
So what can you do?
The problem is caused partly by the fact that occurences is such a small structure (an array with 10 elements only, regardless of input size), threads will continuously try to update the same element. But you can turn that to your advantage: make all the threads keep their own separate tally, and when they all finished, just add up their results. This will add a small, constant overhead to the end of the process but will make the calculations go truly parallel.
The join method allows one thread to wait for the completion of another, so the second thread will start only after the first will finish.
Join each thread after you started all threads.
public void createThreads(int divisionSize) throws InterruptedException {
threads = new Thread[threadCount];
for(int i = 0; i < threads.length; i++) {
final int lower = (i*divisionSize);
final int upper = lower + divisionSize - 1;
threads[i] = new Thread(new Runnable() {
long start, end;
#Override
public void run() {
start = System.nanoTime();
for(int i = lower; i <= upper; i++) {
occurences[numbers[i]]++;
}
end = System.nanoTime();
milliseconds += (end-start)/1000000.0;
}
});
threads[i].start();
}
for(int i = 0; i < threads.length; i++) {
threads[i].join();
}
}
Also there seem to be a race condition in code at occurences[numbers[i]]++
So most probably if you update the code and use more threads the output wouldn't be correct. You should use an AtomicIntegerArray: https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/atomic/AtomicIntegerArray.html
Use an ExecutorService with Callable and invoke all tasks then you can safely aggregate them. Also use TimeUnit for elapsing time manipulations (sleep, joining, waiting, convertion, ...)
Start by defining the task with his input/output :
class Task implements Callable<Task> {
// input
int[] source;
int sliceStart;
int sliceEnd;
// output
int[] occurences = new int[10];
String runner;
long elapsed = 0;
Task(int[] source, int sliceStart, int sliceEnd) {
this.source = source;
this.sliceStart = sliceStart;
this.sliceEnd = sliceEnd;
}
#Override
public Task call() {
runner = Thread.currentThread().getName();
long start = System.nanoTime();
try {
compute();
} finally {
elapsed = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
}
return this;
}
void compute() {
for (int i = sliceStart; i < sliceEnd; i++) {
occurences[source[i]]++;
}
}
}
Then let's define some variable to manage parameters:
// Parametters
int size = 5_000_000;
int parallel = Runtime.getRuntime().availableProcessors();
int slices = parallel;
Then generates random input:
// Generated source
int[] source = new int[size];
ThreadLocalRandom random = ThreadLocalRandom.current();
for (int i = 0; i < source.length; i++) source[i] = random.nextInt(10);
Start timing total computation and prepare tasks:
long start = System.nanoTime();
// Prepare tasks
List<Task> tasks = new ArrayList<>(slices);
int sliceSize = source.length / slices;
for (int sliceStart = 0; sliceStart < source.length;) {
int sliceEnd = Math.min(sliceStart + sliceSize, source.length);
Task task = new Task(source, sliceStart, sliceEnd);
tasks.add(task);
sliceStart = sliceEnd;
}
Executes all task on threading configuration (don't forget to shutdown it !):
// Execute tasks
ExecutorService executor = Executors.newFixedThreadPool(parallel);
try {
executor.invokeAll(tasks);
} finally {
executor.shutdown();
}
Then task have been completed, just aggregate data:
// Collect data
int[] occurences = new int[10];
for (Task task : tasks) {
for (int i = 0; i < occurences.length; i++) {
occurences[i] += task.occurences[i];
}
}
Finally you can output computation result:
// Display result
long elapsed = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
System.out.printf("Computation done in %tT.%<tL%n", calendar(elapsed));
System.out.printf("Results: %s%n", Arrays.toString(occurences));
You can also output partial computations:
// Print debug output
int idxSize = (String.valueOf(size).length() * 4) / 3;
String template = "Slice[%," + idxSize + "d-%," + idxSize + "d] computed in %tT.%<tL by %s: %s%n";
for (Task task : tasks) {
System.out.printf(template, task.sliceStart, task.sliceEnd, calendar(task.elapsed), task.runner, Arrays.toString(task.occurences));
}
Which gives on my workstation:
Computation done in 00:00:00.024
Results: [500159, 500875, 500617, 499785, 500017, 500777, 498394, 498614, 499498, 501264]
Slice[ 0-1 250 000] computed in 00:00:00.013 by pool-1-thread-1: [125339, 125580, 125338, 124888, 124751, 124608, 124463, 124351, 125023, 125659]
Slice[1 250 000-2 500 000] computed in 00:00:00.014 by pool-1-thread-2: [124766, 125423, 125111, 124756, 125201, 125695, 124266, 124405, 125083, 125294]
Slice[2 500 000-3 750 000] computed in 00:00:00.013 by pool-1-thread-3: [124903, 124756, 124934, 125640, 124954, 125452, 124556, 124816, 124737, 125252]
Slice[3 750 000-5 000 000] computed in 00:00:00.014 by pool-1-thread-4: [125151, 125116, 125234, 124501, 125111, 125022, 125109, 125042, 124655, 125059]
the small trick to convert elapsed millis in a stopwatch calendar:
static final TimeZone UTC= TimeZone.getTimeZone("UTC");
public static Calendar calendar(long millis) {
Calendar calendar = Calendar.getInstance(UTC);
calendar.setTimeInMillis(millis);
return calendar;
}
Related
I am trying to write a Java multithreaded program performing a multiplication on 2 matrices given as a file and using a limited total of threads used.
For example if I set a number of thread at 16 I want my threadpool to be able to reuse those 16 threads until all the tasks are done.
However I end up with a larger execution time for a larger number of threads and I am having a hard time trying to understand why.
Runnable:
class Task implements Runnable
{
int _row = 0;
int _col = 0;
public Task(int row, int col)
{
_row = row;
_col = col;
}
#Override
public void run()
{
Application.multiply(_row, _col);
}
}
Application:
public class Application
{
private static Scanner sc = new Scanner(System.in);
private static int _A[][];
private static int _B[][];
private static int _C[][];
public static void main(final String [] args) throws InterruptedException
{
ExecutorService executor = Executors.newFixedThreadPool(16);
ThreadPoolExecutor pool = (ThreadPoolExecutor) executor;
_A = readMatrix();
_B = readMatrix();
_C = new int[_A.length][_B[0].length];
long startTime = System.currentTimeMillis();
for (int x = 0; x < _C.length; x++)
{
for (int y = 0; y < _C[0].length; y++)
{
executor.execute(new Task(x, y));
}
}
long endTime = System.currentTimeMillis();
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.HOURS);
System.out.printf("Calculation Time: %d ms\n" , endTime - startTime);
}
public static void multMatrix(int row, int col)
{
int sum = 0;
for (int i = 0; i < _B.length; i++)
{
sum += _A[row][i] * _B[i][col];
}
_C[row][col] = sum;
}
...
}
The matrix calculations and workload sharing seems correct so it might come from a bad use of ThreadPool
Context switching takes time.
If you have 8 cores and you are executing 8 threads they all can work simultaneously and as soon as one finishes it will be reused.
On the other hand if you have 16 threads for 8 cores each thread will compete for the processor time and scheduler will switch those threads and your time would increase to - Execution time + Context swithcing.
The more the threads the more the context switching and hence the time increases.
Those threads are already being reused to execute the tasks, that's the expected behaviour of ThreadPoolExecutor.
http://www.codejava.net/java-core/concurrency/java-concurrency-understanding-thread-pool-and-executors
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
You're getting a higher computation time as you increase the name of threads because the time needed to create them is greater than the improvement of performance that the concurrency gives at the execution of that -relative short- tasks.
Use submit instead of execute
Make a list of returned Futures so that you can wait for them.
List<Future<?>> futures = new ArrayList<>();
futures.add(executor.submit(new Task(x, y)));
Then just wait for these futures to complete.
My program is trying to sum a range with a given number of threads in order to run it in parallel but it seems that with just one threads it runs better than 4 (I have an 8 core CPU). It is my first time working with multithreading in Java so maybe I have a problem in my code that makes it take longer?
My benchmarks(sum of range 0-10000) done for the moment are:
1 thread: 1350 microsecs (average)
2 thread: 1800 microsecs (average)
4 thread: 2400 microsecs (average)
8 thread: 3300 microsecs (average)
Thanks in advance!
/*
Compile: javac RangeSum.java
Execute: java RangeSum nThreads initRange finRange
*/
import java.util.ArrayList;
import java.util.concurrent.*;
public class RangeSum implements Runnable {
private int init;
private int end;
private int id;
static public int out = 0;
Object lock = new Object();
public synchronized static void increment(int partial) {
out = out + partial;
}
public RangeSum(int init,int end) {
this.init = init;
this.end = end;
}//parameters to pass in threads
// the function called for each thread
public void run() {
int partial = 0;
for(int k = this.init; k < this.end; k++)
{
partial = k + partial + 1;
}
increment(partial);
}//thread: sum its id to the out variable
public static void main(String args[]) throws InterruptedException {
final long startTime = System.nanoTime()/1000;//start time: microsecs
//get command line values for
int NumberOfThreads = Integer.valueOf(args[0]);
int initRange = Integer.valueOf(args[1]);
int finRange = Integer.valueOf(args[2]);
//int[] out = new int[NumberOfThreads];
// an array of threads
ArrayList<Thread> Threads = new ArrayList<Thread>(NumberOfThreads);
// spawn the threads / CREATE
for (int i = 0; i < NumberOfThreads; i++) {
int initial = i*finRange/NumberOfThreads;
int end = (i+1)*finRange/NumberOfThreads;
Threads.add(i, new Thread(new RangeSum(initial,end)));
Threads.get(i).start();
}
// wait for the threads to finish / JOIN
for (int i = 0; i < NumberOfThreads; i++) {
try {
Threads.get(i).join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("All threads finished!");
System.out.println("Total range sum: " + out);
final long endTime = System.nanoTime()/1000;//end time
System.out.println("Time elapsed: "+(endTime - startTime));
}
}
Your workload entirely in memory-non-blocking computation - on a general principle, in this kind of scenario, a single thread will complete the work faster than multiple threads.
Multiple threads tend to interfere with the L1/L2 CPU caching and incur additional overhead for context
switching
Specifically, wrt to your code, you initialize final long startTime = System.nanoTime()/1000; too early and measure thread setup time as well as the actual time it takes them to complete. Its probably better to setup your Threads list first and then:
final long startTime =...
for (int i = 0; i < NumberOfThreads; i++) {
Thread.get(i).start()
}
but really, in this case, the expectations that multiple threads will improve processing time is not warranted.
I have the code like the below. In a loop it is executing the method "process". It is running sequentially. I want to run this method parallel, but it should be finished within the loop so that I can sum in the next line. i.e even it is running parallel all functions should finish before the 2nd for loop execute.
How to solve this in Jdk1.7 not JDK1.8 version?
public static void main(String s[]){
int arrlen = 10;
int arr[] = new int[arrlen] ;
int t =0;
for(int i=0;i<arrlen;i++){
arr[i] = i;
t = process(arr[i]);
arr[i] = t;
}
int sum =0;
for(int i=0;i<arrlen;i++){
sum += arr[i];
}
System.out.println(sum);
}
public static int process(int arr){
return arr*2;
}
Below example might help you. I have used fork/join framework to do that.
For small array size like your example, conventional method might be faster and I doubt that fork/join way would take slight higher time. But for larger size or process , fork/join framework is suitable. Even java 8 parallel streams uses fork/join framework as underlying base.
public class ForkMultiplier extends RecursiveAction {
int[] array;
int threshold = 3;
int start;
int end;
public ForkMultiplier(int[] array,int start, int end) {
this.array = array;
this.start = start;
this.end = end;
}
protected void compute() {
if (end - start < threshold) {
computeDirectly();
} else {
int middle = (end + start) / 2;
ForkMultiplier f1= new ForkMultiplier(array, start, middle);
ForkMultiplier f2= new ForkMultiplier(array, middle, end);
invokeAll(f1, f2);
}
}
protected void computeDirectly() {
for (int i = start; i < end; i++) {
array[i] = array[i] * 2;
}
}
}
You main class would like this below
public static void main(String s[]){
int arrlen = 10;
int arr[] = new int[arrlen] ;
for(int i=0;i<arrlen;i++){
arr[i] = i;
}
ForkJoinPool pool = new ForkJoinPool();
pool.invoke(new ForkMultiplier(arr, 0, arr.length));
int sum =0;
for(int i=0;i<arrlen;i++){
sum += arr[i];
}
System.out.println(sum);
}
You basically need to use Executors and Futures combined that exist since Java 1.5 (see Java Documentation).
In the following example, I've created a main class that uses another helper class that acts like the processor you want to parallelize.
The main class is splitted in 3 steps:
Creates the processes pool and executes tasks in parallel.
Waits for all tasks to finish their work.
Collects the results from tasks.
For didactic reasons, I've put some logs and more important, I've put a random waiting time in each process' business logic, simulating a time-consuming algorithm ran by the Process class.
The maximum waiting time for each process is 2 seconds, which is also the highest waiting time for step 2, even if you increase the number of parallel tasks (just try changing the variable totalTasks of the following code to test it).
Here the Main class:
package com.example;
import java.util.ArrayList;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class Main
{
public static void main(String[] args) throws InterruptedException, ExecutionException
{
int totalTasks = 100;
ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(totalTasks);
System.out.println("Step 1 - Starting parallel tasks");
ArrayList<Future<Integer>> tasks = new ArrayList<Future<Integer>>();
for (int i = 0; i < totalTasks; i++) {
tasks.add(newFixedThreadPool.submit(new Process(i)));
}
long ts = System.currentTimeMillis();
System.out.println("Step 2 - Wait for processes to finish...");
boolean tasksCompleted;
do {
tasksCompleted = true;
for (Future<Integer> task : tasks) {
if (!task.isDone()) {
tasksCompleted = false;
Thread.sleep(10);
break;
}
}
} while (!tasksCompleted);
System.out.println(String.format("Step 2 - End in '%.3f' seconds", (System.currentTimeMillis() - ts) / 1000.0));
System.out.println("Step 3 - All processes finished to run, let's collect results...");
Integer sum = 0;
for (Future<Integer> task : tasks) {
sum += task.get();
}
System.out.println(String.format("Total final sum is: %d", sum));
}
}
Here the Process class:
package com.example;
import java.util.concurrent.Callable;
public class Process implements Callable<Integer>
{
private Integer value;
public Process(Integer value)
{
this.value = value;
}
public Integer call() throws Exception
{
Long sleepTime = (long)(Math.random() * 2000);
System.out.println(String.format("Starting process with value %d, sleep time %d", this.value, sleepTime));
Thread.sleep(sleepTime);
System.out.println(String.format("Stopping process with value %d", this.value));
return value * 2;
}
}
Hope this helps.
This is probably a pretty easy question, but as I never worked with threads before I figured it would be best to ask instead of trying to find the optimal solution completely on my own.
I have a giant for loop that runs literally billions of times. On each on loop run, according to the current index, the program calculates a final result in the form of a number. I am only interested in storing the top result(or top x results), and its corresponding index.
My question is simple, what would be the right way running this loop in threads so it uses all the available CPUs/cores.
int topResultIndex;
double topResult = 0;
for (i=1; i < 1000000000; ++i) {
double result = // some complicated calculation based on the current index
if (result > topResult) {
topResult = result;
topResultIndex = i;
}
}
The calculation is completely independent for each index, no resources are shared. topResultIndex and topResult will be obviously accessed by each thread though.
* Update: Both Giulio's and rolfl's solution are good, also very similar. Could only accept one of them as my answer.
Let's assume that the result is computed by a calculateResult(long) method, which is private and static, and does not access any static field, (it can also be non-static, but still it must be thread-safe and concurrently-executable, hopefully thread-confined).
Then, I think this will do the dirty work:
public static class Response {
int index;
double result;
}
private static class MyTask implements Callable<Response> {
private long from;
private long to;
public MyTask(long fromIndexInclusive, long toIndexExclusive) {
this.from = fromIndexInclusive;
this.to = toIndexExclusive;
}
public Response call() {
int topResultIndex;
double topResult = 0;
for (long i = from; i < to; ++i) {
double result = calculateResult(i);
if (result > topResult) {
topResult = result;
topResultIndex = i;
}
}
Response res = new Response();
res.index = topResultIndex;
res.result = topResult;
return res;
}
};
private static calculateResult(long index) { ... }
public Response interfaceMethod() {
//You might want to make this static/shared/global
ExecutorService svc = Executors.newCachedThreadPool();
int chunks = Runtime.getRuntime().availableProcessors();
long iterations = 1000000000;
MyTask[] tasks = new MyTask[chunks];
for (int i = 0; i < chunks; ++i) {
//You'd better cast to double and round here
tasks[i] = new MyTask(iterations / chunks * i, iterations / chunks * (i + 1));
}
List<Future<Response>> resp = svc.invokeAll(Arrays.asList(tasks));
Iterator<Future<Response>> respIt = resp.iterator();
//You'll have to handle exceptions here
Response bestResponse = respIt.next().get();
while (respIt.hasNext()) {
Response r = respIt.next().get();
if (r.result > bestResponse.result) {
bestResponse = r;
}
}
return bestResponse;
}
From my experience, this division in chunks is much faster that having a task for each index (especially if the computational load for each single index is small, like it probably is. By small, I mean less than half a second). It's a bit harder to code, though, because you need to make a 2-step maximization (first at chunk-level, then at a global level). With this, if the computation is purely cpu-based (does not push the ram too much) you should get a speedup almost equal to 80% the number of physical cores.
Apart from the observation that a C program with OpenMP or some other parallel computing extensions would be a better idea, the Java way to do it would be to create a 'Future' Task that calculates a subset of the problem:
private static final class Result {
final int index;
final double result;
public Result (int index, double result) {
this.result = result;
this.index = index;
}
}
// Calculate 10,000 values in each thead
int steps = 10000;
int cpucount = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(cpucount);
ArrayList<Future<Result>> results = new ArrayList<>();
for (int i = 0; i < 1000000000; i+= steps) {
final int from = i;
final int to = from + steps;
results.add(service.submit(new Callable<Result>() {
public Result call() {
int topResultIndex = -1;
double topResult = 0;
for (int j = from; j < to; j++) {
// do complicated things with 'j'
double result = // some complicated calculation based on the current index
if (result > topResult) {
topResult = result;
topResultIndex = j;
}
}
return new Result(topResultIndex, topResult);
}
});
}
service.shutdown();
while (!service.isTerminated()) {
System.out.println("Waiting for threads to complete");
service.awaitTermination(10, TimeUnit.SECONDS);
}
Result best = null;
for (Future<Result> fut : results) {
if (best == null || fut.result > best.result) {
best = fut;
}
}
System.out.printf("Best result is %f at index %d\n", best.result, best.index);
Future<Result>
The easiest way would be to use an ExecutorService and submit your tasks as a Runnable or Callable. You can use Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()) to create an ExeuctorService that will use the same number of threads as there are processors.
I've coded a multi-threaded matrix multiplication. I believe my approach is right, but I'm not 100% sure. In respect to the threads, I don't understand why I can't just run a (new MatrixThread(...)).start() instead of using an ExecutorService.
Additionally, when I benchmark the multithreaded approach versus the classical approach, the classical is much faster...
What am I doing wrong?
Matrix Class:
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class Matrix
{
private int dimension;
private int[][] template;
public Matrix(int dimension)
{
this.template = new int[dimension][dimension];
this.dimension = template.length;
}
public Matrix(int[][] array)
{
this.dimension = array.length;
this.template = array;
}
public int getMatrixDimension() { return this.dimension; }
public int[][] getArray() { return this.template; }
public void fillMatrix()
{
Random randomNumber = new Random();
for(int i = 0; i < dimension; i++)
{
for(int j = 0; j < dimension; j++)
{
template[i][j] = randomNumber.nextInt(10) + 1;
}
}
}
#Override
public String toString()
{
String retString = "";
for(int i = 0; i < this.getMatrixDimension(); i++)
{
for(int j = 0; j < this.getMatrixDimension(); j++)
{
retString += " " + this.getArray()[i][j];
}
retString += "\n";
}
return retString;
}
public static Matrix classicalMultiplication(Matrix a, Matrix b)
{
int[][] result = new int[a.dimension][b.dimension];
for(int i = 0; i < a.dimension; i++)
{
for(int j = 0; j < b.dimension; j++)
{
for(int k = 0; k < b.dimension; k++)
{
result[i][j] += a.template[i][k] * b.template[k][j];
}
}
}
return new Matrix(result);
}
public Matrix multiply(Matrix multiplier) throws InterruptedException
{
Matrix result = new Matrix(dimension);
ExecutorService es = Executors.newFixedThreadPool(dimension*dimension);
for(int currRow = 0; currRow < multiplier.dimension; currRow++)
{
for(int currCol = 0; currCol < multiplier.dimension; currCol++)
{
//(new MatrixThread(this, multiplier, currRow, currCol, result)).start();
es.execute(new MatrixThread(this, multiplier, currRow, currCol, result));
}
}
es.shutdown();
es.awaitTermination(2, TimeUnit.DAYS);
return result;
}
private class MatrixThread extends Thread
{
private Matrix a, b, result;
private int row, col;
private MatrixThread(Matrix a, Matrix b, int row, int col, Matrix result)
{
this.a = a;
this.b = b;
this.row = row;
this.col = col;
this.result = result;
}
#Override
public void run()
{
int cellResult = 0;
for (int i = 0; i < a.getMatrixDimension(); i++)
cellResult += a.template[row][i] * b.template[i][col];
result.template[row][col] = cellResult;
}
}
}
Main class:
import java.util.Scanner;
public class MatrixDriver
{
private static final Scanner kb = new Scanner(System.in);
public static void main(String[] args) throws InterruptedException
{
Matrix first, second;
long timeLastChanged,timeNow;
double elapsedTime;
System.out.print("Enter value of n (must be a power of 2):");
int n = kb.nextInt();
first = new Matrix(n);
first.fillMatrix();
second = new Matrix(n);
second.fillMatrix();
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using threads:\n" +
first.multiply(second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Threaded took "+elapsedTime+" seconds");
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using classical:\n" +
Matrix.classicalMultiplication(first,second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Classical took "+elapsedTime+" seconds");
}
}
P.S. Please let me know if any further clarification is needed.
There is a bunch of overhead involved in creating threads, even when using an ExecutorService. I suspect the reason why you're multithreaded approach is so slow is that you're spending 99% creating a new thread and only 1%, or less, doing the actual math.
Typically, to solve this problem you'd batch a whole bunch of operations together and run those on a single thread. I'm not 100% how to do that in this case, but I suggest breaking your matrix into smaller chunks (say, 10 smaller matrices) and run those on threads, instead of running each cell in its own thread.
You're creating a lot of threads. Not only is it expensive to create threads, but for a CPU bound application, you don't want more threads than you have available processors (if you do, you have to spend processing power switching between threads, which also is likely to cause cache misses which are very expensive).
It's also unnecessary to send a thread to execute; all it needs is a Runnable. You'll get a big performance boost by applying these changes:
Make the ExecutorService a static member, size it for the current processor, and send it a ThreadFactory so it doesn't keep the program running after main has finished. (It would probably be architecturally cleaner to send it as a parameter to the method rather than keeping it as a static field; I leave that as an exercise for the reader. ☺)
private static final ExecutorService workerPool =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(), new ThreadFactory() {
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setDaemon(true);
return t;
}
});
Make MatrixThread implement Runnable rather than inherit Thread. Threads are expensive to create; POJOs are very cheap. You can also make it static which makes the instances smaller (as non-static classes get an implicit reference to the enclosing object).
private static class MatrixThread implements Runnable
From change (1), you can no longer awaitTermination to make sure all tasks are finished (as this worker pool). Instead, use the submit method which returns a Future<?>. Collect all the future objects in a list, and when you've submitted all the tasks, iterate over the list and call get for each object.
Your multiply method should now look something like this:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
for(int currCol = 0; currCol < multiplier.dimension; currCol++) {
Runnable worker = new MatrixThread(this, multiplier, currRow, currCol, result);
futures.add(workerPool.submit(worker));
}
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
Will it be faster than the single-threaded version? Well, on my arguably crappy box the multithreaded version is slower for values of n < 1024.
This is just scratching the surface, though. The real problem is that you create a lot of MatrixThread instances - your memory consumption is O(n²), which is a very bad sign. Moving the inner for loop into MatrixThread.run would improve performance by a factor of craploads (ideally, you don't create more tasks than you have worker threads).
Edit: As I have more pressing things to do, I couldn't resist optimizing this further. I came up with this (... horrendously ugly piece of code) that "only" creates O(n) jobs:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
Runnable worker = new MatrixThread2(this, multiplier, currRow, result);
futures.add(workerPool.submit(worker));
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
private static class MatrixThread2 implements Runnable
{
private Matrix self, mul, result;
private int row, col;
private MatrixThread2(Matrix a, Matrix b, int row, Matrix result)
{
this.self = a;
this.mul = b;
this.row = row;
this.result = result;
}
#Override
public void run()
{
for(int col = 0; col < mul.dimension; col++) {
int cellResult = 0;
for (int i = 0; i < self.getMatrixDimension(); i++)
cellResult += self.template[row][i] * mul.template[i][col];
result.template[row][col] = cellResult;
}
}
}
It's still not great, but basically the multi-threaded version can compute anything you'll be patient enough to wait for, and it'll do it faster than the single-threaded version.
First of all, you should use a newFixedThreadPool of the size as many cores you have, on a quadcore you use 4. Second of all, don't create a new one for each matrix.
If you make the executorservice a static member variable I get almost consistently faster execution of the threaded version at a matrix size of 512.
Also, change MatrixThread to implement Runnable instead of extending Thread also speeds up execution to where the threaded is on my machine 2x as fast on 512