I am trying to parallel a matrix multiplication.
I have achieved parallelization by calculating each cell of Matrix C in a separate thread. (I hope i have done this correctly).
My question here is if using thread pool is the best way for creating threads. (Sorry i am unfamiliar with this and someone suggested to do in this way)
Also will i see a great difference in the time it takes to calculate with a sequential version of the program compared to this?
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ParallelMatrix {
public final static int N = 2000; //Random size of matrix
public static void main(String[] args) throws InterruptedException {
long startTime = System.currentTimeMillis();
//Create and multiply matrix of random size N.
double [][] a = new double [N][N];
double [][] b = new double [N][N];
double [][] c = new double [N][N];
int i,j,k;
for(i = 0; i < N ; i++) {
for(j = 0; j < N ; j++){
a[i][j] = i + j;
b[i][j] = i * j;
}
ExecutorService pool = Executors.newFixedThreadPool(1);
for(i = 0; i < N; i++) {
for(j = 0; j < N; j++) {
pool.submit(new Multi(N,i,j,a,b,c));
}
}
pool.shutdown();
pool.awaitTermination(1, TimeUnit.DAYS);
long endTime = System.currentTimeMillis();
System.out.println("Calculation completed in " +
(endTime - startTime) + " milliseconds");
}
static class Multi implements Runnable {
final int N;
final double [][] a;
final double [][] b;
final double [][] c;
final int i;
final int j;
public Multi(int N, int i, int j, double[][] a, double[][] b, double[][] c){
this.N=N;
this.i=i;
this.j=j;
this.a=a;
this.b=b;
this.c=c;
}
#Override
public void run() {
for(int k = 0; k < N; k++)
c[i][j] += a[i][k] * b[k][j];
}
}
}
You have to balance between scheduling overhead, operation duration and number of available cores. For a start, size your thread pool according to the number of cores available newFixedThreadPool(Runtime.getRuntime().availableProcessors()).
To minimize scheduling overhead you want to slice the operation into just as many independent tasks (of ideally equal execution time) as you have processors.
Generally, the smaller the operation you do in a slice, the more scheduling overhead you have. What you have now (N square tasks) has excessive overhead (you will create and submit 2000 times 2000 Multi runnables which each do very little work).
Related
I am analyzing brute force Three Sum algorithm. Let's say the running time of this algorithm is T(N)=aN^3. What I am doing is that I am running this ThreeSum.java program with 8Kints.txt and using that running time to calculate constant a. After calculating a I am guessing what the running time of 16Kints.txt is. Here is my ThreeSum.java file:
public class ThreeSum {
public static int count(int[] a) {
// Count triples that sum to 0.
int N = a.length;
int cnt = 0;
for (int i = 0; i < N; i++)
for (int j = i + 1; j < N; j++)
for (int k = j + 1; k < N; k++)
if (a[i] + a[j] + a[k] == 0)
cnt++;
return cnt;
}
public static void main(String[] args) {
In in = new In(args[0]);
int[] a = in.readAllInts();
Stopwatch timer = new Stopwatch();
int count = count(a);
StdOut.println("elapsed time = " + timer.elapsedTime());
StdOut.println(count);
}
}
When I run like this:
$ java ThreeSum 8Kints.txt
I get this:
elapsed time = 165.335
And now in doubling ratio experiment where I use the same method inside another client and run this client with multiple files as arguments and wanna try to compare the running time of 8Kints.txt with above method but I get different result actually faster result. Here is my DoublingRatio.java client:
public class DoublingRatio {
public static double timeTrial(int[] a) {
Stopwatch timer = new Stopwatch();
int cnt = ThreeSum.count(a);
return timer.elapsedTime();
}
public static void main(String[] args) {
In in;
int[][] inputs = new int[args.length][];
for (int i = 0; i < args.length; i++) {
in = new In(args[i]);
inputs[i] = in.readAllInts();
}
double prev = timeTrial(inputs[0]);
for (int i = 1; i < args.length; i++) {
double time = timeTrial(inputs[i]);
StdOut.printf("%6d %7.3f ", inputs[i].length, time);
StdOut.printf("%5.1f\n", time / prev);
prev = time;
}
}
}
When I run this like:
$ java DoublingRatio 1Kints.txt 2Kints.txt 4Kints.txt 8Kints.txt 16Kints.txt 32Kints.txt
I get faster reuslt and I wonder why:
N sec ratio
2000 2.631 7.8
4000 4.467 1.7
8000 34.626 7.8
I know it is something that has to do with Java not the algorithm? Does java optimizes some things under the hood.
I m trying to make a parallel program for Gaussian Elimination with Java.
I am making 2 random matrices A and B at the start and i am not using pivoting.
My code when i create the threads is :
GaussianElimination threads[] = new GaussianElimination[T];
long startTime = System.currentTimeMillis();
for (int k = 0; k < N; k++) {
/**This for statement creates threads that behave like objects
* With the start() method we execute the run() proccess .And with
* Join() the main thread wait for all the other threads to finish
*/
for (int i = 0; i < T; i++) {
threads[i] = new GaussianElimination(T, k, i, A, B);
threads[i].start();
}
for (int i = 0; i < T; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
System.err.println("this should not happen");
}
}
}
long endTime = System.currentTimeMillis();
float time = (endTime - startTime) / 1000.0f;
System.out.println("Computation time: " + time);
After this the run method is :
class GaussianElimination extends Thread {
private int myid;
private double[] B;
private double[][] A;
int k;
int threads;
GaussianElimination(int threads, int k, int myid, double[][] A, double[] B) {
this.A = A;//Matrix A
this.B = B;//Matrix B
this.myid = myid;//Id of thread
this.k = k;//k value from exterior loop
this.threads = threads; //size of threads
}
/**Method run() where the threads are running .
*
* The distribution of the data are in cyclic mode.
* e.g. For 3 threads the operation of each thread will be distribute like this:
* thread 1 = 1,4,7,10...
* thread 2= 2,5,8,11...
* thread 3 =3,6,9,12...
*/
public void run() {
int N = B.length;
long startTime = System.currentTimeMillis();//Clock starts
for (int i = myid + k + 1; i < N; i += threads) {
double factor = A[i][k] / A[k][k];
B[i] -= factor * B[k];
for (int j = k; j < N; j++)
A[i][j] -= factor * A[k][j];
long endTime = System.currentTimeMillis();
float time = (endTime - startTime) ;//clock ends
System.out.println("Computation time of thread: " + time);
}
}
}
After that i am doing a back substitution serial and the i print the solution.
So the program is running but it isnt running in parallel.
I have tried to check the time between every thread but it didnt come to a solution.
I couldn't find a lot of examples in java for similar problems so i am asking here.
Is it bad architecture and logic or is any coding errors in my programm?
Also here is the serial code that i used https://www.sanfoundry.com/java-program-gaussian-elimination-algorithm/
Thank for your cooperation!
I'd like to implement a quicksort algorthmus for a 2D array with multithreading.
Its working very fast in Single thread, but now i Tried to speed it up. This is my code to sort every part of the 2d array correctly (the speed of the sorting algorithm himself should be very fast). Its directly working on the "c".
public static void sort(int[][] c) {
int[][] a = new int[][] { { 0, -4, 1, 2 }, { 1, 0, 3 }, { 2, 3, 0 } };
for (int i = 0; i < c.length; i++) {
sort(c[i],0,c[i].length-1);
}
}
I tried up to know:
split the for loop in small "loopers" which perform a task of "x" loops, but this is slowing the algorithm.
Can someone help me to speed it up?
A couple of possibilities:
public static void sort(int[][] c) {
for (int i = 0; i < c.length; i++) {
//sort(c[i],0,c[i].length-1);
Arrays.sort(c[i]);
}
}
public static void parallelSort(int[][] c) {
Arrays.asList(c).parallelStream().forEach(d -> Arrays.sort(d));
}
public static void threadedSort(int[][] c) throws InterruptedException {
int count = 4;
Thread[] threads = new Thread[count];
for (int i = 0; i < count; i++) {
final int finalI = i;
threads[i] = new Thread(
() -> sortOnThread(c, (c.length / count) * finalI, c.length / count),
"Thread " + i
);
threads[i].start();
}
for (Thread thread : threads) {
thread.join();
}
}
private static void sortOnThread(int[][] c, int first, int length) {
for (int i = first; i < first + length; i++) {
Arrays.sort(c[i]);
}
}
public static void main(String[] args) throws InterruptedException {
int[][] c = new int[10_000_000][75];
shuffle(c);
System.out.println("Starting sort()");
long before = System.currentTimeMillis();
sort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
shuffle(c);
System.out.println("Starting parallelSort()");
before = System.currentTimeMillis();
parallelSort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
shuffle(c);
System.out.println("Starting threadedSort()");
before = System.currentTimeMillis();
threadedSort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
}
private static void shuffle(int[][] c) {
for (int i = 0; i < c.length; i++) {
for (int j = 0; j < c[i].length; j++)
c[i][j] = j;
Collections.shuffle(Arrays.asList(c[i]));
}
}
Which produced these timings on a quad core (i5-2430M):
Starting sort()
Took 2486ms
Starting parallelSort()
Took 984ms
Starting threadedSort()
Took 875ms
The parallelStream() approach was the least code, but clearly comes with a bit more overhead (sending each sort through the ForkJoinPool) than direct threading. This was more noticable when the array was the smaller [100_000] [75]:
Starting sort()
Took 48ms
Starting parallelSort()
Took 101ms
Starting threadedSort()
Took 21ms
Just in case it's useful ... initially while coding this, I found the timings for the three approaches were much more similar:
Starting sort()
Took 2403ms
Starting parallelSort()
Took 2435ms
Starting threadedSort()
Took 2284ms
This turned out to be because I was naively allocating new sub-arrays each time in my shuffle() method. Clearly this was generating a lot of extra GC work - even a short sleep before the calling the sort methods made all the difference.
I have a program that does a lot of matrix multiplication. I thought I'd speed it up by reducing the number of loops in the code to see how much faster it would be (I'll try a matrix math library later). It turns out it's not faster at all. I've been able to replicate the problem with some example code. My guess was that testOne() would be faster than testTwo() because it doesn't create any new arrays and because it has a third as many loops. On my machine, its takes twice as long to run:
Duration for testOne with 5000 epochs: 657, loopCount: 64000000
Duration for testTwo with 5000 epochs: 365, loopCount: 192000000
My guess is that multOne() is slower than multTwo() because in multOne() the CPU is not writing to sequential memory addresses like it is in multTwo(). Does that sound right? Any explanations would be appreciated.
import java.util.Random;
public class ArrayTest {
double[] arrayOne;
double[] arrayTwo;
double[] arrayThree;
double[][] matrix;
double[] input;
int loopCount;
int rows;
int columns;
public ArrayTest(int rows, int columns) {
this.rows = rows;
this.columns = columns;
this.loopCount = 0;
arrayOne = new double[rows];
arrayTwo = new double[rows];
arrayThree = new double[rows];
matrix = new double[rows][columns];
Random random = new Random();
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
matrix[i][j] = random.nextDouble();
}
}
}
public void testOne(double[] input, int epochs) {
this.input = input;
this.loopCount = 0;
long start = System.currentTimeMillis();
long duration;
for (int i = 0; i < epochs; i++) {
multOne();
}
duration = System.currentTimeMillis() - start;
System.out.println("Duration for testOne with " + epochs + " epochs: " + duration + ", loopCount: " + loopCount);
}
public void multOne() {
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
arrayOne[i] += matrix[i][j] * arrayOne[i] * input[j];
arrayTwo[i] += matrix[i][j] * arrayTwo[i] * input[j];
arrayThree[i] += matrix[i][j] * arrayThree[i] * input[j];
loopCount++;
}
}
}
public void testTwo(double[] input, int epochs) {
this.loopCount = 0;
long start = System.currentTimeMillis();
long duration;
for (int i = 0; i < epochs; i++) {
arrayOne = multTwo(matrix, arrayOne, input);
arrayTwo = multTwo(matrix, arrayTwo, input);
arrayThree = multTwo(matrix, arrayThree, input);
}
duration = System.currentTimeMillis() - start;
System.out.println("Duration for testTwo with " + epochs + " epochs: " + duration + ", loopCount: " + loopCount);
}
public double[] multTwo(double[][] matrix, double[] array, double[] input) {
double[] newArray = new double[rows];
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
newArray[i] += matrix[i][j] * array[i] * input[j];
loopCount++;
}
}
return newArray;
}
public static void main(String[] args) {
int rows = 100;
int columns = 128;
ArrayTest arrayTest = new ArrayTest(rows, columns);
Random random = new Random();
double[] input = new double[columns];
for (int i = 0; i < columns; i++) {
input[i] = random.nextDouble();
}
arrayTest.testOne(input, 5000);
arrayTest.testTwo(input, 5000);
}
}
There is a simple reason why your tests take different time: they don't do the same thing. Since the two loops you compare are not functionally identical, the number of iterations is not a good metric to look at.
testOne takes longer than testTwo because:
In multOne you update arrayOne[i] in place, during each iteration
of the j loop. This means for each iteration of the inner loop j
you are using a new value of arrayOne[i], computed in the
previous iteration. This creates a loop carried dependency, which is
harder to optimise for the compiler, because you require the output
of the operation matrix[i][j] * arrayOne[i] * input[j] on the next
CPU clock cycle. This is not really possible with floating point
operations, which have a latency of a few clock cycles usually, so
it results in stalls, therefore reduced performance.
In testTwo you
update arrayOne only once per each iteration of the epoch, and
since there are no carried dependecies, the loop can be vectorised
efficiently, which results in better cache and arithmetic
performance.
To my surprise I get a longer time (10 milliseconds) when "optimizing" multiplications by pregenerating the results in an array compared to the original 8 milliseconds. Is that just a Java quirk or is that general of the PC architecture? I have a Core i5 760 with Java 7, Windows 8 64 Bit.
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
long sum=0;
int[] sqr = new int[1000];
for(int a=1;a<1000;a++) {sqr[a]=a*a;}
for(int b=1;b<1000;b++)
// for(int a=1;a<1000;a++) {sum+=a*a+b*b;}
for(int a=1;a<1000;a++) {sum+=sqr[a]+sqr[b];}
System.out.println(System.currentTimeMillis()-start+"ms");
System.out.println(sum);
}
}
Konrad Rudolph commented on the issues with the benchmarking. So I am ignoring the benchmark and focus on the question:
Is multiplication faster than array access?
Yes, it is very likely. It used to be the other way around 20 or 30 years ago.
Roughly speaking, you can do an integer multiplication in 3 cycles (pessimistic, if you don't get vector instructions), and a memory access costs you 4 cycles if you get it straight from the L1 cache but it is straight downhill from there. For reference, see
Latencies and throughput in Appendix C of the Intel 64 and IA-32 Architectures Optimization Reference Manual
Approximate cost to access various caches and main memory?
Herb Sutter's presentation on this very subject: Machine Architecture: Things Your Programming Language Never Told You
One thing specific to Java was pointed out by Ingo in a comment below: You also get bounds checking in Java, which makes the already slower array access even slower...
A more reasonable benchmark would be:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
abstract int run(int iterations) throws Throwable;
private BigDecimal time() {
try {
int nextI = 1;
int i;
long duration;
do {
i = nextI;
long start = System.nanoTime();
run(i);
duration = System.nanoTime() - start;
nextI = (i << 1) | 1;
} while (duration < 1000000000 && nextI > 0);
return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
#Override
public String toString() {
return name + "\t" + time() + " ns";
}
private static void shuffle(int[] a) {
Random chaos = new Random();
for (int i = a.length; i > 0; i--) {
int r = chaos.nextInt(i);
int t = a[r];
a[r] = a[i - 1];
a[i - 1] = t;
}
}
public static void main(String[] args) throws Exception {
final int[] table = new int[1000];
final int[] permutation = new int[1000];
for (int i = 0; i < table.length; i++) {
table[i] = i * i;
permutation[i] = i;
}
shuffle(permutation);
Benchmark[] marks = {
new Benchmark("sequential multiply") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += i * i;
}
}
return sum;
}
},
new Benchmark("sequential lookup") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[i];
}
}
return sum;
}
},
new Benchmark("random order multiply") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += permutation[i] * permutation[i];
}
}
return sum;
}
},
new Benchmark("random order lookup") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[permutation[i]];
}
}
return sum;
}
}
};
for (Benchmark mark : marks) {
System.out.println(mark);
}
}
}
which prints on my intel core duo (yes, it's old):
sequential multiply 2218.666 ns
sequential lookup 1081.220 ns
random order multiply 2416.923 ns
random order lookup 2351.293 ns
So, if I access the lookup array sequentially, which minimizes the number of cache misses, and permits the hotspot JVM to optimize bounds checking on array access, there is a slight improvement on an array of 1000 elements. If we do random access into the array, that advantage disappears. Also, if the table is larger, the lookup gets slower. For instance, for 10000 elements, I get:
sequential multiply 23192.236 ns
sequential lookup 12701.695 ns
random order multiply 24459.697 ns
random order lookup 31595.523 ns
So, array lookup is not faster than multiplication, unless the access pattern is (nearly) sequential and the lookup array small.
In any case, my measurements indicate that a multiplication (and addition) takes merely 4 processor cycles (2.3 ns per loop iteration on a 2GHz CPU). You're unlikely to get much faster than that. Also, unless you do half a billion multiplications per second, the multiplications are not your bottleneck, and optimizing other parts of the code will be more fruitful.