I m trying to make a parallel program for Gaussian Elimination with Java.
I am making 2 random matrices A and B at the start and i am not using pivoting.
My code when i create the threads is :
GaussianElimination threads[] = new GaussianElimination[T];
long startTime = System.currentTimeMillis();
for (int k = 0; k < N; k++) {
/**This for statement creates threads that behave like objects
* With the start() method we execute the run() proccess .And with
* Join() the main thread wait for all the other threads to finish
*/
for (int i = 0; i < T; i++) {
threads[i] = new GaussianElimination(T, k, i, A, B);
threads[i].start();
}
for (int i = 0; i < T; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
System.err.println("this should not happen");
}
}
}
long endTime = System.currentTimeMillis();
float time = (endTime - startTime) / 1000.0f;
System.out.println("Computation time: " + time);
After this the run method is :
class GaussianElimination extends Thread {
private int myid;
private double[] B;
private double[][] A;
int k;
int threads;
GaussianElimination(int threads, int k, int myid, double[][] A, double[] B) {
this.A = A;//Matrix A
this.B = B;//Matrix B
this.myid = myid;//Id of thread
this.k = k;//k value from exterior loop
this.threads = threads; //size of threads
}
/**Method run() where the threads are running .
*
* The distribution of the data are in cyclic mode.
* e.g. For 3 threads the operation of each thread will be distribute like this:
* thread 1 = 1,4,7,10...
* thread 2= 2,5,8,11...
* thread 3 =3,6,9,12...
*/
public void run() {
int N = B.length;
long startTime = System.currentTimeMillis();//Clock starts
for (int i = myid + k + 1; i < N; i += threads) {
double factor = A[i][k] / A[k][k];
B[i] -= factor * B[k];
for (int j = k; j < N; j++)
A[i][j] -= factor * A[k][j];
long endTime = System.currentTimeMillis();
float time = (endTime - startTime) ;//clock ends
System.out.println("Computation time of thread: " + time);
}
}
}
After that i am doing a back substitution serial and the i print the solution.
So the program is running but it isnt running in parallel.
I have tried to check the time between every thread but it didnt come to a solution.
I couldn't find a lot of examples in java for similar problems so i am asking here.
Is it bad architecture and logic or is any coding errors in my programm?
Also here is the serial code that i used https://www.sanfoundry.com/java-program-gaussian-elimination-algorithm/
Thank for your cooperation!
Related
I have to write program that finds the sum of a 2D array of int,
I coded every thing as I know and there is no syntax error but when I use someways to check my code the thread is not working at all but sometimes work some of thread not all of them
I put the number 1 to check the summation
and I put lock to make sure not two of thread in same method of summation only for make sure
and the n for see how much time it's join the add method
public class extend extends Thread {
int a, b;
private static int sum = 0;
static int n;
boolean lock;
int[][] arr;
public extend() {
arr = new int[45][45];
for (int i = 0; i < 45; i++) {
for (int j = 0; j < 45; j++)
arr[i][j] = 1;
}
n = 0;
lock = false;
}
public extend(int a, int b) {
arr = new int[45][45];
for (int i = 0; i < 45; i++) {
for (int j = 0; j < 45; j++)
arr[i][j] = 1;
}
n = 0;
lock = false;
this.a = a;
this.b = b;
}
public void run() {
add(a, b);
}
public void add(int st, int e) {
n++;
while (lock) ;
lock = true;
int sums = 0;
synchronized (this) {
for (int i = st; i < e; i++) {
for (int j = 0; j < 45; j++) {
sums += arr[i][j];
}
}
}
sum = sums;
lock = false;
}
public int getSum() {
return sum;
}
public static void main(String[] args) {
long ss = System.currentTimeMillis();
Thread t1 = new Thread(new extend(0, 9));
Thread t2 = new Thread(new extend(9, 18));
Thread t3 = new Thread(new extend(18, 27));
Thread t4 = new Thread(new extend(27, 36));
Thread t5 = new Thread(new extend(36, 45));
t1.start();
t2.start();
t3.start();
t4.start();
t5.start();
long se = System.currentTimeMillis();
System.out.println("The sum for 45*45 array is: " + sum);
System.out.println("time start;" + (se - ss));
System.out.print(n);
}
}
I'm sorry to say, but there's so much wrong with this code, it's hard to point at one problem:
You start your threads, but you don't wait for them to finish using .join()
Extending Thread when you actually meant implementing Runnable
Using busy waiting in your thread with while (true)
Using of static intfor counting
But, if there's only one thing you must fix, wait for your threads:
t1.join();
...
t5.join();
Your lockout of the sum variable may not even result in a speedup while taking into account the overhead of creating Threads, but your main problem is you are not adding sums to sum.
Change:
sum = sums;
to:
sum += sums;
This will make your code work for some of the time. It is not guaranteed to work and will sometimes output weird results like 1620 instead of 2025. You should learn more about how to properly handle multithreading, race conditions, and atomic locks.
I'd like to implement a quicksort algorthmus for a 2D array with multithreading.
Its working very fast in Single thread, but now i Tried to speed it up. This is my code to sort every part of the 2d array correctly (the speed of the sorting algorithm himself should be very fast). Its directly working on the "c".
public static void sort(int[][] c) {
int[][] a = new int[][] { { 0, -4, 1, 2 }, { 1, 0, 3 }, { 2, 3, 0 } };
for (int i = 0; i < c.length; i++) {
sort(c[i],0,c[i].length-1);
}
}
I tried up to know:
split the for loop in small "loopers" which perform a task of "x" loops, but this is slowing the algorithm.
Can someone help me to speed it up?
A couple of possibilities:
public static void sort(int[][] c) {
for (int i = 0; i < c.length; i++) {
//sort(c[i],0,c[i].length-1);
Arrays.sort(c[i]);
}
}
public static void parallelSort(int[][] c) {
Arrays.asList(c).parallelStream().forEach(d -> Arrays.sort(d));
}
public static void threadedSort(int[][] c) throws InterruptedException {
int count = 4;
Thread[] threads = new Thread[count];
for (int i = 0; i < count; i++) {
final int finalI = i;
threads[i] = new Thread(
() -> sortOnThread(c, (c.length / count) * finalI, c.length / count),
"Thread " + i
);
threads[i].start();
}
for (Thread thread : threads) {
thread.join();
}
}
private static void sortOnThread(int[][] c, int first, int length) {
for (int i = first; i < first + length; i++) {
Arrays.sort(c[i]);
}
}
public static void main(String[] args) throws InterruptedException {
int[][] c = new int[10_000_000][75];
shuffle(c);
System.out.println("Starting sort()");
long before = System.currentTimeMillis();
sort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
shuffle(c);
System.out.println("Starting parallelSort()");
before = System.currentTimeMillis();
parallelSort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
shuffle(c);
System.out.println("Starting threadedSort()");
before = System.currentTimeMillis();
threadedSort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
}
private static void shuffle(int[][] c) {
for (int i = 0; i < c.length; i++) {
for (int j = 0; j < c[i].length; j++)
c[i][j] = j;
Collections.shuffle(Arrays.asList(c[i]));
}
}
Which produced these timings on a quad core (i5-2430M):
Starting sort()
Took 2486ms
Starting parallelSort()
Took 984ms
Starting threadedSort()
Took 875ms
The parallelStream() approach was the least code, but clearly comes with a bit more overhead (sending each sort through the ForkJoinPool) than direct threading. This was more noticable when the array was the smaller [100_000] [75]:
Starting sort()
Took 48ms
Starting parallelSort()
Took 101ms
Starting threadedSort()
Took 21ms
Just in case it's useful ... initially while coding this, I found the timings for the three approaches were much more similar:
Starting sort()
Took 2403ms
Starting parallelSort()
Took 2435ms
Starting threadedSort()
Took 2284ms
This turned out to be because I was naively allocating new sub-arrays each time in my shuffle() method. Clearly this was generating a lot of extra GC work - even a short sleep before the calling the sort methods made all the difference.
I am trying to parallel a matrix multiplication.
I have achieved parallelization by calculating each cell of Matrix C in a separate thread. (I hope i have done this correctly).
My question here is if using thread pool is the best way for creating threads. (Sorry i am unfamiliar with this and someone suggested to do in this way)
Also will i see a great difference in the time it takes to calculate with a sequential version of the program compared to this?
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ParallelMatrix {
public final static int N = 2000; //Random size of matrix
public static void main(String[] args) throws InterruptedException {
long startTime = System.currentTimeMillis();
//Create and multiply matrix of random size N.
double [][] a = new double [N][N];
double [][] b = new double [N][N];
double [][] c = new double [N][N];
int i,j,k;
for(i = 0; i < N ; i++) {
for(j = 0; j < N ; j++){
a[i][j] = i + j;
b[i][j] = i * j;
}
ExecutorService pool = Executors.newFixedThreadPool(1);
for(i = 0; i < N; i++) {
for(j = 0; j < N; j++) {
pool.submit(new Multi(N,i,j,a,b,c));
}
}
pool.shutdown();
pool.awaitTermination(1, TimeUnit.DAYS);
long endTime = System.currentTimeMillis();
System.out.println("Calculation completed in " +
(endTime - startTime) + " milliseconds");
}
static class Multi implements Runnable {
final int N;
final double [][] a;
final double [][] b;
final double [][] c;
final int i;
final int j;
public Multi(int N, int i, int j, double[][] a, double[][] b, double[][] c){
this.N=N;
this.i=i;
this.j=j;
this.a=a;
this.b=b;
this.c=c;
}
#Override
public void run() {
for(int k = 0; k < N; k++)
c[i][j] += a[i][k] * b[k][j];
}
}
}
You have to balance between scheduling overhead, operation duration and number of available cores. For a start, size your thread pool according to the number of cores available newFixedThreadPool(Runtime.getRuntime().availableProcessors()).
To minimize scheduling overhead you want to slice the operation into just as many independent tasks (of ideally equal execution time) as you have processors.
Generally, the smaller the operation you do in a slice, the more scheduling overhead you have. What you have now (N square tasks) has excessive overhead (you will create and submit 2000 times 2000 Multi runnables which each do very little work).
I wrote the following test class in java to reproduce the performance penalty introduced by "False Sharing".
Basically you can tweak the "size" of array from 4 to a much larger value (e.g. 10000) to turn the "False Sharing phenomenon" either on or off. To be specific, when size = 4, different threads are more likely to update values within the same cache line, causing much more frequent cache misses. In theory, the test program should run much faster when size = 10000 than size = 4.
I ran the same test on two different machines multiple times:
Machine A: Lenovo X230 laptop w/ Intel® Core™ i5-3210M Processor (2 core, 4 threads) Windows 7 64bit
size = 4 => 5.5 second
size = 10000 => 5.4 second
Machine B: Dell OptiPlex 780 w/ Intel® Core™2 Duo Processor E8400 (2 core) Windows XP 32bit
size = 4 => 14.5 second
size = 10000 => 7.2 second
I ran the tests later on a few other machines and quite obviously False Sharing only becomes noticeable on certain machines and I couldn't figure out the decisive factor that makes such difference.
Can anyone kindly take a look at this problem and explain why false sharing introduced in this test class only became noticeable on certain machines?
public class FalseSharing {
interface Oper {
int eval(int value);
}
//try tweak the size
static int size = 4;
//try tweak the op
static Oper op = new Oper() {
#Override
public int eval(int value) {
return value + 2;
}
};
static int[] array = new int[10000 + size];
static final int interval = (size / 4);
public static void main(String args[]) throws InterruptedException {
long start = System.currentTimeMillis();
Thread t1 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + 5000);
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000] = op.eval(array[5000]);
}
}
}
});
Thread t2 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + (5000 + interval));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval] = op.eval(array[5000 + interval]);
}
}
}
});
Thread t3 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + (5000 + interval * 2));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval * 2] = op.eval(array[5000 + interval * 2]);
}
}
}
});
Thread t4 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + (5000 + interval * 3));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval * 3] = op.eval(array[5000 + interval * 3]);
}
}
}
});
t1.start();
t2.start();
t3.start();
t4.start();
t1.join();
t2.join();
t3.join();
t4.join();
System.out.println("Finished!" + (System.currentTimeMillis() - start));
}
}
False sharing only occurs with blocks of 64 bytes. You need to be accessing the same 64-byte block in all four threads. I suggest you create an object or an array with long[8] and update different cells of this array in all four threads and compare with the four threads accessing independent arrays.
Your code is probably fine, Here is a simpler version with results:
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;
public class TestFalseSharing {
static long T0 = System.currentTimeMillis();
static void p(Object msg) {
System.out.format("%09.3f %-10s %s%n", new Double(0.001*(System.currentTimeMillis()-T0)), Thread.currentThread().getName(), " : "+msg);
}
public static void main(String args[]) throws InterruptedException {
int NT = Runtime.getRuntime().availableProcessors();
p("Available processors: "+NT);
int MAXSPAN = 0x1000; //4kB
final byte[] array = new byte[NT*MAXSPAN];
for(int i=1; i<=MAXSPAN; i<<=1) {
testFalseSharing(NT, i, array);
}
}
static void testFalseSharing(final int NT, final int span, final byte[] array) throws InterruptedException {
final int L1 = 10;
final int L2 = 10_000_000;
final CountDownLatch cl = new CountDownLatch(NT*L1);
long t0 = System.nanoTime();
for(int i=0 ; i<4; i++) {
final int startOffset = i*span;
Thread t = new Thread(new Runnable() {
#Override
public void run() {
//p("Offset:" + startOffset);
for (int j = 0; j < L1; j++) {
for (int k = 0; k < L2; k++) {
array[startOffset] += 1;
}
cl.countDown();
}
}
});
t.start();
}
while(!cl.await(10, TimeUnit.SECONDS)) {
p(""+cl.getCount()+" left");
}
long d = System.nanoTime() - t0;
p("Duration: " + 1e-9*d + " seconds, Span="+span+" bytes");
}
}
Results:
00000.000 main : Available processors: 4
00002.843 main : Duration: 2.837645384 seconds, Span=1 bytes
00005.689 main : Duration: 2.8454065760000002 seconds, Span=2 bytes
00008.659 main : Duration: 2.9697156340000004 seconds, Span=4 bytes
00011.640 main : Duration: 2.979306959 seconds, Span=8 bytes
00013.780 main : Duration: 2.140246744 seconds, Span=16 bytes
00015.387 main : Duration: 1.6061148440000002 seconds, Span=32 bytes
00016.729 main : Duration: 1.34128957 seconds, Span=64 bytes
00017.944 main : Duration: 1.215005455 seconds, Span=128 bytes
00019.208 main : Duration: 1.263007368 seconds, Span=256 bytes
00020.477 main : Duration: 1.269272208 seconds, Span=512 bytes
00021.719 main : Duration: 1.241061631 seconds, Span=1024 bytes
00022.975 main : Duration: 1.256024242 seconds, Span=2048 bytes
00024.171 main : Duration: 1.195086858 seconds, Span=4096 bytes
So to answer, it confirms the 64 bytes cache line theory, at least on my laptop core i5.
I've been running some tests to see how inlining function code (explicitly writing function algorithms in the code itself) affects performance. I wrote a simple byte array to integer code and then wrapped it in a function, called it statically from another class, and called it statically from the class itself. The code is as follows:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n = new byte[4];
long start;
System.out.println("Function from Static Class =================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
StaticClass.toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
System.out.println("Function from Class ========================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
toInt(n);
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
int actual = 0;
int len = n.length;
System.out.println("Inline Function ============================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
The results are as follows:
Function from Static Class =================
Elapsed time: 0.096559931s
Function from Class ========================
Elapsed time: 0.015741711s
Inline Function ============================
Elapsed time: 0.837626286s
Is there something weird going on with the bytecode? I've looked at the bytecode myself, but I'm not very familiar and I can't make heads or tails of it.
EDIT
I added assert statements to read the outputs and then randomized the bytes read and the benchmark now behaves the way I thought it would. Thanks to Tomasz Nurkiewicz, who pointed me to the microbenchmark article. The resulting code is thus:
public class FunctionCallSpeed {
public static final int numIter = 50000000;
public static void main (String [] args) {
byte [] n;
long start, end;
int checker, calc;
end = 0;
System.out.println("Function from Object =================");
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = StaticClass.toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
end = 0;
System.out.println("Function from Class ==================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
calc = toInt(n);
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
int len = 4;
end = 0;
System.out.println("Inline Function ======================");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
calc = 0;
checker = (int)(Math.random() * 65535);
n = toByte(checker);
start = System.nanoTime();
for (int j = 0; j < len; j++) {
calc += n[len - 1 - j] << 8 * j;
}
end += System.nanoTime() - start;
assert calc == checker;
}
System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}
public static byte [] toByte(int val) {
byte [] n = new byte[4];
for (int i = 0; i < 4; i++) {
n[i] = (byte)((val >> 8 * i) & 0xFF);
}
return n;
}
public static int toInt(byte [] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
Results:
Function from Static Class =================
Elapsed time: 9.276437031s
Function from Class ========================
Elapsed time: 9.225660708s
Inline Function ============================
Elapsed time: 5.9512E-5s
It's always hard to make a guarantee of what the JIT is doing, but if I had to guess, it noticed the return value of the function was never being used, and optimized a lot of it out.
If you actually use the return value of your function I bet it changes the speed.
I ported your test case to caliper:
import com.google.caliper.SimpleBenchmark;
public class ToInt extends SimpleBenchmark {
private byte[] n;
private int total;
#Override
protected void setUp() throws Exception {
n = new byte[4];
}
public int timeStaticClass(int reps) {
for (int i = 0; i < reps; i++) {
total += StaticClass.toInt(n);
}
return total;
}
public int timeFromClass(int reps) {
for (int i = 0; i < reps; i++) {
total += toInt(n);
}
return total;
}
public int timeInline(int reps) {
for (int i = 0; i < reps; i++) {
int actual = 0;
int len = n.length;
for (int i1 = 0; i1 < len; i1++) {
actual += n[len - 1 - i1] << 8 * i1;
}
total += actual;
}
return total;
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
class StaticClass {
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
And indeed seems like inlined version is the slowest while two static versions are almost the same (as expected):
The reasons are hard to imagine. I can think of two factors:
JVM is better in performing micro-optimizations when code blocks are as small and simple to reason about as possible. When the function is inlined, the whole code becomes more complex and JVM gives up. With smaller toInt() function it JIT is more clever
cache locality - somehow JVM performs better with two small chunks of code (loop and method) rather than one bigger
You have several problems, but the main one is that you are testing one iteration of one optimised code. That is sure to give you mixed results. I suggest running the test for 2 seconds, ignoring the first 10,000 iterations or so.
If the result of a loop is not kept, the entire loop can be discarded after some random interval.
Breaking each test into a separate method
public class FunctionCallSpeed {
public static final int numIter = 50000000;
private static int dontOptimiseAway;
public static void main(String[] args) {
byte[] n = new byte[4];
for (int i = 0; i < 10; i++) {
test1(n);
test2(n);
test3(n);
System.out.println();
}
}
private static void test1(byte[] n) {
System.out.print("from Static Class: ");
long start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = FunctionCallSpeed.toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test2(byte[] n) {
long start;
System.out.print("from Class: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
dontOptimiseAway = toInt(n);
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
private static void test3(byte[] n) {
long start;
int actual = 0;
int len = n.length;
System.out.print("Inlined: ");
start = System.nanoTime();
for (int i = 0; i < numIter; i++) {
for (int j = 0; j < len; j++) {
actual += n[len - 1 - j] << 8 * j;
}
dontOptimiseAway = actual;
}
System.out.print((System.nanoTime() - start) / numIter + "ns ");
}
public static int toInt(byte[] num) {
int actual = 0;
int len = num.length;
for (int i = 0; i < len; i++) {
actual += num[len - 1 - i] << 8 * i;
}
return actual;
}
}
prints
from Class: 7ns Inlined: 11ns from Static Class: 9ns
from Class: 6ns Inlined: 8ns from Static Class: 8ns
from Class: 6ns Inlined: 9ns from Static Class: 6ns
This suggest that when the inner loop is optimised separately it is slightly more efficient.
However if I use an optimised conversion of bytes to int
public static int toInt(byte[] num) {
return num[0] + (num[1] << 8) + (num[2] << 16) + (num[3] << 24);
}
all the tests report
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
from Static Class: 0ns from Class: 0ns Inlined: 0ns
as its realised the test doesn't do anything useful. ;)
Your test is flawed. The second test is having the benefit of the first test already being run. You need to run each test case in its own JVM invocation.