Using parallelism in Java makes program slower (four times slower!!!)

Using parallelism in Java makes program slower (four times slower!!!) - java

I'm writing conjugate-gradient method realization.
I use Java multi threading for matrix back-substitution.
Synchronization is made using CyclicBarrier, CountDownLatch.
Why it takes so much time to synchronize threads?
Are there other ways to do it?
code snippet
private void syncThreads() {
// barrier.await();
try {
barrier.await();
} catch (InterruptedException e) {
} catch (BrokenBarrierException e) {
}
}

You need to ensure that each thread spends more time doing useful work than it costs in overhead to pass a task to another thread.
Here is an example of where the overhead of passing a task to another thread far outweighs the benefits of using multiple threads.
final double[] results = new double[10*1000*1000];
{
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
results[i] = (double) i * i;
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per square%n", (double) time / results.length);
}
{
ExecutorService ex = Executors.newFixedThreadPool(4);
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
final int i2 = i;
ex.execute(new Runnable() {
#Override
public void run() {
results[i2] = i2 * i2;
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per square%n", (double) time / results.length);
}
prints
With one thread it took 1.4 ns per square
With four threads it took 715.6 ns per square
Using multiple threads is much worse.
However, increase the amount of work each thread does and
final double[] results = new double[10 * 1000 * 1000];
{
long start = System.nanoTime();
// using a plain loop.
for (int i = 0; i < results.length; i++) {
results[i] = Math.pow(i, 1.5);
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
{
int threads = 4;
ExecutorService ex = Executors.newFixedThreadPool(threads);
long start = System.nanoTime();
int blockSize = results.length / threads;
// using a plain loop.
for (int i = 0; i < threads; i++) {
final int istart = i * blockSize;
final int iend = (i + 1) * blockSize;
ex.execute(new Runnable() {
#Override
public void run() {
for (int i = istart; i < iend; i++)
results[i] = Math.pow(i, 1.5);
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
prints
With one thread it took 287.6 ns per pow 1.5
With four threads it took 77.3 ns per pow 1.5
That's an almost 4x improvement.

How many threads are being used in total? That is likely the source of your problem. Using multiple threads will only really give a performance boost if:
Each task in the thread does some sort of blocking. For example, waiting on I/O. Using multiple threads in this case enables that blocking time to be used by other threads.
or You have multiple cores. If you have 4 cores or 4 CPUs, you can do 4 tasks simultaneously (or 4 threads).
It sounds like you are not blocking in the threads so my guess is you are using too many threads. If you are for example using 10 different threads to do the work at the same time but only have 2 cores, that would likely be much slower than running all of the tasks in sequence. Generally start the number of threads equal to your number of cores/CPUs. Increase the threads used slowly gaging the performance each time. This will give you the optimal thread count to use.

Perhaps you could try to implement to re-implement your code using fork/join from JDK 7 and see what it does?
The default creates a thread-pool with exactly the same amount of threads as you have cores in your system. If you choose the threshold for dividing your work into smaller chunks reasonably this will probably execute much more efficient.

You are most likely aware of this, but in case you aren't, please read up on Amdahl's Law. It gives the relationship between expected speedup of a program by using parallelism and the sequential segments of the program.

synchronizing across cores is much slower than on a single cored environment see if you can limit the jvm to 1 core (see this blog post)
or you can use a ExecuterorService and use invokeAll to run the parallel tasks

Related

Threads in Java - Sum of N numbers

I tried to perform sum of N numbers using conventional method and also using threads to see the performance of threads. I see that the conventional method runs faster than the thread based.
My plan is to break down the upper limit(N) into ranges then run a thread for each range and finally add the sum calculated from each thread.
stats in milliseconds :
248
500000000500000000
-----same with threads------
498
500000000500000000
Here I see the approach using threads took ~500 milliseconds and conventional method took only ~250 seconds.
I wanted to know If I am correctly implementing threads for this problem.
Thanks
code :
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyThread implements Runnable {
private int from , to , sum;
public MyThread(long from , long to) {
this.from = from;
this.to = to;
sum = 0;
}
public void run() {
for(long i=from;i<=to;i++) {
sum+=i;
}
}
public long getSum() {
return this.sum;
}
}
public class exercise {
public static void main(String args[]) {
long startTime = System.currentTimeMillis();
long sum = 0;
for(long i=1;i<=1000000000;i++) {
sum+=i;
}
long endTime = System.currentTimeMillis();
long duration = (endTime - startTime); //Total execution time in milli seconds
System.out.println(duration);
System.out.println(sum);
System.out.println("-----same with threads------");
ExecutorService executor = Executors.newFixedThreadPool(5);
MyThread one = new MyThread(1, 100000);
MyThread two = new MyThread(100001, 10000000);
MyThread three = new MyThread(10000001, 1000000000);
startTime = System.currentTimeMillis();
executor.execute(one);
executor.execute(two);
executor.execute(three);
executor.shutdown();
// Wait until all threads are finish
while (!executor.isTerminated()) {
}
endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
long thsum = one.getSum() + two.getSum() + three.getSum();
System.out.println(thsum);
}
}

It only makes sense to split the work into multiple threads when each thread is assigned the same amount of work.
In your case, the first thread does almost nothing, the second thread does almost 1% of the work, and the third thread does 99% of the work.
Therefore, you pay the overhead for running multiple threads without benefiting from the parallel execution.
Splitting the work evenly, as follows, should yield better results:
MyThread one = new MyThread(1, 333333333);
MyThread two = new MyThread(333333334, 666666667);
MyThread three = new MyThread(666666668, 1000000000);

The multithread part of your example includes the time for thread creation. Thread creation is an expensive operation and I presume that it is responsible for a large share of the difference between the single thread and multithread approaches.
Your question was if you are correctly implementing the threads. Did you mean implementing the runnable tasks? If so, I wonder why you have distributed the number ranges so unevenly. The task three seems to be far bigger than the others and as a result the performance will be close to a single thread version however you choose to set up the threads.

Multi threaded matrix multiplication performance issue

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.
public class MatMulConcur {
private final static int NUM_OF_THREAD =1 ;
private static Mat matC;
public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}
private static Mat mul(Mat matA,Mat matB) {
int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;
Worker[] myWorker = new Worker[NUM_OF_THREAD];
for (int j = 0; j < NUM_OF_THREAD; j++) {
if (j<NUM_OF_THREAD-1){
numRowForThread = (numRowA / NUM_OF_THREAD);
} else {
numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
}
myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
myWorker[j].start();
startRow += numRowForThread;
}
for (Worker worker : myWorker) {
try {
worker.join();
} catch (InterruptedException e) {
}
}
return matC;
}
private static class Worker extends Thread {
private int startRow, stopRow;
private Mat matA, matB;
public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
super();
this.startRow = startRow;
this.stopRow = stopRow;
this.matA = matA;
this.matB = matB;
}
#Override
public void run() {
for (int i = startRow; i < stopRow; i++) {
for (int j = 0; j < matB.getNColumns(); j++) {
double sum = 0;
for (int k = 0; k < matA.getNColumns(); k++) {
sum += matA.get(i, k) * matB.get(k, j);
}
matC.set(i, j, sum);
}
}
}
}
I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table
Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds
Any Idea?

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.
There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.
A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.
There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.
There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.
1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.

Using multi-threading is not primarily for performance, but for parallelization. There are cases where parallelization can benefit performance, though.
Your computer doesn't have infinite resources. Adding more and more threads will decrease performance. It's like starting more and more applications, you wouldn't expect a program to run faster when you start another program, and you probably wouldn't be surprised if it runs slower.
Up to a certain point performance will remain constant (your computer still has resources to handle the demand), but at some point you reach the maximum your computer can handle and performance will drop. That's exactly what your result shows. Performance stays somewhat constant with 1 or 10 threads, and then drops steadily.

How to check threads timing?

I wrote an application which reads all lines in text files and measure times. I`m wondering what will be the time of whole block.
For example if I start 2 threads at the same time:
for (int i = 0; i < 2; i++) {
t[i] = new Threads(args[j], 2);
j++;
}
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("TIME for block 1 of threads; "
+ (max(new long[]{t[0].getTime(),t[1].getTime()})));
Wait for them to stop processing the files and read operation times (by getTime). Is it good thinking for multithreading that in this case the time of block of threads, will be the maximum time got from thread? I think yes, because other threads will stop working by the time the thread with max time will stop.
Or maybe should I think in another way?

It's dangerous to argue about execution order when having multiple threads! E.g. If you run your code on a single core CPU, the threads will not really run in parallel, but sequentially, so the total run time for both threads is the sum of each thread's run time, not the maximum of both.
Fortunately, there is a very easy way to just measure this if you use an ExecutorService instead of directly using Threads (btw. this is always a good advice):
// 1. init executor
int numberOfThreads = 2; // or any other number
int numberOfTasks = numberOfThreads; // is this true in your case?
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
long startTime = System.currentTimeMillis();
// 2. execute tasks in parallel using executor
for(int i = 0; i < numberOfTasks; i++) {
executor.execute(new Task()); // Task is your implementation of Runnable
}
// 3. initiate shutdown and wait until all tasks are finished
executor.shutdown();
executor.awaitTermination(1, TimeUnit.MINUTES); // we won't wait forever
// 4. measure time
long delta = System.currentTimeMillis() - startTime;
Now, delta holds the total running time of your tasks. You can play around with numberOfThreads to see if more or less threads give different results.
Important note: Reading from a file is not thread-safe in Java, so it is not allowed to share a Reader or InputStream between threads!

As far as my concern You can Use System class's static methods.
You can use it in starting of the block and end of the block and subtract the later one with earlier time.
those are :
System.currentTimeMillis(); // The current value of the system timer, in miliseconds.
or
System.nanoTime(); //The current value of the system timer, in nanoseconds.

You can use
Starting of block
long startTime = System.currentTimeMillis();
End of block
long endTime = System.currentTimeMillis()- startTime;
By this you can calculate.

Performance of ExecutorService in Java

I am trying to understand the ExecutorService in java. There is not much performance difference when I use 1 thread or 4 threads. I have a quad core CPU and I do not have any other process running.
ExecutorService exService = Executors.newFixedThreadPool(4);
exService.execute(new Test().new RunnableThread());
exService.awaitTermination(25, TimeUnit.SECONDS);
class RunnableThread implements Runnable {
#Override
public void run() {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
long cnt = 0;
for (cnt = 0; cnt < 999999999; cnt++) {
try {
for (long j = 0; j < 20; j++){
x += j;
}
} catch (Exception e) {
e.printStackTrace();
}
}
stopWatch.stop();
System.out.println(stopWatch.getTime());
}
}
If my understanding is right, my task should have close to 4x performance improvement when I say newFixedThreadPool(4) right?

Unfortunately, there is no magic in the allocation of workload to threads.
Every task runs on its own thread. It does not somehow automatically get transformed into concurrent execution paths.
If you have only one task, the remaining three threads will be idle.
Multiple threads only speed up things if you can split your workload into multiple tasks that can run concurrently (and you have to do that splitting yourself).

If my understanding is right, my task should have close to 4x performance improvement when I say newFixedThreadPool(4) right?
Yes, if you're actually running 4 concurrent tasks.
Currently, you have a single task that you are submitting to the executor. Let's say that it takes 10 seconds. Even if you have 4 cores and 4 threads, Java will not be able to parallelize a single task. However, if you submit 4 independent tasks (that have no memory or lock contention), then you will see all of them complete in those 10 seconds that it took the 1 task.

Loop throttled with Thread.sleep() runs more than twice as often as intended

I use the following loop (cut some stuff out) as the main loop for a game, but I can't get it to throttle down to a speed I want it to, it keeps running about twice as fast as I intend.
private void myLoop() throws InterruptedException {
long timer = TimeUtils.getMillis();
int achievedLoops = 0;
long currTime = 0l;
long loopTime = 0l;
long lastTime = TimeUtils.getNano();
while(!isRequestedToStop) {
currTime = TimeUtils.getNano();
loopTime = currTime - lastTime;
lastTime = currTime;
if(TimeUtils.getDeltaMillis(timer) > 1000) {
timer += 1000;
logger.debug(achievedLoops + " Loops");
achievedLoops = 0;
}
achievedLoops++;
if(loopTime < TIME_PER_LOOP) {
Thread.sleep( (TIME_PER_LOOP - loopTime) / 1000000l);
}
}
}
Alternative implementation of the sleeping, gets slightly better results (loop runs only 1.9 times too often):
while(loopTime < TIME_PER_LOOP) {
Thread.sleep(1l);
loopTime += 1000000l;
}
Another alternative:
while(loopTime < TIME_PER_LOOP) {
Thread.sleep(1l);
loopTime = TimeUtils.getNano() - lastTime;
}
Why does that happen?
Are there any other ways to throttle a thread down?
I basically could run it uncontrolled as the logic is tied to timed steps, but I would like to reduce the total runs of the loop as otherwise there's a marginal chance of it doing damage to a CPU..

LockSupport can disable the scheduling of a thread for a specified number of nanoseconds.
LockSupport.parkNanos(TIME_PER_LOOP - loopTime);
But as others have mentioned there are better ways to control timing (e.g. ScheduledExecutorService).

Each time you perform the division
(TIME_PER_LOOP - loopTime) / 1000000l
You are truncating the result and sleeping up to 1ms less than you expect (average 0.5ms). Given your 4ms loop time, this would easily cause your loop to run twice as fast as you expect. As others have mentioned, there are much better ways to control timing.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.