Performance of Concurrent Program Degrading with Increase in Threads?

Performance of Concurrent Program Degrading with Increase in Threads? - java

I have been trying to implement the below code on quad core computer and average running times with No of threads in the Executor service over 100 iterations is as follows
1 thread = 78404.95
2 threads = 174995.14
4 thread = 144230.23
But according to what I have studied 2*(no of cores) of threads should give optimal result for the program which is clearly not the case in my program which bizarrely gives best time for single thread.
Code :
import java.util.Collections;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class TestHashSet {
public static void main(String argv[]){
Set<Integer> S = Collections.newSetFromMap(new ConcurrentHashMap<Integer,Boolean>());
S.add(1);
S.add(2);
S.add(3);
S.add(4);
S.add(5);
long startTime = System.nanoTime();
ExecutorService executor = Executors.newFixedThreadPool(8);
int Nb = 0;
for(int i = 0;i<10;i++){
User runnable = new User(S);
executor.execute(runnable);
Nb = Thread.getAllStackTraces().keySet().size();
}
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long endTime = System.nanoTime();
System.out.println(0.001*(endTime-startTime)+" And "+Nb);
}
}
class User implements Runnable{
Set<Integer> S;
User(Set<Integer> S){
this.S = S;
}
#Override
public void run() {
// TODO Auto-generated method stub
Set<Integer> t =Collections.newSetFromMap(new ConcurrentHashMap<Integer,Boolean>());;
for(int i = 0;i<10;i++){
t.add(i+5);
}
S.retainAll(t);
Set<Integer> t2 =Collections.newSetFromMap(new ConcurrentHashMap<Integer,Boolean>());;
for(int i = 0;i<10;i++){
t2.add(i);
}
S.addAll(t);
/*
ConcurrentHashSet<Integer> D = new ConcurrentHashSet<Integer>();
for(int i=0;i<10;i++){
D.add(i+3);
}
S.difference(D);
*/
}
}
Update : If I increase no of queries per thread to 1000 , 4-threaded is performing better than Single threaded .I think overhead has been higher than run-time when I used only about 4 queries per thread and as no of queries increased Runtime is now greater than Overhead.Thanks

But 5 Threads Supposed to increase the performance..?
That's what >>you<< suppose. But in fact, there are no guarantees that adding threads will increase performance.
But according to what I have studied 2*(no of cores) of threads should give optimal result ...
If you read that somewhere, then you either misread it or it is plain wrong.
The reality is that the number of threads for optimal performance is highly dependent on the nature of your application, and also on the hardware you are running on.
Based on a cursory reading of your code, it appears that this is a benchmark to test how well Java deals with multi-threaded access and updates to a shared set (S). Each thread is doing some operations on a thread-confined set, then either adding or removing all entries in the thread-confined set to the shared set.
The problem is that the addAll and retainAll calls are likely to be concurrency bottlenecks. A set based on ConcurrentHashMap will give better concurrent performance for point access / update to the set than on based on HashMap. However, addAll and retainAll perform N such operations, on the same entries that the other threads are operating on. Given the nature of this pattern of operations, you are likely to get significant contention within the different regions of the ConcurrentHashMap. That is likely to lead to one thread blocking another ... and a slowdown.
Update : If I increase no of queries per thread 4-threaded is performing better than Single threaded .I think overhead has been higher than run-time when I used only about 4 queries per thread and as no of queries increased Runtime is now greater than Overhead.
I assume that you mean that you are increasing the number of hash map entries. This is likely to reduce the average contention, given the way that ConcurrentHashMap works. (The class divides the map into regions, and arranges that operations involving entries in different regions incur the minimum possible contention overheads. By increasing the number of distinct entries, you are reducing the probability that two simultaneous operations will lead to contention.)
So returning to the "2 x no of threads" factoid.
I suspect that the sources you have been reading don't actually say that that gives you optimal performance. I suspect that they really say that that:
"2 x no of threads" is a good starting point ... and you need to tune it for your application / problem / hardware, and/or
don't go above "2 x no of threads" for a compute intensive task ... because it is unlikely to help.
In your example, it is most likely that the main source of the contention is in the updates to the shared set / map ... and the overheads of ensuring that they happen atomically.
You can also get contention at a lower level; i.e. contention for memory bandwidth (RAM read/write) and memory cache contention. Whether that happens will depend on the specs of the hardware you are running on ...
The final thing to note is that your benchmark is flawed in that it does not allow for various VM warmup effects ... such as JIT compilation. The fact that your 2 thread times are more than double the 1 thread times points to that issue.
There are other questionable aspects about your benchmarking:
The amount of work done by the run() method is too small.
This benchmark does not appear to be representative of a real-world use-case. Measuring speed-up in a totally fictitious (nonsense) algorithm is not going to give you any clues about how a real algorithm is likely to perform when you scale the thread count.
Running the tests on a 4 core machine means that you probably wouldn't have enough data points to draw scientifically meaningful conclusions ... assuming that the benchmark was sound.
Having said that, the 2 to 4 thread slowdown that you seem to be seeing is not unexpected ... to me.

Related

Why threads do not cache object locally?

I have a String and ThreadPoolExecutor that changes the value of this String. Just check out my sample:
String str_example = "";
ThreadPoolExecutor poolExecutor = new ThreadPoolExecutor(10, 30, (long)10, TimeUnit.SECONDS, runnables);
for (int i = 0; i < 80; i++){
poolExecutor.submit(new Runnable() {
#Override
public void run() {
try {
Thread.sleep((long) (Math.random() * 1000));
String temp = str_example + "1";
str_example = temp;
System.out.println(str_example);
} catch (Exception e) {
e.printStackTrace();
}
}
});
}
so after executing this, i get something like that:
1
11
111
1111
11111
.......
So question is: i just expect the result like this if my String object has volatile modifier. But i have the same result with this modifier and without.

There are several reasons why you see "correct" execution.
First, CPU designers do as much as they can so that our programs run correctly even in presence of data races. Cache coherence deals with cache lines and tries to minimize possible conflicts. For example, only one CPU can write to a cache line at some point of time. After write was done other CPUs should request that cache line to be able to write to it. Not to say x86 architecture(most probable which you use) is very strict comparing to others.
Second, your program is slow and threads sleep for some random period of time. So they do almost all the work at different points of time.
How to achieve inconsistent behavior? Try something with for loop without any sleep. In that case field value most probably will be cached in CPU registers and some updates will not be visible.
P.S. Updates of field str_example are not atomic so you program may produce the same string values even in presense of volatile keyword.

When you talk about concepts like thread caching, you're talking about the properties of a hypothetical machine that Java might be implemented on. The logic is something like "Java permits an implementation to cache things, so it requires you to tell it when such things would break your program". That does not mean that any actual machine does anything of the sort. In reality, most machines you are likely to use have completely different kinds of optimizations that don't involve the kind of caches that you're thinking of.
Java requires you to use volatile precisely so that you don't have to worry about what kinds of absurdly complex optimizations the actual machine you're working on might or might not have. And that's a really good thing.

Your code is unlikely to exhibit concurrency bugs because it executes with very low concurrency. You have 10 threads, each of which sleep on average 500 ms before doing a string concatenation. As a rough guess, String concatenation takes about 1ns per character, and because your string is only 80 characters long, this would mean that each thread spends about 80 out of 500000000 ns executing. The chance of two or more threads running at the same time is therefore vanishingly small.
If we change your program so that several threads are running concurrently all the time, we see quite different results:
static String s = "";
public static void main(String[] args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(5);
for (int i = 0; i < 10_000; i ++) {
executor.submit(() -> {
s += "1";
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.MINUTES);
System.out.println(s.length());
}
In the absence of data races, this should print 10000. On my computer, this prints about 4200, meaning over half the updates have been lost in the data race.
What if we declare s volatile? Interestingly, we still get about 4200 as a result, so data races were not prevented. That makes sense, because volatile ensures that writes are visible to other threads, but does not prevent intermediary updates, i.e. what happens is something like:
Thread 1 reads s and starts making a new String
Thread 2 reads s and starts making a new String
Thread 1 stores its result in s
Thread 2 stores its result in s, overwriting the previous result
To prevent this, you can use a plain old synchronized block:
executor.submit(() -> {
synchronized (Test.class) {
s += "1";
}
});
And indeed, this returns 10000, as expected.

It is working because you are using Thread.sleep((long) (Math.random() * 100));So every thread has different sleep time and executing may be one by one as all other thread in sleep mode or completed execution.But though your code is working is not thread safe.Even if you use Volatile also will not make your code thread safe.Volatile only make sure visibility i.e when one thread make some changes other threads are able to see it.
In your case your operation is multi step process reading the variable,updating then writing to memory.So you required locking mechanism to make it thread safe.

why using two threads for a counter decreases performance in Java?

In python using two Threads for a simple counter program (as demonstrated below) is slower than the program with a single thread. The reason given to this is the mechanism behind Global Interpreter lock.
I tested the same in java to see the performance. Here again, I see that a single Thread out-performs two-threaded one with a significant time scale. why is it so?
Here is the code:
public class ThreadTiming {
static void threadMessage(String message) {
String threadName =
Thread.currentThread().getName();
System.out.format("%s: %s%n",
threadName,
message);
}
private static class Counter implements Runnable {
private int count=500000000;
#Override
public void run() {
while(count>0) {
count--;
}
threadMessage("done processing");
}
}
public static void main(String[] args) throws InterruptedException{
Thread t1 = new Thread(new Counter());
Thread t2 = new Thread(new Counter());
long startTime=System.currentTimeMillis();
t1.start();
t2.start();
t1.join();
t2.join();
long endTime=System.currentTimeMillis();
System.out.println("Time taken by two threads "+ (endTime-startTime)/1000.0);
startTime=System.currentTimeMillis();
Calculate(2*500000000);
endTime=System.currentTimeMillis();
System.out.println("Time taken by single thread "+ (endTime-startTime)/1000.0);
}
public static void Calculate(int x){
while (x>0){
x--;
}
threadMessage("Done processing");
}
}
Output:
Thread-1: done processing
Thread-2: done processing
Time taken by two threads 0.052
main: Done processing
Time taken by single thread 0.0010

Very simple. The single threaded version uses a local variable which hotspot has no problems to reason that it never leaves the scope, hence the whole function is reduced to a nop.
On the other hand proving that the instance variable never leaves scope (hello reflection!) Is much harder and obviously hotspot cannot it here hence the loop isn't removed.
On a general note benchmarking is hard (i count at least three other mistakes that could lead to "wrong" results) and requires tons of knowledge.You are better off using jmh (java measuring harness) which takes care of most things.

The basic answer is you have code the optimiser can eliminate and you are timing how long it takes to detect this. You are also adding the time it takes to start and stop two threads which could be more than half this time.
The second test doesn't start a new thread, it uses the current one so you just need to wait for it to detect the loop doesn't do anything.
For example you have timed that a single thread can do 1 billion loops in 1 ms. If you have a 3.33 GHz processor, this would have to do 300 iterations in a single clock cycle. If this sounds too good to be true, that is because it is. ;)

#Voo seems to be generally right, as you can see by moving ThreadTiming.Counter.count to be a local variable of ThreadTiming.Counter.run(). That eliminates any possibility of non-local references, and the resulting program exhibits much less single-thread vs. dual-thread performance difference.
HOWEVER, that doesn't eliminate all the difference. The timing reported for the dual-thread case is still worse by about a factor of 9 for me. But if I then swap so that the single-threaded case is measured first, the two-thread case wins by about a factor of 2.
But that, too, is illusory, because the two tests are running different -- albeit similar -- code. The single-thread case can easily be made to run exactly the same code as the dual thread case:
Counter c = new Counter();
c.run();
c.run();
(Using the version where count is local to run().) If that approach is used then I observe no difference in performance (at the resolution of the measurement) between single- and dual-threaded, regardless of which case is tested first.
As #Voo said, benchmarking is hard.

It just looks like it's from loading each thread and its context into the CPU. It's thrashing. There's probably a more detailed answer waiting to strike, but let's start by posting the basics...

When running two threads, your timer is including the time taken to launch the two threads. Creating and starting threads has some overhead, and in this case, the overhead is longer than the time to actually carry out the process.

Why adding cores slows down my java program after around 10 cores?

My program uses fork/join as shown below to run thousands of tasks:
private static class Generator extends RecursiveTask<Long> {
final MyHelper mol;
final static SatChecker satCheck = new SatChecker();
public Generator(final MyHelper mol) {
super();
this.mol = mol;
}
#Override
protected Long compute() {
long count = 0;
try {
if (mol.isComplete(satCheck)) {
count = 1;
}
ArrayList<MyHelper> molList = mol.extend();
List<Generator> tasks = new ArrayList<>();
for (final MyHelper child : molList) {
tasks.add(new Generator(child));
}
for(final Generator task : invokeAll(tasks)) {
count += task.join();
}
} catch (Exception e){
e.printStackTrace();
}
return count;
}
}
My program makes heavy use of a third party library for isComplete and extend methods. The extend method also uses a native library. As far as the MyHelper class is concerned, there is no shared variable or synchronization between the tasks.
I use the taskset command from linux to restrict the number of cores used by my application. I get the best speed by using around 10 cores (say around 60 seconds). It means that using more than 10 cores results in slowing down the application, such that 16 cores finishes in the same time as 6 cores (around 90 seconds).
I am more confused because the selected cores are 100% busy (except for garbage collection every now and then).
Does anyone know what could cause such a slow down? And where should I look to solve this problem?
PS: I have made also implementations in Scala/akka and using ThreadPoolExecutor, but with similar results (although slower than fork/join)
PPS: My guess is that down deep in MyHelper or SatCheck, someone crosses the memory barrier (poisoning the cache). But how can I find that and fix or go about it?

There might be an overload due to assigning threads/tasks to the different cores. Also, are you sure that your program is entirely parallelizable? Indeed, some program cannot always use 100% efficiently all the cpus available and the time taken to dispatch the tasks might slow down the program more than it helps it.

I think that you should use a threshold on the size of your molList (or mol) variable to avoid forking on too small sets of data.
I'd been playing a bit with fork/join just to understand framework and my first examples did not take the threshold into consideration. Obviously i got awful perfomances. Fixing a proper limit on size of problem did the trick.
Finding the right value for the threshold requires you spend a bit of time trying different values and see how performances changes.
So, put an if at the very beginning of the compute method like this:
#Override
protected Long compute() {
if (mol.getSize() < THRESHOLD) //getSize or whatever gives you size of problem
return noForkJoinCompute(mol); //noForkJoinCompute gives you count without FJ
long count = 0;
try {
if (mol.isComplete(satCheck)) {
count = 1;
}
...

Dynamically spawn threads Java for a case like this

Suppose I have a List of integers. Each int I have must be multiplied by 100. To do this with a for loop I'd construct something like the following:
for(Integer i : numbers){
i = i*100;
}
But suppose for performance reasons I wanted to simultaneously spawn a thread for each number in numbers and perform a single multiplication on each thread returning the result to the same List. What would be the best way of doing such a thing?
My actual problem isn't as trivial as multiplication of ints but rather a task that each iteration of the loop takes a substantial amount of time, and so I'd like to do them all at the same time in order to decrease execution time.

If you can use Java 7, the Fork/Join framework is created for precisely this problem. If not, there is a JSR166 (the fork/join proposal) source code at this link.
Essentially, you would create a task for each step (in your case, for each index in the array) and submit it to a service that can pool threads (the fork part). Then you wait for everything to complete and merge the results (the join part).
The reason to use a service as opposed to launching your own threads, is there can be an overhead in creating threads, and in some cases, you may want to limit the number of threads. For example, if you're on a four CPU machine, it wouldn't make much sense to have more than four threads running concurrently.

If your tasks are independent of each other, you can use Executors framework.
Note that you would gain more speed if you create no more threads than you have CPU cores at your disposal.
Sample:
class WorkInstance {
final int argument;
final int result;
WorkInstance(int argument, int result) {
this.argument = argument;
this.result = result;
}
public String toString() {
return "WorkInstance{" +
"argument=" + argument +
", result=" + result +
'}';
}
}
public class Main {
public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
int numOfCores = 4;
final ExecutorService executor = Executors.newFixedThreadPool(numOfCores);
List<Integer> toMultiplyBy100 = Arrays.asList(1, 3, 19);
List<Future<WorkInstance>> tasks = new ArrayList<Future<WorkInstance>>(toMultiplyBy100.size());
for (final Integer workInstance : toMultiplyBy100)
tasks.add(executor.submit(new Callable<WorkInstance>() {
public WorkInstance call() throws Exception {
return new WorkInstance(workInstance, workInstance * 100);
}
}));
for (Future<WorkInstance> result : tasks)
System.out.println("Result: " + result.get());
executor.shutdown();
}
}

Spawning a new thread for
each number in numbers
is not a good idea. However, using a fixed thread pool of size matching the number of cores/CPUs might increase your performace slightly.

The quick and dirty way to get started is to use a thread pool, such as one returned by Executors.newCachedThreadPool(). Then create tasks that implement Runnable and submit() them to your thread pool. Also read up on the classes and interfaces linked by those Javadocs, lots of cool stuff you can try.
See the concurrency chapter in Effective Java, 2nd ed for a great introduction to multithreaded Java.

Take a look at ThreadPoolExecutor and create a task for each iteration. A prerequisite is that those tasks are independent though.
The use of a thread pool allows you to create a task per iteration but only run as many concurrently as there a threads, since you'd want to reduce the number of thread, for example to the number of cores or hardware threads available. Creating a whole lot of threads would be counter productive since they'd require a lot of context switching which hurts performance.

I assume you are on a commodity PC. You will at most have N threads executing at the same time on your machine, where N is the # of cores of your CPUs, so most likely in the [1, 4] range. Plus the contention on the shared list.
But even more importantly, the cost of spawning a new thread is much greater than the cost of doing a multiplication. One could have a thread pool... but in this specific case, it's not even worth talking about it. Really.

If it is the only application on a node you should determine which number of threads will finish the job most quickly (max_throughput). This depends on the processor you use how much JIT can optimize your code, so there is no general advise but measure.
After that you could distribute the jobs to a pool of worker threads by numbers modulo max_throughput

java threads vs java processes performance degradation

Here I would focus on custom application where I got degradation (no need for general discussion about fastness of threads against processes).
I've got MPI application on Java which solve some problem using iteration method. The schematic view to application bellow lets call it MyProcess(n), where "n" is the number of processes:
double[] myArray = new double[M*K];
for(int iter = 0;iter<iterationCount;++iter)
{
//some communication between processes
//main loop
for(M)
for(K)
{
//linear sequence of arithmetical instructions
}
//some communication between processes
}
To improve performance I've decided to use Java threads (lets call it MyThreads(n)). The code is almost the same – myArray becomes matrix, where each row contains array for appropriate thread.
double[][] myArray = new double[threadNumber][M*K];
public void run()
{
for(int iter = 0;iter<iterationCount;++iter)
{
//some synchronization primitives
//main loop
for(M)
for(K)
{
//linear sequence of arithmetical instructions
counter++;
}
// some synchronization primitives
}
}
Threads created and started using Executors.newFixedThreadPool(threadNumber).
The problem is that while for MyProcess(n) we got adequate performance(n in [1,8]), in case of MyThreads(n) performance degrades essentially(on my system by factor of n).
Hardware: Intel(R) Xeon(R) CPU X5355(2 processors, 4 cores on each)
Java version: 1.5(using d32 option).
At first I thought that got different workloads on threads, but no, variable “counter” shows, that number of iterations between different run of MyThreads(n) (n in [1,8]) are identical.
And it isn’t synchronization fault, because I have temporary comment all synchronization primitives.
Any suggestions/ideas would be appreciated.
Thanks.

There are 2 issues I see in your piece of code.
Firstly caching problem. Since you try to do this in multi thread/process I'd assume your M * K results in a large number; then when you do
double[][] myArray = new double[threadNumber][M*K];
You are essentially creating an array of double pointer with size threadNumber; each pointing to a double array of size M*K. The interesting point here is that the threadNumber count of arrays are not necessarily allocated onto the same block of memory. They are just double pointers which can be allocated anywhere inside JVM heap. As a result, when multiple threads run, you might encounter a lot of cache miss and you end up reading memory many times, eventually slow down your program.
If the above is the root cause, you can try enlarge your JVM heap size, and then do
double[] myArray = new double[threadNumber * M * K];
And have the threads operating on different segment of the same array. You should be able to see performance better.
Secondly synchronization issue. Note that double (or any primitive) array is NOT volatile. Thus your result on 1 thread isn't guaranteed to be visible to other threads. If you are using synchronization block this resolves the issue, as a side effect of synchronization is make sure visibility across threads; If not, when you are reading and writing the array, please always make sure you use Unsafe.putXXXVolatile() and Unsafe.getXXXVolatile() so that you can do volatile operations on arrays.
To take this further, Unsafe can also be used to create a continuous segment of memory which you can used to hold your data structure and achieve better performance. In your case I think 1) already do the trick.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.