Performance of ThreadLocal variable - java

How much is read from ThreadLocal variable slower than from regular field?
More concretely is simple object creation faster or slower than access to ThreadLocal variable?
I assume that it is fast enough so that having ThreadLocal<MessageDigest> instance is much faster then creating instance of MessageDigest every time. But does that also apply for byte[10] or byte[1000] for example?
Edit: Question is what is really going on when calling ThreadLocal's get? If that is just a field, like any other, then answer would be "it's always fastest", right?

In 2009, some JVMs implemented ThreadLocal using an unsynchronised HashMap in the Thread.currentThread() object. This made it extremely fast (though not nearly as fast as using a regular field access, of course), as well as ensuring that the ThreadLocal object got tidied up when the Thread died. Updating this answer in 2016, it seems most (all?) newer JVMs use a ThreadLocalMap with linear probing. I am uncertain about the performance of those – but I cannot imagine it is significantly worse than the earlier implementation.
Of course, new Object() is also very fast these days, and the garbage collectors are also very good at reclaiming short-lived objects.
Unless you are certain that object creation is going to be expensive, or you need to persist some state on a thread by thread basis, you are better off going for the simpler allocate when needed solution, and only switching over to a ThreadLocal implementation when a profiler tells you that you need to.

Running unpublished benchmarks, ThreadLocal.get takes around 35 cycle per iteration on my machine. Not a great deal. In Sun's implementation a custom linear probing hash map in Thread maps ThreadLocals to values. Because it is only ever accessed by a single thread, it can be very fast.
Allocation of small objects take a similar number of cycles, although because of cache exhaustion you may get somewhat lower figures in a tight loop.
Construction of MessageDigest is likely to be relatively expensive. It has a fair amount of state and construction goes through the Provider SPI mechanism. You may be able to optimise by, for instance, cloning or providing the Provider.
Just because it may be faster to cache in a ThreadLocal rather than create does not necessarily mean that the system performance will increase. You will have additional overheads related to GC which slows everything down.
Unless your application very heavily uses MessageDigest you might want to consider using a conventional thread-safe cache instead.

Good question, I've been asking myself that recently. To give you definite numbers, the benchmarks below (in Scala, compiled to virtually the same bytecodes as the equivalent Java code):
var cnt: String = ""
val tlocal = new java.lang.ThreadLocal[String] {
override def initialValue = ""
}
def loop_heap_write = {
var i = 0
val until = totalwork / threadnum
while (i < until) {
if (cnt ne "") cnt = "!"
i += 1
}
cnt
}
def threadlocal = {
var i = 0
val until = totalwork / threadnum
while (i < until) {
if (tlocal.get eq null) i = until + i + 1
i += 1
}
if (i > until) println("thread local value was null " + i)
}
available here, were performed on an AMD 4x 2.8 GHz dual-cores and a quad-core i7 with hyperthreading (2.67 GHz).
These are the numbers:
i7
Specs: Intel i7 2x quad-core # 2.67 GHz
Test: scala.threads.ParallelTests
Test name: loop_heap_read
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
9.0069 9.0036 9.0017 9.0084 9.0074 (avg = 9.1034 min = 8.9986 max = 21.0306 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
4.5563 4.7128 4.5663 4.5617 4.5724 (avg = 4.6337 min = 4.5509 max = 13.9476 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
2.3946 2.3979 2.3934 2.3937 2.3964 (avg = 2.5113 min = 2.3884 max = 13.5496 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
2.4479 2.4362 2.4323 2.4472 2.4383 (avg = 2.5562 min = 2.4166 max = 10.3726 )
Test name: threadlocal
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
91.1741 90.8978 90.6181 90.6200 90.6113 (avg = 91.0291 min = 90.6000 max = 129.7501 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
45.3838 45.3858 45.6676 45.3772 45.3839 (avg = 46.0555 min = 45.3726 max = 90.7108 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
22.8118 22.8135 59.1753 22.8229 22.8172 (avg = 23.9752 min = 22.7951 max = 59.1753 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
22.2965 22.2415 22.3438 22.3109 22.4460 (avg = 23.2676 min = 22.2346 max = 50.3583 )
AMD
Specs: AMD 8220 4x dual-core # 2.8 GHz
Test: scala.threads.ParallelTests
Test name: loop_heap_read
Total work: 20000000
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
12.625 12.631 12.634 12.632 12.628 (avg = 12.7333 min = 12.619 max = 26.698 )
Test name: loop_heap_read
Total work: 20000000
Run times: (showing last 5)
6.412 6.424 6.408 6.397 6.43 (avg = 6.5367 min = 6.393 max = 19.716 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
3.385 4.298 9.7 6.535 3.385 (avg = 5.6079 min = 3.354 max = 21.603 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
5.389 5.795 10.818 3.823 3.824 (avg = 5.5810 min = 2.405 max = 19.755 )
Test name: threadlocal
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
200.217 207.335 200.241 207.342 200.23 (avg = 202.2424 min = 200.184 max = 245.369 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
100.208 100.199 100.211 103.781 100.215 (avg = 102.2238 min = 100.192 max = 129.505 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
62.101 67.629 62.087 52.021 55.766 (avg = 65.6361 min = 50.282 max = 167.433 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
40.672 74.301 34.434 41.549 28.119 (avg = 54.7701 min = 28.119 max = 94.424 )
Summary
A thread local is around 10-20x that of the heap read. It also seems to scale well on this JVM implementation and these architectures with the number of processors.

#Pete is correct test before you optimise.
I would be very surprised if constructing a MessageDigest has any serious overhead when compared to actaully using it.
Miss using ThreadLocal can be a source of leaks and dangling references, that don't have a clear life cycle, generally I don't ever use ThreadLocal without a very clear plan of when a particular resource will be removed.

Here it goes another test. The results shows that ThreadLocal is a bit slower than a regular field, but in the same order. Aprox 12% slower
public class Test {
private static final int N = 100000000;
private static int fieldExecTime = 0;
private static int threadLocalExecTime = 0;
public static void main(String[] args) throws InterruptedException {
int execs = 10;
for (int i = 0; i < execs; i++) {
new FieldExample().run(i);
new ThreadLocaldExample().run(i);
}
System.out.println("Field avg:"+(fieldExecTime / execs));
System.out.println("ThreadLocal avg:"+(threadLocalExecTime / execs));
}
private static class FieldExample {
private Map<String,String> map = new HashMap<String, String>();
public void run(int z) {
System.out.println(z+"-Running field sample");
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++){
String s = Integer.toString(i);
map.put(s,"a");
map.remove(s);
}
long end = System.currentTimeMillis();
long t = (end - start);
fieldExecTime += t;
System.out.println(z+"-End field sample:"+t);
}
}
private static class ThreadLocaldExample{
private ThreadLocal<Map<String,String>> myThreadLocal = new ThreadLocal<Map<String,String>>() {
#Override protected Map<String, String> initialValue() {
return new HashMap<String, String>();
}
};
public void run(int z) {
System.out.println(z+"-Running thread local sample");
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++){
String s = Integer.toString(i);
myThreadLocal.get().put(s, "a");
myThreadLocal.get().remove(s);
}
long end = System.currentTimeMillis();
long t = (end - start);
threadLocalExecTime += t;
System.out.println(z+"-End thread local sample:"+t);
}
}
}'
Output:
0-Running field sample
0-End field sample:6044
0-Running thread local sample
0-End thread local sample:6015
1-Running field sample
1-End field sample:5095
1-Running thread local sample
1-End thread local sample:5720
2-Running field sample
2-End field sample:4842
2-Running thread local sample
2-End thread local sample:5835
3-Running field sample
3-End field sample:4674
3-Running thread local sample
3-End thread local sample:5287
4-Running field sample
4-End field sample:4849
4-Running thread local sample
4-End thread local sample:5309
5-Running field sample
5-End field sample:4781
5-Running thread local sample
5-End thread local sample:5330
6-Running field sample
6-End field sample:5294
6-Running thread local sample
6-End thread local sample:5511
7-Running field sample
7-End field sample:5119
7-Running thread local sample
7-End thread local sample:5793
8-Running field sample
8-End field sample:4977
8-Running thread local sample
8-End thread local sample:6374
9-Running field sample
9-End field sample:4841
9-Running thread local sample
9-End thread local sample:5471
Field avg:5051
ThreadLocal avg:5664
Env:
openjdk version "1.8.0_131"
Intel® Core™ i7-7500U CPU # 2.70GHz × 4
Ubuntu 16.04 LTS

Build it and measure it.
Also, you only need one threadlocal if you encapsulate your message digesting behaviour into an object. If you need a local MessageDigest and a local byte[1000] for some purpose, create an object with a messageDigest and a byte[] field and put that object into the ThreadLocal rather than both individually.

Related

Multithreaded vs Asynchronous programming in a single core

If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540
For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.
In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?

Does test order affect performance result?

I wrote 2 blocks of time measurement code. The print result of t1 is always much bigger than t2.
Block1 and block2 do the exact same thing. If I write block 2 before block1, then The print result of t2 is much lesser than t1.
I wonder why this happens.
#Test
fun test(){
val list = (1..100000).toList()
//block 1
var t1 = System.nanoTime()
list.filter { it % 7 == 0 }
t1 = System.nanoTime() - t1
//block 2
var t2 = System.nanoTime()
list.filter { it % 7 == 0 }
t2 = System.nanoTime() - t2
//print
println(t1)
println(t2)
}
What you are experiencing is called the warmup. The first requests made to a Kotlin (and other JVm based languages) is often substantially slower than the average response time. This warm-up period is caused by lazy class loading and just-in-time compilation.
There are a few ways how to measure performance more reliably. One of them is to create a warmup manually before the test itself is executed. Even more reliable method would be to use a specialized library such as JMH.
Example of manual warmup:
// warmup
for (i in 1..9999) {
val list = (1..100000).toList()
list.filter { it % 7 == 0 }
}
// rest of the test
As a side note, Kotlin has built-it functions which you can use instead of manually calculating the time difference. There are measureTimeMillis and measureNanoTime.
It would be used like this:
val time = measureNanoTime {
list.filter { it % 7 == 0 }
}

Multi threaded matrix multiplication performance issue

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.
public class MatMulConcur {
private final static int NUM_OF_THREAD =1 ;
private static Mat matC;
public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}
private static Mat mul(Mat matA,Mat matB) {
int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;
Worker[] myWorker = new Worker[NUM_OF_THREAD];
for (int j = 0; j < NUM_OF_THREAD; j++) {
if (j<NUM_OF_THREAD-1){
numRowForThread = (numRowA / NUM_OF_THREAD);
} else {
numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
}
myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
myWorker[j].start();
startRow += numRowForThread;
}
for (Worker worker : myWorker) {
try {
worker.join();
} catch (InterruptedException e) {
}
}
return matC;
}
private static class Worker extends Thread {
private int startRow, stopRow;
private Mat matA, matB;
public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
super();
this.startRow = startRow;
this.stopRow = stopRow;
this.matA = matA;
this.matB = matB;
}
#Override
public void run() {
for (int i = startRow; i < stopRow; i++) {
for (int j = 0; j < matB.getNColumns(); j++) {
double sum = 0;
for (int k = 0; k < matA.getNColumns(); k++) {
sum += matA.get(i, k) * matB.get(k, j);
}
matC.set(i, j, sum);
}
}
}
}
I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table
Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds
Any Idea?
People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.
There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.
A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.
There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.
There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.
1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.
Using multi-threading is not primarily for performance, but for parallelization. There are cases where parallelization can benefit performance, though.
Your computer doesn't have infinite resources. Adding more and more threads will decrease performance. It's like starting more and more applications, you wouldn't expect a program to run faster when you start another program, and you probably wouldn't be surprised if it runs slower.
Up to a certain point performance will remain constant (your computer still has resources to handle the demand), but at some point you reach the maximum your computer can handle and performance will drop. That's exactly what your result shows. Performance stays somewhat constant with 1 or 10 threads, and then drops steadily.

How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched?

When playing around with multithreading, I could observe some unexpected but serious performance issues related to AtomicLong (and classes using it, such as java.util.Random), for which I currently have no explanation. However, I created a minimalistic example, which basically consists of two classes: a class "Container", which keeps a reference to a volatile variable, and a class "DemoThread", which operates on an instance of "Container" during thread execution. Note that the references to "Container" and the volatile long are private, and never shared between threads (I know that there's no need to use volatile here, it's just for demonstration purposes) - thus, multiple instances of "DemoThread" should run perfectly parallel on a multiprocessor machine, but for some reason, they do not (Complete example is at the bottom of this post).
private static class Container {
private volatile long value;
public long getValue() {
return value;
}
public final void set(long newValue) {
value = newValue;
}
}
private static class DemoThread extends Thread {
private Container variable;
public void prepare() {
this.variable = new Container();
}
public void run() {
for(int j = 0; j < 10000000; j++) {
variable.set(variable.getValue() + System.nanoTime());
}
}
}
During my test, I repeatedly create 4 DemoThreads, which are then started and joined. The only difference in each loop is the time when "prepare()" gets called (which is obviously required for the thread to run, as it otherwise would result in a NullPointerException):
DemoThread[] threads = new DemoThread[numberOfThreads];
for(int j = 0; j < 100; j++) {
boolean prepareAfterConstructor = j % 2 == 0;
for(int i = 0; i < threads.length; i++) {
threads[i] = new DemoThread();
if(prepareAfterConstructor) threads[i].prepare();
}
for(int i = 0; i < threads.length; i++) {
if(!prepareAfterConstructor) threads[i].prepare();
threads[i].start();
}
joinThreads(threads);
}
For some reason, if prepare() is executed immediately before starting the thread, it will take twice as more time to finish, and even without the "volatile" keyword, the performance differences were significant, at least on two of the machines and OS'es I tested the code. Here's a short summary:
Mac OS Summary:
Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02-334
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Mac OS X
OS Arch: x86_64
OS Version: 10.6.5
Processors/Cores: 8
With volatile keyword:
Final results:
31979 ms. when prepare() was called after instantiation.
96482 ms. when prepare() was called before execution.
Without volatile keyword:
Final results:
26009 ms. when prepare() was called after instantiation.
35196 ms. when prepare() was called before execution.
Windows Summary:
Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Windows 7
OS Arch: amd64
OS Version: 6.1
Processors/Cores: 4
With volatile keyword:
Final results:
18120 ms. when prepare() was called after instantiation.
36089 ms. when prepare() was called before execution.
Without volatile keyword:
Final results:
10115 ms. when prepare() was called after instantiation.
10039 ms. when prepare() was called before execution.
Linux Summary:
Java Version: 1.6.0_20
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.0-b09
VM Name: OpenJDK 64-Bit Server VM
OS Name: Linux
OS Arch: amd64
OS Version: 2.6.32-28-generic
Processors/Cores: 4
With volatile keyword:
Final results:
45848 ms. when prepare() was called after instantiation.
110754 ms. when prepare() was called before execution.
Without volatile keyword:
Final results:
37862 ms. when prepare() was called after instantiation.
39357 ms. when prepare() was called before execution.
Mac OS Details (volatile):
Test 1, 4 threads, setting variable in creation loop
Thread-2 completed after 653 ms.
Thread-3 completed after 653 ms.
Thread-4 completed after 653 ms.
Thread-5 completed after 653 ms.
Overall time: 654 ms.
Test 2, 4 threads, setting variable in start loop
Thread-7 completed after 1588 ms.
Thread-6 completed after 1589 ms.
Thread-8 completed after 1593 ms.
Thread-9 completed after 1593 ms.
Overall time: 1594 ms.
Test 3, 4 threads, setting variable in creation loop
Thread-10 completed after 648 ms.
Thread-12 completed after 648 ms.
Thread-13 completed after 648 ms.
Thread-11 completed after 648 ms.
Overall time: 648 ms.
Test 4, 4 threads, setting variable in start loop
Thread-17 completed after 1353 ms.
Thread-16 completed after 1957 ms.
Thread-14 completed after 2170 ms.
Thread-15 completed after 2169 ms.
Overall time: 2172 ms.
(and so on, sometimes one or two of the threads in the 'slow' loop finish as expected, but most times they don't).
The given example looks theoretically, as it is of no use, and 'volatile' is not needed here - however, if you'd use a 'java.util.Random'-Instance instead of the 'Container'-Class and call, for instance, nextInt() multiple times, the same effects will occur: The thread will be executed fast if you create the object in the Thread's constructor, but slow if you create it within the run()-method. I believe that the performance issues described in Java Random Slowdowns on Mac OS more than a year ago are related to this effect, but I have no idea why it is as it is - besides that I'm sure that it shouldn't be like that, as it would mean that it's always dangerous to create a new object within the run-method of a thread, unless you know that no volatile variables will get involved within the object graph. Profiling doesn't help, as the problem disappears in this case (same observation as in Java Random Slowdowns on Mac OS cont'd), and it also does not happen on a single-core-PC - so I'd guess that it's kind of a thread synchronization problem... however, the strange thing is that there's actually nothing to synchronize, as all variables are thread-local.
Really looking forward for any hints - and just in case you want to confirm or falsify the problem, see the test case below.
Thanks,
Stephan
public class UnexpectedPerformanceIssue {
private static class Container {
// Remove the volatile keyword, and the problem disappears (on windows)
// or gets smaller (on mac os)
private volatile long value;
public long getValue() {
return value;
}
public final void set(long newValue) {
value = newValue;
}
}
private static class DemoThread extends Thread {
private Container variable;
public void prepare() {
this.variable = new Container();
}
#Override
public void run() {
long start = System.nanoTime();
for(int j = 0; j < 10000000; j++) {
variable.set(variable.getValue() + System.nanoTime());
}
long end = System.nanoTime();
System.out.println(this.getName() + " completed after "
+ ((end - start)/1000000) + " ms.");
}
}
public static void main(String[] args) {
System.out.println("Java Version: " + System.getProperty("java.version"));
System.out.println("Java Class Version: " + System.getProperty("java.class.version"));
System.out.println("VM Vendor: " + System.getProperty("java.vm.specification.vendor"));
System.out.println("VM Version: " + System.getProperty("java.vm.version"));
System.out.println("VM Name: " + System.getProperty("java.vm.name"));
System.out.println("OS Name: " + System.getProperty("os.name"));
System.out.println("OS Arch: " + System.getProperty("os.arch"));
System.out.println("OS Version: " + System.getProperty("os.version"));
System.out.println("Processors/Cores: " + Runtime.getRuntime().availableProcessors());
System.out.println();
int numberOfThreads = 4;
System.out.println("\nReference Test (single thread):");
DemoThread t = new DemoThread();
t.prepare();
t.run();
DemoThread[] threads = new DemoThread[numberOfThreads];
long createTime = 0, startTime = 0;
for(int j = 0; j < 100; j++) {
boolean prepareAfterConstructor = j % 2 == 0;
long overallStart = System.nanoTime();
if(prepareAfterConstructor) {
System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in creation loop");
} else {
System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in start loop");
}
for(int i = 0; i < threads.length; i++) {
threads[i] = new DemoThread();
// Either call DemoThread.prepare() here (in odd loops)...
if(prepareAfterConstructor) threads[i].prepare();
}
for(int i = 0; i < threads.length; i++) {
// or here (in even loops). Should make no difference, but does!
if(!prepareAfterConstructor) threads[i].prepare();
threads[i].start();
}
joinThreads(threads);
long overallEnd = System.nanoTime();
long overallTime = (overallEnd - overallStart);
if(prepareAfterConstructor) {
createTime += overallTime;
} else {
startTime += overallTime;
}
System.out.println("Overall time: " + (overallTime)/1000000 + " ms.");
}
System.out.println("Final results:");
System.out.println(createTime/1000000 + " ms. when prepare() was called after instantiation.");
System.out.println(startTime/1000000 + " ms. when prepare() was called before execution.");
}
private static void joinThreads(Thread[] threads) {
for(int i = 0; i < threads.length; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
It's likely that two volatile variables a and b are too close to each other, they fall in the same cache line; although CPU A only reads/writes variable a, and CPU B only reads/writes variable b, they are still coupled to each other through the same cache line. Such problems are called false sharing.
In your example, we have two allocation schemes:
new Thread new Thread
new Container vs new Thread
new Thread ....
new Container new Container
.... new Container
In the first scheme, it's very unlikely that two volatile variables are close to each other. In the 2nd scheme, it's almost certainly the case.
CPU caches don't work with individual words; instead, they deal with cache lines. A cache line is a continuous chunk of memory, say 64 neighboring bytes. Usually this is nice - if a CPU accessed a cell, it's very likely that it will access the neighboring cells too. Except in your example, that assumption is not only invalid, but detrimental.
Suppose a and b fall in the same cache line L. When CPU A updates a, it notifies other CPUs that L is dirty. Since B caches L too, because it's working on b, B must drop its cached L. So next time B needs to read b, it must reload L, which is costly.
If B must access main memory to reload, that is extremely costly, it's usually 100X slower.
Fortunately, A and B can communicate directly about the new values without going through main memory. Nevertheless it takes extra time.
To verify this theory, you can stuff extra 128 bytes in Container, so that two volatile variable of two Container will not fall in the same cache line; then you should observe that the two schemes take about the same time to execute.
Lession learned: usually CPUs assume that adjecent variables are related. If we want independent variables, we better place them far away from each other.
Well, you're writing to a volatile variable, so I suspect that's forcing a memory barrier - undoing some optimization which can otherwise be achieved. The JVM doesn't know that that particular field isn't going to be observed on another thread.
EDIT: As noted, there are problems with the benchmark itself, such as printing while the timer is running. Also, it's usually a good idea to "warm up" the JIT before starting timing - otherwise you're measuring time which wouldn't be significant in a normal long-running process.
I am not an expert in the internals of Java, but I read your question and find it fascinating. If I had to guess, I think what you have discovered:
Does NOT have anything to do with the instantiation of the volitale property. However, from your data, where the property gets instantiated affects how expensive it is to read/write to it.
Does have to do with finding the reference of the volitale property at runtime. That is, I would be interested to see how the delay grows with more threads that loop more often. Is the number of calls to the volitale property what is causing the delay, or the addition itself, or the writing of the new value.
I would have to guess that: there is probably a JVM optimization that attempts to quickly instantiate the property, and later, if there is time, to alter the property in memory so it is easier to read/write to it. Maybe there is a (1) quick-to-create read/write queue for volitale properties, and a (2) hard-to-create but quick to call queue, and the JVM begins with (1) and later alters the volitale property to (2).
Perhaps if you prepare() right before the run() method gets called, the JVM does not have enough free cycles to optimize from (1) to (2).
The way to test this answer would be to:
prepare(), sleep(), run() and see if you get the same delay. If the sleep is the only thing that is causing for the optimization to take place, then it could mean the JVM needs cycles to optimize the volitale property
OR (a bit more risky) ...
prepare() and run() the threads, later in the middle of the loop, to either pause() or sleep() or somehow stop access to the property in a way that the JVM can attempt to move it to (2).
I'd be interested to see what you find out. Interesting question.
Well, the big difference I see is in the order in which objects are allocated. When preparing after the constructor, your Container allocations are interleaved with your Thread allocations. When preparing prior to execution, your Threads are all allocated first, then your Containers are all allocated.
I don't know a whole lot about memory issues in multi-processor environments, but if I had to guess, maybe in the second case the Container allocations are more likely to be allocated in the same memory page, and perhaps the processors are slowed down due to contention for the same memory page?
[edit] Following this line of thought, I'd be interested to see what happens if you don't try to write back to the variable, and only read from it, in the Thread's run method. I would expect the timings difference to go away.
[edit2] See irreputable's answer; he explains it much better than I could

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?

I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
#Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}
Busy waiting can be a problem:
while (!executor.isTerminated() ) {
}
You can use awaitTermination() instead:
while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}
You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.
As #axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 # 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
10000000 500000 1 (avg of three runs): 36982 ms
10000000 500000 4 (avg of three runs): 21252 ms
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
10000000 500000 1 (avg of 3 runs) 32677 ms
10000000 500000 4 (avg of 3 runs) 8825 ms
10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.
You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...
public static void main(String[] args) {
long lDateTime = new Date().getTime();
parseArgs(args);
List<Thread> threads = new ArrayList<Thread>();
List<Future> futures = new ArrayList<Future>();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
for (int i = 0; i < (N / RANGESIZE); i++) {
Runnable worker = new MyRunnable(i * RANGESIZE + 1, (i + 1) * RANGESIZE);
futures.add(executor.submit(worker));
}
boolean done = false;
while (!done) {
for(Future future : futures) {
done = true;
if( !future.isDone() ) {
done = false;
break;
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime) +
" (" + N / (fDateTime - lDateTime) + " per ms)");
System.exit(0);
}

Categories