Does any JVM implement blocking with spin-waiting? - java

In Java Concurrency in Practice, the authors write:
When locking is contended, the losing thread(s) must block. The JVM can implement blocking either via spin-waiting (repeatedly trying to acquire the lock until it succeeds) or by suspending the blocked thread through the operating system. Which is more efficient depends on the relationship between context switch overhead and the time until the lock becomes available; spin-waiting is preferred for short waits and suspension is preferable for long waits. Some JVMs choose between the two adaptively based on profiling data of past wait times, but most just suspend threads waiting for a lock.
When I read this I was quite surprised. Are there any known JVMs implementing blocking either on always spin-waiting or sometimes spin-waiting due to profiling results? It's hard to believe for now.

Here is evidence that JRockit can use spinlocks - http://forums.oracle.com/forums/thread.jspa?threadID=816625&tstart=494
And if you search for "spin" in the JVM options listed here you will see evidence for the use of / support for spinlocks in Hotspot JVMs.
And if you want a current example, look at "src/hotspot/share/runtime/mutex.cpp" in the OpenJDK Java 11 source tree.

What the authors have written is right and it only makes sense. This is true for Linux as well. The rationale of why spin locks are used is because most resources are protected for a fraction of a millisecond. As such, to suspend, push all the contents of the registers onto the stack and relinquish CPU is just way too much overhead and not worth it. Thus even though it just spins in a tight set of instructions, sometimes just wasting time, it is still way more efficient than swapping out.
That being said, with the VM profiling, it would ideally make your processing more efficient. As such, is there are particular case that you always want to suspend? Or maybe always spin-wait?

Related

If same thread can get scheduled on different CPUs, why does it not create problems?

As threads execute on a multi-processor/multi-core machines, they can cause CPU caches to load data from RAM.
If threads are supposed to be 'see' same data, it is not guaranteed because thread1 may cause an update in it's CPU's (i.e. where it is currently executing) cache and this change will not be immediately visible to thread2.
To solve this problem, programming languages like Java provide constructs like volatile.
It is clear to me what the problem with multiple threads executing on different CPUs is.
I am pretty sure that a given thread is not bound to a particular CPU for its lifetime and can get scheduled to run on a different CPU. But it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs?
After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?
Only possibility I can think, as of now, is that context switching of threads involve writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.
Can some expert please advise what the real story is?
Caches on modern CPU's are always coherent. So if a store is performed by one CPU, then a subsequent load on a different CPU will see that store. In other words: the cache is the source of truth, memory is just an overflow bucket and could be completely out of sync with reality. So since the caches are coherent, it doesn't matter on which CPU a thread will run.
Also on a single CPU system, the lack of volatile can cause problems due to compiler optimizations. A compiler could for example the hoist a variable out of a loop and then a write made by 1 thread, will never be seen by another thread no matter if it is running on the same CPU.
I would suggest not thinking in term of hardware. If you use Java, make sure you understand the Java Memory Model (JMM). This is an abstract model that prevents thinking in terms of hardware since the JMM needs to run independent of the hardware.
On a single thread, there is a happens-before relationship between actions that take place, regardless of how the scheduling done. This is enforced by the implementation of the JVM as part of the Java memory model contract promised in the Java Language Specification:
Two actions can be ordered by a happens-before relationship. If one action happens-before another, then the first is visible to and ordered before the second.
If we have two actions x and y, we write hb(x, y) to indicate that x happens-before y.
If x and y are actions of the same thread and x comes before y in program order, then hb(x, y).
How exactly this is achieved by the operating system is implementation dependent.
it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs? After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?
Yes it may have access to stale data but it more likely to have data in its cache that is unhelpful – not relevant to the memory that it needs. First off, the permissions from the OS (if written correctly) won't let one program see the data from another – yes, there are many stories about hardware vulnerabilities in the news these days so I am talking about how it should work. The cache will be clear if another process gets swapped into a CPU.
Whether or not the cache memory is stale or not is a function of the timing of the cache coherence systems of the architectures or whether or not memory fences are crossed.
context switching of threads involves writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).
That's pretty close to what happens although its cache memory is not refreshed when it gets scheduled. When a thread is contexted switched out of the CPU, all dirty pages of memory are flushed to RAM. When a thread is swapped into a CPU, the cache is either flushed (if from another process) or contains memory that may not be helpful to the incoming thread. This causes a much higher page fault ratio of initial memory accesses meaning that a thread spends longer to access memory until the rows it needs are loaded into the cache.
However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.
Yes there is a performance hit. This highlights why it is so important to properly size your thread-pools. If you have too many threads running CPU intensive tasks, you can cause a loss in performance because of the context switches. If the threads are waiting for IO then increasing the number of threads is a must but if you are just calculating something, using fewer CPUs can result in higher throughput because each CPU can stay in the processor longer and the ratio of cache hits to misses goes up.
For those who might not go through all the comments on the different answers, here is a simplified summary of what I have modelled in my head (please feel free to comment if any point is not correct. I will edit this post)
http://tutorials.jenkov.com/java-concurrency/volatile.html is not accurate and gives rise to questions like this. CPU caches are always coherent. If CPU 1 has written to a memory address X in its cache and CPU 2 reads the same memory address from its own cache (after the fact of CPU 1 writing to its cache), then CPU 2 will read what was written by CPU 1. No special instruction is required to enforce this.
However, modern CPUs also have store buffers. They are used to accumulate writes in the buffer in order to improve performance (so that these writes can be committed to the cache in their own time, making CPU free from waiting for cache coherence protocol to finish).
Whatever is in the store buffer of a CPU is not yet visible to other CPUs.
In addition, CPUs and compilers in order to improve performance are free to re-order instructions as long as it does not change the outcome of the computation (from single thread's point of view)
Also, some compiler optimizations may move a variable completely to CPU registers for a routine, thereby 'hiding' them from shared memory and hence making writes to that variable invisible to other threads.
Points 3,4,5 above are the reason why Java exposes keywords like Volatile. When you use volatile, JVM itself does not re-order instructions if they would break 'happens-before' guarantee. JVM also asks CPU to not re-order by using memory barrier/fence instructions. JVM also does not use any optimization which prevents 'happens-before' guarantee. Overall if a write to a volatile memory has already happened, any read thereafter by another thread will ensure correct value to be available for not only that field but also all fields which were visible to first thread while writing the volatile field.
How does above relate to this question about single thread using different CPUs in its lifetime?
If the thread while executing on a CPU has already written to its cache, nothing more to be considered here. Even if the thread later uses another CPU, it will be able to see its own writes due to cache coherency.
If the thread's writes are waiting in the store buffer and it gets moved out of the CPU, context switching ensures that the thread's writes from store buffer get committed to the cache. After that it is same as point 1.
Any state which is only in CPU registers, anyway gets backed up and restored as part of context switching.
Due to above points, a single thread does not face any problem when it executes over different CPUs during its lifetime.

volatile and synchronized on single core cpu (example - pentium pro)

I have read and know in detail in the implications of the Java volatile and synchronized keyword at the cpu level in SMP architecture based CPUs.
A great paper on that subject is here:
http://irl.cs.ucla.edu/~yingdi/web/paperreading/whymb.2010.06.07c.pdf
Now, leave SMP cpus aside for this question. My question is: How does volatile and synchronized keyword work as it relates to older single core CPUs. Example a Pentium I/Pro/II/III/earlier IVs.
I want to know specifically:
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory? If yes why? (Since there is only a single cache copy and no need for coherency protocols, why can't the cache be directly used by two threads that are time slicing the single core CPU ?). This is me asking this question after reading dozens of internet forums about how volatile reads and writes to/from the "master copy in main memory".
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) With a non superscalar CPU (Pentium I), instructions are not re-ordered. So if that is the case, then is volatile keyword required while running on a Pentium 1? (atomicity, visibility and ordering would be a "no problemo" right, because there is only one cache, one core to work on the cache, and no re-ordering).
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory?
No. The caches are still enabled. That's not related to SMP.
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) Does anything change with respect to a superscalar/non superscalar architecture (out-of-order) processor w.r.t these two keywords?
Gosh, do you have to ask this question about Java? Remember that all things eventually boil down to good ol' fashioned machine instructions. I'm not intimitely familiar with the guts of Java synchronization, but as I understand it, synchronized is just syntactic sugar for your typical monitor-style synchronization mechanism. Multiple threads are not allowed in a critical section simultaneously. Instead of simply spinning on a spinlock, the scheduler is leveraged - the waiting threads are put to sleep, and woken back up when to lock is able to be taken again.
The thing to remember is that even on a single-core, non-SMP system, you still have to worry about OS preemption of threads! These threads can be scheduled on and off of the CPU whenever the OS wants to. This is the purpose for the locks, of course.
Again, this question is so much better asked under the context of assembly, or even C (whose compiled result can often times be directly inferred) as opposed to Java, which has to deal with the VM, JITted code, etc.

What does Thread Affinity mean?

Somewhere I have heard about Thread Affinity and Thread Affinity Executor. But I cannot find a proper reference for it at least in java. Can someone please explain to me what is it all about?
There are two issues. First, it’s preferable that threads have an affinity to a certain CPU (core) to make the most of their CPU-local caches. This must be handled by the operating system. This CPU affinity for threads is often also called “thread affinity”. In case of Java, there is no standard API to get control over this. But there are 3rd party libraries, as mentioned by other answers.
Second, in Java there is the observation that in typical programs objects are thread-affine, i.e. typically used by only one thread most of the time. So it’s the task of the JVM’s optimizer to ensure, that objects affine to one thread are placed close to each other in memory to fit into one CPU’s cache but place objects affine to different threads not too close to each other to avoid that they share a cache line as otherwise two CPUs/Cores have to synchronize them too often.
The ideal situation is that a CPU can work on some objects independently to another CPU working on other objects placed in an unrelated memory region.
Practical examples of optimizations considering Thread Affinity of Java objects are
Thread-Local Allocation Buffers (TLABs)
With TLABs, each object starts its lifetime in a memory region dedicated to the thread which created it. According to the main hypothesis behind generational garbage collectors (“the majority of all objects will die young”), most objects will spent their entire lifetime in such a thread local buffer.
Biased Locking
With Biased Locking, JVMs will perform locking operations with the optimistic assumption that the object will be locked by the same thread only, switching to a more expensive locking implementation only when this assumption does not hold.
#Contended
To address the other end, fields which are known to be accessed by multiple threads, HotSpot/OpenJDK has an annotation, currently not part of a public API, to mark them, to direct the JVM to move these data away from the other, potentially unshared data.
Let me try explaining it. With the rise of multicore processors, message passing between threads & thread pooling, scheduling has become more costlier affair. Why this has become much heavier than before, for that we need to understand the concept of "mechanical sympathy". For details you can go through a blog on it. But in crude words, when threads are distributed across different cores of a processor, when they try to exchange messages; cache miss probability is high. Now coming to your specific question, thread affinity being able to assign specific threads to a particular processor/core. Here is one of the library for java that can be used for it.
The Java Thread Affinity version 1.4 library attempts to get the best of both worlds, by allowing you to reserve a logical thread for critical threads, and reserve a whole core for the most performance sensitive threads. Less critical threads will still run with the benefits of hyper threading. e.g. following code snippet
AffinityLock al = AffinityLock.acquireLock();
try {
// find a cpu on a different socket, otherwise a different core.
AffinityLock readerLock = al.acquireLock(DIFFERENT_SOCKET, DIFFERENT_CORE);
new Thread(new SleepRunnable(readerLock, false), "reader").start();
// find a cpu on the same core, or the same socket, or any free cpu.
AffinityLock writerLock = readerLock.acquireLock(SAME_CORE, SAME_SOCKET, ANY);
new Thread(new SleepRunnable(writerLock, false), "writer").start();
Thread.sleep(200);
} finally {
al.release();
}
// allocate a whole core to the engine so it doesn't have to compete for resources.
al = AffinityLock.acquireCore(false);
new Thread(new SleepRunnable(al, true), "engine").start();
Thread.sleep(200);
System.out.println("\nThe assignment of CPUs is\n" + AffinityLock.dumpLocks());
Thread affinity (or process affinity) describes on which processor cores the thread/process is allowed to run. Normally, this setting is equal to the (logical) CPUs in your system, and there's hardly a reason for changing this, because the operating system then has the best possibilities to schedule your tasks among the available processors.
See i.e. http://msdn.microsoft.com/en-us/library/windows/desktop/ms683213(v=vs.85).aspx for how this works in windows. I don't know whether java offers an API to set these.

Biased locking in java

I keep reading about how biased locking, using the flag -XX:+UseBiasedLocking, can improve the performance of un-contended synchronization. I couldn't find a reference to what it does and how it improves the performance.
Can anyone explain me what exactly it is or may be point me to some links/resources that explains??
Essentially, if your objects are locked only by one thread, the JVM can make an optimization and "bias" that object to that thread in such a way that subsequent atomic operations on the object incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object.
It is on by default in Java 6.
-XX:+UseBiasedLocking
Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.
Does this not answer your questions?
http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.5
Enables a technique for improving the performance of uncontended
synchronization. An object is "biased" toward the thread which first
acquires its monitor via a monitorenter bytecode or synchronized
method invocation; subsequent monitor-related operations performed by
that thread are relatively much faster on multiprocessor machines.
Some applications with significant amounts of uncontended
synchronization may attain significant speedups with this flag
enabled; some applications with certain patterns of locking may see
slowdowns, though attempts have been made to minimize the negative
impact.
Though I think you'll find it's on by default in 1.6. Use the PrintFlagsFinal diagnostic option to see what the effective flags are. Make sure you specify -server if you're investigating for a server application because the flags can differ:
http://www.jroller.com/ethdsy/entry/print_all_jvm_flags
I've been wondering about biased locks myself.
However it seems that java's biased locks are slower on intel's nehalem processors than normal locks, and presumably on the two generations of processors since nehalem. See http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html
and here http://www.azulsystems.com/blog/cliff/2011-11-16-a-short-conversation-on-biased-locking
Also more information here https://blogs.oracle.com/dave/entry/biased_locking_in_hotspot
I've been hoping that there is some relatively cheap way to revoke a biased lock on intel, but I'm beginning to believe that isn't possible. The articles I've seen on how it's done rely on either:
using the os to stop the thread
sending a signal, ie running code in the other thread
having safe points that are guaranteed to run fairly often in the other thread and waiting for one to be executed (which is what java does).
having similar safe points that are a call to a return - and the other thread MODIFIES THE CODE to a breakpoint...
Worth mentioning that biased locking will be disabled by default jdk15 onwards
JEP 374 : Disable and Deprecate Biased Locking
The performance gains seen in the past are far less evident today. Many applications that benefited from biased locking are older, legacy applications that use the early Java collection APIs, which synchronize on every access (e.g., Hashtable and Vector). Newer applications generally use the non-synchronized collections (e.g., HashMap and ArrayList), introduced in Java 1.2 for single-threaded scenarios, or the even more-performant concurrent data structures, introduced in Java 5, for multi-threaded scenarios.
Further
Biased locking introduced a lot of complex code into the synchronization subsystem and is invasive to other HotSpot components as well. This complexity is a barrier to understanding various parts of the code and an impediment to making significant design changes within the synchronization subsystem. To that end we would like to disable, deprecate, and eventually remove support for biased locking.
And ya, no more System.identityHashCode(o) magic ;)
Two paper here:
https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/File/ccd39237cd4dc109d91786762fba41f0/qrl_oplocks_biasedlocking.pdf
https://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf
web page too:
https://blogs.oracle.com/dave/biased-locking-in-hotspot
jvm hotspot source code:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/oops/markOop.hpp

Java Performance Degradation on removing locks

I am testing my java application for any performance bottlenecks. The application uses concurrent.jar for locking purposes.
I have a high computation call which calls lock and unlock functions for its operations.
On removing the lock-unlock mechanism from the code, I have seen the performance degradation by multiple folds contrary to my expectations. Among other things observed was increase in CPU consumption which made me feel that the program is running faster but actually it was not.
Q1. What can be the reason for this degradation in performance when we remove locks?
Best Regards !!!
This can be quite a usual finding, depending on what you're doing and what you're using as an alternative to Locks.
Essentially, what happens is that constructs such as ReentrantLock have some logic built into them that knows "when to back off" when they realistically can't acquire the lock. This reduces the amount of CPU that's burnt just in the logic of repeatedly trying to acquire the lock, which can happen if you use simpler locking constructs.
As an example, have a look at the graph I've hurriedly put up here. It shows the throughput of threads continually accessing random elements of an array, using different constructs as the locking mechanism. Along the X axis is the number of threads; Y axis is throughput. The blue line is a ReentrantLock; the yellow, green and brown lines use variants of a spinlock. Notice how with low numbers of threads, the spinlock gives heigher throughput as you might expect, but as the number of threads ramps up, the back-off logic of ReentrantLock kicks in, and it ends up doing better, while with high contention, the spinlocks just sit burning CPU.
By the way, this was really a trial run done on a dual-processor machine; I also ran it in the Amazon cloud (effectively an 8-way Xeon) but I've ahem... mislaid the file, but I'll either find it or run the experiment again soon and train and post an update. But you get an essentially similar pattern as I recall.
Update: whether it's in locking code or not, a phenomenon that can happen on some multiprocessor architectures is that as the multiple processors do a high volume of memory accesses, you can end up flooding the memory bus, and in effect the processors slow each other down. (It's a bit like with ethernet-- the more machines you add to the network, the more chance of collisions as they send data.)
Profile it. Anything else here will be just a guess and an uninformed one at that.
Using a profiler like YourKit will not only tell you which methods are "hot spots" in terms of CPU time but it will also tell you where threads are spending most of their time BLOCKED or WAITING
Is it still performing correctly? For instance, there was a case in an app server where an unsychronised HashMap caused an occasional infinite loop. It is not to difficult to see how work could simply be repeated.
The most likely culprit for seeing performance decline and CPU usage increase when you remove shared memory protection is a race condition. Two or more threads could be continually flipping a state flag back and forth on a shared object.
More description of the purpose of your application would help with diagnosis.

Categories