Quote from book java concurrency in practice:
The performance cost of synchronization comes from several sources.
The visibility guarantees provided by synchronized and volatile may
entail using special instructions called memory barriers that can
flush or invalidate caches, flush hardware write buffers, and stall
execution pipelines. Memory barriers may also have indirect
performance consequences because they inhibit other compiler
optimizations; most operations cannot be reordered with memory
barriers. When assessing the performance impact of synchronization, it
is important to distinguish between contended and uncontended
synchronization. The synchronized mechanism is optimized for the
uncontended case (volatile is always uncontended), and at this
writing, the performance cost of a "fastͲpath" uncontended
synchronization ranges from 20 to 250 clock cycles for most systems.
Can you clarify this more clear?
What if I have huge amount threads which read volaile variable ?
Can you provide contention definition?
Is there tool to meausure contention? In which values it is measures?
Can you clarify this more clear?
That is one dense paragraph that touches a lot of topics. Which topic or topics specifically are you asking for clarification? Your question is too broad to answer satisfactorily. Sorry.
Now, if you question is specific to uncontended synchronization, it means that threads within a JVM do not have to block, get unblocked/notified and then go back to a blocked state.
Under the hood, the JVM uses hardware specific memory barriers that ensure
A volatile field is always read and written to/from main memory, not from the CPU/core cache, and
Your thread will not block/unblock to access it.
There is no contention. When you use a synchronized block OTH, all your threads are in a blocked state except one, the one reading whatever data is being protected by the synchronized block.
Let's call that thread, the one accessing the synchronized data, thread A.
Now, here is the kicker, when thread A is done with the data and exists the synchronized block, this causes the JVM to wake up all the other threads that are/were waiting for thread A to exit the synchronization block.
They all wake up (and that is expensive CPU/memory wise). And they all race trying to get a hold of the synchronization block.
Imagine a whole bunch of people trying to exit a crowded room through a tiny room. Yep, like that, that's how threads act when they try to grab a synchronization lock.
But only one gets it and gets in. All the others go back to sleep, kind of, in what is called a blocked state. This is also expensive, resource wise.
So every time one of the threads exists a synchronization block, all the other threads go crazy (best mental image I can think of) to get access to it, one gets it, and all the others go back to a blocked state.
That's what makes synchronized blocks expensive. Now, here is the caveat: It used to be very expensive pre JDK 1.4. That's 17 years ago. Java 1.4 started seeing some improvements (2003 IIRC).
Then Java 1.5 introduced even greater improvements in 2005, 12 years ago, which made synchronized blocks less expensive.
It is important to keep such things in mind. There is a lot of outdated information out there.
What if I have huge amount threads which read volaile variable ?
It wouldn't matter that much in terms of correctness. A volatile field will always show a consistent value regardless of the number of threads.
Now, if you have a very large number of threads, performance can suffer because of context switches, memory utilization, etc (and not necessarily and/or primarily because of accessing a volatile field.
Can you provide contention definition?
Please don't take it the wrong way, but if you are asking that question, I'm afraid you are not fully prepared to use a book like the one you are reading.
You will need a more basic introduction to concurrency, and contention specifically.
https://en.wikipedia.org/wiki/Resource_contention
Best regards.
Related
I found that almost all high level synchronization abstractions(like Semaphore, CountDownLatch, Exchanger from java.util.concurrent) and concurrent collections are using methods from Unsafe(like compareAndSwapInt method) to define critical section. In the same time I expected that synchronize block or method will be used for this purpose.
Could you explain is the Unsafe methods(I mean only methods that could atomically set a value) more efficient than synchronize and why it is so?
Using synchronised is more efficient if you expect to be waiting a long time (e.g. milli-seconds) as the thread can fall asleep and release the CPU to do other work.
Using compareAndSwap is more efficient if you expect the operation to happen quite quickly. This is because it is a simple machine code instruction and take as little as 10 ns. However if a resources is heavily contented this instruction must busy wait and if it cannot obtain the value it needs, it can consume the CPU busily until it does.
If you use off heap memory, you can control the layout of the data being shared and avoid false sharing (where the same cache line is being updated by more than one CPU). This is important when you have multiple values you might want to update independently. e.g. for a ring buffer.
Note that the internal implementation of a typical JVM (e.g., hotspot) will often used the compare-and-swap hardware instruction as part of the synchronized implementation if such an instruction is available (e.g,. x86), while the other common alternative is LL/SC (e.g., POWER, ARM). A typical strategy to for the fast path to use compare-and-swap (or equivalent) to attempt to obtain the lock if it is free, followed possibly by a short spin-loop and finally if that fails falling back to an OS-level blocking primitive (e.g., futex, Events). The details go far beyond this and include techniques such as biased locking and are ultimately implementation dependent.
The answer above is not fulfilling. The approach is: A mutex (synchronization) is not necessary because the only one operation does the work (all what is to do in mutex), and this only one operation is not able to interrupt. But that is the half answer, because in a multicore system another CPU can write to the same memory location. For that reason the compareAndSwap machine code instruction reads and writes not only in the cache, it reads and writes to real memory. That needs a little more access time to the RAM. The CompareAndSwap machine code operation checks whether the RAM content is changed in comparison to the before read value, only then the new value is stored. If I have more time, I write an example here.
Effective, the compareAndSwap access is faster than a lock and unlock anytime. But it can only be used if only exact one memory location have to be changend in the access. If more as one memory locations should be commonly changed (should be always consistente together), the compareAndSwap CANNOT be used, only synchronzed can be used. In the answer above it was written, compareAndSwap is often used to implement the synchronized operation. That is correct, because the singular synchronized (get mutex) and end-synchronized (release mutex) need only exact one atomic instruction, inside the task scheduler. Hence atomic access is the base of all. But between synchronized{ .... } the scheduler knows that a thread switch is guarded.
This program approach is valid not only for Java, for C/++ (and maybe other languages - ) it is also important and able to use.
Other than a (minor?) performance hit, why would I use a regular List instead of a Collections.synchronizedList vs. a List?
The project I'm working on is under 10k entries, so I don't care, but if someone (maybe me) chooses to sub-class this, I need to document the behavior.
Besides performance (over 100k entries), why would I not synchronize?
That is, what penalty do I incur for using a synchronizedList? How bad is it? For my current application, it's not an issue. If it is a cheap addition, why not?
Other than a (minor?) performance hit ...
In fact, if the list is shared between threads, the performance hit of using a simple synchronized list (versus something more appropriate) could be a large performance hit, depending on what you are doing. The synchronized operations could become a concurrency bottleneck, reducing the application to the performance of a single core.
Simple "black and white" rules are not sufficient when designing a multi-threaded application ... or a reusable library that needs to be thread-safe and also performant in multi-threaded applications.
That is, what penalty do I incur for using a synchronizedList? How bad is it? For my current application, it's not an issue. If it is a cheap addition, why not?
The synchronized list class uses primitive object locking (mutexes).
If the lock is uncontended, this is cheap; maybe 5 or 10 instructions each time you acquire and release the lock. However, the ovehead may depends on whether there was previous contention on the lock. (Some locking schemes cause an object lock to be "inflated" the first time that contention occurs ...)
If the lock is contended, then it is more expensive because this will typically involve the blocked thread being de-scheduled and rescheduled ... and context switch overheads. There is another JVM-level implementation approach involving "spin locking", but that entails the blocked thread testing the lock object in a tight loop.
If the lock is held for a long time (e.g. in list.contains ... for a long list.) then that typically increases the probablility of contention.
When you don't need the synchronization, or when you aren't deluding yourself that a synchronized list is thread-safe even when iterating, which it isn't.
I have read and know in detail in the implications of the Java volatile and synchronized keyword at the cpu level in SMP architecture based CPUs.
A great paper on that subject is here:
http://irl.cs.ucla.edu/~yingdi/web/paperreading/whymb.2010.06.07c.pdf
Now, leave SMP cpus aside for this question. My question is: How does volatile and synchronized keyword work as it relates to older single core CPUs. Example a Pentium I/Pro/II/III/earlier IVs.
I want to know specifically:
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory? If yes why? (Since there is only a single cache copy and no need for coherency protocols, why can't the cache be directly used by two threads that are time slicing the single core CPU ?). This is me asking this question after reading dozens of internet forums about how volatile reads and writes to/from the "master copy in main memory".
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) With a non superscalar CPU (Pentium I), instructions are not re-ordered. So if that is the case, then is volatile keyword required while running on a Pentium 1? (atomicity, visibility and ordering would be a "no problemo" right, because there is only one cache, one core to work on the cache, and no re-ordering).
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory?
No. The caches are still enabled. That's not related to SMP.
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) Does anything change with respect to a superscalar/non superscalar architecture (out-of-order) processor w.r.t these two keywords?
Gosh, do you have to ask this question about Java? Remember that all things eventually boil down to good ol' fashioned machine instructions. I'm not intimitely familiar with the guts of Java synchronization, but as I understand it, synchronized is just syntactic sugar for your typical monitor-style synchronization mechanism. Multiple threads are not allowed in a critical section simultaneously. Instead of simply spinning on a spinlock, the scheduler is leveraged - the waiting threads are put to sleep, and woken back up when to lock is able to be taken again.
The thing to remember is that even on a single-core, non-SMP system, you still have to worry about OS preemption of threads! These threads can be scheduled on and off of the CPU whenever the OS wants to. This is the purpose for the locks, of course.
Again, this question is so much better asked under the context of assembly, or even C (whose compiled result can often times be directly inferred) as opposed to Java, which has to deal with the VM, JITted code, etc.
I keep reading about how biased locking, using the flag -XX:+UseBiasedLocking, can improve the performance of un-contended synchronization. I couldn't find a reference to what it does and how it improves the performance.
Can anyone explain me what exactly it is or may be point me to some links/resources that explains??
Essentially, if your objects are locked only by one thread, the JVM can make an optimization and "bias" that object to that thread in such a way that subsequent atomic operations on the object incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object.
It is on by default in Java 6.
-XX:+UseBiasedLocking
Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.
Does this not answer your questions?
http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.5
Enables a technique for improving the performance of uncontended
synchronization. An object is "biased" toward the thread which first
acquires its monitor via a monitorenter bytecode or synchronized
method invocation; subsequent monitor-related operations performed by
that thread are relatively much faster on multiprocessor machines.
Some applications with significant amounts of uncontended
synchronization may attain significant speedups with this flag
enabled; some applications with certain patterns of locking may see
slowdowns, though attempts have been made to minimize the negative
impact.
Though I think you'll find it's on by default in 1.6. Use the PrintFlagsFinal diagnostic option to see what the effective flags are. Make sure you specify -server if you're investigating for a server application because the flags can differ:
http://www.jroller.com/ethdsy/entry/print_all_jvm_flags
I've been wondering about biased locks myself.
However it seems that java's biased locks are slower on intel's nehalem processors than normal locks, and presumably on the two generations of processors since nehalem. See http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html
and here http://www.azulsystems.com/blog/cliff/2011-11-16-a-short-conversation-on-biased-locking
Also more information here https://blogs.oracle.com/dave/entry/biased_locking_in_hotspot
I've been hoping that there is some relatively cheap way to revoke a biased lock on intel, but I'm beginning to believe that isn't possible. The articles I've seen on how it's done rely on either:
using the os to stop the thread
sending a signal, ie running code in the other thread
having safe points that are guaranteed to run fairly often in the other thread and waiting for one to be executed (which is what java does).
having similar safe points that are a call to a return - and the other thread MODIFIES THE CODE to a breakpoint...
Worth mentioning that biased locking will be disabled by default jdk15 onwards
JEP 374 : Disable and Deprecate Biased Locking
The performance gains seen in the past are far less evident today. Many applications that benefited from biased locking are older, legacy applications that use the early Java collection APIs, which synchronize on every access (e.g., Hashtable and Vector). Newer applications generally use the non-synchronized collections (e.g., HashMap and ArrayList), introduced in Java 1.2 for single-threaded scenarios, or the even more-performant concurrent data structures, introduced in Java 5, for multi-threaded scenarios.
Further
Biased locking introduced a lot of complex code into the synchronization subsystem and is invasive to other HotSpot components as well. This complexity is a barrier to understanding various parts of the code and an impediment to making significant design changes within the synchronization subsystem. To that end we would like to disable, deprecate, and eventually remove support for biased locking.
And ya, no more System.identityHashCode(o) magic ;)
Two paper here:
https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/File/ccd39237cd4dc109d91786762fba41f0/qrl_oplocks_biasedlocking.pdf
https://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf
web page too:
https://blogs.oracle.com/dave/biased-locking-in-hotspot
jvm hotspot source code:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/oops/markOop.hpp
The JVM performs a neat trick called lock elision to avoid the cost of locking on objects that are only visible to one thread.
There's a good description of the trick here:
http://www.ibm.com/developerworks/java/library/j-jtp10185/
Does the .Net CLR do something similar? If not then why not?
It's neat, but is it useful? I have a hard time coming up with an example where the compiler can prove that a lock is thread local. Almost all classes don't use locking by default, and when you choose one that locks, then in most cases it will be referenced from some kind of static variable foiling the compiler optimization anyways.
Another thing is that the java vm uses escape analysis in its proof. And AFAIK .net hasn't implemented escape analysis. Other uses of escape analysis such as replacing heap allocations with stack allocations sound much more useful and should be implemented first.
IMO it's currently not worth the coding effort. There are many areas in the .net VM which are not optimized very well and have much bigger impact.
SSE vector instructions and delegate inlining are two examples from which my code would profit much more than from this optimization.
EDIT: As chibacity points out below, this is talking about making locks really cheap rather than completely eliminating them. I don't believe the JIT has the concept of "thread-local objects" although I could be mistaken... and even if it doesn't now, it might in the future of course.
EDIT: Okay, the below explanation is over-simplified, but has at least some basis in reality :) See Joe Duffy's blog post for some rather more detailed information.
I can't remember where I read this - probably "CLR via C#" or "Concurrent Programming on Windows" - but I believe that the CLR allocates sync blocks to objects lazily, only when required. When an object whose monitor has never been contested is locked, the object header is atomically updated with a compare-exchange operation to say "I'm locked". If a different thread then tries to acquire the lock, the CLR will be able to determine that it's already locked, and basically upgrade that lock to a "full" one, allocating it a sync block.
When an object has a "full" lock, locking operations are more expensive than locking and unlocking an otherwise-uncontested object.
If I'm right about this (and it's a pretty hazy memory) it should be feasible to lock and unlock a monitor on different threads cheaply, so long as the locks never overlap (i.e. there's no contention).
I'll see if I can dig up some evidence for this...
In answer to your question: No, the CLR\JIT does not perform "lock elision" optimization i.e. the CLR\JIT does not remove locks from code which is only visible to single threads. This can easily be confirmed with simple single threaded benchmarks on code where lock elision should apply as you would expect in Java.
There are likely to be a number of reasons why it does not do this, but chiefly is the fact that in the .Net framework this is likely to be an uncommonly applied optimization, so is not worth the effort of implementing.
Also in .Net uncontended locks are extremely fast due to the fact they are non-blocking and executed in user space (JVMs appear to have similar optimizations for uncontended locks e.g. IBM). To quote from C# 3.0 In A Nutshell's threading chapter
Locking is fast: you can expect to acquire and release a lock in less
than 100 nanoseconds on a 3 GHz computer if the lock is uncontended"
A couple of example scenarios where lock elision could be applied, and why it's not worth it:
Using locks within a method in your own code that acts purely on locals
There is not really a good reason to use locking in this scenario in the first place, so unlike optimizations such as hoisting loop invariants or method inling, this is a pretty uncommon case and the result of unnecessary use of locks. The runtime shouldn't be concerned with optimizing out uncommon and extreme bad usage.
Using someone else's type that is declared as a local which uses locks internally
Although this sounds more useful, the .Net framework's general design philosophy is to leave the responsibility for locking to clients, so it's rare that types have any internal lock usage. Indeed, the .Net framework is pathologically unsynchronized when it comes to instance methods on types that are not specifically designed and advertized to be concurrent. On the other hand, Java has common types that do include synchronization e.g. StringBuffer and Vector. As the .Net BCL is largely unsynchronized, lock elision is likely to have little effect.
Summary
I think overall, there are fewer cases in .Net where lock elision would kick in, because there are simply not as many places where there would be thread-local locks. Locks are far more likely to be used in places which are visible to multiple threads and therefore should not be elided. Also, uncontended locking is extremely quick.
I had difficulty finding any real-world evidence that lock elision actually provides that much of a performance benefit in Java (for example...), and the latest docs for at least the Oracle JVM state that elision is not always applied for thread local locks, hinting that it is not an always given optimization anyway.
I suspect that lock elision is something that is made possible through the introduction of escape analysis in the JVM, but is not as important for performance as EA's ability to analyze whether reference types can be allocated on the stack.