Question regarding synchronization - java

Using synchronization slows down the execution of a program. Is there a way to improve the speed of execution ?

Saying that a synchronization construct slows down execution is like saying that a parachute slows down a skydiver. Going without will be faster, but that's not exactly the point. Synchronization serves a purpose.
To improve the speed of execution, simply apply synchronization properly.
For example, using the Producer/Consumer design pattern may help you reduce the number of synchronization constructs required in your code.

It's simply not true that "synchronization slows down programs" - it only does when the synchronized actions are done very frequently, or when you actually have a lot of threads contending for them. For most applications, neither is true.
Also, some kinds of concurrent operations can be implemented safely without synchronization by using clever data structures or hardware primitives. Examples:
ConcurrentHashMap
CopyOnWriteArrayList
AtomicInteger

Profile your code, find out where the real bottlenecks lie.
Carefully re-analyse your critical regions. It's very easy to apply synchronization too broadly.
Sometimes changing algorithm can lead to a completely different synchronization profile. This doesn't always have a positive effect though!

Have you measured how much (if any) the slowdown is ?
The early JVMs suffered a penalty when using synchronisation. However that situation has improved vastly over the years. I wouldn't worry about a performance penalty when synchronising. There will be many more candidates for optimisations.

You might want to synchronize the block of code rather than the whole method. Without it, you are risking whole lot more!

Related

When is the wrong time to use a Collections.synchronizedList vs. a List?

Other than a (minor?) performance hit, why would I use a regular List instead of a Collections.synchronizedList vs. a List?
The project I'm working on is under 10k entries, so I don't care, but if someone (maybe me) chooses to sub-class this, I need to document the behavior.
Besides performance (over 100k entries), why would I not synchronize?
That is, what penalty do I incur for using a synchronizedList? How bad is it? For my current application, it's not an issue. If it is a cheap addition, why not?
Other than a (minor?) performance hit ...
In fact, if the list is shared between threads, the performance hit of using a simple synchronized list (versus something more appropriate) could be a large performance hit, depending on what you are doing. The synchronized operations could become a concurrency bottleneck, reducing the application to the performance of a single core.
Simple "black and white" rules are not sufficient when designing a multi-threaded application ... or a reusable library that needs to be thread-safe and also performant in multi-threaded applications.
That is, what penalty do I incur for using a synchronizedList? How bad is it? For my current application, it's not an issue. If it is a cheap addition, why not?
The synchronized list class uses primitive object locking (mutexes).
If the lock is uncontended, this is cheap; maybe 5 or 10 instructions each time you acquire and release the lock. However, the ovehead may depends on whether there was previous contention on the lock. (Some locking schemes cause an object lock to be "inflated" the first time that contention occurs ...)
If the lock is contended, then it is more expensive because this will typically involve the blocked thread being de-scheduled and rescheduled ... and context switch overheads. There is another JVM-level implementation approach involving "spin locking", but that entails the blocked thread testing the lock object in a tight loop.
If the lock is held for a long time (e.g. in list.contains ... for a long list.) then that typically increases the probablility of contention.
When you don't need the synchronization, or when you aren't deluding yourself that a synchronized list is thread-safe even when iterating, which it isn't.

Why sharing a static variable between threads reduce performance?

I asked question here and someone leaved a comment saying that the problem is I'm sharing a static variable.
Why is that a problem?
Sharing a static variable of and by itself should have no adverse effect on performance. Global data is common is all programs starting with the JVM and OS constructs.
Mutable shared data is a different story as the mutation of shared data can lead to both performance issues (cache misses at the very least) and correctness issues which are a pain and are often solved using locks, which lead to potentially other performance issues.
The wiki static variable looks like a pretty substantial part of your program. Not knowing anything about what it's going or how it's coded, I would guess that it does locking in order to keep a consistent state. If most of your threads are spending their time blocking waiting to acquire access to this same object then that would explain why you're not seeing any gain from using multiple threads.
For threads to make a difference to the performance of your program they have to be reasonably independent, and not all locking on the same thing. The more locking they have to do, the less gain you will see. So try to split out the work so as much can be done independently as possible. For instance if there are work items that can be gathered independently, then you might be better off by having multiple threads go find the work items, then feed them to a queue that a dedicated thread can use to pull work items off the queue and feed them to the wiki object.

Biased locking in java

I keep reading about how biased locking, using the flag -XX:+UseBiasedLocking, can improve the performance of un-contended synchronization. I couldn't find a reference to what it does and how it improves the performance.
Can anyone explain me what exactly it is or may be point me to some links/resources that explains??
Essentially, if your objects are locked only by one thread, the JVM can make an optimization and "bias" that object to that thread in such a way that subsequent atomic operations on the object incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object.
It is on by default in Java 6.
-XX:+UseBiasedLocking
Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.
Does this not answer your questions?
http://www.oracle.com/technetwork/java/tuning-139912.html#section4.2.5
Enables a technique for improving the performance of uncontended
synchronization. An object is "biased" toward the thread which first
acquires its monitor via a monitorenter bytecode or synchronized
method invocation; subsequent monitor-related operations performed by
that thread are relatively much faster on multiprocessor machines.
Some applications with significant amounts of uncontended
synchronization may attain significant speedups with this flag
enabled; some applications with certain patterns of locking may see
slowdowns, though attempts have been made to minimize the negative
impact.
Though I think you'll find it's on by default in 1.6. Use the PrintFlagsFinal diagnostic option to see what the effective flags are. Make sure you specify -server if you're investigating for a server application because the flags can differ:
http://www.jroller.com/ethdsy/entry/print_all_jvm_flags
I've been wondering about biased locks myself.
However it seems that java's biased locks are slower on intel's nehalem processors than normal locks, and presumably on the two generations of processors since nehalem. See http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html
and here http://www.azulsystems.com/blog/cliff/2011-11-16-a-short-conversation-on-biased-locking
Also more information here https://blogs.oracle.com/dave/entry/biased_locking_in_hotspot
I've been hoping that there is some relatively cheap way to revoke a biased lock on intel, but I'm beginning to believe that isn't possible. The articles I've seen on how it's done rely on either:
using the os to stop the thread
sending a signal, ie running code in the other thread
having safe points that are guaranteed to run fairly often in the other thread and waiting for one to be executed (which is what java does).
having similar safe points that are a call to a return - and the other thread MODIFIES THE CODE to a breakpoint...
Worth mentioning that biased locking will be disabled by default jdk15 onwards
JEP 374 : Disable and Deprecate Biased Locking
The performance gains seen in the past are far less evident today. Many applications that benefited from biased locking are older, legacy applications that use the early Java collection APIs, which synchronize on every access (e.g., Hashtable and Vector). Newer applications generally use the non-synchronized collections (e.g., HashMap and ArrayList), introduced in Java 1.2 for single-threaded scenarios, or the even more-performant concurrent data structures, introduced in Java 5, for multi-threaded scenarios.
Further
Biased locking introduced a lot of complex code into the synchronization subsystem and is invasive to other HotSpot components as well. This complexity is a barrier to understanding various parts of the code and an impediment to making significant design changes within the synchronization subsystem. To that end we would like to disable, deprecate, and eventually remove support for biased locking.
And ya, no more System.identityHashCode(o) magic ;)
Two paper here:
https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/f4a5b21d-66fa-4885-92bf-c4e81c06d916/File/ccd39237cd4dc109d91786762fba41f0/qrl_oplocks_biasedlocking.pdf
https://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf
web page too:
https://blogs.oracle.com/dave/biased-locking-in-hotspot
jvm hotspot source code:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee27509/src/share/vm/oops/markOop.hpp

Should I inline long code in a loop, or move it in a separate method?

Assume I have a loop (any while or for) like this:
loop{
A long code.
}
From the point of time complexity, should I divide this code in parts, write a function outside the loop, and call that function repeatedly?
I read something about functions very long ago, that calling a function repeatedly takes more time or memory or like something, I don't remember it exactly. Can you also provide some good reference about things like this (time complexity, coding style)?
Can you also provide some reference book or tutorial about heap memory, overheads etc. which affects the performance of program?
The performance difference is probably very minimal in this case. I would concentrate on clarity rather than performance until you identify this portion of your code to be a serious bottleneck.
It really does depend on what kind of code you're running in the loop, however. If you're just doing a tiny mathematical operation that isn't going to take any CPU time, but you're doing it a few hundred thousand times, then inlining the calculation might make sense. Anything more expensive than that, though, and performance shouldn't be an issue.
There is an overhead of calling a function.
So if the "long code" is fast compared to this overhead (and your application cares about performance), then you should definitely avoid the overhead.
However, if the performance is not noticably worse, it's better to make it more readable, by using a (or better multiple) function.
Rule one of performance optmisation: Measure it.
Personally, I go for readable code first and then optimise it IF NECESSARY. Usually, it isn't necessary :-)
See the first line in CHAPTER 3 - Measurement Is Everything
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." - Donald Knuth
In this case, the difference in performance will probably be minimal between the two solutions, so writing clearer code is the way to do it.
There really isnt a simple "tutorial" on performance, it is a very complex subject and one that even seasoned veterans often dont fully understand. Anyway, to give you more of an idea of what the overhead of "calling" a function is, basically what you are doing is "freezing" the state of your function(in Java there are no "functions" per se, they are all called methods), calling the method, then "unfreezing", where your method was before.
The "freezing" essentially consists of pushing state information(where you were in the method, what the value of the variables was etc) on to the stack, "unfreezing" consists of popping the saved state off the stack and updating the control structures to where they were before you called the function. Naturally memory operations are far from free, but the VM is pretty good at keeping the performance impact to an absolute minimum.
Now keep in mind Java is almost entirely heap based, the only things that really have to get pushed on the stack are the value of pointers(small), your place in the program(again small), and whatever primitives you have local to your method, and a tiny bit of control information, nothing else. Furthermore, although you cannot explicitly inline in Java(though Im sure there are bytecode editors out there that essentially let you do that), most VMs, including the most popular HotSpot VM, will do this automatically for you. http://java.sun.com/developer/technicalArticles/Networking/HotSpot/inlining.html
So the bottom line is pretty much 0 performance impact, if you want to verify for yourself you can always run benchmarking and profiling tools, they should be able to confirm it for you.
From a execution speed point of view it shouldn't matter, and if you still believe this is a bottleneck it is easy to measure.
From a development performance perspective, it is a good idea to keep the code short. I would vote for turning the loop contents into one (or more) properly named methods.
Forget it! You can't gain any performance by doing the job of the JIT. Let JIT inline it for you. Keep the methods short for readability and also for performance, as JIT works better with short methods.
There are microptimizations which may help you gain some performance, but don't even think about them. I suggest the following rules:
Write clean code using appropriate objects and algorithms for readability and for performance.
In case the program is too slow, profile and identify the critical parts.
Think about improving them using better objects and algorithms.
As a last resort, you may also consider microoptimizations.

Does the CLR perform "lock elision" optimization? If not why not?

The JVM performs a neat trick called lock elision to avoid the cost of locking on objects that are only visible to one thread.
There's a good description of the trick here:
http://www.ibm.com/developerworks/java/library/j-jtp10185/
Does the .Net CLR do something similar? If not then why not?
It's neat, but is it useful? I have a hard time coming up with an example where the compiler can prove that a lock is thread local. Almost all classes don't use locking by default, and when you choose one that locks, then in most cases it will be referenced from some kind of static variable foiling the compiler optimization anyways.
Another thing is that the java vm uses escape analysis in its proof. And AFAIK .net hasn't implemented escape analysis. Other uses of escape analysis such as replacing heap allocations with stack allocations sound much more useful and should be implemented first.
IMO it's currently not worth the coding effort. There are many areas in the .net VM which are not optimized very well and have much bigger impact.
SSE vector instructions and delegate inlining are two examples from which my code would profit much more than from this optimization.
EDIT: As chibacity points out below, this is talking about making locks really cheap rather than completely eliminating them. I don't believe the JIT has the concept of "thread-local objects" although I could be mistaken... and even if it doesn't now, it might in the future of course.
EDIT: Okay, the below explanation is over-simplified, but has at least some basis in reality :) See Joe Duffy's blog post for some rather more detailed information.
I can't remember where I read this - probably "CLR via C#" or "Concurrent Programming on Windows" - but I believe that the CLR allocates sync blocks to objects lazily, only when required. When an object whose monitor has never been contested is locked, the object header is atomically updated with a compare-exchange operation to say "I'm locked". If a different thread then tries to acquire the lock, the CLR will be able to determine that it's already locked, and basically upgrade that lock to a "full" one, allocating it a sync block.
When an object has a "full" lock, locking operations are more expensive than locking and unlocking an otherwise-uncontested object.
If I'm right about this (and it's a pretty hazy memory) it should be feasible to lock and unlock a monitor on different threads cheaply, so long as the locks never overlap (i.e. there's no contention).
I'll see if I can dig up some evidence for this...
In answer to your question: No, the CLR\JIT does not perform "lock elision" optimization i.e. the CLR\JIT does not remove locks from code which is only visible to single threads. This can easily be confirmed with simple single threaded benchmarks on code where lock elision should apply as you would expect in Java.
There are likely to be a number of reasons why it does not do this, but chiefly is the fact that in the .Net framework this is likely to be an uncommonly applied optimization, so is not worth the effort of implementing.
Also in .Net uncontended locks are extremely fast due to the fact they are non-blocking and executed in user space (JVMs appear to have similar optimizations for uncontended locks e.g. IBM). To quote from C# 3.0 In A Nutshell's threading chapter
Locking is fast: you can expect to acquire and release a lock in less
than 100 nanoseconds on a 3 GHz computer if the lock is uncontended"
A couple of example scenarios where lock elision could be applied, and why it's not worth it:
Using locks within a method in your own code that acts purely on locals
There is not really a good reason to use locking in this scenario in the first place, so unlike optimizations such as hoisting loop invariants or method inling, this is a pretty uncommon case and the result of unnecessary use of locks. The runtime shouldn't be concerned with optimizing out uncommon and extreme bad usage.
Using someone else's type that is declared as a local which uses locks internally
Although this sounds more useful, the .Net framework's general design philosophy is to leave the responsibility for locking to clients, so it's rare that types have any internal lock usage. Indeed, the .Net framework is pathologically unsynchronized when it comes to instance methods on types that are not specifically designed and advertized to be concurrent. On the other hand, Java has common types that do include synchronization e.g. StringBuffer and Vector. As the .Net BCL is largely unsynchronized, lock elision is likely to have little effect.
Summary
I think overall, there are fewer cases in .Net where lock elision would kick in, because there are simply not as many places where there would be thread-local locks. Locks are far more likely to be used in places which are visible to multiple threads and therefore should not be elided. Also, uncontended locking is extremely quick.
I had difficulty finding any real-world evidence that lock elision actually provides that much of a performance benefit in Java (for example...), and the latest docs for at least the Oracle JVM state that elision is not always applied for thread local locks, hinting that it is not an always given optimization anyway.
I suspect that lock elision is something that is made possible through the introduction of escape analysis in the JVM, but is not as important for performance as EA's ability to analyze whether reference types can be allocated on the stack.

Categories