Use of volatile only makes sense in multiprocessor systems. is this wrong?
i'm trying to learn about thread programming, so if you know any good articles/pdfs ... i like stuff that mentions a bit about how the operating system works as well not just the language's syntax.
No. Volatile can be used in multi-threaded applications. These may or may not run on more than one processor.
volatile is used to ensure all thread see the same copy of the data. If there is only one thread reading/writing to a field, it doesn't need to be volatile. It will work just fine, just be a bit slower.
In Java you don't have much visibility as to the processor architecture, generally you talk in terms of threads and multi-threading.
I suggest Java Concurrency in Practice, it good whatever your level of knowledge, http://www.javaconcurrencyinpractice.com/
The whole point of using Java is you don't need to know most of the details of how threads work etc. If you learn lots of stuff you don't use you are likely to forget it. ;)
Volatile makes sense in a multithreaded programm, running these threads on a single processor or multiple processors does not make a difference. The volatile keyword is used to tell the JVM (and the Java Memory Model it uses) that it should not re-order or cache the value of a variable with this keyword. This guarantees that Threads using a volatile variable will never see a stale version of that variable.
See this link on the Java Memory Model in general for more details. Or this one for information about Volatile.
No. Volatile is used to support concurrency. In certain circumstances, it can be used instead of synchronization.
This article by Brian Goetz really helped me understand volatile. It has several examples of the use of volatile, and it explains under what conditions it can be used.
Related
Whilst trying to understand how SubmissionPublisher (source code in OpenJDK 10, Javadoc), a new class added to the Java SE in version 9, has been implemented, I stumbled across a few API calls to VarHandle I wasn't previously aware of:
fullFence, acquireFence, releaseFence, loadLoadFence and storeStoreFence.
After doing some research, especially regarding the concept of memory barriers/fences (I have heard of them previously, yes; but never used them, thus was quite unfamiliar with their semantics), I think I have a basic understanding of what they are for. Nonetheless, as my questions might arise from a misconception, I want to ensure that I got it right in the first place:
Memory barriers are reordering constraints regarding reading and writing operations.
Memory barriers can be categorized into two main categories: unidirectional and bidirectional memory barriers, depending on whether they set constraints on either reads or writes or both.
C++ supports a variety of memory barriers, however, these do not match up with those provided by VarHandle. However, some of the memory barriers available in VarHandle provide ordering effects that are compatible to their corresponding C++ memory barriers.
#fullFence is compatible to atomic_thread_fence(memory_order_seq_cst)
#acquireFence is compatible to atomic_thread_fence(memory_order_acquire)
#releaseFence is compatible to atomic_thread_fence(memory_order_release)
#loadLoadFence and #storeStoreFence have no compatible C++ counter part
The word compatible seems to really important here since the semantics clearly differ when it comes to the details. For instance, all C++ barriers are bidirectional, whereas Java's barriers aren't (necessarily).
Most memory barriers also have synchronization effects. Those especially depend upon the used barrier type and previously-executed barrier instructions in other threads. As the full implications a barrier instruction has is hardware-specific, I'll stick with the higher-level (C++) barriers. In C++, for instance, changes made prior to a release barrier instruction are visible to a thread executing an acquire barrier instruction.
Are my assumptions correct? If so, my resulting questions are:
Do the memory barriers available in VarHandle cause any kind of memory synchronization?
Regardless of whether they cause memory synchronization or not, what may reordering constraints be useful for in Java? The Java Memory Model already gives some very strong guarantees regarding ordering when volatile fields, locks or VarHandle operations like #compareAndSet are involved.
In case you're looking for an example: The aforementioned BufferedSubscription, an inner class of SubmissionPublisher (source linked above), established a full fence in line 1079, function growAndAdd. However, it is unclear for me what it is there for.
This is mainly a non-answer, really (initially wanted to make it a comment, but as you can see, it's far too long). It's just that I questioned this myself a lot, did a lot of reading and research and at this point in time I can safely say: this is complicated. I even wrote multiple tests with jcstress to figure out how really they work (while looking at the assembly code generated) and while some of them somehow made sense, the subject in general is by no means easy.
The very first thing you need to understand:
The Java Language Specification (JLS) does not mention barriers, anywhere. This, for java, would be an implementation detail: it really acts in terms of happens before semantics. To be able to proper specify these according to the JMM (Java Memory Model), the JMM would have to change quite a lot.
This is work in progress.
Second, if you really want to scratch the surface here, this is the very first thing to watch. The talk is incredible. My favorite part is when Herb Sutter raises his 5 fingers and says, "This is how many people can really and correctly work with these." That should give you a hint of the complexity involved. Nevertheless, there are some trivial examples that are easy to grasp (like a counter updated by multiple threads that does not care about other memory guarantees, but only cares that it is itself incremented correctly).
Another example is when (in java) you want a volatile flag to control threads to stop/start. You know, the classical:
volatile boolean stop = false; // on thread writes, one thread reads this
If you work with java, you would know that without volatile this code is broken (you can read why double check locking is broken without it for example). But do you also know that for some people that write high performance code this is too much? volatile read/write also guarantees sequential consistency - that has some strong guarantees and some people want a weaker version of this.
A thread safe flag, but not volatile? Yes, exactly: VarHandle::set/getOpaque.
And you would question why someone might need that for example? Not everyone is interested with all the changes that are piggy-backed by a volatile.
Let's see how we will achieve this in java. First of all, such exotic things already existed in the API: AtomicInteger::lazySet. This is unspecified in the Java Memory Model and has no clear definition; still people used it (LMAX, afaik or this for more reading). IMHO, AtomicInteger::lazySet is VarHandle::releaseFence (or VarHandle::storeStoreFence).
Let's try to answer why someone needs these?
JMM has basically two ways to access a field: plain and volatile (which guarantees sequential consistency). All these methods that you mention are there to bring something in-between these two - release/acquire semantics; there are cases, I guess, where people actually need this.
An even more relaxation from release/acquire would be opaque, which I am still trying to fully understand.
Thus bottom line (your understanding is fairly correct, btw): if you plan to use this in java - they have no specification at the moment, do it on you own risk. If you do want to understand them, their C++ equivalent modes are the place to start.
I often hear about other languages promoted as being more suitable for multi-core/concurrent programming e.g. Clojure, Scala, Erlang etc. but I'm a little confused about why I need to worry about multiple cores, shouldn't the Java/.NET VM handle that automatically and if not, what are the reasons behind it?
Is it because those languages mentioned are functional and have some intrinsic advantage over non-functional languages?
The reason you need to care is that processors are generally speaking not getting any faster. Instead more of them are being added to computers in the form of additional cores. In order for a program to take advantage of the extra processors it generally speaking must use multiple threads to do so. Java, and most other languages, will not automatically use more threads in your program. This is something you need to manually do.
Some people prefer functional style languages like Scala, Erlang and F# to non-functional for multi-threaded programming. Functional languages tend to be at least shallowly immutable by default and hence in theory are easier to work with in multi-threaded situations.
Here is a very good article on this subject
The Free Lunch is Over
Functional languages have the intrinsic advantage that most functions/methods call are idempotent and that they typically use "immutability" a lot.
Doing concurrency when using immutable "stuff" is wwaayy easier than when using mutable "stuff".
But anyway: you need to worry about concurrency because we're going to CPU with more and more cores and not faster and faster clocks. Java / C# do not automagically make your program concurrents: you still have to do the hard work yourself.
In the case of normal imperative languages, by default no, you do not get much help from the platform. No compiler is clever enough to parallelize normal imperative code.
There are however various help libraries that can help you on different platforms. For example the Task Parallel library in .NET allows you to do this for example:
Parallel.ForEach(files, file =>
{
Process(file);
});
Pure functional languages always have the benefit of minimal shared state which means that such code is more easily parallelized.
I never did program functional languages, but they are supposed to be easier for coding concurrently since the states do not change (or do not change much), so that you can have the same object on two threads at once. If you have the same object on two threads at once in Java, you have to be very careful and use some special synchronization constructs such that the two threads can "see" the object in the same state when it is necessary.
The current problem in programming is that everybody is used to doing things sequentially, but the hardware is moving multi-core. There has to be a complete paradigm shift in the way we code and think about code. Right now, coding for multi-core in Java or C# is basically just sequential coding with hacks to make it parallelizable. Will functional programming turn out to be the required paradigm shift? I don't know.
The JVM performs a neat trick called lock elision to avoid the cost of locking on objects that are only visible to one thread.
There's a good description of the trick here:
http://www.ibm.com/developerworks/java/library/j-jtp10185/
Does the .Net CLR do something similar? If not then why not?
It's neat, but is it useful? I have a hard time coming up with an example where the compiler can prove that a lock is thread local. Almost all classes don't use locking by default, and when you choose one that locks, then in most cases it will be referenced from some kind of static variable foiling the compiler optimization anyways.
Another thing is that the java vm uses escape analysis in its proof. And AFAIK .net hasn't implemented escape analysis. Other uses of escape analysis such as replacing heap allocations with stack allocations sound much more useful and should be implemented first.
IMO it's currently not worth the coding effort. There are many areas in the .net VM which are not optimized very well and have much bigger impact.
SSE vector instructions and delegate inlining are two examples from which my code would profit much more than from this optimization.
EDIT: As chibacity points out below, this is talking about making locks really cheap rather than completely eliminating them. I don't believe the JIT has the concept of "thread-local objects" although I could be mistaken... and even if it doesn't now, it might in the future of course.
EDIT: Okay, the below explanation is over-simplified, but has at least some basis in reality :) See Joe Duffy's blog post for some rather more detailed information.
I can't remember where I read this - probably "CLR via C#" or "Concurrent Programming on Windows" - but I believe that the CLR allocates sync blocks to objects lazily, only when required. When an object whose monitor has never been contested is locked, the object header is atomically updated with a compare-exchange operation to say "I'm locked". If a different thread then tries to acquire the lock, the CLR will be able to determine that it's already locked, and basically upgrade that lock to a "full" one, allocating it a sync block.
When an object has a "full" lock, locking operations are more expensive than locking and unlocking an otherwise-uncontested object.
If I'm right about this (and it's a pretty hazy memory) it should be feasible to lock and unlock a monitor on different threads cheaply, so long as the locks never overlap (i.e. there's no contention).
I'll see if I can dig up some evidence for this...
In answer to your question: No, the CLR\JIT does not perform "lock elision" optimization i.e. the CLR\JIT does not remove locks from code which is only visible to single threads. This can easily be confirmed with simple single threaded benchmarks on code where lock elision should apply as you would expect in Java.
There are likely to be a number of reasons why it does not do this, but chiefly is the fact that in the .Net framework this is likely to be an uncommonly applied optimization, so is not worth the effort of implementing.
Also in .Net uncontended locks are extremely fast due to the fact they are non-blocking and executed in user space (JVMs appear to have similar optimizations for uncontended locks e.g. IBM). To quote from C# 3.0 In A Nutshell's threading chapter
Locking is fast: you can expect to acquire and release a lock in less
than 100 nanoseconds on a 3 GHz computer if the lock is uncontended"
A couple of example scenarios where lock elision could be applied, and why it's not worth it:
Using locks within a method in your own code that acts purely on locals
There is not really a good reason to use locking in this scenario in the first place, so unlike optimizations such as hoisting loop invariants or method inling, this is a pretty uncommon case and the result of unnecessary use of locks. The runtime shouldn't be concerned with optimizing out uncommon and extreme bad usage.
Using someone else's type that is declared as a local which uses locks internally
Although this sounds more useful, the .Net framework's general design philosophy is to leave the responsibility for locking to clients, so it's rare that types have any internal lock usage. Indeed, the .Net framework is pathologically unsynchronized when it comes to instance methods on types that are not specifically designed and advertized to be concurrent. On the other hand, Java has common types that do include synchronization e.g. StringBuffer and Vector. As the .Net BCL is largely unsynchronized, lock elision is likely to have little effect.
Summary
I think overall, there are fewer cases in .Net where lock elision would kick in, because there are simply not as many places where there would be thread-local locks. Locks are far more likely to be used in places which are visible to multiple threads and therefore should not be elided. Also, uncontended locking is extremely quick.
I had difficulty finding any real-world evidence that lock elision actually provides that much of a performance benefit in Java (for example...), and the latest docs for at least the Oracle JVM state that elision is not always applied for thread local locks, hinting that it is not an always given optimization anyway.
I suspect that lock elision is something that is made possible through the introduction of escape analysis in the JVM, but is not as important for performance as EA's ability to analyze whether reference types can be allocated on the stack.
I read somewhere that x86 processors have cache coherency and can sync the value of fields across multiple cores anyway on each write.
Does that mean that we can code without using the 'volatile' keywoard in java if we plan on running only on x86 processors?
Update:
Ok assuming that we leave out the issue of instruction reordering, can we assume that the issue of an assignment to a non-volatile field not being visible across cores is not present on x86 processors?
No -- the volatile keyword has more implications than just cache coherency; it also places restrictions on what the runtime can and can't do, like delaying constructor calls.
About your update: No we can't. Other threads could just read stale values without the variable being updated. And another problem: The JVM is allowed to optimize code as long as it can guarantuee that the single threaded behavior is correct.
That means that something like:
public boolean var = true;
private void test() {
while (var) {
// do something without changing var
}
}
can be optimized by the JIT to while(true) if it wants to!
There is a big difference between can sync the value of fields and always syncs the value of fields. x86 can sync fields if you have volatile, otherwise it doesn't and shouldn't.
Note: volatile access can be 10-30x slower than non-volatile access which is a key reason it is not done all the time.
BTW: Do you know of any multi-core, plain x86 processors. I would have thought most were x64 with x86 support.
There are very exact specifications on how the JVM should behave for volatile and if it choose to do that using cpu-specific instructions then good for you.
The only place where you should say "we know that on this platform the cpu behaves like.." is when linking in native code where it needs to conform to the cpu. In all other cases write to the specification.
Note that the volatile keyword is very important for writing robust code running on multiple cpu's each with their own cache, as it tells the JVM to disregard local cache and get the official value instead of a cached value from 5 minutes ago. You generally want that.
a write in bytecode doesn't even have to cause a write in machine code. unless it's a volatile write.
I can vouch for volatile having some use. I've been in the situation where one thread has 'null' for a variable and another has the proper value for the variable that was set in that thread. It's not fun to debug. Use volatile for all shared fields :)
How do the Boost Thread libraries compare against the java.util.concurrent libraries?
Performance is critical and so I would prefer to stay with C++ (although Java is a lot faster these days). Given that I have to code in C++, what libraries exist to make threading easy and less error prone.
I have heard recently that as of JDK 1.5, the Java memory model was changed to fix some concurrency issues. How about C++? The last time I did multithreaded programming in C++ was 3-4 years ago when I used pthreads. Although, I don't wish to use that anymore for a large project. The only other alternative that I know of is Boost Threads. However, I am not sure if it is good. I've heard good things about java.util.concurrent, but nothing yet about Boost threads.
java.util.concurrent and boost threads library have an overlapping functionality, but java.util.concurrent also provide a) higher-level abstractions and b) also lower level functions.
Boost threads provide:
Thread (Java: java.util.Thread)
Locking (Java: java.lang.Object and java.util.concurrent.locks)
Condition Variables (Java. java.lang.Object and java.util.concurrent)
Barrier (Java: Barrier)
java.util.concurrent has also:
Semaphores
Reader-writer locks
Concurrent data structures, e.g. a BlockingQueue or a concurrent lock-free hash map.
the Executor services as a highly flexible consumer producer system.
Atomic operations.
A side note: C++ has currently no memory model. On a different machine the same C++ application may have to deal with a different memory model. This makes portable, concurrent programming in C++ even more tricky.
Boost threads are a lot easier to use than pthreads, and, in my opinion, slightly easier to use than Java threads. When a boost thread object is instantiated, it launches a new thread. The user supplies a function or function object which will run in that new thread.
It's really as simple as:
boost::thread* thr = new boost::thread(MyFunc());
thr->join();
You can easily pass data to the thread by storing values inside the function object. And in the latest version of boost, you can pass a variable number of arguments to the thread constructor itself, which will then be passed to your function object's () operator.
You can also use RAII-style locks with boost::mutex for synchronization.
Note that C++0x will use the same syntax for std::thread.
Performance wise I wouldn't really worry. It is my gut feeling that a boost/c++ expert could write faster code than a java expert. But any advantages would have to fought for.
I prefer Boost's design paradigms to Java's. Java is OO all the way, where Boost/C++ allows for OO if you like but uses the most useful paradigm for the problem at hand. In particular I love RAII when dealing with locks. Java handles memory management beautifully, but sometimes it feels like the rest of the programmers' resources get shafted: file handles, mutexes, DB, sockets, etc.
Java's concurrent library is more extensive than Boost's. Thread pools, concurrent containers, atomics, etc. But the core primitives are on par with each other, threads, mutexes, condition variables.
So for performance I'd say it's a wash. If you need lots of high level concurrent library support Java wins. If you prefer paradigm freedom C++.
If performance is an issue in your multithreaded program, then you should consider a lock-free design.
Lock-free means threads do not compete for a shared resource and that minimizes switching costs. In that department, Java has a better story IMHO, with its concurrent collections. You can rapidly come up with a lock-free solution.
For having used the Boost Thread lib a bit (but not extensively), I can say that your thinking will be influenced by what's available, and that means essentially a locking solution.
Writing a lock-free C++ solution is very difficult, because of lack of library support and also conceptually because it's missing a memory model that guarantees you can write truly immutable objects.
this book is a must read: Java Concurrency in Practice
If you're targeting a specific platform then the direct OS call will probably be a little faster than using boost for C++. I would tend to use ACE, since you can generally make the right calls for your main platform and it will still be platform-independent. Java should be about the same speed so long as you can guarantee that it will be running on a recent version.
In C++ one can directly use pthreads (pthread_create() etc) if one wanted to. Internally Java uses pthreads via its run-time environment. Do "ldd " to see.