I read somewhere that x86 processors have cache coherency and can sync the value of fields across multiple cores anyway on each write.
Does that mean that we can code without using the 'volatile' keywoard in java if we plan on running only on x86 processors?
Update:
Ok assuming that we leave out the issue of instruction reordering, can we assume that the issue of an assignment to a non-volatile field not being visible across cores is not present on x86 processors?
No -- the volatile keyword has more implications than just cache coherency; it also places restrictions on what the runtime can and can't do, like delaying constructor calls.
About your update: No we can't. Other threads could just read stale values without the variable being updated. And another problem: The JVM is allowed to optimize code as long as it can guarantuee that the single threaded behavior is correct.
That means that something like:
public boolean var = true;
private void test() {
while (var) {
// do something without changing var
}
}
can be optimized by the JIT to while(true) if it wants to!
There is a big difference between can sync the value of fields and always syncs the value of fields. x86 can sync fields if you have volatile, otherwise it doesn't and shouldn't.
Note: volatile access can be 10-30x slower than non-volatile access which is a key reason it is not done all the time.
BTW: Do you know of any multi-core, plain x86 processors. I would have thought most were x64 with x86 support.
There are very exact specifications on how the JVM should behave for volatile and if it choose to do that using cpu-specific instructions then good for you.
The only place where you should say "we know that on this platform the cpu behaves like.." is when linking in native code where it needs to conform to the cpu. In all other cases write to the specification.
Note that the volatile keyword is very important for writing robust code running on multiple cpu's each with their own cache, as it tells the JVM to disregard local cache and get the official value instead of a cached value from 5 minutes ago. You generally want that.
a write in bytecode doesn't even have to cause a write in machine code. unless it's a volatile write.
I can vouch for volatile having some use. I've been in the situation where one thread has 'null' for a variable and another has the proper value for the variable that was set in that thread. It's not fun to debug. Use volatile for all shared fields :)
Related
I'm reading about the Java volatile keyword and have confusion about its 'visibility'.
A typical usage of volatile keyword is:
volatile boolean ready = false;
int value = 0;
void publisher() {
value = 5;
ready = true;
}
void subscriber() {
while (!ready) {}
System.out.println(value);
}
As explained by most tutorials, using volatile for ready makes sure that:
change to ready on publisher thread is immediately visible to other threads (subscriber);
when ready's change is visible to other thread, any variable update preceding to ready (here is value's change) is also visible to other threads;
I understand the 2nd, because volatile variable prevents memory reordering by using memory barriers, so writes before volatile write cannot be reordered after it, and reads after volatile read cannot be reordered before it. This is how ready prevents printing value = 0 in the above demo.
But I have confusion about the 1st guarantee, which is visibility of the volatile variable itself. That sounds a very vague definition to me.
In other words, my confusion is just on SINGLE variable's visibility, not multiple variables' reordering or something. Let's simplify the above example:
volatile boolean ready = false;
void publisher() {
ready = true;
}
void subscriber() {
while (!ready) {}
}
If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop? Why?
A few questions I want to ask:
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, then why do we need volatile here?
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
Hope I explained my question clearly.
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
That doesn't help you. You aren't writing code for a modern CPU, you are writing code for a Java virtual machine that is allowed to have a virtual machine that has a virtual CPU whose virtual CPU caches are not coherent.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
That is correct. But understand, that's with respect to the virtual machine that you are coding for. Its memory may well be implemented in your physical CPU's caches. That may allow your machine to use the caches and still have the memory visibility required by the Java specification.
Using volatile may ensure that writes go directly to the virtual machine's memory instead of the virtual machine's virtual CPU cache. The virtual machine's CPU cache does not need to provide visibility between threads because the Java specification doesn't require it to.
You cannot assume that characteristics of your particular physical hardware necessarily provide benefits that Java code can use directly. Instead, the JVM trades off those benefits to improve performance. But that means your Java code doesn't get those benefits.
Again, you are not writing code for your physical CPU, you are writing code for the virtual CPU that your JVM provides. That your CPU has coherent caches allows the JVM to do all kinds of optimizations that boost your code's performance, but the JVM is not required to pass those coherent caches through to your code and real JVM's do not. Doing so would mean eliminating a significant number of extremely valuable optimizations.
Relevant bits of the language spec:
volatile keyword: https://docs.oracle.com/javase/specs/jls/se16/html/jls-8.html#jls-8.3.1.4
memory model: https://docs.oracle.com/javase/specs/jls/se16/html/jls-17.html#jls-17.4
The CPU cache is not a factor here, as you correctly said.
This is more about optimizations. If ready is not volatile, the compiler is free to interpret
// this
while (!ready) {}
// as this
if (!ready) while(true) {}
That's certainly an optimization, it has to evaluate the condition fewer times. The value is not changed in the loop, it can be "reused". In terms of single-thread semantics it is equivalent, but it won't do what you wanted.
That's not to say this would always happen. Compilers are free to do that, they don't have to.
If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop?
Yes.
Why?
Because the subscriber may not ever see the results of the publisher's write.
Because ... the JLS does not require the value of an variable to be written to memory ... except to satisfy the specified visibility constraints.
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
(I think) that the JMM specifies or assumes that it is physically impossible to read and write the same conceptual memory cell at the same time. So operations on a memory cell are time ordered. Immediately visible means visible in the next possible opportunity to read following the write.
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
Compilers typically generate code that holds variables in registers, and only writes the values to memory when necessary. Declaring a variable as volatile means that the value must be written to memory. If you take this into consideration, you cannot rely on just the (hypothetical or actual) behavior of cache implementations to specify what volatile means.
While current generation modern CPU / cache architectures behave that way, there is no guarantee that all future computers will behave that way.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads.
Some people say that is incorrect ... for CPUs that implement a cache coherency protocol. However, that is beside the point, because as I described above, the current value of a variable may not yet have been written to the cache yet. Indeed, it may never be written to the cache.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
So lets assume that your diagram shows physical time and represents threads running on different physical cores, reading and writing a cache-coherent memory cell via their respective caches.
What would happen at the physical level would depend on how the cache-coherency is implemented.
I would expect Reader 1 to see the previous state of the cell (if it was available from its cache) or the new state if it wasn't. Reader 2 would see the new state. But it also depends on how long it takes for the writer thread's cache invalidation to propagate to the others' caches. And all sorts of other stuff that is hard to explain.
In short, we don't really know what would happen at the physical level.
But on the other hand, the writer and readers in the above picture can't actually observe the physical time like that anyway. And neither can the programmer.
What the program / programmer sees is that the reads and writes DO NOT OVERLAP. When the necessary happens before relations are present, there will be guarantees about visibility of memory writes by one thread to subsequent1 reads by another. This applies for volatile variables, and for various other things.
How that guarantee is implemented, is not your problem. And it really doesn't help if you do understand what it going on at the hardware level, because you don't actually know what code the JIT compiler is going to emit (today!) anyway.
1 - That is, subsequent according to the synchronization order ... which you could view as a logical time. The JLS Memory model doesn't actually talk about time at all.
Answers to your 3 questions:
A change of a volatile write doesn't need to be 'immediately' visible to a volatile load. A correctly synchronized Java program will behave as if it is sequential consistent and for sequential consistency the real time order of loads/stores isn't relevant. So reads and writes can be skewed as long as the program order isn't violated (or as long as nobody can observe it). Linearizability = sequential consistency + respect real time order. For more info see this answer.
I still need to dig into the exact meaning of visible, but AFAIK it is mostly a compiler level concern because hardware will prevent buffering loads/stores indefinitely.
You are completely right about the articles being wrong. A lot of nonsense is written and 'flushing volatile writes to main memory instead of using the cache' is the most common misunderstanding I'm seeing. I think 50% of all my SO comments is about informing people that caches are always coherent. A great book on the topic is 'A primer on memory consistency and cache coherence 2e' which is available for free.
The informal semantics of the Java Memory model contains 3 parts:
atomicity
visibility
ordering
Atomicity is about making sure that a read/write/rmw happens atomically in the global memory order. So nobody can observe some in between state. This deals with access atomicity like torn read/write, word tearing and proper alignment. It also deals with operational atomicity like rmw.
IMHO it should also deal with store atomicity; so making sure that there is a point in time where the store becomes visibly to all cores. If you have for example the X86, then due to load buffering, a store can become visible to the issuing core earlier than to other cores and you have a violation of atomicity. But I haven't seen it being mentioned in the JMM.
Visibility: this deals mostly with preventing compiler optimizations since the hardware will prevent delaying loads and buffering stores indefinitely. In some literature they also throw ordering of surrounding loads/stores under visibility; but I don't believe this is correct.
Ordering: this is the bread and butter of memory models. It will make sure that loads/stores issued by a single processor don't get reordered. In the first example you can see the need for such behavior. This is the realm of the compiler barriers and cpu memory barriers.
For more info see:
https://download.oracle.com/otndocs/jcp/memory_model-1.0-pfd-spec-oth-JSpec/
I'll just touch on this part :
change to ready on publisher thread is immediately visible to other threads
that is not correct and the articles are wrong. The documentation makes a very clear statement here:
A write to a volatile field happens-before every subsequent read of that field.
The complicated part here is subsequent. In plain english this means that when someone sees ready as being true, it will also see value as being 5. This automatically implies that you need to observe that value to be true, and it can happen that you might observe a different thing. So this is not "immediately".
What people confuse this with, is the fact that volatile offers sequential consistency, which means that if someone has observed ready == true, then everyone will also (unlike release/acquire, for example).
I'm trying to understand the Java Memory Model but have been failing to get a point regarding CPU caches.
As far as I know it, in JVM we have the following locations to store local and shared variables:
local variables -- on thread stack
shared variables -- in memory, but every CPU cache has a copy of it
So my question is: why store local variables on stack, and (cache) shared variables in CPU cache? Why not the other way around (Supposing that CPU cache is too expensive to store both), we cache local variables in CPU caches and just fetch shared variables from memory? Is this part of the Java language design or the computer architecture?
Further: as simple as "CPU cache" sounds, what if several CPUs share one cache? And in systems with multi-level caches, which level of cache will the copy of shared variables be stored in? Further, if more than 1 threads are running in the same CPU core, does it mean that they are sharing the same set of cached shared-variables, and hence even if the shared variable is not defined volatile, accesses of the variable is still instantly visible to the other threads running on the same CPU?
"Local" and "shared" variables are meaningless outside the context of your code. They don't influence where or even if the state is cached. It's not even useful to think or reason in terms of where your state is stored; the entire reason the JMM exists is so that details like these, which vary from architecture to architecture are not exposed to the programmer. By relying on low-level hardware details, you are asking the wrong questions about the JMM. It's not useful to your application, it makes it fragile, easier to break, harder to reason with, and less portable.
That said, in general, you should assume that any program state, if not all states, are eligible to be cached. The fact is that what is cached does not actually matter, just that anything and everything can be, whether it be primitive types or reference types, or even state variables encapsulated by several fields. Whatever instructions a thread runs (and those instructions vary by architecture too - beware!), those instructions default back on the CPU to determine what is relevant to cache and what not to cache; it is impossible for programmers to do this themselves (although it is possible to influence where state variables may be cached , see what false sharing is).
Again, we can also make some more generalizations about x86, that active primitive types are probably put on registers because P/ALUs will be able to work with them the fastest. Anything else goes though. It's possible for primitives to be moved to L1/2 cached if they are core-local, it's certainly possible that they would be overwritten quite quickly. The CPU might put state variables on a shared L3 if it thinks that there will be a context switch in the future, or it could not. A hardware expert will need to respond to that.
Ideally, state variables will be stored in the closest cache (register, L1/2/3, then main memory) to the processor unit. That's up the CPU to decide though. It is impossible to reason about cache semantics at the Java level. Even if hyper threading is enabled (I'm not sure what the AMD equivalent is), threads are not allowed to share resources, and even then, if they were, recall that visibility is not the only problem associated with shared state variables; in the case that the processor performs pipelining, you still need the appropriate instructions to ensure the correct ordering (this is even after you get rid of read/write buffering on the CPU), whether this be hwsync or the appropriate fences or others.
Again, reasoning about the properties of the cache is not useful, both because the JMM handles that for you and because it is indeterminable, where/when/what is cached. Further, even if you did know the where/when/what questions, you STILL cannot reason about data visibility; all caches treat cached data in the same way anyways, and you will need to rely on the processor updating the cache state between the ME(O)SI states, instruction ordering, load/store buffering, write-back/through, etc... And you still haven't dealt with problems that can occur at the OS and JVM level yet. Again, luckily, the JDK allows you to use basic tools such as volatile, final, and atomics that work consistently across all platforms and produce code that is predictable and easy(er) to reason with.
I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?
You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.
The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.
It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.
Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!
volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.
I have read and know in detail in the implications of the Java volatile and synchronized keyword at the cpu level in SMP architecture based CPUs.
A great paper on that subject is here:
http://irl.cs.ucla.edu/~yingdi/web/paperreading/whymb.2010.06.07c.pdf
Now, leave SMP cpus aside for this question. My question is: How does volatile and synchronized keyword work as it relates to older single core CPUs. Example a Pentium I/Pro/II/III/earlier IVs.
I want to know specifically:
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory? If yes why? (Since there is only a single cache copy and no need for coherency protocols, why can't the cache be directly used by two threads that are time slicing the single core CPU ?). This is me asking this question after reading dozens of internet forums about how volatile reads and writes to/from the "master copy in main memory".
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) With a non superscalar CPU (Pentium I), instructions are not re-ordered. So if that is the case, then is volatile keyword required while running on a Pentium 1? (atomicity, visibility and ordering would be a "no problemo" right, because there is only one cache, one core to work on the cache, and no re-ordering).
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory?
No. The caches are still enabled. That's not related to SMP.
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) Does anything change with respect to a superscalar/non superscalar architecture (out-of-order) processor w.r.t these two keywords?
Gosh, do you have to ask this question about Java? Remember that all things eventually boil down to good ol' fashioned machine instructions. I'm not intimitely familiar with the guts of Java synchronization, but as I understand it, synchronized is just syntactic sugar for your typical monitor-style synchronization mechanism. Multiple threads are not allowed in a critical section simultaneously. Instead of simply spinning on a spinlock, the scheduler is leveraged - the waiting threads are put to sleep, and woken back up when to lock is able to be taken again.
The thing to remember is that even on a single-core, non-SMP system, you still have to worry about OS preemption of threads! These threads can be scheduled on and off of the CPU whenever the OS wants to. This is the purpose for the locks, of course.
Again, this question is so much better asked under the context of assembly, or even C (whose compiled result can often times be directly inferred) as opposed to Java, which has to deal with the VM, JITted code, etc.
Use of volatile only makes sense in multiprocessor systems. is this wrong?
i'm trying to learn about thread programming, so if you know any good articles/pdfs ... i like stuff that mentions a bit about how the operating system works as well not just the language's syntax.
No. Volatile can be used in multi-threaded applications. These may or may not run on more than one processor.
volatile is used to ensure all thread see the same copy of the data. If there is only one thread reading/writing to a field, it doesn't need to be volatile. It will work just fine, just be a bit slower.
In Java you don't have much visibility as to the processor architecture, generally you talk in terms of threads and multi-threading.
I suggest Java Concurrency in Practice, it good whatever your level of knowledge, http://www.javaconcurrencyinpractice.com/
The whole point of using Java is you don't need to know most of the details of how threads work etc. If you learn lots of stuff you don't use you are likely to forget it. ;)
Volatile makes sense in a multithreaded programm, running these threads on a single processor or multiple processors does not make a difference. The volatile keyword is used to tell the JVM (and the Java Memory Model it uses) that it should not re-order or cache the value of a variable with this keyword. This guarantees that Threads using a volatile variable will never see a stale version of that variable.
See this link on the Java Memory Model in general for more details. Or this one for information about Volatile.
No. Volatile is used to support concurrency. In certain circumstances, it can be used instead of synchronization.
This article by Brian Goetz really helped me understand volatile. It has several examples of the use of volatile, and it explains under what conditions it can be used.