Why are shared variables cached in CPU caches?

Why are shared variables cached in CPU caches? - java

I'm trying to understand the Java Memory Model but have been failing to get a point regarding CPU caches.
As far as I know it, in JVM we have the following locations to store local and shared variables:
local variables -- on thread stack
shared variables -- in memory, but every CPU cache has a copy of it
So my question is: why store local variables on stack, and (cache) shared variables in CPU cache? Why not the other way around (Supposing that CPU cache is too expensive to store both), we cache local variables in CPU caches and just fetch shared variables from memory? Is this part of the Java language design or the computer architecture?
Further: as simple as "CPU cache" sounds, what if several CPUs share one cache? And in systems with multi-level caches, which level of cache will the copy of shared variables be stored in? Further, if more than 1 threads are running in the same CPU core, does it mean that they are sharing the same set of cached shared-variables, and hence even if the shared variable is not defined volatile, accesses of the variable is still instantly visible to the other threads running on the same CPU?

"Local" and "shared" variables are meaningless outside the context of your code. They don't influence where or even if the state is cached. It's not even useful to think or reason in terms of where your state is stored; the entire reason the JMM exists is so that details like these, which vary from architecture to architecture are not exposed to the programmer. By relying on low-level hardware details, you are asking the wrong questions about the JMM. It's not useful to your application, it makes it fragile, easier to break, harder to reason with, and less portable.
That said, in general, you should assume that any program state, if not all states, are eligible to be cached. The fact is that what is cached does not actually matter, just that anything and everything can be, whether it be primitive types or reference types, or even state variables encapsulated by several fields. Whatever instructions a thread runs (and those instructions vary by architecture too - beware!), those instructions default back on the CPU to determine what is relevant to cache and what not to cache; it is impossible for programmers to do this themselves (although it is possible to influence where state variables may be cached , see what false sharing is).
Again, we can also make some more generalizations about x86, that active primitive types are probably put on registers because P/ALUs will be able to work with them the fastest. Anything else goes though. It's possible for primitives to be moved to L1/2 cached if they are core-local, it's certainly possible that they would be overwritten quite quickly. The CPU might put state variables on a shared L3 if it thinks that there will be a context switch in the future, or it could not. A hardware expert will need to respond to that.
Ideally, state variables will be stored in the closest cache (register, L1/2/3, then main memory) to the processor unit. That's up the CPU to decide though. It is impossible to reason about cache semantics at the Java level. Even if hyper threading is enabled (I'm not sure what the AMD equivalent is), threads are not allowed to share resources, and even then, if they were, recall that visibility is not the only problem associated with shared state variables; in the case that the processor performs pipelining, you still need the appropriate instructions to ensure the correct ordering (this is even after you get rid of read/write buffering on the CPU), whether this be hwsync or the appropriate fences or others.
Again, reasoning about the properties of the cache is not useful, both because the JMM handles that for you and because it is indeterminable, where/when/what is cached. Further, even if you did know the where/when/what questions, you STILL cannot reason about data visibility; all caches treat cached data in the same way anyways, and you will need to rely on the processor updating the cache state between the ME(O)SI states, instruction ordering, load/store buffering, write-back/through, etc... And you still haven't dealt with problems that can occur at the OS and JVM level yet. Again, luckily, the JDK allows you to use basic tools such as volatile, final, and atomics that work consistently across all platforms and produce code that is predictable and easy(er) to reason with.

Related

Java instruction reordering and CPU memory reordering

This is a follow up question to
How to demonstrate Java instruction reordering problems?
There are many articles and blogs referring to Java and JVM instruction reordering which may lead to counter-intuitive results in user operations.
When I asked for a demonstration of Java instruction reordering causing unexpected results, several comments were made to the effect that a more general area of concern is memory reordering, and that it would be difficult to demonstrate on an x86 CPU.
Is instruction reordering just a part of a bigger issue of memory reordering, compiler optimizations and memory models? Are these issues really unique to the Java compiler and the JVM? Are they specific to certain CPU types?

Memory reordering is possible without compile-time reordering of operations in source vs. asm. The order of memory operations (loads and stores) to coherent shared cache (i.e. memory) done by a CPU running a thread is also separate from the order it executes those instructions in.
Executing a load is accessing cache (or the store buffer), but executing" a store in a modern CPU is separate from its value actually being visible to other cores (commit from store buffer to L1d cache). Executing a store is really just writing the address and data into the store buffer; commit isn't allowed until after the store has retired, thus is known to be non-speculative, i.e. definitely happening.
Describing memory reordering as "instruction reordering" is misleading. You can get memory reordering even on a CPU that does in-order execution of asm instructions (as long as it has some mechanisms to find memory-level parallelism and let memory operations complete out of order in some ways), even if asm instruction order matches source order. Thus that term wrongly implies that merely having plain load and store instructions in the right order (in asm) would be useful for anything related to memory order; it isn't, at least on non-x86 CPUs. It's also weird because instructions have effects on registers (at least loads, and on some ISAs with post-increment addressing modes, stores can, too).
It's convenient to talk about something like StoreLoad reordering as x = 1 "happening" after a tmp = y load, but the thing to talk about is when the effects happen (for loads) or are visible to other cores (for stores) in relation to other operations by this thread. But when writing Java or C++ source code, it makes little sense to care whether that happened at compile time or run-time, or how that source turned into one or more instructions. Also, Java source doesn't have instructions, it has statements.
Perhaps the term could make sense to describe compile-time reordering between bytecode instructions in a .class vs. JIT compiler-generate native machine code, but if so then it's a mis-use to use it for memory reordering in general, not just compile/JIT-time reordering excluding run-time reordering. It's not super helpful to highlight just compile-time reordering, unless you have signal handlers (like POSIX) or an equivalent that runs asynchronously in the context of an existing thread.
This effect is not unique to Java at all. (Although I hope this weird use of "instruction reordering" terminology is!) It's very much the same as C++ (and I think C# and Rust for example, probably most other languages that want to normally compile efficiently, and require special stuff in the source to specify when you want your memory operations ordered wrt. each other, and promptly visible to other threads). https://preshing.com/20120625/memory-ordering-at-compile-time/
C++ defines even less than Java about access to non-atomic<> variables without synchronization to ensure that there's never a write in parallel with anything else (undefined behaviour1).
And even present in assembly language, where by definition there's no reordering between source and machine code. All SMP CPUs except a few ancient ones like 80386 also do memory-reordering at run-time, so lack of instruction reordering doesn't gain you anything, especially on machines with a "weak" memory model (most modern CPUs other than x86): https://preshing.com/20120930/weak-vs-strong-memory-models/ - x86 is "strongly ordered", but not SC: it's program-order plus a store buffer with store forwarding. So if you want to actually demo the breakage from insufficient ordering in Java on x86, it's either going to be compile-time reordering or lack of sequential consistency via StoreLoad reordering or store-buffer effects. Other unsafe code like the accepted answer on your previous question that might happen to work on x86 will fail on weakly-ordered CPUs like ARM.
(Fun fact: modern x86 CPUs aggressively execute loads out of order, but check to make sure they were "allowed" to do that according to x86's strongly-ordered memory model, i.e. that the cache line they loaded from is still readable, otherwise roll back the CPU state to before that: machine_clears.memory_ordering perf event. So they maintain the illusion of obeying the strong x86 memory-ordering rules. Other ISAs have weaker orders and can just aggressively execute loads out of order without later checks.)
Some CPU memory models even allow different threads to disagree about the order of stores done by two other threads. So the C++ memory model allows that, too, so extra barriers on PowerPC are only needed for sequential consistency (atomic with memory_order_seq_cst, like Java volatile) not acquire/release or weaker orders.
Related:
How does memory reordering help processors and compilers?
How is load->store reordering possible with in-order commit? - memory reordering on in-order CPUs via other effects, like scoreboarding loads with a cache that can do hit-under-miss, and/or out-of-order commit from the store buffer, on weakly-ordered ISAs that allow this. (Also LoadStore reordering on OoO exec CPUs that still retire instructions in order, which is actually more surprising than on in-order CPUs which have special mechanisms to allow memory-level parallelism for loads, that OoO exec could replace.)
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem? (basically a duplicate of this; I didn't say much there that's not here)
Are loads and stores the only instructions that gets reordered? (at runtime)
Does an x86 CPU reorder instructions? (yes)
Can a speculatively executed CPU branch contain opcodes that access RAM? - store execution order isn't even relevant for memory ordering between threads, only commit order from the store buffer to L1d cache. A store buffer is essential to decouple speculative exec (including of store instructions) from anything that's visible to other cores. (And from cache misses on those stores.)
Why is integer assignment on a naturally aligned variable atomic on x86? - true in asm, but not safe in C/C++; you need std::atomic<int> with memory_order_relaxed to get the same asm but in portably-safe way.
Globally Invisible load instructions - where does load data come from: store forwarding is possible, so it's more accurate to say x86's memory model is "program order + a store buffer with store forwarding" than to say "only StoreLoad reordering", if you ever care about this core reloading its own recent stores.
Why memory reordering is not a problem on single core/processor machines? - just like the as-if rule for compilers, out-of-order exec (and other effects) have to preserve the illusion (within one core and thus thread) of instructions fully executing one at a time, in program order, with no overlap of their effects. This is basically the cardinal rule of CPU architecture.
LWN: Who's afraid of a big bad optimizing compiler? - surprising things compilers can do to C code that uses plain (non-volatile / non-_Atomic accesses). This is mostly relevant for the Linux kernel, which rolls its own atomics with inline asm for some things like barriers, but also just C volatile for pure loads / pure stores (which is very different from Java volatile2.)
Footnote 1: C++ UB means not just an unpredictable value loaded, but that the ISO C++ standard has nothing to say about what can/can't happen in the whole program at any time before or after UB is encountered. In practice for memory ordering, the consequences are often predictable (for experts who are used to looking at compiler-generated asm) depending on the target machine and optimization level, e.g. hoisting loads out of loops breaking spin-wait loops that fail to use atomic. But of course you're totally at the mercy of whatever the compiler happens to do when your program contains UB, not at all something you can rely on.
Caches are coherent, despite common misconceptions
However, all real-world systems that Java or C++ run multiple threads across do have coherent caches; seeing stale data indefinitely in a loop is a result of compilers keeping values in registers (which are thread-private), not of CPU caches not being visible to each other. This is what makes C++ volatile work in practice for multithreading (but don't actually do that because C++11 std::atomic made it obsolete).
Effects like never seeing a flag variable change are due to compilers optimizing global variables into registers, not instruction reordering or cpu caching. You could say the compiler is "caching" a value in a register, but you can choose other wording that's less likely to confuse people that don't already understand thread-private registers vs. coherent caches.
Footnote 2: When comparing Java and C++, also note that C++ volatile doesn't guarantee anything about memory ordering, and in fact in ISO C++ it's undefined behaviour for multiple threads to be writing the same object at the same time even with volatile. Use std::memory_order_relaxed if you want inter-thread visibility without ordering wrt. surrounding code.
(Java volatile is like C++ std::atomic<T> with the default std::memory_order_seq_cst, and AFAIK Java provides no way to relax that to do more efficient atomic stores, even though most algorithms only need acquire/release semantics for their pure-loads and pure-stores, which x86 can do for free. Draining the store buffer for sequential consistency costs extra. Not much compared to inter-thread latency, but significant for per-thread throughput, and a big deal if the same thread is doing a bunch of stuff to the same data without contention from other threads.)

Why do we need the volatile keyword when the core cache synchronization is done on the hardware level?

So I’m currently listing to this talk.
At minute 28:50 the following statement is made: „the fact that on the hardware it could be in main memory, in multiple level 3 caches, in four level 2 caches […] is not your problem. That’s the problem for the hardware designers.“
Yet, in java we have to declare a boolean stopping a thread as volatile, since when another thread calls the stop method, it’s not guaranteed that the running thread will be aware of this change.
Why is this the case, when the hardware level should take care of updating every cache with the correct value?
I’m sure I’m missing something here.
Code in question:
public class App {
public static void main(String[] args) {
Worker worker = new Worker();
worker.start();
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
worker.signalStop();
System.out.println(worker.isShouldStop());
System.out.println(worker.getVal());
System.out.println(worker.getVal());
}
static class Worker extends Thread {
private /*volatile*/ boolean shouldStop = false;
private long val = 0;
#Override
public void run() {
while (!shouldStop) {
val++;
}
System.out.println("Stopped");
}
public void signalStop() {
this.shouldStop = true;
}
public long getVal() {
return val;
}
public boolean isShouldStop() {
return shouldStop;
}
}
}

You are assuming the following:
Compiler doesn't reorder the instructions
CPU performs the loads and stores in the order as specified by your program
Then your reasoning makes sense and this consistency model is called sequential consistency (SC): There is a total order over loads/stores and consistent with the program order of each thread. In simple terms: just some interleaving of the loads/stores. The requirements for SC are a bit more strict, but this captures the essence.
If Java and the CPU would be SC, then there would not be any purpose of making something volatile.
The problem is that you would get terrible performance. A lot of compiler optimizations rely on rewriting the instructions to something more efficient and this can lead to reordering of loads and stores. It could even decide to optimize-out a load or a store so that it doesn't happen. This is all perfectly fine as long as there is just a single thread involved because the thread will not be able to observe these reordering of loads/stores.
Apart from the compiler, the CPU also likes to reorder loads/store. Imagine that a CPU needs to make a write, and the cache-line for that write isn't in the right state. So the CPU would block and this would be very inefficient. Since the store is going to be made anyway, it is better to queue the store in a buffer so that the CPU can continue and as soon as the cache-line is returned in the right state, the store is written to the cache-line and then committed to the cache. Store buffering is a technique used by a lot of processors (e.g. ARM/X86). One problem with it is that it can lead to an earlier store to some address being reordering with a newer load to a different address. So instead of having a total order over all loads and stores like SC, you only get a total order over all stores. This model is called TSO (Total Store Order) and you can find it on the x86 and SPARC v8/v9. This approach assumes that the stores in the store buffer are going to be written to the cache in program order; but there is also a relaxation possible such that store in the store buffer to different cache-lines can be committed to the cache in any order; this is called PSO (Partial Store Order) and you can find it on the SPARC v8/v9.
SC/TSO/PSO are strong memory models because every load and store is a synchronization action; so they order surrounding loads/stores. This can be pretty expensive because for most instructions, as long as the data-dependency-order is preserved, any ordering is fine because:
most memory is not shared between different CPUs.
if memory is shared, often there is some external synchronization like a unlock/lock of a mutex or release-store/acquire-load that takes care of synchronization. So the synchronization can be delayed.
CPU's with weak memory models like ARM, Itanium make use of this. They make a separation between plain loads and stores and synchronizing loads/stores. And for plain loads and stores, any ordering is fine. And modern processors execute instructions out of order any way; there is a lot of parallelism inside a single CPU.
Modern processors do implement cache coherence. The only modern processor that doesn't need to implement cache coherence is a GPU. Cache coherence can be implemented in 2 ways
for small systems the caches can sniff the bus traffic. This is where you see MESI protocol. This technique is called is called sniffing (or snooping).
for larger systems you can have a directory that knows the state of each cache-line and which CPUs are sharing the cache-line and which CPU is owning the cache-line (here there is some MESI-like protocol). And all requests for cache-line go through the directory.
The cache coherence protocol make sure that the cache-line is invalidated on CPUs before a different CPU can write to the cache line. Cache coherence will give you a total order of loads/stores on a single address, but will not provide any ordering of loads/stores between different addresses.
Coming back to volatile:
So what volatile does is:
prevent reordering loads and stores by the compiler and CPU.
ensure that a load/store becomes visible; so it will the compiler from optimizing-out a load or store.
the load/store is atomic; so you don't get problems like a torn read/write. This includes compiler behavior like natural alignment of the field.
I have give you some technical information about what is happening behind the scenes. But to properly understand volatile, you need to understand the Java Memory Model. It is an abstract model that doesn't care about any implementation details as described above. If you would not apply volatile in your example, you would have a data race because a happens-before edge is missing between concurrent conflicting accesses.
A great book on this topic is
A Primer on Memory Consistency and Cache Coherence, Second Edition. You can download it for free.
I can't recommend you any book on the Java Memory Model because it is all explained in an awful manner. Best to get an understanding of memory models in general before diving into the JMM. Probably the best sources are this doctoral dissertation by Jeremy Manson, and Aleksey Shipilëv: One Stop Page.
PS:
There are situations when you don't care about any ordering guarantees, e.g.
stop flag for a thread
progress indicators
blackholes for microbenchmarks.
This is where the VarHandle.getOpaque/setOpaque can be useful. It provides visibility and atomicity, but it doesn't provide any ordering guarantees with respect to other variables. This is mostly a compiler concern. Most engineers will never need this level of control.

What you're suggesting is that hardware designers just make the world all ponies and rainbows for you.
They cannot do that - what you want makes the notion of an on-core cache completely impossible. How could a CPU core possibly know that a given memory location needs to be synced up with another core before accessing it any further, short of just keeping the entire cache in sync on a permanent basis, completely invalidating the entire idea of an on-core cache?
If the talk is strongly suggesting that you as a software engineer can just blame hardware engineers for not making life easy for you, it's a horrible and stupid talk. I bet it's brought a little more nuanced than that.
At any rate, you took the wrong lesson from it.
It's a two-way street. The hardware engineering team works together with the JVM team, effectively, to set up a consistent model that is a good equilibrium between 'With these constraints and limited guarantees to the software engineer, the hardware team can make reliable and significant performance improvements' and 'A software engineer can build multicore software with this model without tearing their hair out'.
This happy equilibrium in java is the JMM (Java Memory Model), which primarily boils down to: All field accesses may have a local thread cache or not, you do not know, and you cannot test if it does. Essentially the JVM has an evil coin an will flip it every time you read a field. Tails, you get the local copy. Heads, it syncs first. The coin is evil in that it is not fair and will land heads through out development, testing, and the first week, every time, even if you flip it a million times. And then the important potential customer demoes your software and you start getting tails.
The solution is to make the JVM never flip it, and this means you need to establish Happens-Before/Happens-After relationships anytime you have a situation anywhere in your code where one thread writes a field and another reads it. volatile is one way to do it.
In other words, to give hardware engineers something to work with, you, the software engineer, effectively made the promise that you'll establish HB/HA if you care about synchronizing between threads. So that's your part of the 'deal'. Their part of the deal is that the hardware guarantees the behaviour if you keep up your end of the deal, and that the hardware is very very fast.

Understanding Java volatile visibility

I'm reading about the Java volatile keyword and have confusion about its 'visibility'.
A typical usage of volatile keyword is:
volatile boolean ready = false;
int value = 0;
void publisher() {
value = 5;
ready = true;
}
void subscriber() {
while (!ready) {}
System.out.println(value);
}
As explained by most tutorials, using volatile for ready makes sure that:
change to ready on publisher thread is immediately visible to other threads (subscriber);
when ready's change is visible to other thread, any variable update preceding to ready (here is value's change) is also visible to other threads;
I understand the 2nd, because volatile variable prevents memory reordering by using memory barriers, so writes before volatile write cannot be reordered after it, and reads after volatile read cannot be reordered before it. This is how ready prevents printing value = 0 in the above demo.
But I have confusion about the 1st guarantee, which is visibility of the volatile variable itself. That sounds a very vague definition to me.
In other words, my confusion is just on SINGLE variable's visibility, not multiple variables' reordering or something. Let's simplify the above example:
volatile boolean ready = false;
void publisher() {
ready = true;
}
void subscriber() {
while (!ready) {}
}
If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop? Why?
A few questions I want to ask:
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, then why do we need volatile here?
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
Hope I explained my question clearly.

Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
That doesn't help you. You aren't writing code for a modern CPU, you are writing code for a Java virtual machine that is allowed to have a virtual machine that has a virtual CPU whose virtual CPU caches are not coherent.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
That is correct. But understand, that's with respect to the virtual machine that you are coding for. Its memory may well be implemented in your physical CPU's caches. That may allow your machine to use the caches and still have the memory visibility required by the Java specification.
Using volatile may ensure that writes go directly to the virtual machine's memory instead of the virtual machine's virtual CPU cache. The virtual machine's CPU cache does not need to provide visibility between threads because the Java specification doesn't require it to.
You cannot assume that characteristics of your particular physical hardware necessarily provide benefits that Java code can use directly. Instead, the JVM trades off those benefits to improve performance. But that means your Java code doesn't get those benefits.
Again, you are not writing code for your physical CPU, you are writing code for the virtual CPU that your JVM provides. That your CPU has coherent caches allows the JVM to do all kinds of optimizations that boost your code's performance, but the JVM is not required to pass those coherent caches through to your code and real JVM's do not. Doing so would mean eliminating a significant number of extremely valuable optimizations.

Relevant bits of the language spec:
volatile keyword: https://docs.oracle.com/javase/specs/jls/se16/html/jls-8.html#jls-8.3.1.4
memory model: https://docs.oracle.com/javase/specs/jls/se16/html/jls-17.html#jls-17.4
The CPU cache is not a factor here, as you correctly said.
This is more about optimizations. If ready is not volatile, the compiler is free to interpret
// this
while (!ready) {}
// as this
if (!ready) while(true) {}
That's certainly an optimization, it has to evaluate the condition fewer times. The value is not changed in the loop, it can be "reused". In terms of single-thread semantics it is equivalent, but it won't do what you wanted.
That's not to say this would always happen. Compilers are free to do that, they don't have to.

If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop?
Yes.
Why?
Because the subscriber may not ever see the results of the publisher's write.
Because ... the JLS does not require the value of an variable to be written to memory ... except to satisfy the specified visibility constraints.
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
(I think) that the JMM specifies or assumes that it is physically impossible to read and write the same conceptual memory cell at the same time. So operations on a memory cell are time ordered. Immediately visible means visible in the next possible opportunity to read following the write.
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
Compilers typically generate code that holds variables in registers, and only writes the values to memory when necessary. Declaring a variable as volatile means that the value must be written to memory. If you take this into consideration, you cannot rely on just the (hypothetical or actual) behavior of cache implementations to specify what volatile means.
While current generation modern CPU / cache architectures behave that way, there is no guarantee that all future computers will behave that way.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads.
Some people say that is incorrect ... for CPUs that implement a cache coherency protocol. However, that is beside the point, because as I described above, the current value of a variable may not yet have been written to the cache yet. Indeed, it may never be written to the cache.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
So lets assume that your diagram shows physical time and represents threads running on different physical cores, reading and writing a cache-coherent memory cell via their respective caches.
What would happen at the physical level would depend on how the cache-coherency is implemented.
I would expect Reader 1 to see the previous state of the cell (if it was available from its cache) or the new state if it wasn't. Reader 2 would see the new state. But it also depends on how long it takes for the writer thread's cache invalidation to propagate to the others' caches. And all sorts of other stuff that is hard to explain.
In short, we don't really know what would happen at the physical level.
But on the other hand, the writer and readers in the above picture can't actually observe the physical time like that anyway. And neither can the programmer.
What the program / programmer sees is that the reads and writes DO NOT OVERLAP. When the necessary happens before relations are present, there will be guarantees about visibility of memory writes by one thread to subsequent1 reads by another. This applies for volatile variables, and for various other things.
How that guarantee is implemented, is not your problem. And it really doesn't help if you do understand what it going on at the hardware level, because you don't actually know what code the JIT compiler is going to emit (today!) anyway.
1 - That is, subsequent according to the synchronization order ... which you could view as a logical time. The JLS Memory model doesn't actually talk about time at all.

Answers to your 3 questions:
A change of a volatile write doesn't need to be 'immediately' visible to a volatile load. A correctly synchronized Java program will behave as if it is sequential consistent and for sequential consistency the real time order of loads/stores isn't relevant. So reads and writes can be skewed as long as the program order isn't violated (or as long as nobody can observe it). Linearizability = sequential consistency + respect real time order. For more info see this answer.
I still need to dig into the exact meaning of visible, but AFAIK it is mostly a compiler level concern because hardware will prevent buffering loads/stores indefinitely.
You are completely right about the articles being wrong. A lot of nonsense is written and 'flushing volatile writes to main memory instead of using the cache' is the most common misunderstanding I'm seeing. I think 50% of all my SO comments is about informing people that caches are always coherent. A great book on the topic is 'A primer on memory consistency and cache coherence 2e' which is available for free.
The informal semantics of the Java Memory model contains 3 parts:
atomicity
visibility
ordering
Atomicity is about making sure that a read/write/rmw happens atomically in the global memory order. So nobody can observe some in between state. This deals with access atomicity like torn read/write, word tearing and proper alignment. It also deals with operational atomicity like rmw.
IMHO it should also deal with store atomicity; so making sure that there is a point in time where the store becomes visibly to all cores. If you have for example the X86, then due to load buffering, a store can become visible to the issuing core earlier than to other cores and you have a violation of atomicity. But I haven't seen it being mentioned in the JMM.
Visibility: this deals mostly with preventing compiler optimizations since the hardware will prevent delaying loads and buffering stores indefinitely. In some literature they also throw ordering of surrounding loads/stores under visibility; but I don't believe this is correct.
Ordering: this is the bread and butter of memory models. It will make sure that loads/stores issued by a single processor don't get reordered. In the first example you can see the need for such behavior. This is the realm of the compiler barriers and cpu memory barriers.
For more info see:
https://download.oracle.com/otndocs/jcp/memory_model-1.0-pfd-spec-oth-JSpec/

I'll just touch on this part :
change to ready on publisher thread is immediately visible to other threads
that is not correct and the articles are wrong. The documentation makes a very clear statement here:
A write to a volatile field happens-before every subsequent read of that field.
The complicated part here is subsequent. In plain english this means that when someone sees ready as being true, it will also see value as being 5. This automatically implies that you need to observe that value to be true, and it can happen that you might observe a different thing. So this is not "immediately".
What people confuse this with, is the fact that volatile offers sequential consistency, which means that if someone has observed ready == true, then everyone will also (unlike release/acquire, for example).

cost of volatile read in java when write are infrequent [duplicate]

I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?

You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.

The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.

It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.

Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!

volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.

What does flushing thread local memory to global memory mean?

I am aware that the purpose of volatile variables in Java is that writes to such variables are immediately visible to other threads. I am also aware that one of the effects of a synchronized block is to flush thread-local memory to global memory.
I have never fully understood the references to 'thread-local' memory in this context. I understand that data which only exists on the stack is thread-local, but when talking about objects on the heap my understanding becomes hazy.
I was hoping that to get comments on the following points:
When executing on a machine with multiple processors, does flushing thread-local memory simply refer to the flushing of the CPU cache into RAM?
When executing on a uniprocessor machine, does this mean anything at all?
If it is possible for the heap to have the same variable at two different memory locations (each accessed by a different thread), under what circumstances would this arise? What implications does this have to garbage collection? How aggressively do VMs do this kind of thing?
(EDIT: adding question 4) What data is flushed when exiting a synchronized block? Is it everything that the thread has locally? Is it only writes that were made inside the synchronized block?
Object x = goGetXFromHeap(); // x.f is 1 here
Object y = goGetYFromHeap(); // y.f is 11 here
Object z = goGetZFromHead(); // z.f is 111 here
y.f = 12;
synchronized(x)
{
x.f = 2;
z.f = 112;
}
// will only x be flushed on exit of the block?
// will the update to y get flushed?
// will the update to z get flushed?
Overall, I think am trying to understand whether thread-local means memory that is physically accessible by only one CPU or if there is logical thread-local heap partitioning done by the VM?
Any links to presentations or documentation would be immensely helpful. I have spent time researching this, and although I have found lots of nice literature, I haven't been able to satisfy my curiosity regarding the different situations & definitions of thread-local memory.
Thanks very much.

The flush you are talking about is known as a "memory barrier". It means that the CPU makes sure that what it sees of the RAM is also viewable from other CPU/cores. It implies two things:
The JIT compiler flushes the CPU registers. Normally, the code may kept a copy of some globally visible data (e.g. instance field contents) in CPU registers. Registers cannot be seen from other threads. Thus, half the work of synchronized is to make sure that no such cache is maintained.
The synchronized implementation also performs a memory barrier to make sure that all the changes to RAM from the current core are propagated to main RAM (or that at least all other cores are aware that this core has the latest values -- cache coherency protocols can be quite complex).
The second job is trivial on uniprocessor systems (I mean, systems with a single CPU which has as single core) but uniprocessor systems tend to become rarer nowadays.
As for thread-local heaps, this can theoretically be done, but it is usually not worth the effort because nothing tells what parts of the memory are to be flushed with a synchronized. This is a limitation of the threads-with-shared-memory model: all memory is supposed to be shared. At the first encountered synchronized, the JVM should then flush all its "thread-local heap objects" to the main RAM.
Yet recent JVM from Sun can perform an "escape analysis" in which a JVM succeeds in proving that some instances never become visible from other threads. This is typical of, for instance, StringBuilder instances created by javac to handle concatenation of strings. If the instance is never passed as parameter to other methods then it does not become "globally visible". This makes it eligible for a thread-local heap allocation, or even, under the right circumstances, for stack-based allocation. Note that in this situation there is no duplication; the instance is not in "two places at the same time". It is only that the JVM can keep the instance in a private place which does not incur the cost of a memory barrier.

It is really an implementation detail if the current content of the memory of an object that is not synchronized is visible to another thread.
Certainly, there are limits, in that all memory is not kept in duplicate, and not all instructions are reordered, but the point is that the underlying JVM has the option if it finds it to be a more optimized way to do that.
The thing is that the heap is really "properly" stored in main memory, but accessing main memory is slow compared to access the CPU's cache or keeping the value in a register inside the CPU. By requiring that the value be written out to memory (which is what synchronization does, at least when the lock is released) it forcing the write to main memory. If the JVM is free to ignore that, it can gain performance.
In terms of what will happen on a one CPU system, multiple threads could still keep values in a cache or register, even while executing another thread. There is no guarantee that there is any scenario where a value is visible to another thread without synchronization, although it is obviously more likely. Outside of mobile devices, of course, the single-CPU is going the way of floppy disks, so this is not going to be a very relevant consideration for long.
For more reading, I recommend Java Concurrency in Practice. It is really a great practical book on the subject.

It's not as simple as CPU-Cache-RAM. That's all wrapped up in the JVM and the JIT and they add their own behaviors.
Take a look at The "Double-Checked Locking is Broken" Declaration. It's a treatise on why double-checked locking doesn't work, but it also explains some of the nuances of Java's memory model.

One excellent document for highlighting the kinds of problems involved, is the PDF from the JavaOne 2009 Technical Session
This Is Not Your Father's Von Neumann Machine: How Modern Architecture Impacts Your Java Apps
By Cliff Click, Azul Systems; Brian Goetz, Sun Microsystems, Inc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.