Memory Consistency Errors vs Thread interference

Memory Consistency Errors vs Thread interference - java

What is the difference between memory consistency errors and thread interference?
How does the use of synchronization to avoid them differ or not? Please illustrate with an example. I couldn't get this from the sun Java tutorial. Any recommendation of reading material(s) to understand this purely in context of java would be helpful.

Memory consistency errors can't be understood purely in the context of java--the details of shared memory behavior on multi-cpu systems are highly architecture-specific, and to make it worse, x86 (where most people coding today learned to code) has pretty programmer-friendly semantics compared to architectures that were designed for multi-processor machines from the beginning (like POWER and SPARC), so most people really aren't used to thinking about memory access semantics.
I'll give a common example of where memory consistency errors can get you into trouble. Assume for this example, that the initial value of x is 3. Nearly all architectures guarantee that if one CPU executes the code:
STORE 4 -> x // x is a memory address
STORE 5 -> x
and another CPU executes
LOAD x
LOAD x
will either see 3,3, 3,4, 4,4, 4,5, or 5,5 from the perspective its two LOAD instructions. Basically, CPUs guarantee that the order of writes to a single memory location is maintained from the perspective of all CPUs, even if the exact time that each of the writes become known to other CPUs is allowed to vary.
Where CPUs differ from one another tends to be in the guarantees they make about LOAD and STORE operations involving different memory addresses. Assume for this example, that the initial values of both x and y are 4.
STORE 5 -> x // x is a memory address
STORE 5 -> y // y is a different memory address
then another CPU executes
LOAD x
LOAD y
In this example, on some architectures, the second thread can see 4,4, 5,5, 4,5, OR 5,4. Ouch!
Most architectures deal with memory at the granularity of a 32 or 64 bit word--this means that on a 32 bit POWER/SPARC machine, you can't update a 64-bit integer memory location and safely read it from another thread ever without explicit synchronization. Goofy, huh?
Thread interference is much simpler. The basic idea is that java doesn't guarantee that a single statement of java code executes atomically. For example, incrementing a value requires reading the value, incrementing it, then storing it again. So you can have int x = 1 after two threads execute x++, x can end up as 2 or 3 depending on how the lower-level code interleaved (the lower-level abstract code at work here presumably looks like LOAD x, INCREMENT, STORE x). The basic idea here is that java code is broken down into smaller atomic pieces and you don't get to make assumptions of how they interleave unless you use synchronization primitives explicitly.
For more information, check out this paper. It's long and dry and written by a notorious asshole, but hey, it's pretty good too. Also check out this (or just google for "double checked locking is broken"). These memory reordering issues reared their ugly heads for many C++/java programmers who tried to get a little bit too clever with their singleton initializations a few years ago.

Thread interference is about threads overwriting each other's statements (say, thread A incrementing a counter and thread B decrementing it at the same time), leading to a situation where the actual value of counter is unpredictable. You avoid them by enforcing exclusive access, one thread at a time.
On the other hand, memory inconsistency is about visibility. Thread A may increment counter, but then thread B may not be aware of this change yet so it might read some prior value. You avoid them by establishing a happens-before relationship, which is
is simply a guarantee that memory writes by one specific statement are visible to another specific statement.(per Oracle)

The article to read on this is "Memory Models: A Case for Rethinking Parallel Languages and Hardware" by Adve and Boehm in the August 2010 vol. 53 number 8 issue of Communications of the ACM. This is available online for Association for Computer Machinery members (http://www.acm.org). This deals with the problem in general and also discusses the Java Memory Model.
For more information on the Java Memory Model, see http://www.cs.umd.edu/~pugh/java/memoryModel/

Memory Consistency problems are normally manifest as broken happens-before relationships.
Time A: Thread 1 sets int i = 1
Time B: Thread 2 sets i = 2
Time C: Thread 1 reads i, but still sees a value of 1, because of any number of reasons that it did not get the most recent stored value in memory.
You prevent this from happening either by using the volatile keyword on the variable, or by using the AtomicX classes from the java.util.concurrent.atomic package. Either of these messages makes sure that no second thread will see a partially modified value, and no one will ever see a value that isn't the most current real value in memory.
(Synchronizing the getter and setter would also fix the problem, but may look strange to other programmers who don't know why you did it, and can also break down in the face of things like binding frameworks and persistence frameworks that use reflection.)
--
Thread interleaves are when two threads munge an object up and see inconsistent states.
We have a PurchaseOrder object with an itemQuantity and itemPrice, automatic logic generates the invoice total.
Time 0: Thread 1 sets itemQuantity 50
Time 1: Thread 2 sets itemQuantity 100
Time 2: Thread 1 sets itemPrice 2.50, invoice total is calculated $250
Time 3: Thread 2 sets itemPrice 3, invoice total is calculated at $300
Thread 1 performed an incorrect calculation because some other thread was messing with the object in between his operations.
You address this issue either by using the synchronized keyword, to make sure only one person can perform the entire process at a time, or alternately with a lock from the java.util.concurrent.locks package. Using java.util.concurrent is generally the preferred approach for new programs.

1. Thread Interference
class Counter {
private int c = 0;
public void increment() {
c++;
}
public void decrement() {
c--;
}
public int value() {
return c;
}
}
Suppose there are two threads Thread-A and Thread-B working on the
same counter instance . Say Thread-A invokes increment() , and at the
same time Thread-B invokes decrement() . Thread-A reads the value c
and increment it by 1 . At the same time Thread-B reads the value (
which is 0 because the incremented value is not yet set by Thread-A) ,
decrements it and set it to -1 . Now Thread-A sets the value to 1 .
2. Memory Consistency Errors
Memory Consistency Errors occurs when different threads have
inconsistent views of the shared data. In the above class counter ,
Lets say there are two threads working on the same counter instance ,
calls the increment method to increase the counter's value by 1 . Here
it is no guarantee that the changes made by one thread is visible to
the other .
For more visit this.

First, please note, that your source is NOT the best place to learn what you're trying to learn. You will do well reading papers from #blucz 's answer (as well as his answer in general), even if it's out of scope of Java. Oracle Trails aren't bad per se, but they simplify matters and gloss over them, hence you may find you don't understand what you've just learned or whether it's useful or not and how much.
Now, trying to answer primarily within Java context.
Thread interference
happens when thread operations interleave, that is, mingle. We need two executors (threads) and shared data (place to interfere).
Image by Daniel Stori, from turnoff.us website:
In the image you see that two threads in a GNU/Linux process can interfere with each other. Java threads are essentially Java objects pointing to native threads and they also can interfere with each other, if they operate on same data (like here where "Rick" messes up the data - drawing - of his younger brother).
Memory Consistency Errors - MCE
Crucial points here are memory visibility, happens-before and - brought up by #blucz, hardware.
MCE are - obviously - situations, where memory becomes inconsistent. Which actually is a term for humans - for computers the memory is consistent at all times (unless it's corrupted). The "inconsistencies" are something humans are "seeing", because they don't understand what exactly happened and were expecting something else. "Why is it 1? It should be 2?!"
This "perceived inconsistency", this "gap", relates to memory visibility, that is, what different threads see when they look at memory. And therefore what those threads operate on.
You see, while reads from and writes to memory are linear when we reason about code (especially when thinking about how it is executed line by line)... actually they are not. Especially, when threads are involved. So, the tutorial you read gives you an example of a counter being incremented by two threads and how thread 2 reads same value as thread 1. Actual reasons for memory inconsistencies might be due to optimizations done to your code, by javac, JIT or hardware memory consistency model (that is, something that CPU people did to speed up their CPU and make it more efficient). These optimizations include prescient stores, branch prediction and for now you may think of them as reordering code so that in the end, it runs faster and uses/wastes less CPU cycles. However, to make sure optimizing doesn't go out of control (or too far), some guarantees are made. These guarantees form relationship of "happens-before", where we can tell that before this point and after, things "happened-before". Imagine you running a party and remembering, that Tom got here BEFORE Suzie, cause you know that Rob came after tom and before Suzie. Rob is the event which you use to form happens-before relationship before events of Tom/Suzie coming in.
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/package-summary.html#MemoryVisibility
Link above tells you more about memory visibility and what establishes happens-before relationship in Java. It will not come as a surprise, but:
synchronization does
starting a Thread
joining a Thread
volatile keyword tells you that writes happens-before subsequent reads, that is, that reads AFTER writes will not be reordered to be "before" writes, as that would break "happens-before" relationship.
Since all that touches memory, hardware is essential. Your platform has it's own rules and while JVM tries to make them universal by making all platforms behave similarly, just that alone means that on platform A there will be more memory barriers than on platform B.
Your questions
What is the difference between memory consistency errors and thread interference?
MCE are about visibility of the memory to program threads and NOT having happens-before relationship between reads and writes, therefore having a gap between what humans think "should be" and what "actually is".
Thread interference is about thread operations overlapping, mingling, interleaving and touching shared data, screwing it in the process, that may lead to thread A having nice drawing destroyed by thread B. Interference being harmful usually marks a critical section, which is why synchronization works.
How does the use of synchronization to avoid them differ or not?
Please read also about thin locks, fat locks and thread contention.
Synchronization to avoid thread interference does it in making only one thread access the critical section, other thread is blocked (costly, thread contention). When it comes to MCE synchronization establishes happens-before when it comes to locking and unlocking the mutex, see earlier link to java.util.concurrent package description.
For examples: see both earlier sections.

Related

Understanding Java volatile visibility

I'm reading about the Java volatile keyword and have confusion about its 'visibility'.
A typical usage of volatile keyword is:
volatile boolean ready = false;
int value = 0;
void publisher() {
value = 5;
ready = true;
}
void subscriber() {
while (!ready) {}
System.out.println(value);
}
As explained by most tutorials, using volatile for ready makes sure that:
change to ready on publisher thread is immediately visible to other threads (subscriber);
when ready's change is visible to other thread, any variable update preceding to ready (here is value's change) is also visible to other threads;
I understand the 2nd, because volatile variable prevents memory reordering by using memory barriers, so writes before volatile write cannot be reordered after it, and reads after volatile read cannot be reordered before it. This is how ready prevents printing value = 0 in the above demo.
But I have confusion about the 1st guarantee, which is visibility of the volatile variable itself. That sounds a very vague definition to me.
In other words, my confusion is just on SINGLE variable's visibility, not multiple variables' reordering or something. Let's simplify the above example:
volatile boolean ready = false;
void publisher() {
ready = true;
}
void subscriber() {
while (!ready) {}
}
If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop? Why?
A few questions I want to ask:
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, then why do we need volatile here?
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
Hope I explained my question clearly.

Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
That doesn't help you. You aren't writing code for a modern CPU, you are writing code for a Java virtual machine that is allowed to have a virtual machine that has a virtual CPU whose virtual CPU caches are not coherent.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
That is correct. But understand, that's with respect to the virtual machine that you are coding for. Its memory may well be implemented in your physical CPU's caches. That may allow your machine to use the caches and still have the memory visibility required by the Java specification.
Using volatile may ensure that writes go directly to the virtual machine's memory instead of the virtual machine's virtual CPU cache. The virtual machine's CPU cache does not need to provide visibility between threads because the Java specification doesn't require it to.
You cannot assume that characteristics of your particular physical hardware necessarily provide benefits that Java code can use directly. Instead, the JVM trades off those benefits to improve performance. But that means your Java code doesn't get those benefits.
Again, you are not writing code for your physical CPU, you are writing code for the virtual CPU that your JVM provides. That your CPU has coherent caches allows the JVM to do all kinds of optimizations that boost your code's performance, but the JVM is not required to pass those coherent caches through to your code and real JVM's do not. Doing so would mean eliminating a significant number of extremely valuable optimizations.

Relevant bits of the language spec:
volatile keyword: https://docs.oracle.com/javase/specs/jls/se16/html/jls-8.html#jls-8.3.1.4
memory model: https://docs.oracle.com/javase/specs/jls/se16/html/jls-17.html#jls-17.4
The CPU cache is not a factor here, as you correctly said.
This is more about optimizations. If ready is not volatile, the compiler is free to interpret
// this
while (!ready) {}
// as this
if (!ready) while(true) {}
That's certainly an optimization, it has to evaluate the condition fewer times. The value is not changed in the loop, it can be "reused". In terms of single-thread semantics it is equivalent, but it won't do what you wanted.
That's not to say this would always happen. Compilers are free to do that, they don't have to.

If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop?
Yes.
Why?
Because the subscriber may not ever see the results of the publisher's write.
Because ... the JLS does not require the value of an variable to be written to memory ... except to satisfy the specified visibility constraints.
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
(I think) that the JMM specifies or assumes that it is physically impossible to read and write the same conceptual memory cell at the same time. So operations on a memory cell are time ordered. Immediately visible means visible in the next possible opportunity to read following the write.
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
Compilers typically generate code that holds variables in registers, and only writes the values to memory when necessary. Declaring a variable as volatile means that the value must be written to memory. If you take this into consideration, you cannot rely on just the (hypothetical or actual) behavior of cache implementations to specify what volatile means.
While current generation modern CPU / cache architectures behave that way, there is no guarantee that all future computers will behave that way.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads.
Some people say that is incorrect ... for CPUs that implement a cache coherency protocol. However, that is beside the point, because as I described above, the current value of a variable may not yet have been written to the cache yet. Indeed, it may never be written to the cache.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
So lets assume that your diagram shows physical time and represents threads running on different physical cores, reading and writing a cache-coherent memory cell via their respective caches.
What would happen at the physical level would depend on how the cache-coherency is implemented.
I would expect Reader 1 to see the previous state of the cell (if it was available from its cache) or the new state if it wasn't. Reader 2 would see the new state. But it also depends on how long it takes for the writer thread's cache invalidation to propagate to the others' caches. And all sorts of other stuff that is hard to explain.
In short, we don't really know what would happen at the physical level.
But on the other hand, the writer and readers in the above picture can't actually observe the physical time like that anyway. And neither can the programmer.
What the program / programmer sees is that the reads and writes DO NOT OVERLAP. When the necessary happens before relations are present, there will be guarantees about visibility of memory writes by one thread to subsequent1 reads by another. This applies for volatile variables, and for various other things.
How that guarantee is implemented, is not your problem. And it really doesn't help if you do understand what it going on at the hardware level, because you don't actually know what code the JIT compiler is going to emit (today!) anyway.
1 - That is, subsequent according to the synchronization order ... which you could view as a logical time. The JLS Memory model doesn't actually talk about time at all.

Answers to your 3 questions:
A change of a volatile write doesn't need to be 'immediately' visible to a volatile load. A correctly synchronized Java program will behave as if it is sequential consistent and for sequential consistency the real time order of loads/stores isn't relevant. So reads and writes can be skewed as long as the program order isn't violated (or as long as nobody can observe it). Linearizability = sequential consistency + respect real time order. For more info see this answer.
I still need to dig into the exact meaning of visible, but AFAIK it is mostly a compiler level concern because hardware will prevent buffering loads/stores indefinitely.
You are completely right about the articles being wrong. A lot of nonsense is written and 'flushing volatile writes to main memory instead of using the cache' is the most common misunderstanding I'm seeing. I think 50% of all my SO comments is about informing people that caches are always coherent. A great book on the topic is 'A primer on memory consistency and cache coherence 2e' which is available for free.
The informal semantics of the Java Memory model contains 3 parts:
atomicity
visibility
ordering
Atomicity is about making sure that a read/write/rmw happens atomically in the global memory order. So nobody can observe some in between state. This deals with access atomicity like torn read/write, word tearing and proper alignment. It also deals with operational atomicity like rmw.
IMHO it should also deal with store atomicity; so making sure that there is a point in time where the store becomes visibly to all cores. If you have for example the X86, then due to load buffering, a store can become visible to the issuing core earlier than to other cores and you have a violation of atomicity. But I haven't seen it being mentioned in the JMM.
Visibility: this deals mostly with preventing compiler optimizations since the hardware will prevent delaying loads and buffering stores indefinitely. In some literature they also throw ordering of surrounding loads/stores under visibility; but I don't believe this is correct.
Ordering: this is the bread and butter of memory models. It will make sure that loads/stores issued by a single processor don't get reordered. In the first example you can see the need for such behavior. This is the realm of the compiler barriers and cpu memory barriers.
For more info see:
https://download.oracle.com/otndocs/jcp/memory_model-1.0-pfd-spec-oth-JSpec/

I'll just touch on this part :
change to ready on publisher thread is immediately visible to other threads
that is not correct and the articles are wrong. The documentation makes a very clear statement here:
A write to a volatile field happens-before every subsequent read of that field.
The complicated part here is subsequent. In plain english this means that when someone sees ready as being true, it will also see value as being 5. This automatically implies that you need to observe that value to be true, and it can happen that you might observe a different thing. So this is not "immediately".
What people confuse this with, is the fact that volatile offers sequential consistency, which means that if someone has observed ready == true, then everyone will also (unlike release/acquire, for example).

If same thread can get scheduled on different CPUs, why does it not create problems?

As threads execute on a multi-processor/multi-core machines, they can cause CPU caches to load data from RAM.
If threads are supposed to be 'see' same data, it is not guaranteed because thread1 may cause an update in it's CPU's (i.e. where it is currently executing) cache and this change will not be immediately visible to thread2.
To solve this problem, programming languages like Java provide constructs like volatile.
It is clear to me what the problem with multiple threads executing on different CPUs is.
I am pretty sure that a given thread is not bound to a particular CPU for its lifetime and can get scheduled to run on a different CPU. But it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs?
After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?
Only possibility I can think, as of now, is that context switching of threads involve writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.
Can some expert please advise what the real story is?

Caches on modern CPU's are always coherent. So if a store is performed by one CPU, then a subsequent load on a different CPU will see that store. In other words: the cache is the source of truth, memory is just an overflow bucket and could be completely out of sync with reality. So since the caches are coherent, it doesn't matter on which CPU a thread will run.
Also on a single CPU system, the lack of volatile can cause problems due to compiler optimizations. A compiler could for example the hoist a variable out of a loop and then a write made by 1 thread, will never be seen by another thread no matter if it is running on the same CPU.
I would suggest not thinking in term of hardware. If you use Java, make sure you understand the Java Memory Model (JMM). This is an abstract model that prevents thinking in terms of hardware since the JMM needs to run independent of the hardware.

On a single thread, there is a happens-before relationship between actions that take place, regardless of how the scheduling done. This is enforced by the implementation of the JVM as part of the Java memory model contract promised in the Java Language Specification:
Two actions can be ordered by a happens-before relationship. If one action happens-before another, then the first is visible to and ordered before the second.
If we have two actions x and y, we write hb(x, y) to indicate that x happens-before y.
If x and y are actions of the same thread and x comes before y in program order, then hb(x, y).
How exactly this is achieved by the operating system is implementation dependent.

it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs? After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?
Yes it may have access to stale data but it more likely to have data in its cache that is unhelpful – not relevant to the memory that it needs. First off, the permissions from the OS (if written correctly) won't let one program see the data from another – yes, there are many stories about hardware vulnerabilities in the news these days so I am talking about how it should work. The cache will be clear if another process gets swapped into a CPU.
Whether or not the cache memory is stale or not is a function of the timing of the cache coherence systems of the architectures or whether or not memory fences are crossed.
context switching of threads involves writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).
That's pretty close to what happens although its cache memory is not refreshed when it gets scheduled. When a thread is contexted switched out of the CPU, all dirty pages of memory are flushed to RAM. When a thread is swapped into a CPU, the cache is either flushed (if from another process) or contains memory that may not be helpful to the incoming thread. This causes a much higher page fault ratio of initial memory accesses meaning that a thread spends longer to access memory until the rows it needs are loaded into the cache.
However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.
Yes there is a performance hit. This highlights why it is so important to properly size your thread-pools. If you have too many threads running CPU intensive tasks, you can cause a loss in performance because of the context switches. If the threads are waiting for IO then increasing the number of threads is a must but if you are just calculating something, using fewer CPUs can result in higher throughput because each CPU can stay in the processor longer and the ratio of cache hits to misses goes up.

For those who might not go through all the comments on the different answers, here is a simplified summary of what I have modelled in my head (please feel free to comment if any point is not correct. I will edit this post)
http://tutorials.jenkov.com/java-concurrency/volatile.html is not accurate and gives rise to questions like this. CPU caches are always coherent. If CPU 1 has written to a memory address X in its cache and CPU 2 reads the same memory address from its own cache (after the fact of CPU 1 writing to its cache), then CPU 2 will read what was written by CPU 1. No special instruction is required to enforce this.
However, modern CPUs also have store buffers. They are used to accumulate writes in the buffer in order to improve performance (so that these writes can be committed to the cache in their own time, making CPU free from waiting for cache coherence protocol to finish).
Whatever is in the store buffer of a CPU is not yet visible to other CPUs.
In addition, CPUs and compilers in order to improve performance are free to re-order instructions as long as it does not change the outcome of the computation (from single thread's point of view)
Also, some compiler optimizations may move a variable completely to CPU registers for a routine, thereby 'hiding' them from shared memory and hence making writes to that variable invisible to other threads.
Points 3,4,5 above are the reason why Java exposes keywords like Volatile. When you use volatile, JVM itself does not re-order instructions if they would break 'happens-before' guarantee. JVM also asks CPU to not re-order by using memory barrier/fence instructions. JVM also does not use any optimization which prevents 'happens-before' guarantee. Overall if a write to a volatile memory has already happened, any read thereafter by another thread will ensure correct value to be available for not only that field but also all fields which were visible to first thread while writing the volatile field.
How does above relate to this question about single thread using different CPUs in its lifetime?
If the thread while executing on a CPU has already written to its cache, nothing more to be considered here. Even if the thread later uses another CPU, it will be able to see its own writes due to cache coherency.
If the thread's writes are waiting in the store buffer and it gets moved out of the CPU, context switching ensures that the thread's writes from store buffer get committed to the cache. After that it is same as point 1.
Any state which is only in CPU registers, anyway gets backed up and restored as part of context switching.
Due to above points, a single thread does not face any problem when it executes over different CPUs during its lifetime.

Will a thread use an old value passed in?

I just thought of this question and couldn't find an answer
If I pass a static non-primitive variable, say X, into a thread which starts at a later time (so one would assume the thread holds a reference to X), is there any possibility that when the thread starts, instead of reading X from memory, it may use the old value of X when X was passed to the thread
or a similar scenario like:
thread A runs and writes X = a to RAM, then gets blocked by IO
thread B reads X = a from RAM, queues a unit of work to A, which would use X
thread A resumes and writes X = b to RAM, then it finishes what it had left
...
thread A resumes and runs that unit of work queued to it, which would use X
Is it possible that X would have the value a?
If so, is it possible to happen in mainstream languages like C, C++, Java, C# on mainstream platforms? (all versions for jvm for java and all versions for .Net and Mono for C#)?
I would not expect this to happen, but rather curious if it would on any sort-of-popular platforms, causes maybe crazy compiler optimization, caching (always a possibility), extremely cheap hardware etc

I think you are a bit confused. Programs don't read data from memory into... something else (seriously, where do you think it would go?). Its always read from memory.
All the time, every time. So no, your scenario is simply not possible. It will always have the updated value.
Quick disclaimer, variables/data can always be cached by the CPU to avoid RAM hits, so if the hardware didn't synchronize, this could happen, but its not an issue with the language/runtime. The software concept of "Memory" includes RAM, cache, and Virtual memory.

It's not about the object (concrete fully built) but the state of an Object. So let's say if you have counter within X that constitutes of State and yes if thread a writes to counter then it may happen thread B can't see the changes.

volatile and synchronized on single core cpu (example - pentium pro)

I have read and know in detail in the implications of the Java volatile and synchronized keyword at the cpu level in SMP architecture based CPUs.
A great paper on that subject is here:
http://irl.cs.ucla.edu/~yingdi/web/paperreading/whymb.2010.06.07c.pdf
Now, leave SMP cpus aside for this question. My question is: How does volatile and synchronized keyword work as it relates to older single core CPUs. Example a Pentium I/Pro/II/III/earlier IVs.
I want to know specifically:
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory? If yes why? (Since there is only a single cache copy and no need for coherency protocols, why can't the cache be directly used by two threads that are time slicing the single core CPU ?). This is me asking this question after reading dozens of internet forums about how volatile reads and writes to/from the "master copy in main memory".
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) With a non superscalar CPU (Pentium I), instructions are not re-ordered. So if that is the case, then is volatile keyword required while running on a Pentium 1? (atomicity, visibility and ordering would be a "no problemo" right, because there is only one cache, one core to work on the cache, and no re-ordering).

1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory?
No. The caches are still enabled. That's not related to SMP.
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) Does anything change with respect to a superscalar/non superscalar architecture (out-of-order) processor w.r.t these two keywords?
Gosh, do you have to ask this question about Java? Remember that all things eventually boil down to good ol' fashioned machine instructions. I'm not intimitely familiar with the guts of Java synchronization, but as I understand it, synchronized is just syntactic sugar for your typical monitor-style synchronization mechanism. Multiple threads are not allowed in a critical section simultaneously. Instead of simply spinning on a spinlock, the scheduler is leveraged - the waiting threads are put to sleep, and woken back up when to lock is able to be taken again.
The thing to remember is that even on a single-core, non-SMP system, you still have to worry about OS preemption of threads! These threads can be scheduled on and off of the CPU whenever the OS wants to. This is the purpose for the locks, of course.
Again, this question is so much better asked under the context of assembly, or even C (whose compiled result can often times be directly inferred) as opposed to Java, which has to deal with the VM, JITted code, etc.

What does flushing thread local memory to global memory mean?

I am aware that the purpose of volatile variables in Java is that writes to such variables are immediately visible to other threads. I am also aware that one of the effects of a synchronized block is to flush thread-local memory to global memory.
I have never fully understood the references to 'thread-local' memory in this context. I understand that data which only exists on the stack is thread-local, but when talking about objects on the heap my understanding becomes hazy.
I was hoping that to get comments on the following points:
When executing on a machine with multiple processors, does flushing thread-local memory simply refer to the flushing of the CPU cache into RAM?
When executing on a uniprocessor machine, does this mean anything at all?
If it is possible for the heap to have the same variable at two different memory locations (each accessed by a different thread), under what circumstances would this arise? What implications does this have to garbage collection? How aggressively do VMs do this kind of thing?
(EDIT: adding question 4) What data is flushed when exiting a synchronized block? Is it everything that the thread has locally? Is it only writes that were made inside the synchronized block?
Object x = goGetXFromHeap(); // x.f is 1 here
Object y = goGetYFromHeap(); // y.f is 11 here
Object z = goGetZFromHead(); // z.f is 111 here
y.f = 12;
synchronized(x)
{
x.f = 2;
z.f = 112;
}
// will only x be flushed on exit of the block?
// will the update to y get flushed?
// will the update to z get flushed?
Overall, I think am trying to understand whether thread-local means memory that is physically accessible by only one CPU or if there is logical thread-local heap partitioning done by the VM?
Any links to presentations or documentation would be immensely helpful. I have spent time researching this, and although I have found lots of nice literature, I haven't been able to satisfy my curiosity regarding the different situations & definitions of thread-local memory.
Thanks very much.

The flush you are talking about is known as a "memory barrier". It means that the CPU makes sure that what it sees of the RAM is also viewable from other CPU/cores. It implies two things:
The JIT compiler flushes the CPU registers. Normally, the code may kept a copy of some globally visible data (e.g. instance field contents) in CPU registers. Registers cannot be seen from other threads. Thus, half the work of synchronized is to make sure that no such cache is maintained.
The synchronized implementation also performs a memory barrier to make sure that all the changes to RAM from the current core are propagated to main RAM (or that at least all other cores are aware that this core has the latest values -- cache coherency protocols can be quite complex).
The second job is trivial on uniprocessor systems (I mean, systems with a single CPU which has as single core) but uniprocessor systems tend to become rarer nowadays.
As for thread-local heaps, this can theoretically be done, but it is usually not worth the effort because nothing tells what parts of the memory are to be flushed with a synchronized. This is a limitation of the threads-with-shared-memory model: all memory is supposed to be shared. At the first encountered synchronized, the JVM should then flush all its "thread-local heap objects" to the main RAM.
Yet recent JVM from Sun can perform an "escape analysis" in which a JVM succeeds in proving that some instances never become visible from other threads. This is typical of, for instance, StringBuilder instances created by javac to handle concatenation of strings. If the instance is never passed as parameter to other methods then it does not become "globally visible". This makes it eligible for a thread-local heap allocation, or even, under the right circumstances, for stack-based allocation. Note that in this situation there is no duplication; the instance is not in "two places at the same time". It is only that the JVM can keep the instance in a private place which does not incur the cost of a memory barrier.

It is really an implementation detail if the current content of the memory of an object that is not synchronized is visible to another thread.
Certainly, there are limits, in that all memory is not kept in duplicate, and not all instructions are reordered, but the point is that the underlying JVM has the option if it finds it to be a more optimized way to do that.
The thing is that the heap is really "properly" stored in main memory, but accessing main memory is slow compared to access the CPU's cache or keeping the value in a register inside the CPU. By requiring that the value be written out to memory (which is what synchronization does, at least when the lock is released) it forcing the write to main memory. If the JVM is free to ignore that, it can gain performance.
In terms of what will happen on a one CPU system, multiple threads could still keep values in a cache or register, even while executing another thread. There is no guarantee that there is any scenario where a value is visible to another thread without synchronization, although it is obviously more likely. Outside of mobile devices, of course, the single-CPU is going the way of floppy disks, so this is not going to be a very relevant consideration for long.
For more reading, I recommend Java Concurrency in Practice. It is really a great practical book on the subject.

It's not as simple as CPU-Cache-RAM. That's all wrapped up in the JVM and the JIT and they add their own behaviors.
Take a look at The "Double-Checked Locking is Broken" Declaration. It's a treatise on why double-checked locking doesn't work, but it also explains some of the nuances of Java's memory model.

One excellent document for highlighting the kinds of problems involved, is the PDF from the JavaOne 2009 Technical Session
This Is Not Your Father's Von Neumann Machine: How Modern Architecture Impacts Your Java Apps
By Cliff Click, Azul Systems; Brian Goetz, Sun Microsystems, Inc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.