VarHandle get/setOpaque

VarHandle get/setOpaque - java

I keep fighting to understand what VarHandle::setOpaque and VarHandle::getOpaque are really doing. It has not been easy so far - there are some things I think I get (but will not present them in the question itself, not to muddy the waters), but overall this is miss-leading at best for me.
The documentation:
Returns the value of a variable, accessed in program order...
Well in my understanding if I have:
int xx = x; // read x
int yy = y; // read y
These reads can be re-ordered. On the other had if I have:
// simplified code, does not compile, but reads happen on the same "this" for example
int xx = VarHandle_X.getOpaque(x);
int yy = VarHandle_Y.getOpaque(y);
This time re-orderings are not possible? And this is what it means "program order"? Are we talking about insertions of barriers here for this re-ordering to be prohibited? If so, since these are two loads, would the same be achieved? via:
int xx = x;
VarHandle.loadLoadFence()
int yy = y;
But it gets a lot trickier:
... but with no assurance of memory ordering effects with respect to other threads.
I could not come up with an example to even pretend I understand this part.
It seems to me that this documentation is targeted at people who know exactly what they are doing (and I am definitely not one)... So can someone shed some light here?

Well in my understanding if I have:
int xx = x; // read x
int yy = y; // read y
These reads can be re-ordered.
These reads may not only happen to be reordered, they may not happen at all. The thread may use an old, previously read value for x and/or y or values it did previously write to these variables whereas, in fact, the write may not have been performed yet, so the “reading thread” may use values, no other thread may know of and are not in the heap memory at that time (and probably never will).
On the other had if I have:
// simplified code, does not compile, but reads happen on the same "this" for example
int xx = VarHandle_X.getOpaque(x);
int yy = VarHandle_Y.getOpaque(y);
This time re-orderings are not possible? And this is what it means "program order"?
Simply said, the main feature of opaque reads and writes, is, that they will actually happen. This implies that they can not be reordered in respect to other memory access of at least the same strength, but that has no impact for ordinary reads and writes.
The term program order is defined by the JLS:
… the program order of t is a total order that reflects the order in which these actions would be performed according to the intra-thread semantics of t.
That’s the evaluation order specified for expressions and statements. The order in which we perceive the effects, as long as only a single thread is involved.
Are we talking about insertions of barriers here for this re-ordering to be prohibited?
No, there is no barrier involved, which might be the intention behind the phrase “…but with no assurance of memory ordering effects with respect to other threads”.
Perhaps, we could say that opaque access works a bit like volatile was before Java 5, enforcing read access to see the most recent heap memory value (which makes only sense if the writing end also uses opaque or an even stronger mode), but with no effect on other reads or writes.
So what can you do with it?
A typical use case would be a cancellation or interruption flag that is not supposed to establish a happens-before relationship. Often, the stopped background task has no interest in perceiving actions made by the stopping task prior to signalling, but will just end its own activity. So writing and reading the flag with opaque mode would be sufficient to ensure that the signal is eventually noticed (unlike the normal access mode), but without any additional negative impact on the performance.
Likewise, a background task could write progress updates, like a percentage number, which the reporting (UI) thread is supposed to notice timely, while no happens-before relationship is required before the publication of the final result.
It’s also useful if you just want atomic access for long and double, without any other impact.
Since truly immutable objects using final fields are immune to data races, you can use opaque modes for timely publishing immutable objects, without the broader effect of release/acquire mode publishing.
A special case would be periodically checking a status for an expected value update and once available, querying the value with a stronger mode (or executing the matching fence instruction explicitly). In principle, a happens-before relationship can only be established between the write and its subsequent read anyway, but since optimizers usually don’t have the horizon to identify such a inter-thread use case, performance critical code can use opaque access to optimize such scenario.

The opaque means that the thread executing opaque operation is guaranteed to observe its own actions in program order, but that's it.
Other threads are free to observe the threads actions in any order. On x86 it is a common case since it has
write ordered with store-buffer forwarding
memory model so even if the thread does store before load. The store can be cached in the store buffer and some thread being executed on any other core observes the thread action in reverse order load-store instead of store-load. So opaque operation is done on x86 for free (on x86 we actually also have acquire for free, see this extremely exhaustive answer for details on some other architectures and their memory models: https://stackoverflow.com/a/55741922/8990329)
Why is it useful? Well, I could speculate that if some thread observed a value stored with opaque memory semantic then subsequent read will observe "at least this or later" value (plain memory access does not provide such guarantees, does it?).
Also since Java 9 VarHandles are somewhat related to acquire/release/consume semantic in C I think it is worth noting that opaque access is similar to memory_order_relaxed which is defined in the Standard as follows:
For memory_order_relaxed, no operation orders memory.
with some examples provided.

I have been struggling with opaque myself and the documentation is certainly not easy to understand.
From the above link:
Opaque operations are bitwise atomic and coherently ordered.
The bitwise atomic part is obvious. Coherently ordered means that loads/stores to a single address have some total order, each reach sees the most recent address before it and the order is consistent with the program order. For some coherence examples, see the following JCStress test.
Coherence doesn't provide any ordering guarantees between loads/stores to different addresses so it doesn't need to provide any fences so that loads/stores to different addresses are ordered.
With opaque, the compiler will emit the loads/stores as it sees them. But the underlying hardware is still allowed to reorder load/stores to different addresses.
I upgraded your example to the message-passing litmus test:
thread1:
X.setOpaque(1);
Y.setOpaque(1);
thread2:
ry = Y.getOpaque();
rx = X.getOpaque();
if (ry == 1 && rx == 0) println("Oh shit");
The above could fail on a platform that would allow for the 2 stores to be reordered or the 2 loads (again ARM or PowerPC). Opaque is not required to provide causality. JCStress has a good example for that as well.
Also, the following IRIW example can fail:
thread1:
X.setOpaque(1);
thread2:
Y.setOpaque(1);
thread3:
rx_thread3 = X.getOpaque();
[LoadLoad]
ry_thread3 = Y.getOpaque();
thread4:
ry_thread4 = Y.getOpaque();
[LoadLoad]
rx_thread4 = X.getOpaque();
Can it be that we end up with rx_thread3=1,ry_thread3=0,ry_thread4=1 and rx_thread4 is 0?
With opaque this can happen. Even though the loads are prevented from being reordered, opaque accesses do not require multi-copy-atomicity (stores to different addresses issued by different CPUs can be seen in different orders).
Release/acquire is stronger than opaque, since with release/acquire it is allowed to fail, therefor with opaque, it is allowed to fail. So Opaque is not required to provide consensus.

Related

When not to use volatile，it still can see the changes which issued by other thread

public class VisibleDemo {
private boolean flag;
public VisibleDemo setFlag(boolean flag) {
this.flag = flag;
return this;
}
public static void main(String[] args) throws InterruptedException {
VisibleDemo t = new VisibleDemo();
new Thread(()->{
long l = System.currentTimeMillis();
while (true) {
if (System.currentTimeMillis() - l > 600) {
break;
}
}
t.setFlag(true);
}).start();
new Thread(()->{
long l = System.currentTimeMillis();
while (true) {
if (System.currentTimeMillis() - l > 500) {
break;
}
}
while (!t.flag) {
// if (System.currentTimeMillis() - l > 598) {
//
// }
}
System.out.println("end");
}).start();
}
}
if it does not have the following codes, it will not show "end".
if (System.currentTimeMillis() - l > 598) {
}
if it has these codes, it will probably show "end". Sometimes it does not show.
when is less than 598 or not have these codes, like use 550, it will not show "end".
when is 598, it will probably show "end"
when is greater than 598, it will show "end" every time
notes:
598 is on my computer, May be your computer is another number.
the flag is not with volatile, why can know the newest value.
First: I want to know Why?
Second: I need help,
I want to know the scenarios: when the worker cache of jvm thread will refresh to/from main memory.
OS: windows 10
java: jdk8u231

Your code is suffering from a data-race and that is why it is behaving unreliably.
The JMM is defined in terms of the happens-before relation. So if you have 2 actions A and B, and A happens-before B, then B should see A and everything before A. It is very important to understand that happens-before doesn't imply happening-before (so ordering based on physical time) and vice versa.
The 'flag' field is accessed concurrently; one thread is reading it while another thread is writing it. In JMM terms this is called conflicting access.
Conflicting accesses are fine as long as it is done using some form of synchronization because the synchronization will induce happens-before edges. But since the 'flag' accesses are plain loads/stores, there is no synchronization, and as a consequence, there will not be a happens-before edge to order the load and the store. A conflicting access, that isn't ordered by a happens-before edge, is called a data-race and that is the problem you are suffering from.
When there is a data-race; funny things can happen but it will not lead to undefined behavior like is possible under C++ (undefined behavior can effectively lead to any possible outcome including crashes and super weird behavior). So load still needs to see a value that is written and can't see a value coming out of thin air.
If we look at your code:
while (!t.flag) {
...
}
Because the flag field isn't updated within the loop and is just a plain load, the compiler is allowed to optimize this code to:
if(!t.flag){
while(true){...}
}
This particular optimization is called loop hoisting (or loop invariant code motion).
So this explains why the loop doesn't need to complete.
Why does it complete when you access the System.currentTimeMillis? Because you got lucky; apparently this prevents the JIT from applying the above optimization. But keep in mind that System.currentTimeMillis doesn't have any formal synchronization semantics and therefore doesn't induce happens-before edges.
How to fix your code?
The simplest way to fix your code would be to make 'flag' volatile or access both the read/write from a synchronized block. If you want to go really hardcore: use VarHandle get/set opaque. Officially it is still a data-race because opaque doesn't indice happens-before edges, but it will prevent the compiler to optimize out the load/store. It is a benign data race. The primary advantage is slightly better performance because it doesn't prevent the reordering of surrounding loads/stores.
I want to know the scenarios: when the worker cache of jvm thread will refresh to/from main memory.
This is a fallacy. Caches on modern CPUs are always coherent; this is taken care of by the cache coherence protocol like MESI. Writing to main memory for every volatile read/write would be extremely slow. For more information see the following excellent post. If you want to know more about cache coherence and memory ordering, please check this excellent book which you can download for free.

I want to know the scenarios: when the worker cache of jvm thread will refresh to/from main memory.
When Taylor Swift is playing on your music player, it'll be 598, unless it's tuesday, then it'll be 599.
No, really. It's that arbitrary. The JVM spec gives the JVM the right to come up with any old number for any reason if your code isn't properly guarded.
The problem is JVM diversity. There is a crazy combinatorial explosion:
There are about 8 OSes give or take.
There are like 20 different 'chip lines', with different pipelining behaviour.
These chips can be in various mitigating modes to mitigate against attacks like Spectre. Let's call it 3.
There are about 8 different major JVM vendors.
These come in ~10 or so different versions (java 8, java 9, java 10, java 11, etc).
That gives us about 384000 different combinations.
The point of the JMM (Java Memory Model) is to remove the handcuffs from a JVM implementation. A JVM implementation is looking for this optimal case:
It wants the freedom to use the various tricks that CPUs use to run code as fast as possible. For example, it wants the freedom to be capable of 're-ordering' (given a(); b(), to run b() first, and a() later. Which is okay, if a and b are utterly independent and are not in any way looking at each others modifications). The reason it wants to do this is because CPUs are pipelines: Even processing a single instruction is in fact a chain of many separate steps, and the 'parse the instruction' step can get cracking on parsing another instruction the very moment it is done, even if that instruction is still being processed by the rest of the pipe. In fact, the CPU could have 4 separate 'instruction parser units' and they can be parsing 4 instructions in parallel. This is NOT the kind of parallelism that multiple cores do: This is a single core that will parse 4 consecutive instructions in parallel because parsing instructions is slightly slower than running them. For example. But that's just intel chips of the Z-whatever line. That's the point. If the memory model of the java specification indicates that a JVM simply can't use this stuff then that would mean JVMs on that particular intel chip run slow as molasses. We don't want that.
Nevertheless, the memory model rules can't be so preferential to giving the JVM the right to re-order and do all sorts of crazy things that it becomes impossible to write reliable code for JVMs. Imagine the java lang spec says that the JVM can re-order any 2 instructions in one method at any time even if these 2 instructions are touching the same field. That'd be great for JVM engineers, they can go nuts with optimizing code on the fly to re-order it optimally. But it would impossible to write java code.
So, a balance has been struck. This balance takes the following form:
The JMM gives you specific rules - these rules take the form of: "If you do X, then the JVM guarantees Y".
But that is all. In particular, there is nothing written about what happens if you do not do X. All you know is, that then Y is not guaranteed. But 'not guaranteed' does not mean: Will definitely NOT happen.
Here is an example:
class Data {
static int a = 0;
static int b = 0;
}
class Thread1 extends Thread {
public void run() {
Data.a = 5;
Data.b = 10;
}
}
class Thread2 extends Thread {
public void run() {
int a = Data.a;
int b = Data.b;
System.out.println(a);
System.out.println(b);
}
}
class Main {
public static void main(String[] args) {
new Thread1().start();
new Thread2().start();
}
}
This code:
Makes 2 fields, which start out at 0 and 0.
Runs one thread that first sets a to 5 and then sets b to 10.
Starts a second thread that reads these 2 fields into local vars and then prints these.
The JVM spec says that it is valid for a JVM to:
Print 0/0
Print 5/0
Print 0/10
Print 5/10
But it would not be legal for a JVM to e.g. print '20/20', or '10/5'.
Let's zoom in on the 0/10 case because that is utterly bizarre - how could a JVM possibly do that? Well, reordering!
WILL a JVM print 0/10? On some combinations of JVM vender and version+Architecture+OS+phase of the moon, YES IT WILL. On most, no it won't. Ever. Still, imagine you wrote this code, you rely on 0/10 NEVER occurring, and you test the heck out of your code, and you verify that indeed, even running the test a million times, it never happens. You ship it to the production server and it runs fine for a week and then just as you are giving the demo to the really important potential customer, all heck breaks loose: Your app is broken, as from time to time the 0/10 case does occur.
You file a bug with your JVM vendor. And they close it as 'intended behaviour - wontfix'. That will really happen, because that really is the intended behaviour. _If you write code that relies on a thing being true that is NOT guaranteed by the JMM, then YOU wrote a bug, even if on your particular hardware on this particular day it is completely impossible for you to make this bug occur right now.
This means one simple and very nasty conclusion is the only correct one: You cannot test this stuff.
So, if you adhere to the rule that if there are no tests then you can't know if you code works, guess what? You cannot ever know if your code is fine. Ever.
That then leads to the conclusion that you don't want to write any such code.
This sounds crazy (how can you simply not ever, ever write anything multicore?) but it's not as nuts as you think. This only comes up if 2 threads are dependent on ordering relative to each other for some in-process action. For example, if two threads are both accessing the same field of the same instance. Simply... don't do that.
It's easier than you think: If all 'communication' between threads goes via the database and you know how to use transactions in databases, voila. Or you use a message bus service like RabbitMQ.
If for some job you really must write multithread code where the threads interact with each other, don't shoot the messenger: It is NOT POSSIBLE to test that you did it right. So write it very carefully.
A second conclusion is that the JMM doesn't explain how things work or what happens. It merely says: IF you follow these rules, I guarantee you that THIS will happen. If you don't follow these rules, anything can happen. A JVM is free to do all sorts of crazy shenanigans, and this documentation nor any other documentation will ever enumerate all the crazy things that could happen. After all, there are at least 38400 different combinations and it's crazy to attempt to document all 38400!
So, what are the core rules?
The core rules are so-called happens-before relationships. The basic rule is simply this:
There are various ways to establish H-B relationships. Such a relationship is always between 2 lines of code. 2 lines of code might be unrelated, H-B wise. Or, the rules state that line A 'happens-before' line B.
If and only if the rules state this, then it will be impossible to observe a state of the universe (the values of all fields of all instances in the entire JVM) at line B as it was before line A ran.
That's it. For example, if line A 'happens before' line B, but line B does not attempt to witness any field change A made, then the JVM is still free to reorder and have B run before A. The point is that this shouldn't matter - you're not observing, so why does it matter?
We can 'fix' our weird 0/0/5/10 issue by setting up H-B: If the 'grab the static field values and save them to local a/b vars' code happens-after thread1's setting of it, then we can be sure that the code will always print 5/10 and the JMM guarantees means a JVM that doesn't print that is broken.
H-B are also transitive (if HB(A, B) is true, and HB(B, C) is true, then HB(A, C) is also true).
How do you set up HB?
If line B would run after line A as per the usual understanding of how things run, and both are being run by the same thread, HB(A, B). This is obvious: If you just write x(); y();, then y cannot observe state as it was before x ran.
HB(thread.start(), X) where X is the very first line in the started thread.
HB(EndS, StartS), where EndS is the exiting of a synchronized block on object ref Z, and StartS is another thread entering a synchronized block (on ref Z as well) later.
HB(V, V) where V is 'accessing volatile variable Z', but it is hard to know which way the HB goes with volatiles.
There are a few more exotic ways. There's also a separate HB relationship for constructors and final variables that they initialize, but generally this one is real easy to understand (once a constructor returns, whatever final fields it initialized are definitely set and cannot be observed to not be set, even if otherwise no actual HB relationship has been established. This applies only to final fields).
This explains why you observe weird values. This also explains why your question of 'I want to know when a JVM thread will refresh to/from main memory' is not answerable: Because the java memory model spec and the java virtual machine spec intentionally and specifically make no promises on how that works. One JVM can work one way, another JVM can do it completely differently.
The reason I started off making a seeming joke about playing Taylor Swift is: A CPU has cores, and the cores are limited. A modern computer, especially a desktop, is doing thousands of things at once, and will therefore be rotating apps through cores all the time. Whether a field update is 'flushed out' to main memory (NOTE: THAT IS DANGEROUS THINKING - THE DOCS DO NOT ACTUALLY ENFORCE THAT JVMS CAN BE UNDERSTOOD IN THOSE TERMS!) might depend on whether it gets rotated out of a core or not. And that in turn might depend on your music player dealing with a particular compressed music file that takes a few more cores to decompress the next block so that it can be queued up in the audio buffer.
Hence, and this is no joke, the song you are playing on your music player can in fact change the number you get. Hence, why you have to give up: You CANNOT enumerate 'if my computer is in this state, then this code will always produce Y number'. There are billions of states you'd have to enumerate. Impossible.

Does 'volatile' guarantee that any thread reads the most recently written value?

From the book Effective Java:
While the volatile modifier performs no mutual exclusion, it guarantees that any thread that reads the field will see the most recently written value
SO and many other sources claim similar things.
Is this true?
I mean really true, not a close-enough model, or true only on x86, or only in Oracle JVMs, or some definition of "most recently written" that's not the standard English interpretation...
Other sources (SO example) have said that volatile in Java is like acquire/release semantics in C++. Which I think do not offer the guarantee from the quote.
I found that in the JLS 17.4.4 it says "A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where "subsequent" is defined according to the synchronization order)." But I don't quite understand.
There are quite some sources for and against this, so I'm hoping the answer is able to convince that many of those (on either side) are indeed wrong - for example reference or spec, or counter-example code.

Is this true?
I mean really true, not a close-enough model, or true only on x86, or only in Oracle JVMs, or some definition of "most recently written" that's not the standard English interpretation...
Yes, at least in the sense that a correct implementation of Java gives you this guarantee.
Unless you are using some exotic, experimental Java compiler/JVM (*), you can essentially take this as true.
From JLS 17.4.5:
A write to a volatile field (§8.3.1.4) happens-before every subsequent read of that field.
(*) As Stephen C points out, such an exotic implementation that doesn't implement the memory model semantics described in the language spec can't usefully (or even legally) be described as "Java".

The quote per-se is correct in terms of what is tries to prove, but it is incorrect on a broader view.
It tries to make a distinction of sequential consistency and release/acquire semantics, at least in my understanding. The difference is rather "thin" between these two terms, but very important. I have tried to simplify the difference at the beginning of this answer or here.
The author is trying to say that volatile offers that sequential consistency, as implied by that:
"... it guarantees that any thread.."
If you look at the JLS, it has this sentence:
A write to a volatile field (§8.3.1.4) happens-before every subsequent read of that field.
The tricky part there is that subsequent and it's meaning, and it has been discussed here. What is really wants to mean is "subsequent that observes that write". So happens-before is guaranteed when the reader observes the value that the writer has written.
This already implies that a write is not necessarily seen on the next read, and this can be the case where speculative execution is allowed. So in this regard, the quote is miss-leading.
The quote that you found:
A write to a volatile variable v (§8.3.1.4) synchronizes-with all subsequent reads of v by any thread (where "subsequent" is defined according to the synchronization order)
is a complicated to understand without a much broader context. In simple words, it established synchronizes-with order (and implicitly happens-before) between two threads, where volatile v variables is a shared variable. here is an answer where this has broader explanation and thus should make more sense.

It is not true. JMM is based on sequential consistency and for sequential consistency real time ordering isn't guaranteed; for that you need linearizability. In other words, reads and writes can be skewed as long as the program order isn't violated (or as long is it can't be proven po was violated).
A read of volatile variable a, needs to see the most recent written value before it in the memory order. But that doesn't imply real time ordering.
Good read about the topic:
https://concurrency-interest.altair.cs.oswego.narkive.com/G8KjyUtg/relativity-of-guarantees-provided-by-volatile.
I'll make it concrete:
Imagine there are 2 CPU's and (volatile) variable A with initial value 0. CPU1 does a store A=1 and CPU2 does a load of A. And both CPUs have the cacheline containing A in SHARED state.
The store is first speculatively executed and written to the store buffer; eventually the store commits and retires, but since the stored value is still in the store buffer; it isn't visible yet to the CPU2. Till so far it wasn't required for the cacheline to be in an EXCLUSIVE/MODIFIED state, so the cacheline on CPU2 still contains the old value and hence CPU2 can still read the old value.
So in the real time order, the write of A is ordered before the read of A=0, but in the synchronization order, the write of A=1 is ordered after the read of A=0.
Only when the store leaves the store buffer and wants to enter the L1 cache, the request for ownership (RFO) is send to all other CPU's which set the cacheline containing A to INVALID on CPU2 (RFO prefetching I'll leave out of the discussion). If CPU2 would now read A, it is guaranteed to see A=1 (the request will block till CPU1 has completed the store to the L1 cache).
On acknowledgement of the RFO the cacheline is set to MODIFIED on CPU1 and the store is written to the L1 cache.
So there is a period of time between when the store is executed/retired and when it is visible to another CPU. But the only way to determine this is if you would add special measuring equipment to the CPUs.
I believe a similar delaying effect can happen on the reading side with invalidation queues.
In practice this will not be an issue because store buffers have a limited capacity and need to be drained eventually (so a write can't be invisible indefinitely). So in day to day usage you could say that a volatile read, reads the most recent write.
A java volatile write/read provides release/acquire semantics, but keep in mind that the volatile write/read is stronger than release/acquire semantics. A volatile write/read is sequential consistent and release/acquire semantics isn't.

The Volatile Keyword and CPU Cache Coherence Protocol

The CPU has already guranteed the cache conherence by some protocols (like MESI). Why do we also need volatile in some languages(like java) to keep the visibility between multithreads.
The likely reason is those protocols aren't enabled when boot and must be triggered by some instructions like LOCK.
If really that, Why does not the CPU enable the protocol when boot?

Volatile prevents 3 different flavors of problems:
visibility
reordering
atomicity
I'm assuming X86..
First of all, caches on the X86 are always coherent. So it won't happen that after one CPU commits the store to some variable to the cache, another CPU will still load the old value for that variable. This is the domain of the MESI protocol.
Assuming that every put and get in the Java bytecode is translated (and not optimized away) to a store and a load on the CPU, then even without volatile, every get would see the most recent put to the same address.
The issue here is that the compiler (JIT in this case) has a lot of freedom to optimize code. For example if it detects that the same field is read in a loop, it could decide to hoist that variable out of the loop as is shown below.
for(...){
int tmp = a;
println(tmp);
}
After hoisting:
int tmp = a;
for(...){
println(tmp);
}
This is fine if that field is only touched by 1 thread. But if the field is updated by another thread, the first thread will never see the change. Using volatile prevents such visibility problems and this is effectively the behavior of:
C style volatile
the Java volatile before the Java memory model was introduced with JSR-133.
A VarHandle with opaque access mode.
Then there is another very important aspect of volatile; volatile prevents that loads and stores to different addresses in the instruction stream executed by some CPU are reordered. The JIT compiler and the CPU have a lot of liberty to reorder loads and stores. Although on the X86 only older stores can be reordered with newer loads to a different address due to store buffers.
So imagine the following code:
int a;
volatile int b;
thread1:
a=1;
b=1;
thread2:
if(b==1) print(a);
The fact that b is volatile prevents the store of a=1 to jump after the store b=1. And it also prevents the load of a to jump in before the load of b. So this way thread 2 is guaranteed to see a=1, when it reads b=1.
So using volatile, you can ensure that non volatile fields are visible to other threads.
If you want to understand how volatile works, I would suggest digging into the Java memory model which is expressed in synchronize-with and happens-before rules as Margeret Bloom already indicated. I have given some low level details, but in case of Java, it is best to work with this high level model instead of thinking in terms of hardware. Thinking exclusively in terms of hardware/fences is only for the experts, non portable and very fragile.

Memory Barrier Vs CAS

I find that CAS will flush all CPU write cache to main memory。 Is this similar to memory barrier？
If this is true, does this mean CAS can make java Happens-Before work?
For answer：
The CAS is CPU instruction.
The barrier is a StoreLoad barrier because what I care about is will the data are written before CAS can be read after CAS.
More Detail:
I have this question because I am writing a fork-join built in Java. The implementation is like this
{
//initialize result container
Objcet[] result = new Object[];
//worker finish state count
AtomicInteger state = new AtomicInteger(result.size);
}
//worker thread i
{
result[i] = new Object();
//this is a CAS operation
state.getAndDecrement();
if(state.get() == 0){
//do something useing result array
}
}
I want to know can (do something using result array) part see all result element which is written by other worker thread.

I find that CAS will flush all cpu write cache to main memory。 Is this similar to memory barrier？
It depends on what you mean by CAS. (A specific hardware instruction? An implementation strategy used in the implementation of some Java class?)
It depends on what kind of memory barrier you are talking about. There are a number of different kinds ...
It is not necessarily true that a CAS instruction flushes all dirty cache lines. It depends on how a particular instruction set / hardware implements the CAS instruction.
It is unclear what you mean by "make happens-before work". Certainly, under some circumstance a CAS instruction would provide the necessary memory coherency properties for a specific happens-before relationship. But not necessarily all relationships. It would depend on how the CAS instruction is implemented by the hardware.
To be honest, unless you are actually writing a Java compiler, you would do better to not try to understanding the intricacies of what a JIT compiler needs to do to implement the Java Memory Model. Just apply the happens before rules.
UPDATE
It turns out from your recent updates and comments that your actual question is about the behavior of AtomicInteger operations.
The memory semantics of the atomic types are specified in the package javadoc for java.util.concurrent.atomic as follows:
The memory effects for accesses and updates of atomics generally follow the rules for volatiles, as stated in The Java Language Specification (17.4 Memory Model):
get has the memory effects of reading a volatile variable.
set has the memory effects of writing (assigning) a volatile variable.
lazySet has the memory effects of writing (assigning) a volatile variable except that it permits reorderings with subsequent (but not previous) memory actions that do not themselves impose reordering constraints with ordinary non-volatile writes. Among other usage contexts, lazySet may apply when nulling out, for the sake of garbage collection, a reference that is never accessed again.
weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides no guarantees with respect to previous or subsequent reads and writes of any variables other than the target of the weakCompareAndSet.
compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.
As you can see, operations on Atomic types are specified to have memory semantics that are equivalent volatile variables. This should be sufficient to reason about your use of Java atomic types ... without resorting to dubious analogies with CAS instructions and memory barriers.
Your example is incomplete and it is difficult to understand what it is trying to do. Therefore, I can't comment on its correctness. However, you should be able to analyze it yourself using happens-before logic, etc.

I find that CAS will flush all CPU write cache to main memory。
Is this similar to memory barrier？
A CAS in Java on the X86 is implemented using a lock prefix and then it depends on the type of CAS what kind of instruction is actually being used; but that isn't that relevant for this discussion. A locked instruction effectively is a full barrier; so it includes all 4 fences: LoadLoad/LoadStore/StoreLoad/StoreStore. Since the X86 provides all but StoreLoad due to TSO, only the StoreLoad needs to be added; just as with a volatile write.
A StoreLoad doesn't force changes to be written to main memory; it only forces the CPU to wait executing loads till the store buffer has been be drained to the L1d. However, with MESI (Intel) based cache coherence protocols, it can happen that a cache-line that is in MODIFIED state on a different CPU, needs to be flushed to main memory before it can be returned as EXCLUSIVE. With MOESI (AMD) based cache coherence protocols, this is not an issue. If the cache-line is already in MODIFIED,EXCLUSIVE state on the core doing the StoreLoad, StoreLoad doesn't cause the cache line to be flushed to main memory. The cache is the source of truth.
If this is true, does this mean CAS can make java Happens-Before work?
From a memory model perspective, a successful CAS in java is nothing else than a volatile read followed by a volatile write. So there is a happens before relation between a volatile write of some field on some object instance and a subsequent volatile read on the same field on the same object instance.
Since you are working with Java, I would focus on the Java Memory Model and not too much on how it is implemented in the hardware. The JMM is allowing for executions that can't be explained based purely by thinking in fences.
Regarding your example:
result[i] = new Object();
//this is a CAS operation
state.getAndDecrement();
if(state.get() == 0){
//do something using result array
}
I'm not sure what the intended logic is. In your example, multiple threads at the same time could see that the state is 0, so all could start to do something with the array. If this behavior is undesirable, then this is caused by a race condition. I would use something like this:
result[i] = new Object();
//this is a CAS operation
int s = state.getAndDecrement();
if(s == 0){
//do something using result array
}
Now the other question is if there is a data race on the array content. There is a happens-before edge between the write to the array content and the write to 'state' (program order rule). There is a happens before edge between the write of the state and the read (volatile variable rule) and there is a happens before relation between the read of the state and the read of the array content (program order rule). So there is a happens before edge between writing to the array and reading its content in this particular example due to the transitive nature of the happens-before relation.
Personally I would not to try too be too smart and use something less array prone like an AtomicReferenceArray; then at least you don't need to worry about missing happens before edge between the write of the array and the read.

Understanding intra-thread semantics

Could you explain in simple words what "the program satisfies intra-thread semantic" means? Is it possible to provide simple examples of programs which satisfy and which don't satisfy such semantics?

The notion of intra-thread semantics is discussed in the JLS section 17.4, which covers the Java Memory Model. The JMM is a set of requirements and constraints on the execution of Java programs by JVMs. Here's the relevant section of text from 17.4:
The memory model determines what values can be read at every point in the program. The actions of each thread in isolation must behave as governed by the semantics of that thread, with the exception that the values seen by each read are determined by the memory model. When we refer to this, we say that the program obeys intra-thread semantics. Intra-thread semantics are the semantics for single-threaded programs, and allow the complete prediction of the behavior of a thread based on the values seen by read actions within the thread. To determine if the actions of thread t in an execution are legal, we simply evaluate the implementation of thread t as it would be performed in a single-threaded context, as defined in the rest of this specification.
This means that, as far as a single thread is concerned, the values visible in objects' fields are either the fields' initial values (zero, false, or null) or are values that this thread has previously written.
This is so obvious as to be elementary; why bother stating it?
Consider a single-threaded Java program with a few int fields:
field1 = 1; // 1
field2 = 2; // 2
field3 = field1 + field2; // 3
then clearly the value of field3 must be 3. This is because the values visible in field1 and field2 at line 3 must reflect the earlier values written at lines 1 and 2. It would be incorrect if the initial value of zero for field1 or field2 were used in the computation at line 3, since the assignment of those fields occurs earlier in the program than the computation.
What is less obvious are the constraints that are not present. For example, there is no constraint here over the ordering of the writes of field1 and field2. The JVM could execute line 2 before line 1 and the result of the program would be the same. Or, it could delay the writes to field1 and field2 and keep these values in registers, and do register-based addition at line 3. The actual writes of all the fields could be delayed until much later. Or they could even be omitted entirely if the values are subsequently overwritten by this thread. Again, the outcome of the program would be the same.
And that's the point: JVMs are free to rearrange the execution of a program (mainly so that it can run faster), but only as long it doesn't change the results of that program if it were run single-threaded. These constraints are referred to as intra-thread semantics. Any rearrangements that don't violate intra-thread semantics are permitted.
(Note that the paragraph quoted above talks about a "program that obeys intra-thread semantics" but what it really means is the execution of the program obeys intra-thread semantics. Text in later sections, such as 17.4.7, is more precise, referring to whether an execution of a program obeys intra-thread consistency or whether a set of actions performed is in accord with intra-thread semantics.)

Let's break this down: https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.4
1 The memory model determines what values can be read at every point in the program. 2 The actions of each thread in isolation must behave as governed by the semantics of that thread, 3 with the exception that the values seen by each read are determined by the memory model. 4 When we refer to this, we say that the program obeys intra-thread semantics. Intra-thread semantics are the semantics for single-threaded programs, and allow the complete prediction of the behavior of a thread based on the values seen by read actions within the thread
1.
The memory model determines what values can be read at every point in the program
It means the JMM have rules for visibility and ordering that dictates the values for reads - rules that haven't been defined yet but makes the reader aware that they exist (they explain in 3 and 4 how it relates to intra-thread semantics)
2.
The actions of each thread in isolation must behave as governed by the semantics of that thread
It does not define what is "isolation" in this context, nor what is "semantics of that thread" but they probably meant Actions that do not fall under inter-thread actions, so all actions done by a thread that don't influence other threads will be read line by line (semantics of thread). Their "definition" for intra-thread actions (sort of, they only give an example) is:
"This specification is only concerned with inter-thread actions. We do not need to concern ourselves with intra-thread actions (e.g., adding two local variables and storing the result in a third local variable)" - ie any action that won't effect any shared memory.
3.
with the exception that the values seen by each read are determined by the memory model.
It means that the JMM apply its rules even when a thread executes code that have no influence on any other thread (2) - it dictates the values of these reads.
4.
When we refer to this, we say that the program obeys intra-thread semantics. Intra-thread semantics are the semantics for single-threaded programs, and allow the complete prediction of the behavior of a thread based on the values seen by read actions within the thread
To sum it up, they are saying - If you have a thread that perform an action in isolation, meaning manipulates (read or write) data not in any shared memory. the value of reads in that code would still be *decided by the JMM and all of it (complete) would be known to that same thread - This is what it means to obey intra-thread semantics
They almost implicitly describe here Sequential Consistency (SC) without actually saying it, the only missing part is inter-thread actions - This is described to be used later on in Program Order (*decided by the JMM) which needed all the definitions here to define when is SC used.
Program orders is the rule they kept mentioning without naming before, it says that if you look at all (total) the actions (intra and inter) in a thread and consider them as intra-thread semantic then they are all considered Sequential Consistent.
You can understand what is SC by reading Stuart Marks answer
As a side note I just have to say the phrasing in JLS for JMM is horrendous, they could not have made it more confusing if they tried to.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.