AtomicInteger.lazyset() - visibility delay [duplicate]

AtomicInteger.lazyset() - visibility delay [duplicate] - java

What is the difference between the lazySet and set methods of AtomicInteger? The documentation doesn't have much to say about lazySet:
Eventually sets to the given value.
It seems that the stored value will not be immediately set to the desired value but will instead be scheduled to be set some time in the future. But, what is the practical use of this method? Any example?

Cited straight from "JDK-6275329: Add lazySet methods to atomic classes":
As probably the last little JSR166 follow-up for Mustang,
we added a "lazySet" method to the Atomic classes
(AtomicInteger, AtomicReference, etc). This is a niche
method that is sometimes useful when fine-tuning code using
non-blocking data structures. The semantics are
that the write is guaranteed not to be re-ordered with any
previous write, but may be reordered with subsequent operations
(or equivalently, might not be visible to other threads) until
some other volatile write or synchronizing action occurs).
The main use case is for nulling out fields of nodes in
non-blocking data structures solely for the sake of avoiding
long-term garbage retention; it applies when it is harmless
if other threads see non-null values for a while, but you'd
like to ensure that structures are eventually GCable. In such
cases, you can get better performance by avoiding
the costs of the null volatile-write. There are a few
other use cases along these lines for non-reference-based
atomics as well, so the method is supported across all of the
AtomicX classes.
For people who like to think of these operations in terms of
machine-level barriers on common multiprocessors, lazySet
provides a preceeding store-store barrier (which is either
a no-op or very cheap on current platforms), but no
store-load barrier (which is usually the expensive part
of a volatile-write).

lazySet can be used for rmw inter thread communication, because xchg is atomic, as for visibility, when writer thread process modify a cache line location, reader thread's processor will see it at the next read, because the cache coherence protocol of intel cpu will garantee LazySet works, but the cache line will be updated at the next read, again, the CPU has to be modern enough.
http://sc.tamu.edu/systems/eos/nehalem.pdf
For Nehalem which is a multi-processor platform, the processors have the ability to “snoop” (eavesdrop) the address bus for other processor’s accesses to system memory and to their internal caches. They use this snooping ability to keep their internal caches consistent both with system memory and with the caches in other interconnected processors.
If through snooping one processor detects that another processor intends to write to a memory location that it currently has cached in Shared state, the snooping processor will invalidate its cache block forcing it to perform a cache line fill the next time it accesses the same memory location.
oracle hotspot jdk for x86 cpu architecture->
lazySet == unsafe.putOrderedLong == xchg rw( asm instruction that serve as a soft barrier costing 20 cycles on nehelem intel cpu)
on x86 (x86_64) such a barrier is much cheaper performance-wise than volatile or AtomicLong getAndAdd ,
In an one producer, one consumer queue scenario, xchg soft barrier can force the line of codes before the lazySet(sequence+1) for producer thread to happen BEFORE any consumer thread code that will consume (work on) the new data, of course consumer thread will need to check atomically that producer sequence was incremented by exactly one using a compareAndSet (sequence, sequence + 1).
I traced after Hotspot source code to find the exact mapping of the lazySet to cpp code:
http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/9b0ca45cd756/src/share/vm/prims/unsafe.cpp
Unsafe_setOrderedLong -> SET_FIELD_VOLATILE definition -> OrderAccess:release_store_fence.
For x86_64, OrderAccess:release_store_fence is defined as using the xchg instruction.
You can see how it is exactly defined in jdk7 (doug lea is working on some new stuff for JDK 8):
http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/4fc084dac61e/src/os_cpu/linux_x86/vm/orderAccess_linux_x86.inline.hpp
you can also use the hdis to disassemble the lazySet code's assembly in action.
There is another related question:
Do we need mfence when using xchg

A wider discussion of the origins and utility of lazySet and the underlying putOrdered can be found here: http://psy-lob-saw.blogspot.co.uk/2012/12/atomiclazyset-is-performance-win-for.html
To summarize: lazySet is a weak volatile write in the sense that it acts as a store-store and not a store-load fence. This boils down to lazySet being JIT compiled to a MOV instruction that cannot be re-ordered by the compiler rather then the significantly more expensive instruction used for a volatile set.
When reading the value you always end up doing a volatile read(with an Atomic*.get() in any case).
lazySet offers a single writer a consistent volatile write mechanism, i.e. it is perfectly legitimate for a single writer to use lazySet to increment a counter, multiple threads incrementing the same counter will have to resolve the competing writes using CAS, which is exactly what happens under the covers of Atomic* for incAndGet.

From the Concurrent-atomic package summary
lazySet has the memory effects of writing (assigning) a volatile variable except that it permits reorderings with subsequent (but not previous) memory actions that do not themselves impose reordering constraints with ordinary non-volatile writes. Among other usage contexts, lazySet may apply when nulling out, for the sake of garbage collection, a reference that is never accessed again.
If you are curious about lazySet then you owe yourself other explanations too
The memory effects for accesses and updates of atomics generally
follow the rules for volatiles, as stated in section 17.4 of The Java™
Language Specification.
get has the memory effects of reading a volatile variable.
set has the memory effects of writing (assigning) a volatile variable.
lazySet has the memory effects of writing (assigning) a volatile variable except that it permits reorderings with subsequent (but not previous) memory actions that do not themselves impose reordering
constraints with ordinary non-volatile writes. Among other usage
contexts, lazySet may apply when nulling out, for the sake of garbage
collection, a reference that is never accessed again.
weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides
no guarantees with respect to previous or subsequent reads and writes
of any variables other than the target of the weakCompareAndSet.
compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing
volatile variables.

Here is my understanding, correct me if I am wrong:
You can think about lazySet() as "semi" volatile: it's basically a non-volatile variable in terms of reading by other threads, i.e. the value set by lazySet may not be visible to to other threads. But it becomes volatile when another write operation occurs (may be from other threads).
The only impact of lazySet I can imagine is compareAndSet. So if you use lazySet(), get() from other threads may still get the old value, but compareAndSet() will always have the new value since it is a write operation.

Re: attempt to dumb it down -
You can think of this as a way to treat a volatile field as if it wasn't volatile for a particular store (eg: ref = null;) operation.
That isn't perfectly accurate, but it should be enough that you could make a decision between "OK, I really don't care" and "Hmm, let me think about that for a bit".

Related

Java instruction reordering and CPU memory reordering

This is a follow up question to
How to demonstrate Java instruction reordering problems?
There are many articles and blogs referring to Java and JVM instruction reordering which may lead to counter-intuitive results in user operations.
When I asked for a demonstration of Java instruction reordering causing unexpected results, several comments were made to the effect that a more general area of concern is memory reordering, and that it would be difficult to demonstrate on an x86 CPU.
Is instruction reordering just a part of a bigger issue of memory reordering, compiler optimizations and memory models? Are these issues really unique to the Java compiler and the JVM? Are they specific to certain CPU types?

Memory reordering is possible without compile-time reordering of operations in source vs. asm. The order of memory operations (loads and stores) to coherent shared cache (i.e. memory) done by a CPU running a thread is also separate from the order it executes those instructions in.
Executing a load is accessing cache (or the store buffer), but executing" a store in a modern CPU is separate from its value actually being visible to other cores (commit from store buffer to L1d cache). Executing a store is really just writing the address and data into the store buffer; commit isn't allowed until after the store has retired, thus is known to be non-speculative, i.e. definitely happening.
Describing memory reordering as "instruction reordering" is misleading. You can get memory reordering even on a CPU that does in-order execution of asm instructions (as long as it has some mechanisms to find memory-level parallelism and let memory operations complete out of order in some ways), even if asm instruction order matches source order. Thus that term wrongly implies that merely having plain load and store instructions in the right order (in asm) would be useful for anything related to memory order; it isn't, at least on non-x86 CPUs. It's also weird because instructions have effects on registers (at least loads, and on some ISAs with post-increment addressing modes, stores can, too).
It's convenient to talk about something like StoreLoad reordering as x = 1 "happening" after a tmp = y load, but the thing to talk about is when the effects happen (for loads) or are visible to other cores (for stores) in relation to other operations by this thread. But when writing Java or C++ source code, it makes little sense to care whether that happened at compile time or run-time, or how that source turned into one or more instructions. Also, Java source doesn't have instructions, it has statements.
Perhaps the term could make sense to describe compile-time reordering between bytecode instructions in a .class vs. JIT compiler-generate native machine code, but if so then it's a mis-use to use it for memory reordering in general, not just compile/JIT-time reordering excluding run-time reordering. It's not super helpful to highlight just compile-time reordering, unless you have signal handlers (like POSIX) or an equivalent that runs asynchronously in the context of an existing thread.
This effect is not unique to Java at all. (Although I hope this weird use of "instruction reordering" terminology is!) It's very much the same as C++ (and I think C# and Rust for example, probably most other languages that want to normally compile efficiently, and require special stuff in the source to specify when you want your memory operations ordered wrt. each other, and promptly visible to other threads). https://preshing.com/20120625/memory-ordering-at-compile-time/
C++ defines even less than Java about access to non-atomic<> variables without synchronization to ensure that there's never a write in parallel with anything else (undefined behaviour1).
And even present in assembly language, where by definition there's no reordering between source and machine code. All SMP CPUs except a few ancient ones like 80386 also do memory-reordering at run-time, so lack of instruction reordering doesn't gain you anything, especially on machines with a "weak" memory model (most modern CPUs other than x86): https://preshing.com/20120930/weak-vs-strong-memory-models/ - x86 is "strongly ordered", but not SC: it's program-order plus a store buffer with store forwarding. So if you want to actually demo the breakage from insufficient ordering in Java on x86, it's either going to be compile-time reordering or lack of sequential consistency via StoreLoad reordering or store-buffer effects. Other unsafe code like the accepted answer on your previous question that might happen to work on x86 will fail on weakly-ordered CPUs like ARM.
(Fun fact: modern x86 CPUs aggressively execute loads out of order, but check to make sure they were "allowed" to do that according to x86's strongly-ordered memory model, i.e. that the cache line they loaded from is still readable, otherwise roll back the CPU state to before that: machine_clears.memory_ordering perf event. So they maintain the illusion of obeying the strong x86 memory-ordering rules. Other ISAs have weaker orders and can just aggressively execute loads out of order without later checks.)
Some CPU memory models even allow different threads to disagree about the order of stores done by two other threads. So the C++ memory model allows that, too, so extra barriers on PowerPC are only needed for sequential consistency (atomic with memory_order_seq_cst, like Java volatile) not acquire/release or weaker orders.
Related:
How does memory reordering help processors and compilers?
How is load->store reordering possible with in-order commit? - memory reordering on in-order CPUs via other effects, like scoreboarding loads with a cache that can do hit-under-miss, and/or out-of-order commit from the store buffer, on weakly-ordered ISAs that allow this. (Also LoadStore reordering on OoO exec CPUs that still retire instructions in order, which is actually more surprising than on in-order CPUs which have special mechanisms to allow memory-level parallelism for loads, that OoO exec could replace.)
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem? (basically a duplicate of this; I didn't say much there that's not here)
Are loads and stores the only instructions that gets reordered? (at runtime)
Does an x86 CPU reorder instructions? (yes)
Can a speculatively executed CPU branch contain opcodes that access RAM? - store execution order isn't even relevant for memory ordering between threads, only commit order from the store buffer to L1d cache. A store buffer is essential to decouple speculative exec (including of store instructions) from anything that's visible to other cores. (And from cache misses on those stores.)
Why is integer assignment on a naturally aligned variable atomic on x86? - true in asm, but not safe in C/C++; you need std::atomic<int> with memory_order_relaxed to get the same asm but in portably-safe way.
Globally Invisible load instructions - where does load data come from: store forwarding is possible, so it's more accurate to say x86's memory model is "program order + a store buffer with store forwarding" than to say "only StoreLoad reordering", if you ever care about this core reloading its own recent stores.
Why memory reordering is not a problem on single core/processor machines? - just like the as-if rule for compilers, out-of-order exec (and other effects) have to preserve the illusion (within one core and thus thread) of instructions fully executing one at a time, in program order, with no overlap of their effects. This is basically the cardinal rule of CPU architecture.
LWN: Who's afraid of a big bad optimizing compiler? - surprising things compilers can do to C code that uses plain (non-volatile / non-_Atomic accesses). This is mostly relevant for the Linux kernel, which rolls its own atomics with inline asm for some things like barriers, but also just C volatile for pure loads / pure stores (which is very different from Java volatile2.)
Footnote 1: C++ UB means not just an unpredictable value loaded, but that the ISO C++ standard has nothing to say about what can/can't happen in the whole program at any time before or after UB is encountered. In practice for memory ordering, the consequences are often predictable (for experts who are used to looking at compiler-generated asm) depending on the target machine and optimization level, e.g. hoisting loads out of loops breaking spin-wait loops that fail to use atomic. But of course you're totally at the mercy of whatever the compiler happens to do when your program contains UB, not at all something you can rely on.
Caches are coherent, despite common misconceptions
However, all real-world systems that Java or C++ run multiple threads across do have coherent caches; seeing stale data indefinitely in a loop is a result of compilers keeping values in registers (which are thread-private), not of CPU caches not being visible to each other. This is what makes C++ volatile work in practice for multithreading (but don't actually do that because C++11 std::atomic made it obsolete).
Effects like never seeing a flag variable change are due to compilers optimizing global variables into registers, not instruction reordering or cpu caching. You could say the compiler is "caching" a value in a register, but you can choose other wording that's less likely to confuse people that don't already understand thread-private registers vs. coherent caches.
Footnote 2: When comparing Java and C++, also note that C++ volatile doesn't guarantee anything about memory ordering, and in fact in ISO C++ it's undefined behaviour for multiple threads to be writing the same object at the same time even with volatile. Use std::memory_order_relaxed if you want inter-thread visibility without ordering wrt. surrounding code.
(Java volatile is like C++ std::atomic<T> with the default std::memory_order_seq_cst, and AFAIK Java provides no way to relax that to do more efficient atomic stores, even though most algorithms only need acquire/release semantics for their pure-loads and pure-stores, which x86 can do for free. Draining the store buffer for sequential consistency costs extra. Not much compared to inter-thread latency, but significant for per-thread throughput, and a big deal if the same thread is doing a bunch of stuff to the same data without contention from other threads.)

Understanding Java volatile visibility

I'm reading about the Java volatile keyword and have confusion about its 'visibility'.
A typical usage of volatile keyword is:
volatile boolean ready = false;
int value = 0;
void publisher() {
value = 5;
ready = true;
}
void subscriber() {
while (!ready) {}
System.out.println(value);
}
As explained by most tutorials, using volatile for ready makes sure that:
change to ready on publisher thread is immediately visible to other threads (subscriber);
when ready's change is visible to other thread, any variable update preceding to ready (here is value's change) is also visible to other threads;
I understand the 2nd, because volatile variable prevents memory reordering by using memory barriers, so writes before volatile write cannot be reordered after it, and reads after volatile read cannot be reordered before it. This is how ready prevents printing value = 0 in the above demo.
But I have confusion about the 1st guarantee, which is visibility of the volatile variable itself. That sounds a very vague definition to me.
In other words, my confusion is just on SINGLE variable's visibility, not multiple variables' reordering or something. Let's simplify the above example:
volatile boolean ready = false;
void publisher() {
ready = true;
}
void subscriber() {
while (!ready) {}
}
If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop? Why?
A few questions I want to ask:
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, then why do we need volatile here?
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
Hope I explained my question clearly.

Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
That doesn't help you. You aren't writing code for a modern CPU, you are writing code for a Java virtual machine that is allowed to have a virtual machine that has a virtual CPU whose virtual CPU caches are not coherent.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads. That doesn't sound a correct explain.
That is correct. But understand, that's with respect to the virtual machine that you are coding for. Its memory may well be implemented in your physical CPU's caches. That may allow your machine to use the caches and still have the memory visibility required by the Java specification.
Using volatile may ensure that writes go directly to the virtual machine's memory instead of the virtual machine's virtual CPU cache. The virtual machine's CPU cache does not need to provide visibility between threads because the Java specification doesn't require it to.
You cannot assume that characteristics of your particular physical hardware necessarily provide benefits that Java code can use directly. Instead, the JVM trades off those benefits to improve performance. But that means your Java code doesn't get those benefits.
Again, you are not writing code for your physical CPU, you are writing code for the virtual CPU that your JVM provides. That your CPU has coherent caches allows the JVM to do all kinds of optimizations that boost your code's performance, but the JVM is not required to pass those coherent caches through to your code and real JVM's do not. Doing so would mean eliminating a significant number of extremely valuable optimizations.

Relevant bits of the language spec:
volatile keyword: https://docs.oracle.com/javase/specs/jls/se16/html/jls-8.html#jls-8.3.1.4
memory model: https://docs.oracle.com/javase/specs/jls/se16/html/jls-17.html#jls-17.4
The CPU cache is not a factor here, as you correctly said.
This is more about optimizations. If ready is not volatile, the compiler is free to interpret
// this
while (!ready) {}
// as this
if (!ready) while(true) {}
That's certainly an optimization, it has to evaluate the condition fewer times. The value is not changed in the loop, it can be "reused". In terms of single-thread semantics it is equivalent, but it won't do what you wanted.
That's not to say this would always happen. Compilers are free to do that, they don't have to.

If ready is not defined volatile, is it possible that subscriber get stuck infinitely in the while loop?
Yes.
Why?
Because the subscriber may not ever see the results of the publisher's write.
Because ... the JLS does not require the value of an variable to be written to memory ... except to satisfy the specified visibility constraints.
What does 'immediately visible' mean? Write operation takes some time, so after how long can other threads see volatile's change? Can a read in another thread that happens very shortly after the write starts but before the write finishes see the change?
(I think) that the JMM specifies or assumes that it is physically impossible to read and write the same conceptual memory cell at the same time. So operations on a memory cell are time ordered. Immediately visible means visible in the next possible opportunity to read following the write.
Visibility, for modern CPUs is guaranteed by cache coherence protocol (e.g. MESI) anyway, so what can volatile help here?
Compilers typically generate code that holds variables in registers, and only writes the values to memory when necessary. Declaring a variable as volatile means that the value must be written to memory. If you take this into consideration, you cannot rely on just the (hypothetical or actual) behavior of cache implementations to specify what volatile means.
While current generation modern CPU / cache architectures behave that way, there is no guarantee that all future computers will behave that way.
Some articles say volatile variable uses memory directly instead of CPU cache, which guarantees visibility between threads.
Some people say that is incorrect ... for CPUs that implement a cache coherency protocol. However, that is beside the point, because as I described above, the current value of a variable may not yet have been written to the cache yet. Indeed, it may never be written to the cache.
Time : ---------------------------------------------------------->
writer : --------- | write | -----------------------
reader1 : ------------- | read | -------------------- can I see the change?
reader2 : --------------------| read | -------------- can I see the change?
So lets assume that your diagram shows physical time and represents threads running on different physical cores, reading and writing a cache-coherent memory cell via their respective caches.
What would happen at the physical level would depend on how the cache-coherency is implemented.
I would expect Reader 1 to see the previous state of the cell (if it was available from its cache) or the new state if it wasn't. Reader 2 would see the new state. But it also depends on how long it takes for the writer thread's cache invalidation to propagate to the others' caches. And all sorts of other stuff that is hard to explain.
In short, we don't really know what would happen at the physical level.
But on the other hand, the writer and readers in the above picture can't actually observe the physical time like that anyway. And neither can the programmer.
What the program / programmer sees is that the reads and writes DO NOT OVERLAP. When the necessary happens before relations are present, there will be guarantees about visibility of memory writes by one thread to subsequent1 reads by another. This applies for volatile variables, and for various other things.
How that guarantee is implemented, is not your problem. And it really doesn't help if you do understand what it going on at the hardware level, because you don't actually know what code the JIT compiler is going to emit (today!) anyway.
1 - That is, subsequent according to the synchronization order ... which you could view as a logical time. The JLS Memory model doesn't actually talk about time at all.

Answers to your 3 questions:
A change of a volatile write doesn't need to be 'immediately' visible to a volatile load. A correctly synchronized Java program will behave as if it is sequential consistent and for sequential consistency the real time order of loads/stores isn't relevant. So reads and writes can be skewed as long as the program order isn't violated (or as long as nobody can observe it). Linearizability = sequential consistency + respect real time order. For more info see this answer.
I still need to dig into the exact meaning of visible, but AFAIK it is mostly a compiler level concern because hardware will prevent buffering loads/stores indefinitely.
You are completely right about the articles being wrong. A lot of nonsense is written and 'flushing volatile writes to main memory instead of using the cache' is the most common misunderstanding I'm seeing. I think 50% of all my SO comments is about informing people that caches are always coherent. A great book on the topic is 'A primer on memory consistency and cache coherence 2e' which is available for free.
The informal semantics of the Java Memory model contains 3 parts:
atomicity
visibility
ordering
Atomicity is about making sure that a read/write/rmw happens atomically in the global memory order. So nobody can observe some in between state. This deals with access atomicity like torn read/write, word tearing and proper alignment. It also deals with operational atomicity like rmw.
IMHO it should also deal with store atomicity; so making sure that there is a point in time where the store becomes visibly to all cores. If you have for example the X86, then due to load buffering, a store can become visible to the issuing core earlier than to other cores and you have a violation of atomicity. But I haven't seen it being mentioned in the JMM.
Visibility: this deals mostly with preventing compiler optimizations since the hardware will prevent delaying loads and buffering stores indefinitely. In some literature they also throw ordering of surrounding loads/stores under visibility; but I don't believe this is correct.
Ordering: this is the bread and butter of memory models. It will make sure that loads/stores issued by a single processor don't get reordered. In the first example you can see the need for such behavior. This is the realm of the compiler barriers and cpu memory barriers.
For more info see:
https://download.oracle.com/otndocs/jcp/memory_model-1.0-pfd-spec-oth-JSpec/

I'll just touch on this part :
change to ready on publisher thread is immediately visible to other threads
that is not correct and the articles are wrong. The documentation makes a very clear statement here:
A write to a volatile field happens-before every subsequent read of that field.
The complicated part here is subsequent. In plain english this means that when someone sees ready as being true, it will also see value as being 5. This automatically implies that you need to observe that value to be true, and it can happen that you might observe a different thing. So this is not "immediately".
What people confuse this with, is the fact that volatile offers sequential consistency, which means that if someone has observed ready == true, then everyone will also (unlike release/acquire, for example).

cost of volatile read in java when write are infrequent [duplicate]

I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?

You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.

The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.

It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.

Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!

volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.

Unsafe compareAndSwapInt vs synchronize

I found that almost all high level synchronization abstractions(like Semaphore, CountDownLatch, Exchanger from java.util.concurrent) and concurrent collections are using methods from Unsafe(like compareAndSwapInt method) to define critical section. In the same time I expected that synchronize block or method will be used for this purpose.
Could you explain is the Unsafe methods(I mean only methods that could atomically set a value) more efficient than synchronize and why it is so?

Using synchronised is more efficient if you expect to be waiting a long time (e.g. milli-seconds) as the thread can fall asleep and release the CPU to do other work.
Using compareAndSwap is more efficient if you expect the operation to happen quite quickly. This is because it is a simple machine code instruction and take as little as 10 ns. However if a resources is heavily contented this instruction must busy wait and if it cannot obtain the value it needs, it can consume the CPU busily until it does.
If you use off heap memory, you can control the layout of the data being shared and avoid false sharing (where the same cache line is being updated by more than one CPU). This is important when you have multiple values you might want to update independently. e.g. for a ring buffer.
Note that the internal implementation of a typical JVM (e.g., hotspot) will often used the compare-and-swap hardware instruction as part of the synchronized implementation if such an instruction is available (e.g,. x86), while the other common alternative is LL/SC (e.g., POWER, ARM). A typical strategy to for the fast path to use compare-and-swap (or equivalent) to attempt to obtain the lock if it is free, followed possibly by a short spin-loop and finally if that fails falling back to an OS-level blocking primitive (e.g., futex, Events). The details go far beyond this and include techniques such as biased locking and are ultimately implementation dependent.

The answer above is not fulfilling. The approach is: A mutex (synchronization) is not necessary because the only one operation does the work (all what is to do in mutex), and this only one operation is not able to interrupt. But that is the half answer, because in a multicore system another CPU can write to the same memory location. For that reason the compareAndSwap machine code instruction reads and writes not only in the cache, it reads and writes to real memory. That needs a little more access time to the RAM. The CompareAndSwap machine code operation checks whether the RAM content is changed in comparison to the before read value, only then the new value is stored. If I have more time, I write an example here.
Effective, the compareAndSwap access is faster than a lock and unlock anytime. But it can only be used if only exact one memory location have to be changend in the access. If more as one memory locations should be commonly changed (should be always consistente together), the compareAndSwap CANNOT be used, only synchronzed can be used. In the answer above it was written, compareAndSwap is often used to implement the synchronized operation. That is correct, because the singular synchronized (get mutex) and end-synchronized (release mutex) need only exact one atomic instruction, inside the task scheduler. Hence atomic access is the base of all. But between synchronized{ .... } the scheduler knows that a thread switch is guarded.
This program approach is valid not only for Java, for C/++ (and maybe other languages - ) it is also important and able to use.

volatile and synchronized on single core cpu (example - pentium pro)

I have read and know in detail in the implications of the Java volatile and synchronized keyword at the cpu level in SMP architecture based CPUs.
A great paper on that subject is here:
http://irl.cs.ucla.edu/~yingdi/web/paperreading/whymb.2010.06.07c.pdf
Now, leave SMP cpus aside for this question. My question is: How does volatile and synchronized keyword work as it relates to older single core CPUs. Example a Pentium I/Pro/II/III/earlier IVs.
I want to know specifically:
1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory? If yes why? (Since there is only a single cache copy and no need for coherency protocols, why can't the cache be directly used by two threads that are time slicing the single core CPU ?). This is me asking this question after reading dozens of internet forums about how volatile reads and writes to/from the "master copy in main memory".
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) With a non superscalar CPU (Pentium I), instructions are not re-ordered. So if that is the case, then is volatile keyword required while running on a Pentium 1? (atomicity, visibility and ordering would be a "no problemo" right, because there is only one cache, one core to work on the cache, and no re-ordering).

1) Is the L1-L2 caches not used to read memory addresses and all reads and writes are performed directly to main memory?
No. The caches are still enabled. That's not related to SMP.
2) Apart from taking a lock on the this or specified object which is more of a Java platform thingy, what other effects does the synchronized keyword have on single core CPUs (compilers, assembly, execution, cache) ?
3) Does anything change with respect to a superscalar/non superscalar architecture (out-of-order) processor w.r.t these two keywords?
Gosh, do you have to ask this question about Java? Remember that all things eventually boil down to good ol' fashioned machine instructions. I'm not intimitely familiar with the guts of Java synchronization, but as I understand it, synchronized is just syntactic sugar for your typical monitor-style synchronization mechanism. Multiple threads are not allowed in a critical section simultaneously. Instead of simply spinning on a spinlock, the scheduler is leveraged - the waiting threads are put to sleep, and woken back up when to lock is able to be taken again.
The thing to remember is that even on a single-core, non-SMP system, you still have to worry about OS preemption of threads! These threads can be scheduled on and off of the CPU whenever the OS wants to. This is the purpose for the locks, of course.
Again, this question is so much better asked under the context of assembly, or even C (whose compiled result can often times be directly inferred) as opposed to Java, which has to deal with the VM, JITted code, etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.