Why define the Java memory model?

Why define the Java memory model? - java

Java's multithreaded code is finally mapped to the operating system thread for execution.
Is the operating system thread not thread safe?
Why use the Java memory model to ensure thread safety?Why define the Java memory model?
I hope someone can answer this question, I have looked up a lot of information on the Internet, still do not understand!
The material on the web is all about atomicity, visibility, orderliness, and using the cache consistency model as an example, but I don't think it really answers the question.
thank you very much!

The operating system thread is not thread safe (that statement does not make a lot of sense, but basically, the operating system does not ensure that the intended atomicity of your code is respected).
The problem is that whether two data items are related and therefore need to be synchronized is only really understood by your application.
For example, imagine you are defining a ListOfIntegers class which contains an int array and count of the number of items used in the array. These two data items are related and the way they are updated needs to be co-ordinated in order to ensure that if the object is accessed by two different threads they are always updated in a consistent manner, even if the threads update them simultaneously. Only your application knows how these data items are related. The operating system doesn't know. They are just two pieces of memory as far as it is concerned. That is why you have to implement the thread safety (by using synchronized or carefully arranging how the fields are updated).
The Java "memory model" is pretty close to the hardware model. There is a stack for primitives and objects are allocated on the heap. Synchronization is provided to allow the programmer to lock access to shared data on the heap. In addition, there are rules that the optimizer must follow so that the opimisations don't defeat the synchronisations put in place.

Every programing language that takes concurrency seriously needs a memory model - and here is why.
The memory model is the crux of the concurrency semantics of shared-memory systems. It defines the possible values that a read operation is allowed to return for any given set of write operations performed by a concurrent program, thereby defining the basic semantics of shared variables. In other words, the memory model specifies the set of allowed outputs of a program's read and write operations, and constrains an implementation to produce only (but at least one) such allowed executions. The memory model may and often does allow executions where the outcome cannot be inferred from the order in which read and write operations occur in the program. It is impossible to meaningfully reason about a program or any part of the programming language implementation without an unambiguous memory model. The memory model defines the possible outcomes of a concurrent programs read and write operations. Conversely, the memory model also defines which instruction reorderings may be permitted, either by the processor, the memory system, or the compiler.
This is an excerpt from the paper Memory Models for C/C++ Programmers which I have co-authored. Even though a large part of it is dedicated to the C++ memory model, it also covers more general areas -
starting with the reason why we need a memory model in the first place, explaining the (intuitive) sequential consistent model, and finally the weaker memory models provided by current hardware in x86 and ARM/POWER CPUs.

The Java Memory Model answers the following question: What shall happen when multiple threads modify the same memory location.
And the answer the memory model gives is:
If a program has no data races, then all executions of the program will appear to be sequentially consistent.
There is a great paper by Sarita V. Adve, Hans-J. Boehm about why the Java and C++ Memory Model are designed the way they are: Memory Models: A Case For Rethinking Parallel Languages and Hardware
From the paper:
We have been repeatedly surprised at how difficult it is to formalize the seemingly simple and fundamental property of “what value a read should return in a multithreaded program.”
Memory models, which describe the semantics of shared variables, are crucial to both correct multithreaded applications and the entire underlying implementation stack. It is difficult to teach multithreaded programming without clarity on memory models.
After much prior confusion, major programming languages are converging on a model that guarantees simple interleaving-based semantics for "data-race-free" programs and most hardware vendors have committed to support this model.

Related

Java instruction reordering and CPU memory reordering

This is a follow up question to
How to demonstrate Java instruction reordering problems?
There are many articles and blogs referring to Java and JVM instruction reordering which may lead to counter-intuitive results in user operations.
When I asked for a demonstration of Java instruction reordering causing unexpected results, several comments were made to the effect that a more general area of concern is memory reordering, and that it would be difficult to demonstrate on an x86 CPU.
Is instruction reordering just a part of a bigger issue of memory reordering, compiler optimizations and memory models? Are these issues really unique to the Java compiler and the JVM? Are they specific to certain CPU types?

Memory reordering is possible without compile-time reordering of operations in source vs. asm. The order of memory operations (loads and stores) to coherent shared cache (i.e. memory) done by a CPU running a thread is also separate from the order it executes those instructions in.
Executing a load is accessing cache (or the store buffer), but executing" a store in a modern CPU is separate from its value actually being visible to other cores (commit from store buffer to L1d cache). Executing a store is really just writing the address and data into the store buffer; commit isn't allowed until after the store has retired, thus is known to be non-speculative, i.e. definitely happening.
Describing memory reordering as "instruction reordering" is misleading. You can get memory reordering even on a CPU that does in-order execution of asm instructions (as long as it has some mechanisms to find memory-level parallelism and let memory operations complete out of order in some ways), even if asm instruction order matches source order. Thus that term wrongly implies that merely having plain load and store instructions in the right order (in asm) would be useful for anything related to memory order; it isn't, at least on non-x86 CPUs. It's also weird because instructions have effects on registers (at least loads, and on some ISAs with post-increment addressing modes, stores can, too).
It's convenient to talk about something like StoreLoad reordering as x = 1 "happening" after a tmp = y load, but the thing to talk about is when the effects happen (for loads) or are visible to other cores (for stores) in relation to other operations by this thread. But when writing Java or C++ source code, it makes little sense to care whether that happened at compile time or run-time, or how that source turned into one or more instructions. Also, Java source doesn't have instructions, it has statements.
Perhaps the term could make sense to describe compile-time reordering between bytecode instructions in a .class vs. JIT compiler-generate native machine code, but if so then it's a mis-use to use it for memory reordering in general, not just compile/JIT-time reordering excluding run-time reordering. It's not super helpful to highlight just compile-time reordering, unless you have signal handlers (like POSIX) or an equivalent that runs asynchronously in the context of an existing thread.
This effect is not unique to Java at all. (Although I hope this weird use of "instruction reordering" terminology is!) It's very much the same as C++ (and I think C# and Rust for example, probably most other languages that want to normally compile efficiently, and require special stuff in the source to specify when you want your memory operations ordered wrt. each other, and promptly visible to other threads). https://preshing.com/20120625/memory-ordering-at-compile-time/
C++ defines even less than Java about access to non-atomic<> variables without synchronization to ensure that there's never a write in parallel with anything else (undefined behaviour1).
And even present in assembly language, where by definition there's no reordering between source and machine code. All SMP CPUs except a few ancient ones like 80386 also do memory-reordering at run-time, so lack of instruction reordering doesn't gain you anything, especially on machines with a "weak" memory model (most modern CPUs other than x86): https://preshing.com/20120930/weak-vs-strong-memory-models/ - x86 is "strongly ordered", but not SC: it's program-order plus a store buffer with store forwarding. So if you want to actually demo the breakage from insufficient ordering in Java on x86, it's either going to be compile-time reordering or lack of sequential consistency via StoreLoad reordering or store-buffer effects. Other unsafe code like the accepted answer on your previous question that might happen to work on x86 will fail on weakly-ordered CPUs like ARM.
(Fun fact: modern x86 CPUs aggressively execute loads out of order, but check to make sure they were "allowed" to do that according to x86's strongly-ordered memory model, i.e. that the cache line they loaded from is still readable, otherwise roll back the CPU state to before that: machine_clears.memory_ordering perf event. So they maintain the illusion of obeying the strong x86 memory-ordering rules. Other ISAs have weaker orders and can just aggressively execute loads out of order without later checks.)
Some CPU memory models even allow different threads to disagree about the order of stores done by two other threads. So the C++ memory model allows that, too, so extra barriers on PowerPC are only needed for sequential consistency (atomic with memory_order_seq_cst, like Java volatile) not acquire/release or weaker orders.
Related:
How does memory reordering help processors and compilers?
How is load->store reordering possible with in-order commit? - memory reordering on in-order CPUs via other effects, like scoreboarding loads with a cache that can do hit-under-miss, and/or out-of-order commit from the store buffer, on weakly-ordered ISAs that allow this. (Also LoadStore reordering on OoO exec CPUs that still retire instructions in order, which is actually more surprising than on in-order CPUs which have special mechanisms to allow memory-level parallelism for loads, that OoO exec could replace.)
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem? (basically a duplicate of this; I didn't say much there that's not here)
Are loads and stores the only instructions that gets reordered? (at runtime)
Does an x86 CPU reorder instructions? (yes)
Can a speculatively executed CPU branch contain opcodes that access RAM? - store execution order isn't even relevant for memory ordering between threads, only commit order from the store buffer to L1d cache. A store buffer is essential to decouple speculative exec (including of store instructions) from anything that's visible to other cores. (And from cache misses on those stores.)
Why is integer assignment on a naturally aligned variable atomic on x86? - true in asm, but not safe in C/C++; you need std::atomic<int> with memory_order_relaxed to get the same asm but in portably-safe way.
Globally Invisible load instructions - where does load data come from: store forwarding is possible, so it's more accurate to say x86's memory model is "program order + a store buffer with store forwarding" than to say "only StoreLoad reordering", if you ever care about this core reloading its own recent stores.
Why memory reordering is not a problem on single core/processor machines? - just like the as-if rule for compilers, out-of-order exec (and other effects) have to preserve the illusion (within one core and thus thread) of instructions fully executing one at a time, in program order, with no overlap of their effects. This is basically the cardinal rule of CPU architecture.
LWN: Who's afraid of a big bad optimizing compiler? - surprising things compilers can do to C code that uses plain (non-volatile / non-_Atomic accesses). This is mostly relevant for the Linux kernel, which rolls its own atomics with inline asm for some things like barriers, but also just C volatile for pure loads / pure stores (which is very different from Java volatile2.)
Footnote 1: C++ UB means not just an unpredictable value loaded, but that the ISO C++ standard has nothing to say about what can/can't happen in the whole program at any time before or after UB is encountered. In practice for memory ordering, the consequences are often predictable (for experts who are used to looking at compiler-generated asm) depending on the target machine and optimization level, e.g. hoisting loads out of loops breaking spin-wait loops that fail to use atomic. But of course you're totally at the mercy of whatever the compiler happens to do when your program contains UB, not at all something you can rely on.
Caches are coherent, despite common misconceptions
However, all real-world systems that Java or C++ run multiple threads across do have coherent caches; seeing stale data indefinitely in a loop is a result of compilers keeping values in registers (which are thread-private), not of CPU caches not being visible to each other. This is what makes C++ volatile work in practice for multithreading (but don't actually do that because C++11 std::atomic made it obsolete).
Effects like never seeing a flag variable change are due to compilers optimizing global variables into registers, not instruction reordering or cpu caching. You could say the compiler is "caching" a value in a register, but you can choose other wording that's less likely to confuse people that don't already understand thread-private registers vs. coherent caches.
Footnote 2: When comparing Java and C++, also note that C++ volatile doesn't guarantee anything about memory ordering, and in fact in ISO C++ it's undefined behaviour for multiple threads to be writing the same object at the same time even with volatile. Use std::memory_order_relaxed if you want inter-thread visibility without ordering wrt. surrounding code.
(Java volatile is like C++ std::atomic<T> with the default std::memory_order_seq_cst, and AFAIK Java provides no way to relax that to do more efficient atomic stores, even though most algorithms only need acquire/release semantics for their pure-loads and pure-stores, which x86 can do for free. Draining the store buffer for sequential consistency costs extra. Not much compared to inter-thread latency, but significant for per-thread throughput, and a big deal if the same thread is doing a bunch of stuff to the same data without contention from other threads.)

Why do we need the volatile keyword when the core cache synchronization is done on the hardware level?

So I’m currently listing to this talk.
At minute 28:50 the following statement is made: „the fact that on the hardware it could be in main memory, in multiple level 3 caches, in four level 2 caches […] is not your problem. That’s the problem for the hardware designers.“
Yet, in java we have to declare a boolean stopping a thread as volatile, since when another thread calls the stop method, it’s not guaranteed that the running thread will be aware of this change.
Why is this the case, when the hardware level should take care of updating every cache with the correct value?
I’m sure I’m missing something here.
Code in question:
public class App {
public static void main(String[] args) {
Worker worker = new Worker();
worker.start();
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
worker.signalStop();
System.out.println(worker.isShouldStop());
System.out.println(worker.getVal());
System.out.println(worker.getVal());
}
static class Worker extends Thread {
private /*volatile*/ boolean shouldStop = false;
private long val = 0;
#Override
public void run() {
while (!shouldStop) {
val++;
}
System.out.println("Stopped");
}
public void signalStop() {
this.shouldStop = true;
}
public long getVal() {
return val;
}
public boolean isShouldStop() {
return shouldStop;
}
}
}

You are assuming the following:
Compiler doesn't reorder the instructions
CPU performs the loads and stores in the order as specified by your program
Then your reasoning makes sense and this consistency model is called sequential consistency (SC): There is a total order over loads/stores and consistent with the program order of each thread. In simple terms: just some interleaving of the loads/stores. The requirements for SC are a bit more strict, but this captures the essence.
If Java and the CPU would be SC, then there would not be any purpose of making something volatile.
The problem is that you would get terrible performance. A lot of compiler optimizations rely on rewriting the instructions to something more efficient and this can lead to reordering of loads and stores. It could even decide to optimize-out a load or a store so that it doesn't happen. This is all perfectly fine as long as there is just a single thread involved because the thread will not be able to observe these reordering of loads/stores.
Apart from the compiler, the CPU also likes to reorder loads/store. Imagine that a CPU needs to make a write, and the cache-line for that write isn't in the right state. So the CPU would block and this would be very inefficient. Since the store is going to be made anyway, it is better to queue the store in a buffer so that the CPU can continue and as soon as the cache-line is returned in the right state, the store is written to the cache-line and then committed to the cache. Store buffering is a technique used by a lot of processors (e.g. ARM/X86). One problem with it is that it can lead to an earlier store to some address being reordering with a newer load to a different address. So instead of having a total order over all loads and stores like SC, you only get a total order over all stores. This model is called TSO (Total Store Order) and you can find it on the x86 and SPARC v8/v9. This approach assumes that the stores in the store buffer are going to be written to the cache in program order; but there is also a relaxation possible such that store in the store buffer to different cache-lines can be committed to the cache in any order; this is called PSO (Partial Store Order) and you can find it on the SPARC v8/v9.
SC/TSO/PSO are strong memory models because every load and store is a synchronization action; so they order surrounding loads/stores. This can be pretty expensive because for most instructions, as long as the data-dependency-order is preserved, any ordering is fine because:
most memory is not shared between different CPUs.
if memory is shared, often there is some external synchronization like a unlock/lock of a mutex or release-store/acquire-load that takes care of synchronization. So the synchronization can be delayed.
CPU's with weak memory models like ARM, Itanium make use of this. They make a separation between plain loads and stores and synchronizing loads/stores. And for plain loads and stores, any ordering is fine. And modern processors execute instructions out of order any way; there is a lot of parallelism inside a single CPU.
Modern processors do implement cache coherence. The only modern processor that doesn't need to implement cache coherence is a GPU. Cache coherence can be implemented in 2 ways
for small systems the caches can sniff the bus traffic. This is where you see MESI protocol. This technique is called is called sniffing (or snooping).
for larger systems you can have a directory that knows the state of each cache-line and which CPUs are sharing the cache-line and which CPU is owning the cache-line (here there is some MESI-like protocol). And all requests for cache-line go through the directory.
The cache coherence protocol make sure that the cache-line is invalidated on CPUs before a different CPU can write to the cache line. Cache coherence will give you a total order of loads/stores on a single address, but will not provide any ordering of loads/stores between different addresses.
Coming back to volatile:
So what volatile does is:
prevent reordering loads and stores by the compiler and CPU.
ensure that a load/store becomes visible; so it will the compiler from optimizing-out a load or store.
the load/store is atomic; so you don't get problems like a torn read/write. This includes compiler behavior like natural alignment of the field.
I have give you some technical information about what is happening behind the scenes. But to properly understand volatile, you need to understand the Java Memory Model. It is an abstract model that doesn't care about any implementation details as described above. If you would not apply volatile in your example, you would have a data race because a happens-before edge is missing between concurrent conflicting accesses.
A great book on this topic is
A Primer on Memory Consistency and Cache Coherence, Second Edition. You can download it for free.
I can't recommend you any book on the Java Memory Model because it is all explained in an awful manner. Best to get an understanding of memory models in general before diving into the JMM. Probably the best sources are this doctoral dissertation by Jeremy Manson, and Aleksey Shipilëv: One Stop Page.
PS:
There are situations when you don't care about any ordering guarantees, e.g.
stop flag for a thread
progress indicators
blackholes for microbenchmarks.
This is where the VarHandle.getOpaque/setOpaque can be useful. It provides visibility and atomicity, but it doesn't provide any ordering guarantees with respect to other variables. This is mostly a compiler concern. Most engineers will never need this level of control.

What you're suggesting is that hardware designers just make the world all ponies and rainbows for you.
They cannot do that - what you want makes the notion of an on-core cache completely impossible. How could a CPU core possibly know that a given memory location needs to be synced up with another core before accessing it any further, short of just keeping the entire cache in sync on a permanent basis, completely invalidating the entire idea of an on-core cache?
If the talk is strongly suggesting that you as a software engineer can just blame hardware engineers for not making life easy for you, it's a horrible and stupid talk. I bet it's brought a little more nuanced than that.
At any rate, you took the wrong lesson from it.
It's a two-way street. The hardware engineering team works together with the JVM team, effectively, to set up a consistent model that is a good equilibrium between 'With these constraints and limited guarantees to the software engineer, the hardware team can make reliable and significant performance improvements' and 'A software engineer can build multicore software with this model without tearing their hair out'.
This happy equilibrium in java is the JMM (Java Memory Model), which primarily boils down to: All field accesses may have a local thread cache or not, you do not know, and you cannot test if it does. Essentially the JVM has an evil coin an will flip it every time you read a field. Tails, you get the local copy. Heads, it syncs first. The coin is evil in that it is not fair and will land heads through out development, testing, and the first week, every time, even if you flip it a million times. And then the important potential customer demoes your software and you start getting tails.
The solution is to make the JVM never flip it, and this means you need to establish Happens-Before/Happens-After relationships anytime you have a situation anywhere in your code where one thread writes a field and another reads it. volatile is one way to do it.
In other words, to give hardware engineers something to work with, you, the software engineer, effectively made the promise that you'll establish HB/HA if you care about synchronizing between threads. So that's your part of the 'deal'. Their part of the deal is that the hardware guarantees the behaviour if you keep up your end of the deal, and that the hardware is very very fast.

What are memory fences used for in Java?

Whilst trying to understand how SubmissionPublisher (source code in OpenJDK 10, Javadoc), a new class added to the Java SE in version 9, has been implemented, I stumbled across a few API calls to VarHandle I wasn't previously aware of:
fullFence, acquireFence, releaseFence, loadLoadFence and storeStoreFence.
After doing some research, especially regarding the concept of memory barriers/fences (I have heard of them previously, yes; but never used them, thus was quite unfamiliar with their semantics), I think I have a basic understanding of what they are for. Nonetheless, as my questions might arise from a misconception, I want to ensure that I got it right in the first place:
Memory barriers are reordering constraints regarding reading and writing operations.
Memory barriers can be categorized into two main categories: unidirectional and bidirectional memory barriers, depending on whether they set constraints on either reads or writes or both.
C++ supports a variety of memory barriers, however, these do not match up with those provided by VarHandle. However, some of the memory barriers available in VarHandle provide ordering effects that are compatible to their corresponding C++ memory barriers.
#fullFence is compatible to atomic_thread_fence(memory_order_seq_cst)
#acquireFence is compatible to atomic_thread_fence(memory_order_acquire)
#releaseFence is compatible to atomic_thread_fence(memory_order_release)
#loadLoadFence and #storeStoreFence have no compatible C++ counter part
The word compatible seems to really important here since the semantics clearly differ when it comes to the details. For instance, all C++ barriers are bidirectional, whereas Java's barriers aren't (necessarily).
Most memory barriers also have synchronization effects. Those especially depend upon the used barrier type and previously-executed barrier instructions in other threads. As the full implications a barrier instruction has is hardware-specific, I'll stick with the higher-level (C++) barriers. In C++, for instance, changes made prior to a release barrier instruction are visible to a thread executing an acquire barrier instruction.
Are my assumptions correct? If so, my resulting questions are:
Do the memory barriers available in VarHandle cause any kind of memory synchronization?
Regardless of whether they cause memory synchronization or not, what may reordering constraints be useful for in Java? The Java Memory Model already gives some very strong guarantees regarding ordering when volatile fields, locks or VarHandle operations like #compareAndSet are involved.
In case you're looking for an example: The aforementioned BufferedSubscription, an inner class of SubmissionPublisher (source linked above), established a full fence in line 1079, function growAndAdd. However, it is unclear for me what it is there for.

This is mainly a non-answer, really (initially wanted to make it a comment, but as you can see, it's far too long). It's just that I questioned this myself a lot, did a lot of reading and research and at this point in time I can safely say: this is complicated. I even wrote multiple tests with jcstress to figure out how really they work (while looking at the assembly code generated) and while some of them somehow made sense, the subject in general is by no means easy.
The very first thing you need to understand:
The Java Language Specification (JLS) does not mention barriers, anywhere. This, for java, would be an implementation detail: it really acts in terms of happens before semantics. To be able to proper specify these according to the JMM (Java Memory Model), the JMM would have to change quite a lot.
This is work in progress.
Second, if you really want to scratch the surface here, this is the very first thing to watch. The talk is incredible. My favorite part is when Herb Sutter raises his 5 fingers and says, "This is how many people can really and correctly work with these." That should give you a hint of the complexity involved. Nevertheless, there are some trivial examples that are easy to grasp (like a counter updated by multiple threads that does not care about other memory guarantees, but only cares that it is itself incremented correctly).
Another example is when (in java) you want a volatile flag to control threads to stop/start. You know, the classical:
volatile boolean stop = false; // on thread writes, one thread reads this
If you work with java, you would know that without volatile this code is broken (you can read why double check locking is broken without it for example). But do you also know that for some people that write high performance code this is too much? volatile read/write also guarantees sequential consistency - that has some strong guarantees and some people want a weaker version of this.
A thread safe flag, but not volatile? Yes, exactly: VarHandle::set/getOpaque.
And you would question why someone might need that for example? Not everyone is interested with all the changes that are piggy-backed by a volatile.
Let's see how we will achieve this in java. First of all, such exotic things already existed in the API: AtomicInteger::lazySet. This is unspecified in the Java Memory Model and has no clear definition; still people used it (LMAX, afaik or this for more reading). IMHO, AtomicInteger::lazySet is VarHandle::releaseFence (or VarHandle::storeStoreFence).
Let's try to answer why someone needs these?
JMM has basically two ways to access a field: plain and volatile (which guarantees sequential consistency). All these methods that you mention are there to bring something in-between these two - release/acquire semantics; there are cases, I guess, where people actually need this.
An even more relaxation from release/acquire would be opaque, which I am still trying to fully understand.
Thus bottom line (your understanding is fairly correct, btw): if you plan to use this in java - they have no specification at the moment, do it on you own risk. If you do want to understand them, their C++ equivalent modes are the place to start.

Under what circumstances do you need to synchronize an array in Java?

Under what circumstances do you need to synchronize an array?
My thoughts are, do you need to synchronize for access? Say two threads access the array at the same time, is that going to crash?
What if one edits, while one is reading? (separate values, and the same in different circumstances)
Both editing different things?
Or is there no JVM crash like for arrays when you don't synchronize?

Under what circumstances do you need to synchronize an array?
It's sort of you either always need to or never need to. Like #EJP said, he's never done it because there's almost always a better data structure than an array, anyway (edit: there are lots of good use cases for arrays, but they're almost always used in isolation. e.g. ArrayList). But if you insist on sharing arrays between threads, array elements aren't volatile, so because of possible caching, you'll get inconsistencies and corrupt data without using synchronized.
My thoughts are, do you need to synchronize for access? Say two threads access the array at the same time, is that going to crash?
Crash, no, but your data could be inconsistent, and extra inconsistent if they're 64-bits on a 32-bit architecture.
What if one edits, while one is reading? (separate values, and the same in different circumstances)
Please don't. Wrapping your head around the Java memory model is hard enough. If you haven't established that a read or a write happened-before another read or write, the ultimate sequencing is undefined.

This is a difficult question because it touches on a lot of Concurrency topics.
First I'd start with, http://docs.oracle.com/javase/tutorial/essential/concurrency/sync.html
Threads communicate primarily by sharing access to fields and the objects reference fields refer to. This form of communication is extremely efficient, but makes two kinds of errors possible: thread interference and memory consistency errors. The tool needed to prevent these errors is synchronization.
A. Thread Interference describes how errors are introduced when multiple threads access shared data.
B. Memory Consistency Errors describes errors that result from inconsistent views of shared memory.
So to answer the main question directly, You synchronize an array when you believe that your array maybe be accessed in a way that introduces Thread interference or Memory Consistency Errors mainly.
You end up with what's called a Race Condition. Whether that crashes your application or not depends on your application.
So if you do not synchronize access to an array that is shared between multiple threads you run the chance of threads interleaving modifications to this array ( ie. Thread Interference ). Or the chance that threads read inconsistent data in your array ( ie. Memory Consistency ).
The solution is typically to synchronize the array, or us a Collection built for Concurrency, such as those discribed at https://docs.oracle.com/javase/tutorial/essential/concurrency/collections.html

Does the CLR perform "lock elision" optimization? If not why not?

The JVM performs a neat trick called lock elision to avoid the cost of locking on objects that are only visible to one thread.
There's a good description of the trick here:
http://www.ibm.com/developerworks/java/library/j-jtp10185/
Does the .Net CLR do something similar? If not then why not?

It's neat, but is it useful? I have a hard time coming up with an example where the compiler can prove that a lock is thread local. Almost all classes don't use locking by default, and when you choose one that locks, then in most cases it will be referenced from some kind of static variable foiling the compiler optimization anyways.
Another thing is that the java vm uses escape analysis in its proof. And AFAIK .net hasn't implemented escape analysis. Other uses of escape analysis such as replacing heap allocations with stack allocations sound much more useful and should be implemented first.
IMO it's currently not worth the coding effort. There are many areas in the .net VM which are not optimized very well and have much bigger impact.
SSE vector instructions and delegate inlining are two examples from which my code would profit much more than from this optimization.

EDIT: As chibacity points out below, this is talking about making locks really cheap rather than completely eliminating them. I don't believe the JIT has the concept of "thread-local objects" although I could be mistaken... and even if it doesn't now, it might in the future of course.
EDIT: Okay, the below explanation is over-simplified, but has at least some basis in reality :) See Joe Duffy's blog post for some rather more detailed information.
I can't remember where I read this - probably "CLR via C#" or "Concurrent Programming on Windows" - but I believe that the CLR allocates sync blocks to objects lazily, only when required. When an object whose monitor has never been contested is locked, the object header is atomically updated with a compare-exchange operation to say "I'm locked". If a different thread then tries to acquire the lock, the CLR will be able to determine that it's already locked, and basically upgrade that lock to a "full" one, allocating it a sync block.
When an object has a "full" lock, locking operations are more expensive than locking and unlocking an otherwise-uncontested object.
If I'm right about this (and it's a pretty hazy memory) it should be feasible to lock and unlock a monitor on different threads cheaply, so long as the locks never overlap (i.e. there's no contention).
I'll see if I can dig up some evidence for this...

In answer to your question: No, the CLR\JIT does not perform "lock elision" optimization i.e. the CLR\JIT does not remove locks from code which is only visible to single threads. This can easily be confirmed with simple single threaded benchmarks on code where lock elision should apply as you would expect in Java.
There are likely to be a number of reasons why it does not do this, but chiefly is the fact that in the .Net framework this is likely to be an uncommonly applied optimization, so is not worth the effort of implementing.
Also in .Net uncontended locks are extremely fast due to the fact they are non-blocking and executed in user space (JVMs appear to have similar optimizations for uncontended locks e.g. IBM). To quote from C# 3.0 In A Nutshell's threading chapter
Locking is fast: you can expect to acquire and release a lock in less
than 100 nanoseconds on a 3 GHz computer if the lock is uncontended"
A couple of example scenarios where lock elision could be applied, and why it's not worth it:
Using locks within a method in your own code that acts purely on locals
There is not really a good reason to use locking in this scenario in the first place, so unlike optimizations such as hoisting loop invariants or method inling, this is a pretty uncommon case and the result of unnecessary use of locks. The runtime shouldn't be concerned with optimizing out uncommon and extreme bad usage.
Using someone else's type that is declared as a local which uses locks internally
Although this sounds more useful, the .Net framework's general design philosophy is to leave the responsibility for locking to clients, so it's rare that types have any internal lock usage. Indeed, the .Net framework is pathologically unsynchronized when it comes to instance methods on types that are not specifically designed and advertized to be concurrent. On the other hand, Java has common types that do include synchronization e.g. StringBuffer and Vector. As the .Net BCL is largely unsynchronized, lock elision is likely to have little effect.
Summary
I think overall, there are fewer cases in .Net where lock elision would kick in, because there are simply not as many places where there would be thread-local locks. Locks are far more likely to be used in places which are visible to multiple threads and therefore should not be elided. Also, uncontended locking is extremely quick.
I had difficulty finding any real-world evidence that lock elision actually provides that much of a performance benefit in Java (for example...), and the latest docs for at least the Oracle JVM state that elision is not always applied for thread local locks, hinting that it is not an always given optimization anyway.
I suspect that lock elision is something that is made possible through the introduction of escape analysis in the JVM, but is not as important for performance as EA's ability to analyze whether reference types can be allocated on the stack.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.