Java instruction reordering and CPU memory reordering

Java instruction reordering and CPU memory reordering - java

This is a follow up question to
How to demonstrate Java instruction reordering problems?
There are many articles and blogs referring to Java and JVM instruction reordering which may lead to counter-intuitive results in user operations.
When I asked for a demonstration of Java instruction reordering causing unexpected results, several comments were made to the effect that a more general area of concern is memory reordering, and that it would be difficult to demonstrate on an x86 CPU.
Is instruction reordering just a part of a bigger issue of memory reordering, compiler optimizations and memory models? Are these issues really unique to the Java compiler and the JVM? Are they specific to certain CPU types?

Memory reordering is possible without compile-time reordering of operations in source vs. asm. The order of memory operations (loads and stores) to coherent shared cache (i.e. memory) done by a CPU running a thread is also separate from the order it executes those instructions in.
Executing a load is accessing cache (or the store buffer), but executing" a store in a modern CPU is separate from its value actually being visible to other cores (commit from store buffer to L1d cache). Executing a store is really just writing the address and data into the store buffer; commit isn't allowed until after the store has retired, thus is known to be non-speculative, i.e. definitely happening.
Describing memory reordering as "instruction reordering" is misleading. You can get memory reordering even on a CPU that does in-order execution of asm instructions (as long as it has some mechanisms to find memory-level parallelism and let memory operations complete out of order in some ways), even if asm instruction order matches source order. Thus that term wrongly implies that merely having plain load and store instructions in the right order (in asm) would be useful for anything related to memory order; it isn't, at least on non-x86 CPUs. It's also weird because instructions have effects on registers (at least loads, and on some ISAs with post-increment addressing modes, stores can, too).
It's convenient to talk about something like StoreLoad reordering as x = 1 "happening" after a tmp = y load, but the thing to talk about is when the effects happen (for loads) or are visible to other cores (for stores) in relation to other operations by this thread. But when writing Java or C++ source code, it makes little sense to care whether that happened at compile time or run-time, or how that source turned into one or more instructions. Also, Java source doesn't have instructions, it has statements.
Perhaps the term could make sense to describe compile-time reordering between bytecode instructions in a .class vs. JIT compiler-generate native machine code, but if so then it's a mis-use to use it for memory reordering in general, not just compile/JIT-time reordering excluding run-time reordering. It's not super helpful to highlight just compile-time reordering, unless you have signal handlers (like POSIX) or an equivalent that runs asynchronously in the context of an existing thread.
This effect is not unique to Java at all. (Although I hope this weird use of "instruction reordering" terminology is!) It's very much the same as C++ (and I think C# and Rust for example, probably most other languages that want to normally compile efficiently, and require special stuff in the source to specify when you want your memory operations ordered wrt. each other, and promptly visible to other threads). https://preshing.com/20120625/memory-ordering-at-compile-time/
C++ defines even less than Java about access to non-atomic<> variables without synchronization to ensure that there's never a write in parallel with anything else (undefined behaviour1).
And even present in assembly language, where by definition there's no reordering between source and machine code. All SMP CPUs except a few ancient ones like 80386 also do memory-reordering at run-time, so lack of instruction reordering doesn't gain you anything, especially on machines with a "weak" memory model (most modern CPUs other than x86): https://preshing.com/20120930/weak-vs-strong-memory-models/ - x86 is "strongly ordered", but not SC: it's program-order plus a store buffer with store forwarding. So if you want to actually demo the breakage from insufficient ordering in Java on x86, it's either going to be compile-time reordering or lack of sequential consistency via StoreLoad reordering or store-buffer effects. Other unsafe code like the accepted answer on your previous question that might happen to work on x86 will fail on weakly-ordered CPUs like ARM.
(Fun fact: modern x86 CPUs aggressively execute loads out of order, but check to make sure they were "allowed" to do that according to x86's strongly-ordered memory model, i.e. that the cache line they loaded from is still readable, otherwise roll back the CPU state to before that: machine_clears.memory_ordering perf event. So they maintain the illusion of obeying the strong x86 memory-ordering rules. Other ISAs have weaker orders and can just aggressively execute loads out of order without later checks.)
Some CPU memory models even allow different threads to disagree about the order of stores done by two other threads. So the C++ memory model allows that, too, so extra barriers on PowerPC are only needed for sequential consistency (atomic with memory_order_seq_cst, like Java volatile) not acquire/release or weaker orders.
Related:
How does memory reordering help processors and compilers?
How is load->store reordering possible with in-order commit? - memory reordering on in-order CPUs via other effects, like scoreboarding loads with a cache that can do hit-under-miss, and/or out-of-order commit from the store buffer, on weakly-ordered ISAs that allow this. (Also LoadStore reordering on OoO exec CPUs that still retire instructions in order, which is actually more surprising than on in-order CPUs which have special mechanisms to allow memory-level parallelism for loads, that OoO exec could replace.)
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem? (basically a duplicate of this; I didn't say much there that's not here)
Are loads and stores the only instructions that gets reordered? (at runtime)
Does an x86 CPU reorder instructions? (yes)
Can a speculatively executed CPU branch contain opcodes that access RAM? - store execution order isn't even relevant for memory ordering between threads, only commit order from the store buffer to L1d cache. A store buffer is essential to decouple speculative exec (including of store instructions) from anything that's visible to other cores. (And from cache misses on those stores.)
Why is integer assignment on a naturally aligned variable atomic on x86? - true in asm, but not safe in C/C++; you need std::atomic<int> with memory_order_relaxed to get the same asm but in portably-safe way.
Globally Invisible load instructions - where does load data come from: store forwarding is possible, so it's more accurate to say x86's memory model is "program order + a store buffer with store forwarding" than to say "only StoreLoad reordering", if you ever care about this core reloading its own recent stores.
Why memory reordering is not a problem on single core/processor machines? - just like the as-if rule for compilers, out-of-order exec (and other effects) have to preserve the illusion (within one core and thus thread) of instructions fully executing one at a time, in program order, with no overlap of their effects. This is basically the cardinal rule of CPU architecture.
LWN: Who's afraid of a big bad optimizing compiler? - surprising things compilers can do to C code that uses plain (non-volatile / non-_Atomic accesses). This is mostly relevant for the Linux kernel, which rolls its own atomics with inline asm for some things like barriers, but also just C volatile for pure loads / pure stores (which is very different from Java volatile2.)
Footnote 1: C++ UB means not just an unpredictable value loaded, but that the ISO C++ standard has nothing to say about what can/can't happen in the whole program at any time before or after UB is encountered. In practice for memory ordering, the consequences are often predictable (for experts who are used to looking at compiler-generated asm) depending on the target machine and optimization level, e.g. hoisting loads out of loops breaking spin-wait loops that fail to use atomic. But of course you're totally at the mercy of whatever the compiler happens to do when your program contains UB, not at all something you can rely on.
Caches are coherent, despite common misconceptions
However, all real-world systems that Java or C++ run multiple threads across do have coherent caches; seeing stale data indefinitely in a loop is a result of compilers keeping values in registers (which are thread-private), not of CPU caches not being visible to each other. This is what makes C++ volatile work in practice for multithreading (but don't actually do that because C++11 std::atomic made it obsolete).
Effects like never seeing a flag variable change are due to compilers optimizing global variables into registers, not instruction reordering or cpu caching. You could say the compiler is "caching" a value in a register, but you can choose other wording that's less likely to confuse people that don't already understand thread-private registers vs. coherent caches.
Footnote 2: When comparing Java and C++, also note that C++ volatile doesn't guarantee anything about memory ordering, and in fact in ISO C++ it's undefined behaviour for multiple threads to be writing the same object at the same time even with volatile. Use std::memory_order_relaxed if you want inter-thread visibility without ordering wrt. surrounding code.
(Java volatile is like C++ std::atomic<T> with the default std::memory_order_seq_cst, and AFAIK Java provides no way to relax that to do more efficient atomic stores, even though most algorithms only need acquire/release semantics for their pure-loads and pure-stores, which x86 can do for free. Draining the store buffer for sequential consistency costs extra. Not much compared to inter-thread latency, but significant for per-thread throughput, and a big deal if the same thread is doing a bunch of stuff to the same data without contention from other threads.)

Related

Why define the Java memory model?

Java's multithreaded code is finally mapped to the operating system thread for execution.
Is the operating system thread not thread safe?
Why use the Java memory model to ensure thread safety?Why define the Java memory model?
I hope someone can answer this question, I have looked up a lot of information on the Internet, still do not understand!
The material on the web is all about atomicity, visibility, orderliness, and using the cache consistency model as an example, but I don't think it really answers the question.
thank you very much!

The operating system thread is not thread safe (that statement does not make a lot of sense, but basically, the operating system does not ensure that the intended atomicity of your code is respected).
The problem is that whether two data items are related and therefore need to be synchronized is only really understood by your application.
For example, imagine you are defining a ListOfIntegers class which contains an int array and count of the number of items used in the array. These two data items are related and the way they are updated needs to be co-ordinated in order to ensure that if the object is accessed by two different threads they are always updated in a consistent manner, even if the threads update them simultaneously. Only your application knows how these data items are related. The operating system doesn't know. They are just two pieces of memory as far as it is concerned. That is why you have to implement the thread safety (by using synchronized or carefully arranging how the fields are updated).
The Java "memory model" is pretty close to the hardware model. There is a stack for primitives and objects are allocated on the heap. Synchronization is provided to allow the programmer to lock access to shared data on the heap. In addition, there are rules that the optimizer must follow so that the opimisations don't defeat the synchronisations put in place.

Every programing language that takes concurrency seriously needs a memory model - and here is why.
The memory model is the crux of the concurrency semantics of shared-memory systems. It defines the possible values that a read operation is allowed to return for any given set of write operations performed by a concurrent program, thereby defining the basic semantics of shared variables. In other words, the memory model specifies the set of allowed outputs of a program's read and write operations, and constrains an implementation to produce only (but at least one) such allowed executions. The memory model may and often does allow executions where the outcome cannot be inferred from the order in which read and write operations occur in the program. It is impossible to meaningfully reason about a program or any part of the programming language implementation without an unambiguous memory model. The memory model defines the possible outcomes of a concurrent programs read and write operations. Conversely, the memory model also defines which instruction reorderings may be permitted, either by the processor, the memory system, or the compiler.
This is an excerpt from the paper Memory Models for C/C++ Programmers which I have co-authored. Even though a large part of it is dedicated to the C++ memory model, it also covers more general areas -
starting with the reason why we need a memory model in the first place, explaining the (intuitive) sequential consistent model, and finally the weaker memory models provided by current hardware in x86 and ARM/POWER CPUs.

The Java Memory Model answers the following question: What shall happen when multiple threads modify the same memory location.
And the answer the memory model gives is:
If a program has no data races, then all executions of the program will appear to be sequentially consistent.
There is a great paper by Sarita V. Adve, Hans-J. Boehm about why the Java and C++ Memory Model are designed the way they are: Memory Models: A Case For Rethinking Parallel Languages and Hardware
From the paper:
We have been repeatedly surprised at how difficult it is to formalize the seemingly simple and fundamental property of “what value a read should return in a multithreaded program.”
Memory models, which describe the semantics of shared variables, are crucial to both correct multithreaded applications and the entire underlying implementation stack. It is difficult to teach multithreaded programming without clarity on memory models.
After much prior confusion, major programming languages are converging on a model that guarantees simple interleaving-based semantics for "data-race-free" programs and most hardware vendors have committed to support this model.

Why are shared variables cached in CPU caches?

I'm trying to understand the Java Memory Model but have been failing to get a point regarding CPU caches.
As far as I know it, in JVM we have the following locations to store local and shared variables:
local variables -- on thread stack
shared variables -- in memory, but every CPU cache has a copy of it
So my question is: why store local variables on stack, and (cache) shared variables in CPU cache? Why not the other way around (Supposing that CPU cache is too expensive to store both), we cache local variables in CPU caches and just fetch shared variables from memory? Is this part of the Java language design or the computer architecture?
Further: as simple as "CPU cache" sounds, what if several CPUs share one cache? And in systems with multi-level caches, which level of cache will the copy of shared variables be stored in? Further, if more than 1 threads are running in the same CPU core, does it mean that they are sharing the same set of cached shared-variables, and hence even if the shared variable is not defined volatile, accesses of the variable is still instantly visible to the other threads running on the same CPU?

"Local" and "shared" variables are meaningless outside the context of your code. They don't influence where or even if the state is cached. It's not even useful to think or reason in terms of where your state is stored; the entire reason the JMM exists is so that details like these, which vary from architecture to architecture are not exposed to the programmer. By relying on low-level hardware details, you are asking the wrong questions about the JMM. It's not useful to your application, it makes it fragile, easier to break, harder to reason with, and less portable.
That said, in general, you should assume that any program state, if not all states, are eligible to be cached. The fact is that what is cached does not actually matter, just that anything and everything can be, whether it be primitive types or reference types, or even state variables encapsulated by several fields. Whatever instructions a thread runs (and those instructions vary by architecture too - beware!), those instructions default back on the CPU to determine what is relevant to cache and what not to cache; it is impossible for programmers to do this themselves (although it is possible to influence where state variables may be cached , see what false sharing is).
Again, we can also make some more generalizations about x86, that active primitive types are probably put on registers because P/ALUs will be able to work with them the fastest. Anything else goes though. It's possible for primitives to be moved to L1/2 cached if they are core-local, it's certainly possible that they would be overwritten quite quickly. The CPU might put state variables on a shared L3 if it thinks that there will be a context switch in the future, or it could not. A hardware expert will need to respond to that.
Ideally, state variables will be stored in the closest cache (register, L1/2/3, then main memory) to the processor unit. That's up the CPU to decide though. It is impossible to reason about cache semantics at the Java level. Even if hyper threading is enabled (I'm not sure what the AMD equivalent is), threads are not allowed to share resources, and even then, if they were, recall that visibility is not the only problem associated with shared state variables; in the case that the processor performs pipelining, you still need the appropriate instructions to ensure the correct ordering (this is even after you get rid of read/write buffering on the CPU), whether this be hwsync or the appropriate fences or others.
Again, reasoning about the properties of the cache is not useful, both because the JMM handles that for you and because it is indeterminable, where/when/what is cached. Further, even if you did know the where/when/what questions, you STILL cannot reason about data visibility; all caches treat cached data in the same way anyways, and you will need to rely on the processor updating the cache state between the ME(O)SI states, instruction ordering, load/store buffering, write-back/through, etc... And you still haven't dealt with problems that can occur at the OS and JVM level yet. Again, luckily, the JDK allows you to use basic tools such as volatile, final, and atomics that work consistently across all platforms and produce code that is predictable and easy(er) to reason with.

cost of volatile read in java when write are infrequent [duplicate]

I know that writing to a volatile variable flushes it from the memory of all the cpus, however I want to know if reads to a volatile variable are as fast as normal reads?
Can volatile variables ever be placed in the cpu cache or is it always fetched from the main memory?

You should really check out this article: http://brooker.co.za/blog/2012/09/10/volatile.html. The blog article argues volatile reads can be a lot slower (also for x86) than non-volatile reads on x86.
Test 1 is a parallel read and write to a non-volatile variable. There
is no visibility mechanism and the results of the reads are
potentially stale.
Test 2 is a parallel read and write to a volatile variable. This does not address the OP's question specifically. However worth noting that a contended volatile can be very slow.
Test 3 is a read to a volatile in a tight loop. Demonstrated is that the semantics of what it means to be volatile indicate that the value can change with each loop iteration. Thus the JVM can not optimize the read and hoist it out of the loop. In Test 1, it is likely the value was read and stored once, thus there is no actual "read" occurring.
Credit to Marc Booker for running these tests.

The answer is somewhat architecture dependent. On an x86, there is no additional overhead associated with volatile reads specifically, though there are implications for other optimizations.
JMM cookbook from Doug Lea, see architecture table near the bottom.
To clarify: There is not any additional overhead associated with the read itself. Memory barriers are used to ensure proper ordering. JSR-133 classifies four barriers "LoadLoad, LoadStore, StoreLoad, and StoreStore". Depending on the architecture, some of these barriers correspond to a "no-op", meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.
As pointed out in a blog post, the fact that the variable is volatile means there are assumptions about the nature of the variable that can no longer be made and some compiler optimizations would not be applied to a volatile.
Volatile is not something that should be used glibly, but it should also not be feared. There are plenty of cases where a volatile will suffice in place of more heavy handed locking.

It is architecture dependent. What volatile does is tell the compiler not to optimise that variable away. It forces most operations to treat the variable's state as an unknown. Because it is volatile, it could be changed by another thread or some other hardware operation. So, reads will need to re-read the variable and operations will be of the read-modify-write kind.
This kind of variable is used for device drivers and also for synchronisation with in-memory mutexes/semaphores.

Volatile reads cannot be as quick, especially on multi-core CPUs (but also only single-core).
The executing core has to fetch from the actual memory address to make sure it gets the current value - the variable indeed cannot be cached.
As opposed to one other answer here, volatile variables are not used just for device drivers! They are sometimes essential for writing high performance multi-threaded code!

volatile implies that the compiler cannot optimize the variable by placing its value in a CPU register. It must be accessed from main memory. It may, however, be placed in a CPU cache. The cache will guaranty consistency between any other CPUs/cores in the system. If the memory is mapped to IO, then things are a little more complicated. If it was designed as such, the hardware will prevent that address space from being cached and all accesses to that memory will go to the hardware. If there isn't such a design, the hardware designers may require extra CPU instructions to insure that the read/write goes through the caches, etc.
Typically, the 'volatile' keyword is only used for device drivers in operating systems.

Unsafe compareAndSwapInt vs synchronize

I found that almost all high level synchronization abstractions(like Semaphore, CountDownLatch, Exchanger from java.util.concurrent) and concurrent collections are using methods from Unsafe(like compareAndSwapInt method) to define critical section. In the same time I expected that synchronize block or method will be used for this purpose.
Could you explain is the Unsafe methods(I mean only methods that could atomically set a value) more efficient than synchronize and why it is so?

Using synchronised is more efficient if you expect to be waiting a long time (e.g. milli-seconds) as the thread can fall asleep and release the CPU to do other work.
Using compareAndSwap is more efficient if you expect the operation to happen quite quickly. This is because it is a simple machine code instruction and take as little as 10 ns. However if a resources is heavily contented this instruction must busy wait and if it cannot obtain the value it needs, it can consume the CPU busily until it does.
If you use off heap memory, you can control the layout of the data being shared and avoid false sharing (where the same cache line is being updated by more than one CPU). This is important when you have multiple values you might want to update independently. e.g. for a ring buffer.
Note that the internal implementation of a typical JVM (e.g., hotspot) will often used the compare-and-swap hardware instruction as part of the synchronized implementation if such an instruction is available (e.g,. x86), while the other common alternative is LL/SC (e.g., POWER, ARM). A typical strategy to for the fast path to use compare-and-swap (or equivalent) to attempt to obtain the lock if it is free, followed possibly by a short spin-loop and finally if that fails falling back to an OS-level blocking primitive (e.g., futex, Events). The details go far beyond this and include techniques such as biased locking and are ultimately implementation dependent.

The answer above is not fulfilling. The approach is: A mutex (synchronization) is not necessary because the only one operation does the work (all what is to do in mutex), and this only one operation is not able to interrupt. But that is the half answer, because in a multicore system another CPU can write to the same memory location. For that reason the compareAndSwap machine code instruction reads and writes not only in the cache, it reads and writes to real memory. That needs a little more access time to the RAM. The CompareAndSwap machine code operation checks whether the RAM content is changed in comparison to the before read value, only then the new value is stored. If I have more time, I write an example here.
Effective, the compareAndSwap access is faster than a lock and unlock anytime. But it can only be used if only exact one memory location have to be changend in the access. If more as one memory locations should be commonly changed (should be always consistente together), the compareAndSwap CANNOT be used, only synchronzed can be used. In the answer above it was written, compareAndSwap is often used to implement the synchronized operation. That is correct, because the singular synchronized (get mutex) and end-synchronized (release mutex) need only exact one atomic instruction, inside the task scheduler. Hence atomic access is the base of all. But between synchronized{ .... } the scheduler knows that a thread switch is guarded.
This program approach is valid not only for Java, for C/++ (and maybe other languages - ) it is also important and able to use.

What does flushing thread local memory to global memory mean?

I am aware that the purpose of volatile variables in Java is that writes to such variables are immediately visible to other threads. I am also aware that one of the effects of a synchronized block is to flush thread-local memory to global memory.
I have never fully understood the references to 'thread-local' memory in this context. I understand that data which only exists on the stack is thread-local, but when talking about objects on the heap my understanding becomes hazy.
I was hoping that to get comments on the following points:
When executing on a machine with multiple processors, does flushing thread-local memory simply refer to the flushing of the CPU cache into RAM?
When executing on a uniprocessor machine, does this mean anything at all?
If it is possible for the heap to have the same variable at two different memory locations (each accessed by a different thread), under what circumstances would this arise? What implications does this have to garbage collection? How aggressively do VMs do this kind of thing?
(EDIT: adding question 4) What data is flushed when exiting a synchronized block? Is it everything that the thread has locally? Is it only writes that were made inside the synchronized block?
Object x = goGetXFromHeap(); // x.f is 1 here
Object y = goGetYFromHeap(); // y.f is 11 here
Object z = goGetZFromHead(); // z.f is 111 here
y.f = 12;
synchronized(x)
{
x.f = 2;
z.f = 112;
}
// will only x be flushed on exit of the block?
// will the update to y get flushed?
// will the update to z get flushed?
Overall, I think am trying to understand whether thread-local means memory that is physically accessible by only one CPU or if there is logical thread-local heap partitioning done by the VM?
Any links to presentations or documentation would be immensely helpful. I have spent time researching this, and although I have found lots of nice literature, I haven't been able to satisfy my curiosity regarding the different situations & definitions of thread-local memory.
Thanks very much.

The flush you are talking about is known as a "memory barrier". It means that the CPU makes sure that what it sees of the RAM is also viewable from other CPU/cores. It implies two things:
The JIT compiler flushes the CPU registers. Normally, the code may kept a copy of some globally visible data (e.g. instance field contents) in CPU registers. Registers cannot be seen from other threads. Thus, half the work of synchronized is to make sure that no such cache is maintained.
The synchronized implementation also performs a memory barrier to make sure that all the changes to RAM from the current core are propagated to main RAM (or that at least all other cores are aware that this core has the latest values -- cache coherency protocols can be quite complex).
The second job is trivial on uniprocessor systems (I mean, systems with a single CPU which has as single core) but uniprocessor systems tend to become rarer nowadays.
As for thread-local heaps, this can theoretically be done, but it is usually not worth the effort because nothing tells what parts of the memory are to be flushed with a synchronized. This is a limitation of the threads-with-shared-memory model: all memory is supposed to be shared. At the first encountered synchronized, the JVM should then flush all its "thread-local heap objects" to the main RAM.
Yet recent JVM from Sun can perform an "escape analysis" in which a JVM succeeds in proving that some instances never become visible from other threads. This is typical of, for instance, StringBuilder instances created by javac to handle concatenation of strings. If the instance is never passed as parameter to other methods then it does not become "globally visible". This makes it eligible for a thread-local heap allocation, or even, under the right circumstances, for stack-based allocation. Note that in this situation there is no duplication; the instance is not in "two places at the same time". It is only that the JVM can keep the instance in a private place which does not incur the cost of a memory barrier.

It is really an implementation detail if the current content of the memory of an object that is not synchronized is visible to another thread.
Certainly, there are limits, in that all memory is not kept in duplicate, and not all instructions are reordered, but the point is that the underlying JVM has the option if it finds it to be a more optimized way to do that.
The thing is that the heap is really "properly" stored in main memory, but accessing main memory is slow compared to access the CPU's cache or keeping the value in a register inside the CPU. By requiring that the value be written out to memory (which is what synchronization does, at least when the lock is released) it forcing the write to main memory. If the JVM is free to ignore that, it can gain performance.
In terms of what will happen on a one CPU system, multiple threads could still keep values in a cache or register, even while executing another thread. There is no guarantee that there is any scenario where a value is visible to another thread without synchronization, although it is obviously more likely. Outside of mobile devices, of course, the single-CPU is going the way of floppy disks, so this is not going to be a very relevant consideration for long.
For more reading, I recommend Java Concurrency in Practice. It is really a great practical book on the subject.

It's not as simple as CPU-Cache-RAM. That's all wrapped up in the JVM and the JIT and they add their own behaviors.
Take a look at The "Double-Checked Locking is Broken" Declaration. It's a treatise on why double-checked locking doesn't work, but it also explains some of the nuances of Java's memory model.

One excellent document for highlighting the kinds of problems involved, is the PDF from the JavaOne 2009 Technical Session
This Is Not Your Father's Von Neumann Machine: How Modern Architecture Impacts Your Java Apps
By Cliff Click, Azul Systems; Brian Goetz, Sun Microsystems, Inc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.