Is memory allocation on the JVM lockless

Is memory allocation on the JVM lockless - java

When you do a new Object() in Java, does the jvm use a lockless algorithm to allocate memory or does it need to lock?
The JVM I am referring to in this case is the Hotspot VM. From the little I know about it, it just needs to increment a pointer to allocate memory super fast. But in the case of multiple threads, does that increment require locking or CAS?

as mentioned, default is to use a tlab. The behavious is described in this glossary as follows
TLAB
Thread-local allocation buffer. Used to allocate heap space quickly without synchronization. Compiled code has a "fast path" of a few instructions which tries to bump a high-water mark in the current thread's TLAB, successfully allocating an object if the bumped mark falls before a TLAB-specific limit address.
Further details on sizing in this blog & all the details you could want in this blog.
In short it's thread local unless the TLAB is full in which case you'll need to hit the shared pool and this is a CAS operation.
Another complicating factor could be this bug that describes false sharing in card marking which is not a lock as such but will hurt performance (if this is why you're asking about locking). It looks like this is fixed in java7 though.

It depends :) I believe that if you use the -XX:+UseTLAB option (which is the default for Sun/Oracle JVMs as noted by Peter), it will be contention-free in the "happy path" due to thread-local heaps. Of course, if garbage collection is required due to there not being enough space, we get into the territory of parallel GCs etc, where there are various implementations and it's all very complicated... and of course, this moves on all the time.
Even in the "single heap" model, I'd expect the allocation to be highly optimized - not so much acquiring a lock in the normal sense as performing atomic increments where possible. I can't say I know the details though.

Related

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
Something along the lines of http://wwwasd.web.cern.ch/wwwasd/lhc++/Objectivity/V5.2/Java/guide/jgdStorage.fm.html and specifically non-garbage-collectible containers there (non-garbage-collectable?).
The problem is that I have lots of ordinary temporary objects, but I have even bigger (several Gigs) of objects that are stored for Cache purposes. For no reason should the Java GC traverse all those Cache gigabytes trying to find anything to collect, because they contain cached data which have their own timeouts.
This way I could partition my data in a custom way into infinite-lived and normal-lived objects, and hopefully GC would be quite fast because normal objects don't live so long and amount to smaller amounts.
There are some workarounds to this problem, such as Apache DirectMemory and Commercial Terracotta BigMemory(http://terracotta.org/products/bigmemory), but a java-native solution would be nicer (I mean free and probably more reliable?). Also I want to avoid serialization overhead which means it should happen within same jvm. To my understanding DirectMemory and BigMemory operate mainly off heap which means that the objects must be serialized/deserialized to/from memory outside jvm. Simply marking non-gc regions within the jvm would seem a better solution. Using Files for cache is not an option either, it has the same unaffordable serialization/deserialization overhead - use case is a HA server with lots of data used in random (human) order and low latency needed.

Any memory the JVM manages is also garbage-collected by the JVM. And any “live” objects which are directly available to Java methods without deserialization have to live in JVM memory. Therefore in my understanding you cannot have live objects which are immune to garbage collection.
On the other hand, the usage you describe should make the generational approach to garbage collection quite efficient. If your big objects stay around for a while, they will be checked for reclamation less often. So I doubt there is much to be gained from avoiding those checks.

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
No it is not possible.
You can prevent objects from being garbage collected by keeping them reachable, but the GC will still need to trace them to check reachability on each full; GC (at least).
Is simply my assumption, that when the jvm is starving it begins scanning all those unnecessary objects too.
Yes. That is correct. However, unless you've got LOTS of objects that you want to be treated this way, the overhead is likely to be insignificant. (And anyway, a better idea is to give the JVM more memory ... if that is possible.)

Quite simply, for you to be able to do this, the garbage collection algorithm would need to be aware of such a flag, and take it into account when doing its work.
I'm not aware of any of the standard GC algorithms having such a flag, so for this to work you would need to write your own GC algorithm (after deciding on some feasible way to communicate this information to it).
In principle, in fact, you've already started down this track - you're deciding how garbage collection should be done rather than being happy to leaving it to the JVM's GC algo. Is the situation you describe a measurable problem for you; something for which the existing garbage collection is insufficient, but your plan would work? Garbage collectors are extremely well-tuned, so I wouldn't be surprised if the "inefficient" default strategy is actually faster than your naively-optimal one.
(Doing manual memory management is tricky and error-prone at the best of times; managing some memory yourself while using a stock garbage collector to handle the rest seems even worse. I expect you'd run into a lot of edge cases where the GC assumes it "knows" what's happening with the whole heap, which would no longer be true. Steer clear if you can...)

The recommended approaches would be to use either a commerical RTSJ implementation to avoid GC, or to use off heap memory. One could also look into soft references for caches as well (they do get collected).
This is not recommended:
If for some reason you do not believe these options are sufficient, you could look into direct memory access which is UNSAFE (part of sun.misc.Unsafe). You can use the 'theUnsafe' field to get the 'Unsafe' instance. Unsafe allows to allocation/deallocate memory via 'allocateMemory' and 'freeMemory'. This is not under GC control nor limited by JVM heap size. The impact on GC/application, once you go down this route, is not guaranteed - which is why using byte buffers might be the way to go (if you're not using a RTSJ like implementation).
Hope this helps.

Living Java objects will always be part of the GC life cycle. Or said another way, marking an object to be non-gc is the same order of overhead than having your object referenced by a root reference (a static final map for instance).
But thinking a bit further, data put in a cache are most likely to be temporary, and would eventually be evicted. At that point you will start again to like the JVM and the GC.
If you have 100's of GBs of permanent data, you may want to rethink the architecture of your application, and try to shard and distribute your data (horizontally scalability).
Last but not least, lots of work has been done around serialization, and the overhead of serialization (I'm not speaking about the poor reputation of ObjectInputStream and ObjectOutputStream) is not that big.
More than that, if your data is mainly composed of primitive types (including bytes array), there is efficient way to readInt() or readBytes() from off heap buffers (for instannce netty.io's ChannelBuffer). This could be a way to go.

Java SoftReference, panicing GC and GC behavior

I want to write a cache using SoftReferences using as much memory as possible, as long as it doesn't get too inefficient.
Trying to estimate the used size by calculating object sizes or by getting some used memory approximation of the JVM are dead ends.
The javadoc even states that SoftReferences are good for memory-aware caches, but there is no hard rule on how a JVM implementation shall handle SoftReferences. I'm only talking about the Oracle implementation of the JVM (Version 6.22 and above and Version 7).
Now my questions (please feel free to answer partial, grouped or in any way you please):
Does the JVM take the last access of the object into account and only remove the old ones? Javadoc states: Virtual machine implementations are, however, encouraged to bias against clearing recently-created or recently-used soft references.
What happens when memory gets tight? The JVM panics and just eats all objects?
Is there a parameter for telling the JVM to only eat as much to survive (no OOMEs) and live healthy (not having the CPU only run the GC)

I don't think there is an order. (I'm not sure though about the order of events)
But what happens with soft references is that it is always guaranteed that they will be released before there is an out of memory exception. Unless you have a hard reference pointing to them.
But you should be aware that you might try to access them and they are gone. My guess is that the garbage collector will just eat the first soft reference that fits the amount needed for the operation.

Although SoftReferences are a cool feature, I personally don't dare using them in large
projects where I don't know the memory requirements of every other component. Will a memory-hogging SoftReference cache make other parts perform badly?
I'd instead of using SoftReferences I'd consider using EHCache. It let's you limit the size of particular caches in terms of number of entries, or even better, the bytes used in memory (this is a new feature in the upcoming version 2.5). Different eviction strategies can be configured, of course, such as LRU. There's lots you can configure with EHCache.
If you're using Spring, then version 3.1 will also provide you with some nice #Cachable method-level annotations; EHCache can be used as a caching implementation there.

What happens when memory gets tight? The JVM panics and just eats all
objects?
I know for a fact that with Oracle 1.6 JVM this is not the case. I am aware of a situation where a server that processes concurrent requests uses a response the contains the actual data inside a soft reference. I have observed that when a low memory situation is reported by one thread the other threads' soft references continue to hold on to their contents (the referenced objects).
Is there a parameter for telling the JVM to only eat as much to
survive (no OOMEs) and live healthy (not having the CPU only run the
GC)
What is enough to survive? You mean that if X amount of memory is required then only reclaim soft-references till X is available? I didn't find any such tuning parameter but as I said JVM does not seem to be reclaiming all soft references when it needs to reclaim one.

Is object creation a bottleneck in Java in multithreaded environment?

Based on the understanding from the following:
Where is allocated variable reference, in stack or in the heap?
I was wondering since all the objects are created on the common heap. If multiple threads create objects then to prevent data corruption there has to be some serialization that must be happening to prevent the multiple threads from creating objects at same locations. Now, with a large number of threads this serialization would cause a big bottleneck. How does Java avoid this bottleneck? Or am I missing something?
Any help appreciated.

Modern VM implementations reserve for each thread an own area on the heap to create objects in. So, no problem as long as this area does not get full (then the garbage collector moves the surviving objects).
Further read: how TLAB works in Sun's JVM. Azul's VM uses slightly different approach (look at "A new thread & stack layout"), the article shows quite a few tricks JVMs may perform behind the scenes to ensure nowadays Java speed.
The main idea is keeping per thread (non-shared) area to allocate new objects, much like allocating on the stack with C/C++. The copy garbage collection is very quick to deallocate the short-lived objects, the few survivors, if any, are moved into different area. Thus, creating relatively small objects is very fast and lock free.
The lock free allocation is very important, especially since the question regards multithreaded environment. It also allows true lock-free algorithms to exist. Even if an algorithm, itself, is a lock-free but allocation of new objects is synchronized, the entire algorithm is effectively synchronized and ultimately less scalable.
java.util.concurrent.ConcurrentLinkedQueue which is based on the work of Maged M. Michael Michael L. Scott is a classic example.
What happens if an object is referenced by another thread? (due to discussion request)
That object (call it A) will be moved to some "survivor" area. The survivor area is checked less often than the ThreadLocal areas. It contains, like the name suggests, objects whose references managed to escape, or in particular A managed to stay alive. The copy (move) part occurs during some "safe point" (safe point excludes properly JIT'd code), so the garbage collector is sure the object is not being referenced. The references to the object are updated, the necessary memories fences issued and the application (java code) is free to continue. Further read to this simplistic scenario.
To the very interested reader and if possible to chew it: the highly advanced Pauseless GC Algorithm

No. The JVM has all sorts of tricks up its sleeves to avoid any sort of simpleminded serialization at the point of 'new'.

Sometimes. I wrote a recursive method that generates integer permutations and creates objects from those. The multithreaded version (every branch from root = task, but concurrent thread count limited to number of cores) of that method wasn't faster. And the CPU load wasn't higher. The tasks didn't share any object. After I removed the object creation from both methods the multithreaded method was ~4x faster (6 cores) and used 100% CPU. In my test case the methods generated ~4,500,000 permutations, 1500 per task.
I think TLAB didn't work because it's space is limited (see: Thread Local Allocation Buffers).

Does allocation speed depend on the garbage collector being used?

My app is allocating a ton of objects (>1mln per second; most objects are byte arrays of size ~80-100 and strings of the same size) and I think it might be the source of its poor performance.
The app's working set is only tens of megabytes. Profiling the app shows that GC time is negligibly small.
However, I suspect that perhaps the allocation procedure depends on which GC is being used, and some settings might make allocation faster or perhaps make a positive influence on cache hit rate, etc.
Is that so? Or is allocation performance independent on GC settings under the assumption that garbage collection itself takes little time?

Of course your performance depends on the allocator used. But you have profiled GC and saw that it is not much of an issue. Also, one of the strengths of the GC is fast allocation at the expense of slower collection.
I think you are having issues with resulting fragmentation which makes memory access pattern problematic for the cpu, since it may need to invalidate its cache too often. Most GC algorithms doesn't reclaim space in an optimum way.
Since your working set is limited and predictable, you might want to use an object pool which is allocated beforehand. You may also want to use reference counting to avoid much of the manual memory management. Technically it is still GC but not in the common sense of the GC.
Still, I don't think the performance is much affected by how you manage memory but how you actually use, access it. Most likely your profiler has the definite answer.

There are two distinct aspects to object allocation. The first is finding a suitable area of memory - with todays generational garbarge collectors, this is usually very fast (in the order of a few 10ths of machine cycles).
The second is the initialization of the objects you allocate. Since everything you allocate in Java is initialized, the cost for initialization can easily outweight the cost of allocation (except for the most simple, smallest objects). There is more. Since initialization requires writing the entire memory area the new object occupies (if you allocate a "new byte[1<<20]" for example, the entire megabyte needs to be set to zeros), this also usually pulls that memory into the cpu's cache, evicting other, older cache lines (which may or may not belong to your current "hot" working set).
If you do comparatively little processing on each of your arrays, those effects can severly affect the performance of your code. This can be partially avoided by re-using the same arrays over and over, but it usually makes the program logic more complex. It is also often not easy to determine if cache trashing is really the culprit. Its impossible to say from what little information is given in your question.

Does your VM try to pool strings? I had heard once, that IBM's VM did something like string interning but dynamically (no idea if its true) perhaps your VM is trying doing extra work to build an internal data structure of String internals.
Are you doing something like byte b[] = new byte[100]; String s = new String(b); by any chance? You might try not to allocate the String objects, and instead allocate some random object which has a reference to the byte[] (for comparison).

What does flushing thread local memory to global memory mean?

I am aware that the purpose of volatile variables in Java is that writes to such variables are immediately visible to other threads. I am also aware that one of the effects of a synchronized block is to flush thread-local memory to global memory.
I have never fully understood the references to 'thread-local' memory in this context. I understand that data which only exists on the stack is thread-local, but when talking about objects on the heap my understanding becomes hazy.
I was hoping that to get comments on the following points:
When executing on a machine with multiple processors, does flushing thread-local memory simply refer to the flushing of the CPU cache into RAM?
When executing on a uniprocessor machine, does this mean anything at all?
If it is possible for the heap to have the same variable at two different memory locations (each accessed by a different thread), under what circumstances would this arise? What implications does this have to garbage collection? How aggressively do VMs do this kind of thing?
(EDIT: adding question 4) What data is flushed when exiting a synchronized block? Is it everything that the thread has locally? Is it only writes that were made inside the synchronized block?
Object x = goGetXFromHeap(); // x.f is 1 here
Object y = goGetYFromHeap(); // y.f is 11 here
Object z = goGetZFromHead(); // z.f is 111 here
y.f = 12;
synchronized(x)
{
x.f = 2;
z.f = 112;
}
// will only x be flushed on exit of the block?
// will the update to y get flushed?
// will the update to z get flushed?
Overall, I think am trying to understand whether thread-local means memory that is physically accessible by only one CPU or if there is logical thread-local heap partitioning done by the VM?
Any links to presentations or documentation would be immensely helpful. I have spent time researching this, and although I have found lots of nice literature, I haven't been able to satisfy my curiosity regarding the different situations & definitions of thread-local memory.
Thanks very much.

The flush you are talking about is known as a "memory barrier". It means that the CPU makes sure that what it sees of the RAM is also viewable from other CPU/cores. It implies two things:
The JIT compiler flushes the CPU registers. Normally, the code may kept a copy of some globally visible data (e.g. instance field contents) in CPU registers. Registers cannot be seen from other threads. Thus, half the work of synchronized is to make sure that no such cache is maintained.
The synchronized implementation also performs a memory barrier to make sure that all the changes to RAM from the current core are propagated to main RAM (or that at least all other cores are aware that this core has the latest values -- cache coherency protocols can be quite complex).
The second job is trivial on uniprocessor systems (I mean, systems with a single CPU which has as single core) but uniprocessor systems tend to become rarer nowadays.
As for thread-local heaps, this can theoretically be done, but it is usually not worth the effort because nothing tells what parts of the memory are to be flushed with a synchronized. This is a limitation of the threads-with-shared-memory model: all memory is supposed to be shared. At the first encountered synchronized, the JVM should then flush all its "thread-local heap objects" to the main RAM.
Yet recent JVM from Sun can perform an "escape analysis" in which a JVM succeeds in proving that some instances never become visible from other threads. This is typical of, for instance, StringBuilder instances created by javac to handle concatenation of strings. If the instance is never passed as parameter to other methods then it does not become "globally visible". This makes it eligible for a thread-local heap allocation, or even, under the right circumstances, for stack-based allocation. Note that in this situation there is no duplication; the instance is not in "two places at the same time". It is only that the JVM can keep the instance in a private place which does not incur the cost of a memory barrier.

It is really an implementation detail if the current content of the memory of an object that is not synchronized is visible to another thread.
Certainly, there are limits, in that all memory is not kept in duplicate, and not all instructions are reordered, but the point is that the underlying JVM has the option if it finds it to be a more optimized way to do that.
The thing is that the heap is really "properly" stored in main memory, but accessing main memory is slow compared to access the CPU's cache or keeping the value in a register inside the CPU. By requiring that the value be written out to memory (which is what synchronization does, at least when the lock is released) it forcing the write to main memory. If the JVM is free to ignore that, it can gain performance.
In terms of what will happen on a one CPU system, multiple threads could still keep values in a cache or register, even while executing another thread. There is no guarantee that there is any scenario where a value is visible to another thread without synchronization, although it is obviously more likely. Outside of mobile devices, of course, the single-CPU is going the way of floppy disks, so this is not going to be a very relevant consideration for long.
For more reading, I recommend Java Concurrency in Practice. It is really a great practical book on the subject.

It's not as simple as CPU-Cache-RAM. That's all wrapped up in the JVM and the JIT and they add their own behaviors.
Take a look at The "Double-Checked Locking is Broken" Declaration. It's a treatise on why double-checked locking doesn't work, but it also explains some of the nuances of Java's memory model.

One excellent document for highlighting the kinds of problems involved, is the PDF from the JavaOne 2009 Technical Session
This Is Not Your Father's Von Neumann Machine: How Modern Architecture Impacts Your Java Apps
By Cliff Click, Azul Systems; Brian Goetz, Sun Microsystems, Inc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.