My app is allocating a ton of objects (>1mln per second; most objects are byte arrays of size ~80-100 and strings of the same size) and I think it might be the source of its poor performance.
The app's working set is only tens of megabytes. Profiling the app shows that GC time is negligibly small.
However, I suspect that perhaps the allocation procedure depends on which GC is being used, and some settings might make allocation faster or perhaps make a positive influence on cache hit rate, etc.
Is that so? Or is allocation performance independent on GC settings under the assumption that garbage collection itself takes little time?
Of course your performance depends on the allocator used. But you have profiled GC and saw that it is not much of an issue. Also, one of the strengths of the GC is fast allocation at the expense of slower collection.
I think you are having issues with resulting fragmentation which makes memory access pattern problematic for the cpu, since it may need to invalidate its cache too often. Most GC algorithms doesn't reclaim space in an optimum way.
Since your working set is limited and predictable, you might want to use an object pool which is allocated beforehand. You may also want to use reference counting to avoid much of the manual memory management. Technically it is still GC but not in the common sense of the GC.
Still, I don't think the performance is much affected by how you manage memory but how you actually use, access it. Most likely your profiler has the definite answer.
There are two distinct aspects to object allocation. The first is finding a suitable area of memory - with todays generational garbarge collectors, this is usually very fast (in the order of a few 10ths of machine cycles).
The second is the initialization of the objects you allocate. Since everything you allocate in Java is initialized, the cost for initialization can easily outweight the cost of allocation (except for the most simple, smallest objects). There is more. Since initialization requires writing the entire memory area the new object occupies (if you allocate a "new byte[1<<20]" for example, the entire megabyte needs to be set to zeros), this also usually pulls that memory into the cpu's cache, evicting other, older cache lines (which may or may not belong to your current "hot" working set).
If you do comparatively little processing on each of your arrays, those effects can severly affect the performance of your code. This can be partially avoided by re-using the same arrays over and over, but it usually makes the program logic more complex. It is also often not easy to determine if cache trashing is really the culprit. Its impossible to say from what little information is given in your question.
Does your VM try to pool strings? I had heard once, that IBM's VM did something like string interning but dynamically (no idea if its true) perhaps your VM is trying doing extra work to build an internal data structure of String internals.
Are you doing something like byte b[] = new byte[100]; String s = new String(b); by any chance? You might try not to allocate the String objects, and instead allocate some random object which has a reference to the byte[] (for comparison).
Related
I read that garbage collection can lead to memory fragmentation problem at run-time. To solve this problem, compacting is done by the JVM where it takes all the active objects and assigns them contiguous memory.
This means that the object addresses must change from time to time? Also, if this happens,
Are the references to these objects also re-assigned?
Won't this cause significant performance issues? How does Java cope with it?
I read that garbage collection can lead to memory fragmentation problem at run-time.
This is not an exclusive problem of garbage collected heaps. When you have a manually managed heap and free memory in a different order than the preceding allocations, you may get a fragmented heap as well. And being able to have different lifetimes than the last-in-first-out order of automatic storage aka stack memory, is one of the main motivations to use the heap memory.
To solve this problem, compacting is done by the JVM where it takes all the active objects and assigns them contiguous memory.
Not necessarily all objects. Typical implementation strategies will divide the memory into logical regions and only move objects from a specific region to another, but not all existing objects at a time. These strategies may incorporate the age of the objects, like generational collectors moving objects of the young generation from the Eden space to a Survivor space, or the distribution of the remaining objects, like the “Garbage First” collector which will, like the name suggests, evacuate the fragment with the highest garbage ratio first, which implies the smallest work to get a free contiguous memory block.
This means that the object addresses must change from time to time?
Of course, yes.
Also, if this happens,
Are the references to these objects also re-assigned?
The specification does not mandate how object references are implemented. An indirect pointer may eliminate the need to adapt all references, see also this Q&A. However, for JVMs using direct pointers, this does indeed imply that these pointers need to get adapted.
Won't this cause significant performance issues? How does Java cope with it?
First, we have to consider what we gain from that. To “eliminate fragmentation” is not an end in itself. If we don’t do it, we have to scan the reachable objects for gaps between them and create a data structure maintaining this information, which we would call “free memory” then. We also needed to implement memory allocations as a search for matching chunks in this data structure or to split chunks if no exact match has been found. This is a rather expensive operation compared to an allocation from a contiguous free memory block, where we only have to bump the pointer to the next free byte by the required size.
Given that allocations happen much more often than garbage collection, which only runs when the memory is full (or a threshold has been crossed), this does already justify more expensive copy operations. It also implies that just using a larger heap can solve performance issues, as it reduces the number of required garbage collector runs, whereas the number of survivor objects will not scale with the memory (unreachable objects stay unreachable, regardless of how long you defer the collection). In fact, deferring the collection raises the chances that more objects became unreachable in the meanwhile. Compare also with this answer.
The costs of adapting references are not much higher than the costs of traversing references in the marking phase. In fact, non-concurrent collectors could even combine these two steps, transferring an object on first encounter and adapting subsequently encountered references, instead of marking the object. The actual copying is the more expensive aspect, but as explained above, it is reduced by not copying all objects but using certain strategies based on typical application behavior, like generational approaches or the “garbage first” strategy, to minimize the required work.
If you move an object around the memory, its address will change. Therefore, references pointing to it will need to be updated. Memory fragmentation occurs when an object in a contigous (in memory) sequence of objects gets deleted. This creates a hole in the memory space, which is generally bad because contigous chunks of memory have faster access times and a higher probability of fitting in chache lines, among other things. It should be noted that the use of indirection tables can prevent reference updates up to the maximum level of indirection used.
Garbage collection has a moderate performance overhead, not just in Java but in other languages as well, such as C# for example. As how Java copes with this, the strategies for performing garbage collection and how to minimize its impact on performance depends on the particular JVM being used, since each JVM can implement garbage collection however it pleases; the only requirement is that it meets the JVM specification.
However, as a programmer, there are some best practices you should follow to make the best out of garbage collection and to minimze its performance hit on your application. See this, also this, this, this blog post, and this other blog post. You might want to check the JVM specs but it's a bit dense.
Is there a way to predict how much memory my Java program is going to take? I come from a C++ background where I implemented methods such as "size_in_bytes()" on classes and I could fairly accurately predict the runtime memory footprint of my app. Now I'm in a Java world, and that is not so easy... There are references, pools, immutable objects that are shared... but I'd still like to be able to predict my memory footprint before I look at the process size in top.
You can inspect the size of objects if you use the instrumentation API. It is a bit tricky to use -- it requires a "premain" method and extra VM parameters -- but there are plenty of examples on the web. "java instrumentation size" should find you these.
Note that the default methods will only give you a shallow size. And unless you avoid any object construction outside of the constructor (which is next to impossible), there will be dead objects around waiting to be garbage collected.
But in general, you could use these to estimate the memory requirements of your application, if you have a good control on the amount of objects generated.
You can't predict the amount of memory a program is going to take. However, you can predict how much an object will take. Edit it turns out I'm almost completely wrong, this document describes the memory usage of objects better: http://www.javamex.com/tutorials/memory/object_memory_usage.shtml
In general, you can predict fairly closely what a given object will require. There's some overhead that is relatively fixed, plus the instance fields in the object, plus a modest amount of padding. But then object size is rounded up to at least (on most JVMs) a 16-byte boundary, and some JVMs round up some object sizes to larger boundaries (to allow the use of standard sized pre-allocated object frames). But all this is relatively fixed for a given JVM.
What varies, of course, is the overhead required for garbage collection. A naive garbage collector requires 100% overhead (at least one free byte for every allocated byte), though certain varieties of "generational" collectors can improve on this to a degree. But how much space is required for GC is highly dependent on the workload (on most JVMs).
The other problem is that when you're running at a relatively low level of allocation (where you're only using maybe 10% of max available heap) then garbage will accumulate. It isn't actively referenced, but the bits of garbage are interspersed with your active objects, so it takes up working set. As a result, your working set tends to be roughly equal to your current overall garbage-collected heap size (plus other system overhead).
You can, of course, "throttle" the heap size so that you run at a higher % utilization, but that increases the frequency of garbage collection (and the overall cost of GC to a lesser degree).
You can use profilers to understand the constant set of objects that are always in memory. Then you should execute all the code paths to check for memory leaks. JProfiler is a good one to start with.
Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
Something along the lines of http://wwwasd.web.cern.ch/wwwasd/lhc++/Objectivity/V5.2/Java/guide/jgdStorage.fm.html and specifically non-garbage-collectible containers there (non-garbage-collectable?).
The problem is that I have lots of ordinary temporary objects, but I have even bigger (several Gigs) of objects that are stored for Cache purposes. For no reason should the Java GC traverse all those Cache gigabytes trying to find anything to collect, because they contain cached data which have their own timeouts.
This way I could partition my data in a custom way into infinite-lived and normal-lived objects, and hopefully GC would be quite fast because normal objects don't live so long and amount to smaller amounts.
There are some workarounds to this problem, such as Apache DirectMemory and Commercial Terracotta BigMemory(http://terracotta.org/products/bigmemory), but a java-native solution would be nicer (I mean free and probably more reliable?). Also I want to avoid serialization overhead which means it should happen within same jvm. To my understanding DirectMemory and BigMemory operate mainly off heap which means that the objects must be serialized/deserialized to/from memory outside jvm. Simply marking non-gc regions within the jvm would seem a better solution. Using Files for cache is not an option either, it has the same unaffordable serialization/deserialization overhead - use case is a HA server with lots of data used in random (human) order and low latency needed.
Any memory the JVM manages is also garbage-collected by the JVM. And any “live” objects which are directly available to Java methods without deserialization have to live in JVM memory. Therefore in my understanding you cannot have live objects which are immune to garbage collection.
On the other hand, the usage you describe should make the generational approach to garbage collection quite efficient. If your big objects stay around for a while, they will be checked for reclamation less often. So I doubt there is much to be gained from avoiding those checks.
Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
No it is not possible.
You can prevent objects from being garbage collected by keeping them reachable, but the GC will still need to trace them to check reachability on each full; GC (at least).
Is simply my assumption, that when the jvm is starving it begins scanning all those unnecessary objects too.
Yes. That is correct. However, unless you've got LOTS of objects that you want to be treated this way, the overhead is likely to be insignificant. (And anyway, a better idea is to give the JVM more memory ... if that is possible.)
Quite simply, for you to be able to do this, the garbage collection algorithm would need to be aware of such a flag, and take it into account when doing its work.
I'm not aware of any of the standard GC algorithms having such a flag, so for this to work you would need to write your own GC algorithm (after deciding on some feasible way to communicate this information to it).
In principle, in fact, you've already started down this track - you're deciding how garbage collection should be done rather than being happy to leaving it to the JVM's GC algo. Is the situation you describe a measurable problem for you; something for which the existing garbage collection is insufficient, but your plan would work? Garbage collectors are extremely well-tuned, so I wouldn't be surprised if the "inefficient" default strategy is actually faster than your naively-optimal one.
(Doing manual memory management is tricky and error-prone at the best of times; managing some memory yourself while using a stock garbage collector to handle the rest seems even worse. I expect you'd run into a lot of edge cases where the GC assumes it "knows" what's happening with the whole heap, which would no longer be true. Steer clear if you can...)
The recommended approaches would be to use either a commerical RTSJ implementation to avoid GC, or to use off heap memory. One could also look into soft references for caches as well (they do get collected).
This is not recommended:
If for some reason you do not believe these options are sufficient, you could look into direct memory access which is UNSAFE (part of sun.misc.Unsafe). You can use the 'theUnsafe' field to get the 'Unsafe' instance. Unsafe allows to allocation/deallocate memory via 'allocateMemory' and 'freeMemory'. This is not under GC control nor limited by JVM heap size. The impact on GC/application, once you go down this route, is not guaranteed - which is why using byte buffers might be the way to go (if you're not using a RTSJ like implementation).
Hope this helps.
Living Java objects will always be part of the GC life cycle. Or said another way, marking an object to be non-gc is the same order of overhead than having your object referenced by a root reference (a static final map for instance).
But thinking a bit further, data put in a cache are most likely to be temporary, and would eventually be evicted. At that point you will start again to like the JVM and the GC.
If you have 100's of GBs of permanent data, you may want to rethink the architecture of your application, and try to shard and distribute your data (horizontally scalability).
Last but not least, lots of work has been done around serialization, and the overhead of serialization (I'm not speaking about the poor reputation of ObjectInputStream and ObjectOutputStream) is not that big.
More than that, if your data is mainly composed of primitive types (including bytes array), there is efficient way to readInt() or readBytes() from off heap buffers (for instannce netty.io's ChannelBuffer). This could be a way to go.
In a Java program when it is necessary to allocate thousands of similar-size objects, it would be better (in my mind) to have a "pool" (which is a single allocation) with reserved items that can be pulled from when needed. This single large allocation wouldn't fragment the heap as much as thousands of smaller allocations.
Obviously, there isn't a way to specifically point an object reference to an address in memory (for its member fields) to set up a pool. Even if the new object referenced an area of the pool, the object itself would still need to be allocated. How would you handle many allocations like this without resorting to native OS libraries?
You could try using the Commons Pool library.
That said, unless I had proof the JVM wasn't doing what I needed, I'd probably hold off on optimizing object creation.
Don't worry about it. Unless you have done a lot of testing and analysis on the actual code being run and know that it is a problem with garbage collection and that the JVM isn't doing a good enough job, spend your time elsewhere.
If you are building an application, where a predictable response time is very important, then pooling of objects, no matter how small they are will pay you dividends. Again, pooling is also a factor of how big of a data set you are trying to pool and how much physical memory your machine has.
There is ample proof on the web that shows that object pooling, no matter how small the objects are, is beneficial for application performance.
There are two levels of pooling you could do:
Pooling of the basic objects such as Vectors, which you retrieve from the pool each time you have to use the vector to form a map or such.
Have the higher level composite objects pooled, which are most commonly used, pooled.
This is generally an application design decision.
Also, in a multi-threaded application, you would like to be sensitive about how many different threads are going to be allocating and returning to the pool. You certainly do not want your application to be bogged down by contention - especially if you are dealing with thousands of objects at the same time.
#Dave and Casey, you don't need any proof to show that contiguous memory layout improves Cache efficiency, which is the major bottleneck in most OOP apps that need high performance but follow a "too idealistic" OOP-design trajectory.
People often think of the GC as the culprit causing low performance in high performance Java applications and after fixing it, just leave it at that, without actually profiling memory-behavior of the application. Note though that un-cached memory instructions are inherently more expensive than arithmetic instructions (and are getting more and more expensive due to the memory access <-> computation gap). So if you care about performance, you should certainly care about memory management.
Cache-aware, or more general, data-oriented programming, is the key to achieving high performance in many kinds of applications, such as games, or mobile apps (to reduce power consumption).
Here is a SO thread on DOP.
Here is a slideshow from the Sony R&D department that shows the usefulness of DOP as applied to a playstation game (high performance required).
So how to solve the problem that Java, does not, in general allow you to allocate a chunk of memory? My guess is that when the program is just starting, you can assume that there is very little internal fragmentation in the already allocated pages. If you now have a loop that allocates thousands or millions of objects, they will probably all be as contiguous as possible. Note that you only need to make sure that consecutive objects stretch out over the same cacheline, which in many modern systems, is only 64 bytes. Also, take a look at the DOP slides, if you really care about the (memory-) performance of your application.
In short: Always allocate multiple objects at once (increase temporal locality of allocation), and, if your GC has defragmentation, run it beforehand, else try to reduce such allocations to the beginning of your program.
I hope, this is of some help,
-Domi
PS: #Dave, the commons pool library does not allocate objects contiguously. It only keeps track of the allocations by putting them into a reference array, embedded in a stack, linked list, or similar.
I'm exploring options to help my memory-intensive application, and in doing so I came across Terracotta's BigMemory. From what I gather, they take advantage of non-garbage-collected, off-heap "native memory," and apparently this is about 10x slower than heap-storage due to serialization/deserialization issues. Prior to reading about BigMemory, I'd never heard of "native memory" outside of normal JNI. Although BigMemory is an interesting option that warrants further consideration, I'm intrigued by what could be accomplished with native memory if the serialization issue could be bypassed.
Is Java native memory faster (I think this entails ByteBuffer objects?) than traditional heap memory when there are no serialization issues (for instance if I am comparing it with a huge byte[])? Or do the vagaries of garbage collection, etc. render this question unanswerable? I know "measure it" is a common answer around here, but I'm afraid I would not set up a representative test as I don't yet know enough about how native memory works in Java.
Direct memory is faster when performing IO because it avoid one copy of the data. However, for 95% of application you won't notice the difference.
You can store data in direct memory, however it won't be faster than storing data POJOs. (or as safe or readable or maintainable) If you are worried about GC, try creating your objects (have to be mutable) in advance and reuse them without discarding them. If you don't discard your objects, there is nothing to collect.
Is Java native memory faster (I think this entails ByteBuffer objects?) than traditional heap memory when there are no serialization issues (for instance if I am comparing it with a huge byte[])?
Direct memory can be faster than using a byte[] if you use use non bytes like int as it can read/write the whole four bytes without turning the data into bytes.
However it is slower than using POJOs as it has to bounds check every access.
Or do the vagaries of garbage collection, etc. render this question unanswerable?
The speed has nothing to do with the GC. The GC only matters when creating or discard objects.
BTW: If you minimise the number of object you discard and increase your Eden size, you can prevent even minor collection occurring for a long time e.g. a whole day.
The point of BigMemory is not that native memory is faster, but rather, it's to reduce the overhead of the garbage collector having to go through the effort of tracking down references to memory and cleaning it up. As your heap size increases, so do your GC intervals and CPU commitment. Depending upon the situation, this can create a sort of "glass ceiling" where the Java heap gets so big that the GC turns into a hog, taking up huge amounts of processor power each time the GC kicks in. Also, many GC algorithms require some level of locking that means nobody can do anything until that portion of the GC reference tracking algorithm finishes, though many JVM's have gotten much better at handling this. Where I work, with our app server and JVM's, we found that the "glass ceiling" is about 1.5 GB. If we try to configure the heap larger than that, the GC routine starts eating up more than 50% of total CPU time, so it's a very real cost. We've determined this through various forms of GC analysis provided by our JVM vendor.
BigMemory, on the other hand, takes a more manual approach to memory management. It reduces the overhead and sort of takes us back to having to do our own memory cleanup, as we did in C, albeit in a much simpler approach akin to a HashMap. This essentially eliminates the need for a traditional garbage collection routine, and as a result, we eliminate that overhead. I believe that the Terracotta folks used native memory via a ByteBuffer as it's an easy way to get out from under the Java garbage collector.
The following whitepaper has some good info on how they architected BigMemory and some background on the overhead of the GC: http://www.terracotta.org/resources/whitepapers/bigmemory-whitepaper.
I'm intrigued by what could be accomplished with native memory if the serialization issue could be bypassed.
I think that your question is predicated on a false assumption. AFAIK, it is impossible to bypass the serialization issue that they are talking about here. The only thing you could do would be to simplify the objects that you put into BigMemory and use custom serialization / deserialization code to reduce the overheads.
While benchmarks might give you a rough idea of the overheads, the actual overheads will be very application specific. My advice would be:
Only go down this route if you know you need to. (You will be tying your application to a particular implementation technology.)
Be prepared for some intrusive changes to your application if the data involved isn't already managed using as a cache.
Be prepared to spend some time in (re-)tuning your caching code to get good performance with BigMemory.
If your data structures are complicated, expect a proportionately larger runtime overheads and tuning effort.