Java collections faster than c++ containers? - java

I was reading the comments on this answer and I saw this quote.
Object instantiation and object-oriented features are blazing fast to use (faster than C++ in many cases) because they're designed in from the beginning. and Collections are fast. Standard Java beats standard C/C++ in this area, even for most optimized C code.
One user (with really high rep I might add) boldly defended this claim, stating that
heap allocation in java is better than C++'s
and added this statement defending the collections in java
And Java collections are fast compared to C++ collections due largely to the different memory subsystem.
So my question is can any of this really be true, and if so why is java's heap allocation so much faster.

This sort of statement is ridiculous; people making it are
either incredibly uninformed, or incredibly dishonest. In
particular:
The speed of dynamic memory allocation in the two cases will
depend on the pattern of dynamic memory use, as well as the
implementation. It is trivial for someone familiar with the
algorithms used in both cases to write a benchmark proving which
ever one he wanted to be faster. (Thus, for example, programs
using large, complex graphs that are build, then torn down and
rebuilt, will typically run faster under garbage collection. As
will programs that never use enough dynamic memory to trigger
the collector. Programs using few, large, long lived
allocations will often run faster with manual memory
management.)
When comparing the collections, you have to consider what is
in the collections. If you're comparing large vectors of
double, for example, the difference between Java and C++ will
likely be slight, and could go either way. If you're comparing
large vectors of Point, where Point is a value class containing
two doubles, C++ will probably blow Java out of the water,
because it uses pure value semantics (with no additional dynamic
allocation), where as Java needs to dynamically allocate each
Point (and no dynamic allocation is always faster than even
the fastest dynamic allocation). If the Point class in Java
is correctly designed to act as a value (and thus immutable,
like java.lang.String), then doing a translation on the
Point in a vector will require a new allocation for every
Point; in C++, you could just assign.
Much depends on the optimizer. In Java, the optimizer works
with perfect knowledge of the actual use cases, in this
particular run of the program, and perfect knowledge of the
actual processor it is running on, in this run. In C++, the
optimizer must work with data from a profiling run, which will
never correspond exactly to any one run of the program, and the
optimizer must (usually) generate code that will run (and run
quickly) on a wide variety of processor versions. On the other
hand, the C++ optimizer may take significantly more time
analysing the different paths (and effective optimization can
require a lot of CPU); the Java optimizer has to be fairly
quick.
Finally, although not relevant to all applications, C++ can be
single threaded. In which case, no locking is needed in the
allocator, which is never the case in Java.
With regards to the two numbered points: C++ can use more or
less the same algorithms as Java in its heap allocator. I've
used C++ programs where the ::operator delete() function was
empty, and the memory was garbage collected. (If your
application allocates lots of short lived, small objects, such
an allocator will probably speed things up.) And as for the
second: the really big advantage C++ has is that its memory
model doesn't require everything to be dynamically allocated.
Even if allocation in Java takes only a tenth of the time it
would take in C++ (which could be the case, if you only count
the allocation, and not the time needed for the collector
sweeps), with large vectors of Point, as above, you're
comparing two or three allocations in C++ with millions of
allocations in Java.
And finally: "why is Java's heap allocation so much faster?" It
isn't, necessarily, if you amortise the time for the
collection phases. The time for the allocation itself can be
very cheap, because Java (or at least most Java implementations)
use a relocating collector, which results in all of the free
memory being in a single contiguous block. This is at least
partially offset by the time needed in the collector: to get
that contiguity, you've got to move data, which means a lot of
copying. In most implementations, it also means an additional
indirection in the pointers, and a lot of special logic to avoid
issues when one thread has the address in a register, or such.

Your questions don't have concrete answers. For example, C++ does not define memory management at all. It leaves allocation details up to the library implementation. Therefore, within the bounds of C++, a given platform may have a very slow heap allocation scheme, and Java would certainly be faster if it bypasses that. On another platform, memory allocations may be blazing fast, outperforming Java. As James Kanze pointed out, Java also places very little constraints on memory management (e.g. even the GC algorithm is entirely up to the JVM implementor). Because Java and C++ do not place constraints on memory management, there is no concrete answer to that question. C++ is purposefully open about underlying hardware and kernel functions, and Java is purposefully open about JVM memory management. So the question becomes very fuzzy.
You may find that some operations are faster in Java, and some not. You never know until you try, however:
In practice, the real differences lie in your higher level algorithms and implementations. For all but the most absolutely performance critical applications, the differences in performance of identical data structures in different languages is completely negligible compared to the performance characteristics of the algorithm itself. Concentrate on optimizing your higher level implementations. Only after you have done so, and after you have determined that your performance requirements are not being met, and after you have benchmarked and found (unlikely) that your bottleneck is in container implementations, should you start to think of things like this.
In general, as soon as you find yourself thinking or reading about C++ vs. Java issues, stop and refocus on something productive.

Java heap is faster because (simplified) all you need to do to allocate is to increase heap top pointer (just like on stack). It is possible because heap is periodically compacted. So your price for speed is:
Periodic GC pauses for heap compacting
Increased memory usage
There is no free cheese... So while collection operations may be fast, it is amortized by overall slowing down during GC work.

While I am a fan of Java, it is worth noting that C++ supports allocation of objects on the stack which is faster than heap allocation.
If you use C++ efficiently with all it various ways of doing the same thing, it will be faster than Java (even if it takes you longer to find that optimal combination)
If you program in C++ as you would in Java, e.g. everything on the heap, all methods virtual, have lots of runtime checks which don't do anything and can be optimised away dynamically, it will be slower. Java has optimised these things further as these a) are the only thing Java does, b) can be optimised dynamically more efficiently, c) Java has less features and side effects so it is easier for optimiser for get decent speeds.

and Collections are fast. Standard Java beats standard C/C++ in this area, even for most optimized C code.
This may be true for particular collections, but most certainly isn't true for all collections in all usage patterns.
For instance, a java.util.HashMap will outperform a std:map, because the latter is required to be sorted. That is, the fastest Map in the Java Standard Library is faster that the fastest Map in the C++ one (at least prior to C++11, which added the std:unordered_map)
On the other side, a std:Vector<int> is far more efficient that an java.util.ArrayList<Integer> (due to type erasure, you can't use a java.util.ArrayList<int>, and therefore end up with about 4 times the memory consumption, and possibly poorer cache locality, and correspondingly slower iteration).
In short, like most sweeping generalizations, this one doesn't always apply. However, neither would the opposite assertion (that Java is always slower than C++). It really depends on the details, such as how you use the collection, or even which versions of the languages you compare).

Related

4-ary heaps in Java

Binary heaps are commonly used in e.g. priority queues. The basic idea is that of an incomplete heap sort: you keep the data sorted "just enough" to get out the top element quickly.
While 4-ary heaps are theoretically worse than binary heaps, they do also have some benefits. For example, they will require less heap restructuring operations (as the heap is much shallower), while obvisouly needing more comparisons at each level. But (and that probably is their main benefit?) they may have better CPU cache locality. So some sources say that 3-ary and 4-ary heaps outperform both Fibonacci and binary heaps in practise.
They should not be much harder to implement, the additional cases are just some extra if cases.
Has anyone experimented with 4-ary heaps (and 3-ary) for priority queues and done some benchmarking?
In Java you never know if they are faster or slower before you benchmarked them extensively.
And from all I've found via Google, it may be quite language and use case dependant. Some sources say that they found 3-ary to perform best for them.
Some more points:
PriorityQueue obviously is a binary heap. But the class for example also lacks bulk loading and bulk repair support, or replaceTopElement which can make a huge difference. Bulk loading for example is O(n) instead of O(n log n); bulk repair is essentially the same after adding a larger set of candidates. Tracking which parts of the heap are invalid can be done with a single integer. replaceTopElement is much cheaper than poll + add (just consider how a poll is implemented: replace the top element with the very last)
While heaps of course are popular for complex objects, the priority often is an integer of double value. It's not as if we are comparing strings here. Usually it is a (primitive) priority
PQs are often used just to get the top k elements. For example A*-search can terminate when the goal is reached. All the less good paths are then discarded. So the queue is never completely emptied. In a 4-way heap, there is less order: approximately half as much (half as many parent nodes). So it will impose less order on these elements that are not needed. (This of course differs if you intend to empty your heap completely, e.g. because you are doing heap sort.)
As per #ErichSchubert's suggestion, I have taken the implementations from ELKI and modified them into a 4-ary heap. It was a bit trick to get the indexing right, as a lot of the publications around 4-ary heaps use formulas for 1-indexed arrays?!?
Here are some early benchmark results, based on the ELKI unit test. 200000 Double objects are preallocated (to avoid measuring memory management too much) and shuffled.
As a warmup, 10 iterations are performed for each heap, for benchmarking 100 iterations, but I'll probably try to scale this up further. 10-30 seconds isn't that realiably for benchmarking yet, and OTOH I should try to measure standard deviations, too.
In each iteration, the 200000 elements are added to the heap, then half of them are polled again. Yes, the workload could also be made more complex.
Here are the results:
My 4-ary DoubleMinHeap: 10.371
ELKI DoubleMinHeap: 12.356
ELKI Heap<Double>: 37.458
Java PriorityQueue<Double>: 45.875
So the difference between the 4-ary heap (probably not yet L1 cache-aligned!) and the ELKI heap for primitive doubles is not too big. Well, 10%-20% or so; it could be worse.
The difference between a heap for primitive doubles and a heap for Double objects is much larger. And the ELKI Heap is indeed quite clearly faster than the Java PriorityQueue (but that one seems to have a high variance).
There was a slight "bug" in ELKI, though - at least the primitive heaps did not use the bulk loading code yet. It's there, it's just not being used, as every elements repairs the heap immediately instead of delaying this until the next poll(). I fixed this for my experiments, essentially by removing a few lines and adding one ensureValid(); call. Furthermore, I also don't have a 4-ary object heap yet, and I havn't include ELKI's DoubleObjectMinHeap yet... quite a lot to benchmark, and I'll probably give caliper a try for that.
Not benchmarked it myself, but have a few points to make that are relevant.
Firstly, note that the standard Java implementation of PriorityQueue uses a binary heap:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/PriorityQueue.java
It is plausibly the case that, despite the cache locality benefit of n-ary heaps, binary heaps are still the best solution on average. Below are some slightly hand-wavy reasons why this might be the case:
For most interesting objects, comparison costs are probably much more significant than cache locality effects in the heap data structure itself. n-ary heaps require more comparisons. This probably is enough on its own to outweigh any cache locality effect in the heap itself.
If you were simply making a heap of numbers in place (i.e. backed by
an array of ints or doubles) then I can see that the chache locality
would be a worthwhile benefit. But this isn't the case: usually you will have a heap of object references. Cache locality on the object references themselves is then less useful, since each
comparison will require following at least one extra reference to
examine the referenced object and its fields.
The common case for priority heaps is probably quite a small heap. If you are hitting it sufficiently that you care about it from a performance perspective, it's probably all in the L1 cache anyway. So no cache locality benefit for the n-ary heap anyway.
It's easier to handle a binary heap with bitwise ops. Sure it's not a big advantage, but every little helps....
Simpler algorithms are generally faster than more complex ones, all else being equal, simply because of a lower constant overhead. You get benefits like lower instruction cache usage, higher likelihood of the compiler being able to find smart optimisations etc. Again this works in favour of the binary heap.
Obviously of course you'd need to do your own benchmarks on your own data before you come to a real conclusion about whether which performs the best (and if the difference is enough to care about, which I personally doubt....)
EDIT
Also, I did write a priority heap implementation using an array of primitive keys that may be of interest given the original poster mentioned primitive keys in the comment below:
https://github.com/mikera/mikera/blob/master/src/main/java/mikera/util/RankedQueue.java
This could probably be hacked into an n-ary version for benchmarking purposes relatively easily if anyone was interested in running a test.
I have not benchmarked 4-ary heaps yet. I'm currently trying to optimize our own heap implementations, and I'm trying 4-ary heaps there, too. And you are right: we will need to benchmark this carefully, as it is easy to get mislead by implementation differences and Hotspot optimization will heavily affect the results. Plus, small heaps will probably show different performance characteristics than large heaps.
The Java PriorityQueue is a very simple heap implementation, but that means Hotspot will optimize it well. It's not bad at all: most people would implement a worse heap. But, for example, it indeed does not do efficient bulk loads or bulk adds (bulk repairs). However, in my experiments it was hard to consistently beat this implementation even in simulations with repeated inserts, unless you go for really large heaps. Furthermore, in many situations it pays off to replace the top element in the heap instead of poll() + add(); this is not supported by java's PriorityQueue.
Some of the performance gains in ELKI (and I've seen you are an ELKI user) across versions are actually due to improved heap implementations. But it's an up and down, it's hard to predict which heap variation performs best across real workloads. The key benefit of our implementation is probably to have a "replaceTopElement" function. You can inspect the code here:
SVN de.lmu.ifi.dbs.elki.utilities.heap package
You will notice we have a whole set of heaps there. They are optimized for different stuff, and will need some more refactoring. A number of these classes are actually generated from templates, similar to what GNU Trove does. The reason is that Java can be quite costly when managing boxed primitives, so it does pay off to have primitive versions. (yes, there are plans to split this out into a separate library. It's just not of high priority.)
Note that ELKI deliberately does not endorse the java.util.Collections API. We have found in particular the java.util.Iterator class to be quite costly, and thus try to encourage people to use C++-style iterators throughout ELKI:
for (Iter iter = ids.iter(); iter.valid(); iter.advance()) {
often saves a lot of unnecessary object creations over the java.util.Iterator API. Plus, these iterators can have multiple (and primitive) value getters; where Iterator.next() is a mixture of a getter and the advance operator.
Ok, I have drifted off too much now, back to the topic of 4-ary heaps:
If you intend to try out 4-ary heaps, I suggest you start with the ObjectHeap class there.
Update: I've been microbenchmarking, but the results so far are inconclusive. It's hard to beat PriorityQueue consistently. In particular bulk loading and bulk repairs do not seem to cut anything in my benchmark - probably they cause HotSpot to optimize less, or de-optimize at some point. As often, a simpler Java code is faster than a complex logic. So far, 4-ary heaps without bulk-loading seem to work best. I haven't tried 5-ary yet. 3-ary are about equal with 4-ary heaps; and the memory layout of 4-ary is a bit nicer. I'm also considering to try a heap-of-heaps approach to safe array resizing. But I expect that the increased code complexity means it will run slower in practise.

Predicting Java memory

Is there a way to predict how much memory my Java program is going to take? I come from a C++ background where I implemented methods such as "size_in_bytes()" on classes and I could fairly accurately predict the runtime memory footprint of my app. Now I'm in a Java world, and that is not so easy... There are references, pools, immutable objects that are shared... but I'd still like to be able to predict my memory footprint before I look at the process size in top.
You can inspect the size of objects if you use the instrumentation API. It is a bit tricky to use -- it requires a "premain" method and extra VM parameters -- but there are plenty of examples on the web. "java instrumentation size" should find you these.
Note that the default methods will only give you a shallow size. And unless you avoid any object construction outside of the constructor (which is next to impossible), there will be dead objects around waiting to be garbage collected.
But in general, you could use these to estimate the memory requirements of your application, if you have a good control on the amount of objects generated.
You can't predict the amount of memory a program is going to take. However, you can predict how much an object will take. Edit it turns out I'm almost completely wrong, this document describes the memory usage of objects better: http://www.javamex.com/tutorials/memory/object_memory_usage.shtml
In general, you can predict fairly closely what a given object will require. There's some overhead that is relatively fixed, plus the instance fields in the object, plus a modest amount of padding. But then object size is rounded up to at least (on most JVMs) a 16-byte boundary, and some JVMs round up some object sizes to larger boundaries (to allow the use of standard sized pre-allocated object frames). But all this is relatively fixed for a given JVM.
What varies, of course, is the overhead required for garbage collection. A naive garbage collector requires 100% overhead (at least one free byte for every allocated byte), though certain varieties of "generational" collectors can improve on this to a degree. But how much space is required for GC is highly dependent on the workload (on most JVMs).
The other problem is that when you're running at a relatively low level of allocation (where you're only using maybe 10% of max available heap) then garbage will accumulate. It isn't actively referenced, but the bits of garbage are interspersed with your active objects, so it takes up working set. As a result, your working set tends to be roughly equal to your current overall garbage-collected heap size (plus other system overhead).
You can, of course, "throttle" the heap size so that you run at a higher % utilization, but that increases the frequency of garbage collection (and the overall cost of GC to a lesser degree).
You can use profilers to understand the constant set of objects that are always in memory. Then you should execute all the code paths to check for memory leaks. JProfiler is a good one to start with.

How to memory profile in Java?

I'm still learning the ropes of Java so sorry if there's a obvious answer to this. I have a program that is taking a ton of memory and I want to figure a way to reduce its usage, but after reading many SO questions I have the idea that I need to prove where the problem is before I start optimizing it.
So here's what I did, I added a break point to the start of my program and ran it, then I started visualVM and had it profile the memory(I also did the same thing in netbeans just to compare the results and they are the same). My problem is I don't know how to read them, I got the highest area just saying char[] and I can't see any code or anything(which makes sense because visualvm is connecting to the jvm and can't see my source, but netbeans also does not show me the source as it does when doing cpu profiling).
Basically what I want to know is which variable(and hopefully more details like in which method) all the memory is being used so I can focus on working there. Is there a easy way to do this? I right now I am using eclipse and java to develop(and installed visualVM and netbeans specifically for profiling but am willing to install anything else that you feel gets this job done).
EDIT: Ideally, I'm looking for something that will take all my objects and sort them by size(so I can see which one is hogging memory). Currently it returns generic information such as string[] or int[] but I want to know which object its referring to so I can work on getting its size more optimized.
Strings are problematic
Basically in Java, String references ( things that use char[] behind the scenes ) will dominate most business applications memory wise. How they are created determines how much memory they consume in the JVM.
Just because they are so fundamental to most business applications as a data type, and they are one of the most memory hungry as well. This isn't just a Java thing, String data types take up lots of memory in pretty much every language and run time library, because at the least they are just arrays of 1 byte per character or at the worse ( Unicode ) they are arrays of multiple bytes per character.
Once when profiling CPU usage on a web app that also had an Oracle JDBC dependency I discovered that StringBuffer.append() dominated the CPU cycles by many orders of magnitude over all other method calls combined, much less any other single method call. The JDBC driver did lots and lots of String manipulation, kind of the trade off of using PreparedStatements for everything.
What you are concerned about you can't control, not directly anyway
What you should focus on is what in in your control, which is making sure you don't hold on to references longer than you need to, and that you are not duplicating things unnecessarily. The garbage collection routines in Java are highly optimized, and if you learn how their algorithms work, you can make sure your program behaves in the optimal way for those algorithms to work.
Java Heap Memory isn't like manually managed memory in other languages, those rules don't apply
What are considered memory leaks in other languages aren't the same thing/root cause as in Java with its garbage collection system.
Most likely in Java memory isn't consumed by one single uber-object that is leaking ( dangling reference in other environments ).
It is most likely lots of smaller allocations because of StringBuffer/StringBuilder objects not sized appropriately on first instantantations and then having to automatically grow the char[] arrays to hold subsequent append() calls.
These intermediate objects may be held around longer than expected by the garbage collector because of the scope they are in and lots of other things that can vary at run time.
EXAMPLE: the garbage collector may decide that there are candidates, but because it considers that there is plenty of memory still to be had that it might be too expensive time wise to flush them out at that point in time, and it will wait until memory pressure gets higher.
The garbage collector is really good now, but it isn't magic, if you are doing degenerate things, it will cause it to not work optimally. There is lots of documentation on the internet about the garbage collector settings for all the versions of the JVMs.
These un-referenced objects may just have not reached the time that the garbage collector thinks it needs them to for them to be expunged from memory, or there could be references to them held by some other object ( List ) for example that you don't realize still points to that object. This is what is most commonly referred to as a leak in Java, which is a reference leak more specifically.
EXAMPLE: If you know you need to build a 4K String using a StringBuilder create it with new StringBuilder(4096); not the default, which is like 32 and will immediately start creating garbage that can represent many times what you think the object should be size wise.
You can discover how many of what types of objects are instantiated with VisualVM, this will tell you what you need to know. There isn't going to be one big flashing light that points at a single instance of a single class that says, "This is the big memory consumer!", that is unless there is only one instance of some char[] that you are reading some massive file into, and this is not possible either, because lots of other classes use char[] internally; and then you pretty much knew that already.
I don't see any mention of OutOfMemoryError
You probably don't have a problem in your code, the garbage collection system just might not be getting put under enough pressure to kick in and deallocate objects that you think it should be cleaning up. What you think is a problem probably isn't, not unless your program is crashing with OutOfMemoryError. This isn't C, C++, Objective-C, or any other manual memory management language / runtime. You don't get to decide what is in memory or not at the detail level you are expecting you should be able to.
In JProfiler, you can take go to the heap walker and activate the biggest objects view. You will see the objects the retain most memory. "Retained" memory is the memory that would be freed by the garbage collector if you removed the object.
You can then open the object nodes to see the reference tree of the retained objects. Here's a screen shot of the biggest object view:
Disclaimer: My company develops JProfiler
I would recommend capturing heap dumps and using a tool like Eclipse MAT that lets you analyze them. There are many tutorials available. It provides a view of the dominator tree to provide insight into the relationships between the objects on the heap. Specifically for what you mentioned, the "path to GC roots" feature of MAT will tell you where the majority of those char[], String[] and int[] objects are being referenced. JVisualVM can also be useful in identifying leaks and allocations, particularly by using snapshots with allocation stack traces. There are quite a few walk-throughs of the process of getting the snapshots and comparing them to find the allocation point.
Java JDK comes with JVisualVM under bin folder, once your application server (for example is running) you can run visualvm and connect it to your localhost, which will provide you memory allocation and enable you to perform heap dump
If you use visualVM to check your memory usage, it focuses on the data, not the methods. Maybe your big char[] data is caused by many String values? Unless you are using recursion, the data will not be from local variables. So you can focus on the methods that insert elements into large data structures. To find out what precise statements cause your "memory leakage", I suggest you additionally
read Josh Bloch's Effective Java Item 6: (Eliminate obsolete object references)
use a logging framework an log instance creations on the highest verbosity level.
There are generally two distinct approaches to analyse Java code to gain an understanding of its memory allocation profile. If you're trying to measure the impact of a specific, small section of code – say you want to compare two alternative implementations in order to decide which one gives better runtime performance – you would use a microbenchmarking tool such as JMH.
While you can pause the running program, the JVM is a sophisticated runtime that performs a variety of housekeeping tasks and it's really hard to get a "point in time" snapshot and an accurate reading of the "level of memory usage". It might allocate/free memory at a rate that does not directly reflect the behaviour of the running Java program. Similarly, performing a Java object heap dump does not fully capture the low-level machine specific memory layout that dictates the actual memory footprint, as this could depend on the machine architecture, JVM version, and other runtime factors.
Tools like JMH get around this by repeatedly running a small section of code, and observing a long-running average of memory allocations across a number of invocations. E.g. in the GC profiling sample JMH benchmark the derived *·gc.alloc.rate.norm metric gives a reasonably accurate per-invocation normalised memory cost.
In the more general case, you can attach a profiler to a running application and get JVM-level metrics, or perform a heap dump for offline analysis. Some commonly used tools for profiling full applications are Async Profiler and the newly open-sourced Java Flight Recorder in conjunction with Java Mission Control to visualise results.

Java allocation : allocating objects from a pre-existing/allocated pool

In a Java program when it is necessary to allocate thousands of similar-size objects, it would be better (in my mind) to have a "pool" (which is a single allocation) with reserved items that can be pulled from when needed. This single large allocation wouldn't fragment the heap as much as thousands of smaller allocations.
Obviously, there isn't a way to specifically point an object reference to an address in memory (for its member fields) to set up a pool. Even if the new object referenced an area of the pool, the object itself would still need to be allocated. How would you handle many allocations like this without resorting to native OS libraries?
You could try using the Commons Pool library.
That said, unless I had proof the JVM wasn't doing what I needed, I'd probably hold off on optimizing object creation.
Don't worry about it. Unless you have done a lot of testing and analysis on the actual code being run and know that it is a problem with garbage collection and that the JVM isn't doing a good enough job, spend your time elsewhere.
If you are building an application, where a predictable response time is very important, then pooling of objects, no matter how small they are will pay you dividends. Again, pooling is also a factor of how big of a data set you are trying to pool and how much physical memory your machine has.
There is ample proof on the web that shows that object pooling, no matter how small the objects are, is beneficial for application performance.
There are two levels of pooling you could do:
Pooling of the basic objects such as Vectors, which you retrieve from the pool each time you have to use the vector to form a map or such.
Have the higher level composite objects pooled, which are most commonly used, pooled.
This is generally an application design decision.
Also, in a multi-threaded application, you would like to be sensitive about how many different threads are going to be allocating and returning to the pool. You certainly do not want your application to be bogged down by contention - especially if you are dealing with thousands of objects at the same time.
#Dave and Casey, you don't need any proof to show that contiguous memory layout improves Cache efficiency, which is the major bottleneck in most OOP apps that need high performance but follow a "too idealistic" OOP-design trajectory.
People often think of the GC as the culprit causing low performance in high performance Java applications and after fixing it, just leave it at that, without actually profiling memory-behavior of the application. Note though that un-cached memory instructions are inherently more expensive than arithmetic instructions (and are getting more and more expensive due to the memory access <-> computation gap). So if you care about performance, you should certainly care about memory management.
Cache-aware, or more general, data-oriented programming, is the key to achieving high performance in many kinds of applications, such as games, or mobile apps (to reduce power consumption).
Here is a SO thread on DOP.
Here is a slideshow from the Sony R&D department that shows the usefulness of DOP as applied to a playstation game (high performance required).
So how to solve the problem that Java, does not, in general allow you to allocate a chunk of memory? My guess is that when the program is just starting, you can assume that there is very little internal fragmentation in the already allocated pages. If you now have a loop that allocates thousands or millions of objects, they will probably all be as contiguous as possible. Note that you only need to make sure that consecutive objects stretch out over the same cacheline, which in many modern systems, is only 64 bytes. Also, take a look at the DOP slides, if you really care about the (memory-) performance of your application.
In short: Always allocate multiple objects at once (increase temporal locality of allocation), and, if your GC has defragmentation, run it beforehand, else try to reduce such allocations to the beginning of your program.
I hope, this is of some help,
-Domi
PS: #Dave, the commons pool library does not allocate objects contiguously. It only keeps track of the allocations by putting them into a reference array, embedded in a stack, linked list, or similar.

Is Java Native Memory Faster than the heap?

I'm exploring options to help my memory-intensive application, and in doing so I came across Terracotta's BigMemory. From what I gather, they take advantage of non-garbage-collected, off-heap "native memory," and apparently this is about 10x slower than heap-storage due to serialization/deserialization issues. Prior to reading about BigMemory, I'd never heard of "native memory" outside of normal JNI. Although BigMemory is an interesting option that warrants further consideration, I'm intrigued by what could be accomplished with native memory if the serialization issue could be bypassed.
Is Java native memory faster (I think this entails ByteBuffer objects?) than traditional heap memory when there are no serialization issues (for instance if I am comparing it with a huge byte[])? Or do the vagaries of garbage collection, etc. render this question unanswerable? I know "measure it" is a common answer around here, but I'm afraid I would not set up a representative test as I don't yet know enough about how native memory works in Java.
Direct memory is faster when performing IO because it avoid one copy of the data. However, for 95% of application you won't notice the difference.
You can store data in direct memory, however it won't be faster than storing data POJOs. (or as safe or readable or maintainable) If you are worried about GC, try creating your objects (have to be mutable) in advance and reuse them without discarding them. If you don't discard your objects, there is nothing to collect.
Is Java native memory faster (I think this entails ByteBuffer objects?) than traditional heap memory when there are no serialization issues (for instance if I am comparing it with a huge byte[])?
Direct memory can be faster than using a byte[] if you use use non bytes like int as it can read/write the whole four bytes without turning the data into bytes.
However it is slower than using POJOs as it has to bounds check every access.
Or do the vagaries of garbage collection, etc. render this question unanswerable?
The speed has nothing to do with the GC. The GC only matters when creating or discard objects.
BTW: If you minimise the number of object you discard and increase your Eden size, you can prevent even minor collection occurring for a long time e.g. a whole day.
The point of BigMemory is not that native memory is faster, but rather, it's to reduce the overhead of the garbage collector having to go through the effort of tracking down references to memory and cleaning it up. As your heap size increases, so do your GC intervals and CPU commitment. Depending upon the situation, this can create a sort of "glass ceiling" where the Java heap gets so big that the GC turns into a hog, taking up huge amounts of processor power each time the GC kicks in. Also, many GC algorithms require some level of locking that means nobody can do anything until that portion of the GC reference tracking algorithm finishes, though many JVM's have gotten much better at handling this. Where I work, with our app server and JVM's, we found that the "glass ceiling" is about 1.5 GB. If we try to configure the heap larger than that, the GC routine starts eating up more than 50% of total CPU time, so it's a very real cost. We've determined this through various forms of GC analysis provided by our JVM vendor.
BigMemory, on the other hand, takes a more manual approach to memory management. It reduces the overhead and sort of takes us back to having to do our own memory cleanup, as we did in C, albeit in a much simpler approach akin to a HashMap. This essentially eliminates the need for a traditional garbage collection routine, and as a result, we eliminate that overhead. I believe that the Terracotta folks used native memory via a ByteBuffer as it's an easy way to get out from under the Java garbage collector.
The following whitepaper has some good info on how they architected BigMemory and some background on the overhead of the GC: http://www.terracotta.org/resources/whitepapers/bigmemory-whitepaper.
I'm intrigued by what could be accomplished with native memory if the serialization issue could be bypassed.
I think that your question is predicated on a false assumption. AFAIK, it is impossible to bypass the serialization issue that they are talking about here. The only thing you could do would be to simplify the objects that you put into BigMemory and use custom serialization / deserialization code to reduce the overheads.
While benchmarks might give you a rough idea of the overheads, the actual overheads will be very application specific. My advice would be:
Only go down this route if you know you need to. (You will be tying your application to a particular implementation technology.)
Be prepared for some intrusive changes to your application if the data involved isn't already managed using as a cache.
Be prepared to spend some time in (re-)tuning your caching code to get good performance with BigMemory.
If your data structures are complicated, expect a proportionately larger runtime overheads and tuning effort.

Categories