4-ary heaps in Java - java

Binary heaps are commonly used in e.g. priority queues. The basic idea is that of an incomplete heap sort: you keep the data sorted "just enough" to get out the top element quickly.
While 4-ary heaps are theoretically worse than binary heaps, they do also have some benefits. For example, they will require less heap restructuring operations (as the heap is much shallower), while obvisouly needing more comparisons at each level. But (and that probably is their main benefit?) they may have better CPU cache locality. So some sources say that 3-ary and 4-ary heaps outperform both Fibonacci and binary heaps in practise.
They should not be much harder to implement, the additional cases are just some extra if cases.
Has anyone experimented with 4-ary heaps (and 3-ary) for priority queues and done some benchmarking?
In Java you never know if they are faster or slower before you benchmarked them extensively.
And from all I've found via Google, it may be quite language and use case dependant. Some sources say that they found 3-ary to perform best for them.
Some more points:
PriorityQueue obviously is a binary heap. But the class for example also lacks bulk loading and bulk repair support, or replaceTopElement which can make a huge difference. Bulk loading for example is O(n) instead of O(n log n); bulk repair is essentially the same after adding a larger set of candidates. Tracking which parts of the heap are invalid can be done with a single integer. replaceTopElement is much cheaper than poll + add (just consider how a poll is implemented: replace the top element with the very last)
While heaps of course are popular for complex objects, the priority often is an integer of double value. It's not as if we are comparing strings here. Usually it is a (primitive) priority
PQs are often used just to get the top k elements. For example A*-search can terminate when the goal is reached. All the less good paths are then discarded. So the queue is never completely emptied. In a 4-way heap, there is less order: approximately half as much (half as many parent nodes). So it will impose less order on these elements that are not needed. (This of course differs if you intend to empty your heap completely, e.g. because you are doing heap sort.)

As per #ErichSchubert's suggestion, I have taken the implementations from ELKI and modified them into a 4-ary heap. It was a bit trick to get the indexing right, as a lot of the publications around 4-ary heaps use formulas for 1-indexed arrays?!?
Here are some early benchmark results, based on the ELKI unit test. 200000 Double objects are preallocated (to avoid measuring memory management too much) and shuffled.
As a warmup, 10 iterations are performed for each heap, for benchmarking 100 iterations, but I'll probably try to scale this up further. 10-30 seconds isn't that realiably for benchmarking yet, and OTOH I should try to measure standard deviations, too.
In each iteration, the 200000 elements are added to the heap, then half of them are polled again. Yes, the workload could also be made more complex.
Here are the results:
My 4-ary DoubleMinHeap: 10.371
ELKI DoubleMinHeap: 12.356
ELKI Heap<Double>: 37.458
Java PriorityQueue<Double>: 45.875
So the difference between the 4-ary heap (probably not yet L1 cache-aligned!) and the ELKI heap for primitive doubles is not too big. Well, 10%-20% or so; it could be worse.
The difference between a heap for primitive doubles and a heap for Double objects is much larger. And the ELKI Heap is indeed quite clearly faster than the Java PriorityQueue (but that one seems to have a high variance).
There was a slight "bug" in ELKI, though - at least the primitive heaps did not use the bulk loading code yet. It's there, it's just not being used, as every elements repairs the heap immediately instead of delaying this until the next poll(). I fixed this for my experiments, essentially by removing a few lines and adding one ensureValid(); call. Furthermore, I also don't have a 4-ary object heap yet, and I havn't include ELKI's DoubleObjectMinHeap yet... quite a lot to benchmark, and I'll probably give caliper a try for that.

Not benchmarked it myself, but have a few points to make that are relevant.
Firstly, note that the standard Java implementation of PriorityQueue uses a binary heap:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/PriorityQueue.java
It is plausibly the case that, despite the cache locality benefit of n-ary heaps, binary heaps are still the best solution on average. Below are some slightly hand-wavy reasons why this might be the case:
For most interesting objects, comparison costs are probably much more significant than cache locality effects in the heap data structure itself. n-ary heaps require more comparisons. This probably is enough on its own to outweigh any cache locality effect in the heap itself.
If you were simply making a heap of numbers in place (i.e. backed by
an array of ints or doubles) then I can see that the chache locality
would be a worthwhile benefit. But this isn't the case: usually you will have a heap of object references. Cache locality on the object references themselves is then less useful, since each
comparison will require following at least one extra reference to
examine the referenced object and its fields.
The common case for priority heaps is probably quite a small heap. If you are hitting it sufficiently that you care about it from a performance perspective, it's probably all in the L1 cache anyway. So no cache locality benefit for the n-ary heap anyway.
It's easier to handle a binary heap with bitwise ops. Sure it's not a big advantage, but every little helps....
Simpler algorithms are generally faster than more complex ones, all else being equal, simply because of a lower constant overhead. You get benefits like lower instruction cache usage, higher likelihood of the compiler being able to find smart optimisations etc. Again this works in favour of the binary heap.
Obviously of course you'd need to do your own benchmarks on your own data before you come to a real conclusion about whether which performs the best (and if the difference is enough to care about, which I personally doubt....)
EDIT
Also, I did write a priority heap implementation using an array of primitive keys that may be of interest given the original poster mentioned primitive keys in the comment below:
https://github.com/mikera/mikera/blob/master/src/main/java/mikera/util/RankedQueue.java
This could probably be hacked into an n-ary version for benchmarking purposes relatively easily if anyone was interested in running a test.

I have not benchmarked 4-ary heaps yet. I'm currently trying to optimize our own heap implementations, and I'm trying 4-ary heaps there, too. And you are right: we will need to benchmark this carefully, as it is easy to get mislead by implementation differences and Hotspot optimization will heavily affect the results. Plus, small heaps will probably show different performance characteristics than large heaps.
The Java PriorityQueue is a very simple heap implementation, but that means Hotspot will optimize it well. It's not bad at all: most people would implement a worse heap. But, for example, it indeed does not do efficient bulk loads or bulk adds (bulk repairs). However, in my experiments it was hard to consistently beat this implementation even in simulations with repeated inserts, unless you go for really large heaps. Furthermore, in many situations it pays off to replace the top element in the heap instead of poll() + add(); this is not supported by java's PriorityQueue.
Some of the performance gains in ELKI (and I've seen you are an ELKI user) across versions are actually due to improved heap implementations. But it's an up and down, it's hard to predict which heap variation performs best across real workloads. The key benefit of our implementation is probably to have a "replaceTopElement" function. You can inspect the code here:
SVN de.lmu.ifi.dbs.elki.utilities.heap package
You will notice we have a whole set of heaps there. They are optimized for different stuff, and will need some more refactoring. A number of these classes are actually generated from templates, similar to what GNU Trove does. The reason is that Java can be quite costly when managing boxed primitives, so it does pay off to have primitive versions. (yes, there are plans to split this out into a separate library. It's just not of high priority.)
Note that ELKI deliberately does not endorse the java.util.Collections API. We have found in particular the java.util.Iterator class to be quite costly, and thus try to encourage people to use C++-style iterators throughout ELKI:
for (Iter iter = ids.iter(); iter.valid(); iter.advance()) {
often saves a lot of unnecessary object creations over the java.util.Iterator API. Plus, these iterators can have multiple (and primitive) value getters; where Iterator.next() is a mixture of a getter and the advance operator.
Ok, I have drifted off too much now, back to the topic of 4-ary heaps:
If you intend to try out 4-ary heaps, I suggest you start with the ObjectHeap class there.
Update: I've been microbenchmarking, but the results so far are inconclusive. It's hard to beat PriorityQueue consistently. In particular bulk loading and bulk repairs do not seem to cut anything in my benchmark - probably they cause HotSpot to optimize less, or de-optimize at some point. As often, a simpler Java code is faster than a complex logic. So far, 4-ary heaps without bulk-loading seem to work best. I haven't tried 5-ary yet. 3-ary are about equal with 4-ary heaps; and the memory layout of 4-ary is a bit nicer. I'm also considering to try a heap-of-heaps approach to safe array resizing. But I expect that the increased code complexity means it will run slower in practise.

Related

Java: is it faster to create new array or set all elements of current array to 0

In a performance-critical part of my code, I need to clear an int array buffer by setting it back to all 0s.
Should I do buffer = new int[size] or Arrays.fill(buffer, 0)? The first seems to be faster in my tests, but maybe it will slow down eventually because of garbage collection. I don't have confidence in my own tests (because of stuff like compiler optimization), so I am asking it here.
If it matters, buffer will be size of about 300, and I need to clear buffer when it fills up, so after 300 iterations of my main loop.
I read More efficient to create new array or reset array but it doesn't specifically say for larger arrays. Also it is for Objects, not ints, which I think could matter.
Is it faster to create new array or set all elements of current array to 0.
There is no simple answer. The JVM can allocate a default initialized array faster that fill(array, 0) can fill an array of the same size. But the flipside is that there are GC-related overheads that are difficult to quantify:
The GC costs are typically proportional to amount of reachable data. For non-reachable objects, the cost is essentially the cost of zeroing memory.
The GC costs / efficiency will depend on the heap size, and on how full it is.
The GC overheads also depend on the lifetime of the objects. For example a long-lived object will typically be tenured to the "old" generation and GC'd less often. But the flipside is that write barriers may make array writes slower.
Different GC's perform differently.
Different Java JIT compilers, etc perform differently.
And so on.
The bottom line is that it is not possible to give a clear answer without knowing ... more information than you can provide to create a valid model of the behavior.
Likewise, artificial benchmarks are liable to involve making explicit or implicit choices about various of the above (overt and hidden) variables. The result is liable to be that the benchmark results don't reflect real performance in your application.
So the best answer is to measure and compare the performance in the context of your actual application. In other words:
Get your application working
Write a benchmark for measuring your application's performance with realistic test data / inputs
Use the benchmark to compare the performance of the two alternatives in the context of your application.
(Your question has the smell of premature optimization about it. You should be able to put off deciding which of these alternatives is better ... until you have the tools to make a well-founded decision.)

Why is java HashMap resize or rehash not taking gradual approach like Redis

I am just wondering why the jdk HashMap reshashing process not taking the gradual approach as Redis. Though the rehash calculation of Jdk HashMap is quite elegant and effective, it will still take noticeable time when the number of elements in the original HashMap contains quite a number of entries.
I am not an experience java user so I always suppose that there must be consideration of the java designers that is beyond the limit of my cognitive capability.
The gradual rehash like Redis can effectively distributes the workload to each put, delete or get in the HashMap, which could significantly reduce the resize/rehashing time .
And I have also compared the two hash methods which in my mind doesn't restrict Jdk from doing a gradual rehashing.
Hope someone could give an clue or some inspiration. Thanks a lot in advance.
If you think about the costs and benefits of incremental rehashing for something like HashMap, It turns out that the costs are not insignificant, and the benefits are not as great as you might like.
An incrementally rehashing HashMap:
Uses 50% more memory on average, because it needs to keep both the old table and new table around during the incremental rehash; and
Has a somewhat higher computational cost per operation. Also:
The rehashing is still not entirely incremental, because allocating the new hash table array has to be done all at once; so
There are no improvements in the asymptotic complexity of any operation. And finally:
Almost nothing that would really need incremental rehashing can be implemented in Java at all, due to the unpredictable GC pauses, so why bother?

Java collections faster than c++ containers?

I was reading the comments on this answer and I saw this quote.
Object instantiation and object-oriented features are blazing fast to use (faster than C++ in many cases) because they're designed in from the beginning. and Collections are fast. Standard Java beats standard C/C++ in this area, even for most optimized C code.
One user (with really high rep I might add) boldly defended this claim, stating that
heap allocation in java is better than C++'s
and added this statement defending the collections in java
And Java collections are fast compared to C++ collections due largely to the different memory subsystem.
So my question is can any of this really be true, and if so why is java's heap allocation so much faster.
This sort of statement is ridiculous; people making it are
either incredibly uninformed, or incredibly dishonest. In
particular:
The speed of dynamic memory allocation in the two cases will
depend on the pattern of dynamic memory use, as well as the
implementation. It is trivial for someone familiar with the
algorithms used in both cases to write a benchmark proving which
ever one he wanted to be faster. (Thus, for example, programs
using large, complex graphs that are build, then torn down and
rebuilt, will typically run faster under garbage collection. As
will programs that never use enough dynamic memory to trigger
the collector. Programs using few, large, long lived
allocations will often run faster with manual memory
management.)
When comparing the collections, you have to consider what is
in the collections. If you're comparing large vectors of
double, for example, the difference between Java and C++ will
likely be slight, and could go either way. If you're comparing
large vectors of Point, where Point is a value class containing
two doubles, C++ will probably blow Java out of the water,
because it uses pure value semantics (with no additional dynamic
allocation), where as Java needs to dynamically allocate each
Point (and no dynamic allocation is always faster than even
the fastest dynamic allocation). If the Point class in Java
is correctly designed to act as a value (and thus immutable,
like java.lang.String), then doing a translation on the
Point in a vector will require a new allocation for every
Point; in C++, you could just assign.
Much depends on the optimizer. In Java, the optimizer works
with perfect knowledge of the actual use cases, in this
particular run of the program, and perfect knowledge of the
actual processor it is running on, in this run. In C++, the
optimizer must work with data from a profiling run, which will
never correspond exactly to any one run of the program, and the
optimizer must (usually) generate code that will run (and run
quickly) on a wide variety of processor versions. On the other
hand, the C++ optimizer may take significantly more time
analysing the different paths (and effective optimization can
require a lot of CPU); the Java optimizer has to be fairly
quick.
Finally, although not relevant to all applications, C++ can be
single threaded. In which case, no locking is needed in the
allocator, which is never the case in Java.
With regards to the two numbered points: C++ can use more or
less the same algorithms as Java in its heap allocator. I've
used C++ programs where the ::operator delete() function was
empty, and the memory was garbage collected. (If your
application allocates lots of short lived, small objects, such
an allocator will probably speed things up.) And as for the
second: the really big advantage C++ has is that its memory
model doesn't require everything to be dynamically allocated.
Even if allocation in Java takes only a tenth of the time it
would take in C++ (which could be the case, if you only count
the allocation, and not the time needed for the collector
sweeps), with large vectors of Point, as above, you're
comparing two or three allocations in C++ with millions of
allocations in Java.
And finally: "why is Java's heap allocation so much faster?" It
isn't, necessarily, if you amortise the time for the
collection phases. The time for the allocation itself can be
very cheap, because Java (or at least most Java implementations)
use a relocating collector, which results in all of the free
memory being in a single contiguous block. This is at least
partially offset by the time needed in the collector: to get
that contiguity, you've got to move data, which means a lot of
copying. In most implementations, it also means an additional
indirection in the pointers, and a lot of special logic to avoid
issues when one thread has the address in a register, or such.
Your questions don't have concrete answers. For example, C++ does not define memory management at all. It leaves allocation details up to the library implementation. Therefore, within the bounds of C++, a given platform may have a very slow heap allocation scheme, and Java would certainly be faster if it bypasses that. On another platform, memory allocations may be blazing fast, outperforming Java. As James Kanze pointed out, Java also places very little constraints on memory management (e.g. even the GC algorithm is entirely up to the JVM implementor). Because Java and C++ do not place constraints on memory management, there is no concrete answer to that question. C++ is purposefully open about underlying hardware and kernel functions, and Java is purposefully open about JVM memory management. So the question becomes very fuzzy.
You may find that some operations are faster in Java, and some not. You never know until you try, however:
In practice, the real differences lie in your higher level algorithms and implementations. For all but the most absolutely performance critical applications, the differences in performance of identical data structures in different languages is completely negligible compared to the performance characteristics of the algorithm itself. Concentrate on optimizing your higher level implementations. Only after you have done so, and after you have determined that your performance requirements are not being met, and after you have benchmarked and found (unlikely) that your bottleneck is in container implementations, should you start to think of things like this.
In general, as soon as you find yourself thinking or reading about C++ vs. Java issues, stop and refocus on something productive.
Java heap is faster because (simplified) all you need to do to allocate is to increase heap top pointer (just like on stack). It is possible because heap is periodically compacted. So your price for speed is:
Periodic GC pauses for heap compacting
Increased memory usage
There is no free cheese... So while collection operations may be fast, it is amortized by overall slowing down during GC work.
While I am a fan of Java, it is worth noting that C++ supports allocation of objects on the stack which is faster than heap allocation.
If you use C++ efficiently with all it various ways of doing the same thing, it will be faster than Java (even if it takes you longer to find that optimal combination)
If you program in C++ as you would in Java, e.g. everything on the heap, all methods virtual, have lots of runtime checks which don't do anything and can be optimised away dynamically, it will be slower. Java has optimised these things further as these a) are the only thing Java does, b) can be optimised dynamically more efficiently, c) Java has less features and side effects so it is easier for optimiser for get decent speeds.
and Collections are fast. Standard Java beats standard C/C++ in this area, even for most optimized C code.
This may be true for particular collections, but most certainly isn't true for all collections in all usage patterns.
For instance, a java.util.HashMap will outperform a std:map, because the latter is required to be sorted. That is, the fastest Map in the Java Standard Library is faster that the fastest Map in the C++ one (at least prior to C++11, which added the std:unordered_map)
On the other side, a std:Vector<int> is far more efficient that an java.util.ArrayList<Integer> (due to type erasure, you can't use a java.util.ArrayList<int>, and therefore end up with about 4 times the memory consumption, and possibly poorer cache locality, and correspondingly slower iteration).
In short, like most sweeping generalizations, this one doesn't always apply. However, neither would the opposite assertion (that Java is always slower than C++). It really depends on the details, such as how you use the collection, or even which versions of the languages you compare).

Is Java Native Memory Faster than the heap?

I'm exploring options to help my memory-intensive application, and in doing so I came across Terracotta's BigMemory. From what I gather, they take advantage of non-garbage-collected, off-heap "native memory," and apparently this is about 10x slower than heap-storage due to serialization/deserialization issues. Prior to reading about BigMemory, I'd never heard of "native memory" outside of normal JNI. Although BigMemory is an interesting option that warrants further consideration, I'm intrigued by what could be accomplished with native memory if the serialization issue could be bypassed.
Is Java native memory faster (I think this entails ByteBuffer objects?) than traditional heap memory when there are no serialization issues (for instance if I am comparing it with a huge byte[])? Or do the vagaries of garbage collection, etc. render this question unanswerable? I know "measure it" is a common answer around here, but I'm afraid I would not set up a representative test as I don't yet know enough about how native memory works in Java.
Direct memory is faster when performing IO because it avoid one copy of the data. However, for 95% of application you won't notice the difference.
You can store data in direct memory, however it won't be faster than storing data POJOs. (or as safe or readable or maintainable) If you are worried about GC, try creating your objects (have to be mutable) in advance and reuse them without discarding them. If you don't discard your objects, there is nothing to collect.
Is Java native memory faster (I think this entails ByteBuffer objects?) than traditional heap memory when there are no serialization issues (for instance if I am comparing it with a huge byte[])?
Direct memory can be faster than using a byte[] if you use use non bytes like int as it can read/write the whole four bytes without turning the data into bytes.
However it is slower than using POJOs as it has to bounds check every access.
Or do the vagaries of garbage collection, etc. render this question unanswerable?
The speed has nothing to do with the GC. The GC only matters when creating or discard objects.
BTW: If you minimise the number of object you discard and increase your Eden size, you can prevent even minor collection occurring for a long time e.g. a whole day.
The point of BigMemory is not that native memory is faster, but rather, it's to reduce the overhead of the garbage collector having to go through the effort of tracking down references to memory and cleaning it up. As your heap size increases, so do your GC intervals and CPU commitment. Depending upon the situation, this can create a sort of "glass ceiling" where the Java heap gets so big that the GC turns into a hog, taking up huge amounts of processor power each time the GC kicks in. Also, many GC algorithms require some level of locking that means nobody can do anything until that portion of the GC reference tracking algorithm finishes, though many JVM's have gotten much better at handling this. Where I work, with our app server and JVM's, we found that the "glass ceiling" is about 1.5 GB. If we try to configure the heap larger than that, the GC routine starts eating up more than 50% of total CPU time, so it's a very real cost. We've determined this through various forms of GC analysis provided by our JVM vendor.
BigMemory, on the other hand, takes a more manual approach to memory management. It reduces the overhead and sort of takes us back to having to do our own memory cleanup, as we did in C, albeit in a much simpler approach akin to a HashMap. This essentially eliminates the need for a traditional garbage collection routine, and as a result, we eliminate that overhead. I believe that the Terracotta folks used native memory via a ByteBuffer as it's an easy way to get out from under the Java garbage collector.
The following whitepaper has some good info on how they architected BigMemory and some background on the overhead of the GC: http://www.terracotta.org/resources/whitepapers/bigmemory-whitepaper.
I'm intrigued by what could be accomplished with native memory if the serialization issue could be bypassed.
I think that your question is predicated on a false assumption. AFAIK, it is impossible to bypass the serialization issue that they are talking about here. The only thing you could do would be to simplify the objects that you put into BigMemory and use custom serialization / deserialization code to reduce the overheads.
While benchmarks might give you a rough idea of the overheads, the actual overheads will be very application specific. My advice would be:
Only go down this route if you know you need to. (You will be tying your application to a particular implementation technology.)
Be prepared for some intrusive changes to your application if the data involved isn't already managed using as a cache.
Be prepared to spend some time in (re-)tuning your caching code to get good performance with BigMemory.
If your data structures are complicated, expect a proportionately larger runtime overheads and tuning effort.

Optimizing processing and management of large Java data arrays

I'm writing some pretty CPU-intensive, concurrent numerical code that will process large amounts of data stored in Java arrays (e.g. lots of double[100000]s). Some of the algorithms might run millions of times over several days so getting maximum steady-state performance is a high priority.
In essence, each algorithm is a Java object that has an method API something like:
public double[] runMyAlgorithm(double[] inputData);
or alternatively a reference could be passed to the array to store the output data:
public runMyAlgorithm(double[] inputData, double[] outputData);
Given this requirement, I'm trying to determine the optimal strategy for allocating / managing array space. Frequently the algorithms will need large amounts of temporary storage space. They will also take large arrays as input and create large arrays as output.
Among the options I am considering are:
Always allocate new arrays as local variables whenever they are needed (e.g. new double[100000]). Probably the simplest approach, but will produce a lot of garbage.
Pre-allocate temporary arrays and store them as final fields in the algorithm object - big downside would be that this would mean that only one thread could run the algorithm at any one time.
Keep pre-allocated temporary arrays in ThreadLocal storage, so that a thread can use a fixed amount of temporary array space whenever it needs it. ThreadLocal would be required since multiple threads will be running the same algorithm simultaneously.
Pass around lots of arrays as parameters (including the temporary arrays for the algorithm to use). Not good since it will make the algorithm API extremely ugly if the caller has to be responsible for providing temporary array space....
Allocate extremely large arrays (e.g. double[10000000]) but also provide the algorithm with offsets into the array so that different threads will use a different area of the array independently. Will obviously require some code to manage the offsets and allocation of the array ranges.
Any thoughts on which approach would be best (and why)?
What I have noticed when working with memory in Java is the following. If your memory needs patterns are simple (mostly 2-3 types of memory allocations) you can usually be better than the default allocator. You can either preallocate a pool of buffers at the application startup and use them as needed or go to the other route (allocate an huge array at the beginning and provide pieces of that when needed). In effect you are writing your own memory allocator. But chances are you will do a worse job than the default allocator of Java.
I would probably try to do the following: standardize the buffer sizes and allocate normally. That way after a while the only memory allocation/deallocation will be in fixed sizes which will greatly help the garbage collector to run fast. Another thing I would do is to make sure at the algorithm design time that the total memory needed at any one point will not exceed something like 80-85% of the memory of the machine in order to not trigger a full collection inadvertently.
Apart from those heuristics I would probably test the hell of any solution I would pick and see how it works in practice.
Allocating big arrays is relatively cheap for the GC. You tend to use you your Eden space quickly, but the cost is largely per object. I suggest you write the code in the simplest manner possible and optimise it later after profiling the application. a double[100000] is less than a MB and you can over a thousand in a GB.
Memory is a lot cheaper than it used to be. An 8 GB server costs about £850. A 24 GB server costs about £1,800. (a 24 GB machine could allow you 24K x double[100000]) You may find using a large heap size or even a large Eden size gives you the efficiency you want.

Categories