Reuse hashmaps in array - java

I am holding an array of hashmaps, I want to gain maximum performance and memory usage so I would like to resue the hashmaps inside an array.
So when there is a hashmap in the array that is not needed any more and I want to add new hashmap to the array I just clear the hashmap and use put() to add new values.
I also need to copy back values when I retireve hashmap from array.
I am not sure if this is better than creating new HashMap() every time.
What is better?
UPDATE
need to cycle about 50 milions of hashmaps, each hash map has about 10 key-value pairs. If size of the array 20,000 I need just 20,000 hashmaps instead of 50 milions new hashmaps()

Be very careful with this approach. Although it may be better performance-wise to recycle objects, you may get into trouble by modifying the same reference several times, as illustrated in the following example:
public class A {
public int counter = 0;
public static void main(String[] args) {
A a = new A();
a.counter = 5;
A b = a; // I want to save a into b and then recycle a for other purposes
a.counter = 10; // now b.counter is also 10
}
}
I'm sure you got the point, however if you are not copying around references to HashMaps from the array, then it should be ok.

Doesn't matter. Premature optimization. Come back when you have profiler results telling you where you're actually spending most memory or CPU cycles

It is entirely unclear why re-using maps in this manner would improve performance and/or memory usage. For all we know, it might make no difference, or might have the opposite effect.
You should do whatever results in the most readable code, then profile, and finally optimize the parts of the code that the profiler highlights as bottlenecks.

In most cases you will not feel any difference.
Typically number of map entries is MUCH higher than number of map objects. When you populate map you create instance of Map.Entry per entry. This is relatively light-weight object but anyway you invoke new. The map itself without data is lightweight too, so you will not get any benefits with these tricks unless your map is supposed to hold 1-2 entries.
Bottom line.
Forget about pre-mature optimization. Implement your application. If you have performance problems profile the application, find bottle necks and fix them. I can 99% guarantee you that the bottleneck will never be in new HashMap() call.

I think what you want is an Object pool kind of thing, where you get an object(in your case, its HashMap) from the object pool, perform your operations, and if that Object is no longer needed you put it back in the pool.
check for Object pool design pattern, for further reference check this link :
http://sourcemaking.com/design_patterns/object_pool

The problem you have is that most of the objects are Map.Entry objects in the HashMap. While you can recycle the HashMap itself (and its array) these are only a small portion of the objects. One way around this is to use FastMap from javolution which recycles everything and has support for managing the lifecycle (its designed to minimise garbage this way)
I suspect the most efficient way is to use an EnumMap is possible (if you have known key attributes) or POJOs even if most fields are not used.

There's a few problems with reusing HashMaps.
Even if the key and value data were to take no memory (shared from other places), the Map.Entry objects would dominate memory usage but not be reused (unless you did something a bit special).
Because of generational GC, generally having old objects point to new is expensive (and relatively difficult to see what's going on). Might not be an issue if you are keeping millions of these.
More complicated code is more difficult to optimise. So keep it simple, and then do the big optimisations, which probably involve changing the data structures.

Related

Is there a reason to use a mapping of string => index into a vector, instead of string => object?

If I have a collection of objects that I'd like to be able to look up by name, I could of course use a { string => object } map.
Is there ever a reason to use a vector of objects along with a { string => index into this vector } companion map, instead?
I've seen a number of developers do this over the years, and I've largely dismissed it as an indication that the developer is unfamiliar with maps, or is otherwise confused. But in recent days, I've started second-guessing myself, and I'm concerned I might be missing a potential optimization or something, though I can't for the life of me figure out what that could optimize.
There is one reason I can think of:
Besides looking up object by name, sometimes you also want to iterate through all the objects as efficient as possible. Using a map + vector can achieve this. You pay a very small penalty for accessing the vector via index, but you could gain a big performance improvement by iterating a vector rather than a map (because vector is in continuous memory and more cache friendly).
Of course you can do similar thing using boost::multiindex, but that has some restrictions on the object itself.
I can think of at least a couple reasons:
For unrelated reasons you need to retain insertion order.
You want to have multiple maps pointing into the vector (different indexes).
Not all items in the vector need to be pointed to by a string.
There's no optimization. If you think about it, it might actually decrease performance (albeit by a few micronanoseconds). This is because the vector-based "solution" would require an extra step to find the object in the vector, while the non-vector-based solution doesn't have to do that.

Performance issues when HashMap is used as Cache

Case 1 :
One HashMap with 1,00,000 entries
Case 2 :
Two HashMaps with 50,000 entries each.
Which of the above cases will take more execution time and more memory? Or is there a significant difference between the two?
Is it feasible to replace one HashMap of large number of entries with two HashMaps of lesser number of entries?
You're better off with a single hash map.
Look-ups are very efficient in a hash map, and they're designed to have lots of elements in. It'll be slower overall if you have to put something in place to search one map and then look in the other one if you don't find it in the first one.
(There won't be much difference in memory usage either way.)
If it's currently too slow, have a check that your .hashCode() and .equals() are not inefficient.
The memory requirements should be similar for the two cases (since the HashMap storage is an array of Entries whose size is the capacity of the map, so two arrays of 50K would take the same space as one array of 100K).
The get() and put() methods should also have similar performance in both cases, since the calculation of the hash code of the key and the matching index is not affected by the size of the map. The only thing that may affect the performance of these operations is the
average number of keys mapped to the same index, and that should also be independent of the size of the map, since as the size grows, the Entries array grows, so the average number of Entries per index should stay the same.
On the other hand, if you use two smaller maps, you have to add logic to decide which map to use. You can try to search for a key in the first map and if not found, search in the second map. Or you can have a criteria that determines which map is used (for example, String keys starting in A to M will be stored in the first map, and the rest in the second map). Therefore, each operation on the map will have an additional overhead.
Therefore, I would suggest using a single HashMap.
The performance difference between using one or two HashMaps should not matter much, so go for the simpler solution, the single HashMap.
But since you ask this question, I assume that you have performance problems. Often, using a HashMap as a cache is a bad idea, because it keeps a reference to the cached objects alive, thus basically disabling garbage collection. Consider redesigning your cache, for example using SoftReferences (a class in the standard Java API), which allows the garbage collector to collect your cached objects while still being able to reuse the objects as long a they are not garbage collected yet.
as everyone mentioned, you should be using one single hash map. if you having trouble with 100k entries then there is a major problem with your code.
here is some heads up on hash map:
Don't use too complex objects as key (in my opinion using object string as key is as far as you should go for HashMap.
if you try to use some complex object as key make sure your equals and hashCode method are as efficient as possible as too much calculation within these method can reduce the efficiency of hashmap greatly

Map clear vs null

I have a map that I use to store dynamic data that are discarded as soon as they are created (i.e. used; they are consumed quickly). It responds to user interaction in the sense that when user clicks a button the map is filled and then the data is used to do some work and then the map is no longer needed.
So my question is what's a better approach for emptying the map? should I set it to null each time or should I call clear()? I know clear is linear in time. But I don't know how to compare that cost with that of creating the map each time. The size of the map is not constant, thought it may run from n to 3n elements between creations.
If a map is not referenced from other objects where it may be hard to set a new one, simply null-ing out an old map and starting from scratch is probably lighter-weight than calling a clear(), because no linear-time cleanup needs to happen. With the garbage collection costs being tiny on modern systems, there is a good chance that you would save some CPU cycles this way. You can avoid resizing the map multiple times by specifying the initial capacity.
One situation where clear() is preferred would be when the map object is shared among multiple objects in your system. For example, if you create a map, give it to several objects, and then keep some shared information in it, setting the map to a new one in all these objects may require keeping references to objects that have the map. In situations like that it's easier to keep calling clear() on the same shared map object.
Well, it depends on how much memory you can throw at it. If you have a lot, then it doesn't matter. However, setting the map itself to null means that you have freed up the garbage collector - if only the map has references to the instances inside of it, the garbage collector can collect not only the map but also any instances inside of it. Clear does empty the map but it has to iterate over everything in the map to set each reference to null, and this takes place during your execution time that you can control - the garbage collector essentially has to do this work anyways, so let it do its thing. Just note that setting it to null doesn't let you reuse it. A typical pattern to reuse a map variable may be:
Map<String, String> whatever = new HashMap<String, String();
// .. do something with map
whatever = new HashMap<String, String>();
This allows you to reuse the variable without setting it to null at all, you silently discard the reference to the old map. This is atrocious practice in non-memory managed applications since they must reference the old pointer to clear it (this is a dangling pointer in other langauges), but in Java since nothing references this the GC marks it as eligible for collection.
I feel nulling the existing map is more cheaper than clear(). As creation of object is very cheap in modern JVMs.
Short answer: use Collection.clear() unless it is too complicated to keep the collection arround.
Detailed answer: In Java, the allocation of memory is almost instantaneous. It is litle more than a pointer that gets moved inside the VM. However, the initialization of those objects might add up to something significant. Also, all objects that use an internal buffer are sensible to resizing and copying of their content. Using clear() make sure that buffers eventually stabilize to some dimension, so that reallocation of memory and copying if old buffer to new buffer will never be necessary.
Another important issue is that reallocating then releasing a lot of objects will require more frequent execution of the Garbage collector, which might cause suddenly lag.
If you always holds the map, it will be prompted to the old generation. If each user has one corresponding map, the number of map in the old generation is proportionate to the number of the user. It may trigger Full GC more frequently when the number of users increase.
You can use both with similar results.
One prior answer notes that clear is expected to take constant time in a mature map implementation. Without checking the source code of the likes of HashMap, TreeMap, ConcurrentHashMap, I would expect their clear method to take constant time, plus amortized garbage collection costs.
Another poster notes that a shared map cannot be nulled. Well, it can if you want it, but you do it by using a proxy object which encapsulates a proper map and nulls it out when needed. Of course, you'd have to implement the proxy map class yourself.
Map<Foo, Bar> myMap = new ProxyMap<Foo, Bar>();
// Internally, the above object holds a reference to a proper map,
// for example, a hash map. Furthermore, this delegates all calls
// to the underlying map. A true proxy.
myMap.clear();
// The clear method simply reinitializes the underlying map.
Unless you did something like the above, clear and nulling out are equivalent in the ways that matter, but I think it's more mature to assume your map, even if not currently shared, may become shared at a later time due to forces you can't foresee.
There is another reason to clear instead of nulling out, even if the map is not shared. Your map may be instantiated by an external client, like a factory, so if you clear your map by nulling it out, you might end up coupling yourself to the factory unnecessarily. Why should the object that clears the map have to know that you instantiate your maps using Guava's Maps.newHashMap() with God knows what parameters? Even if this is not a realistic concern in your project, it still pays off to align yourself to mature practices.
For the above reasons, and all else being equal, I would vote for clear.
HTH.

Unique tasks all along a program

In my program, I do some tasks, parametrized by a MyParameter object (I call doTask(MyParameter parameter) to run a task).
From the beginning to the end of the program, I can create a lot of tasks (a few million at least) BUT i want to run only once each of them (if a task has already been executed, the method does nothing)
Currently, I'm using a HashSet to store the MyParameter objects for the tasks already executed, but if the MyParameter object is 100bytes, and if I run in my program 10M tasks, it is 1GB at least in memory ...)
How can I optimize that, to use as few memory as possible ?
Thanks a lot guys
If all you need to know is whether a particular MyParameter has been processed or not, ditch the HashSet and use a BitSet instead.
Basically, if all you need to know is whether a particular MyParameter is done or not, then storing the entire MyParameter in the set is overkill - you only need to store a single bit, where 0 means "not done" and 1 means "done". This is exactly what a BitSet is designed for.
The hashes of your MyParameter values are presumably unique, otherwise your current approach of using a HashSet is pointless. If so, then you can use the hashCode() of each MyParameter as an index into the bit set, using the corresponding bit as an indicator of whether the given MyParameter is done or not.
That probably doesn't make much sense as is, so the following is a basic implementation. (Feel free to substitute the for loop, numParameters, getParameter(), etc with whatever it is that you're actually using to generate MyParameters)
BitSet doneSet = new BitSet();
for (int i = 0; < numParameters; ++i) {
MyParameter parameter = getParameter(i);
if (!doneSet.get(parameter.hashCode())) {
doTask(parameter );
doneSet.set(parameter.hashCode());
}
}
The memory usage of this approach is a bit contingent on how BitSet is implemented internally, but I would expect it to be significantly better than simply storing all your MyParameters in a HashSet.
If, in fact, you do need to hang onto your MyParameter objects once you've processed them because they contain the result of processing, then you can possibly save space by storing just the result portion of the MyParameter in the HashSet (if such a thing's possible - your question doesn't make this clear).
If, on the other hand, you really do need each MyParameter in its entirety once you're done processing it, then you're already doing pretty much the best you can do. You might be able to do a little better memory-wise by storing them as a vector (i.e. expandable array) of MyParameters (which avoids some of the memory overheads inherent in using a HashSet), but this will incur a speed penalty due to time needed to expand the vector and an O(n) search time.
A TreeSet will give you somewhat better memory performance than a HashSet, at the cost of log(n) lookups.
You can use a NoSql key-value store such as Cassandra or LevelDB, which are essentially external hash tables.
You may be able to compress the MyParameter representation, but if it's only at 100bytes currently then I don't know how much smaller you'd be able to get it.

Need an efficient Map or Set that does NOT produce any garbage when adding and removing

So because Javolution does not work (see here) I am in deep need of a Java Map implementation that is efficient and produces no garbage under simple usage. java.util.Map will produce garbage as you add and remove keys. I checked Trove and Guava but it does not look they have Set<E> implementations. Where can I find a simple and efficient alternative for java.util.Map?
Edit for EJP:
An entry object is allocated when you add an entry, and released to GC when you remove it. :(
void addEntry(int hash, K key, V value, int bucketIndex) {
Entry<K,V> e = table[bucketIndex];
table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
if (size++ >= threshold)
resize(2 * table.length);
}
Taken literally, I am not aware of any existing implementation of Map or Set that never produces any garbage on adding and removing a key.
In fact, the only way that it would even be technically possible (in Java, using the Map and Set APIs as defined) is if you were to place a strict upper bound on the number of entries. Practical Map and Set implementations need extra state proportional to the number of elements they hold. This state has to be stored somewhere, and when the current allocation is exceeded that storage needs to be expanded. In Java, that means that new nodes need to be allocated.
(OK, you could designed a data structure class that held onto old useless nodes for ever, and therefore never generated any collectable garbage ... but it is still generating garbage.)
So what can you do about this in practice ... to reduce the amount of garbage generated. Let's take HashMap as an example:
Garbage is created when you remove an entry. This is unavoidable, unless you replace the hash chains with an implementation that never releases the nodes that represent the chain entries. (And that's a bad idea ... unless you can guarantee that the free node pool size will always be small. See below for why it is a bad idea.)
Garbage is created when the main hash array is resized. This can be avoided in a couple of ways:
You can give a 'capacity' argument in the HashMap constructor to set the size of the initial hash array large enough that you never need to resize it. (But that potentially wastes space ... especially if you can't accurately predict how big the HashMap is going to grow.)
You can supply a ridiculous value for the 'load factor' argument to cause the HashMap to never resize itself. (But that results in a HashMap whose hash chains are unbounded, and you end up with O(N) behaviour for lookup, insertion, deletion, etc.
In fact, creating garbage is not necessarily bad for performance. Indeed, hanging onto nodes so that the garbage collector doesn't collect them can actually be worse for performance.
The cost of a GC run (assuming a modern copying collector) is mostly in three areas:
Finding nodes that are not garbage.
Copying those non-garbage nodes to the "to-space".
Updating references in other non-garbage nodes to point to objects in "to-space".
(If you are using a low-pause collector there are other costs too ... generally proportional to the amount of non-garbage.)
The only part of the GC's work that actually depends on the amount of garbage, is zeroing the memory that the garbage objects once occupied to make it ready for reuse. And this can be done with a single bzero call for the entire "from-space" ... or using virtual memory tricks.
Suppose your application / data structure hangs onto nodes to avoid creating garbage. Now, when the GC runs, it has to do extra work to traverse all of those extra nodes, and copy them to "to-space", even though they contain no useful information. Furthermore, those nodes are using memory, which means that if the rest of the application generates garbage there will be less space to hold it, and the GC will need to run more often.
And if you've used weak/soft references to allow the GC to claw back nodes from your data structure, then that's even more work for the GC ... and space to represent those references.
Note: I'm not claiming that object pooling always makes performance worse, just that it often does, especially if the pool gets unexpectedly big.
And of course, that's why HashMap and similar general purpose data structure classes don't do any object pooling. If they did, they would perform significantly badly in situations where the programmer doesn't expect it ... and they would be genuinely broken, IMO.
Finally, there is an easy way to tune a HashMap so that an add immediately followed by a remove of the same key produces no garbage (guaranteed). Wrap it in a Map class that caches the last entry "added", and only does the put on the real HashMap when the next entry is added. Of course, this is NOT a general purpose solution, but it does address the use case of your earlier question.
I guess you need a version of HashMap that uses open addressing, and you'll want something better than linear probing. I don't know of a specific recommendation though.
http://sourceforge.net/projects/high-scale-lib/ has implementations of Set and Map which do not create garbage on add or remove of keys. The implementation uses a single array with alternating keys and values, so put(k,v) does not create an Entry object.
Now, there are some caveats:
Rehash creates garbage b/c it replaces the underlying array
I think this map will rehash given enough interleaved put & delete operations, even if the overall size is stable. (To harvest tombstone values)
This map will create Entry object if you ask for the entry set (one at a time as you iterate)
The class is called NonBlockingHashMap.
One option is to try to fix the HashMap implementation to use a pool of entries. I have done that. :) There are also other optimizations for speed you can do there. I agree with you: that issue with Javolution FastMap is mind-boggling. :(

Categories