Performance issues when HashMap is used as Cache

Performance issues when HashMap is used as Cache - java

Case 1 :
One HashMap with 1,00,000 entries
Case 2 :
Two HashMaps with 50,000 entries each.
Which of the above cases will take more execution time and more memory? Or is there a significant difference between the two?
Is it feasible to replace one HashMap of large number of entries with two HashMaps of lesser number of entries?

You're better off with a single hash map.
Look-ups are very efficient in a hash map, and they're designed to have lots of elements in. It'll be slower overall if you have to put something in place to search one map and then look in the other one if you don't find it in the first one.
(There won't be much difference in memory usage either way.)
If it's currently too slow, have a check that your .hashCode() and .equals() are not inefficient.

The memory requirements should be similar for the two cases (since the HashMap storage is an array of Entries whose size is the capacity of the map, so two arrays of 50K would take the same space as one array of 100K).
The get() and put() methods should also have similar performance in both cases, since the calculation of the hash code of the key and the matching index is not affected by the size of the map. The only thing that may affect the performance of these operations is the
average number of keys mapped to the same index, and that should also be independent of the size of the map, since as the size grows, the Entries array grows, so the average number of Entries per index should stay the same.
On the other hand, if you use two smaller maps, you have to add logic to decide which map to use. You can try to search for a key in the first map and if not found, search in the second map. Or you can have a criteria that determines which map is used (for example, String keys starting in A to M will be stored in the first map, and the rest in the second map). Therefore, each operation on the map will have an additional overhead.
Therefore, I would suggest using a single HashMap.

The performance difference between using one or two HashMaps should not matter much, so go for the simpler solution, the single HashMap.
But since you ask this question, I assume that you have performance problems. Often, using a HashMap as a cache is a bad idea, because it keeps a reference to the cached objects alive, thus basically disabling garbage collection. Consider redesigning your cache, for example using SoftReferences (a class in the standard Java API), which allows the garbage collector to collect your cached objects while still being able to reuse the objects as long a they are not garbage collected yet.

as everyone mentioned, you should be using one single hash map. if you having trouble with 100k entries then there is a major problem with your code.
here is some heads up on hash map:
Don't use too complex objects as key (in my opinion using object string as key is as far as you should go for HashMap.
if you try to use some complex object as key make sure your equals and hashCode method are as efficient as possible as too much calculation within these method can reduce the efficiency of hashmap greatly

Related

Java: speed of searching in a Map and for circle

I have a list of module which has two fields: time and size and I need to find one module according to given time/size.
I have two solutions:
for(Module module: myModuleList)
I create a Map and use Map.get().
and I wonder which would be faster or consume less ressources? because this manipulation would be raise periodically with a more and more large module list.

Iterating through the list is O(myModuleList.size()) whereas using Map.get() is O(1) for a HashMap or O(log(myModuleList.size())) for a TreeMap. So, if you are optimizing for performance, then using a HashMap would be the best solution. In almost all cases, you should reach for a HashMap by default unless you also need to iterate over the elements in key order, in which case it would make sense to use a TreeMap. So, the short answer is that you should use a HashMap.
For a very large number of elements (which is not the case that you are describing), it's possible to improve the space usage of the HashMap (albeit at the expense of speed) by tuning the load factor and initial capacity (though you will not need to do this in ordinary usages). There are also alternative map datastructures that give different tradeoffs in performance and space, depending on your needs.

Faster to retrieve objects via hashmap or iterate over array?

Basically i have an enum with an id.
Would it be faster to create a hashmap that would take the id and give you the enum or iterate over all the enums testing if the id provided is equal to the id of the enum, and if so returning it.
If it matters then there are 5 enums.

HashMaps are specifically designed for retrieving objects via a key, and are fast even with many entries. Usually, a sequential scan of a list or similar collection would be much slower.
But if you have only 5 items, there's no real difference.
Edit: on second thoughts, with so few objects you might be better off with a sequential scan because the extra work in calculating the hash codes might outweigh the advantage. But the difference is so small it's not worth bothering about.

A hashmap has a lookup complexity of O(1), while iterating naturally has O(n). That said, for only 5 enum values the iteration will probably be faster on average, and it will not need an extra data structure when you just iterate over .values() anyway.

If you have an Enum as a key, you should use an EnumMap. This basically wraps an array of values and is faster than using a HashMap.

Need an efficient Map or Set that does NOT produce any garbage when adding and removing

So because Javolution does not work (see here) I am in deep need of a Java Map implementation that is efficient and produces no garbage under simple usage. java.util.Map will produce garbage as you add and remove keys. I checked Trove and Guava but it does not look they have Set<E> implementations. Where can I find a simple and efficient alternative for java.util.Map?
Edit for EJP:
An entry object is allocated when you add an entry, and released to GC when you remove it. :(
void addEntry(int hash, K key, V value, int bucketIndex) {
Entry<K,V> e = table[bucketIndex];
table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
if (size++ >= threshold)
resize(2 * table.length);
}

Taken literally, I am not aware of any existing implementation of Map or Set that never produces any garbage on adding and removing a key.
In fact, the only way that it would even be technically possible (in Java, using the Map and Set APIs as defined) is if you were to place a strict upper bound on the number of entries. Practical Map and Set implementations need extra state proportional to the number of elements they hold. This state has to be stored somewhere, and when the current allocation is exceeded that storage needs to be expanded. In Java, that means that new nodes need to be allocated.
(OK, you could designed a data structure class that held onto old useless nodes for ever, and therefore never generated any collectable garbage ... but it is still generating garbage.)
So what can you do about this in practice ... to reduce the amount of garbage generated. Let's take HashMap as an example:
Garbage is created when you remove an entry. This is unavoidable, unless you replace the hash chains with an implementation that never releases the nodes that represent the chain entries. (And that's a bad idea ... unless you can guarantee that the free node pool size will always be small. See below for why it is a bad idea.)
Garbage is created when the main hash array is resized. This can be avoided in a couple of ways:
You can give a 'capacity' argument in the HashMap constructor to set the size of the initial hash array large enough that you never need to resize it. (But that potentially wastes space ... especially if you can't accurately predict how big the HashMap is going to grow.)
You can supply a ridiculous value for the 'load factor' argument to cause the HashMap to never resize itself. (But that results in a HashMap whose hash chains are unbounded, and you end up with O(N) behaviour for lookup, insertion, deletion, etc.
In fact, creating garbage is not necessarily bad for performance. Indeed, hanging onto nodes so that the garbage collector doesn't collect them can actually be worse for performance.
The cost of a GC run (assuming a modern copying collector) is mostly in three areas:
Finding nodes that are not garbage.
Copying those non-garbage nodes to the "to-space".
Updating references in other non-garbage nodes to point to objects in "to-space".
(If you are using a low-pause collector there are other costs too ... generally proportional to the amount of non-garbage.)
The only part of the GC's work that actually depends on the amount of garbage, is zeroing the memory that the garbage objects once occupied to make it ready for reuse. And this can be done with a single bzero call for the entire "from-space" ... or using virtual memory tricks.
Suppose your application / data structure hangs onto nodes to avoid creating garbage. Now, when the GC runs, it has to do extra work to traverse all of those extra nodes, and copy them to "to-space", even though they contain no useful information. Furthermore, those nodes are using memory, which means that if the rest of the application generates garbage there will be less space to hold it, and the GC will need to run more often.
And if you've used weak/soft references to allow the GC to claw back nodes from your data structure, then that's even more work for the GC ... and space to represent those references.
Note: I'm not claiming that object pooling always makes performance worse, just that it often does, especially if the pool gets unexpectedly big.
And of course, that's why HashMap and similar general purpose data structure classes don't do any object pooling. If they did, they would perform significantly badly in situations where the programmer doesn't expect it ... and they would be genuinely broken, IMO.
Finally, there is an easy way to tune a HashMap so that an add immediately followed by a remove of the same key produces no garbage (guaranteed). Wrap it in a Map class that caches the last entry "added", and only does the put on the real HashMap when the next entry is added. Of course, this is NOT a general purpose solution, but it does address the use case of your earlier question.

I guess you need a version of HashMap that uses open addressing, and you'll want something better than linear probing. I don't know of a specific recommendation though.

http://sourceforge.net/projects/high-scale-lib/ has implementations of Set and Map which do not create garbage on add or remove of keys. The implementation uses a single array with alternating keys and values, so put(k,v) does not create an Entry object.
Now, there are some caveats:
Rehash creates garbage b/c it replaces the underlying array
I think this map will rehash given enough interleaved put & delete operations, even if the overall size is stable. (To harvest tombstone values)
This map will create Entry object if you ask for the entry set (one at a time as you iterate)
The class is called NonBlockingHashMap.

One option is to try to fix the HashMap implementation to use a pool of entries. I have done that. :) There are also other optimizations for speed you can do there. I agree with you: that issue with Javolution FastMap is mind-boggling. :(

Reuse hashmaps in array

I am holding an array of hashmaps, I want to gain maximum performance and memory usage so I would like to resue the hashmaps inside an array.
So when there is a hashmap in the array that is not needed any more and I want to add new hashmap to the array I just clear the hashmap and use put() to add new values.
I also need to copy back values when I retireve hashmap from array.
I am not sure if this is better than creating new HashMap() every time.
What is better?
UPDATE
need to cycle about 50 milions of hashmaps, each hash map has about 10 key-value pairs. If size of the array 20,000 I need just 20,000 hashmaps instead of 50 milions new hashmaps()

Be very careful with this approach. Although it may be better performance-wise to recycle objects, you may get into trouble by modifying the same reference several times, as illustrated in the following example:
public class A {
public int counter = 0;
public static void main(String[] args) {
A a = new A();
a.counter = 5;
A b = a; // I want to save a into b and then recycle a for other purposes
a.counter = 10; // now b.counter is also 10
}
}
I'm sure you got the point, however if you are not copying around references to HashMaps from the array, then it should be ok.

Doesn't matter. Premature optimization. Come back when you have profiler results telling you where you're actually spending most memory or CPU cycles

It is entirely unclear why re-using maps in this manner would improve performance and/or memory usage. For all we know, it might make no difference, or might have the opposite effect.
You should do whatever results in the most readable code, then profile, and finally optimize the parts of the code that the profiler highlights as bottlenecks.

In most cases you will not feel any difference.
Typically number of map entries is MUCH higher than number of map objects. When you populate map you create instance of Map.Entry per entry. This is relatively light-weight object but anyway you invoke new. The map itself without data is lightweight too, so you will not get any benefits with these tricks unless your map is supposed to hold 1-2 entries.
Bottom line.
Forget about pre-mature optimization. Implement your application. If you have performance problems profile the application, find bottle necks and fix them. I can 99% guarantee you that the bottleneck will never be in new HashMap() call.

I think what you want is an Object pool kind of thing, where you get an object(in your case, its HashMap) from the object pool, perform your operations, and if that Object is no longer needed you put it back in the pool.
check for Object pool design pattern, for further reference check this link :
http://sourcemaking.com/design_patterns/object_pool

The problem you have is that most of the objects are Map.Entry objects in the HashMap. While you can recycle the HashMap itself (and its array) these are only a small portion of the objects. One way around this is to use FastMap from javolution which recycles everything and has support for managing the lifecycle (its designed to minimise garbage this way)
I suspect the most efficient way is to use an EnumMap is possible (if you have known key attributes) or POJOs even if most fields are not used.

There's a few problems with reusing HashMaps.
Even if the key and value data were to take no memory (shared from other places), the Map.Entry objects would dominate memory usage but not be reused (unless you did something a bit special).
Because of generational GC, generally having old objects point to new is expensive (and relatively difficult to see what's going on). Might not be an issue if you are keeping millions of these.
More complicated code is more difficult to optimise. So keep it simple, and then do the big optimisations, which probably involve changing the data structures.

TreeMap or HashMap? [duplicate]

This question already has answers here:
Difference between HashMap, LinkedHashMap and TreeMap
(17 answers)
What is the difference between a HashMap and a TreeMap? [duplicate]
(8 answers)
Closed 8 years ago.
When to use hashmaps or treemaps?
I know that I can use TreeMap to iterate over the elements when I need them to be sorted.
But is just that? There is no optimization when I just want to consult the maps, or some optimal specific uses?

TreeMap provides guaranteed O(log n) lookup time (and insertion etc), whereas HashMap provides O(1) lookup time if the hash code disperses keys appropriately.
Unless you need the entries to be sorted, I'd stick with HashMap. Or there's ConcurrentHashMap of course. I can't remember the details of the differences between all of them, but HashMap is a perfectly reasonable "default" option :)
For completeness, I should point out that there was a discussion on Stack Overflow a month or so ago about the internals of various maps. See the comments in this question, which I will copy into this answer if bestsss is happy for me to do so.

Hashtables (usually) perform search operations (look up) bounded within the complexity of O(n)<=T(n)<=O(1), with an average case complexity of O(1 + n/k); however, binary search trees, (BST's), perform search operations (lookup) bounded within the complexity of O(n)<=T(n)<=O(log_2(n)), with an average case complexity of O(log_2(n)). The implementation for each (and every) data structure should be known (by you), to understand the advantages, drawbacks, time complexity of operations, and code complexity.
For example, the number of entries in a hashtable often have some fixed number of entries (some part of which may not be filled at all) with lists of collisions. Trees, on the other hand, usually have two pointers (references) per node, but this can be more if the implementation allows more than two child nodes per node, and this allows the tree to grow as nodes are added, but may not allow duplicates. (The default implementation of a Java TreeMap does not allow for duplicates)
There are special cases to consider as well, for example, what if the number of elements in a particular data structure increases without bound or approaches the limit of an underlying part of the data structure? What about amortized operations that perform some rebalancing or cleanup operation?
For example, in a hashtable, when the number of elements in the table become sufficiently large, and arbitrary number of collisions can occur. On the other hand, trees usually require come re-balancing procedure after an insertion (or deletion).
So, if you have something like a cache (Ex. the number of elements in bounded, or size is known) then a hashtable is probably your best bet; however, if you have something more like a dictionary (Ex. populated once and looked up many times) then I'd use a tree.
This is only in the general case, however, (no information was given). You have to understand process that happen how they happen to make the right choice in deciding which data structure to use.
When I need a multi-map (ranged lookup) or sorted flattening of a collection, then it can't be a hashtable.

The largest difference between the two is the underlying structure used in the implementation.
HashMaps use an array and a hashing function to store elements. When you try to insert or delete an item in the array the hashing function converts the key into an index on the array where the object is/should be stored (ignoring conflicts). While hashmaps are generally very fast because they don't need to iterate over large amounts of data, they slow down when they're filled because they need to copy all the key/values into a new array.
TreeMaps store a the data in a sorted tree structure. While this means that they'll never have to allocate more space and copy over to it, operations require that part of the data already stored be iterated over. Sometimes changing large amounts of the structure.
Out of the two Hashmaps will generally have better performance when you don't need sorting.

Inserting new elements into a HashMap will, on average, be a good deal faster than inserting elements into a TreeMap. Unless you need your elements sorted, I'd go with the HashMap.

Don't forget there is also LinkedHashMap which is nearly as fast as HashMap for add/contains/remove operations but also maintains the insertion order.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.