HashMap key update vs double entries

HashMap key update vs double entries - java

I am using a hashmap to store objects with a key that evolves over time.
HashMap<String,Stuff> hm = new HashMap<String,Stuff>()
Stuff stuff = new Stuff();
hm.put( "OrignalKey", stuff);
I didn't find anything better than removing "OrignalKey" and put() a new entry with the same object.
hm.remove("OriginalKey");
hm.put("NewKey", stuff);
remove() seems to be taking a significant cpu toll hence my questions:
What is the actual the memory cost to leave duplicate entries (there is no overlapping risk)?
Am I just missing some neat swapKey() method?

What is the actual the memory cost to leave duplicate entries (there is no overlapping risk)?
Well, you've got an extra entry, and the key itself can't be garbage collected. If the key is "large", that could be a problem. It also means that you'll never be able to get an accurate count, you'll never be able to sensibly iterate over all the values, etc. It seems like a bad idea to me.
Am I just missing some neat swapKey() method?
There's no such thing - and it feels like a fairly rare requirement to me. Any such method would pretty much have to do what you're doing anyway - it has to find the old key, remove it from the data structure, and insert an entry for the new key. I can't easily imagine any optimizations possible just by knowing about both operations at once.

swapping of the key is not easily possible, since the key is used for hashing.
changing the key means that the hashvalue is most probably different, too. in this case, changing the key conforms to deletion and following reinsertion

Related

Why JAVA HashSet uses HashMap internally? Doesn't it waste's memory? [duplicate]

Looking at the source of Java 6, HashSet<E> is actually implemented using HashMap<E,Object>, using dummy object instance on every entry of the Set.
I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.
But, why is it still used? Is there any reason to use it besides making it easier to maintain the code?

Actually, it's not just HashSet. All implementations of the Set interface in Java 6 are based on an underlying Map. This is not a requirement; it's just the way the implementation is. You can see for yourself by checking out the documentation for the various implementations of Set.
Your main questions are
But, why is it still used? Is there
any reason to use it besides making it
easier to maintain the codes?
I assume that code maintenance is a big motivating factor. So is preventing duplication and bloat.
Set and Map are similar interfaces, in that duplicate elements are not allowed. (I think the only Set not backed by a Map is CopyOnWriteArraySet, which is an unusual Collection, because it's immutable.)
Specifically:
From the documentation of Set:
A collection that contains no
duplicate elements. More formally,
sets contain no pair of elements e1
and e2 such that e1.equals(e2), and at
most one null element. As implied by
its name, this interface models the
mathematical set abstraction.
The Set interface places additional
stipulations, beyond those inherited
from the Collection interface, on the
contracts of all constructors and on
the contracts of the add, equals and
hashCode methods. Declarations for
other inherited methods are also
included here for convenience. (The
specifications accompanying these
declarations have been tailored to the
Set interface, but they do not contain
any additional stipulations.)
The additional stipulation on
constructors is, not surprisingly,
that all constructors must create a
set that contains no duplicate
elements (as defined above).
And from Map:
An object that maps keys to values.
A map cannot contain duplicate keys; each key can map to at most one value.
If you can implement your Sets using existing code, any benefit (speed, for example) you can realize from existing code accrues to your Set as well.
If you choose to implement a Set without a Map backing, you have to duplicate code designed to prevent duplicate elements. Ah, the delicious irony.
That said, there's nothing preventing you from implementing your Sets differently.

My guess is that HashSet was originally implemented in terms of HashMap in order to get it done quickly and easily. In terms of lines of code, HashSet is a fraction of HashMap.
I would guess that the reason it still hasn't been optimized is fear of change.
However, the waste is much worse than you think. On both 32-bit and 64-bit, HashSet is 4x larger than necessary, and HashMap is 2x larger than necessary. HashMap could be implemented with an array with keys and values in it (plus chains for collisions). That means two pointers per entry, or 16 bytes on a 64-bit VM. In fact, HashMap contains an Entry object per entry, which adds 8 bytes for the pointer to the Entry and 8 bytes for the Entry object header. HashSet also uses 32 bytes per element, but the waste is 4x instead of 2x since it only requires 8 bytes per element.

Yes you are right, a small amount of wastage is definetley there. Small because, for every entry it uses the same object PRESENT(which is declared final). Hence the only wastage is for every entry's value in the HashMap.
Mostly I think, they took this approach for maintainability and reusability. (The JCF developers would have thought, we have tested HashMap anyway, why not reuse it.)
But if you are having huge collections, and you are a memory freak, then you may opt out for better alternatives like Trove or Google Collections.

I looked at your question and it took me a while to think about what you said. So here's my opinion regarding the HashSet implementation.
It is necessary to have the dummy instance to know if the value is or is not present in the set.
Take a look at the add method
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
Abd now let's take a look at the put return value
#returns the previous value associated with key, or null if there was no mapping for key. (A null return can also indicate that the map previously associated null with key.)
So the PRESENT object is just used to represent that the set contains the e value. I think you asked why not use null instead of PRESENT. But the, you would not be able to distinguish if the entry was previously on the map because map.put(key,value) would always return null and you would not have way to know if the key existed.
That being said you could argue that they could have used an implementation like this
public boolean add(E e) {
if( map.containsKey(e) ) {
return false;
}
map.put(e, null);
return true;
}
I guess they waste 4 bytes to avoid computing the hashCode, as it could be expensive, of the key two times (if the key is going to be added).
If you question referred to why they used a HashMap that would waste 8 bytes (because of the Map.Entry) instead of some other data structure using a similar Entry of only 4, then yes, I would say they did it for the reasons you mentioned.

I am guessing that it has never turned up as a significant problem for real applications or important benchmarks. Why complicate the code for no real benefit?
Also note, that object sizes are rounded up in many JVM implementation, so there may not actually be an increase in size (I don't know for this example). Also the code for HashMap is likely to be compiled and in cache. Other things being equal, more code => more cache misses => lower performance.

After searching through pages like this wondering why the mildly inefficient standard implementation, found com.carrotsearch.hppc.IntOpenHashSet

Your question:
I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.
Just one Object variable is created for the entire datastructure of hashset and doing that would save yourself from re-writing the entire hashMap kind of code again.
private static final Object PRESENT = new Object();
All the keys are having one value i.e PRESENT object.

Should we use HashSet?

A HashSet is backed by a HashMap. From it's JavaDoc:
This class implements the Set interface, backed by a hash table
(actually a HashMap instance)
When taking a look at the source we can also see how they relate to each other:
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
Therefore a HashSet<E> is backed by a HashMap<E,Object>. For all HashSets in our application we have one reference object PRESENT that we use in the HashMap for the value. While the memory needed to store PRESENT is neglectable, we still store a reference to it for each value in the map.
Would it not be more efficient to use null instead of PRESENT? A further consideration then is should we forgo the HashSet altogether and directly use a HashMap, given the circumstance permits the use of a Map instead of a Set.
My basic problem that triggered these thoughts is the following situation: I have a collection of objects on with the following properties:
big collection of objects > 30'000
Insertion order is not relevant
Efficient check if an item is contained
Adding new items to the collection is not relevant
The chosen solution should perform optimal in the context to the above criteria as well as minimize memory consumption. On this basis the datastructures HashSet and HashMap spring to mind. When thinking about alternative approaches, the key question is:
How to check containement efficiently?
The only answer that comes to my mind is using the items hash to calculate the storage location. I might be missing something here. Are there any other approaches?
I had a look at various issues, that did shed some light on the issue, but not quietly answered my question:
Java : HashSet vs. HashMap
clarifying facts behind Java's implementation of HashSet/HashMap
Java HashSet vs HashMap
I am not looking for suggestions of any alternative libraries or framework to address this, but I want to understand if there is an other way to think about efficient containement checking of an element in a Collection.

In short, yes you should use HashSet. It might not be the most possibly efficient Set implementation, but that hardly ever matters, unless you are working with huge amounts of data.
In that case, I would suggest using specialized libraries. EnumMaps if you can use enums, primitive maps like Trove if your data is mostly primitives, a bunch of other data-structures that are optimized for certain data-types, or even an in-memory-database.
Don't get me wrong, I'm someone who likes performance-tuning, too, but replacing the built-in data-structures should only be done when its really necessary. For most cases, they work perfectly fine.
What you could do, in case you really want to save the last bit of memory and do not care about inserting, is using a fixed-sized array, sorting that and doing a binary search every time. But I doubt that it's more efficient than a HashSet.

Hashtables and HashSets should be used entirely different, so maybe the two shouldn't be compared as "which is more efficient". The hashset would be more suitable for the mathematical "set" (ex. {1,2,3,4}). They contain no duplicates and allow for only one null value. While a hashmap is more of a key-> pair value system. They allow multiple null values as well as duplicates, just not duplicate key vales. I know this is probably answering "difference between a hashtable and hashset" but I think my point is they really can't be compared.

Reuse hashmaps in array

I am holding an array of hashmaps, I want to gain maximum performance and memory usage so I would like to resue the hashmaps inside an array.
So when there is a hashmap in the array that is not needed any more and I want to add new hashmap to the array I just clear the hashmap and use put() to add new values.
I also need to copy back values when I retireve hashmap from array.
I am not sure if this is better than creating new HashMap() every time.
What is better?
UPDATE
need to cycle about 50 milions of hashmaps, each hash map has about 10 key-value pairs. If size of the array 20,000 I need just 20,000 hashmaps instead of 50 milions new hashmaps()

Be very careful with this approach. Although it may be better performance-wise to recycle objects, you may get into trouble by modifying the same reference several times, as illustrated in the following example:
public class A {
public int counter = 0;
public static void main(String[] args) {
A a = new A();
a.counter = 5;
A b = a; // I want to save a into b and then recycle a for other purposes
a.counter = 10; // now b.counter is also 10
}
}
I'm sure you got the point, however if you are not copying around references to HashMaps from the array, then it should be ok.

Doesn't matter. Premature optimization. Come back when you have profiler results telling you where you're actually spending most memory or CPU cycles

It is entirely unclear why re-using maps in this manner would improve performance and/or memory usage. For all we know, it might make no difference, or might have the opposite effect.
You should do whatever results in the most readable code, then profile, and finally optimize the parts of the code that the profiler highlights as bottlenecks.

In most cases you will not feel any difference.
Typically number of map entries is MUCH higher than number of map objects. When you populate map you create instance of Map.Entry per entry. This is relatively light-weight object but anyway you invoke new. The map itself without data is lightweight too, so you will not get any benefits with these tricks unless your map is supposed to hold 1-2 entries.
Bottom line.
Forget about pre-mature optimization. Implement your application. If you have performance problems profile the application, find bottle necks and fix them. I can 99% guarantee you that the bottleneck will never be in new HashMap() call.

I think what you want is an Object pool kind of thing, where you get an object(in your case, its HashMap) from the object pool, perform your operations, and if that Object is no longer needed you put it back in the pool.
check for Object pool design pattern, for further reference check this link :
http://sourcemaking.com/design_patterns/object_pool

The problem you have is that most of the objects are Map.Entry objects in the HashMap. While you can recycle the HashMap itself (and its array) these are only a small portion of the objects. One way around this is to use FastMap from javolution which recycles everything and has support for managing the lifecycle (its designed to minimise garbage this way)
I suspect the most efficient way is to use an EnumMap is possible (if you have known key attributes) or POJOs even if most fields are not used.

There's a few problems with reusing HashMaps.
Even if the key and value data were to take no memory (shared from other places), the Map.Entry objects would dominate memory usage but not be reused (unless you did something a bit special).
Because of generational GC, generally having old objects point to new is expensive (and relatively difficult to see what's going on). Might not be an issue if you are keeping millions of these.
More complicated code is more difficult to optimise. So keep it simple, and then do the big optimisations, which probably involve changing the data structures.

Why does java.util.Map.values() allow you to remove entries from the returned Collection

Why does java.util.Map.values() allow you to delete entries from the returned Collection when it makes no sense to remove a key value pair based on the value? The code which does this would have no idea what key the value(and hence a key) being removed is mapped from. Especially when there are duplicate values, calling remove on that Collection would result in an unexpected key being removed.

it makes no sense to remove a key value pair based on the value
I don't think you're being imaginative enough. I'll admit there probably isn't wide use for it, but there will be valid cases where it would be useful.
As a sample use case, say you had a Map<Person, TelephoneNumber> called contactList. Now you want to filter your contact list by those that are local.
To accomplish this, you could make a copy of the map, localContacts = new HashMap<>(contactList) and remove all mappings where the TelephoneNumber starts with an area code other than your local area code. This would be a valid time where you want to iterate through the values collection and remove some of the values:
Map<Person, TelephoneNumber> contactList = getContactList();
Map<Person, TelephoneNumber> localContacts = new HashMap<Person, TelephoneNumber>(contactList);
for ( Iterator<TelephoneNumber> valuesIt = localContacts.values().iterator(); valuesIt.hasNext(); ){
TelephoneNumber number = valuesIt.next();
if ( !number.getAreaCode().equals(myAreaCode) ) {
valuesIt.remove();
}
}
Especially when there are duplicate values, calling remove on that Collection would result in an unexpected key being removed.
What if you wanted to remove all mappings with that value?

It has to have a remove method because that's part of Collection. Given that, it has the choice of allowing you to remove values or throwing an UnsupportedOperationException. Since there are legitimate reasons that you might want to remove values, why not choose to allow this operation?
Maybe there's a given value where you want to remove every instance
of it from the Map.
Maybe you want to trim out every third
key/value pair for some reason.
Maybe you have a map from hotel
room number to occupancy count and you want to remove everything from
the map where the occupancy count is greater than one in order to
find a room for someone to stay in.
...if you think about it more
closely, there are plenty more examples like this...
In short: there are plenty of situations where this might be useful and implementing it doesn't harm anyone who doesn't use it, so why not?

I think there is quite often a use for removing a value based on a key; other answers show examples. Given that, if you want to remove a certain value, why would you only want one particular key of it removed? Even if you did, you'd have to know which key you wanted to remove (or not, as the case may be), and then you should just remove it by key anyway.

The Collection returned is a special Collection, and its semantics are such that it knows how values in it relate back to the Map it came from. The javadoc indicates what Collection operation the returned collection supports.

Timeout Mechanism for Hashtable

I have a hashtable that under heavy-traffic. I want to add timeout mechanism to hashtable, remove too old records. My concerns are,
- It should be lightweight
- Remove operation has not time critical. I mean (timeout value is 1 hour) remove operation can be after 1 hour or and 1 hour 15 minute. There is no problem.
My opinion is,
I create a big array (as ring buffer)that store put time and hashtable key,
When adding to hashtable, using array index find a next slot on array put time,
if array slot empty, put insertion time and HT key,
if array slot is not empty, compare insertion time for timeout occured.
if timeout occured remove from Hashtable (if not removed yet)
it not timeout occured, increment index till to find empty slot or timeouted array slot.
When removing from hashtable there is no operation on big array.
Shortly, for every add operation to Hashtable, may remove 1 timeouted element from hashtable or do nothing.
What is your the more elegant and more lightweight solution ?
Thanks for helps,

My approach would be to use the Guava MapMaker:
ConcurrentMap<String, MyValue> graphs = new MapMaker()
.maximumSize(100)
.expireAfterWrite(1, TimeUnit.HOURS)
.makeComputingMap(
new Function<String, MyValue>() {
public MyValue apply(String string) {
return calculateMyValue(string);
}
});
This might not be exactly what you're describing, but chances it's close enough. And it's much easier to produce (plus it's using a well-tested code base).
Note that you can tweak the behaviour of the resulting Map by calling different methods before the make*() call.

You should rather consider using a LinkedHashMap or maybe a WeakHashMap.
The former has a constructor to set the iteration order of its elements to the order of last access; this makes it trivial to remove too old elements. And its removeEldestEntry method can be overridden to define your own policy on when to remove the eldest entry automatically after the insertion of a new one.
The latter uses weak references to keys, so any key which has no other reference to it can be automatically garbage collected.

I think a much easier solution is to use LRUMap from Apache Commons Collections. Of course you can write your own data structures if you enjoy it or you want to learn, but this problem is so common that numerous ready-made solutions exist. (I'm sure others will point you to other implementations too, after a time your problem will be choosing the right one from them :))

Under the assumption that the currently most heavily accessed items in your cache structure are in the significant minority, you may well get by with randomly selecting items for removal (you have a low probability of removing something very useful). I've used this technique and, in this particular application, it worked very well and took next to no implementation effort.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.