Changing hashCode of object stored in hash-based collection

Changing hashCode of object stored in hash-based collection - java

I have a hash-based collection of objects, such as HashSet or HashMap. What issues can I run into when the implementation of hashCode() is such that it can change with time because it's computed from some mutable fields?
How does it affect Hibernate? Is there any reason why having hashCode() return object's ID by default is bad? All not-yet-persisted objects have id=0, if that matters.
What is the reasonable implementation of hashCode for Hibernate-mapped entities? Once set the ID is immutable, but it's not true for the moment of saving an entity to database.
I'm not worried about performance of a HashSet with a dozen entities with key=0. What I care about is whether it's safe for my application and Hibernate to use ID as hash code, because ID changes as it is generated on persist.

If the hash code of the same object changes over time, the results are basically unpredictable. Hash collections use the hash code to assign objects to buckets -- if your hash code suddenly changes, the collection obviously doesn't know, so it can fail to find an existing object because it hashes to a different bucket now.
Returning an object's ID by itself isn't bad, but if many of them have id=0 as you mentioned, it will reduce the performance of the hash table: all objects with the same hash code go into the same bucket, so your hash table is now no better than a linear list.
Update: Theoretically, your hash code can change as long as nobody else is aware of it -- this implies exactly what #bestsss mentioned in his comment, which is to remove your object from any collections that may be holding it and insert it again once the hash code has changed. In practice, a better alternative is to generate your hash code from the actual content fields of your object rather than relying on the database ID.

If you add an object to a hash-based collection, then mutate its state so as to change its hashcode (and by implication probably the behaviour in .equals() calls), you may see effects including but not limited to:
Stuff you put in the collection seeming to not be there any more
Getting something out which is different to what you asked for
This is surely not what you want. So, I recommend making the hashcode only out of immutable fields. This is usually done by making the fields final and setting their values in the constructor.

http://community.jboss.org/wiki/EqualsandHashCode

Don’t change hashcode of elements in hash based collection after put.
Many programmers fall into the pitfall.
You could think hashcode is kind of address in collection, so you couldn’t change address of an element after it’s put in the collection.

The Javadoc spefically says that the built-in Collections don't support this. So don't do it.

Related

Collections in java - how to choose the appropriate one

I'm learning about collections and trying to ascertain the best one to use for my practice exercise.....I've done a lot of reading on them, but still can't find the best approach.....this may sound a bit woolly but any guidance at all would be appreciated....
I need to associate a list of Travellers, with a list of Boarding Passes. Both classes contain a mutable boolean field that will be modified during my programme, else all other fields are immutable. That boolean field must exist. I'll need to create a collection of 10 travellers, and then when all criteria has been met, instantiate a boarding pass, and associate it with them.
There won't be any duplicates of either due to each object having a unique reference variable associated with them, created through an object factory.
From doing some reading I understand that Sets must contain immutable objects, and don't allow duplicate elements, whereas Lists are the opposite.
Because I need to associate them with each other, I was thinking a Map, but I now know that the keys are stored in a set, which would be problematic due to the aforementioned reasons....
Could I override the hashcode() method so that it doesn't taken into consideration the boolean field and therefore as long as all of my other fields are immutable it should be fine? Or is that bad practice?
I also thought about creating a list of Travellers, and then trying to associate a Boarding Pass another way, but couldn't think of how that could be achieved....
Please don't give me any code - just some sort of a steer in the right direction would be really helpful.

If you are looking for a best practice, you need to think what you are planning to do with the data now and in the (near) future. When you know
what this is, you need to check which of the methods (list, set and map) works best for you. If you want to compare the three, have a look here

You've been mislead about the mutability requirements of set members and map keys.
When you do a lookup in a HashMap, you do it based on the key's hashCode. If you have mutable objects as keys, and mutating the object modifies the hashCode value, then this is a problem.
If a key was inserted into the table when it had a hashCode of 123, but later it's modified to have a hashCode of 345, you won't be able to find it again later since it's stored in the 123 bucket.
If the mutable boolean field does not influence your hashCode values (e.g., you didn't override hashCode or equals on your key class), then there's no issue.
That said, since you say you'll only have one unique instance of each passenger, Boris's suggestion in the comments about using an IdentityHashMap is probably the way to go. The IdentityHashMap gives the same behavior as a HashMap whose keys all use the default (identity-based) implementations for hashCode and equals. This way you'll get the expected behavior whether or not you've overridden equals and/or hashCode for other purposes.
(Note that you need to take equality into account as well as the hashCode.)

Caching hashes in Java collections?

When I implement a collection that uses hashes for optimizing access, should I cache the hash values or assume an efficient implementation of hashCode()?
On the other hand, when I implement a class that overrides hashCode(), should I assume that the collection (i.e. HashSet) caches the hash?
This question is only about performance vs. memory overhead. I know that the hash value of an object should not change.
Clarification:
A mutable object would of course have to clear the cached value when it is changed, whereas the collection relies on objects not changing. But this is not relevant for my question.

When designing Guava's ImmutableSet and ImmutableMap classes, we opted not to cache hash codes. This way, you'll get better performance from hash code caching when and only when you care enough to do the caching yourself. If we cached them ourselves, we'd be costing you extra time and memory even in the case that you care deeply about speed and space!
It's true that HashMap does this caching, but it was HashMap's author (Josh Bloch) who strongly suggested we not follow that precedent!
Edit: oh, also, if your hashCode() is slow, the caching by the collection only addresses half of the problem anyway, as hashCode() still must be invoked on the object passed in to get() no matter what.

Considering that java.lang.String caches its hash, i guess that hashcode() is supposed to be fast.
So as first approach, I would not cache hashes in my collection.
In my objects that I use, I would not cache hash code unless it is oviously slow, and only do it if profiling tell me so.
If my objects will be used by others, i would probubly consider cachnig hash codes sooner (but needs measurements anyway).

On the other hand, when I implement a class that overrides hashcode(),
should I assume that the collection (i.e. HashSet) caches the hash?
No, you should not make any assumptions beyond the scope of the class you are writing.
Of course you should try to make your hashCode cheap. If it isn't, and your class is immutable, create the hashCode on initialization or lazily upon the first request (see java.lang.String). If your class is not immutable, I don't see any other option than to re-calculate the hashCode every time.

I'd say in most cases you can rely on efficient implementations of hashCode(). AFAIK, that method is only invoked on lookup methods (like contains, get etc.) or methods that change the collection (add/put, remove etc.).
Thus, in most cases there shouldn't be any need to cache hashes yourself.

Why do you want to cache it? You need to ask objects what their hashcode is while you're working with it to allocate it to a hash bucket (and any objects that are in the same bucket that may have the same hashcode), but then you can forget it.
You could store objects in a wrapper HashNode or something, but I would try implementing it first without caching (just like HashSet et al does) and see if you need the added performance and complexity before going there.

Java Collections with Mutable Objects

How does a TreeSet, HashSet or LinkedHashSet behave when the objects are mutable? I cannot imagine that they would work in any sense?
If I modify an object after I have added it; what is the behaviour of the list?
Is there a better option for dealing with a collection of mutable objects (which I need to sort/index/etc) other than a linked list or an array and simply iterating through them each time?

The Set interface addresses this issue directly: "Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element."
Addendum:
Is there a better option for dealing with a collection of mutable objects?
When trying to decide which collection implementation is most suitable, it may be worth looking over the core collection interfaces. For Set implementations in particular, as long as equals() and hashCode() are implemented correctly, any unrelated attributes may be mutable. By analogy with a database relation, any attribute may change, but the primary key must be inviolate.

Being mutable is only a problem for the collection if the objects' hashCode and behaviour of compare methods change after it is inserted.
The way you could handle this is to remove the objects from the collection and re-adding them after such a change so that the object.
In essence this results in a inmutable object from the collections' point of view.
Another less performant way could be to keep a set containing all objects and creating a TreeSet/HashSet when you need the set to be sorted or indexed. This is no real solution for a situation where the objects change constantly and you need map access at the same time.

The "best" way to deal with this situation is to keep ancillary data structures for lookup, a bit like indexes in a database. Then all of your modifications need to make sure the indexes are updated. Good examples would be maps or multimaps - before an update, remove the entry from any indexes, and then after an update add them back in with the new values. Obviously this needs care with concurrency etc.

Do all Hash-based datastructures in java use the 'bucket' concept?

The hash structures I am aware of - HashTable, HashSet & HashMap.
Do they all use the bucket structure - ie when two hashcodes are similar exactly the same one element does not overwrite the other, instead they are placed in the same bucket associated with that hashcode?

In Sun's current implementation of the Java library, IdentityHashMap and the internal implementation in ThreadLocal use probing structures.
The general problem with probing hash tables in Java is that hashCode and equals may be relatively expensive. Therefore you want to cache the hash value. You can't have an array that mixes references and primitives, so you'd need to do something relatively complicated. On the other hand, if you are using == to check matches, then you can check many references without a performance problem.
IIRC, Azul had a fast concurrent quadratic probing hash map.

A linked list is used at each bucket for dealing with hash collisions. Note that the java HashSet is actually implemented by a HashMap underneath (all keys being mapped to the same singleton value across all HashSets) and hence uses the same bucket structure.
If an element is added, its equality is checked against all items in the linked list (via .equals) before it is added at the end. Hence having hash collisions is particularly bad, as this could be an expensive check as the linked list becomes larger.

I believe Java hash structures all use a form of chaining to deal with colisions when performing the hashing - which places the items that have the same hash into a list.
I do not believe that Java uses open addressing for it's hash based data structures (open addressing recomputes hashes based on retry sequences until it finds an open slit in the table)

No -- open addressing is an alternate method of representing hash tables, where objects are stored directly in the table, instead of residing in a linked list. Only one object can be stored at a given index, so resolving collisions is more complicated.
When adding an object for which another object already resides at the same index, a probing sequence is used to determine the new index at which to store the new object. Removing objects is also more complicated, since you if you remove an object, you need to leave a marker that says "there used to be an object here"; for more details, see Wikipedia.
Open addressing is preferable when the objects being stored as small and will rarely be deleted. Open addressing has improved cache performance, since you don't need to go through an extra level of indirection walking a linked list.
The classes you mentioned -- HashTable, HashSet, and HashMap don't use open addressing, but you could easily create new classes that implemented open addressing and provided the same APIs as those classes.

The apis define the behaviour, the internals of how Hash collisions are managed doesn't affect the guarantees of the API ... the performance impact of bad hash value computation is another story. Let's just hash everything to 42 and see how it behaves.

Maps and Sets are the interfaces that determine the behavior of a HashSet or HashMap. A HashSet is a Set, and so it behaves like a Set (ie duplicates are not allowed). A HashMap acts like a Map - it will not overwrite a key with a similar hashcode, but it will overwrite a key, if the same exact key is used again. This will be the same regardless of what data structure is backing the Map internally. See the javadoc for Sets and HashMaps for more.
Did you mean to ask something about the specific implementation of one of these structures?

Except the HashSet. Set is by definition unique elements.
This was a mistake. Please see the comments below.

Any disadvantage to using arbitrary objects as Map keys in Java?

I have two kinds of objects in my application where every object of one kind has exactly one corresponding object of the other kind.
The obvious choice to keep track of this relationship is a Map<type1, type2>, like a HashMap. But somehow, I'm suspicious. Can I use an object as a key in the Map, pass it around, have it sitting in another collection, too, and retrieve its partner from the Map any time?
After an object is created, all I'm passing around is an identifier, right? So probably no problem there. What if I serialize and deserialize the key?
Any other caveats? Should I use something else to correlate the object pairs, like a number I generate myself?

The key needs to implement .equals() and .hashCode() correctly
The key must not be changed in any way that changes it's .hashCode() value while it's used as the key
Ideally any object used as a key in a HashMap should be immutable. This would automatically ensure that 2. is always held true.
Objects that could otherwise be GCed might be kept around when they are used as key and/or value.

I have two kinds of objects in my
application where every object of one
kind has exactly one corresponding
object of the other kind.
This really sounds like a has-a relationship and thus could be implemented using a simple attribute.

It depends on the implementation of the map you choose:
HashMap uses equals() and hashCode(). By default (in Object) these are based on the object identity, which will work OK unless you serialize/deserialize. With a proper implementation of equals() and hashCode() based on the content of the object you will have no problems, as long as you don't modify it while it is a key in a hash map.
TreeMap uses compareTo(). There is no default implementation, so you need to provide one. The same limitations apply as for implementing hashCode() and equals() above.

You could use a standard Map, but doing so you will keep strong references to your objects in the Map. If your objects are referenced in another structure and you need the Map just to link them together consider using a WeakHashMap.
And BTW you don't have to override equals and hashCode unless you have to consider several instances of an object as equal...

Can I use an object as a key in the Map, pass it around, have it sitting in another collection, too, and retrieve its partner from the Map any time?
Yes, no problem here at all.
After an object is created, all I'm passing around is an identifier, right? So probably no problem there. What if I serialize and deserialize the key?
That's right, you are only passing a reference around - they will all point to the same actual object. If you serialize or deserialize the object, that would create a new object. However, if your object implements equals and hashCode properly, you should still be able to use the new deserialized object to retrieve items from the map.
Any other caveats? Should I use something else to correlate the object pairs, like a number I generate myself?
As for Caveats, yes, you can't change anything that would cause the hashCode of the object to change while the object is in the Map.

Any object can be a map key. The important thing here is to make sure that you override .equals() and .hashCode() for any objects that will be used as map keys.
The reason you do this is that if you don't, equals will be understood as object equality, and the only way you'll be able to find "equal" map keys is to have a handle to the original object itself.
You override hashcode because it needs to be consistent with equals. This is so that objects that you've defined as equals hash identically.

The failure points are the hashcode and equals functions. If they don't produce consistent and proper return values, the Map will behave strangely. Effective Java has a whole section on them and is highly, highly recommended.

You might consider Google Collection's BiMap.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.