Consider a class with a comparable (consistent with equals) and a non-comparable field (of a class about which I do not know whether it overrides Object#equals or not).
The class' instances shall be compared, where the resulting order shall be consistent with equals, i.e. 0 returned iff both fields are equal (as per Object#equals) and consistent with the order of the comparable field. I used System.identityHashCode to cover most of the cases not covered by these requirements (the order of instances with same comparable, but different other value is arbitrary), but am not sure whether this is the best approach.
public class MyClass implements Comparable<MyClass> {
private Integer intField;
private Object nonCompField;
public int compareTo(MyClass other) {
int intFieldComp = this.intField.compareTo(other.intField);
if (intFieldComp != 0)
return intFieldComp;
if (this.nonCompField.equals(other.nonCompField))
return 0;
// ...and now? My current approach:
if (Systems.identityHashCode(this.nonCompField) < Systems.identityHashCode(other.nonCompField))
return -1;
else
return 1;
}
}
Two problems I see here:
If Systems.identityHashCode is the same for two objects, each is greater than the other. (Can this happen at all?)
The order of instances with same intField value and different nonCompField values need not be consistent between runs of the program, as far as I understand what Systems.identityHashCode does.
Is that correct? Are there more problems? Most importantly, is there a way around this?
The first problem, although highly unlikely, could happen (I think you would need an enormous amount of memory, and a very bad luck). But it's solved by Guava's Ordering.arbitrary(), which uses the identity hash code behind the scenes, but maintains a cache of comparison results for the cases where two different objects have the same identity hash code.
Regarding your second question, no, the identity hash codes are not preserved between runs.
Systems.identityHashCode […] the same for two objects […] (Can this happen at all?)
Yes it can. Quoting from the Java API Documentation:
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects.
identityHashCode(Object x) returns the same hash code for the given object as would be returned by the default method hashCode(), whether or not the given object's class overrides hashCode().
So you may encounter hash collisions, and with memory ever growing but hash codes staying fixed at 32 bit, they will become increasingly more likely.
The order of instances with same intField value and different nonCompField values need not be consistent between runs of the program, as far as I understand what Systems.identityHashCode does.
Right. It might even be different during a single invocation of the same program: You could have (1,foo) < (1,bar) < (1,baz) even though foo.equals(baz).
Most importantly, is there a way around this?
You can maintain a map which maps each distinct value of the non-comparable type to a sequence number which you increase for each distinct value you encounter.
Memory management will be tricky, though: You cannot use a WeakHashMap as the code might make your key object unreachable but still hold a reference to another object of the same value. So either you maintain a list of weak references to all the objects of a given value, or you simply use strong references and accept the fact that any uncomparable value ever encountered will never be garbage collected.
Note that this scheme will still not result in reproducible sequence numbers unless you create values reproducibly in just the same order.
If the class of the nonCompField has implemented a reasonably good toString(), you might be able to use
return String.valueOf(this.nonCompField).compareTo(String.valueOf(other.nonCompField));
Unfortunately, the default Object.toString() uses the hashcode, which has potential issues as noted by others.
Related
I have a method that checks if two objects are equal(by reference).
public boolean isUnique( T uniqueIdOfFirstObject, T uniqueIdOfSecondObject ) {
return (uniqueIdOfFirstObject == uniqueIdOfSecondObject);
}
(Use case) Assuming that I don't have any control over creation of the object.
I have a method
void currentNodeExistOrAddToHashSet(Object newObject, HashSet<T> objectHash) {
// will it be 100% precise? Assuming both object have the same field values.
if(!objectHash.contains(newObject){
objectHash.add(newObject);
}
}
or I could do something like this
void currentNodeExistOrAddToHashSet(Object newObject, HashSet<T> objectHash){
//as per my knowledge, there might be collision for different objects.
int uniqueId = System.identityHashCode(newObject);
if(!objectHash.contains(uniqueId){
objectHash.add(uniqueId);
}
}
Is it possible to get a 100% collision proof Id in java i.e different object having different IDs, the same object having same ids irrespective of the content of the object?
Since you put them into a HashSet that uses hashcode/equals and hashCode is 32 bits long - this has a limit; thus collision will happen. Especially since a HashSet actually only cares about n-last bits before making itself bigger in size and thus adding one more bit and so on. You can read a lot more about this here for example.
The question is different here: why you want a collision free structure in the first place? If you define a fairly well distributed hashCode and a fairly decent equals - these things should not matter to you at all. If you worry about performance of a search, it is O(1) for HashSet.
You could define hashCode and equality based on UUID, like let's say UUID#randomUUID - but this still bounds your hashCode to the same 32-bits, thus collision could still happen.
Given that I some class with various fields in it:
class MyClass {
private String s;
private MySecondClass c;
private Collection<someInterface> coll;
// ...
#Override public int hashCode() {
// ????
}
}
and of that, I do have various objects which I'd like to store in a HashMap. For that, I need to have the hashCode() of MyClass.
I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
I OPENED A SECOND QUESTION regarding the actual problem of uniquely IDing an object here: How do I generate an (almost) unique hash ID for objects?
Clarification:
is there anything more or less unqiue in your class? The String s? Then only use that as hashcode.
MyClass hashCode() of two objects should definitely differ, if any of the values in the coll of one of the objects is changed. HashCode should only return the same value if all fields of two objects store the same values, resursively. Basically, there is some time-consuming calculation going on on a MyClass object. I want to spare this time, if the calculation had already been done with the exact same values some time ago. For this purpose, I'd like to look up in a HashMap, if the result is available already.
Would you be using MyClass in a HashMap as the key or as the value? If the key, you have to override both equals() and hashCode()
Thus, I'm using the hashCode OF MyClass as the key in a HashMap. The value (calculation result) will be something different, like an Integer (simplified).
What do you think equality should mean for multiple collections? Should it depend on element ordering? Should it only depend on the absolute elements that are present?
Wouldn't that depend on the kind of Collection that is stored in coll? Though I guess ordering not really important, no
The response you get from this site is gorgeous. Thank you all
#AlexWien that depends on whether that collection's items are part of the class's definition of equivalence or not.
Yes, yes they are.
I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
That's correct. It's not as onerous as it sounds because the rule of thumb is that you only need to override hashCode() if you override equals(). You don't have to worry about classes that use the default equals(); the default hashCode() will suffice for them.
Also, for your class, you only need to hash the fields that you compare in your equals() method. If one of those fields is a unique identifier, for instance, you could get away with just checking that field in equals() and hashing it in hashCode().
All of this is predicated upon you also overriding equals(). If you haven't overridden that, don't bother with hashCode() either.
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
Yes, you can rely on any collection type in the Java standard library to implement hashCode() correctly. And yes, any List or Set will take into account its contents (it will mix together the items' hash codes).
So you want to do a calculation on the contents of your object that will give you a unique key you'll be able to check in a HashMap whether the "heavy" calculation that you don't want to do twice has already been done for a given deep combination of fields.
Using hashCode alone:
I believe hashCode is not the appropriate thing to use in the scenario you are describing.
hashCode should always be used in association with equals(). It's part of its contract, and it's an important part, because hashCode() returns an integer, and although one may try to make hashCode() as well-distributed as possible, it is not going to be unique for every possible object of the same class, except for very specific cases (It's easy for Integer, Byte and Character, for example...).
If you want to see for yourself, try generating strings of up to 4 letters (lower and upper case), and see how many of them have identical hash codes.
HashMap therefore uses both the hashCode() and equals() method when it looks for things in the hash table. There will be elements that have the same hashCode() and you can only tell if it's the same element or not by testing all of them using equals() against your class.
Using hashCode and equals together
In this approach, you use the object itself as the key in the hash map, and give it an appropriate equals method.
To implement the equals method you need to go deeply into all your fields. All of their classes must have equals() that matches what you think of as equal for the sake of your big calculation. Special care needs to be be taken when your objects implement an interface. If the calculation is based on calls to that interface, and different objects that implement the interface return the same value in those calls, then they should implement equals in a way that reflects that.
And their hashCode is supposed to match the equals - when the values are equal, the hashCode must be equal.
You then build your equals and hashCode based on all those items. You may use Objects.equals(Object, Object) and Objects.hashCode( Object...) to save yourself a lot of boilerplate code.
But is this a good approach?
While you can cache the result of hashCode() in the object and re-use it without calculation as long as you don't mutate it, you can't do that for equals. This means that calculation of equals is going to be lengthy.
So depending on how many times the equals() method is going to be called for each object, this is going to be exacerbated.
If, for example, you are going to have 30 objects in the hashMap, but 300,000 objects are going to come along and be compared to them only to realize that they are equal to them, you'll be making 300,000 heavy comparisons.
If you're only going to have very few instances in which an object is going to have the same hashCode or fall in the same bucket in the HashMap, requiring comparison, then going the equals() way may work well.
If you decide to go this way, you'll need to remember:
If the object is a key in a HashMap, it should not be mutated as long as it's there. If you need to mutate it, you may need to make a deep copy of it and keep the copy in the hash map. Deep copying again requires consideration of all the objects and interfaces inside to see if they are copyable at all.
Creating a unique key for each object
Back to your original idea, we have established that hashCode is not a good candidate for a key in a hash map. A better candidate for that would be a hash function such as md5 or sha1 (or more advanced hashes, like sha256, but you don't need cryptographic strength in your case), where collisions are a lot rarer than a mere int. You could take all the values in your class, transform them into a byte array, hash it with such a hash function, and take its hexadecimal string value as your map key.
Naturally, this is not a trivial calculation. So you need to think if it's really saving you much time over the calculation you are trying to avoid. It is probably going to be faster than repeatedly calling equals() to compare objects, as you do it only once per instance, with the values it had at the time of the "big calculation".
For a given instance, you could cache the result and not calculate it again unless you mutate the object. Or you could just calculate it again only just before doing the "big calculation".
However, you'll need the "cooperation" of all the objects you have inside your class. That is, they will all need to be reasonably convertible into a byte array in such a way that two equivalent objects produce the same bytes (including the same issue with the interface objects that I mentioned above).
You should also beware of situations in which you have, for example, two strings "AB" and "CD" which will give you the same result as "A" and "BCD", and then you'll end up with the same hash for two different objects.
For future readers.
Yes, equals and hashCode go hand in hand.
Below shows a typical implementation using a helper library, but it really shows the "hand in hand" nature. And the helper library from apache keeps things simpler IMHO:
#Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
MyCustomObject castInput = (MyCustomObject) o;
boolean returnValue = new org.apache.commons.lang3.builder.EqualsBuilder()
.append(this.getPropertyOne(), castInput.getPropertyOne())
.append(this.getPropertyTwo(), castInput.getPropertyTwo())
.append(this.getPropertyThree(), castInput.getPropertyThree())
.append(this.getPropertyN(), castInput.getPropertyN())
.isEquals();
return returnValue;
}
#Override
public int hashCode() {
return new org.apache.commons.lang3.builder.HashCodeBuilder(17, 37)
.append(this.getPropertyOne())
.append(this.getPropertyTwo())
.append(this.getPropertyThree())
.append(this.getPropertyN())
.toHashCode();
}
17, 37 .. those you can pick your own values.
From your clarifications:
You want to store MyClass in an HashMap as key.
This means the hashCode() is not allowed to change after adding the object.
So if your collections may change after object instantiation, they should not be part of the hashcode().
From http://docs.oracle.com/javase/8/docs/api/java/util/Map.html
Note: great care must be exercised if mutable objects are used as map
keys. The behavior of a map is not specified if the value of an object
is changed in a manner that affects equals comparisons while the
object is a key in the map.
For 20-100 objects it is not worth that you enter the risk of an inconsistent hash() or equals() implementation.
There is no need to override hahsCode() and equals() in your case.
If you don't overide it, java takes the unique object identity for equals and hashcode() (and that works, epsecially because you stated that you don't need an equals() considering the values of the object fields).
When using the default implementation, you are on the safe side.
Making an error like using a custom hashcode() as key in the HashMap when the hashcode changes after insertion, because you used the hashcode() of the collections as part of your object hashcode may result in an extremly hard to find bug.
If you need to find out whether the heavy calculation is finished, I would not absue equals(). Just write an own method objectStateValue() and call hashcode() on the collection, too. This then does not interfere with the objects hashcode and equals().
public int objectStateValue() {
// TODO make sure the fields are not null;
return 31 * s.hashCode() + coll.hashCode();
}
Another simpler possibility: The code that does the time consuming calculation can raise an calculationCounter by one as soon as the calculation is ready. You then just check whether or not the counter has changed. this is much cheaper and simpler.
What is the most efficient way to remember which objects are processed?
Obviously one could use a hash set:
Set<Foo> alreadyProcessed = new HashSet<>();
void process(Foo foo) {
if (!alreadyProcessed.contains(foo)) {
// Do something
alreadyProcessed.add(foo);
}
}
This makes me wonder why I would store the object, while I just want to check if the hash does exist in this set. Assuming that any hash of foo is unique.
Is there any more performant way to do this?
Take in mind that a very large number of objects will be processed and that the actual process code will not always be very heavy. Also it is not possible for me to have a precompiled worklist of objects, it will be build up dynamically during the processing.
Set#contains can be very fast. It depends how are your hashcode() and equals() methods are implemented. try to cache the hashcode value to make it more faster. (like String.java)
The other simple and fastes option is to add a boolean member to your Foo class: foo.done = true;
You can't use the hashcode since equality of the hashcode of two objects does not imply that the objects are equal.
Else depending on the use case you want to remember if you have already processed
a) the same object, tested by reference, or
b) an equal object, tested by a call to Object.equals(Object)
For b) you can use a standard Set implementation.
For a) also you can also use a standard Set implementation if you now that the equals-method is returning equality of references, or you would need something like a IdentityHashSet.
No mention of performance in this answer, you need to address correctness first!
Write good code. Optimize it for performance only if you can show that you need to in your use case.
There is no performance advantage in storing a hash code rather than the object. If you doubt that, remember that what is being stored is a reference to the object, not a copy of it. In reality that's going to be 64 bits, pretty much the same as the hash code. You've already spent a substantial amount of time thinking about a problem that none of your users will ever notice. (If you are doing this calculations millions of times in a tight, mission-critical loop that's another matter).
Using the set is simple to understand. Doing anything else runs a risk that a future maintainer will not understand the code and introduce a bug.
Also don't forget that a hash code it not guaranteed to be unique for every different object. Every so often storing the hash code will give you a false positive, causing you to fail to process an object you wanted to process. (As an aside, you need to make sure that equals() only considers two objects equal if they are the same object. The default Object.equals() does this, so don't override it)
Use the Set. If you are processing a very large number of objects, use a more efficient Set than HashSet. That is much more likely to give you a performance speedup than anything clever with hashing.
Is it possible for two instances of Object to have the same hashCode()?
In theory an object's hashCode is derived from its memory address, so all hashCodes should be unique, but what if objects are moved around during GC?
I think the docs for object's hashCode method state the answer.
"As much as is reasonably practical,
the hashCode method defined by class
Object does return distinct integers
for distinct objects. (This is
typically implemented by converting
the internal address of the object
into an integer, but this
implementation technique is not
required by the JavaTM programming
language.)"
Given a reasonable collection of objects, having two with the same hash code is quite likely. In the best case it becomes the birthday problem, with a clash with tens of thousands of objects. In practice objects a created with a relatively small pool of likely hash codes, and clashes can easily happen with merely thousands of objects.
Using memory address is just a way of obtaining a slightly random number. The Sun JDK source has a switch to enable use of a Secure Random Number Generator or a constant. I believe IBM (used to?) use a fast random number generator, but it was not at all secure. The mention in the docs of memory address appears to be of a historical nature (around a decade ago it was not unusual to have object handles with fixed locations).
Here's some code I wrote a few years ago to demonstrate clashes:
class HashClash {
public static void main(String[] args) {
final Object obj = new Object();
final int target = obj.hashCode();
Object clash;
long ct = 0;
do {
clash = new Object();
++ct;
} while (clash.hashCode() != target && ct<10L*1000*1000*1000L);
if (clash.hashCode() == target) {
System.out.println(ct+": "+obj+" - "+clash);
} else {
System.out.println("No clashes found");
}
}
}
RFE to clarify docs, because this comes up way too frequently: CR 6321873
Think about it. There are an infinite number of potential objects, and only 4 billion hash codes. Clearly, an infinity of potential objects share each hash code.
The Sun JVM either bases the Object hash code on a stable handle to the object or caches the initial hash code. Compaction during GC will not alter the hashCode(). Everything would break if it did.
Is it possible?
Yes.
Does it happen with any reasonable degree of frequency?
No.
I assume the original question is only about the hash codes generated by the default Object implementation. The fact is that hash codes must not be relied on for equality testing and are only used in some specific hash mapping operations (such as those implemented by the very useful HashMap implementation).
As such they have no need of being really unique - they only have to be unique enough to not generate a lot of clashes (which will render the HashMap implementation inefficient).
Also it is expected that when developer implement classes that are meant to be stored in HashMaps they will implement a hash code algorithm that has a low chance of clashes for objects of the same class (assuming you only store objects of the same class in application HashMaps), and knowing about the data makes it much easier to implement robust hashing.
Also see Ken's answer about equality necessitating identical hash codes.
Are you talking about the actual class Object or objects in general? You use both in the question. (And real-world apps generally don't create a lot of instances of Object)
For objects in general, it is common to write a class for which you want to override equals(); and if you do that, you must also override hashCode() so that two different instances of that class that are "equal" must also have the same hash code. You are likely to get a "duplicate" hash code in that case, among instances of the same class.
Also, when implementing hashCode() in different classes, they are often based on something in the object, so you end up with less "random" values, resulting in "duplicate" hash codes among instances of different classes (whether or not those objects are "equal").
In any real-world app, it is not unusual to find to different objects with the same hash code.
If there were as many hashcodes as memory addresses, then it would took the whole memory to store the hash itself. :-)
So, yes, the hash codes should sometimes happen to coincide.
If the hashCode() method is not overridden, what will be the result of invoking hashCode() on any object in Java?
In HotSpot JVM by default on the first invocation of non-overloaded Object.hashCode or System.identityHashCode a random number is generated and stored in the object header. The consequent calls to Object.hashCode or System.identityHashCode just extract this value from the header. By default it has nothing in common with object content or object location, just random number. This behavior is controlled by -XX:hashCode=n HotSpot JVM option which has the following possible values:
0: use global random generator. This is default setting in Java 7. It has the disadvantage that concurrent calls from multiple threads may cause a race condition which will result in generating the same hashCode for different objects. Also in highly-concurrent environment delays are possible due to contention (using the same memory region from different CPU cores).
5: use some thread-local xor-shift random generator which is free from the previous disadvantages. This is default setting in Java 8.
1: use object pointer mixed with some random value which is changed on the "stop-the-world" events, so between stop-the-world events (like garbage collection) generated hashCodes are stable (for testing/debugging purposes)
2: use always 1 (for testing/debugging purposes)
3: use autoincrementing numbers (for testing/debugging purposes, also global counter is used, thus contention and race conditions are possible)
4: use object pointer trimmed to 32 bit if necessary (for testing/debugging purposes)
Note that even if you set -XX:hashCode=4, the hashCode will not always point to the object address. Object may be moved later, but hashCode will stay the same. Also object addresses are poorly distributed (if your application uses not so much memory, most objects will be located close to each other), so you may end up having unbalanced hash tables if you use this option.
Typically, hashCode() just returns the object's address in memory if you don't override it.
From 1:
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)
1 http://java.sun.com/javase/6/docs/api/java/lang/Object.html#hashCode
The implementation of hashCode() may differ from class to class but the contract for hashCode() is very specific and stated clearly and explicitly in the Javadocs:
Returns a hash code value for the object. This method is supported for the benefit of hashtables such as those provided by java.util.Hashtable.
The general contract of hashCode is:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)
hashCode() is closely tied to equals() and if you override equals(), you should also override hashCode().
The default hashCode() implementation is nothing to do with object's memory address.
In openJDK, in version 6 and 7 it is a randomly generated number. In 8 and 9, it is a number based on the thread state.
Refer this link: hashCode != address
So the result of identity hash generation(the value returned by default implementation of hashCode() method) is generated once and cached in the object's header.
If you want to learn more about this you can go through OpenJDK which defines entry points for hashCode() at
src/share/vm/prims/jvm.h
and
src/share/vm/prims/jvm.cpp
If you go through this above directory, it seems hundred lines of functions that seems to be far more complicated to understand. So, To simplify this, the naively way to represent the default hashcode implementation is something like below,
if (obj.hash() == 0) {
obj.set_hash(generate_new_hash());
}
return obj.hash();
If hashcode is not overriden you will call Object's hashcode, here is an excerpt from its javadoc:
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)
the default hashcode implementation gives the internal address of the object in the jvm, as a 32 bits integer. Thus, two different (in memory) objects will have different hashcodes.
This is consistent with the default implementation of equals. If you want to override equals for your objects, you will have to adapt hashCode so that they are consistent.
See http://www.ibm.com/developerworks/java/library/j-jtp05273.html for a good overview.
You must override hashCode in every class that overrides equals. Failure to do so will result in a violation of the general contract for Object.hashCode, which will prevent your class from functioning properly in conjunction with all hash-based collections, including HashMap, HashSet, and Hashtable.
A hashcode is useful for storing an object in a collection, such as a hashset. By allowing an Object to define a Hashcode as something unique it allows the algorithm of the HashSet to work effectively.
Object itself uses the Object's address in memory, which is very unique, but may not be very useful if two different objects (for example two identical strings) should be considered the same, even if they are duplicated in memory.
You should try to implement the hash code so that different objects will give different results. I don't think there is a standard way of doing this.
Read this article for some information.
Two objects with different hash code must not be equal with regard to equals()
a.hashCode() != b.hashCode() must imply !a.equals(b)
However, two objects that are not equal with regard to equals() can have the same hash code. Storing these objects in a set or map will become less efficient if many objects have the same hash code.
Not really an answer but adding to my earlier comment
internal address of the object cannot be guaranteed to remain unchanged in the JVM, whose garbage collector might move it around during heap compaction.
I tried to do something like this:
public static void main(String[] args) {
final Object object = new Object();
while (true) {
int hash = object.hashCode();
int x = 0;
Runtime r = Runtime.getRuntime();
List<Object> list = new LinkedList<Object>();
while (r.freeMemory() / (double) r.totalMemory() > 0.3) {
Object p = new Object();
list.add(p);
x += object.hashCode();//ensure optimizer or JIT won't remove this
}
System.out.println(x);
list.clear();
r.gc();
if (object.hashCode() != hash) {
System.out.println("Voila!");
break;
}
}
}
But the hashcode indeed doesn't change... can someone tell me how Sun's JDK actually implements Obect.hashcode?
returns 6 digit hex number. This is usually the memory location of the slot where the object is addressed. From an algorithmic per-se, I guess JDK does double hashing (native implementation) which is one of the best hashing functions for open addressing. This double hashing scheme highly reduces the possibility of collisions.
The following post will give a supportive idea -
Java - HashMap confusion about collision handling and the get() method