I was reading the JavaDoc for Object.hashCode method, it says that
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer [...])
But whatever its implementation is, hashCode method always returns a (let's assume positive) integer, so given Integer.MAX+1 different objects, two of them are going to have the same hashcode.
Why is the JavaDoc here "denying" collisions? Is it a practical conclusion given that internal address is used and "come on, you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique"?
EDIT
This bug entry (thank you Sleiman Jneidi) gives an exact idea of what I mean (it seems to be a more that 10 years old discussion):
appears that many, perhaps majority, of programmers take this to mean that the default implementation, and hence System.identityHashCode, will produce unique hashcodes.
The qualification "As much as is reasonably practical," is, in practice, insufficient to make clear that hashcodes are not, in practice, distinct.
The docs are misleading indeed, and there is a bug opened ages ago that says that the docs are misleading especially that the implementation is JVM dependent, and in-practice especially with massive heap sizes it is so likely to get collisions when mapping object identities to 32-bit integers
there is an interesting discussion of hashcode collisions here:
http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/
In particular, this highlights that your practical conclusion, "you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique" is a long way from accurate due to the birthday paradox.
The conclusion from the link is that, assuming a random distribution of hashCodes, we only need 77,163 objects before we have a 50/50 chance of hashCode collision.
When you read this carefully, you'll notice that this only means objects should try to avoid collisions ('as much as reasonably practical'), but also that you are not guaranteed to have different hashcodes for unequal objects.
So the promise is not very strong, but it is still very useful. For instance when using the hashcode as a quick indication of equality before doing the full check.
For instance ConcurrentHashMap which will use (a function performed on) the hashcode to assign a location to an object in the map. In practice the hashcode is used to find where an object is roughly located, and the equals is used to find the precisely located.
A hashmap could not use this optimization if objects don't try to spread their hashcodes as much as possible.
Related
1.
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
2.
I realise that a good HashMap relies on a good hash function. How does Java's HashSet/HashMap hash the objects? I know that there is a hash function but so far for strings I have not needed to implement this. What if I now want to Hash a Java Object that I create - do I need to implement the hash function? Or does Java have a built in way of creating a hash code?
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
You could answer many of these questions yourself, by reading the source code for HashMap.
(Hint: you can usually find the source code for Java SE classes using Google; e.g. search for "java.util.HashMap source".)
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
Chaining. See the source code. (Line 154 in the version I linked to).
How does Java's HashSet/HashMap hash the objects?
It doesn't. The object's hashCode method is called to do this. See the source code. (line 360).
If you look at the code you will see some interesting wrinkles:
The code (in the version I linked to) is hashing Strings using a special method. (It appears that this is to allow hashing of strings to be "tuned" at the platform level. I didn't dig into this ...)
The hashcode returned by the Object.hashCode() call is "scrambled" further to reduce the chance of collisions. (Read the comment!)
What if I now want to Hash a Java Object that I create - do I need to implement the hash function?
You can do that.
Whether you need to do this depends on how you have defined equals for the class. Specifically, Java's HashMap, HashSet and related classes place the following requirement on hashcode() and equals(Object):
If a.equals(b) then a.hashCode() == b.hashCode().
While a is in a HashSet or is a key in a HashMap, the value returned by a.hashCode() must not change.
if !a.equals(b), then the probability that a.hashCode() == b.hashCode() should be low, especially if a and b are probably hash keys for the application.
(The last requirement for performance reasons. If you you have a "poor" hash function that results in a high probability that different keys hash the same hashcode, you get lots of collisions. The hash chains will become unbalanced, and you won't get the average O(1) performance that is normally expected of hash table operations. In the worst case, performance will be O(N); i.e. equivalent to a linear search of a linked list.)
Or does Java have a built in way of creating a hash code?
Every class inherits a default hashCode() method from Object (unless this is overridden). It uses what is known as an "identity hash code"; i.e. a hash value that is based on the object's identity (its reference). This matches the default implementation of equals(Object) ... which simply uses == to compare references.
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is incorrect.
The default hashCode() method returns the "identity hashcode". This is typically based on the object's memory address at some point time, but it is NOT the object's memory address.
In particular, if an object is moved by the garbage collector, its "identity hashcode" is guaranteed not to change. Yes. That's right, it DOES NOT CHANGE ... even though the object was moved!
(How they implement this efficiently is rather clever. See https://stackoverflow.com/a/3796963/139985 for details.)
The bottom line is that the default Object.hashCode() method satisfies all of the requirements that I listed above. It can be relied on.
Question 1)
The Java HashMap implementation uses the chaining implementation to deal with collisions. Think of it as an array of linked lists.
Question 2
Object has a default implementation of equals and hashCode. equals is implemented as return this == other and hashcode is (to all intents and purposes) implemented as assigning a random identifier to each instance and using that as the hashCode.
As all classes in Java extends Object, they all inherit these implementations.
Some classes override these implementations by default. String, as you mentioned, is a very good example. Another is the classes in the collections API - so ArrayList implements these methods based on the elements it contains.
As far as implementing a good hashCode, this is a little bit of a dark art. Here's a pretty good summary of best practice.
Your final comment:
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is not correct. The default implementation of hashCode is constant as that is part of the method's contract. From the Javadoc:
Whenever it is invoked on the same object more than once during an
execution of a Java application, the hashCode method must consistently
return the same integer, provided no information used in equals
comparisons on the object is modified. This integer need not remain
consistent from one execution of an application to another execution
of the same application.
A friend of mine and I have the following bet going:
It is possible to get the Object again from the memory by using the Identity Hashcode received for that Object using System.identityHashCode() in Java. With the restriction that it has not yet been cleaned up by the Garbage Collector.
I have been looking for an answer for quite some while now and am not able to find a definite one.
I think that it might be possible to do so using the JVMTI, but I havn't yet worked with it.
Does anyone of you have an answer to that? Will buy you a coffie, if I can do so on your site ;)
Thanks in advance,
Felix
p.s: I am saying this behaviour can be achieved and the friend of mine says it is not possible
In theory it is possible however you have some issues.
it is randomly generated so it is not unique. Any number of objects (though unlikely) could have the same identity hash code.
it is not a memory location, it doesn't change when moved from Eden, around the Survivors spaces or in tenured space.
you need to find all the object roots to potentially find it.
If you can assume it is visible to a known object like a static collection, it should be easy to navigate via reflection.
BTW Once the 64-bit OpenJDK/Oracle JVM, the identity hash code is stored in the header from offset 1, this means you can read it, or even change it using sun.misc.Unsafe. ;)
BTW2 The 31-bit hashCode (not 32-bit) stored in the header is lazily set and is also used for biased locking. i.e. once you call Object.hashCode() or System.identityHashCode() you disable biased locking for the object.
I think your friend is going to win this bet. Java/the JVM manages the memory for you and there is no way to access it once you drop all your references to something.
Phantom References, Weak References, etc are all designed to allow just what you are describing - so if you keep a Weak or Phantom reference to something you can. identityHashCode is neither though.
C and C++ might let you do this since you have more direct control of the memory, but even then you would need the memory location not a hash of it.
No, because the identityHashCodes are not necessarily unique. They are not pointers to the objects.
No. The identityHashCode is not necessarily a memory address: it is only the default implementation of hashCode. It is also not guaranteed to be unique for all objects (but different instances should have different identityHashCodes).
Even if the identityHashCode is derived from a memory address, the object may be reallocated (but the identityHashCode cannot change, by definition).
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
why do String.hashCode() in java is not implemented in a way with less conflicts?
For non-cryptographic hashes, how does Java's String.hashCode() perform?
Mostly I'm concerned about collisions.
Thanks.
You seem to be misunderstanding what .hashCode() is for with regards to Java, and more specifically, the .equals()/.hashCode() contract specified by java.lang.Object.
The only part of the contract of matter to anyone is that if two objects are equal with regards to .equals(), then they must have the same hash code as returned by .hashCode(). There is no other obligation to that contract.
It is therefore perfectly legal to write a custom .hashCode() implementation like this, even though this is as suboptimal as one can think of:
#Override
public int hashCode()
{
// Legal, but useless
return 42;
}
Of course, JDK developers would never be that thick, and .hashCode() implementations for builtin types (including String) are good enough that you do not even need to worry about collisions. Even then, this implementation will more than likely vary from one JDK implementation to another, and so will its "cryptographic value".
But that's not the point.
The most important thing to consider is that .hashCode() has nothing to do with cryptography at all. Its only obligation is to obey the contract defined by java.lang.Object.
It's pretty good as a general purpose hash function. i.e. you shouldn't usually worry about it.
In particular:
It is fast, to the extent that it probably produces hashes as the CPU can read the String from memory (i.e. you usually can't get better without skipping large parts of the String). It does just one multiply and one add per character in the String.
For typical sets of random Strings, it produces well-distributed hashes over the entire int range.
Obviously, it is not a cryptographic hash function, so don't use it for that. Also, be aware that you likely will get hash collisions as it is producing a 32-bit hash. So you just need to design your algorithms to take that into account.
How is hashCode() implemented?
My assumption is that it uses the object memory location as the initial number (the seed) on which it runs the hash function. However, this is not the case.
I've also looked at Hash : How does it work internally? but it does not answer my question.
Yes I could download the SDK, but before I do that and look at the code, perhaps someone else already has knowledge of it.
Thanks :)
EDIT:
I know it should be overridden and such, so please try to stay on topic :)
No, no, no. All answers in this thread are wrong or at least only partially correct.
First:
Object.hashCode() is a native method, so its implementation depends solely on the JVM. It may vary between HotSpot and other VM implementations like JRockit or IBM J9.
If you are asking:
how is hashCode() implemented in Java?
Then the answer is: it depends on which VM you are using.
Assuming that you are using Oracle's default JVM—which is HotSpot, then I can tell you that HotSpot has six hashCode() implementations. You can choose it using the -XX:hashCode=n flag running JVM via command line, where n can be:
0 – Park-Miller RNG (default)
1 – f(address, global_statement)
2 – constant 1
3 – Serial counter
4 – Object address
5 – Thread-local Xorshift
The above is copied from this post.
And if you dig a little bit around in the HotSpot source code, you may find below snippet:
if (hashCode == 0) {
value = os::random();
} else {
...
os::random() is just the implementation of the Park-Miller Pseudo Random Generator algorithm.
That's all. There isn't any notion of memory address. Although two other implementations, 1 and 4, use an object's memory address, the default one doesn't use it.
The notion that Object.hashCode() is based on the object's address is largely a historic artefact - it is no longer true.
I know that inside Object#hashCode() JavaDoc we can read:
(...) this is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the Java™ programming language.
But it is obsolete and misleading.
Of course it is implementation specific, but generally the hash code for an object will be computed lazily and stored in the object header. Odd things are done with headers to keep them small whilst allowing complex locking algorithms.
In the OpenJDK/Oracle JVM the usual method of computing the initial hash code is based on the memory address at the time of the first request. Objects move about in memory, so using the address each time would not be a good choice. The hash code isn't the actual address - that would typically be a multiple of eight which isn't great for using straight in a hash table particularly with a power of two size. Note identity hash codes are not unique.
HotSpot has build time options to always use zero or use a secure random number generator (SRNG) for testing purposes.
The implementation of the hashcode() function varies from Object to Object. If you want to know how a specific class implements hashcode(), you'll have to look it up for that class.
The hashCode method defined by class Object returns distinct integers for distinct objects. This could be implemented by converting the internal address of the object into an integer (but this implementation style is not required by the standard). It gets interesting with new classes which override hashCode in order to support hash tables (equal and hashCode):
http://www.javapractices.com/topic/TopicAction.do?Id=28
I'm assuming you're talking about the Object implementation of hashCode, since the method can and should be overridden.
It's implementation dependent. For the Sun JDK, it's based on the object's memory address.
The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.
The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.
I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.
I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)
The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).
There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.
As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.
One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry