There seems to be an ongoing debate about whether it is safe to rely on the current implementation of String.hashCode() because, technically speaking, it is guaranteed by the specification (Javadoc).
Why did Sun specify String.hashCode()'s implementation in the specification?
Why would developers ever need to rely upon a specific implementation of hashCode()?
Why is Sun so afraid that the sky will fall if String.hashCode() is changed in the future? (This is probably be explained by #2)
A reason for relying on the specific implementation of hashCode() would be if it is ever persisted out into a database, file or any other storage medium. Bad Things(tm) would happen if the data was read back in when the hashing algorithm had changed. You could encounter unexpected hash collisions, and more worryingly, the inability to find something by its hash because the hash had changed between the data being persisted and "now".
In fact, that pretty much explains point #3 too =)
The reason for point #1 could be "to allow interoperability". If the hashCode implementation is locked down then data can be shared between different implementations of Java quite safely. i.e, the hash of a given object will always be the same irrespective of implementation.
The implementation has changed since the original String class. If I recall, it used to be that only every 16th (?) character was used in the hash for "long" strings.
It may have been specified to promote serialization interoperability between subsequent versions of Java, or even between the runtimes of different vendors. I agree, a programmer should not rely on a particular implementation of hashCode() directly, but changing it could potentially break a lot of serialized collections.
Related
I was reading the JavaDoc for Object.hashCode method, it says that
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer [...])
But whatever its implementation is, hashCode method always returns a (let's assume positive) integer, so given Integer.MAX+1 different objects, two of them are going to have the same hashcode.
Why is the JavaDoc here "denying" collisions? Is it a practical conclusion given that internal address is used and "come on, you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique"?
EDIT
This bug entry (thank you Sleiman Jneidi) gives an exact idea of what I mean (it seems to be a more that 10 years old discussion):
appears that many, perhaps majority, of programmers take this to mean that the default implementation, and hence System.identityHashCode, will produce unique hashcodes.
The qualification "As much as is reasonably practical," is, in practice, insufficient to make clear that hashcodes are not, in practice, distinct.
The docs are misleading indeed, and there is a bug opened ages ago that says that the docs are misleading especially that the implementation is JVM dependent, and in-practice especially with massive heap sizes it is so likely to get collisions when mapping object identities to 32-bit integers
there is an interesting discussion of hashcode collisions here:
http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/
In particular, this highlights that your practical conclusion, "you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique" is a long way from accurate due to the birthday paradox.
The conclusion from the link is that, assuming a random distribution of hashCodes, we only need 77,163 objects before we have a 50/50 chance of hashCode collision.
When you read this carefully, you'll notice that this only means objects should try to avoid collisions ('as much as reasonably practical'), but also that you are not guaranteed to have different hashcodes for unequal objects.
So the promise is not very strong, but it is still very useful. For instance when using the hashcode as a quick indication of equality before doing the full check.
For instance ConcurrentHashMap which will use (a function performed on) the hashcode to assign a location to an object in the map. In practice the hashcode is used to find where an object is roughly located, and the equals is used to find the precisely located.
A hashmap could not use this optimization if objects don't try to spread their hashcodes as much as possible.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
why do String.hashCode() in java is not implemented in a way with less conflicts?
For non-cryptographic hashes, how does Java's String.hashCode() perform?
Mostly I'm concerned about collisions.
Thanks.
You seem to be misunderstanding what .hashCode() is for with regards to Java, and more specifically, the .equals()/.hashCode() contract specified by java.lang.Object.
The only part of the contract of matter to anyone is that if two objects are equal with regards to .equals(), then they must have the same hash code as returned by .hashCode(). There is no other obligation to that contract.
It is therefore perfectly legal to write a custom .hashCode() implementation like this, even though this is as suboptimal as one can think of:
#Override
public int hashCode()
{
// Legal, but useless
return 42;
}
Of course, JDK developers would never be that thick, and .hashCode() implementations for builtin types (including String) are good enough that you do not even need to worry about collisions. Even then, this implementation will more than likely vary from one JDK implementation to another, and so will its "cryptographic value".
But that's not the point.
The most important thing to consider is that .hashCode() has nothing to do with cryptography at all. Its only obligation is to obey the contract defined by java.lang.Object.
It's pretty good as a general purpose hash function. i.e. you shouldn't usually worry about it.
In particular:
It is fast, to the extent that it probably produces hashes as the CPU can read the String from memory (i.e. you usually can't get better without skipping large parts of the String). It does just one multiply and one add per character in the String.
For typical sets of random Strings, it produces well-distributed hashes over the entire int range.
Obviously, it is not a cryptographic hash function, so don't use it for that. Also, be aware that you likely will get hash collisions as it is producing a 32-bit hash. So you just need to design your algorithms to take that into account.
How is hashCode() implemented?
My assumption is that it uses the object memory location as the initial number (the seed) on which it runs the hash function. However, this is not the case.
I've also looked at Hash : How does it work internally? but it does not answer my question.
Yes I could download the SDK, but before I do that and look at the code, perhaps someone else already has knowledge of it.
Thanks :)
EDIT:
I know it should be overridden and such, so please try to stay on topic :)
No, no, no. All answers in this thread are wrong or at least only partially correct.
First:
Object.hashCode() is a native method, so its implementation depends solely on the JVM. It may vary between HotSpot and other VM implementations like JRockit or IBM J9.
If you are asking:
how is hashCode() implemented in Java?
Then the answer is: it depends on which VM you are using.
Assuming that you are using Oracle's default JVM—which is HotSpot, then I can tell you that HotSpot has six hashCode() implementations. You can choose it using the -XX:hashCode=n flag running JVM via command line, where n can be:
0 – Park-Miller RNG (default)
1 – f(address, global_statement)
2 – constant 1
3 – Serial counter
4 – Object address
5 – Thread-local Xorshift
The above is copied from this post.
And if you dig a little bit around in the HotSpot source code, you may find below snippet:
if (hashCode == 0) {
value = os::random();
} else {
...
os::random() is just the implementation of the Park-Miller Pseudo Random Generator algorithm.
That's all. There isn't any notion of memory address. Although two other implementations, 1 and 4, use an object's memory address, the default one doesn't use it.
The notion that Object.hashCode() is based on the object's address is largely a historic artefact - it is no longer true.
I know that inside Object#hashCode() JavaDoc we can read:
(...) this is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the Java™ programming language.
But it is obsolete and misleading.
Of course it is implementation specific, but generally the hash code for an object will be computed lazily and stored in the object header. Odd things are done with headers to keep them small whilst allowing complex locking algorithms.
In the OpenJDK/Oracle JVM the usual method of computing the initial hash code is based on the memory address at the time of the first request. Objects move about in memory, so using the address each time would not be a good choice. The hash code isn't the actual address - that would typically be a multiple of eight which isn't great for using straight in a hash table particularly with a power of two size. Note identity hash codes are not unique.
HotSpot has build time options to always use zero or use a secure random number generator (SRNG) for testing purposes.
The implementation of the hashcode() function varies from Object to Object. If you want to know how a specific class implements hashcode(), you'll have to look it up for that class.
The hashCode method defined by class Object returns distinct integers for distinct objects. This could be implemented by converting the internal address of the object into an integer (but this implementation style is not required by the standard). It gets interesting with new classes which override hashCode in order to support hash tables (equal and hashCode):
http://www.javapractices.com/topic/TopicAction.do?Id=28
I'm assuming you're talking about the Object implementation of hashCode, since the method can and should be overridden.
It's implementation dependent. For the Sun JDK, it's based on the object's memory address.
When I implement a collection that uses hashes for optimizing access, should I cache the hash values or assume an efficient implementation of hashCode()?
On the other hand, when I implement a class that overrides hashCode(), should I assume that the collection (i.e. HashSet) caches the hash?
This question is only about performance vs. memory overhead. I know that the hash value of an object should not change.
Clarification:
A mutable object would of course have to clear the cached value when it is changed, whereas the collection relies on objects not changing. But this is not relevant for my question.
When designing Guava's ImmutableSet and ImmutableMap classes, we opted not to cache hash codes. This way, you'll get better performance from hash code caching when and only when you care enough to do the caching yourself. If we cached them ourselves, we'd be costing you extra time and memory even in the case that you care deeply about speed and space!
It's true that HashMap does this caching, but it was HashMap's author (Josh Bloch) who strongly suggested we not follow that precedent!
Edit: oh, also, if your hashCode() is slow, the caching by the collection only addresses half of the problem anyway, as hashCode() still must be invoked on the object passed in to get() no matter what.
Considering that java.lang.String caches its hash, i guess that hashcode() is supposed to be fast.
So as first approach, I would not cache hashes in my collection.
In my objects that I use, I would not cache hash code unless it is oviously slow, and only do it if profiling tell me so.
If my objects will be used by others, i would probubly consider cachnig hash codes sooner (but needs measurements anyway).
On the other hand, when I implement a class that overrides hashcode(),
should I assume that the collection (i.e. HashSet) caches the hash?
No, you should not make any assumptions beyond the scope of the class you are writing.
Of course you should try to make your hashCode cheap. If it isn't, and your class is immutable, create the hashCode on initialization or lazily upon the first request (see java.lang.String). If your class is not immutable, I don't see any other option than to re-calculate the hashCode every time.
I'd say in most cases you can rely on efficient implementations of hashCode(). AFAIK, that method is only invoked on lookup methods (like contains, get etc.) or methods that change the collection (add/put, remove etc.).
Thus, in most cases there shouldn't be any need to cache hashes yourself.
Why do you want to cache it? You need to ask objects what their hashcode is while you're working with it to allocate it to a hash bucket (and any objects that are in the same bucket that may have the same hashcode), but then you can forget it.
You could store objects in a wrapper HashNode or something, but I would try implementing it first without caching (just like HashSet et al does) and see if you need the added performance and complexity before going there.
The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.
The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.
I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.
I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)
The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).
There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.
As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.
One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry