String.hashCode in Java [duplicate]

String.hashCode in Java [duplicate] - java

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
why do String.hashCode() in java is not implemented in a way with less conflicts?
For non-cryptographic hashes, how does Java's String.hashCode() perform?
Mostly I'm concerned about collisions.
Thanks.

You seem to be misunderstanding what .hashCode() is for with regards to Java, and more specifically, the .equals()/.hashCode() contract specified by java.lang.Object.
The only part of the contract of matter to anyone is that if two objects are equal with regards to .equals(), then they must have the same hash code as returned by .hashCode(). There is no other obligation to that contract.
It is therefore perfectly legal to write a custom .hashCode() implementation like this, even though this is as suboptimal as one can think of:
#Override
public int hashCode()
{
// Legal, but useless
return 42;
}
Of course, JDK developers would never be that thick, and .hashCode() implementations for builtin types (including String) are good enough that you do not even need to worry about collisions. Even then, this implementation will more than likely vary from one JDK implementation to another, and so will its "cryptographic value".
But that's not the point.
The most important thing to consider is that .hashCode() has nothing to do with cryptography at all. Its only obligation is to obey the contract defined by java.lang.Object.

It's pretty good as a general purpose hash function. i.e. you shouldn't usually worry about it.
In particular:
It is fast, to the extent that it probably produces hashes as the CPU can read the String from memory (i.e. you usually can't get better without skipping large parts of the String). It does just one multiply and one add per character in the String.
For typical sets of random Strings, it produces well-distributed hashes over the entire int range.
Obviously, it is not a cryptographic hash function, so don't use it for that. Also, be aware that you likely will get hash collisions as it is producing a 32-bit hash. So you just need to design your algorithms to take that into account.

Related

About Object.hashcode() and collisions

I was reading the JavaDoc for Object.hashCode method, it says that
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer [...])
But whatever its implementation is, hashCode method always returns a (let's assume positive) integer, so given Integer.MAX+1 different objects, two of them are going to have the same hashcode.
Why is the JavaDoc here "denying" collisions? Is it a practical conclusion given that internal address is used and "come on, you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique"?
EDIT
This bug entry (thank you Sleiman Jneidi) gives an exact idea of what I mean (it seems to be a more that 10 years old discussion):
appears that many, perhaps majority, of programmers take this to mean that the default implementation, and hence System.identityHashCode, will produce unique hashcodes.
The qualification "As much as is reasonably practical," is, in practice, insufficient to make clear that hashcodes are not, in practice, distinct.

The docs are misleading indeed, and there is a bug opened ages ago that says that the docs are misleading especially that the implementation is JVM dependent, and in-practice especially with massive heap sizes it is so likely to get collisions when mapping object identities to 32-bit integers

there is an interesting discussion of hashcode collisions here:
http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/
In particular, this highlights that your practical conclusion, "you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique" is a long way from accurate due to the birthday paradox.
The conclusion from the link is that, assuming a random distribution of hashCodes, we only need 77,163 objects before we have a 50/50 chance of hashCode collision.

When you read this carefully, you'll notice that this only means objects should try to avoid collisions ('as much as reasonably practical'), but also that you are not guaranteed to have different hashcodes for unequal objects.
So the promise is not very strong, but it is still very useful. For instance when using the hashcode as a quick indication of equality before doing the full check.
For instance ConcurrentHashMap which will use (a function performed on) the hashcode to assign a location to an object in the map. In practice the hashcode is used to find where an object is roughly located, and the equals is used to find the precisely located.
A hashmap could not use this optimization if objects don't try to spread their hashcodes as much as possible.

clarifying facts behind Java's implementation of HashSet/HashMap

1.
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
2.
I realise that a good HashMap relies on a good hash function. How does Java's HashSet/HashMap hash the objects? I know that there is a hash function but so far for strings I have not needed to implement this. What if I now want to Hash a Java Object that I create - do I need to implement the hash function? Or does Java have a built in way of creating a hash code?
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.

You could answer many of these questions yourself, by reading the source code for HashMap.
(Hint: you can usually find the source code for Java SE classes using Google; e.g. search for "java.util.HashMap source".)
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
Chaining. See the source code. (Line 154 in the version I linked to).
How does Java's HashSet/HashMap hash the objects?
It doesn't. The object's hashCode method is called to do this. See the source code. (line 360).
If you look at the code you will see some interesting wrinkles:
The code (in the version I linked to) is hashing Strings using a special method. (It appears that this is to allow hashing of strings to be "tuned" at the platform level. I didn't dig into this ...)
The hashcode returned by the Object.hashCode() call is "scrambled" further to reduce the chance of collisions. (Read the comment!)
What if I now want to Hash a Java Object that I create - do I need to implement the hash function?
You can do that.
Whether you need to do this depends on how you have defined equals for the class. Specifically, Java's HashMap, HashSet and related classes place the following requirement on hashcode() and equals(Object):
If a.equals(b) then a.hashCode() == b.hashCode().
While a is in a HashSet or is a key in a HashMap, the value returned by a.hashCode() must not change.
if !a.equals(b), then the probability that a.hashCode() == b.hashCode() should be low, especially if a and b are probably hash keys for the application.
(The last requirement for performance reasons. If you you have a "poor" hash function that results in a high probability that different keys hash the same hashcode, you get lots of collisions. The hash chains will become unbalanced, and you won't get the average O(1) performance that is normally expected of hash table operations. In the worst case, performance will be O(N); i.e. equivalent to a linear search of a linked list.)
Or does Java have a built in way of creating a hash code?
Every class inherits a default hashCode() method from Object (unless this is overridden). It uses what is known as an "identity hash code"; i.e. a hash value that is based on the object's identity (its reference). This matches the default implementation of equals(Object) ... which simply uses == to compare references.
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is incorrect.
The default hashCode() method returns the "identity hashcode". This is typically based on the object's memory address at some point time, but it is NOT the object's memory address.
In particular, if an object is moved by the garbage collector, its "identity hashcode" is guaranteed not to change. Yes. That's right, it DOES NOT CHANGE ... even though the object was moved!
(How they implement this efficiently is rather clever. See https://stackoverflow.com/a/3796963/139985 for details.)
The bottom line is that the default Object.hashCode() method satisfies all of the requirements that I listed above. It can be relied on.

Question 1)
The Java HashMap implementation uses the chaining implementation to deal with collisions. Think of it as an array of linked lists.
Question 2
Object has a default implementation of equals and hashCode. equals is implemented as return this == other and hashcode is (to all intents and purposes) implemented as assigning a random identifier to each instance and using that as the hashCode.
As all classes in Java extends Object, they all inherit these implementations.
Some classes override these implementations by default. String, as you mentioned, is a very good example. Another is the classes in the collections API - so ArrayList implements these methods based on the elements it contains.
As far as implementing a good hashCode, this is a little bit of a dark art. Here's a pretty good summary of best practice.
Your final comment:
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is not correct. The default implementation of hashCode is constant as that is part of the method's contract. From the Javadoc:
Whenever it is invoked on the same object more than once during an
execution of a Java application, the hashCode method must consistently
return the same integer, provided no information used in equals
comparisons on the object is modified. This integer need not remain
consistent from one execution of an application to another execution
of the same application.

how is hashCode() implemented in Java

How is hashCode() implemented?
My assumption is that it uses the object memory location as the initial number (the seed) on which it runs the hash function. However, this is not the case.
I've also looked at Hash : How does it work internally? but it does not answer my question.
Yes I could download the SDK, but before I do that and look at the code, perhaps someone else already has knowledge of it.
Thanks :)
EDIT:
I know it should be overridden and such, so please try to stay on topic :)

No, no, no. All answers in this thread are wrong or at least only partially correct.
First:
Object.hashCode() is a native method, so its implementation depends solely on the JVM. It may vary between HotSpot and other VM implementations like JRockit or IBM J9.
If you are asking:
how is hashCode() implemented in Java?
Then the answer is: it depends on which VM you are using.
Assuming that you are using Oracle's default JVM—which is HotSpot, then I can tell you that HotSpot has six hashCode() implementations. You can choose it using the -XX:hashCode=n flag running JVM via command line, where n can be:
0 – Park-Miller RNG (default)
1 – f(address, global_statement)
2 – constant 1
3 – Serial counter
4 – Object address
5 – Thread-local Xorshift
The above is copied from this post.
And if you dig a little bit around in the HotSpot source code, you may find below snippet:
if (hashCode == 0) {
value = os::random();
} else {
...
os::random() is just the implementation of the Park-Miller Pseudo Random Generator algorithm.
That's all. There isn't any notion of memory address. Although two other implementations, 1 and 4, use an object's memory address, the default one doesn't use it.
The notion that Object.hashCode() is based on the object's address is largely a historic artefact - it is no longer true.
I know that inside Object#hashCode() JavaDoc we can read:
(...) this is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the Java™ programming language.
But it is obsolete and misleading.

Of course it is implementation specific, but generally the hash code for an object will be computed lazily and stored in the object header. Odd things are done with headers to keep them small whilst allowing complex locking algorithms.
In the OpenJDK/Oracle JVM the usual method of computing the initial hash code is based on the memory address at the time of the first request. Objects move about in memory, so using the address each time would not be a good choice. The hash code isn't the actual address - that would typically be a multiple of eight which isn't great for using straight in a hash table particularly with a power of two size. Note identity hash codes are not unique.
HotSpot has build time options to always use zero or use a secure random number generator (SRNG) for testing purposes.

The implementation of the hashcode() function varies from Object to Object. If you want to know how a specific class implements hashcode(), you'll have to look it up for that class.

The hashCode method defined by class Object returns distinct integers for distinct objects. This could be implemented by converting the internal address of the object into an integer (but this implementation style is not required by the standard). It gets interesting with new classes which override hashCode in order to support hash tables (equal and hashCode):
http://www.javapractices.com/topic/TopicAction.do?Id=28

I'm assuming you're talking about the Object implementation of hashCode, since the method can and should be overridden.
It's implementation dependent. For the Sun JDK, it's based on the object's memory address.

What is the reason behind Enum.hashCode()?

The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.

The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.

I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.

I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)

The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).

There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.

As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.

One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry

How should one unit test the hashCode-equals contract?

In a nutshell, the hashCode contract, according to Java's object.hashCode():
The hash code shouldn't change unless something affecting equals() changes
equals() implies hash codes are ==
Let's assume interest primarily in immutable data objects - their information never changes after they're constructed, so #1 is assumed to hold. That leaves #2: the problem is simply one of confirming that equals implies hash code ==.
Obviously, we can't test every conceivable data object unless that set is trivially small. So, what is the best way to write a unit test that is likely to catch the common cases?
Since the instances of this class are immutable, there are limited ways to construct such an object; this unit test should cover all of them if possible. Off the top of my head, the entry points are the constructors, deserialization, and constructors of subclasses (which should be reducible to the constructor call problem).
[I'm going to try to answer my own question via research. Input from other StackOverflowers is a welcome safety mechanism to this process.]
[This could be applicable to other OO languages, so I'm adding that tag.]

EqualsVerifier is a relatively new open source project and it does a very good job at testing the equals contract. It doesn't have the issues the EqualsTester from GSBase has. I would definitely recommend it.

My advice would be to think of why/how this might ever not hold true, and then write some unit tests which target those situations.
For example, let's say you had a custom Set class. Two sets are equal if they contain the same elements, but it's possible for the underlying data structures of two equal sets to differ if those elements are stored in a different order. For example:
MySet s1 = new MySet( new String[]{"Hello", "World"} );
MySet s2 = new MySet( new String[]{"World", "Hello"} );
assertEquals(s1, s2);
assertTrue( s1.hashCode()==s2.hashCode() );
In this case, the order of the elements in the sets might affect their hash, depending on the hashing algorithm you've implemented. So this is the kind of test I'd write, since it tests the case where I know it would be possible for some hashing algorithm to produce different results for two objects I've defined to be equal.
You should use a similar standard with your own custom class, whatever that is.

It's worth using the junit addons for this. Check out the class EqualsHashCodeTestCase http://junit-addons.sourceforge.net/ you can extend this and implement createInstance and createNotEqualInstance, this will check the equals and hashCode methods are correct.

I would recommend the EqualsTester from GSBase. It does basically what you want. I have two (minor) problems with it though:
The constructor does all the work, which I don't consider to be good practice.
It fails when an instance of class A equals to an instance of a subclass of class A. This is not necessarily a violation of the equals contract.

[At the time of this writing, three other answers were posted.]
To reiterate, the aim of my question is to find standard cases of tests to confirm that hashCode and equals are agreeing with each other. My approach to this question is to imagine the common paths taken by programmers when writing the classes in question, namely, immutable data. For example:
Wrote equals() without writing hashCode(). This often means equality was defined to mean equality of the fields of two instances.
Wrote hashCode() without writing equals(). This may mean the programmer was seeking a more efficient hashing algorithm.
In the case of #2, the problem seems nonexistent to me. No additional instances have been made equals(), so no additional instances are required to have equal hash codes. At worst, the hash algorithm may yield poorer performance for hash maps, which is outside the scope of this question.
In the case of #1, the standard unit test entails creating two instances of the same object with the same data passed to the constructor, and verifying equal hash codes. What about false positives? It's possible to pick constructor parameters that just happen to yield equal hash codes on a nonetheless unsound algorithm. A unit test that tends to avoid such parameters would fulfill the spirit of this question. The shortcut here is to inspect the source code for equals(), think hard, and write a test based on that, but while this may be necessary in some cases, there may also be common tests that catch common problems - and such tests also fulfill the spirit of this question.
For example, if the class to be tested (call it Data) has a constructor that takes a String, and instances constructed from Strings that are equals() yielded instances that were equals(), then a good test would probably test:
new Data("foo")
another new Data("foo")
We could even check the hash code for new Data(new String("foo")), to force the String to not be interned, although that's more likely to yield a correct hash code than Data.equals() is to yield a correct result, in my opinion.
Eli Courtwright's answer is an example of thinking hard of a way to break the hash algorithm based on knowledge of the equals specification. The example of a special collection is a good one, as user-made Collections do turn up at times, and are quite prone to muckups in the hash algorithm.

This is one of the only cases where I would have multiple asserts in a test. Since you need to test the equals method you should also check the hashCode method at the same time. So on each of your equals method test cases check the hashCode contract as well.
A one = new A(...);
A two = new A(...);
assertEquals("These should be equal", one, two);
int oneCode = one.hashCode();
assertEquals("HashCodes should be equal", oneCode, two.hashCode());
assertEquals("HashCode should not change", oneCode, one.hashCode());
And of course checking for a good hashCode is another exercise. Honestly I wouldn't bother to do the double check to make sure the hashCode wasn't changing in the same run, that sort of problem is better handled by catching it in a code review and helping the developer understand why that's not a good way to write hashCode methods.

You can also use something similar to http://code.google.com/p/guava-libraries/source/browse/guava-testlib/src/com/google/common/testing/EqualsTester.java
to test equals and hashCode.

If I have a class Thing, as most others do I write a class ThingTest, which holds all the unit tests for that class. Each ThingTest has a method
public static void checkInvariants(final Thing thing) {
...
}
and if the Thing class overrides hashCode and equals it has a method
public static void checkInvariants(final Thing thing1, Thing thing2) {
ObjectTest.checkInvariants(thing1, thing2);
... invariants that are specific to Thing
}
That method is responsible for checking all invariants that are designed to hold between any pair of Thing objects. The ObjectTest method it delegates to is responsible for checking all invariants that must hold between any pair of objects. As equals and hashCode are methods of all objects, that method checks that hashCode and equals are consistent.
I then have some test methods that create pairs of Thing objects, and pass them to the pairwise checkInvariants method. I use equivalence partitioning to decide what pairs are worth testing. I usually create each pair to be different in only one attribute, plus a test that tests two equivalent objects.
I also sometimes have a 3 argument checkInvariants method, although I finds that is less useful in findinf defects, so I do not do this often

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.