The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.
The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.
I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.
I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)
The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).
There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.
As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.
One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry
Related
I was reading the JavaDoc for Object.hashCode method, it says that
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer [...])
But whatever its implementation is, hashCode method always returns a (let's assume positive) integer, so given Integer.MAX+1 different objects, two of them are going to have the same hashcode.
Why is the JavaDoc here "denying" collisions? Is it a practical conclusion given that internal address is used and "come on, you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique"?
EDIT
This bug entry (thank you Sleiman Jneidi) gives an exact idea of what I mean (it seems to be a more that 10 years old discussion):
appears that many, perhaps majority, of programmers take this to mean that the default implementation, and hence System.identityHashCode, will produce unique hashcodes.
The qualification "As much as is reasonably practical," is, in practice, insufficient to make clear that hashcodes are not, in practice, distinct.
The docs are misleading indeed, and there is a bug opened ages ago that says that the docs are misleading especially that the implementation is JVM dependent, and in-practice especially with massive heap sizes it is so likely to get collisions when mapping object identities to 32-bit integers
there is an interesting discussion of hashcode collisions here:
http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/
In particular, this highlights that your practical conclusion, "you're never going to have Integer.MAX+1 objects in memory at once, so we can say it's practically always unique" is a long way from accurate due to the birthday paradox.
The conclusion from the link is that, assuming a random distribution of hashCodes, we only need 77,163 objects before we have a 50/50 chance of hashCode collision.
When you read this carefully, you'll notice that this only means objects should try to avoid collisions ('as much as reasonably practical'), but also that you are not guaranteed to have different hashcodes for unequal objects.
So the promise is not very strong, but it is still very useful. For instance when using the hashcode as a quick indication of equality before doing the full check.
For instance ConcurrentHashMap which will use (a function performed on) the hashcode to assign a location to an object in the map. In practice the hashcode is used to find where an object is roughly located, and the equals is used to find the precisely located.
A hashmap could not use this optimization if objects don't try to spread their hashcodes as much as possible.
I am required to create a class very similar to String however instead of storing an array of characters, the object must store an array of bytes because I will be dealing with binary data, not strings.
I am using HashMaps within my application. I am therefore keen to make my custom byteArray class immutable since immutable objects perform faster searches in hashmaps. (I would like a source for this fact please)
I'm pretty sure my class is immutable, but its still performing poorly vs string in hashmap searches. How can I be sure it is immutable?
The most important thing is to copy the bytes into your array. If you have
this.bytes = passedInArray;
The caller can modify passedInArray and hence modify this.bytes. You must do
this.bytes = Arrays.copyOf(passedInArray, passedInArray.length);
(Or similar, clone is o.k. too). If this class will be mainly used as a key in Maps, I'd calculate the hashcode immediately (in the constructor), simpler than doing it lazily.
Implement the obvious equals() and I think you are done.
Your question is "How can I be sure that my class is immutable?" I'm not sure that's what you mean to ask, but the way to make your class immutable is listed by Josh Bloch in Effective Java, 2nd Ed. in item 15 here, and which I'll summarize in this answer:
Don't provide any mutator methods (methods that change the object's state, usually called "setters").
Ensure the class can't be extended. Generally, make the class final. This keeps others from subclassing it and modifying protected fields.
Make all fields final, so you can't change them.
Make all fields private, so others can't change them.
"Ensure exclusive access to mutable components." That is, if something else points to the data and therefore can alter it, make a defensive copy (as #user949300 pointed out).
Note that immutable objects don't automatically yield a the big performance boost. The boost from immutable objects would be from not having to lock or copy the object, and from reusing it instead of creating a new one. I believe the searches in HashMap use the class' hashCode() method, and the lookup should be O(c), or constant-time and fast. If you are having performance issues, you may need to look at if there's slowness in your hashCode() method (unlikely), or issues elsewhere.
One possibility is if you have implemented hashCode() poorly (or not at all) and this is causing a large number of collisions in your HashMap -- that is, calling that method with different instances of your class returns mostly similar or same values -- then the instances will be stored in a linked list at the location specified by hashCode(). Traversing this list will convert your efficiency from constant-time to linear-time, making performance much worse.
since immutable objects perform faster searches in hashmaps. (I would like a source for this fact please)
No, this isn't true. Performance as a hashmap key will be determined by the runtime, and collision avoidance, of hashCode.
I'm pretty sure my class is immutable, but its still performing poorly vs string in hashmap searches. How can I be sure it is immutable?
Your problem is more likely to be a poor choice of hashCode implementation. Consider basing your implementation around Arrays.hashCode.
(Your question ArrayList<Byte> vs String in Java suggests you're trying to tune a specific implementation; the advice there to use byte[] is good.)
1.
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
2.
I realise that a good HashMap relies on a good hash function. How does Java's HashSet/HashMap hash the objects? I know that there is a hash function but so far for strings I have not needed to implement this. What if I now want to Hash a Java Object that I create - do I need to implement the hash function? Or does Java have a built in way of creating a hash code?
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
You could answer many of these questions yourself, by reading the source code for HashMap.
(Hint: you can usually find the source code for Java SE classes using Google; e.g. search for "java.util.HashMap source".)
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
Chaining. See the source code. (Line 154 in the version I linked to).
How does Java's HashSet/HashMap hash the objects?
It doesn't. The object's hashCode method is called to do this. See the source code. (line 360).
If you look at the code you will see some interesting wrinkles:
The code (in the version I linked to) is hashing Strings using a special method. (It appears that this is to allow hashing of strings to be "tuned" at the platform level. I didn't dig into this ...)
The hashcode returned by the Object.hashCode() call is "scrambled" further to reduce the chance of collisions. (Read the comment!)
What if I now want to Hash a Java Object that I create - do I need to implement the hash function?
You can do that.
Whether you need to do this depends on how you have defined equals for the class. Specifically, Java's HashMap, HashSet and related classes place the following requirement on hashcode() and equals(Object):
If a.equals(b) then a.hashCode() == b.hashCode().
While a is in a HashSet or is a key in a HashMap, the value returned by a.hashCode() must not change.
if !a.equals(b), then the probability that a.hashCode() == b.hashCode() should be low, especially if a and b are probably hash keys for the application.
(The last requirement for performance reasons. If you you have a "poor" hash function that results in a high probability that different keys hash the same hashcode, you get lots of collisions. The hash chains will become unbalanced, and you won't get the average O(1) performance that is normally expected of hash table operations. In the worst case, performance will be O(N); i.e. equivalent to a linear search of a linked list.)
Or does Java have a built in way of creating a hash code?
Every class inherits a default hashCode() method from Object (unless this is overridden). It uses what is known as an "identity hash code"; i.e. a hash value that is based on the object's identity (its reference). This matches the default implementation of equals(Object) ... which simply uses == to compare references.
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is incorrect.
The default hashCode() method returns the "identity hashcode". This is typically based on the object's memory address at some point time, but it is NOT the object's memory address.
In particular, if an object is moved by the garbage collector, its "identity hashcode" is guaranteed not to change. Yes. That's right, it DOES NOT CHANGE ... even though the object was moved!
(How they implement this efficiently is rather clever. See https://stackoverflow.com/a/3796963/139985 for details.)
The bottom line is that the default Object.hashCode() method satisfies all of the requirements that I listed above. It can be relied on.
Question 1)
The Java HashMap implementation uses the chaining implementation to deal with collisions. Think of it as an array of linked lists.
Question 2
Object has a default implementation of equals and hashCode. equals is implemented as return this == other and hashcode is (to all intents and purposes) implemented as assigning a random identifier to each instance and using that as the hashCode.
As all classes in Java extends Object, they all inherit these implementations.
Some classes override these implementations by default. String, as you mentioned, is a very good example. Another is the classes in the collections API - so ArrayList implements these methods based on the elements it contains.
As far as implementing a good hashCode, this is a little bit of a dark art. Here's a pretty good summary of best practice.
Your final comment:
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is not correct. The default implementation of hashCode is constant as that is part of the method's contract. From the Javadoc:
Whenever it is invoked on the same object more than once during an
execution of a Java application, the hashCode method must consistently
return the same integer, provided no information used in equals
comparisons on the object is modified. This integer need not remain
consistent from one execution of an application to another execution
of the same application.
This Java tutorial
says that an immutable object cannot change its state after creation.
java.lang.String has a field
/** Cache the hash code for the string */
private int hash; // Default to 0
which is initialized on the first call of the hashCode() method, so it changes after creation:
String s = new String(new char[] {' '});
Field hash = s.getClass().getDeclaredField("hash");
hash.setAccessible(true);
System.out.println(hash.get(s));
s.hashCode();
System.out.println(hash.get(s));
output
0
32
Is it correct to call String immutable?
A better definition would be not that the object does not change, but that it cannot be observed to have been changed. It's behavior will never change: .substring(x,y) will always return the same thing for that string ditto for equals and all the other methods.
That variable is calculated the first time you call .hashcode() and is cached for further calls. This is basically what they call "memoization" in functional programming languages.
Reflection isn't really a tool for "programming" but rather for meta-programming (ie programming programs for generating programs) so it doesn't really count. It's the equivalent of changing a constant's value using a memory debugger.
The term "Immutable" is vague enough to not allow for a precise definition.
I suggest reading Kinds of Immutability from Eric Lippert's blog. Although it's technically a C# article, it's quite relevant to the question posed. In particular:
Observational immutability:
Suppose you’ve got an object which has the property that every time
you call a method on it, look at a field, etc, you get the same
result. From the point of view of the caller such an object would be
immutable. However you could imagine that behind the scenes the object
was doing lazy initialization, memoizing results of function calls in
a hash table, etc. The “guts” of the object might be entirely mutable.
What does it matter? Truly deeply immutable objects never change their
internal state at all, and are therefore inherently threadsafe. An
object which is mutable behind the scenes might still need to have
complicated threading code in order to protect its internal mutable
state from corruption should the object be called on two threads “at
the same time”.
Once created, all the methods on a String instance (called with the same parameters) will always provide the same result. You cannot change its behavoiur (with any public method), so it will always represent the same entity. Also it is final and cannot be subclassed, so it is guaranteed that all instances will behave like this.
Therefore from public view the object is considered immutable. The internal state does not really matter in this case.
Yes it is correct to call them immutable.
While it is true that you can reach in and modify private ... and final ... variables of a class, it is an unnecessary and incredibly unwise thing to do on a String object. It is generally assumed that nobody is going to be crazy enough do it.
From a security standpoint, the reflection calls needed to modify the state of a String all perform security checks. Unless you've miss-implement your sandbox, the calls will be blocked for non-trusted code. So you should have to worry about this as a way that untrusted code can break sandbox security.
It is also worth noting that the JLS states that using reflection to change final, may break things (e.g. in multi-threading) or may not have any effect.
From the viewpoint of a developer who is using reflection, it is not correct to call String immutable. There are actual Java developers using reflection to write real software every day. Dismissing reflection as a "hack" is preposterous. However, from the viewpoint of a developer who is not using reflection, it is correct to call String immutable. Whether or not it is valid to assume that String is immutable depends on context.
Immutability is an abstract concept and therefore cannot apply in an absolute sense to anything with a physical form (see the ship of Theseus). Programming language constructs like objects, variables, and methods exist physically as bits in a storage medium. Data degradation is a physical process which happens to all storage media, so no data can ever be said to be truly immutable. In addition, it is almost always possible in practice to subvert the programming language features intended to prevent the mutation of a particular datum. In contrast, the number 3 is 3, has always been 3, and will always be 3.
As applied to program data, immutability should be considered a useful assumption rather than a fundamental property. For example, if one assumes that a String is immutable, one may cache its hash code for reuse and avoid the cost of ever recomputing its hash code again later. Virtually all non-trivial software relies on assumptions that certain data will not mutate for certain durations of time. Software developers generally assume that the code segment of a program will not change while it is executing, unless they are writing self-modifying code. Understanding what assumptions are valid in a particular context is an important aspect of software development.
It can not be modified from outside and it is a final class, so it can not be subclassed and made mutable. Theese are two requirments for immutability. Reflection is considered as a hack, its not a normal way of development.
A class can be immutable while still having mutable fields, as long as it doesn't provide access to its mutable fields.
It's immutable by design. If you use Reflection (getting the declared Field and resetting its accessibility), you are circumventing its design.
Reflection will allow you to change the contents of any private field. Is it therefore correct to call any object in Java immutable?
Immutability refers to changes that are either initiated by or perceivable by the application.
In the case of string, the fact that a particular implementation chooses to lazily calculate the hashcode is not perceptible to the application. I would go a step further, and say that an internal variable that is incremented by the object -- but never exposed and never used in any other way -- would also be acceptable in an "immutable" object.
Yes it is correct. When you modified a String like you do in your example, a new String is created but the older one maintain its value.
When I implement a collection that uses hashes for optimizing access, should I cache the hash values or assume an efficient implementation of hashCode()?
On the other hand, when I implement a class that overrides hashCode(), should I assume that the collection (i.e. HashSet) caches the hash?
This question is only about performance vs. memory overhead. I know that the hash value of an object should not change.
Clarification:
A mutable object would of course have to clear the cached value when it is changed, whereas the collection relies on objects not changing. But this is not relevant for my question.
When designing Guava's ImmutableSet and ImmutableMap classes, we opted not to cache hash codes. This way, you'll get better performance from hash code caching when and only when you care enough to do the caching yourself. If we cached them ourselves, we'd be costing you extra time and memory even in the case that you care deeply about speed and space!
It's true that HashMap does this caching, but it was HashMap's author (Josh Bloch) who strongly suggested we not follow that precedent!
Edit: oh, also, if your hashCode() is slow, the caching by the collection only addresses half of the problem anyway, as hashCode() still must be invoked on the object passed in to get() no matter what.
Considering that java.lang.String caches its hash, i guess that hashcode() is supposed to be fast.
So as first approach, I would not cache hashes in my collection.
In my objects that I use, I would not cache hash code unless it is oviously slow, and only do it if profiling tell me so.
If my objects will be used by others, i would probubly consider cachnig hash codes sooner (but needs measurements anyway).
On the other hand, when I implement a class that overrides hashcode(),
should I assume that the collection (i.e. HashSet) caches the hash?
No, you should not make any assumptions beyond the scope of the class you are writing.
Of course you should try to make your hashCode cheap. If it isn't, and your class is immutable, create the hashCode on initialization or lazily upon the first request (see java.lang.String). If your class is not immutable, I don't see any other option than to re-calculate the hashCode every time.
I'd say in most cases you can rely on efficient implementations of hashCode(). AFAIK, that method is only invoked on lookup methods (like contains, get etc.) or methods that change the collection (add/put, remove etc.).
Thus, in most cases there shouldn't be any need to cache hashes yourself.
Why do you want to cache it? You need to ask objects what their hashcode is while you're working with it to allocate it to a hash bucket (and any objects that are in the same bucket that may have the same hashcode), but then you can forget it.
You could store objects in a wrapper HashNode or something, but I would try implementing it first without caching (just like HashSet et al does) and see if you need the added performance and complexity before going there.