Caching hashes in Java collections? - java

When I implement a collection that uses hashes for optimizing access, should I cache the hash values or assume an efficient implementation of hashCode()?
On the other hand, when I implement a class that overrides hashCode(), should I assume that the collection (i.e. HashSet) caches the hash?
This question is only about performance vs. memory overhead. I know that the hash value of an object should not change.
Clarification:
A mutable object would of course have to clear the cached value when it is changed, whereas the collection relies on objects not changing. But this is not relevant for my question.

When designing Guava's ImmutableSet and ImmutableMap classes, we opted not to cache hash codes. This way, you'll get better performance from hash code caching when and only when you care enough to do the caching yourself. If we cached them ourselves, we'd be costing you extra time and memory even in the case that you care deeply about speed and space!
It's true that HashMap does this caching, but it was HashMap's author (Josh Bloch) who strongly suggested we not follow that precedent!
Edit: oh, also, if your hashCode() is slow, the caching by the collection only addresses half of the problem anyway, as hashCode() still must be invoked on the object passed in to get() no matter what.

Considering that java.lang.String caches its hash, i guess that hashcode() is supposed to be fast.
So as first approach, I would not cache hashes in my collection.
In my objects that I use, I would not cache hash code unless it is oviously slow, and only do it if profiling tell me so.
If my objects will be used by others, i would probubly consider cachnig hash codes sooner (but needs measurements anyway).

On the other hand, when I implement a class that overrides hashcode(),
should I assume that the collection (i.e. HashSet) caches the hash?
No, you should not make any assumptions beyond the scope of the class you are writing.
Of course you should try to make your hashCode cheap. If it isn't, and your class is immutable, create the hashCode on initialization or lazily upon the first request (see java.lang.String). If your class is not immutable, I don't see any other option than to re-calculate the hashCode every time.

I'd say in most cases you can rely on efficient implementations of hashCode(). AFAIK, that method is only invoked on lookup methods (like contains, get etc.) or methods that change the collection (add/put, remove etc.).
Thus, in most cases there shouldn't be any need to cache hashes yourself.

Why do you want to cache it? You need to ask objects what their hashcode is while you're working with it to allocate it to a hash bucket (and any objects that are in the same bucket that may have the same hashcode), but then you can forget it.
You could store objects in a wrapper HashNode or something, but I would try implementing it first without caching (just like HashSet et al does) and see if you need the added performance and complexity before going there.

Related

How can I be sure that my class is immutable

I am required to create a class very similar to String however instead of storing an array of characters, the object must store an array of bytes because I will be dealing with binary data, not strings.
I am using HashMaps within my application. I am therefore keen to make my custom byteArray class immutable since immutable objects perform faster searches in hashmaps. (I would like a source for this fact please)
I'm pretty sure my class is immutable, but its still performing poorly vs string in hashmap searches. How can I be sure it is immutable?
The most important thing is to copy the bytes into your array. If you have
this.bytes = passedInArray;
The caller can modify passedInArray and hence modify this.bytes. You must do
this.bytes = Arrays.copyOf(passedInArray, passedInArray.length);
(Or similar, clone is o.k. too). If this class will be mainly used as a key in Maps, I'd calculate the hashcode immediately (in the constructor), simpler than doing it lazily.
Implement the obvious equals() and I think you are done.
Your question is "How can I be sure that my class is immutable?" I'm not sure that's what you mean to ask, but the way to make your class immutable is listed by Josh Bloch in Effective Java, 2nd Ed. in item 15 here, and which I'll summarize in this answer:
Don't provide any mutator methods (methods that change the object's state, usually called "setters").
Ensure the class can't be extended. Generally, make the class final. This keeps others from subclassing it and modifying protected fields.
Make all fields final, so you can't change them.
Make all fields private, so others can't change them.
"Ensure exclusive access to mutable components." That is, if something else points to the data and therefore can alter it, make a defensive copy (as #user949300 pointed out).
Note that immutable objects don't automatically yield a the big performance boost. The boost from immutable objects would be from not having to lock or copy the object, and from reusing it instead of creating a new one. I believe the searches in HashMap use the class' hashCode() method, and the lookup should be O(c), or constant-time and fast. If you are having performance issues, you may need to look at if there's slowness in your hashCode() method (unlikely), or issues elsewhere.
One possibility is if you have implemented hashCode() poorly (or not at all) and this is causing a large number of collisions in your HashMap -- that is, calling that method with different instances of your class returns mostly similar or same values -- then the instances will be stored in a linked list at the location specified by hashCode(). Traversing this list will convert your efficiency from constant-time to linear-time, making performance much worse.
since immutable objects perform faster searches in hashmaps. (I would like a source for this fact please)
No, this isn't true. Performance as a hashmap key will be determined by the runtime, and collision avoidance, of hashCode.
I'm pretty sure my class is immutable, but its still performing poorly vs string in hashmap searches. How can I be sure it is immutable?
Your problem is more likely to be a poor choice of hashCode implementation. Consider basing your implementation around Arrays.hashCode.
(Your question ArrayList<Byte> vs String in Java suggests you're trying to tune a specific implementation; the advice there to use byte[] is good.)

clarifying facts behind Java's implementation of HashSet/HashMap

1.
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
2.
I realise that a good HashMap relies on a good hash function. How does Java's HashSet/HashMap hash the objects? I know that there is a hash function but so far for strings I have not needed to implement this. What if I now want to Hash a Java Object that I create - do I need to implement the hash function? Or does Java have a built in way of creating a hash code?
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
You could answer many of these questions yourself, by reading the source code for HashMap.
(Hint: you can usually find the source code for Java SE classes using Google; e.g. search for "java.util.HashMap source".)
I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?
Chaining. See the source code. (Line 154 in the version I linked to).
How does Java's HashSet/HashMap hash the objects?
It doesn't. The object's hashCode method is called to do this. See the source code. (line 360).
If you look at the code you will see some interesting wrinkles:
The code (in the version I linked to) is hashing Strings using a special method. (It appears that this is to allow hashing of strings to be "tuned" at the platform level. I didn't dig into this ...)
The hashcode returned by the Object.hashCode() call is "scrambled" further to reduce the chance of collisions. (Read the comment!)
What if I now want to Hash a Java Object that I create - do I need to implement the hash function?
You can do that.
Whether you need to do this depends on how you have defined equals for the class. Specifically, Java's HashMap, HashSet and related classes place the following requirement on hashcode() and equals(Object):
If a.equals(b) then a.hashCode() == b.hashCode().
While a is in a HashSet or is a key in a HashMap, the value returned by a.hashCode() must not change.
if !a.equals(b), then the probability that a.hashCode() == b.hashCode() should be low, especially if a and b are probably hash keys for the application.
(The last requirement for performance reasons. If you you have a "poor" hash function that results in a high probability that different keys hash the same hashcode, you get lots of collisions. The hash chains will become unbalanced, and you won't get the average O(1) performance that is normally expected of hash table operations. In the worst case, performance will be O(N); i.e. equivalent to a linear search of a linked list.)
Or does Java have a built in way of creating a hash code?
Every class inherits a default hashCode() method from Object (unless this is overridden). It uses what is known as an "identity hash code"; i.e. a hash value that is based on the object's identity (its reference). This matches the default implementation of equals(Object) ... which simply uses == to compare references.
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is incorrect.
The default hashCode() method returns the "identity hashcode". This is typically based on the object's memory address at some point time, but it is NOT the object's memory address.
In particular, if an object is moved by the garbage collector, its "identity hashcode" is guaranteed not to change. Yes. That's right, it DOES NOT CHANGE ... even though the object was moved!
(How they implement this efficiently is rather clever. See https://stackoverflow.com/a/3796963/139985 for details.)
The bottom line is that the default Object.hashCode() method satisfies all of the requirements that I listed above. It can be relied on.
Question 1)
The Java HashMap implementation uses the chaining implementation to deal with collisions. Think of it as an array of linked lists.
Question 2
Object has a default implementation of equals and hashCode. equals is implemented as return this == other and hashcode is (to all intents and purposes) implemented as assigning a random identifier to each instance and using that as the hashCode.
As all classes in Java extends Object, they all inherit these implementations.
Some classes override these implementations by default. String, as you mentioned, is a very good example. Another is the classes in the collections API - so ArrayList implements these methods based on the elements it contains.
As far as implementing a good hashCode, this is a little bit of a dark art. Here's a pretty good summary of best practice.
Your final comment:
I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.
This is not correct. The default implementation of hashCode is constant as that is part of the method's contract. From the Javadoc:
Whenever it is invoked on the same object more than once during an
execution of a Java application, the hashCode method must consistently
return the same integer, provided no information used in equals
comparisons on the object is modified. This integer need not remain
consistent from one execution of an application to another execution
of the same application.

Changing hashCode of object stored in hash-based collection

I have a hash-based collection of objects, such as HashSet or HashMap. What issues can I run into when the implementation of hashCode() is such that it can change with time because it's computed from some mutable fields?
How does it affect Hibernate? Is there any reason why having hashCode() return object's ID by default is bad? All not-yet-persisted objects have id=0, if that matters.
What is the reasonable implementation of hashCode for Hibernate-mapped entities? Once set the ID is immutable, but it's not true for the moment of saving an entity to database.
I'm not worried about performance of a HashSet with a dozen entities with key=0. What I care about is whether it's safe for my application and Hibernate to use ID as hash code, because ID changes as it is generated on persist.
If the hash code of the same object changes over time, the results are basically unpredictable. Hash collections use the hash code to assign objects to buckets -- if your hash code suddenly changes, the collection obviously doesn't know, so it can fail to find an existing object because it hashes to a different bucket now.
Returning an object's ID by itself isn't bad, but if many of them have id=0 as you mentioned, it will reduce the performance of the hash table: all objects with the same hash code go into the same bucket, so your hash table is now no better than a linear list.
Update: Theoretically, your hash code can change as long as nobody else is aware of it -- this implies exactly what #bestsss mentioned in his comment, which is to remove your object from any collections that may be holding it and insert it again once the hash code has changed. In practice, a better alternative is to generate your hash code from the actual content fields of your object rather than relying on the database ID.
If you add an object to a hash-based collection, then mutate its state so as to change its hashcode (and by implication probably the behaviour in .equals() calls), you may see effects including but not limited to:
Stuff you put in the collection seeming to not be there any more
Getting something out which is different to what you asked for
This is surely not what you want. So, I recommend making the hashcode only out of immutable fields. This is usually done by making the fields final and setting their values in the constructor.
http://community.jboss.org/wiki/EqualsandHashCode
Don’t change hashcode of elements in hash based collection after put.
Many programmers fall into the pitfall.
You could think hashcode is kind of address in collection, so you couldn’t change address of an element after it’s put in the collection.
The Javadoc spefically says that the built-in Collections don't support this. So don't do it.

What is the reason behind Enum.hashCode()?

The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.
The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.
I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.
I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)
The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).
There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.
As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.
One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry

Which data structure uses more memory?

Which type of data structure uses more memory?
Hashtable
Hashmap
ArrayList
Could you please give me a brief explanation which one is less prone to memory leakage?
...which one to use for avoiding the memory leakage
The answer is all of them and none of them.
Memory leakage is not related to the data structure, but the way you use them.
The amount of memory used by each one is irrelevant when you aim to avoid "memory leakage".
The best thing you can do is: When you detect an object won't be used any longer in the application, you must remove it from the collection ( not only those you've listed, but any other you might use; List, Map, Set or even arrays ).
That way the garbage collector will be able to release the memory used by that object.
You can take a look at this article "How does garbage collector works" for further explanation on how Java release memory from the objects it uses.
Edit:
There are others data structures in Java which help for the references management such as WeakHashMap, but this may be considered as "advanced topics".
Most likely you should really just use a Collection that suits your current need. In the most common cases, if you need a List, use ArrayList, and if you need a Map, use HashMap. For a tutorial, see e.g. http://java.sun.com/docs/books/tutorial/collections/
When your profiler shows you there is an actual memory leak related to the use of Java Collections, then do something about it.
Your question is woefully underspecified because the concrete data structures you specify are not of comparable structure.
The HashMap/HashTable are comparable since they both function as maps (key -> value lookups).
ArrayLists (and lists in general) do not.
The HashMap/HashTable part is easy to answer as they are largely identical (the major difference is null keys) but the former is not synchronized and the latter is, thus HashMap will generally be faster (assuming the synchronization is not required) Modern JVM's are reasonably fast at uncontended locks though so the difference will be small in a micro benchmark.
Well, I've actually been, recently, in a situation where I had to hold onto large collections of custom objects, where the size of the collections was one of the applications limiting factors. If that's your situation, a few suggestions -
there are a few implementations of
collections using primitives (list
here). Played around a bit with
trove4j, and found a somewhat smaller
memory footprint (as long as you're
dealing with primitives, of course).
If you're dealing with large
collections, you'll probably get more
bang for your buck, in terms of
reducing memory footprint, by
optimizing the objects you're
holding. After all, you've got a lot
more of them, otherwise you wouldn't
need a collection, right?
Some collections are naturally smaller (e.g. LinkedList will be a bit smaller than an ArrayList) but the difference will probably be swamped by the differences in how they're used)
Most of the java collections can be manually sized - you can set your arraylist of 100 elements to be initialized to 100 elements, and you can set your maps to keep less open space at the cost of slower performance. All in the javadocs.
Ultimately the simplest thing to do is to test for yourself.
You're not comparing like with like: HashMap and Hashtable implement the Map interface, and ArrayList implements the List interface.
In a direct comparison between Hashtable and HashMap, HashMap will probably offer better performance because Hashtable is synchronized.
If you give some indication about what you're using the collections for, you might get a more insightful answer.
Hashtables (be it HashMap or HashTable) would take a little more memory than what they use to actually store the information.
Hashing performance comes at a price.
A java.util.Collection stores references to objects.
A java.util.Map stores references to Map.Entry instances, which hold references to keys and objects.
So a java.util.Collection holding N references to objects will require less memory than a java.util.Map holding onto the same N references, because the Map has to point to the keys as well.
Performance for reading and writing differs depending on the implementation of each of these interfaces.
I don't see any java.util.Collection analogous to WeakHashMap. I'd read about that class if you're worried about garbage collection and memory leaks.
As others have pointed out, your question is underspecified.
Still, sometimes an ArrayList based implementation can replace a HashMap based implementation (I would not consider Hashtable at all, it's obsolete). You might need to search the ArrayList linearly but for small Lists that might still be fast enough and the ArrayList will need less memory (for the same data), because it has less overhead.
In most languages it depends on how good you are at picking up your toys after you're done with them.
In Java, it doesn't matter since garbage collection is done automatically and you don't really need to worry about memory leaks.

Categories