Why is Hashset called a "Hash"-set?
I understand we call hashtable or a hashmap since it's a key value store and when we put(), then the key is hashed and distributed evenly using a good hash function.
I assume its called a HashSet because when we add(), the value is hashed and stored to keep it unique. But why the overkill? We don't really care about "equal distribution" of data like we do in a Hash table.
We do care about equal distribution because we want constant time performance on our basic Collection operations. In order to respect the basic rules of a SET, no two objects are equal, we want to find a potentially equal match quickly. HashSet is one fairly good way of doing it. Compare to a theoretical ArraySet where adding a new element is a linear time operation to iterate and check every single existing entry for equality.
A HashSet is called a HashSet because hashing is indeed important to its functionality. Operations like contains(Object) (arguably the most important method in a Set) and remove(Object) are able to work in constant time by making use of the hash code of the object (by way of HashMap).
HashSet (like HashMap) uses, well, hashing, to achieve O(1) amortized set/test/remove performance. (There were some incorrect assumptions in the question about HashSet not using hashing.)
Now, in Java, all objects are "hashable" -- that is, they have a hashCode() function (as they as descendants of Object). The quality of this hashing function will allow a hash algorithm to reach the above anticipated performance characteristics by "spreading the objects [uniformly] through the buckets". (The default Object implementations of hashCode/equals amount to object-identity. Generally this should be changed for any subclass.)
However, if your class implements hashCode poorly (e.g. returns 1 for all values) then the performance of HashSet/HashMap will suffer greatly as a result (for any non-trivial n). It is important to note that hashCode determines the bucket but equals determines, well, the actual equality which may be used even if the hash code is unique and/or there are no collisions (for instance, to make sure the test/get doesn't return a false-positive -- it could conceivably be eliminated on a non-collision set/insert).
Just make sure to follow the requirements setup in Object wrt. hashCode and equals or objects may be lost. A poor hashing function which honors the rules will still work -- albeit at potentially poor performance. (Mutable object are especially problematic for use in hash ADTs because the hash code and/or equality may not always be stable.)
What 'overkill'? The idea of a HashXXX for any X is to provide O(1) performance, and that is accomplished by hashing. If you don't want O(1) performance, don't use it. Use a TreeSet for example.
Related
I have a large number of interned strings (with a small number of possible values, so it makes sense to intern them) that I want to store in a Map (to use as a counter).
The TreeMap does a comparison at each level of the tree, which I imagine would involve an O(n) comparison of characters. HashMap will use the hash to bucket.
Given that I have a small set of interned Strings, which means that reference can be used for equality or ordering comparison (so neither the hash code nor the value need to be used), I wonder if there's a well-suited structure?
(Or indeed a more specialised one suited to counting)
My priorities are both speed and compact representation (I'm dealing with a large amount of data).
(To head off any "premature optimisation" comments, I'm processing about 200 million items).
IdentityHashMap
The java.util.IdentityHashMap works similarly to the HashMap class, but uses identity equality == and hash code (System.identityHashCode) to compare keys. It also has a much smaller memory footprint because it only uses a single array to store both keys and values. Although == is as fast as it gets, the System.identityHashCode(Object) method has a native implementation that carries some overhead (but it probably is a JVM intrinsic).
HashMap
The HashMap implementation, although it requires more memory (HashEntrys), should have a similar performance for hash code computation and equality checks. This is due to the fact that the String.equals method checks for reference equality at first, and String.hashCode is cached for every String. In an 'emergency', the HashMap approach will also produce correct results for non-cached Strings. In terms of maintainability, this might be a better choice.
First of all, I want to make it clear that I would never use a HashMap to do things that require some kind of order in the data structure and that this question is motivated by my curiosity about the inner details of Java HashMap implementation.
You can read in the java documentation on Object about the Object method hashCode.
I understand from there that hashCode implementation for classes such as String and basic types wrappers (Integer, Long,...) is predictable once the value contained by the object is given. An example of that would be that calls to hashCode for any String object containing the value hello should return always: 99162322
Having an algorithm that always insert into an empty Java HashMap where Strings are used as keys the same values in the same order. Then, the order of its elements at the end should be always the same, am I wrong?
Since the hash code for a concrete value is always the same, if there are not collisions the order should be the same.
On the other hand, if there are collisions, I think (I don't know the facts) that the collisions resolutions should result in the same order for exactly the same input elements.
So, isn't it right that two HashMap objects with the same elements, inserted in the same order should be traversed (by an iterator) giving the same elements sequence?
As far as I know the order (assuming we call "order" the order of elements as returned by values() iterator) of the elements in HashMap are kept until map rehash is performed. We can influence on probability of that event by providing capacity and/or loadFactor to the constructor.
Nevertheless, we should never rely on this statement because the internal implementation of HashMap is not a part of its public contract and is a subject to change in future.
I think you are asking "Is HashMap non-deterministic?". The answer is "probably not" (look at the source code of your favourite implementation to find out).
However, bear in mind that because the Java standard does not guarantee a particular order, the implementation is free to alter at any time (e.g. in newer JRE versions), giving a different (yet deterministic) result.
Whether or not that is true is entirely dependent upon the implementation. What's more important is that it isn't guaranteed. If you order is important to you there are options. You could create your own implementation of Map that does preserve order, you can use a SortedMap/LinkedHashMap or you can use something like the apache commons-collections OrderedMap: http://commons.apache.org/proper/commons-collections/javadocs/api-release/org/apache/commons/collections4/OrderedMap.html.
If i will override hashCode() method will it degrade the performance of application. I am overriding this method in many places in my application.
Yes you can degrade the performance of a hashed collection if the hashCode method is implemented in a bad way. The best implementation of a hashCode method should generate the unique hashCode for unique objects. Unique hashCode will avoid collisions and an element can be stored and retrieved with O(1) complexity. But only hashCode method will not be able to do it, you need to override the equals method also to help the JVM.
If the hashCode method is not able to generate unique hash for unique objects then there is a chance that you will be holding more than one objects at a bucket. This will occur when you have two elements with same hash but equals method returns false for them. So each time this happens the element will be added to the list at hash bucket. This will slow down both the insertion and retreival of elements. It will lead to O(n) complexity for the get method, where n is the size of the list at a bucket.
Note: While you try to generate unique hash for unique objects in your hashCode implementation, be sure that you write simple algorithm for doing so. If your algorithm for generating the hash is too heavy then you will surely see a poor performance for operations on your hashed collection. As hashCode method is called for most of the operations on the hashed collection.
It would improve performance if the right data structure used at right place,
For example: a proper hashcode implementation in Object can nearly convert O(N) to O(1) for HashMap lookup
unless you are doing too much complicated operation in hashCode() method
It would invoke hashCode() method every time it has to deal with Hash data structure with your Object and if you have heavy hashCode() method (which shouldn't be)
It depends entirely on how you're implementing hashCode. If you're doing lots of expensive deep operations, then perhaps it might, and in that case, you should consider caching a copy of the hashCode (like String does). But a decent implementation, such as with HashCodeBuilder, won't be a big deal. Having a good hashCode value can make lookups in data structures like HashMaps and HashSets much, much faster, and if you override equals, you need to override hashCode.
Java's hashCode() is a virtual function anyway, so there is no performance loss by the sheer fact that it is overridden and the overridden method is used.
The real difference may be the implementation of the method. By default, hashCode() works like this (source):
As much as is reasonably practical, the hashCode method defined by
class Object does return distinct integers for distinct objects. (This
is typically implemented by converting the internal address of the
object into an integer, but this implementation technique is not
required by the JavaTM programming language.)
So, whenever your implementation is as simple as this, there will be no performance loss. However, if you perform complex computing operations based on many fields, calling many other functions - you will notice a performance loss but only because your hashCode() does more things.
There is also the issue of inefficient hashCode() implementations. For example, if your hashCode() simply returns value 1 then the use of HashMap or HashSet will be significantly slower than with proper implementation. There is a great question which covers the topic of implementing hashCode() and equals() on SO: What issues should be considered when overriding equals and hashCode in Java?
One more note: remember, that whenever you implement hashCode() you should also implement equals(). Moreover, you should do it with care, because if you write an invalid hashCode() you may break equality checks for various collections.
Overriding hashCode() in a class in itself does not cause any performance issues. However when an instance of such class is inserted either into a HashMap HashSet or equivalent data structure hashCode() & optionally equals() method is invoked to identify right bucket to put the element in. same applicable to Retrival Search & Deletion.
As posted by others performance totally depends on how hashCode() is implemented.
However If a particular class's equals method is not used at all then it is not mandatory to override equals() and hashCode() , but if equals() is overridden , hashcode() must be overridden as well
As all previous comments mentioned, hash-code is used for hashing in collections or it could be used as negative condition in equals. So, yes you can slow you app a lot. Obviously there is more use-cases.
First of all I would say that the approach (whether to rewrite it at all) depends on the type of objects you are talking about.
Default implementation of hash-code is fast as possible because it's unique for every object. It's possible to be enough for many cases.
This is not good when you want to use hashset and let say want to do not store two same objects in a collection. Now, the point is in "same" word.
"Same" can mean "same instance". "Same" can mean object with same (database) identifier when your object is entity or "same" can mean the object with all equal properties. It seems that it can affect performance so far.
But one of properties can be a object which could evaluate hashCode() on demand too and right now you can get evaluation of object tree's hash-code always when you call hash-code method on the root object.
So, what I would recommend? You need to define and clarify what you want to do. Do you really need to distinguish different object instances, or identifier is crucial, or is it value object?
It also depends on immutability. It's possible to calculate hashcode value once when object is constructed using all constructor properties (which has only get) and use it always when hashcode() is call. Or another option is to calculate hashcode always when any property gets change. You need to decide whether most cases read the value or write it.
The last thing I would say is to override hashCode() method only when you know that you need it and when you know what are you doing.
If you will override hashCode() method will it degrade the performance of application.It would improve performance if the right data structure used at right place,
For example: a proper hashcode() implementation in Object can nearly convert O(N) to O(1) for HashMap lookup.unless you are doing too much complicated operation in hashCode() method
The main purpose of hashCode method is to allow an object to be a key in the hash map or a member of a hash set. In this case an object should also implement equals(Object) method, which is consistent with hashCode implementation:
If a.equals(b) then a.hashCode() == b.hashCode()
If hashCode() was called twice on the same object, it should return the same result provided that the object was not changed
hashCode from the performance point of view
From the performance point of view, the main objective for your hashCode method implementation is to minimize the number of objects sharing the same hash code.
All JDK hash based collections store their values in an array.
Hash code is used to calculate an initial lookup position in this array. After that equals is used to compare given value with values stored in the internal array. So, if all values have distinct hash codes, this will minimize the possibility of hash collisions.
On the other hand, if all values will have the same hash code, hash map (or set) will degrade into a list with operations on it having O(n2) complexity.
From Java 8 onwards though collision will not impact performance as much as it does in earlier versions because after a threshold the linked list will be replaced by the binary tree, which will give you O(logN) performance in the worst case as compared to O(n) of linked list.
Never write a hashCode method which returns a constant.
String.hashCode results distribution is nearly perfect, so you can sometimes substitute Strings with their hash codes.
The next objective is to check how many identifiers with non-unique has codes you still have. Improve your hashCode method or increase a range of allowed hash code values if you have too many of non-unique hash codes. In the perfect case all your identifiers will have unique hash codes.
The best look-up structure is a HashTable. It provides constant access on average (linear in worst case).
This depends on the hash function. Ok.
My question is the following. Assuming a good implementation of a HashTable e.g. HashMap is there a best practice concerning the keys passed in the map?I mean it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
Example the size of the key? For example in a good hashmap (in the way described above) if we used String as keys, won't the "bottleneck" be in the string comparison for equals (trying to find the key)? So should the keys be kept small? Or are there objects that should not be used as keys? E.g. a URL? In such cases how can you choose what to use as a key?
The best performing key for an HashMap is probably an Integer, where hashCode() and equals() are implemented as:
public int hashCode() {
return value;
}
public boolean equals(Object obj) {
if (obj instanceof Integer) {
return value == ((Integer)obj).intValue();
}
return false;
}
Said that, the purpose of an HashMap is to map some object (value) to some others (key). The fact that a hash function is used to address the (value) objects is to provide fast, constant-time access.
it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The recommendation is to Map objects to what you need: don't think what is faster; but think what is the best for your business logic to address the objects to retrieve.
The important requirement is that the key object must be immutable, because if you change the key object after storing it in the Map it may be not possible to retrieve the associated value later.
The key word in HashMap is Map. Your object should just map. If you sacrifice the mapping task optimizing the key, you are defeating the purpose of the Map - without probably achieving any performance boost.
I 100% agree with the first two comments in your question:
the major constraint is that it has to be the thing that you want to base the lookup on ;)
– Oli Charlesworth
The general rule is to use as the key whatever you need to look up with.
– Louis Wasserman
Remember the two rules for optimization:
Don't.
(for experts only) don't yet.
The third rule is: profile before to optimize.
You should use whatever key you want to use to lookup things in the data structure, it's typically a domain-specific constraint. With that said, keep in mind that both hashCode() and equals() will be used in finding a key in the table.
hashCode() is used to find the position of the key, while equals() is used to determine if the key you are searching for is actually the key that we just found using hashCode().
For example, consider two keys a and b that have the same hash code in a table using separate chaining. Then a search for a would require testing if a.equals(key) for potentially both a and b in the table once we find the index of the list containing a and b from hashCode().
it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The key of the value should be final.
Most times a field of the object is used as key. If that field changes then the map cannot find it:
void foo(Employee e) {
map.put(e.getId(), e);
String newId = e.getId() + "new";
e.setId(newId);
Employee e2 = e.get(newId);
// e != e2 !
}
So Employee should not have a setId() method at all, but that is difficult because when you are writing Employee you don't know what it will be keyed by.
I digged up the implementation. I had an assumption that the effectiveness of the hashCode() method will be the key factor.
When I looked into the HashMap() and the Hashtable() implementation, I found that the implementation is quite similar (with one exception). Both are using and storing an internal hash code for all entries, so that's a good point that hashCode() is not so heavily influencing the performance.
Both are having a number of buckets, where the values are stored. It is important balance between the number of buckets (say n), and the average number of keys within a bucket (say k). The bucket is found in O(1) time, the content of the bucket is iterated in O(k) size, but the more bucket we have, the more memory will be allocated. Also, if many buckets are empty, it means that the hashCode() method for the key class does not the hashcode wide enough.
The algorithm works like this:
Take the `hashCode()` of the Key (and make a slight bijective transformation on it)
Find the appropriate bucket
Loop through the content of the bucket (which is some kind of LinkedList)
Make the comparison of the keys as follows:
1. Compare the hashcodes
(it is calculated in the first step, and stored for the entry)
2. Examine if key `==` the stored key (still no call)
(this step is missing from Hashtable)
3. Compare the keys by `key.equals(storedKey)`
To summarize:
hashCode() is called once per call (this is a must, you cannot do
without it)
equals() is called if the hashCode is not so well spread, and two keys happen to have the same hashcode
The same algorithm is for get() and put() (because in put() case you can set the value for an existing key). So, the most important thing is how the hashCode() method was implemented. That is the most frequently called method.
Two strategies are: make it fast and make it effective (well-spread). The JDK developers made efforts to make it both, but it's not always possible to have them both.
Numeric types are good
Object (and non-overriden classes) are good (hashCode() is native), except that you cannot specify an own equals()
String is not good, iterates through the characters, but caches after that (see my comment below)
Any class with synchronized hashCode() is not good
Any class that has an iteration is not good
Classes that have hashcode cache are a bit better (depends on the usage)
Comment on the String: To make it fast, in the first versions of JDK the String hash code calculation was made for the first 32 characters only. But the hashcode it produced was not well spread, so they decided to take all the characters into the hashcode.
Why did java designers impose a mandate that
if obj1.equals(obj2) then
obj1.hashCode() MUST Be == obj2.hashCode()
Because a HashMap uses the following algorithm to find keys quickly:
get the hashCode() of the key in argument
deduce the bucket from this hash code
compare every key in the bucket with the key in argument (using equals()) to find the right one
If two equal objects didn't have the same hash code, the first two steps of the algorithm wouldn't work. And it's those two first steps that make a HashMap very fast (O(1)).
There is no mandate. It is a good practice since this is a required condition if your objects are meant to be used in hash based data structures like HashMap/HashSet etc.
As far as I know that's not baked into the language - you could technically have objects whose equals() method does not check the hashcode but you'll get pretty peculiar results.
In particular if you put a bunch of these objects into a HashMap or HashSet the map/set will use the hashCode() method to determine whether the objects may be duplicates - so you can have a situation where a collection will store 2 objects you've defined as equals (which should never happen) because they're each returning different hashCodes.
Because hashcodes are used to quickly determine if two objects are not equal.
Its sort of two steps matching to improve performance.
First Step: calculate hashcode()
Second Step: calculate equals()
Its because if you put your objects as keys in collections like hashmap, your keys will be compared first on hashcode() method if it finds matching hashcode it then goes on further to calculate equals().
Its like indexing for better search performance