How can a Hash Set incur collision? - java

If a hash set contains only one instance of any distinct element(s), how might collision occur at this case?
And how could load factor be an issue since there is only one of any given element?
While this is homework, it is not for me. I am tutoring someone, and I need to know how to explain it to them.

Let's assume you have a HashSet of Integers, and your Hash Function is mod 4. The integers 0, 4, 8, 12, 16, etc. will all colide, if you try to insert them. (mod 4 is a terrible hash function, but it illustrates the concept)
Assuming a proper function, the load factor is correlated to the chance of having a collision; please note that I say correlated and not equal because it depends on the strategy you use to handle collisions. In general, a high load factor increases the possibility of collisions. Assuming that you have 4 slots and you use mod 4 as the hash function, when the load factor is 0 (empty table), you won't have a collision. When you have one element, the probability of a collision is .25, which obviously degrades the performance, since you have to solve the collision.
Now, assuming that you use linear probing (i.e. on collision, use the next entry available), once you reach 3 entries in the table, you have a .75 probability of a collision, and if you have a collision, in the best case you will go to the next entry, but in the worst, you will have to go through the 3 entries, so the collision means that instead of a direct access, you need in average a linear search with an average of 2 items.
Of course, you have better strategies to handle collisions, and generally, in non-pathological cases, a load of .7 is acceptable, but after that collisions shoot up and performance degrades.

The general idea behind a "hash table" (which a "hash set" is a variety of) is that you have a number of objects containing "key" values (eg, character strings) that you want to put into some sort of container and then be able to find individual objects by their "key" values easily, without having to examine every item in the container.
One could, eg, put the values into a sorted array and then do a binary search to find a value, but maintaining a sorted array is expensive if there are lots of updates.
So the key values are "hashed". One might, for instance, add together all of the ASCII values of the characters to create a single number which is the "hash" of the character string. (There are better hash computation algorithms, but the precise algorithm doesn't matter, and this is an easy one to explain.)
When you do this you'll get a number that, for a ten-character string, will be in the range from maybe 600 to 1280. Now, if you divide that by, say, 500 and take the remainder, you'll have a value between 0 and 499. (Note that the string doesn't have to be ten characters -- longer strings will add to larger values, but when you divide and take the remainder you still end up with a number between 0 and 499.)
Now create an array of 500 entries, and each time you get a new object, calculate its hash as described above and use that value to index into the array. Place the new object into the array entry that corresponds to that index.
But (especially with the naive hash algorithm above) you could have two different strings with the same hash. Eg, "ABC" and "CBA" would have the same hash, and would end up going into the same slot in the array.
To handle this "collision" there are several strategies, but the most common is to create a linked list off the array entry and put the various "hash synonyms" into that list.
You'd generally try to have the array large enough (and have a better hash calculation algorithm) to minimize such collisions, but, using the hash scheme, there's no way to absolutely prevent collisions.
Note that the multiple entries in a synonym list are not identical -- they have different key values -- but they have the same hash value.

Related

Java Structure that is able to determine approximate number of elements less then x in an ordered set which is updated concurrently

Suppose U is an ordered set of elements, S ⊆U, and x ∈ U. S is being updated concurrently. I want to get an estimate of the number of elements in S that is less x in O(log(|S|) time.
S is being maintained by another software component that I cannot change. However, whenever e is inserted (or deleted) into S I get a message e inserted (deleted). I don't want to maintain my own version of S since memory is limited. I am looking for a structure, ES, (perhaps using O(log(|S|) space) where I can get a reasonable estimate of the number of elements less than any give x. Assume that the entire set S can periodically be sampled to recreate or update ES.
Update: I think that this problem statement must include more specific values for U. One obvious case is where U are numbers (int, double,etc). Another case is where U are strings ordered lexigraphically.
In the case of numbers one could use a probability distribution (but how can that be determined?).
I am wondering if the set S can be scanned periodically. Place the entire set into an array and sort. Then pick the log(n) values at n/log(n), 2n/log(n) ... n where n = |S|. Then draw a histogram based on those values?
More generally how can one find the appropriate probability distribution from S?
Not sure what the unit of measure would be for strings lexigraphically ordered?
By concurrently, I'm assuming you mean thread-safe. In that case, I believe what you're looking for is a ConcurrentSkipListSet, which is essentially a concurrent TreeSet. You can use ConcurrentSkipListSet#headSet.size() or ConcurrentSkipListSet#tailSet.size() to get the amount of elements greater/less than (or equal to) a single element where you can pass in a custom Comparator.
Is x constant? If so it seems easy to track the number less than x as they are inserted and deleted?
If x isn't constant you could still take a histogram approach. Divide up the range that values can take. As items are inserted / deleted, keep track of how many items are in each range bucket. When you get a query, sum up all the values from smaller buckets.
I accept your point that bucketing is tricky - especially if you know nothing about the underlying data. You could record the first 100 values of x, and use those calculate a mean and a standard deviation. Then you could assume the values are normally distributed and calculate the buckets that way.
Obviously if you know more about the underlying data you can use a different distribution model. It would be easy enough to have a modular approach if you want it to be generic.

Is HashTable/HashMap an array?

I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.
The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.
Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).

HashMap speed greater for smaller maps

This may be a strange question, but it is based on some results I get, using Java Map - is element retrieval speed greater in case of a HashMap, when the map is smaller?
I have some part of code that uses containsKey and get(key) methods of a HashMap, and it seems that runs faster if number of elements in the Map is smaller? Is that so?
My knowledge is that HashMap uses some hash function to access to certain field of a map, and there are versions in which that field is a reference to a linked list (because some keys can map to same value), or to other fields in the map, when implemented fully statically.
Is this correct - speed can be greater if Map has less elements?
I need to extend my question, with a concrete example.
I have 2 cases, in both the total number of elements is same.
In first case, I have 10 HashMaps, I'm not aware how elements are distributed. Time of execution of that part of algorithm is 141ms.
In second case, I have 25 HashMaps, same total number of elements. Time of execution of same algorithm is 69ms.
In both cases, I have a for loop that goes through each of the HashMaps, tries to find same elements, and to get elements if present.
Can it be that the execution time is smaller, because individual search inside HashMap is smaller, so is there sum?
I know that this is very strange, but is something like this somehow possible, or am I doing something wrong?
Map(Integer,Double) is considered. It is hard to tell what is the distribution of elements, since it is actually an implementation of KMeans clustering algorithm, and the elements are representations of cluster centroids. That means that they will mostly depend on the initialization of the algorithm. And the total number of elements will not mostly be the same, but I have tried to simplify the problem, sorry if that was misleading.
The number of collisions is decisive for a slow down.
Assume an array of some size, the hash code modulo the size then points to an index where the object is put. Two objects with the same index collide.
Having a large capacity (array size) with respect to number of elements helps.
With HashMap there are overloaded constructors with extra settings.
public HashMap(int initialCapacity,
float loadFactor)
Constructs an empty HashMap with the specified initial capacity and load factor.
You might experiment with that.
For a specific key class used with a HashMap, having a good hashCode can help too. Hash codes are a separate mathematical field.
Of course using less memory helps on the processor / physical memory level, but I doubt an influence in this case.
Does your timing take into account only the cost of get / containsKey, or are you also performing puts in the timed code section? If so, and if you're using the default constructor (initial capacity 16, load factor 0.75) then the larger hash tables are going to need to resize themselves more often than will the smaller hash tables. Like Joop Eggen says in his answer, try playing around with the initial capacity in the constructor, e.g. if you know that you have N elements then set the initial capacity to N / number_of_hash_tables or something along those lines - this ought to result in the smaller and larger hash tables having sufficient capacity that they won't need to be resized

How to get the optimal number to use in hashtable?

As the question states, how calculate the optimal number to use and how to motivate it?
If we are going to build an hashtable which uses the following hash function:
h(k) = k mod m, k = key
So some sources tells me:
to use the number of elements to be inserted as the value of m
to use a close prime to m
that java simply use 31 as their value of m
And some people tell me to use the closed prime to 2^n as m
I'm so confused at this point that I don't know what value to use for m. Like for instance if we use the table size for m then what happens if we want to expand the table size? Will I then have to rehash all the values with the new value of m. If so why does Java use simply 31 as prime value for m.
I've also heard that the table size should be two times bigger then the total elements in the hashtable, that's for each time it rehashes. But how come we for instance use m=10 for a table of 10 elements when it should be m=20 to create that extra empty space?
Can someone please help me understand how to calculate the value of m to use based on different scenarios like when we want to have a static (where we know that we will only insnert like 10 elements) or dynamic (rehash after a certain limit) hashtable.
Lets illustrate my problem by the following examples:
I got the values {1,2,...,n}
Question: What would be a optimized value of m if I must use the division by mod in my hashfunction?
Senario 1: n = 100?
Senario 2: n = 5043?
Addition question:
Would the m value hashfunction be different if we used a open or closed hashtable?
Note that i'm now not in need to understand hashtable for java but hashtable in general where I must use a divsion mod hashfunction.
Thank you for your time!
You have several issues here:
1) What should m equal?
2) How much free space should you have in your hash table?
3) Should you make the size of your table be a prime number?
1) As was mentioned in the comments, the h(k) you describe isn't the hash function, it gives you the index into your hash table. The idea is that every object produces some hash code, which is a positive integer. You use the hash code to figure out where to put the object in the hash table (so that you can find it again later). You clearly don't want a hash table of size MAX_INT, so you choose some size m. Then for any object, you take its hash code, compute k % m, and now you have an integer in the interval [0, m-1], which is a valid index into your hash table.
2) Because a hash table works by using a hash code to find the place in a table where an object should go, you get into trouble if multiple items are assigned to the same location. This is called a collision. Every hash table implementation must deal with collisions, either by putting items into nearby spots or keeping a linked list of items in each location. No matter the solution, more collisions means lower performance for your hash table. For that reason, it is recommended that you not let your hash table fill up, otherwise, collisions are more likely. Keeping your hash table at least twice as large as the number of items is a common recommendation to reduce the probability of collisions. Obviously, this means you will have to resize your table as it fills up. Yes, this means that you have to rehash each item since it will go into a different location when you are taking a modulus by a different value. That is the hidden cost of a hash table: it runs in constant time (assuming few or no collisions), but it can have a large coefficient (ammortized resizing, rehashing, etc.).
3) It is also often recommended that you make the size of your hash table be a prime number. This is because it tends to produce a better distribution of items in your hash table in certain common use cases, thus avoiding collisions. Rather than giving a complete explanation here, I will refer you to this excellent answer: Why should hash functions use a prime number modulus?

Is a Java hashmap search really O(1)?

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

Categories