Java default string hash function produces collisions on single character strings

Java default string hash function produces collisions on single character strings - java

I am a CS student, so please bear with me if what I say sounds too ridiculous. It definitely does so to me, that is why I am here in search of an answer.
I read how strings are hashed in Java, and then I took a look at the ASCII table. The letters "d" and "n" hash to 100 and 110 respectively. Now if I were to create a brand new hashmap in Java, by default it has 10 buckets. So even though the hashcodes are unique, mod 10 they are both 0. This leads to a collision.
Having collision on 1 character strings just doesn't sit well with me, so is the process I described correct?
Thanks in advance.

What you described is probably correct, both would fall on the same bucket due to the pigeonhole principle, which basically means that if you have more items than holes to put them on, two or more will end up on the same hole. In this case, considering only the 95 printable ASCII characters, the principle states that there would be at least 10 in each hole (not considering the actual values, only the amount of them).
However, shazin's answer is also correct in that the hash values are not actually used as the identity for the values in a map, instead they are used to find the bucket in which the kay/value pair belongs, and then values in the bucket are checked for equality with their equals() method (or with ==, if using IdentityHashMap.)

Hash is actually used in Hash based collections as an index or grouping mechanism. Hash is not used as an actual reference. Hash used to find the bucket in which the element may contain first.
As you said d and n can be contained in the same bucket but after that the actual value in this case d and n can be used to refer the actual object. Since Hashmaps don't allow duplicate keys you can be sure that there will always be one d and one n.

Related

How to improve the complexity of HashMap iteration?

I implemented a custom HashMap class (in C++, but shouldn't matter). The implementation is simple -
A large array holds pointers to Items.
Each item contains the key - value pair, and a pointer to an Item (to form a linked list in case of key collision).
I also implemented an iterator for it.
My implementation of incrementing/decrementing the iterator is not very efficient. From the present position, the iterator scans the array of hashes for the next non-null entry. This is very inefficient, when the map is sparsely populated (which it would be for my use case).
Can anyone suggest a faster implementation, without affecting the complexity of other operations like insert and find? My primary use case is find, secondary is insert. Iteration is not even needed, I just want to know this for the sake of learning.
PS: Why I implemented a custom class? Because I need to find strings with some error tolerance, while ready made hash maps that I have seen provide only exact match.
EDIT: To clarify, I am talking about incrementing/decrementing an already obtained iterator. Yes, this is mostly done in order to traverse the whole map.
The errors in strings (keys) in my case occur from OCR errors. So I can not use the error handling techniques used to detect typing errors. The chance of fist character being wrong is almost the same as that of the last one.
Also, my keys are always string, one word to be exact. Number of entries will be less than 5000. So hash table size of 2^16 is enough for me. Even though it will still be sparsely populated, but that's ok.
My hash function:
hash code size is 16 bits.
First 5 bits for the word length. ==> Max possible key length = 32. Reasonable, given that key is a single word.
Last 11 bits for sum of the char codes. I only store the English alphabet characters, and do not need case sensitivity. So 26 codes are enough, 0 to 25. So a key with 32 'z' = 25 * 32 = 800. Which is well within 2^11. I even have scope to add case sensitivity, if needed in future.
Now when you compare a key containing an error with the correct one,
say "hell" with "hello"
1. Length of the keys is approx the same
2. sum of their chars will differ by the sum of the dropped/added/distorted chars.
in the hash code, as first 5 bits are for length, the whole table has fixed sections for every possible length of keys. All sections are of same size. First section stores keys of length 1, second of length 2 and so on.
Now 'hello' is stored in the 5th section, as length is 5.'When we try to find 'hello',
Hashcode of 'hello' = (length - 1) (sum of chars) = (4) (7 + 4 + 11 + 11 + 14) = (4) (47)
= (00100)(00000101111)
similarly, hashcode of 'helo' = (3)(36)
= (00011)(00000100100)
We jump to its bucket, and don't find it there.
so we try to check for ONE distorted character. This will not change the length, but change the sum of characters by at max -25 to +25. So we search from 25 places backwards to 25 places forward. i.e, we check the sum part from (36-25) to (36+25) in the same section. We won't find it.
We check for an additional character error. That means the correct string would contain only 3 characters. So we go to the third section. Now sum of chars due to additional char would have increased by max 25, it has to be compensated. So search the third section for appropriate 25 places (36 - 0) to (36 - 25). Again we don't find.
Now we consider the case of a missing character. So the original string would contain 5 chars. And the second part of hashcode, sum of chars in the original string, would be more by a factor of 0 to 25. So we search the corresponding 25 buckets in the 5th section, (36 + 0) to (36 + 25). Now as 47 (the sum part of 'hello') lies in this range, we will find a match of the hashcode. Ans we also know that this match will be due to a missing character. So we compare the keys allowing for a tolerance of 1 missing character. And we get a match!
In reality, this has been implemented to allow more than one error in key.
It can also be optimized to use only 25 places for the first section (since it has only one character) and so on.
Also, checking 25 places seems overkill, as we already know the largest and smallest char of the key. But it gets complex in case of multiple errors.

You mention an 'error tolerance' for the string. Why not build in the "tolerance' into the hash function itself and thus obviate the need for iteration.

You could go the way of Javas LinkedHashMap class. It adds efficient iteration to a hashmap by also making it a doubly-linked list.
The entries are key-value pairs that have pointers to the previous and next entries. The hashmap itself has the large array as well as the head of the linked list.
Insertion/deletion are constant time for both data structures, searches are done via the hashmap, and iteration via the linked list.

Using hashcode for a unique ID

I am working in a java-based system where I need to set an id for certain elements in the visual display. One category of elements is Strings, so I decided to use the String.hashCode() method to get a unique identifier for these elements.
The problem I ran into, however, is that the system I am working in borks if the id is negative and String.hashCode often returns negative values. One quick solution is to just use Math.abs() around the hashcode call to guarantee a positive result. What I was wondering about this approach is what are the chances of two distinct elements having the same hashcode?
For example, if one string returns a hashcode of -10 and another string returns a hashcode of 10 an error would occur. In my system we're talking about collections of objects that aren't more than 30 elements large typically so I don't think this would really be an issue, but I am curious as to what the math says.

Hash codes can be thought of as pseudo-random numbers. Statistically, with a positive int hash code the chance of a collision between any two elements reaches 50% when the population size is about 54K (and 77K for any int). See Birthday Problem Probability Table for collision probabilities of various hash code sizes.
Also, your idea to use Math.abs() alone is flawed: It does not always return a positive number! In 2's compliment arithmetic, the absolute value of Integer.MIN_VALUE is itself! Famously, the hash code of "polygenelubricants" is this value.

Hashes are not unique, hence they are not apropriate for uniqueId.
As to probability of hash collision, you could read about birthday paradox. Actually (from what I recall) when drawing from an uniform distribution of N values, you should expect collision after drawing $\sqrt(N)$ (you could get collision much earlier). The problem is that Java's implementation of hashCode (and especially when hashing short strings) doesnt provide uniform distribution, so you'll get collision much earlier.

You already can get two strings with the same hashcode. This should be obvious if you think that you have an infinite number of strings and only 2^32 possible hashcodes.
You just make it a little more probable when taking the absolute value. The risk is small but if you need an unique id, this isn't the right approach.

What you can do when you only have 30-50 values as you said is register each String you get into an HashMap together with a running counter as value:
HashMap StringMap = new HashMap<String,Integer>();
StringMap.add("Test",1);
StringMap.add("AnotherTest",2);
You can then get your unique ID by calling this:
StringMap.get("Test"); //returns 1

How HashTable and HashMap key-value are stored in the memory?

I understand there is a hashing technique is applied to a key to store its value in the memory address.
But I don't understand how the collision is happening here? Which hashing algorithm does Java use to create a memory space? Is it MD5?

The basic idea of HashMap is this:
A HashMap is really an array of special objects that hold both Key and Value.
The array has some amount of buckets (slots), say 16.
The hashing algorithm is provided by the hashCode() method that every object has. Therefore, when you are writing a new Class, you should take care of proper hashCode() and equals() methods implementation. The default one (of the Object class) takes the memory pointer as a number. But that's not good for most of the classes we would like to use. For example, the String class uses an algorithm that makes the hash from all the characters in the string - think of it like this: hashCode = 1.char + 2.char + 3.char... (simplified). Therefore, two equal Strings, even though they are on a different location in the memory, have the same hashCode().
The result of hashCode(), say "132", is then the number of bucket where the object should be stored if we had an array that big. But we don't, ours is only 16 buckets long. So we use the obvious calculation 'hashcode % array.length = bucket' or '132 mod 16 = 4' and store the Key-Value pair in a bucket number 4.
If there's no other pair yet, it's ok.
If there's one with Key equal to the Key we have, we remove the old one.
If there's another Key-Value pair (collision), we chain the new one after the old one into a linked list.
If the backing array becomes too full, so we would have to make too many linked lists, we make a new array with doubled length, rehash all the elements and add them into the new array, then we dispose the old one. This is most likely the most expensive operation on HashMap, so you want to tell your Maps how many buckets you'll use if you know it in before.
If somebody tries to get a value, he provides a key, we hash it, mod it, and then go through the potential linked list for the exact match.
An image, courtesy of Wikipedia:
In this case,
there is an array with 256 buckets (numbered, of course, 0-255)
there are five people. Their hashcodes, after being put through mod 256, point to four different slots in the array.
you can see Sandra Dee didn't have a free slot, so she has been chained after John Smith.
Now, if you tried to look up a telephone number of Sandra Dee, you would hash her name, mod it by 256, and look into the bucket 152. There you'd find John Smith. That's no Sandra, look further ... aha, there's Sandra chained after John.

Here Hash does not mean for Hashing techniques such as MD5. Its the HashCode of memory location used to store an Object for a particular key.
Readings:
How does Java hashmap work?
How HashMap works in Java
This is somewhat more clearer explanation of How HashMap works?

As default implementation hashCode() function in Object class return the memory address as the hash which is used as key in HashTable & HashMap.

After going through #Slanec's answer, do see the javadoc from Java-8, as there are significant changes. For ex: 'TREEIFY' where the LinkedList is converted to a TreeMap in case the threshold number of entries per bucket (currently 8) is reached.

How can a Hash Set incur collision?

If a hash set contains only one instance of any distinct element(s), how might collision occur at this case?
And how could load factor be an issue since there is only one of any given element?
While this is homework, it is not for me. I am tutoring someone, and I need to know how to explain it to them.

Let's assume you have a HashSet of Integers, and your Hash Function is mod 4. The integers 0, 4, 8, 12, 16, etc. will all colide, if you try to insert them. (mod 4 is a terrible hash function, but it illustrates the concept)
Assuming a proper function, the load factor is correlated to the chance of having a collision; please note that I say correlated and not equal because it depends on the strategy you use to handle collisions. In general, a high load factor increases the possibility of collisions. Assuming that you have 4 slots and you use mod 4 as the hash function, when the load factor is 0 (empty table), you won't have a collision. When you have one element, the probability of a collision is .25, which obviously degrades the performance, since you have to solve the collision.
Now, assuming that you use linear probing (i.e. on collision, use the next entry available), once you reach 3 entries in the table, you have a .75 probability of a collision, and if you have a collision, in the best case you will go to the next entry, but in the worst, you will have to go through the 3 entries, so the collision means that instead of a direct access, you need in average a linear search with an average of 2 items.
Of course, you have better strategies to handle collisions, and generally, in non-pathological cases, a load of .7 is acceptable, but after that collisions shoot up and performance degrades.

The general idea behind a "hash table" (which a "hash set" is a variety of) is that you have a number of objects containing "key" values (eg, character strings) that you want to put into some sort of container and then be able to find individual objects by their "key" values easily, without having to examine every item in the container.
One could, eg, put the values into a sorted array and then do a binary search to find a value, but maintaining a sorted array is expensive if there are lots of updates.
So the key values are "hashed". One might, for instance, add together all of the ASCII values of the characters to create a single number which is the "hash" of the character string. (There are better hash computation algorithms, but the precise algorithm doesn't matter, and this is an easy one to explain.)
When you do this you'll get a number that, for a ten-character string, will be in the range from maybe 600 to 1280. Now, if you divide that by, say, 500 and take the remainder, you'll have a value between 0 and 499. (Note that the string doesn't have to be ten characters -- longer strings will add to larger values, but when you divide and take the remainder you still end up with a number between 0 and 499.)
Now create an array of 500 entries, and each time you get a new object, calculate its hash as described above and use that value to index into the array. Place the new object into the array entry that corresponds to that index.
But (especially with the naive hash algorithm above) you could have two different strings with the same hash. Eg, "ABC" and "CBA" would have the same hash, and would end up going into the same slot in the array.
To handle this "collision" there are several strategies, but the most common is to create a linked list off the array entry and put the various "hash synonyms" into that list.
You'd generally try to have the array large enough (and have a better hash calculation algorithm) to minimize such collisions, but, using the hash scheme, there's no way to absolutely prevent collisions.
Note that the multiple entries in a synonym list are not identical -- they have different key values -- but they have the same hash value.

Hash : How does it work internally?

This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.

Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)

Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.

Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.

I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).

What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.