How to get the optimal number to use in hashtable?

How to get the optimal number to use in hashtable? - java

As the question states, how calculate the optimal number to use and how to motivate it?
If we are going to build an hashtable which uses the following hash function:
h(k) = k mod m, k = key
So some sources tells me:
to use the number of elements to be inserted as the value of m
to use a close prime to m
that java simply use 31 as their value of m
And some people tell me to use the closed prime to 2^n as m
I'm so confused at this point that I don't know what value to use for m. Like for instance if we use the table size for m then what happens if we want to expand the table size? Will I then have to rehash all the values with the new value of m. If so why does Java use simply 31 as prime value for m.
I've also heard that the table size should be two times bigger then the total elements in the hashtable, that's for each time it rehashes. But how come we for instance use m=10 for a table of 10 elements when it should be m=20 to create that extra empty space?
Can someone please help me understand how to calculate the value of m to use based on different scenarios like when we want to have a static (where we know that we will only insnert like 10 elements) or dynamic (rehash after a certain limit) hashtable.
Lets illustrate my problem by the following examples:
I got the values {1,2,...,n}
Question: What would be a optimized value of m if I must use the division by mod in my hashfunction?
Senario 1: n = 100?
Senario 2: n = 5043?
Addition question:
Would the m value hashfunction be different if we used a open or closed hashtable?
Note that i'm now not in need to understand hashtable for java but hashtable in general where I must use a divsion mod hashfunction.
Thank you for your time!

You have several issues here:
1) What should m equal?
2) How much free space should you have in your hash table?
3) Should you make the size of your table be a prime number?
1) As was mentioned in the comments, the h(k) you describe isn't the hash function, it gives you the index into your hash table. The idea is that every object produces some hash code, which is a positive integer. You use the hash code to figure out where to put the object in the hash table (so that you can find it again later). You clearly don't want a hash table of size MAX_INT, so you choose some size m. Then for any object, you take its hash code, compute k % m, and now you have an integer in the interval [0, m-1], which is a valid index into your hash table.
2) Because a hash table works by using a hash code to find the place in a table where an object should go, you get into trouble if multiple items are assigned to the same location. This is called a collision. Every hash table implementation must deal with collisions, either by putting items into nearby spots or keeping a linked list of items in each location. No matter the solution, more collisions means lower performance for your hash table. For that reason, it is recommended that you not let your hash table fill up, otherwise, collisions are more likely. Keeping your hash table at least twice as large as the number of items is a common recommendation to reduce the probability of collisions. Obviously, this means you will have to resize your table as it fills up. Yes, this means that you have to rehash each item since it will go into a different location when you are taking a modulus by a different value. That is the hidden cost of a hash table: it runs in constant time (assuming few or no collisions), but it can have a large coefficient (ammortized resizing, rehashing, etc.).
3) It is also often recommended that you make the size of your hash table be a prime number. This is because it tends to produce a better distribution of items in your hash table in certain common use cases, thus avoiding collisions. Rather than giving a complete explanation here, I will refer you to this excellent answer: Why should hash functions use a prime number modulus?

Related

HashMap speed greater for smaller maps

This may be a strange question, but it is based on some results I get, using Java Map - is element retrieval speed greater in case of a HashMap, when the map is smaller?
I have some part of code that uses containsKey and get(key) methods of a HashMap, and it seems that runs faster if number of elements in the Map is smaller? Is that so?
My knowledge is that HashMap uses some hash function to access to certain field of a map, and there are versions in which that field is a reference to a linked list (because some keys can map to same value), or to other fields in the map, when implemented fully statically.
Is this correct - speed can be greater if Map has less elements?
I need to extend my question, with a concrete example.
I have 2 cases, in both the total number of elements is same.
In first case, I have 10 HashMaps, I'm not aware how elements are distributed. Time of execution of that part of algorithm is 141ms.
In second case, I have 25 HashMaps, same total number of elements. Time of execution of same algorithm is 69ms.
In both cases, I have a for loop that goes through each of the HashMaps, tries to find same elements, and to get elements if present.
Can it be that the execution time is smaller, because individual search inside HashMap is smaller, so is there sum?
I know that this is very strange, but is something like this somehow possible, or am I doing something wrong?
Map(Integer,Double) is considered. It is hard to tell what is the distribution of elements, since it is actually an implementation of KMeans clustering algorithm, and the elements are representations of cluster centroids. That means that they will mostly depend on the initialization of the algorithm. And the total number of elements will not mostly be the same, but I have tried to simplify the problem, sorry if that was misleading.

The number of collisions is decisive for a slow down.
Assume an array of some size, the hash code modulo the size then points to an index where the object is put. Two objects with the same index collide.
Having a large capacity (array size) with respect to number of elements helps.
With HashMap there are overloaded constructors with extra settings.
public HashMap(int initialCapacity,
float loadFactor)
Constructs an empty HashMap with the specified initial capacity and load factor.
You might experiment with that.
For a specific key class used with a HashMap, having a good hashCode can help too. Hash codes are a separate mathematical field.
Of course using less memory helps on the processor / physical memory level, but I doubt an influence in this case.

Does your timing take into account only the cost of get / containsKey, or are you also performing puts in the timed code section? If so, and if you're using the default constructor (initial capacity 16, load factor 0.75) then the larger hash tables are going to need to resize themselves more often than will the smaller hash tables. Like Joop Eggen says in his answer, try playing around with the initial capacity in the constructor, e.g. if you know that you have N elements then set the initial capacity to N / number_of_hash_tables or something along those lines - this ought to result in the smaller and larger hash tables having sufficient capacity that they won't need to be resized

Using hashcode for a unique ID

I am working in a java-based system where I need to set an id for certain elements in the visual display. One category of elements is Strings, so I decided to use the String.hashCode() method to get a unique identifier for these elements.
The problem I ran into, however, is that the system I am working in borks if the id is negative and String.hashCode often returns negative values. One quick solution is to just use Math.abs() around the hashcode call to guarantee a positive result. What I was wondering about this approach is what are the chances of two distinct elements having the same hashcode?
For example, if one string returns a hashcode of -10 and another string returns a hashcode of 10 an error would occur. In my system we're talking about collections of objects that aren't more than 30 elements large typically so I don't think this would really be an issue, but I am curious as to what the math says.

Hash codes can be thought of as pseudo-random numbers. Statistically, with a positive int hash code the chance of a collision between any two elements reaches 50% when the population size is about 54K (and 77K for any int). See Birthday Problem Probability Table for collision probabilities of various hash code sizes.
Also, your idea to use Math.abs() alone is flawed: It does not always return a positive number! In 2's compliment arithmetic, the absolute value of Integer.MIN_VALUE is itself! Famously, the hash code of "polygenelubricants" is this value.

Hashes are not unique, hence they are not apropriate for uniqueId.
As to probability of hash collision, you could read about birthday paradox. Actually (from what I recall) when drawing from an uniform distribution of N values, you should expect collision after drawing $\sqrt(N)$ (you could get collision much earlier). The problem is that Java's implementation of hashCode (and especially when hashing short strings) doesnt provide uniform distribution, so you'll get collision much earlier.

You already can get two strings with the same hashcode. This should be obvious if you think that you have an infinite number of strings and only 2^32 possible hashcodes.
You just make it a little more probable when taking the absolute value. The risk is small but if you need an unique id, this isn't the right approach.

What you can do when you only have 30-50 values as you said is register each String you get into an HashMap together with a running counter as value:
HashMap StringMap = new HashMap<String,Integer>();
StringMap.add("Test",1);
StringMap.add("AnotherTest",2);
You can then get your unique ID by calling this:
StringMap.get("Test"); //returns 1

How can a Hash Set incur collision?

If a hash set contains only one instance of any distinct element(s), how might collision occur at this case?
And how could load factor be an issue since there is only one of any given element?
While this is homework, it is not for me. I am tutoring someone, and I need to know how to explain it to them.

Let's assume you have a HashSet of Integers, and your Hash Function is mod 4. The integers 0, 4, 8, 12, 16, etc. will all colide, if you try to insert them. (mod 4 is a terrible hash function, but it illustrates the concept)
Assuming a proper function, the load factor is correlated to the chance of having a collision; please note that I say correlated and not equal because it depends on the strategy you use to handle collisions. In general, a high load factor increases the possibility of collisions. Assuming that you have 4 slots and you use mod 4 as the hash function, when the load factor is 0 (empty table), you won't have a collision. When you have one element, the probability of a collision is .25, which obviously degrades the performance, since you have to solve the collision.
Now, assuming that you use linear probing (i.e. on collision, use the next entry available), once you reach 3 entries in the table, you have a .75 probability of a collision, and if you have a collision, in the best case you will go to the next entry, but in the worst, you will have to go through the 3 entries, so the collision means that instead of a direct access, you need in average a linear search with an average of 2 items.
Of course, you have better strategies to handle collisions, and generally, in non-pathological cases, a load of .7 is acceptable, but after that collisions shoot up and performance degrades.

The general idea behind a "hash table" (which a "hash set" is a variety of) is that you have a number of objects containing "key" values (eg, character strings) that you want to put into some sort of container and then be able to find individual objects by their "key" values easily, without having to examine every item in the container.
One could, eg, put the values into a sorted array and then do a binary search to find a value, but maintaining a sorted array is expensive if there are lots of updates.
So the key values are "hashed". One might, for instance, add together all of the ASCII values of the characters to create a single number which is the "hash" of the character string. (There are better hash computation algorithms, but the precise algorithm doesn't matter, and this is an easy one to explain.)
When you do this you'll get a number that, for a ten-character string, will be in the range from maybe 600 to 1280. Now, if you divide that by, say, 500 and take the remainder, you'll have a value between 0 and 499. (Note that the string doesn't have to be ten characters -- longer strings will add to larger values, but when you divide and take the remainder you still end up with a number between 0 and 499.)
Now create an array of 500 entries, and each time you get a new object, calculate its hash as described above and use that value to index into the array. Place the new object into the array entry that corresponds to that index.
But (especially with the naive hash algorithm above) you could have two different strings with the same hash. Eg, "ABC" and "CBA" would have the same hash, and would end up going into the same slot in the array.
To handle this "collision" there are several strategies, but the most common is to create a linked list off the array entry and put the various "hash synonyms" into that list.
You'd generally try to have the array large enough (and have a better hash calculation algorithm) to minimize such collisions, but, using the hash scheme, there's no way to absolutely prevent collisions.
Note that the multiple entries in a synonym list are not identical -- they have different key values -- but they have the same hash value.

Hash : How does it work internally?

This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.

Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)

Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.

Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.

I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).

What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.

Why Object#hashCode() returns int instead of long

why not:
public native long hashCode();
instead of:
public native int hashCode();
for higher chance of achieving unique hash codes?

Because the maximum length of an array is Integer.MAX_VALUE.
Since the prime use of hashCode() is to determine which slot to insert an object into in the backing array of a HashMap/Hashtable, a hashcode > Integer.MAX_VALUE would not be able to be stored in the array.

Anyway, the hash code value will be used to determine a number of row in a table which is relatively small value.
In HashMap, for instance, the default table contains 256 rows only 16 rows (Sun JDK 1.6.0_17). This means that the row number is determined in the way like this:
int rowNumber = obj.hashCode() % rowsCount;
So, the real distribution is from 0 to rowsCount.
UPD: I remember the implementation of ConcurrentHashMap. In a nutshell, ConcurrentHashMap contains many relatively small tables. At first the hashCode function is used to determine the table number, and after that the same function is used to determine a row in the selected table.
This approach removes the limitation of array size (and even allows to build distributed hash table).
So, I incline to the conclusion that hashCode returns int because it covers the vast majority of use cases.

I'd assume it's a balance of computation cost vs. hash range. Hashcodes are so frequently referenced that pushing around twice as much data every time you need a hash would be expensive, especially if you consider more common use cases -
for example - if you create a small hash with 10, or 100, or 1000 values, the difference in the number of hash collisions you're going to see will be extremely negligible. For larger hashes, ... well, think of how big a hash will need to be for 10**32 values to start having frequent collisions, and whether that's even possible to do in a JVM given the amount of memory you'd need.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.