Why HashMap use length minus 1 to calculate index? - java

I am analysis the source code of HashMap in jdk7,and I found that when we invoke put()method to add elements,it will use indexFor()to calculate the index and store the element in the array,the method is listed as below.
Now I am wondering why it use h & (length-1) to get the index? Is it used for to get more random array index? Can we use length or length-2(if exists) instead?
Can anyone help me to understand this? Thanks in advance!
static int indexFor(int h, int length) {
// assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
return h & (length-1);
}

Is it used for to get more random array index?
No
Can we use length or length-2(if exists) instead?
No
This is just a mathematical trick. To map your hash to the range [0, L) you calculate hash % L. Division can be expensive, so the developers choose to take advantage of how numbers are stored in binary. This works only for powers of two because only a single bit is set, subtracting one sets all less-significant bits and unsets the original bit. Binary-AND-ing this to our hash has the same result as calculating the modulo, every bit which is more-significant than we can handle is just dropped.

Related

Combination Algorithm from multiple sets

I am trying to write an algorithm that tells me how many pairs I could generate with items coming from multiple set of values. For example I have the following sets:
{1,2,3} {4,5} {6}
From these sets I can generate 11 pairs:
{1,4}, {1,5}, {1,6}, {2,4}, {2,5}, {2,6}, {3,4}, {3,5}, {3,6}, {4,6}, {5,6}
I wrote the following algorithm:
int result=0;
for(int k=0;k<numberOfSets;k++){ //map is a list where I store all my sets
int size1 = map.get(k);
for(int l=k+1;l<numberOfSets;l++){
int size2 = map.get(l);
result += size1*size2;
}
}
But as you can see the algorithm is not very scalable. If the number of sets increases the algorithm starts performing very poorly.
Am I missing something?, Is there an algorithm that can help me with this ? I have been looking to combination and permutation algorithms but I am not very sure if thats the right path for this.
Thank you very much in advance
First at all, if the order in the pairs does matter, then starting with int l=k+1 in the inner cycle is erroneous. E.g. you are missing {4,1} if you consider it equal with {1,4}, then the result is correct, otherwise it isn't.
Second, to complicate the matter further, you don't say if the the pairs need to be unique or not. E.g. {1,2} , {2,3}, {4} will generate {2,4} twice - if you need to count it as unique, the result of your code is incorrect (and you will need to keep a Set<Pair<int,int>> to remove the duplicates and you will need to scan those sets and actually generate the pairs).
The good news: while you can't do better than O(N2) just for counting the pairs, even if you have thousands of sets, the millions of integral multiplication/additions are fast enough on nowaday computers - e.g Eigen deals quite well with O(N^3) operations for floating multiplications (see matrix multiplication operations).
Assuming you only care about the number of pairs, and are counting duplicates, then there is a more efficient algorithm:
We will keep track of the current number of sets, and the number of elements which we encountered so far.
Go over the list from the end to the start
For each new set, the number of new pairs we can make is the size of the set * the size of encountered elements. Add this to the current number of sets.
Add the size of the new set to the number of elements which we encountered so far.
The code:
int numberOfPairs=0;
int elementsEncountered=0;
for(int k = numberOfSets - 1 ; k >= 0 ; k--) {
int sizeOfCurrentSet = map.get(k);
int numberOfNewPairs = sizeOfCurrentSet * elementsEncountered;
numberOfPairs += numberOfNewPairs;
elementsEncountered += sizeOfCurrentSet;
}
The key point to relize is that when we count the number of new pairs that each set contributes, it doesn't matter from which set we select the second element of the pair. That is, we don't need to keep track of any set which we have already analyzed.

Create an array with given median, mean, mode, and range

I already know how to calculate the median, mean, mode from an array in Java. But is there actually a way to do the reverse like creating an array from the median, mode, mean, given the range and the number of numbers in the array ?
The numbers in the array can be created by randomness as long as it satisfies the condition above. It can stop when successfully find the first array with that condition. And plus the range can be from 0 to 10 or 0 to 100, not so much.
Yes / No
When you do an array to mmm, you are applying a lossy algorithm, you are losing the original data.
There can be literally an infinite number of array combinations that would result in the same mmm.
Some naive ideas
Median: for example your array have 2n+1 elements, put n elements which are less than your median, n elements which are greater than median;
Mean: say your array have m elements, Sum = m * Mean, generate m elements by partitioning Sum;
Mode: count the frequency of the numbers you put into the array, make sure Mode has the highest frequency;

What to use if an array is too small in Java?

In java, an array can have at most Integer.MAX_VALUE items as it uses integers as keys for the array.
What is the best Object to use when I want to use a long as an index?
For example, if I want to calculate all prime numbers below 5 billion using a prime sieve, I cannot use an array as 5000000000 is too large to store in an integer.
A sieve of 5000000000 elements does not need an array of 5,000,000,000 values; it needs 5,000,000,000 bits. Unfortunately, BitSet uses an int index as well, but you can implement your own bit set by allocating 5000000000 / 32 integers, and then using bit operations to access the corresponding bit:
Use long as the actual position pos
The location of the int inside the int array is (int)(pos / 32)
The location of the bit inside the int is (int)(pos % 32)
Anther approach would be to switch to segmented sieve, which reduces memory requirements to √N. A good explanation of how that works is given here.

Java HashMap implementation hashcode issue

Looking through the implementation of the Java HashMap here : http://www.docjar.com/html/api/java/util/HashMap.java.html I noticed the following :
The internal data structure used is an array which at each index stores the reference to the first entry in a linked list. The array index is based on the key's hashcode and the linked list represents the bucket for that particular hashcode.
What I found interesting is the method indexFor(int h, int length) which, for a given key, determines what bucket in the array to look in. But the implementation, return h & (length - 1) looks odd in the sense that for an indeterminate number of hashcodes which do not coincide with a given array index the method will return 0. So, no matter what unique hashcode you implement for your object, the 0 bucket in the array will most likely be full of objects and thus you don't benefit from what a unique hashcode is supposed to offer you, that is faster data access.
Am I missing something?
Cristian
You are missing the following Javadoc from the HashMap source code:
/**
* The table, resized as necessary. Length MUST Always be a power of two.
*/
transient Entry<K,V>[] table;
This means that table.length-1 will always be a sequence of 1's.
I can't quite understand what you believe the problem to be.
The h & (length - 1) is a simple way to calculate h % n where n is a power of two. There isn't, in my understanding, any reason why h % n should give an unnaturally large number of zeros.

Is it possible to get k-th element of m-character-length combination in O(1)?

Do you know any way to get k-th element of m-element combination in O(1)? Expected solution should work for any size of input data and any m value.
Let me explain this problem by example (python code):
>>> import itertools
>>> data = ['a', 'b', 'c', 'd']
>>> k = 2
>>> m = 3
>>> result = [''.join(el) for el in itertools.combinations(data, m)]
>>> print result
['abc', 'abd', 'acd', 'bcd']
>>> print result[k-1]
abd
For a given data the k-th (2-nd in this example) element of m-element combination is abd. Is it possible to that value (abd) without creating the whole combinatory list?
I'am asking because I have data of ~1,000,000 characters and it is impossible to create full m-character-length combinatory list to get k-th element.
The solution can be pseudo code, or a link the page describing this problem (unfortunately, I didn't find one).
Thanks!
http://en.wikipedia.org/wiki/Permutation#Numbering_permutations
Basically, express the index in the factorial number system, and use its digits as a selection from the original sequence (without replacement).
Not necessarily O(1), but the following should be very fast:
Take the original combinations algorithm:
def combinations(elems, m):
#The k-th element depends on what order you use for
#the combinations. Assuming it looks something like this...
if m == 0:
return [[]]
else:
combs = []
for e in elems:
combs += combinations(remove(e,elems), m-1)
For n initial elements and m combination length, we have n!/(n-m)!m! total combinations. We can use this fact to skip directly to our desired combination:
def kth_comb(elems, m, k):
#High level pseudo code
#Untested and probably full of errors
if m == 0:
return []
else:
combs_per_set = ncombs(len(elems) - 1, m-1)
i = k / combs_per_set
k = k % combs_per_set
x = elems[i]
return x + kth_comb(remove(x,elems), m-1, k)
first calculate r = !n/(!m*!(n-m)) with n the amount of elements
then floor(r/k) is the index of the first element in the result,
remove it (shift everything following to the left)
do m--, n-- and k = r%k
and repeat until m is 0 (hint when k is 0 just copy the following chars to the result)
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem appears to fall under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it too is faster than other published techniques.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to Java, Python, or C++.

Categories