HashMap different bucket index for a key possible? - java

My concern was to check how Java HashMap gets the same index for a key. Even when it's size expand from default 16 to much higher values as we keep adding Entries.
I tried to reproduce the indexing algorithm of HashMap.
int length=1<<5;
int v=6546546;
int h = new Integer(v).hashCode();
h =h^( (h >>> 20) ^ (h >>> 12));
h=h ^ h >>> 7 ^ h >>> 4;
System.out.println("index :: " + (h & (length-1)));
I ran my code for different values of "length".
So for same key I am getting different index, as length of HashMap changes. What am I missing here?
My Results:
length=1<<5;
index :: 10
length=1<<15;
index :: 7082
length=1<<30;
index :: 6626218

You're missing the fact that every time the length changes, the entries are redistributed - they're put in the new buckets appropriately. That's why it takes a bit of time (O(N)) when the map expands - everything needs to be copied from the old buckets to the new ones.
So long as you only ever have indexes for one length at a time (not a mixture), you're fine.

Related

Finding mode for every window of size k in an array

Given an array of size n and k, how do you find the mode for every contiguous subarray of size k?
For example
arr = 1 2 2 6 6 1 1 7
k = 3
ans = 2 2 6 6 1 1
I was thinking of having a hashmap where the key is no and value is frequency, treemap where the key is freq and value is number, and having a queue to remove the first element when the size > k. Here the time complexity is o(nlog(n)). Can we do this in O(1)?.
This can be done in O(n) time
I was intrigued by this problem in part because, as I indicated in the comments, I felt certain that it could be done in O(n) time. I had some time over this past weekend, so I wrote up my solution to this problem.
Approach: Mode Frequencies
The basic concept is this: the mode of a collection of numbers is the number(s) which occur with the highest frequency within that set.
This means that whenever you add a number to the collection, if the number added was not already one of the mode-values then the frequency of the mode would not change. So with the collection (8 9 9) the mode-values are {9} and the mode-frequency is 2. If you add say a 5 to this collection ((8 9 9 5)) neither the mode-frequency nor the mode-values change. If instead you add an 8 to the collection ((8 9 9 8)) then the mode-values change to {9, 8} but the mode-frequency is still unchanged at 2. Finally, if you instead added a 9 to the collection ((8 9 9 9)), now the mode-frequency goes up by one.
Thus in all cases when you add a single number to the collection, the mode-frequency is either unchanged or goes up by only one. Likewise, when you remove a single number from the collection, the mode-frequency is either unchanged or goes down by at most one. So all incremental changes to the collection result in only two possible new mode-frequencies. This means that if we had all of the distinct numbers of the collection indexed by their frequencies, then we could always find the new Mode in a constant amount of time (i.e., O(1)).
To accomplish this I use a custom data structure ("ModeTracker") that has a multiset ("numFreqs") to store the distinct numbers of the collection along with their current frequency in the collection. This is implemented with a Dictionary<int, int> (I think that this is a Map in Java). Thus given a number, we can use this to find its current frequency within the collection in O(1).
This data structure also has an array of sets ("freqNums") that given a specific frequency will return all of the numbers that have that frequency in the current collection.
I have included the code for this data structure class below. Note that this is implemented in C# as I do not know Java well enough to implement it there, but I believe that a Java programmer should have no trouble translating it.
(pseudo)Code:
class ModeTracker
{
HashSet<int>[] freqNums; //numbers at each frequency
Dictionary<int, int> numFreqs; //frequencies for each number
int modeFreq_ = 0; //frequency of the current mode
public ModeTracker(int maxFrequency)
{
freqNums = new HashSet<int>[maxFrequency + 2];
// populate frequencies, so we dont have to check later
for (int i=0; i<maxFrequency+1; i++)
{
freqNums[i] = new HashSet<int>();
}
numFreqs = new Dictionary<int, int>();
}
public int Mode { get { return freqNums[modeFreq_].First(); } }
public void addNumber(int n)
{
int newFreq = adjustNumberCount(n, 1);
// new mode-frequency is one greater or the same
if (freqNums[modeFreq_+1].Count > 0) modeFreq_++;
}
public void removeNumber(int n)
{
int newFreq = adjustNumberCount(n, -1);
// new mode-frequency is the same or one less
if (freqNums[modeFreq_].Count == 0) modeFreq_--;
}
int adjustNumberCount(int num, int adjust)
{
// make sure we already have this number
if (!numFreqs.ContainsKey(num))
{
// add entries for it
numFreqs.Add(num, 0);
freqNums[0].Add(num);
}
// now adjust this number's frequency
int oldFreq = numFreqs[num];
int newFreq = oldFreq + adjust;
numFreqs[num] = newFreq;
// remove old freq for this number and and the new one
freqNums[oldFreq].Remove(num);
freqNums[newFreq].Add(num);
return newFreq;
}
}
Also, below is a small C# function that demonstrates how to use this datastructure to solve the problem originally posed in the question.
int[] ModesOfSubarrays(int[] arr, int subLen)
{
ModeTracker tracker = new ModeTracker(subLen);
int[] modes = new int[arr.Length - subLen + 1];
for (int i=0; i < arr.Length; i++)
{
//add every number into the tracker
tracker.addNumber(arr[i]);
if (i >= subLen)
{
// remove the number that just rotated out of the window
tracker.removeNumber(arr[i-subLen]);
}
if (i >= subLen - 1)
{
// add the new Mode to the output
modes[i - subLen + 1] = tracker.Mode;
}
}
return modes;
}
I have tested this and it does appear to work correctly for all of my tests.
Complexity Analysis
Going through the individual steps of the `ModesOfSubarrays()` function:
The new ModeTracker object is created in O(n) time or less.
The modes[] array is created in O(n) time.
The For(..) loops N times:
. 3a: the addNumber() function takes O(1) time
. 3b: the removeNumber() function takes O(1) time
. 3c: getting the new Mode takes O(1) time
So the total time is O(n) + O(n) + n*(O(1) + O(1) + O(1)) = O(n)
Please let me know of any questions that you might have about this code.

Hash code for array size 600 with least collisions

So I'm working with a file that has 400 data values, all ints and ranging from 4 to 20,000 in value. I load all these into an array of size 400. There is another empty array of ListNodes of size 600 that I will move the data to, but using a self-written hash code (I'll post it below).
Because each index in the array of length 600 has a ListNode in it, if there are any collisions, then the data value is added to the back of the ListNode. I also have a method that returns the percent of the array that is null. But basically since I'm loading 400 data values into an array of size 600, the least percent of nulls I can have is 33.3% because if there are no collisions, then 400 slots in the array are taken and 200 are null, but this is not the case:
return (num+123456789/(num*9365))%600; //num is the value read from the array of 400
That hashCode has given me my best result of 48.3% nulls and I need it to be below 47% at least. any suggestions or solutions to imporve this hashCode? I would greatly appreciate any help. If you need any more info or details please let me know. Thank you!!!
I did some experiments with random numbers: generate 400 uniformly distributed random numbers in the range [0, 599] and check how many values in that range are not generated. It turns out, on average 51.3 % of the values are not generated. So your 48.3 % is already better than expected.
The target of 47 % seems unrealistic unless some form of perfect hashing is used.
If you want to make some experiments on your own, here is the program.
public static void main(String[] args) {
Random r = new Random();
int[] counts = new int[600];
for (int i = 0; i < 400; i++) {
counts[r.nextInt(600)]++;
}
int n = 0;
for (int i = 0; i < 600; i++) {
if (counts[i] == 0) {
n++;
}
}
System.out.println(100.0 * n / 600);
}
I'd use JAVAs implementation of the hashing-algorithm:
Hava a look at open-jdk HashMap
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Note you have too add a modulo-operation to make sure that the value wouldn't be greater than 600
EDIT 1
>>> is logical shift right
EXAMPLE:
10000000 >>> 2 = 00100000

Hash Map entries collision

I am trying this code snippet
Map headers=new HashMap();
headers.put("X-Capillary-Relay","abcd");
headers.put("Message-ID","abcd");
Now when I do a get for either of the keys its working fine.
However I am seeing a strange phenomenon on the Eclipse debugger.
When I debug and go inside the Variables and check inside the table entry at first I see this
->table
--->[4]
------>key:X-Capillary-Relay
...........
However after debugging across the 2nd line I get
->table
--->[4]
------>key:Message-ID
...........
Instead of creating a new entry it overwrites on the existing key. For any other key this overwrite does not occur. The size of the map is shown 2. and the get works for both keys. So what is the reason behind this discrepancy in the eclipse debugger. Is it an eclipse problem? Or a hashing problem. The hashcode is different for the 2 keys.
The hashCode of the keys is not used as is.
It is applied two transformations (at least based on Java 6 code):
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
and
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Since length is the initial capacity of the HashMap (16 by default), you get 4 for both keys :
System.out.println (hash("X-Capillary-Relay".hashCode ())&(16-1));
System.out.println (hash("Message-ID".hashCode ())&(16-1));
Therefore both entries are stored in a linked list in the same bucket of the map (index 4 of the table array, as you can see in the debugger). The fact that the debugger shows only one of them doesn't mean that the other was overwritten. It means that you see the key of the first Entry of the linked list, and each new Entry is added to the head of the list.

How efficient is this hash function?

I am not sure the best way to go about hashing a "dictionary" into a table.
The dictionary has 61406 words, I determine the overload by SizeOFDictionary/.75
That gives me 81874 buckets in the table.
I run it through my hash function(generic random algorithm) and there are 31690 buckets that get used up. and 50 some thousand that are empty. The largest bucket only contains 10 words.
My question: Do these numbers suffice for a hashing project? I am unfamiliar with what I am trying to achieve, to me, it seems like 50 some thousand is a lot of empty buckets.
Here is my hashing function.
private void hashingAlgorithm(String word)
{
int key = 1;
//Multiplying ASCII values of string
//To determine the index
for(int i = 0 ; i < word.length(); i++){
key *= (int)word.charAt(i);
//Accounting for integer overflow
if(key<0)
key*=-1;
}
key %= sizeOfTable;
//Inserting into the table
table[key].addToBucket(word);
}
Performance analysis:
Your hashing function doesn't take the order into account. According to your algorithm, if there's no overflow,
ab = ba. Your code depends on overflow to make difference between different order. So there is space for a lot of extra collisions which can be removed if you think about the sentences to be a N based number.
Suggested Improvement:
2 * 3 == 3 * 2
but
2 * 223 + 3 != 3 * 223 + 2
So if we represent the strings as N based number, number of collisions will be decreased at a dramatic scale.
If dictionary contains words like :
abdc
abcd
dbca
dabc
dacb
all will get hashed to same value in hash table i.e int(a)*int(b)*int(c)*int(d) , which is not a good idea .
So , use rolling hash .
example :
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]
where base be a prime number like say 31.
NOTE : [i] means char.at(i) .
you can also use modulo p [obviously p is a prime number] operator to avoid overflow and limit your size of hash table .
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1] mod p

Need help in understanding Rolling Hash computation in constant time for Rabin-Karp Implementation

I've been trying to implement Rabin-Karp algorithm in Java. I have hard time computing the rolling hash value in constant time. I've found one implementation at http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html. Still I could not get how these two lines work.
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;
I looked at couple of articles on modular arithmetic but no article could able to penetrate my thick skull. Please give some pointers to understand this.
First you need to understand how the hash is computed.
Lets take a simple case of base 10 strings. How would you guarantee that the hash code of a string is unique? Base 10 is what we use to represent numbers, and we don't have collisions!!
"523" = 5*10^2 + 2*10^1 + 3*10^0 = 523
using the above hash function you are guaranteed to get distinct hashes for every string.
Given the hash of "523", if you want to calculate the hash of "238", i.e. by jutting out the leftmost digit 5 and bringing in a new digit 8 from the right, you would have to do the following:
1) remove the effect of the 5 from the hash:
hash = hash - 5*10^2 (523-500 = 23)
2) adjust the hash of the remaining chars by shifting by 1, hash = hash * 10
3) add the hash of the new character:
hash = hash + 8 (230 + 8 = 238, which as we expected is the base 10 hash of "238")
Now let's extend this to all ascii characters. This takes us to the base 256 world. Therefore the hash of the same string "523" now is
= 5*256^2 + 2*256^1 + 3*256^0 = 327680 + 512 + 3 = 328195.
You can imagine as the string length increases you will will exceed the range of integer/long in most programming languages relatively quickly.
How can we solve this? The way this is routinely solved is by working with modulus a large prime number. The drawback of this method is that we will now get false positives as well, which is a small price to pay if it takes the runtime of your algorithm from quadratic to linear!
The complicated equation you quoted is nothing but the steps 1-3 above done with modulus math.
The two modulus properties used above are ->
a) (a*b) % p = ((a % p) * (b % p)) % p
b) a % p = (a + p) % p
Lets go back to steps 1-3 mentioned above ->
1) (expanded using property a) hash = hash - ((5 % p)*(10^2 %p) %p)
vs. what you quoted
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
Here are is how the two are related!
RM = 10^3 % p
txt.charAt(i-M) % Q = 5 % p
The additional + Q you see is just to ensure that the hash is not negative. See property b above.
2 & 3) hash = hash*10 + 8, vs txtHash = (txtHash*R + txt.charAt(i)) % Q;
Is the same but with taking mod of the final hash result!
Looking at properties a & b more closely, should help you figure it out!
This is the "rolling" aspect of the hash. It's eliminating the contribution of the oldest character (txt.charAt(i-M)), and incorporating the contribution of the newest character(txt.charAt(i)).
The hash function is defined as:
M-1
hash[i] = ( SUM { input[i-j] * R^j } ) % Q
j=0
(where I'm using ^ to denote "to the power of".)
But this can be written as an efficient recursive implementation as:
hash[i] = (txtHash*R - input[i-M]*(R^M) + input[i]) % Q
Your reference code is doing this, but it's using various techniques to ensure that the result is always computed correctly (and efficiently).
So, for instance, the + Q in the first expression has no mathematical effect, but it ensures that the result of the sum is always positive (if it goes negative, % Q doesn't have the desired effect). It's also breaking the calculation into stages, presumably to prevent numerical overflow.

Categories