Hash Map entries collision - java

I am trying this code snippet
Map headers=new HashMap();
headers.put("X-Capillary-Relay","abcd");
headers.put("Message-ID","abcd");
Now when I do a get for either of the keys its working fine.
However I am seeing a strange phenomenon on the Eclipse debugger.
When I debug and go inside the Variables and check inside the table entry at first I see this
->table
--->[4]
------>key:X-Capillary-Relay
...........
However after debugging across the 2nd line I get
->table
--->[4]
------>key:Message-ID
...........
Instead of creating a new entry it overwrites on the existing key. For any other key this overwrite does not occur. The size of the map is shown 2. and the get works for both keys. So what is the reason behind this discrepancy in the eclipse debugger. Is it an eclipse problem? Or a hashing problem. The hashcode is different for the 2 keys.

The hashCode of the keys is not used as is.
It is applied two transformations (at least based on Java 6 code):
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
and
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Since length is the initial capacity of the HashMap (16 by default), you get 4 for both keys :
System.out.println (hash("X-Capillary-Relay".hashCode ())&(16-1));
System.out.println (hash("Message-ID".hashCode ())&(16-1));
Therefore both entries are stored in a linked list in the same bucket of the map (index 4 of the table array, as you can see in the debugger). The fact that the debugger shows only one of them doesn't mean that the other was overwritten. It means that you see the key of the first Entry of the linked list, and each new Entry is added to the head of the list.

Related

"Normalize" hash of Key in HashMap

HashMap keeps its data in buckets as:
transient Node<K,V>[] table;
To put something in HashMap we need a hash() function which returns hash of Key in range from 0 to table.length(), right?
Suppose, I have:
String s = "15315";
// Just pasted internal operation. Is it supposed to calcule hash in table.length range?
int h;
int hmhc = (h = s.hashCode()) ^ (h >>> 16);
System.out.println("String native hashCode: "+s.hashCode() + ", HashMap hash: "+hmhc);
This returns the following:
String native hashCode: 46882035, HashMap hash: 46882360
We should have approximately 256 buckets (so hash of Key should be in range from 0 to 256), but internal hash in HashMap gives us 46882360. How to "normalize" this hash to our range? I just can't see it in the source code.
I looked at this jdk ( put() starts from line 610): http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java
Generally the hash code returned will be taken modulo the number of buckets.
In your case, it will go into bucket 46882360 % 256 = 56.

HashMap different bucket index for a key possible?

My concern was to check how Java HashMap gets the same index for a key. Even when it's size expand from default 16 to much higher values as we keep adding Entries.
I tried to reproduce the indexing algorithm of HashMap.
int length=1<<5;
int v=6546546;
int h = new Integer(v).hashCode();
h =h^( (h >>> 20) ^ (h >>> 12));
h=h ^ h >>> 7 ^ h >>> 4;
System.out.println("index :: " + (h & (length-1)));
I ran my code for different values of "length".
So for same key I am getting different index, as length of HashMap changes. What am I missing here?
My Results:
length=1<<5;
index :: 10
length=1<<15;
index :: 7082
length=1<<30;
index :: 6626218
You're missing the fact that every time the length changes, the entries are redistributed - they're put in the new buckets appropriately. That's why it takes a bit of time (O(N)) when the map expands - everything needs to be copied from the old buckets to the new ones.
So long as you only ever have indexes for one length at a time (not a mixture), you're fine.

How to look up range from set of contiguous ranges for given number

so simply put, this is what I am trying to do:
I have a collection of Range objects that are contiguous (non overlapping, with no gaps between them), each containing a start and end int, and a reference to another object obj. These ranges are not of a fixed size (the first could be 1-49, the second 50-221, etc.). This collection could grow to be quite large.
I am hoping to find a way to look up the range (or more specifically, the object that it references) that includes a given number without having to iterate over the entire collection checking each range to see if it includes the number. These lookups will be performed frequently, so speed/performance is key.
Does anyone know of an algorithm/equation that might help me out here? I am writing in Java. I can provide more details if needed, but I figured I would try to keep it simple.
Thanks.
If sounds like you want to use a TreeMap, where the key is the bottom of the range, and the value is the Range object.
Then to identify the correct range, just use the floorEntry() method to very quickly get the closest (lesser or equal) Range, which should contain the key, like so:
TreeMap<Integer, Range> map = new TreeMap<>();
map.put(1, new Range(1, 10));
map.put(11, new Range(11, 30));
map.put(31, new Range(31, 100));
// int key = 0; // null
// int key = 1; // Range [start=1, end=10]
// int key = 11; // Range [start=11, end=30]
// int key = 21; // Range [start=11, end=30]
// int key = 31; // Range [start=31, end=100]
// int key = 41; // Range [start=31, end=100]
int key = 101; // Range [start=31, end=100]
// etc.
Range r = null;
Map.Entry<Integer, Range> m = map.floorEntry(key);
if (m != null) {
r = m.getValue();
}
System.out.println(r);
Since the tree is always sorted by the natural ordering of the bottom range boundary, all your searches will be at worst O(log(n)).
You'll want to add some sanity checking for when your key is completely out of bounds (for instance, when they key is beyond the end of the map, it returns the last Range in the map), but this should give you an idea how to proceed.
Assuming that you lookups are of utmost importance, and you can spare O(N) memory and approximately O(N^2) preprocessing time, the algorithm would be:
introduce a class ObjectsInRange, which contains: start of range (int startOfRange) and a set of objects (Set<Object> objects)
introduce an ArrayList<ObjectsInRange> oir, which will contain ObjectsInRange sorted by the startOfRange
for each Range r, ensure that there exist ObjectsInRange (let's call them a and b) such that a.startOfRange = r.start and b.startOfRange = b.end. Then, for all ObjectsInRange x between a, and until (but not including) b, add r.obj to their x.objects set
The lookup, then, is as follows:
for integer x, find such i that oir[i].startOfRange <= x and oir[i+1].startOfRange > x
note: i can be found with bisection in O(log N) time!
your objects are oir[i].objects
If the collection is in order, then you can implement a binary search to find the right range in O(log(n)) time. It's not as efficient as hashing for very large collections, but if you have less than 1000 or so ranges, it may be faster (because it's simpler).

Java Hashcode gives integer overflow

Background information:
In my project I'm applying Reinforcement Learning (RL) to the Mario domain. For my state representation I chose to use a hashtable with custom objects as keys. My custom objects are immutable and have overwritten the .equals() and the .hashcode() (which were generated by the IntelliJ IDE).
This is the resulting .hashcode(), I've added the possible values in comments as extra information:
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (facing ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + marioMode; // 3 possible values: 0, 1, 2
result = 31 * result + (onGround ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (canJump ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + (wallNear ? 1 : 0); // 2 possible values: 0, 1
result = 31 * result + nearestEnemyX; // 33 possible values: - 16 to 16
result = 31 * result + nearestEnemyY; // 33 possible values: - 16 to 16
return result;
}
The Problem:
The problem here is that the result in the above code can exceed Integer.MAX_VALUE. I've read online this doesn't have to be a problem, but in my case it is. This is partly due to algorithm used which is Q-Learning (an RL method) and depends on the correct Q-values stored inside the hashtable. Basically I cannot have conflicts when retrieving values. When running my experiments I see that the results are not good at all and I'm 95% certain the problem lies with the retrieval of the Q-values from the hashtable. (If needed I can expand on why I'm certain about this, but this requires some extra information on the project which isn't relevant for the question.)
The Question:
Is there a way to avoid the integer overflow, maybe I'm overlooking something here? Or is there another way (perhaps another datastructure) to get reasonably fast the values given my custom-key?
Remark:
After reading some comments I do realise that my choice for using a HashTable wasn't maybe the best one as I want unique keys that do not cause collisions. If I still want to use the HashTable I will probably need a proper encoding.
You need a dedicated Key Field to guarantee uniqueness
.hashCode() isn't designed for what you are using it for
.hashCode() is designed to give good general results in bucketing algorithms, which can tolerate minor collisions. It is not designed to provide a unique key. The default algorithm is a trade off of time and space and minor collisions, it isn't supposed to guarantee uniqueness.
Perfect Hash
What you need to implement is a perfect hash or some other unique key based on the contents of the object. This is possible within the boundries of an int but I wouldn't use .hashCode() for this representation. I would use an explicit key field on the object.
Unique Hashing
One way to use use SHA1 hashing that is built into the standard library which has an extremely low chance of collisions for small data sets. You don't have a huge combinational explosion in the values you posts to SHA1 will work.
You should be able to calculate a way to generate a minimal perfect hash with the limited values that you are showing in your question.
A minimal perfect hash function is a perfect hash function that maps n
keys to n consecutive integers—usually [0..n−1] or [1..n]. A more
formal way of expressing this is: Let j and k be elements of some
finite set K. F is a minimal perfect hash function iff F(j) =F(k)
implies j=k (injectivity) and there exists an integer a such that the
range of F is a..a+|K|−1. It has been proved that a general purpose
minimal perfect hash scheme requires at least 1.44 bits/key.2 The
best currently known minimal perfect hashing schemes use around 2.6
bits/key.[3]
A minimal perfect hash function F is order preserving if keys are
given in some order a1, a2, ..., an and for any keys aj and ak, j
A minimal perfect hash function F is monotone if it preserves the
lexicographical order of the keys. In this case, the function value is
just the position of each key in the sorted ordering of all of the
keys. If the keys to be hashed are themselves stored in a sorted
array, it is possible to store a small number of additional bits per
key in a data structure that can be used to compute hash values
quickly.[6]
Solution
Note where it talks about a URL it can be any byte[] representation of any String that you calculate from your object.
I usually override the toString() method to make it generate something unique, and then feed that into the UUID.nameUUIDFromBytes() method.
Type 3 UUID can be just as useful as well UUID.nameUUIDFromBytes()
Version 3 UUIDs use a scheme deriving a UUID via MD5 from a URL, a
fully qualified domain name, an object identifier, a distinguished
name (DN as used in Lightweight Directory Access Protocol), or on
names in unspecified namespaces. Version 3 UUIDs have the form
xxxxxxxx-xxxx-3xxx-yxxx-xxxxxxxxxxxx where x is any hexadecimal digit
and y is one of 8, 9, A, or B.
To determine the version 3 UUID of a given name, the UUID of the
namespace (e.g., 6ba7b810-9dad-11d1-80b4-00c04fd430c8 for a domain) is
transformed to a string of bytes corresponding to its hexadecimal
digits, concatenated with the input name, hashed with MD5 yielding 128
bits. Six bits are replaced by fixed values, four of these bits
indicate the version, 0011 for version 3. Finally, the fixed hash is
transformed back into the hexadecimal form with hyphens separating the
parts relevant in other UUID versions.
My preferred solution is Type 5 UUID ( SHA version of Type 3)
Version 5 UUIDs use a scheme with SHA-1 hashing; otherwise it is the
same idea as in version 3. RFC 4122 states that version 5 is preferred
over version 3 name based UUIDs, as MD5's security has been
compromised. Note that the 160 bit SHA-1 hash is truncated to 128 bits
to make the length work out. An erratum addresses the example in
appendix B of RFC 4122.
Key objects should be immutable
That way you can calculate toString(), .hashCode() and generate a unique primary key inside the Constructor and set them once and not calculate them over and over.
Here is a straw man example of an idiomatic immutable object and calculating a unique key based on the contents of the object.
package com.stackoverflow;
import javax.annotation.Nonnull;
import java.util.Date;
import java.util.UUID;
public class Q23633894
{
public static class Person
{
private final String firstName;
private final String lastName;
private final Date birthday;
private final UUID key;
private final String strRep;
public Person(#Nonnull final String firstName, #Nonnull final String lastName, #Nonnull final Date birthday)
{
this.firstName = firstName;
this.lastName = lastName;
this.birthday = birthday;
this.strRep = String.format("%s%s%d", firstName, lastName, birthday.getTime());
this.key = UUID.nameUUIDFromBytes(this.strRep.getBytes());
}
#Nonnull
public UUID getKey()
{
return this.key;
}
// Other getter/setters omitted for brevity
#Override
#Nonnull
public String toString()
{
return this.strRep;
}
#Override
public boolean equals(final Object o)
{
if (this == o) { return true; }
if (o == null || getClass() != o.getClass()) { return false; }
final Person person = (Person) o;
return key.equals(person.key);
}
#Override
public int hashCode()
{
return key.hashCode();
}
}
}
For a unique representation of your object's state, you would need 19 bits in total. Thus, it is possible to represent it by a "perfect hash" integer value (which can have up to 32 bits):
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // needs 1 bit (2 possible values)
result += (facing ? 1 : 0) << 1; // needs 1 bit (2 possible values)
result += marioMode << 2; // needs 2 bits (3 possible values)
result += (onGround ? 1 : 0) << 4; // needs 1 bit (2 possible values)
result += (canJump ? 1 : 0) << 5; // needs 1 bit (2 possible values)
result += (wallNear ? 1 : 0) << 6; // needs 1 bit (2 possible values)
result += (nearestEnemyX + 16) << 7; // needs 6 bits (33 possible values)
result += (nearestEnemyY + 16) << 13; // needs 6 bits (33 possible values)
}
Instead of using 31 as a your magic number, you need to use the number of possibilities (normalised to 0)
#Override
public int hashCode() {
int result = (stuck ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (facing ? 1 : 0); // 2 possible values: 0, 1
result = 3 * result + marioMode; // 3 possible values: 0, 1, 2
result = 2 * result + (onGround ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (canJump ? 1 : 0); // 2 possible values: 0, 1
result = 2 * result + (wallNear ? 1 : 0); // 2 possible values: 0, 1
result = 33 * result + (16 + nearestEnemyX); // 33 possible values: - 16 to 16
result = 33 * result + (16 + nearestEnemyY); // 33 possible values: - 16 to 16
return result;
}
This will give you 104544 possible hashCodes() BTW you can reverse this process to get the original values from the code by using a series of / and %
Try Guava's hashCode() method or JDK7's Objects.hash(). It's way better than writing your own. Don't repeat code yourself (and anyone else when you can use out of box solution):

Java : hash function

I am wondering if we implement our own hashmap that doesn't use power-of-two length hash tables (initial capacity and whenever we re-size), then in that case can we just use the object's hashcode and mod the total size directly instead of use a hash function to hash the object's hashcode ?
for example
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
// int hash = hash(key.hashCode()); original way
//can we just use the key's hashcode if our table length is not power-of-two ?
int hash = key.hashCode();
int i = indexFor(hash, table.length);
...
...
}
Presuming we're talking about OpenJDK 7, the additional hash is used to stimulate avalanching; it is a mixing function. It is used because the mapping function from a hash to a bucket, since were using a power of 2 for the capacity, is a mere bitwise & (since a % b is equivalent to a & (b - 1) iff b is a power of 2); this means that the lower bits are the only important ones, so by applying this mixing step it can help protect against poorer hashes.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you want to use sizes that aren't powers of 2, the above may not be needed.
Actually changing the mapping from hashes to buckets (which normally relies on the capacity being a power of 2) will require you to you to look at indexFor:
static int indexFor(int h, int length) {
return h & (length-1);
}
You could use (h & 0x7fffffff) % length here.
You can think of the mod function as a simple form of hash function. It maps a large range of data to a smaller space. Assuming the original hashcode is well designed, I see no reason why a mod cannot be used to transform the hashcode into the size of the table you are using.
If your original hashfunction is not well implemented, e.g. always returns an even number, you will create quite a lot of collisions using just a mod function as your hashfunction.
This is true, you can pick pseudo-prime numbers instead.
Note: indexFor needs to use % compensating for the sign instead of a simple & which can actually make the lookup slower.
indexFor = (h & Integer.MAX_VALUE) % length
// or
indexFor = Math.abs(h % length)

Categories