How HashSet create hashkey for the object or value - java

I am little bit confused about HashSet internal working, as i know HashSet uses key(K) to find the right bucket and equals used to compare values but how HashSet works means how it generate hash Key ?

here it is
final int hash(Object k) {
int h = hashSeed;
if (0 != h && k instanceof String) {
return sun.misc.Hashing.stringHash32((String) k);
}
h ^= k.hashCode();
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
it's actually in HashMap which HashSet uses internally

Internally HashSet use HashMap,the hash key of the value is generated and used to save the element in HashTable.
To generate HashCode of the element the method HashCode() is called
Below method of HashMap to put element which is internally used by HashSet to add element :
public V put(K paramK, V paramV)
{
if (paramK == null)
return putForNullKey(paramV);
int i = hash(paramK.hashCode());
-----------------------------^
// More code
}

Related

"Normalize" hash of Key in HashMap

HashMap keeps its data in buckets as:
transient Node<K,V>[] table;
To put something in HashMap we need a hash() function which returns hash of Key in range from 0 to table.length(), right?
Suppose, I have:
String s = "15315";
// Just pasted internal operation. Is it supposed to calcule hash in table.length range?
int h;
int hmhc = (h = s.hashCode()) ^ (h >>> 16);
System.out.println("String native hashCode: "+s.hashCode() + ", HashMap hash: "+hmhc);
This returns the following:
String native hashCode: 46882035, HashMap hash: 46882360
We should have approximately 256 buckets (so hash of Key should be in range from 0 to 256), but internal hash in HashMap gives us 46882360. How to "normalize" this hash to our range? I just can't see it in the source code.
I looked at this jdk ( put() starts from line 610): http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java
Generally the hash code returned will be taken modulo the number of buckets.
In your case, it will go into bucket 46882360 % 256 = 56.

Hash Map entries collision

I am trying this code snippet
Map headers=new HashMap();
headers.put("X-Capillary-Relay","abcd");
headers.put("Message-ID","abcd");
Now when I do a get for either of the keys its working fine.
However I am seeing a strange phenomenon on the Eclipse debugger.
When I debug and go inside the Variables and check inside the table entry at first I see this
->table
--->[4]
------>key:X-Capillary-Relay
...........
However after debugging across the 2nd line I get
->table
--->[4]
------>key:Message-ID
...........
Instead of creating a new entry it overwrites on the existing key. For any other key this overwrite does not occur. The size of the map is shown 2. and the get works for both keys. So what is the reason behind this discrepancy in the eclipse debugger. Is it an eclipse problem? Or a hashing problem. The hashcode is different for the 2 keys.
The hashCode of the keys is not used as is.
It is applied two transformations (at least based on Java 6 code):
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
and
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
Since length is the initial capacity of the HashMap (16 by default), you get 4 for both keys :
System.out.println (hash("X-Capillary-Relay".hashCode ())&(16-1));
System.out.println (hash("Message-ID".hashCode ())&(16-1));
Therefore both entries are stored in a linked list in the same bucket of the map (index 4 of the table array, as you can see in the debugger). The fact that the debugger shows only one of them doesn't mean that the other was overwritten. It means that you see the key of the first Entry of the linked list, and each new Entry is added to the head of the list.

How hashset checks for duplicate elements?

Kindly look into my code :
HashSet<A> set = new HashSet<A>();
for (int i = 0; i < 10; i++)
set.add(new A());
System.out.println(set.contains(new A()));
Class A is defined as :
class A {
public boolean equals(Object o) {
return true;
}
public int hashCode() {
return (int) (Math.random()%100);
}
}
If hashset uses hashmap inside......why is the output true ?
Because different hashcodes means their bucket location is different .
So how checking for new A() returns true .
Also if I return 1 always from hashcode output is true which seems ok.
The reason is your hashcode function:
(int) (Math.random()%100);
always returns 0. So all A elements always have the same hashcode. Therefore all A elements will be in the same bucket in the HashSet so since your equals will always return true. As soon as it finds an A in the same bucket (in this case always) it will return true that that A is alreay contained.
Math.random() returns a number between 0 and 1 so that modulo anything will always be0.
you probably meant do to * instead of % to get random numbers between 0 and 100
(int) (Math.random() * 100);
Does what you want
HashSet uses equals() on all objects with the same hash bucket to determine contains(). Because equals() is always true, it doesn't matter which bucket the new A matches, but all objects will be in the same bucket because (int)(Math.random() % 100) is always 0.
Try changing your hash to:
(int)(Math.random() * 100)

Good hashcode function for 2D coordinates

I would like to use a HashMap
to map (x, y) coordinates to values.
What is a good hashCode() function definition?
In this case, I am only storing integer coordinates of the form (x, y)
where y - x = 0, 1, ..., M - 1 for some parameter M.
To get unique Value from two numbers, you can use bijective algorithm described in here
< x; y >= x + (y + ( (( x +1 ) /2) * (( x +1 ) /2) ) )
This will give you unquie value , which can be used for hashcode
public int hashCode()
{
int tmp = ( y + ((x+1)/2));
return x + ( tmp * tmp);
}
I generally use Objects.hash(Object... value) for generating hash code for a sequence of items.
The hash code is generated as if all the input values were placed into an array, and that array were hashed by calling Arrays.hashCode(Object[]).
#Override
public int hashCode() {
return Objects.hash(x, y);
}
Use Objects.hash(x, y, z) for 3D coordinates.
If you wish to handle it manually, you could do compute hashCode using:-
// For 2D coordinates
hashCode = LARGE_PRIME * X + Y;
// For 3D coordinates
hashCode = LARGE_PRIME^2 * X + LARGE_PRIME * Y + Z;
To calculate a hash code for objects with several properties, often a generic solution is implemented. This implementation uses a constant factor to combine the properties, the value of the factor is a subject of discussions. It seems that a factor of 33 or 397 will often result in a good distribution of hash codes, so they are suited for dictionaries.
This is a small example in C#, though it should be easily adabtable to Java:
public override int GetHashCode()
{
unchecked // integer overflows are accepted here
{
int hashCode = 0;
hashCode = (hashCode * 397) ^ this.Hue.GetHashCode();
hashCode = (hashCode * 397) ^ this.Saturation.GetHashCode();
hashCode = (hashCode * 397) ^ this.Luminance.GetHashCode();
return hashCode;
}
}
This scheme should also work for your coordinates, simply replace the properties with the X and Y value. Note that we should prevent integer overflow exceptions, in DotNet this can be achieved by using the unchecked block.
Have you considered simply shifting either x or y by half the available bits?
For "classic" 8bit thats only 16 cells/axis, but with todays "standard" 32bit it grows to over 65k cells/axis.
#override
public int hashCode() {
return x | (y << 15);
}
For obvious reasons this only works as long as both x and y are in between 0 and 0xFFFF (0-65535, inclusive), but thats plenty of space, more than 4.2bio cells.
Edit:
Another option, but that requires you to know the actual size, would be to do x + y * width (where width ofc is in the direction of x)
That depends on what you intend on using the hash code for:
If you plan on using it as a sort of index, E.g. knowing x and y will hash into an index where (x, y) data is stored, it's better to use a vector for such a thing.
Coordinates[][] coordinatesBucket = new Coordinates[maxY][maxX];
But if you absolutely must have a unique hash for every (x, y) combination, then try applying the coordinates to a decimal table (rather than adding or multiplying). For example, x=20 y=40 would give you the simple and unique code xy=2040.

Java : hash function

I am wondering if we implement our own hashmap that doesn't use power-of-two length hash tables (initial capacity and whenever we re-size), then in that case can we just use the object's hashcode and mod the total size directly instead of use a hash function to hash the object's hashcode ?
for example
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
// int hash = hash(key.hashCode()); original way
//can we just use the key's hashcode if our table length is not power-of-two ?
int hash = key.hashCode();
int i = indexFor(hash, table.length);
...
...
}
Presuming we're talking about OpenJDK 7, the additional hash is used to stimulate avalanching; it is a mixing function. It is used because the mapping function from a hash to a bucket, since were using a power of 2 for the capacity, is a mere bitwise & (since a % b is equivalent to a & (b - 1) iff b is a power of 2); this means that the lower bits are the only important ones, so by applying this mixing step it can help protect against poorer hashes.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you want to use sizes that aren't powers of 2, the above may not be needed.
Actually changing the mapping from hashes to buckets (which normally relies on the capacity being a power of 2) will require you to you to look at indexFor:
static int indexFor(int h, int length) {
return h & (length-1);
}
You could use (h & 0x7fffffff) % length here.
You can think of the mod function as a simple form of hash function. It maps a large range of data to a smaller space. Assuming the original hashcode is well designed, I see no reason why a mod cannot be used to transform the hashcode into the size of the table you are using.
If your original hashfunction is not well implemented, e.g. always returns an even number, you will create quite a lot of collisions using just a mod function as your hashfunction.
This is true, you can pick pseudo-prime numbers instead.
Note: indexFor needs to use % compensating for the sign instead of a simple & which can actually make the lookup slower.
indexFor = (h & Integer.MAX_VALUE) % length
// or
indexFor = Math.abs(h % length)

Categories