Hashcode comparison problem

Hashcode comparison problem - java

I have list of a an object which is termed as rule in our case, this object itself is a list of field for which I have to do hashcode comparison as we can't duplicate rule in the
system.
i.e Let say I have two Rules R1 and R2 with fields A & B.
Now if values of A & B in R1 are 7 and 2 respectively.
And in R2 it's 3 and 4 respectively then the process I have used to check the duplicity
of Rules in the system that is hashcode comparison fails
the method which I have used is
for(Rule rule : rules){
changeableAttrCode=0;
fieldCounter=1;
attributes = rule.getAttributes();
for(RuleField ruleField : attributes){
changeableAttrCode = changeableAttrCode + (fieldCounter * ruleField.getValue().hashCode());
fieldCounter++;
}
parameters = rule.getParameters();
for(RuleField ruleField : parameters){
changeableAttrCode = changeableAttrCode + (fieldCounter * ruleField.getValue().hashCode());
fieldCounter++;
}
changeableAttrCodes.add(changeableAttrCode);
here changeableAttrCodes where we store the hashcode of all the rules.
so can please suggest me better method so that this kind of problem does not arise in future as well as duplicity of rules in system can be seen.
Thanks in advance

hashcode() is not meant to be used to check for equality. return 42; is a perfectly valid implementation of hashcode(). Why don't you overwrite equals() (and hashcode() for that matter) in the rules objects and use that to check whether two rules are equal? You could still use the hashcode to check which objects you need to investigate, since two equal() objects should always have the same hashcode, but that is a performance improvement that you may or may not need, depending on your system.

Implement hashCode and equals in class Rule.
Implementation of equals has to compare its values.
Then use a HashSet<Rule> and ask if(mySet.contains(newRule))
HashSet + equals implementation solves the problem of the non-uniqueness of the hash. It uses hash for classifying and speed but it uses equals at the end to ensure that two Rules with same hash are the same Rule or not.
More on hash: if you want to do it by hand, use the prime number sudggestion, and review the JDK code for string hashcodes. If you want to make a clean implementation try to retrieve the hashcode of the elements, make some kind of array of ints and use Arrays.hashCode(int[]) to get a hashcode for the combination of them.

Updated Your hashing algorithm is not producing a good spread of hash values - it gives the same value for (7, 2) and (3, 4):
1 * 7 + 2 * 2 = 11
1 * 3 + 2 * 4 = 11
It would also give the same value for (11, 0), (-1, 6), ... and one can trivially make up an endless number of similar equivalence classes based on your current algorithm.
Of course you can not avoid collisions - if you have enough instances, hash collision is inevitable. However, you should aim to minimize the chance for collisions. Good hashing algorithms strive to spread hash values equally over a wide range of values. A typical way to achieve this is to generate the hash value for an object containing n independent fields as an n-digit number with a base big enough to hold the different hash values for the individual fields.
In your case, instead of multiplying with fieldCounter you should multiply with a prime constant, e.g. 31 (that would be the base of your number). And add another prime constant to the result, e.g. 17. This gives you a better spread of hash values. (Of course the concrete base depends on what values can your fields take - I have no info about that.)
Also if you implement hashCode, you are strongly advised to implement equals as well - and in fact, you should use the latter to test for equality.
Here is an article about implementing hashCode.

I don't understand what you are trying to do here. With most hash function scenarios, collision is inevitable, because there are way more objects to hash than there are possible hash values (it's a pigeonhole principle).
It is generally the case that two different objects may have the same hash value. You cannot rely on hash functions alone to eliminate duplicates.
Some hash functions are better than others in minimizing collisions, but it's still an inevitability.
That said, there are some simple guidelines that usually gives a good enough hash function. Joshua Bloch gives the following in his book Effective Java 2nd Edition:
Store some constant nonzero value, say 17, in an int variable called result.
Compute an int hashcode c for each field:
If the field is a boolean, compute (f ? 1 : 0)
If the field is a byte, char, short, int, compute (int) f
If the field is a long, compute (int) (f ^ (f >>> 32))
If the field is a float, compute Float.floatToIntBits(f)
If the field is a double, compute Double.doubleToLongBits(f), then hash the resulting long as in above.
If the field is an object reference and this class's equals method compares the field by recursively invoking equals, recursively invoke hashCode on the field. If the value of the field is null, return 0.
If the field is an array, treat it as if each element is a separate field. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.
Combine the hashcode c into result as follows: result = 31 * result + c;

I started to write that the only way you can achieve what you want is with Perfect Hashing.
But then I thought about the fact that you said you can't duplicate objects in your system.
Edit based on thought-provoking comment from helios:
Your solution depends on what you meant when you wrote that you "can't duplicate rules".
If you meant that literally you cannot, that there is guaranteed to be only one instance of a rule with a particular set of values, then your problem is trivial: you can do identity comparison, in which case you can do identity comparison using ==.
On the other hand, you meant that you shouldn't for some reason (performance), then your problem is also trivial: just do value comparisons.
Given the way you've defined your problem, under no circumstances should you be considering the use of hashcodes as a substitute for equality. As others have noted, hashcodes by their nature yield collisions (false equality), unless you go to a Perfect Hashing solution, but why would you in this case?

Related

Using Integer as a key with HashMap in Java

Recently I was looking for good implementation of hashCode() method in Java API and looked through Integer source code. Didn't expect that, but the hashCode() just returns the backed int value.
public final class Integer ... {
private final int value;
...
public int hashCode() {
return Integer.hashCode(value);
}
public static int hashCode(int value) {
return value;
}
It's really strange as there are a lot of papers and pages as well as packages dedicated to this question - how to design good hash function to distribute values.
Finally I ended up with this conclusion:
Integer is the worst data type candidate for a key when used with HashMap, as all consecutive keys will be places in one bin/bucked. Like in the sample above.
Map<Integer, String> map = HashMap<>();
for (int i = 1; i < 10; i++) {
map.put(Integer.valueOf(i), "string" + i);
}
There are two questions, for which I didn't find answers while googled:
Am I right with my conclusion regarding Integer data type?
In case it's true, why Integer's hashCode() method don't implemented in some tricky way when power operation, prime numbers, binary shifting are used?

Integer is the worst data type candidate for a key when used with HashMap, as all consecutive keys will be places in one bin
No, that statement is wrong.
In fact, the implementation of Integer's hashCode() is the best possible implementation. It maps each Integer value to a unique hashCode value, which reduces the chance of different keys being mapped into the same bucket.
Sometimes a simple implementation is the best.
From the Javadoc of hashCode() in the Object class:
It is not required that if two objects are unequal according to the java.lang.Object.equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
Integer is one of the few classes that actually guarantees that unequal objects will have different hashCode().

Adding to #Eran's answer, Java's HashMap also has a protection against "bad hash functions" (which Integer.hashCode() isn't, but still).
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
So your "simple hash" of an integer will actually be a bit different when working with HashMap.

From the docs:
The general contract of hashCode is:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently
return the same integer, provided no information used in equals
comparisons on the object is modified. This integer need not remain
consistent from one execution of an application to another execution
of the same application.
--> Integer#hashCode fulfills this.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must
produce the same integer result.
--> Integer#hashCode fulfills this too.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method
on each of the two objects must produce distinct integer results.
However, the programmer should be aware that producing distinct
integer results for unequal objects may improve the performance of
hash tables.
--> Integer#hashCode fulfills this to the maximum extent, i.e. two unequal Integers will never have the same hash code.

Redefinition of equals()

This is an extract from Core Java by C. Horstmann.
+++++
The hashCode method should return an integer (which can be negative). Just combine the
hash codes of the instance fields so that the hash codes for different objects are likely to
be widely scattered.
For example, here is a hashCode method for the Employee class:
class Employee
{
public int hashCode()
{
return 7 * name.hashCode() + 11 * new Double(salary).hashCode() + 13 * hireDay.hashCode();
}
. . .
}
+++
I can't understand these 7, 11, and 13. Are they just pulled out of a hat? Without them the result (checking for equality of two objects) seems to be the same.

In general, testing for equality does not use the hash code.
The 7, 11, 13 are all prime numbers. This lowers the possibility of two different employees having the same hash code (because of theorem of Bézout).
In fact, I would suggest (to widen the obtained hash) using much bigger but non-consecutive primes, e.g. 1039, 2011, 32029. On Linux, the /usr/games/primes utility from package bsdgames is very useful to get them.
What is important is that if two things compare equal they have the same hash code.
For perfomance reasons, you want the hash code to be widely distributed (so if two things are not equals, their hash code usually should be different) to lower the probability of hash collision.
Read wikipage on hash tables.

the numbers are prime numbers.
you don't want to just add the hash codes, because it would give you more collissions.
e.g.
situation A: foo="bla", bar="111"
situation B: foo="111", bar="bla"
this means that foo.hash() + bar.hash() will return the same value in both situations. you use prime numbers because the function f: N/2^32 -> N/2^32: x -> x * p (mod 2^32) is bijective if p is a prime > 2. (i.e. you would lose bits if you multiplied with 256 instead...)
collisions are only to be avoided if you use somthing like hash-sets.

Multiplying with primes is a common optimization which is often done for you by your IDE. I wouldn't do it if there is no need for optimization.

Using hashcode for a unique ID

I am working in a java-based system where I need to set an id for certain elements in the visual display. One category of elements is Strings, so I decided to use the String.hashCode() method to get a unique identifier for these elements.
The problem I ran into, however, is that the system I am working in borks if the id is negative and String.hashCode often returns negative values. One quick solution is to just use Math.abs() around the hashcode call to guarantee a positive result. What I was wondering about this approach is what are the chances of two distinct elements having the same hashcode?
For example, if one string returns a hashcode of -10 and another string returns a hashcode of 10 an error would occur. In my system we're talking about collections of objects that aren't more than 30 elements large typically so I don't think this would really be an issue, but I am curious as to what the math says.

Hash codes can be thought of as pseudo-random numbers. Statistically, with a positive int hash code the chance of a collision between any two elements reaches 50% when the population size is about 54K (and 77K for any int). See Birthday Problem Probability Table for collision probabilities of various hash code sizes.
Also, your idea to use Math.abs() alone is flawed: It does not always return a positive number! In 2's compliment arithmetic, the absolute value of Integer.MIN_VALUE is itself! Famously, the hash code of "polygenelubricants" is this value.

Hashes are not unique, hence they are not apropriate for uniqueId.
As to probability of hash collision, you could read about birthday paradox. Actually (from what I recall) when drawing from an uniform distribution of N values, you should expect collision after drawing $\sqrt(N)$ (you could get collision much earlier). The problem is that Java's implementation of hashCode (and especially when hashing short strings) doesnt provide uniform distribution, so you'll get collision much earlier.

You already can get two strings with the same hashcode. This should be obvious if you think that you have an infinite number of strings and only 2^32 possible hashcodes.
You just make it a little more probable when taking the absolute value. The risk is small but if you need an unique id, this isn't the right approach.

What you can do when you only have 30-50 values as you said is register each String you get into an HashMap together with a running counter as value:
HashMap StringMap = new HashMap<String,Integer>();
StringMap.add("Test",1);
StringMap.add("AnotherTest",2);
You can then get your unique ID by calling this:
StringMap.get("Test"); //returns 1

Why do we check hash if we are going to check equals anyways?

If two objects are equal then hashcode must be same. Then why does the any check in HashMap do -
if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) {
Instead of simply
if ((k = e.key) == key || (key != null && key.equals(k)))) {

Because the hash check is cheap, and the equals() method call might be expensive. If the hash check fails, we don't need to bother with the equals() check to return false, so we save time.

If two objects are equal then hashcode must be same.
In this case, take it the other way: "If hashcodes of two objects are different, they can't be equal"
So, here we are simply short-circuiting the comparison using equals() by first comparing the hashes.
Since hash is of type int, comparing 2 ints is not an expensive operation (Just uses a single machine instruction - if_icmp<cond>.
On the other hand, equals() method for various objects might involve complex operations, of course making it an expensive operation in compared to int comparison. So, we just do hash comparison for earlier exit.

On top of what #MarounMaroun said, another advantage of hash is that it returns an int. This lets you use it as an index into an array (which is how the implementation of a hash table works). equals returns a boolean, so it can't be used this way.

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
However When it comes to Hashcode, We should be aware of
If two objects are equal, they will have same hashcodes
If the two objects have same hashcode , that doesn't mean that they
must be equal
Also note that, if two objects are not equal, even then they can have
the same hash code.
You Can understand with this example.
Lets Consider two Numbers 10,40 and Hashcode Logic is calculated using
%(Mod) 6 so. ex 10 % 6 = 4 and 40 % 6 = 4
Although 10,40 are different but returning Same HashCode.
However if i change my Hashcode logic to use % (Mod) 37 then, ex 10 %
37 = 10 and 40 % 37 = 3
Now With this Hash logic. 10 & 40 are different!
Hence Hashcode basically depends on your logic to calculate Hash.

In situations where the hash values of two objects to be compared will always be known, and objects which are not reference-identical will often have different hash values, comparing them will often be faster than comparing the objects themselves, sometimes considerably so, and will never be 'much' slower. For example, if one wishes to compare two 10,000-character strings which are identical except for the last few characters, and if one knows the hash values of the two strings, checking to see whether the hash values match will be cheaper than examining enough characters of each string to find the first difference.
If the hash values are not known, computing them will generally be slower than direct comparison of the objects. If they will sometimes be known, checking whether they are known and using them if so will sometimes be helpful and sometimes not, depending upon how frequently the objects are involved in comparisons with other things whose hash values happen to be different, and upon how long the comparisons in such cases would take. Note also that if collections of arbitrary-type objects will need to support direct comparison with other collections to see if they hold identical content, it may be advantageous for them to compute and cache the hash values of all the items they contain, as well a composite hash which combines the hashes of all the values. If one has two large collections of thousand-character strings which might differ except for the last few characters of the last string, being able to know that they're unequal without having to check almost every character of every string can be a big win.

A good hash function to use in interviews for integer numbers, strings?

I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"

Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.

You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)

For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hashcode comparison problem - java

Related

Using Integer as a key with HashMap in Java

Redefinition of equals()

Using hashcode for a unique ID

Why do we check hash if we are going to check equals anyways?

A good hash function to use in interviews for integer numbers, strings?

Categories

Resources