The accepted answer in Best implementation for hashCode method gives a seemingly good method for finding Hash Codes. But I'm new to Hash Codes, so I don't quite know what to do.
For 1), does it matter what nonzero value I choose? Is 1 just as good as other numbers such as the prime 31?
For 2), do I add each value to c? What if I have two fields that are both a long, int, double, etc?
Did I interpret it right in this class:
public MyClass{
long a, b, c; // these are the only fields
//some code and methods
public int hashCode(){
return 37 * (37 * ((int) (a ^ (a >>> 32))) + (int) (b ^ (b >>> 32)))
+ (int) (c ^ (c >>> 32));
}
}
The value is not important, it can be whatever you want. Prime numbers will result in a better distribution of the hashCode values therefore they are preferred.
You do not necessary have to add them, you are free to implement whatever algorithm you want, as long as it fulfills the hashCode contract:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
There are some algorithms which can be considered as not good hashCode implementations, simple adding of the attributes values being one of them. The reason for that is, if you have a class which has two fields, Integer a, Integer b and your hashCode() just sums up these values then the distribution of the hashCode values is highly depended on the values your instances store. For example, if most of the values of a are between 0-10 and b are between 0-10 then the hashCode values are be between 0-20. This implies that if you store the instance of this class in e.g. HashMap numerous instances will be stored in the same bucket (because numerous instances with different a and b values but with the same sum will be put inside the same bucket). This will have bad impact on the performance of the operations on the map, because when doing a lookup all the elements from the bucket will be compared using equals().
Regarding the algorithm, it looks fine, it is very similar to the one that Eclipse generates, but it is using a different prime number, 31 not 37:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (int) (a ^ (a >>> 32));
result = prime * result + (int) (b ^ (b >>> 32));
result = prime * result + (int) (c ^ (c >>> 32));
return result;
}
A well-behaved hashcode method already exists for long values - don't reinvent the wheel:
int hashCode = Long.hashCode((a * 31 + b) * 31 + c); // Java 8+
int hashCode = Long.valueOf((a * 31 + b) * 31 + c).hashCode() // Java <8
Multiplying by a prime number (usually 31 in JDK classes) and cumulating the sum is a common method of creating a "unique" number from several numbers.
The hashCode() method of Long keeps the result properly distributed across the int range, making the hash "well behaved" (basically pseudo random).
I have been thinking of it but have ran out of idea's. I have 10 arrays each of length 18 and having 18 double values in them. These 18 values are features of an image. Now I have to apply k-means clustering on them.
For implementing k-means clustering I need a unique computational value for each array. Are there any mathematical or statistical or any logic that would help me to create a computational value for each array, which is unique to it based upon values inside it. Thanks in advance.
Here is my array example. Have 10 more
[0.07518284315321135
0.002987851573676068
0.002963866526639678
0.002526139418225552
0.07444872939213325
0.0037219653347541617
0.0036979802877177715
0.0017920256571474585
0.07499695903867931
0.003477831820276616
0.003477831820276616
0.002036159171625004
0.07383539747505984
0.004311312204791184
0.0043352972518275745
0.0011786937400740452
0.07353130134299131
0.004339580295941216]
Did you checked the Arrays.hashcode in Java 7 ?
/**
* Returns a hash code based on the contents of the specified array.
* For any two <tt>double</tt> arrays <tt>a</tt> and <tt>b</tt>
* such that <tt>Arrays.equals(a, b)</tt>, it is also the case that
* <tt>Arrays.hashCode(a) == Arrays.hashCode(b)</tt>.
*
* <p>The value returned by this method is the same value that would be
* obtained by invoking the {#link List#hashCode() <tt>hashCode</tt>}
* method on a {#link List} containing a sequence of {#link Double}
* instances representing the elements of <tt>a</tt> in the same order.
* If <tt>a</tt> is <tt>null</tt>, this method returns 0.
*
* #param a the array whose hash value to compute
* #return a content-based hash code for <tt>a</tt>
* #since 1.5
*/
public static int hashCode(double a[]) {
if (a == null)
return 0;
int result = 1;
for (double element : a) {
long bits = Double.doubleToLongBits(element);
result = 31 * result + (int)(bits ^ (bits >>> 32));
}
return result;
}
I dont understand why #Marco13 mentioned " this is not returning unquie for arrays".
UPDATE
See #Macro13 comment for the reason why it cannot be unquie..
UPDATE
If we draw a graph using your input points, ( 18 elements) has one spike and 3 low values and the pattern goes..
if that is true.. you can find the mean of your Peak ( 1, 4, 8,12,16 ) and find the low Mean from remaining values.
So that you will be having Peak mean and Low mean . and you find the unquie number to represent these two also preserve the values using bijective algorithm described in here
This Alogirthm also provides formulas to reverse i.e take the Peak and Low mean from the unquie value.
To find unique pair < x; y >= x + (y + ( (( x +1 ) /2) * (( x +1 ) /2) ) )
Also refer Exercise 1 in pdf page 2 to reverse x and y.
For finding Mean and find paring value.
public static double mean(double[] array){
double peakMean = 0;
double lowMean = 0;
for (int i = 0; i < array.length; i++) {
if ( (i+1) % 4 == 0 || i == 0){
peakMean = peakMean + array[i];
}else{
lowMean = lowMean + array[i];
}
}
peakMean = peakMean / 5;
lowMean = lowMean / 13;
return bijective(lowMean, peakMean);
}
public static double bijective(double x,double y){
double tmp = ( y + ((x+1)/2));
return x + ( tmp * tmp);
}
for test
public static void main(String[] args) {
double[] arrays = {0.07518284315321135,0.002963866526639678,0.002526139418225552,0.07444872939213325,0.0037219653347541617,0.0036979802877177715,0.0017920256571474585,0.07499695903867931,0.003477831820276616,0.003477831820276616,0.002036159171625004,0.07383539747505984,0.004311312204791184,0.0043352972518275745,0.0011786937400740452,0.07353130134299131,0.004339580295941216};
System.out.println(mean(arrays));
}
You can use this the peak and low values to find the similar images.
You can simply sum the values, using double precision, the result value will unique most of the times. On the other hand, if the value position is relevant, then you can apply a sum using the index as multiplier.
The code could be as simple as:
public static double sum(double[] values) {
double val = 0.0;
for (double d : values) {
val += d;
}
return val;
}
public static double hash_w_order(double[] values) {
double val = 0.0;
for (int i = 0; i < values.length; i++) {
val += values[i] * (i + 1);
}
return val;
}
public static void main(String[] args) {
double[] myvals =
{ 0.07518284315321135, 0.002987851573676068, 0.002963866526639678, 0.002526139418225552, 0.07444872939213325, 0.0037219653347541617, 0.0036979802877177715, 0.0017920256571474585, 0.07499695903867931, 0.003477831820276616,
0.003477831820276616, 0.002036159171625004, 0.07383539747505984, 0.004311312204791184, 0.0043352972518275745, 0.0011786937400740452, 0.07353130134299131, 0.004339580295941216 };
System.out.println("Computed value based on sum: " + sum(myvals));
System.out.println("Computed value based on values and its position: " + hash_w_order(myvals));
}
The output for that code, using your list of values is:
Computed value based on sum: 0.41284176550504803
Computed value based on values and its position: 3.7396448842464496
Well, here's a method that works for any number of doubles.
public BigInteger uniqueID(double[] array) {
final BigInteger twoToTheSixtyFour =
BigInteger.valueOf(Long.MAX_VALUE).add(BigInteger.ONE);
BigInteger count = BigInteger.ZERO;
for (double d : array) {
long bitRepresentation = Double.doubleToRawLongBits(d);
count = count.multiply(twoToTheSixtyFour);
count = count.add(BigInteger.valueOf(bitRepresentation));
}
return count;
}
Explanation
Each double is a 64-bit value, which means there are 2^64 different possible double values. Since a long is easier to work with for this sort of thing, and it's the same number of bits, we can get a 1-to-1 mapping from doubles to longs using Double.doubleToRawLongBits(double).
This is awesome, because now we can treat this like a simple combinations problem. You know how you know that 1234 is a unique number? There's no other number with the same value. This is because we can break it up by its digits like so:
1234 = 1 * 10^3 + 2 * 10^2 + 3 * 10^1 + 4 * 10^0
The powers of 10 would be "basis" elements of the base-10 numbering system, if you know linear algebra. In this way, base-10 numbers are like arrays consisting of only values from 0 to 9 inclusively.
If we want something similar for double arrays, we can discuss the base-(2^64) numbering system. Each double value would be a digit in a base-(2^64) representation of a value. If there are 18 digits, there are (2^64)^18 unique values for a double[] of length 18.
That number is gigantic, so we're going to need to represent it with a BigInteger data-structure instead of a primitive number. How big is that number?
(2^64)^18 = 61172327492847069472032393719205726809135813743440799050195397570919697796091958321786863938157971792315844506873509046544459008355036150650333616890210625686064472971480622053109783197015954399612052812141827922088117778074833698589048132156300022844899841969874763871624802603515651998113045708569927237462546233168834543264678118409417047146496
There are that many unique configurations of 18-length double arrays and this code lets you uniquely describe them.
I'm going to suggest three methods, with different pros and cons which I will outline.
Hash Code
This is the obvious "solution", though it has been correctly pointed out that it will not be unique. However, it will be very unlikely that any two arrays will have the same value.
Weighted Sum
Your elements appear to be bounded; perhaps they range from a minimum of 0 to a maximum of 1. If this is the case, you can multiply the first number by N^0, the second by N^1, the third by N^2 and so on, where N is some large number (ideally the inverse of your precision). This is easily implemented, particularly if you use a matrix package, and very fast. We can make this unique if we choose.
Euclidean Distance from Mean
Subtract the mean of your arrays from each array, square the results, sum the squares. If you have an expected mean, you can use that. Again, not unique, there will be collisions, but you (almost) can't avoid that.
The difficulty of uniqueness
It has already been explained that hashing will not give you a unique solution. A unique number is possible in theory, using the Weighted Sum, but we have to use numbers of a very large size. Let's say your numbers are 64 bits in memory. That means that there are 2^64 possible numbers they can represent (slightly less using floating point). Eighteen such numbers in an array could represent 2^(64*18) different numbers. That's huge. If you use anything less, you will not be able to guarantee uniqueness due to the pigeonhole principle.
Let's look at a trivial example. If you have four letters, a, b, c and d, and you have to number them each uniquely using the numbers 1 to 3, you can't. That's the pigeonhole principle. You have 2^(18*64) possible numbers. You can't number them uniquely with less than 2^(18*64) numbers, and hashing doesn't give you that.
If you use BigDecimal, you can represent (almost) arbitrarily large numbers. If the largest element you can get is 1 and the smallest 0, then you can set N = 1/(precision) and apply the Weighted Sum mentioned above. This will guarantee uniqueness. The precision for doubles in Java is Double.MIN_VALUE. Note that the array of weights needs to be stored in _Big Decimal_s!
That satisfies this part of your question:
create a computational value for each array, which is unique to it
based upon values inside it
However, there is a problem:
1 and 2 suck for K Means
I am assuming from your discussion with Marco 13 that you are performing the clustering on the single values, not the length 18 arrays. As Marco has already mentioned, Hashing sucks for K means. The whole idea is that the smallest change in the data will result in a large change in Hash Values. That means that two images which are similar, produce two very similar arrays, produce two very different "unique" numbers. Similarity is not preserved. The result will be pseudo random!!!
Weighted Sums are better, but still bad. It will basically ignore all the elements except for the last one, unless the last element is the same. Only then will it look at the next to last, and so on. Similarity is not really preserved.
Euclidean distance from the mean (or at least some point) will at least group things together in a sort of sensible way. Direction will be ignored, but at least things that are far from the mean won't be grouped with things that are close. Similarity of one feature is preserved, the other features are lost.
In summary
1 is very easy, but is not unique and doesn't preserve similarity.
2 is easy, can be unique and doesn't preserve similarity.
3 is easy, but is not unique and preserves some similarity.
Implementatio of Weighted Sum. Not really tested.
public class Array2UniqueID {
private final double min;
private final double max;
private final double prec;
private final int length;
/**
* Used to provide a {#code BigInteger} that is unique to the given array.
* <p>
* This uses weighted sum to guarantee that two IDs match if and only if
* every element of the array also matches. Similarity is not preserved.
*
* #param min smallest value an array element can possibly take
* #param max largest value an array element can possibly take
* #param prec smallest difference possible between two array elements
* #param length length of each array
*/
public Array2UniqueID(double min, double max, double prec, int length) {
this.min = min;
this.max = max;
this.prec = prec;
this.length = length;
}
/**
* A convenience constructor which assumes the array consists of doubles of
* full range.
* <p>
* This will result in very large IDs being returned.
*
* #see Array2UniqueID#Array2UniqueID(double, double, double, int)
* #param length
*/
public Array2UniqueID(int length) {
this(-Double.MAX_VALUE, Double.MAX_VALUE, Double.MIN_VALUE, length);
}
public BigDecimal createUniqueID(double[] array) {
// Validate the data
if (array.length != length) {
throw new IllegalArgumentException("Array length must be "
+ length + " but was " + array.length);
}
for (double d : array) {
if (d < min || d > max) {
throw new IllegalArgumentException("Each element of the array"
+ " must be in the range [" + min + ", " + max + "]");
}
}
double range = max - min;
/* maxNums is the maximum number of numbers that could possibly exist
* between max and min.
* The ID will be in the range 0 to maxNums^length.
* maxNums = range / prec + 1
* Stored as a BigDecimal for convenience, but is an integer
*/
BigDecimal maxNums = BigDecimal.valueOf(range)
.divide(BigDecimal.valueOf(prec))
.add(BigDecimal.ONE);
// For convenience
BigDecimal id = BigDecimal.valueOf(0);
// 2^[ (el-1)*length + i ]
for (int i = 0; i < array.length; i++) {
BigDecimal num = BigDecimal.valueOf(array[i])
.divide(BigDecimal.valueOf(prec))
.multiply(maxNums).pow(i);
id = id.add(num);
}
return id;
}
As I understand, you are going to make k-clustering, based on the double values.
Why not just wrap double value in an object, with array and position identifier, so you would know in which cluster it ended up?
Something like:
public class Element {
final public double value;
final public int array;
final public int position;
public Element(double value, int array, int position) {
this.value = value;
this.array = array;
this.position = position;
}
}
If you need to cluster array as a whole,
You can transform original arrays of length 18 to array of length 19 with last or first element being unique id, that you will ignore during clustering, but, to which you could refer after clustering finished. That way this have a small memory footprint - of 8 additional bytes for an array, and easy association with the original value.
If space is absolutely a problem, and you have all values of an array lesser than 1, you can add unique id, greater or equal to 1 to each array, and cluster, based on reminder of division to 1, 0.07518284315321135 stays 0.07518284315321135 for the 1st, and 0.07518284315321135 becomes 1.07518284315321135 for the 2nd, although this increases complexity of computation during clustering.
First of all, let's try to understand what you need mathematically:
Uniquely mapping an array of m real numbers to a single number is in fact a bijection between R^m and R, or at least N.
Since floating points are in fact rational numbers, your problem is to find a bijection between Q^m and N, which can be transformed to N^n to N, because you know your values will always be greater than 0 (just multiply your values by the precision).
Thus you need to map N^m to N. Take a look at the Cantor Pairing Function for some ideas
A guaranteed way to generate a unique result based on the array is to convert it to one big string, and use that for your computational value.
It may be slow, but it will be unique based on the array's values.
Implementation examples:
Best way to convert an ArrayList to a string
I created a class "Book":
public class Book {
public static int idCount = 1;
private int id;
private String title;
private String author;
private String publisher;
private int yearOfPublication;
private int numOfPages;
private Cover cover;
...
}
And then i need to override the hashCode() and equals() methods.
#Override
public int hashCode() {
int result = id; // !!!
result = 31 * result + (title != null ? title.hashCode() : 0);
result = 31 * result + (author != null ? author.hashCode() : 0);
result = 31 * result + (publisher != null ? publisher.hashCode() : 0);
result = 31 * result + yearOfPublication;
result = 31 * result + numOfPages;
result = 31 * result + (cover != null ? cover.hashCode() : 0);
return result;
}
It's no problem with equals(). I just wondering about one thing in hashCode() method.
Note: IntelliJ IDEA generated that hashCode() method.
So, is it OK to set the result variable to id, or should i use some prime number?
What is the better choice here?
Thanks!
Note that only the initial value of the result is set to id, not the final one. The final value is calculated by combining that initial value with hash codes of other parts of the object, multiplied by a power of a small prime number (i.e. 31). Using id rather than an arbitrary prime is definitely right in this context.
In general, there is no advantage to hash code being prime (it's the number of hash buckets that needs to be prime). Using an int as its own hash code (in your case, that's id and numOfPages) is a valid approach.
It helps to know what the hashCode is used for. It's supposed to help you map a theoretically infinite set of objects to fitting in a small number of "bins", with each bin having a number, and each object saying which bin it wants to go in based on its hashCode. The question is not whether it's okay to do one thing or another, but whether what you want to do matches what the hashCode function is for.
As per http://docs.oracle.com/javase/6/docs/api/java/lang/Object.html#hashCode(), it's not about the number you return, it's about how it behaves for different objects of the same class.
If the object doesn't change, the hashCode must be the same value every time you call the hashCode() function.
Two objects that are equal according to .equals, must have the same hashCode.
Two objects that are not equal may have the same hashCode. (if this wasn't the case, there would be no point in using the hashCode at all, because every object already has a unique object pointer)
If you're reimplementing the hashCode function, the most important thing is to either rely on a tool to generate it for you, or to use code you understand that obeys those rules. The basic Java hashCode function uses an incredibly well-researched, seemingly simple bit of code for String hashing, so the code you see is based on turning everything into Strings and falling back to that.
If you don't know why that works, don't touch it. Just rely on it working and move on. That 31 is ridiculously important and ensures an even hashing distribution. See Why does Java's hashCode() in String use 31 as a multiplier? for the why on that one.
However, this might also be way more than you need. You could use id, but then you're basically negating the reason to use a hashCode (because now every object will want to be in a bin on its own, turning any hashed collection into a flat array. Kind of silly).
If you know the distribution of your id values, there are far easier hashCodes to come up with. Say you know they are always between 0 and Interger.MAX_VALUE, and you know there are never any gaps between ids, you could simply generate a hashCode like
final int modulus = Intereger.MAX_VALUE / 255;
int hashCode() {
return this.id % modulus;
}
now, you have a hashCode optimised for 255 bins, fulfilling the necessary requirements for an acceptable hashCode function.
Note : In my answer I am assuming that you know how hash code is meant to be used. The following just talks about any potential optimization using a non-zero constant for the initial value of result may produce.
If id is rarely 0 then it's fine to use it. However, if it's 0 frequently you should use some constant instead (just using 1 should be fine). The reason you want for it to be non-zero is so that the 31 * result part always adds some value to the hash. That way say if object A has all fields null or 0 except for yearOfPublication = 1 and object B has all fields null or 0 except for numOfPages = 1 the hash codes will be:
A.hashCode() => initialValue * 31 ^ 4 + 1
B.hashCode() => initialValue * 31 ^ 5 + 1
As you can see if initialValue is 0 then both hash codes are the same, however if it's not 0 then they will be different. It is preferable for them to be different so as to reduce collisions in data structures that use the hash code like HashMap.
That said, in your example of the Book class it is likely that id will never be 0. In fact, if id uniquely identifies the Book then you can have the hashCode() method just return the id.
I need a hashCode implementation in Java which ignores the order of the fields in my class Edge. It should be possible that Node first could be Node second, and second could be Node first.
Here is my method is depend on the order:
public class Edge {
private Node first, second;
#Override
public int hashCode() {
int hash = 17;
int hashMultiplikator = 79;
hash = hashMultiplikator * hash
+ first.hashCode();
hash = hashMultiplikator * hash
+ second.hashCode();
return hash;
}
}
Is there a way to compute a hash which is for the following Edges the same but unique?
Node n1 = new Node("a");
Node n2 = new Node("b");
Edge ab = new Edge(n1,n2);
Edge ba = new Edge(n2,n1);
ab.hashCode() == ba.hashCode() should be true.
You can use some sort of commutative operation instead of what you have now, like addition:
#Override
public int hashCode() {
int hash = 17;
int hashMultiplikator = 79;
int hashSum = first.hashCode() + second.hashCode();
hash = hashMultiplikator * hash * hashSum;
return hash;
}
I'd recommend that you still use the multiplier since it provides some entropy to your hash code. See my answer here, which says:
Some good rules to follow for hashing are:
Mix up your operators. By mixing your operators, you can cause the results to vary more. Using simply x * y in this test, I had a very
large number of collisions.
Use prime numbers for multiplication. Prime numbers have interesting binary properties that cause multiplication to be more volatile.
Avoid using shift operators (unless you really know what you're doing). They insert lots of zeroes or ones into the binary of the
number, decreasing volatility of other operations and potentially even
shrinking your possible number of outputs.
To solve you problem you have to combine both hashCodes of the components.
An example could be:
#Override
public int hashCode() {
int prime = 17;
return prime * (first.hashCode() + second.hashCode());
}
Please check if this matches your requirements. Also a multiplikation or an XOR insted of an addition could be possible.
I have to write a hash function, under the following two conditions:
I don't know anything about Object o that is passed to the method - it can be a String, and Integer, or an actual custom object;
I am not allowed to call hashCode() at all.
Approach that I am using now, to calculate the hash code:
Write object to the byte stream;
Convert byte stream to the byte array;
Loop through the byte array and calculate hash by doing something like this:
hash = hash * PRIME + byteArray[i]
My question is it a passable approach and is there a way to improve it? Personally I feel like the scope for this function is too broad - there is no information about what the objects are, but I have little say in this situation.
You could use HashCodeBuilder.reflectionHashCode instead of implementing your own solution.
The serialization approach does only work for objects which in fact are serializable. Thus, for all types of objects is not really possible.
Also, this compares objects by have equivalent object graphs, which is not necessarily the same as are equal by .equals().
For example, StringBuilder objects created by the same code (with same data) will have an equal OOS output (i.e. also equal hash), while b1.equals(b2) is false, and a ArrayList and LinkedList with same elements will be register as different, while list1.equals(list2) is true.
You can avoid the convert byte stream to array step by creating a custom HashOutputStream, which simply takes the byte data and hashes it, instead of saving it as an array for later iteration.
class HashOutputStream extends OutputStream {
private static final int PRIME = 13;
private int hash;
// all the other write methods delegate to this one
public void write(int b) {
this.hash = this.hash * PRIME + b;
}
public int getHash() {
return hash;
}
}
Then wrap your ObjectOutputStream around an object of this class.
Instead of your y = y*13 + x method you might look at other checksum algorithms. For example, java.util.zip contains Adler32 (used in the zlib format) and CRC32 (used in the gzip format).
hash = (hash * PRIME + byteArray[i]) % MODULO ?
Also, while you're at it, if you want to avoid collisions as much as possible, you can use a standardized (cryptographic if intentional collisions are an issue) hash function in step 3, like SHA-2 or so?
Have a look at DigestInputStream, which also spares you step 2.
Take a look at Bob Jenkin's article on non-cryptographic hashing. He walks through a number of approaches and discusses their strengths, weakness, and tradeoffs between speed and the probability of collisions.
If nothing else, it will allow you to justify your algorithm decision. Explain to your instructor why you chose speed over correctness or vice versa.
As a starting point, try his One-at-a-time hash:
ub4 one_at_a_time(char *key, ub4 len)
{
ub4 hash, i;
for (hash=0, i=0; i<len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return (hash & mask);
}
It's simple, but does surprisingly well against more complex algorithms.