Is it a good approach to generate hash codes? - java

I have to write a hash function, under the following two conditions:
I don't know anything about Object o that is passed to the method - it can be a String, and Integer, or an actual custom object;
I am not allowed to call hashCode() at all.
Approach that I am using now, to calculate the hash code:
Write object to the byte stream;
Convert byte stream to the byte array;
Loop through the byte array and calculate hash by doing something like this:
hash = hash * PRIME + byteArray[i]
My question is it a passable approach and is there a way to improve it? Personally I feel like the scope for this function is too broad - there is no information about what the objects are, but I have little say in this situation.

You could use HashCodeBuilder.reflectionHashCode instead of implementing your own solution.

The serialization approach does only work for objects which in fact are serializable. Thus, for all types of objects is not really possible.
Also, this compares objects by have equivalent object graphs, which is not necessarily the same as are equal by .equals().
For example, StringBuilder objects created by the same code (with same data) will have an equal OOS output (i.e. also equal hash), while b1.equals(b2) is false, and a ArrayList and LinkedList with same elements will be register as different, while list1.equals(list2) is true.
You can avoid the convert byte stream to array step by creating a custom HashOutputStream, which simply takes the byte data and hashes it, instead of saving it as an array for later iteration.
class HashOutputStream extends OutputStream {
private static final int PRIME = 13;
private int hash;
// all the other write methods delegate to this one
public void write(int b) {
this.hash = this.hash * PRIME + b;
}
public int getHash() {
return hash;
}
}
Then wrap your ObjectOutputStream around an object of this class.
Instead of your y = y*13 + x method you might look at other checksum algorithms. For example, java.util.zip contains Adler32 (used in the zlib format) and CRC32 (used in the gzip format).

hash = (hash * PRIME + byteArray[i]) % MODULO ?

Also, while you're at it, if you want to avoid collisions as much as possible, you can use a standardized (cryptographic if intentional collisions are an issue) hash function in step 3, like SHA-2 or so?
Have a look at DigestInputStream, which also spares you step 2.

Take a look at Bob Jenkin's article on non-cryptographic hashing. He walks through a number of approaches and discusses their strengths, weakness, and tradeoffs between speed and the probability of collisions.
If nothing else, it will allow you to justify your algorithm decision. Explain to your instructor why you chose speed over correctness or vice versa.
As a starting point, try his One-at-a-time hash:
ub4 one_at_a_time(char *key, ub4 len)
{
ub4 hash, i;
for (hash=0, i=0; i<len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return (hash & mask);
}
It's simple, but does surprisingly well against more complex algorithms.

Related

How can we write a polynomial hash function with given prime

So for a given prime number 31, how can I write a hash function for a string parameter?
Here is my attempt.
private int hash(String key){
int c = 31;
int hash = 0;
for (int i = 0; i < key.length(); i++ ) {
int ascii = key.charAt(i);
hash += c * hash + ascii;
}
return (hash % sizetable);} // sizetable is an integer which is declared outside. You can see it as a table.length().
So, since I can not run any other function in my work and I need to be sure about the process here, I need your answers and help! Thank you so much.
Your implementation looks quite similar to what is documented as standard String.hashCode() implementation, this even uses also 31 as prime factor, so it should be good enough.
I just would not assign 31 to a variable, but declare a private static final field or use it directly as magic number - not OK in general, but might be OK in this case.
Additionally you should add some tests - if you already know about the concept of unit tests - to prove that your method gives different hashes for different strings. And pick the samples clever, so they are different (for the case of the homework ;)

Creating combinations of a BitSet

Assume I have a Java BitSet. I now need to make combinations of the BitSet such that only Bits which are Set can be flipped. i.e. only need combinations of Bits which are set.
For Eg. BitSet - 1010, Combinations - 1010, 1000, 0010, 0000
BitSet - 1100, Combination - 1100, 1000, 0100, 0000
I can think of a few solutions E.g. I can take combinations of all 4 bits and then XOR the combinations with the original Bitset. But this would be very resource-intensive for large sparse BitSets. So I was looking for a more elegant solution.
It appears that you want to get the power set of the bit set. There is already an answer here about how to get the power set of a Set<T>. Here, I will show a modified version of the algorithm shown in that post, using BitSets:
private static Set<BitSet> powerset(BitSet set) {
Set<BitSet> sets = new HashSet<>();
if (set.isEmpty()) {
sets.add(new BitSet(0));
return sets;
}
Integer head = set.nextSetBit(0);
BitSet rest = set.get(0, set.size());
rest.clear(head);
for (BitSet s : powerset(rest)) {
BitSet newSet = s.get(0, s.size());
newSet.set(head);
sets.add(newSet);
sets.add(s);
}
return sets;
}
You can perform the operation in a single linear pass instead of recursion, if you realize the integer numbers are a computer’s intrinsic variant of “on off” patterns and iterating over the appropriate integer range will ultimately produce all possible permutations. The only challenge in your case, is to transfer the densely packed bits of an integer number to the target bits of a BitSet.
Here is such a solution:
static List<BitSet> powerset(BitSet set) {
int nBits = set.cardinality();
if(nBits > 30) throw new OutOfMemoryError(
"Not enough memory for "+BigInteger.ONE.shiftLeft(nBits)+" BitSets");
int max = 1 << nBits;
int[] targetBits = set.stream().toArray();
List<BitSet> sets = new ArrayList<>(max);
for(int onOff = 0; onOff < max; onOff++) {
BitSet next = new BitSet(set.size());
for(int bitsToSet = onOff, ix = 0; bitsToSet != 0; ix++, bitsToSet>>>=1) {
if((bitsToSet & 1) == 0) {
int skip = Integer.numberOfTrailingZeros(bitsToSet);
ix += skip;
bitsToSet >>>= skip;
}
next.set(targetBits[ix]);
}
sets.add(next);
}
return sets;
}
It uses an int value for the iteration, which is already enough to represent all permutations that can ever be stored in one of Java’s builtin collections. If your source BitSet has 2³¹ one bits, the 2³² possible combinations do not only require a hundred GB heap, but also a collection supporting 2³² elements, i.e. a size not representable as int.
So the code above terminates early if the number exceeds the capabilities, without even trying. You could rewrite it to use a long or even BigInteger instead, to keep it busy in such cases, until it will fail with an OutOfMemoryError anyway.
For the working cases, the int solution is the most efficient variant.
Note that the code returns a List rather than a HashSet to avoid the costs of hashing. The values are already known to be unique and hashing would only pay off if you want to perform lookups, i.e. call contains with another BitSet. But to test whether an existing BitSet is a permutation of your input BitSet, you wouldn’t even need to generate all permutations, a simple bit operation, e.g. andNot would tell you that already. So for storing and iterating the permutations, an ArrayList is more efficient.

Hashcode generated by Eclipse

In SO I have read several answers related to the implementation of hashcode and the suggestion to use the XOR operator. (E.g. Why are XOR often used in java hashCode() but another bitwise operators are used rarely?).
When I use Eclipse to generate the hashcode function where field is an object and timestamp a long, the output is:
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result
+ field == null) ? 0 : field.hashCode());
return result;
}
Is there any reason by not using the XOR operator like below?
result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
Eclipse takes the safe way out. Although the calculation method that uses a prime, a multiplication, and an addition is slower than a single XOR, it gives you an overall better hash code in situations when you have multiple fields.
Consider a simple example - a class with two Strings, a and b. You can use
a.hashCode() ^ b.hashCode()
or
a.hashCode() * 31 + b.hashCode()
Now consider two objects:
a = "ABC"; b = "XYZ"
and
a = "XYZ"; b = "ABC"
The first method will produce identical hash codes for them, because XOR is symmetric; the second method will produce different hash codes, which is good, because the objects are not equal. In general, you want non-equal objects to have different hash codes as often as possible, to improve performance of hash-based containers of these objects. The 31*a+b method achieves this goal better than XOR.
Note that when you are dealing with portions of the same object, as in
timestamp ^ (timestamp >>> 32)
the above argument is much weaker: encountering two timestamps such that the only difference between them is that their upper and lower parts are swapped is harder to imagine than two objects with swapped a and b field values.

Which is faster, String or Integer as hashkey in Java?

I am working on a problem and I have problem with execution times becoming too large and now Im looking for possible optimizations.
The question: Is there any (considerable) difference in performance between using String or Integer as haskey?
The problem is I have a graph with nodes stored in a hashtable with String as key. For example the keys are as follows - "0011" or "1011" etc. Now I could convert these to integers aswell if this would mean a improvement in execution time.
Integer will perform better than String. Following is code for the hashcode computation for both.
Integer hash code implementation
/**
* Returns a hash code for this <code>Integer</code>.
*
* #return a hash code value for this object, equal to the
* primitive <code>int</code> value represented by this
* <code>Integer</code> object.
*/
public int hashCode() {
return value;
}
String hash code implementation
/**
* Returns a hash code for this string. The hash code for a
* <code>String</code> object is computed as
* <blockquote><pre>
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* </pre></blockquote>
* using <code>int</code> arithmetic, where <code>s[i]</code> is the
* <i>i</i>th character of the string, <code>n</code> is the length of
* the string, and <code>^</code> indicates exponentiation.
* (The hash value of the empty string is zero.)
*
* #return a hash code value for this object.
*/
public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
If you have performance problem, it's quite unlikely that the issue is due to the HashMap/HashTable. While hashing string is slightly more expensive than hashing integers, it's rather small difference, and hashCode is cached so it's not recalculated if you use the same string object, you are unlikely to get any significant performance benefit from converting it first to integer.
It's probably more fruitful to look somewhere else for the source of your performance issue. Have you tried profiling your code yet?
There is a difference in speed. HashMaps will use hashCode to calculate the bucket based on that code, and the implementation of Integer is much simpler than that of String.
Having said that, if you are having problems with execution times, you need to do some proper measurements and profiling yourself. That's the only way to find out what the problem is with the execution time, and using Integers instead of Strings will usually only have a minimal effect on performance, meaning that your performance problem might be elsewhere.
For example, look at this post if you want to do some proper micro benchmarks. There are many other resources available for profiling etc..

Hashcode for objects with only integers

How do you in a general (and performant) way implement hashcode while minimizing collisions for objects with 2 or more integers?
update: as many stated, you cant ofcource eliminate colisions entierly (honestly didnt think about it). So my question should be how do you minimize collisions in a proper way, edited to reflect that.
Using NetBeans' autogeneration fails; for example:
public class HashCodeTest {
#Test
public void testHashCode() {
int loopCount = 0;
HashSet<Integer> hashSet = new HashSet<Integer>();
for (int outer = 0; outer < 18; outer++) {
for (int inner = 0; inner < 2; inner++) {
loopCount++;
hashSet.add(new SimpleClass(inner, outer).hashCode());
}
}
org.junit.Assert.assertEquals(loopCount, hashSet.size());
}
private class SimpleClass {
int int1;
int int2;
public SimpleClass(int int1, int int2) {
this.int1 = int1;
this.int2 = int2;
}
#Override
public int hashCode() {
int hash = 5;
hash = 17 * hash + this.int1;
hash = 17 * hash + this.int2;
return hash;
}
}
}
Can you in a general (and performant) way implement hashcode without
colisions for objects with 2 or more integers.
It is technically impossible to have zero collision when hashing to 32 bits (one integer) something made of more than 32 bits (like 2 or more integers).
This is what eclipse auto-generates:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + getOuterType().hashCode();
result = prime * result + int1;
result = prime * result + int2;
return result;
}
And with this code your testcase passes...
PS: And don't forget to implement equals()!
There is no way to eliminate hash collisions entirely. Your approach is basically the preferred one to minimize collisions.
Creating a hash method with zero collisions is impossible. The idea of a hash method is you're taking a large set of objects and mapping it to a smaller set of integers. The best you can do is minimize the number of collisions you get within a subset of your objects.
As others have said, it's more important to minimize collisions that to eliminate them -- especially since you didn't say how many buckets you're aiming for. It's going to be much easier to have zero collisions with 5 items in 1000 buckets than if you have 5 items in 2 buckets! And even if there are plenty of buckets, your collisions could look very different with 1000 buckets vs 1001.
Another thing to note is that there's a good chance that the hash you provide won't even be the one the HashMap eventually uses. If you take a look at the OpenJDK HashMap code, for instance, you'll see that your keys' hashCodes are put through a private hash method (line 264 in that link) which re-hashes them. So, if you're going through the trouble of creating a carefully constructed custom hash function to reduce collisions (rather than just a simple, auto-generated one), make sure you also understand who's going to use it, and how.

Categories