How do you in a general (and performant) way implement hashcode while minimizing collisions for objects with 2 or more integers?
update: as many stated, you cant ofcource eliminate colisions entierly (honestly didnt think about it). So my question should be how do you minimize collisions in a proper way, edited to reflect that.
Using NetBeans' autogeneration fails; for example:
public class HashCodeTest {
#Test
public void testHashCode() {
int loopCount = 0;
HashSet<Integer> hashSet = new HashSet<Integer>();
for (int outer = 0; outer < 18; outer++) {
for (int inner = 0; inner < 2; inner++) {
loopCount++;
hashSet.add(new SimpleClass(inner, outer).hashCode());
}
}
org.junit.Assert.assertEquals(loopCount, hashSet.size());
}
private class SimpleClass {
int int1;
int int2;
public SimpleClass(int int1, int int2) {
this.int1 = int1;
this.int2 = int2;
}
#Override
public int hashCode() {
int hash = 5;
hash = 17 * hash + this.int1;
hash = 17 * hash + this.int2;
return hash;
}
}
}
Can you in a general (and performant) way implement hashcode without
colisions for objects with 2 or more integers.
It is technically impossible to have zero collision when hashing to 32 bits (one integer) something made of more than 32 bits (like 2 or more integers).
This is what eclipse auto-generates:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + getOuterType().hashCode();
result = prime * result + int1;
result = prime * result + int2;
return result;
}
And with this code your testcase passes...
PS: And don't forget to implement equals()!
There is no way to eliminate hash collisions entirely. Your approach is basically the preferred one to minimize collisions.
Creating a hash method with zero collisions is impossible. The idea of a hash method is you're taking a large set of objects and mapping it to a smaller set of integers. The best you can do is minimize the number of collisions you get within a subset of your objects.
As others have said, it's more important to minimize collisions that to eliminate them -- especially since you didn't say how many buckets you're aiming for. It's going to be much easier to have zero collisions with 5 items in 1000 buckets than if you have 5 items in 2 buckets! And even if there are plenty of buckets, your collisions could look very different with 1000 buckets vs 1001.
Another thing to note is that there's a good chance that the hash you provide won't even be the one the HashMap eventually uses. If you take a look at the OpenJDK HashMap code, for instance, you'll see that your keys' hashCodes are put through a private hash method (line 264 in that link) which re-hashes them. So, if you're going through the trouble of creating a carefully constructed custom hash function to reduce collisions (rather than just a simple, auto-generated one), make sure you also understand who's going to use it, and how.
Related
So for a given prime number 31, how can I write a hash function for a string parameter?
Here is my attempt.
private int hash(String key){
int c = 31;
int hash = 0;
for (int i = 0; i < key.length(); i++ ) {
int ascii = key.charAt(i);
hash += c * hash + ascii;
}
return (hash % sizetable);} // sizetable is an integer which is declared outside. You can see it as a table.length().
So, since I can not run any other function in my work and I need to be sure about the process here, I need your answers and help! Thank you so much.
Your implementation looks quite similar to what is documented as standard String.hashCode() implementation, this even uses also 31 as prime factor, so it should be good enough.
I just would not assign 31 to a variable, but declare a private static final field or use it directly as magic number - not OK in general, but might be OK in this case.
Additionally you should add some tests - if you already know about the concept of unit tests - to prove that your method gives different hashes for different strings. And pick the samples clever, so they are different (for the case of the homework ;)
Assume I have a Java BitSet. I now need to make combinations of the BitSet such that only Bits which are Set can be flipped. i.e. only need combinations of Bits which are set.
For Eg. BitSet - 1010, Combinations - 1010, 1000, 0010, 0000
BitSet - 1100, Combination - 1100, 1000, 0100, 0000
I can think of a few solutions E.g. I can take combinations of all 4 bits and then XOR the combinations with the original Bitset. But this would be very resource-intensive for large sparse BitSets. So I was looking for a more elegant solution.
It appears that you want to get the power set of the bit set. There is already an answer here about how to get the power set of a Set<T>. Here, I will show a modified version of the algorithm shown in that post, using BitSets:
private static Set<BitSet> powerset(BitSet set) {
Set<BitSet> sets = new HashSet<>();
if (set.isEmpty()) {
sets.add(new BitSet(0));
return sets;
}
Integer head = set.nextSetBit(0);
BitSet rest = set.get(0, set.size());
rest.clear(head);
for (BitSet s : powerset(rest)) {
BitSet newSet = s.get(0, s.size());
newSet.set(head);
sets.add(newSet);
sets.add(s);
}
return sets;
}
You can perform the operation in a single linear pass instead of recursion, if you realize the integer numbers are a computer’s intrinsic variant of “on off” patterns and iterating over the appropriate integer range will ultimately produce all possible permutations. The only challenge in your case, is to transfer the densely packed bits of an integer number to the target bits of a BitSet.
Here is such a solution:
static List<BitSet> powerset(BitSet set) {
int nBits = set.cardinality();
if(nBits > 30) throw new OutOfMemoryError(
"Not enough memory for "+BigInteger.ONE.shiftLeft(nBits)+" BitSets");
int max = 1 << nBits;
int[] targetBits = set.stream().toArray();
List<BitSet> sets = new ArrayList<>(max);
for(int onOff = 0; onOff < max; onOff++) {
BitSet next = new BitSet(set.size());
for(int bitsToSet = onOff, ix = 0; bitsToSet != 0; ix++, bitsToSet>>>=1) {
if((bitsToSet & 1) == 0) {
int skip = Integer.numberOfTrailingZeros(bitsToSet);
ix += skip;
bitsToSet >>>= skip;
}
next.set(targetBits[ix]);
}
sets.add(next);
}
return sets;
}
It uses an int value for the iteration, which is already enough to represent all permutations that can ever be stored in one of Java’s builtin collections. If your source BitSet has 2³¹ one bits, the 2³² possible combinations do not only require a hundred GB heap, but also a collection supporting 2³² elements, i.e. a size not representable as int.
So the code above terminates early if the number exceeds the capabilities, without even trying. You could rewrite it to use a long or even BigInteger instead, to keep it busy in such cases, until it will fail with an OutOfMemoryError anyway.
For the working cases, the int solution is the most efficient variant.
Note that the code returns a List rather than a HashSet to avoid the costs of hashing. The values are already known to be unique and hashing would only pay off if you want to perform lookups, i.e. call contains with another BitSet. But to test whether an existing BitSet is a permutation of your input BitSet, you wouldn’t even need to generate all permutations, a simple bit operation, e.g. andNot would tell you that already. So for storing and iterating the permutations, an ArrayList is more efficient.
I am implementing a hash table (as per requirements). It works ok with a small input but unfortunately it's way too slow when dealing with a large number of input. I tried BufferedInputStream but it doesn't make any differences. Basically I implemented it following the logic below. Any ideas how I can improve the speed? Is there a specific function that causes the bad performance? Or we might need to close the Scanner?
int [] table = new int [30000];// creat an array as the table
Scanner scan = new Scanner (System.in); //use scanner to read the input file.
while (scan.hasNextLine()) {
//read one line at a time, and a sequence of int into an array list called keys
// functions used here is string.split(" ");
}
hashFuction{
//use middle-squaring on each elements to the array list keys.
// Math.pow() and % 30000, which is the table size, to generate the hash value
// assign table [hashvalue]= value
}
So first, you should now what part of the program is slow. Optimizing everything is a stupid idea, optimizing the fast part is even worse.
Math.pow() and % 30000, which is the table size
This is pretty wrong.
Never use floating point operations for things like hashing. It's slow and badly distributed.
Never use a table size which is neither a power of two nor prime.
You failed to tell us anything about what you're hashing and why... so let's assume you need to map a pair of two ints into the table.
class IntPair {
private int x;
private int y;
public int hashCode() {
// the multiplier must be odd for good results
// its exact value doesn't matter much, but it mustn't equal to your table size; ideally, it should be co-prime
return 54321 * x + y;
}
public boolean equals() {
do yourself
}
}
//// Prime table size. The division is slow, but it works slightly better than power of two.
int[] table = new int[30011]; // this is a prime
int hashCodeToIndex(int hashCode) {
int nonNegative = hashCode & Integer.MAX_VALUE;
return nonNegative % table.length;
}
//// Power of two table size. No division, faster.
int[] table2 = new int[1<<15]; // this is 2**15, i.e., 32768
int smear(int hashCode) {
// doing nothing may be good enough, if the hashCode is well distributed
// otherwise, see e.g., https://github.com/google/guava/blob/c234ed7f015dc90d0380558e663f57c5c445a288/guava/src/com/google/common/collect/Hashing.java#L46
return hashCode;
}
int hashCodeToIndex(int hashCode) {
// the "&" cleans all unwanted bits
return smear(hashCode) & (table2.length - 1);
}
// an alternative, explanation upon request
int hashCodeToIndex2(int hashCode) {
return smear(hashCode) >>> 17;
}
I need a hashCode implementation in Java which ignores the order of the fields in my class Edge. It should be possible that Node first could be Node second, and second could be Node first.
Here is my method is depend on the order:
public class Edge {
private Node first, second;
#Override
public int hashCode() {
int hash = 17;
int hashMultiplikator = 79;
hash = hashMultiplikator * hash
+ first.hashCode();
hash = hashMultiplikator * hash
+ second.hashCode();
return hash;
}
}
Is there a way to compute a hash which is for the following Edges the same but unique?
Node n1 = new Node("a");
Node n2 = new Node("b");
Edge ab = new Edge(n1,n2);
Edge ba = new Edge(n2,n1);
ab.hashCode() == ba.hashCode() should be true.
You can use some sort of commutative operation instead of what you have now, like addition:
#Override
public int hashCode() {
int hash = 17;
int hashMultiplikator = 79;
int hashSum = first.hashCode() + second.hashCode();
hash = hashMultiplikator * hash * hashSum;
return hash;
}
I'd recommend that you still use the multiplier since it provides some entropy to your hash code. See my answer here, which says:
Some good rules to follow for hashing are:
Mix up your operators. By mixing your operators, you can cause the results to vary more. Using simply x * y in this test, I had a very
large number of collisions.
Use prime numbers for multiplication. Prime numbers have interesting binary properties that cause multiplication to be more volatile.
Avoid using shift operators (unless you really know what you're doing). They insert lots of zeroes or ones into the binary of the
number, decreasing volatility of other operations and potentially even
shrinking your possible number of outputs.
To solve you problem you have to combine both hashCodes of the components.
An example could be:
#Override
public int hashCode() {
int prime = 17;
return prime * (first.hashCode() + second.hashCode());
}
Please check if this matches your requirements. Also a multiplikation or an XOR insted of an addition could be possible.
i have a list of elements (let's say integers), and i need to make all possible 2-pair comparisons. my approach is O(n^2), and i am wondering if there is a faster way. here is my implementation in java.
public class Pair {
public int x, y;
public Pair(int x, int y) {
this.x = x;
this.y = y;
}
}
public List<Pair> getAllPairs(List<Integer> numbers) {
List<Pair> pairs = new ArrayList<Pair>();
int total = numbers.size();
for(int i=0; i < total; i++) {
int num1 = numbers.get(i).intValue();
for(int j=i+1; j < total; j++) {
int num2 = numbers.get(j).intValue();
pairs.add(new Pair(num1,num2));
}
}
return pairs;
}
please note that i don't allow self-pairing, so there should be ((n(n+1))/2) - n possible pairs. what i have currently works, but as n increases, it is taking me an unbearable long amount of time to get the pairs. is there any way to turn the O(n^2) algorithm above to something sub-quadratic? any help is appreciated.
by the way, i also tried the algorithm below, but when i benchmark, empirically, it performs worst than what i had above. i had thought that by avoiding an inner loop this would speed things up. shouldn't this algorithm below be faster? i would think that it's O(n)? if not, please explain and let me know. thanks.
public List<Pair> getAllPairs(List<Integer> numbers) {
int n = list.size();
int i = 0;
int j = i + 1;
while(true) {
int num1 = list.get(i);
int num2 = list.get(j);
pairs.add(new Pair(num1,num2));
j++;
if(j >= n) {
i++;
j = i + 1;
}
if(i >= n - 1) {
break;
}
}
}
Well, you can't, right?
The result is a list with n*(n-1)/2 elements, no matter what those elements are, just to populate this list (say with zeros) takes O(n^2) time, since n*(n-1)/2 = O(n^2)...
You cannot make it sub-quadric, because as you said - the output is itself quadric - and to create it, you need at least #elements_in_output ops.
However, you could do some "cheating" create your list on the fly:
You can create a class CombinationsGetter that implements Iterable<Pair>, and implement its Iterator<Pair>. This way, you will be able to iterate on the elements on the fly, without creating the list first, which might decrease latency for your application.
Note: It will still be quadric! The time to generate the list on the fly will just be distributed between more operations.
EDIT:
Another solution, which is faster then the naive approach - is multithreading.
Create a few threads, each will get his "slice" of the data - and generate relevant pairs, and create its own partial list.
Later - you can use ArrayList.addAll() to convert those different lists into one.
Note: though complexity is stiil O(n^2), it is likely to be much faster - since the creation of pairs is done in parallel, and ArrayList.addAll() is implemented much more effieciently then the trivial insert one by one elements.
EDIT2:
Your second code is still O(n^2), even though it is a "single loop" - the loop itself will repeat O(n^2) times. Have a look at your variable i. It increases only when j==n, and it decreases j back to i+1 when it does it. So, it will result in n + (n-1) + ... + 1 iterations, and this is sum of arithmetic progression, and gets us back to O(n^2) as expected.
We cannot get better then O(n^2), because we are trying to create O(n^2) distinct Pair objects.