Generating Random Hash Functions for LSH Minhash Algorithm

Generating Random Hash Functions for LSH Minhash Algorithm - java

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment).
In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it.
Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it?
This post was asking a similar question, but I'm still somewhat confused by the wording of the answer: Minhash implementation how to find hash functions for permutations

When I was working with Bloom filters a few years ago, I ran across an article that describes how to generate multiple hash functions very simply, with a minimum of code. The method he describes works very well. See Less Hashing, Same Performance: Building a Better Bloom Filter.
The basic idea is to create two hash functions, call them h1 and h2, with which you can then simulate multiple hash functions, g1 through gk, using the formula:
gi = h1(x) + i*h2(x)
i varies from 1 to k (the number of hash functions you want).
The paper is well worth reading, even if you decide not to implement his idea. Although after reading it I can't imagine not wanting to implement it. It made my Bloom filter code a whole lot more tractable and didn't negatively impact performance.

So the method that I described above was almost correct. The numbers a and b should be randomly generated. However, c needs to be a prime number that is slightly larger than the maximum possible value of x. Once those numbers have been chosen, finding hash value h using h = ((a*x)+b) % c is the standard, accepted way to generate hash functions.
Also, a and b should be random numbers from the range 1 to c-1.

Related

Is there a way to pow 2 BigInteger Numbers in java?

I have to pow a bigInteger number with another BigInteger number.
Unfortunately, only one BigInteger.pow(int) is allowed.
I have no clue on how I can solve this problem.

I have to pow a bigInteger number with another BigInteger number.
No, you don't.
You read a crypto spec and it seemed to say that. But that's not what it said; you didn't read carefully enough. The mathematical 'universe' that the math in the paper / spec you're reading operates in is different from normal math. It's a modulo-space. All operations are implicitly performed modulo X, where X is some number the crypto algorithm explains.
You can do that just fine.
Alternatively, the spec is quite clear and says something like: C = (A^B) % M and you've broken that down in steps (... first, I must calculate A to the power of B. I'll worry about what the % M part is all about later). That's not how that works - you can't lop that operation into parts. (A^B) % M is quite doable, and has its own efficient algorithm. (A^B) is simply not calculable without a few years worth of the planet's entire energy and GDP output.
The reason I know that must be what you've been reading, is because (A ^ B) % M is a common operation in crypto. (Well, that, and the simple fact that A^B can't be done).
Just to be crystal clear: When I say impossible, I mean it in the same way 'travelling faster than the speed of light' is impossible. It's a law in the physics sense of the word: If you really just want to do A^B and not in a modspace where B is so large it doesn't fit in an int, a computer cannot calculate it, and the result will be gigabytes large. int can hold about 9 digits worth. Just for fun, imagine doing X^Y where both X and Y are 20 digit numbers.
The result would have 10^21 digits.
That's roughly equal to the total amount of disk space available worldwide. 10^12 is a terabyte. You're asking to calculate a number where, forget about calculating it, merely storing it requires one thousand million harddisks each of 1TB.
Thus, I'm 100% certain that you do not want what you think you want.
TIP: If you can't follow the math (which is quite bizarre; it's not like you get modulo-space math in your basic AP math class!), generally rolling your own implementation of a crypto algorithm isn't going to work out. The problem with crypto is, if you mess up, often a unit test cannot catch it. No; someone will hack your stuff and then you know, and that's a high price to pay. Rely on experts to build the algorithm, spend your time ensuring the protocol is correct (which is still quite difficult to get right, don't take that lightly!). If you insist, make dang sure you have a heap of plaintext+keys / encrypted (or plaintext / hashed, or whatever it is you're doing) pairs to test against, and assume that whatever you wrote, even if it passes those tests, is still insecure because e.g. it is trivial to leak the key out of your algorithm using timing attacks.

Since you anyway want to use it in a modulo operation with a prime number, like #Progman said in the comments, you can use modPow()
Below is an example code:
// Create BigInteger objects
BigInteger biginteger1, biginteger2, exponent, result;
//prime number
int pNumber = 5;
// Intializing all BigInteger Objects
biginteger1 = new BigInteger("23895");
biginteger2 = BigInteger.valueOf(pNumber);
exponent = new BigInteger("15");
// Perform modPow operation on the objects and exponent
result = biginteger1.modPow(exponent, biginteger2);

Even distribution of long integer identifiers into buckets

I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers.
With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?
Since I got feedback that my question does not have enough background information. I am adding some more information.
The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.
The identifiers themselves are long integers.
The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.
Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).
I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years
)

There's no need to involve String.
new Integer(i).hashCode()
... gives you a hash - designed for the very purpose of evenly distributing into buckets.
new Integer(i).hashCode() % n
... will give you a number in the range you want.
However Integer.hashCode() is just:
return value;
So new Integer(i).hashCode() % n is equivalent to i % n.

Your question as is cannot be answered. #slim's try is the best you will get, because crucial information is missing in your question.
To distribute a set of items, you have to know something about their initial distribution.
If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.
If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].
That is, half of your buckets end up with 50% more items than the other half.
If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.
This is when you need a hash function.
But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.
To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute
hash(input) = input * 2654435761 % 2^32
It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]
I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.

Can anybody explain how java design HashMap's hash() function? [duplicate]

This question already has answers here:
Explanation of HashMap#hash(int) method
(2 answers)
Closed 7 years ago.
after I read JDK's source code ,I find HashMap's hash() function seems fun. Its soucre code like this:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Parameter h is the hashCode from Objects which was put into HashMap. How does this method work and why? Why this method can defend against poor hashCode functions?

Hashtable uses the 'classical' approach of prime numbers: to get the 'index' of a value, you take the hash of the key and perform the modulus against the size. Taking a prime number as size, gives (normally) a nice spread over the indexes (depending on the hash as well, of course).
HashMap uses a 'power of two'-approach, meaning the sizes are a power of two. The reason is it's supposed to be faster than prime number calculations. However, since a power of two is not a prime number, there would be more collisions, especially with hash values having the same lower bits.
Why? The modulus performed against the size to get the (bucket/slot) index is simply calculated by: hash & (size-1) (which is exactly what's used in HashMap to get the index!). That's basically the problem with the 'power-of-two' approach: if the length is limited, e.g. 16, the default value of HashMap, only the last bits are used and hence, hash values with the same lower bits will result in the same (bucket) index. In the case of 16, only the last 4 bits are used to calculate the index.
That's why an extra hash is calculated and basically it's shifting the higher bit values, and operate on them with the lower bit values. The reason for the numbers 20, 12, 7 and 4, I don't really know. They used be different (in Java 1.5 or so, the hash function was little different). I suppose there's more advanced literature available. You might find more info about why they use the numbers they use in all kinds of algorithm-related literature, e.g.
http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming
http://mitpress.mit.edu/books/introduction-algorithms
http://burtleburtle.net/bob/hash/evahash.html#lookup uses different algorithms depending on the length (which makes some sense).
http://www.javaspecialists.eu/archive/Issue054.html is probably interesting as well. Check the reaction of Joshua Bloch near the bottom of the article: "The replacement secondary hash function (which I developed with the aid of a computer) has strong statistical properties that pretty much guarantee good bucket distribution.") So, if you ask me, the numbers come from some kind of analysis performed by Josh himself, probably assisted by who knows who.
So: power of two gives faster calculation, but the necessity for additional hash calculation in order to have a nice spread over the slots/buckets.

Explanation of the constants used while calculating hashcode value of java.util.hash

Can someone explain the significance of these constants and why they are chosen?
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
source: java-se6 library

Understanding what makes for a good hash function is tricky, as there are in fact a great many different functions that are used and for slightly different purposes.
Java's hash tables work as follows:
They ask the key object to produce its hash code. The implementation of the hashCode() method is likely to be of distinctly variable quality (in the worst case, returning a constant value!) and will definitely not be adapted to the particular hash table you're working with.
They then use the above function to mix the bits up a bit, so that information present in the high bits also gets moved down to the low bits. This is important because next …
They take the mod of the hash code (w.r.t. the number of hash table array entries) to get the index into the array of hash table chains. There's a distinct possibility that the hash table array will have size equivalent to a power of 2, so the mixing down of the bits in step 2 is important to ensure that they don't just get thrown away.
They then traverse the chain until they get to the entry with an equal key (according to the equals() method).
To complete the picture, the number of entries in the hash table array is non-constant; if the chains get too long the array gets replaced with a new larger array and everything gets rehashed. That's relatively fast and has good performance implications for normal use patterns (e.g., lots of put()s followed by lots of get()s).
The actual constants used are fairly arbitrary (and are probably chosen by experiment with some simple corpus including things like large numbers of Integer and String values) but their purpose is not: getting the information in the whole value spread to most of the low bits in the value ensures that such information as is present in the output of the hashCode() is used as well as possible.
(You wouldn't do this with perfect hashing or cryptographic hashing; despite the similar names, they have very different implementation strategies. The former requires knowledge of the key space so that collisions are avoided/reduced, and the latter needs information to be moved about in all directions, not just to the low bits.)

I have also wondered about such "magic" numbers. As far as I know they are magic numbers.
It has been proven by extensive testing that odd and prime numbers have interesting priorities that could be used in hashing (avoid primary/secondary clustering etc).
I believe that most of the numbers come after research and testing that prove statistically to give good distributions. Why specifically these numbers do that, I have no idea but I have the impression (hopefully collegues here can correct me if I am way off) neither the implementers know why these specific numbers present these qualities

A good hash function to use in interviews for integer numbers, strings?

I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"

Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.

You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)

For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.