So for a given prime number 31, how can I write a hash function for a string parameter?
Here is my attempt.
private int hash(String key){
int c = 31;
int hash = 0;
for (int i = 0; i < key.length(); i++ ) {
int ascii = key.charAt(i);
hash += c * hash + ascii;
}
return (hash % sizetable);} // sizetable is an integer which is declared outside. You can see it as a table.length().
So, since I can not run any other function in my work and I need to be sure about the process here, I need your answers and help! Thank you so much.
Your implementation looks quite similar to what is documented as standard String.hashCode() implementation, this even uses also 31 as prime factor, so it should be good enough.
I just would not assign 31 to a variable, but declare a private static final field or use it directly as magic number - not OK in general, but might be OK in this case.
Additionally you should add some tests - if you already know about the concept of unit tests - to prove that your method gives different hashes for different strings. And pick the samples clever, so they are different (for the case of the homework ;)
Related
I need to find 2 to-the-power N where N is a very large number (Java BigInteger type)
Java BigInteger Class has pow method but it takes only integer value as exponent.
So, I wrote a method as follows:
static BigInteger twoToThePower(BigInteger n)
{
BigInteger result = BigInteger.valueOf(1L);
while (n.compareTo(BigInteger.valueOf((long) Integer.MAX_VALUE)) > 0)
{
result = result.shiftLeft(Integer.MAX_VALUE);
n = n.subtract(BigInteger.valueOf((long) Integer.MAX_VALUE));
}
long k = n.longValue();
result = result.shiftLeft((int) k);
return result;
}
My code works fine, I am just sharing my idea and curious to know if there is any other better idea?
Thank you.
You cannot use BigInteger to store the result of your computation. From the javadoc :
BigInteger must support values in the range -2^Integer.MAX_VALUE (exclusive) to +2^Integer.MAX_VALUE (exclusive) and may support values outside of that range.
This is the reason why the pow method takes an int. On my machine, BigInteger.ONE.shiftLeft(Integer.MAX_VALUE) throws a java.lang.ArithmeticException (message is "BigInteger would overflow supported range").
Emmanuel Lonca's answer is correct. But, by Manoj Banik's idea, I would like to share my idea too.
My code do the same thing as Manoj Banik's code in faster way. The idea is init the buffer, and put the bit 1 in to correct location. I using the shift left operator on 1 byte instead of shiftLeft method.
Here is my code:
static BigInteger twoToThePower(BigInteger n){
BigInteger eight = BigInteger.valueOf(8);
BigInteger[] devideResult = n.divideAndRemainder(eight);
BigInteger bufferSize = devideResult[0].add(BigInteger.ONE);
int offset = devideResult[1].intValue();
byte[] buffer = new byte[bufferSize.intValueExact()];
buffer[0] = (byte)(1 << offset);
return new BigInteger(1,buffer);
}
But it still slower than BigInteger.pow
Then, I found that class BigInteger has a method called setBit. It also accepts parameter type int like the pow method. Using this method is faster than BigInteger.pow.
The code can be:
static BigInteger twoToThePower(BigInteger n){
return BigInteger.ZERO.setBit(n.intValueExact());
}
Class BigInteger has a method called modPow also. But It need one more parameter. This means you should specify the modulus and your result should be smaller than this modulus. I did not do a performance test for modPow, but I think it should slower than the pow method.
By using repeated squaring you can achieve your goal. I've posted below sample code to understand the logic of repeated squaring.
static BigInteger pow(BigInteger base, BigInteger exponent) {
BigInteger result = BigInteger.ONE;
while (exponent.signum() > 0) {
if (exponent.testBit(0)) result = result.multiply(base);
base = base.multiply(base);
exponent = exponent.shiftRight(1);
}
return result;
}
An interesting question. Just to add a little more information to the fine accepted answer, examining the openjdk 8 source code for BigInteger reveals that the bits are stored in an array final int[] mag;. Since arrays can contain at most Integer.MAX_VALUE elements this immediately puts a theoretical bound on this particular implementation of BigInteger of 2(32 * Integer.MAX_VALUE). So even your method of repeated left-shifting can only exceed the size of an int by at most a factor of 32.
So, are you ready to produce your own implementation of BigInteger?
I'm trying to implement cuckoo hashing with hash functions:
hash1: key.hashcode() % capacity
hash2: key.hashcode() / capacity % capacity
With an infinite loop check and rehashing method doubling capacity. The program works fine with small amount of data, but when data gets big (around 20k elements) the program keeps getting rehashing until the capacity gets overflowed.
I figured that mostly the infinite rehashing causes by the data with exactly same hashcode. After rehashing, there will be chance other data get same hashcode and causing rehashing again.
I already use Java built-in hashcode but the chance of same hashcodes still high when data is large. Even I modified a little bit hashcode method, eventually there is still data with same hashcode.
So which hash method should I use to prevent this?
A usual method to create a hash function is generally to use primes. I write a function (below), with which, I don't guarantee no collisions, but it should be lessened.
hashFunction1(String s){
int k = 7; //take a prime number, can be anything (I just chose 7)
for(int i = 0; i < s.length(); i++){
k *= (23 * (int)(s.charAt(i)));
k %= capacity;
}
}
//23 is another randomly chosen number.
You can write a similar hash function as hashFunction2, choosing two different prime numbers. But here, the main problem is, for strings "stop" and "pots", this gives same hash code.
So, an improvization over this function can be:
hashFunction1(String s){
int k = 7; //take a prime number, can be anything (I just chose 7)
for(int i = 0; i < s.length(); i++){
k *= (23 * (int)(s.charAt(i)));
k += (int)(s.charAt(i));
k %= capacity;
}
}
which will resolve this (for most cases, if not all).
If you still find this function bad, instead of s.charAt(i), you can use a unique prime number mapped to every character, ie. a=3, b=5, c=7, d=11 and so on. This should resolve collision even more.
EDIT:
You are using +n, which is a constant.
2 is not the prime to be used in such cases. Use an odd prime number, 3 works.
I have searched for this question on the website but do not understand most of the solutions posted.
my question is;
i have a method:
public static int convertToDecimal(int number, int base)
it will receive a number and base, then multiply them thus:
(1023, 4) passed to the method will be (1 * Math.pow(4,3)) + (0* Math.pow(4,2))+(2* Math.pow(4,1))+(3* Math.pow(4,0)) will return 75.
please I am very new to java and do not yet understand the sophisticated methods of doing this. the goal is to understand how to do the iriteration. thanks
An integer value itself (e.g. an int) doesn't have a base, really. It's just a number.
Bases are mostly important for textual representations. (They're also important in floating point numbers as that determines what sort of point is floating, but let's leave that aside for the moment.) So it would make sense to have:
public static String convertToDecimal(String input, int base)
Or even:
public static int convertFromBase(String input, int base)
But your current signature doesn't make much sense. 1023 is not the same as "1023"... it's just a number.
Fortunately, Java has the second of these signatures built in, as Integer.parseInt(String, int), so you don't need to do it yourself. If you do want to do it yourself, I would write it something like:
public static int convertFromBase(String input, int base) {
int multiplier = 1;
int result = 0;
for (int index = input.length() - 1; index >= 0; index--) {
// This will handle hex for you. TODO: Validation!
int digitValue = Character.digit(input.charAt(index), base);
result += digitValue * multiplier;
multiplier *= base;
}
return result;
}
Method Integer.parseInt(String,int) does exactly what your are looking for.
Integer.parseInt("1023",4) //returns 75
How do you in a general (and performant) way implement hashcode while minimizing collisions for objects with 2 or more integers?
update: as many stated, you cant ofcource eliminate colisions entierly (honestly didnt think about it). So my question should be how do you minimize collisions in a proper way, edited to reflect that.
Using NetBeans' autogeneration fails; for example:
public class HashCodeTest {
#Test
public void testHashCode() {
int loopCount = 0;
HashSet<Integer> hashSet = new HashSet<Integer>();
for (int outer = 0; outer < 18; outer++) {
for (int inner = 0; inner < 2; inner++) {
loopCount++;
hashSet.add(new SimpleClass(inner, outer).hashCode());
}
}
org.junit.Assert.assertEquals(loopCount, hashSet.size());
}
private class SimpleClass {
int int1;
int int2;
public SimpleClass(int int1, int int2) {
this.int1 = int1;
this.int2 = int2;
}
#Override
public int hashCode() {
int hash = 5;
hash = 17 * hash + this.int1;
hash = 17 * hash + this.int2;
return hash;
}
}
}
Can you in a general (and performant) way implement hashcode without
colisions for objects with 2 or more integers.
It is technically impossible to have zero collision when hashing to 32 bits (one integer) something made of more than 32 bits (like 2 or more integers).
This is what eclipse auto-generates:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + getOuterType().hashCode();
result = prime * result + int1;
result = prime * result + int2;
return result;
}
And with this code your testcase passes...
PS: And don't forget to implement equals()!
There is no way to eliminate hash collisions entirely. Your approach is basically the preferred one to minimize collisions.
Creating a hash method with zero collisions is impossible. The idea of a hash method is you're taking a large set of objects and mapping it to a smaller set of integers. The best you can do is minimize the number of collisions you get within a subset of your objects.
As others have said, it's more important to minimize collisions that to eliminate them -- especially since you didn't say how many buckets you're aiming for. It's going to be much easier to have zero collisions with 5 items in 1000 buckets than if you have 5 items in 2 buckets! And even if there are plenty of buckets, your collisions could look very different with 1000 buckets vs 1001.
Another thing to note is that there's a good chance that the hash you provide won't even be the one the HashMap eventually uses. If you take a look at the OpenJDK HashMap code, for instance, you'll see that your keys' hashCodes are put through a private hash method (line 264 in that link) which re-hashes them. So, if you're going through the trouble of creating a carefully constructed custom hash function to reduce collisions (rather than just a simple, auto-generated one), make sure you also understand who's going to use it, and how.
I have to write a hash function, under the following two conditions:
I don't know anything about Object o that is passed to the method - it can be a String, and Integer, or an actual custom object;
I am not allowed to call hashCode() at all.
Approach that I am using now, to calculate the hash code:
Write object to the byte stream;
Convert byte stream to the byte array;
Loop through the byte array and calculate hash by doing something like this:
hash = hash * PRIME + byteArray[i]
My question is it a passable approach and is there a way to improve it? Personally I feel like the scope for this function is too broad - there is no information about what the objects are, but I have little say in this situation.
You could use HashCodeBuilder.reflectionHashCode instead of implementing your own solution.
The serialization approach does only work for objects which in fact are serializable. Thus, for all types of objects is not really possible.
Also, this compares objects by have equivalent object graphs, which is not necessarily the same as are equal by .equals().
For example, StringBuilder objects created by the same code (with same data) will have an equal OOS output (i.e. also equal hash), while b1.equals(b2) is false, and a ArrayList and LinkedList with same elements will be register as different, while list1.equals(list2) is true.
You can avoid the convert byte stream to array step by creating a custom HashOutputStream, which simply takes the byte data and hashes it, instead of saving it as an array for later iteration.
class HashOutputStream extends OutputStream {
private static final int PRIME = 13;
private int hash;
// all the other write methods delegate to this one
public void write(int b) {
this.hash = this.hash * PRIME + b;
}
public int getHash() {
return hash;
}
}
Then wrap your ObjectOutputStream around an object of this class.
Instead of your y = y*13 + x method you might look at other checksum algorithms. For example, java.util.zip contains Adler32 (used in the zlib format) and CRC32 (used in the gzip format).
hash = (hash * PRIME + byteArray[i]) % MODULO ?
Also, while you're at it, if you want to avoid collisions as much as possible, you can use a standardized (cryptographic if intentional collisions are an issue) hash function in step 3, like SHA-2 or so?
Have a look at DigestInputStream, which also spares you step 2.
Take a look at Bob Jenkin's article on non-cryptographic hashing. He walks through a number of approaches and discusses their strengths, weakness, and tradeoffs between speed and the probability of collisions.
If nothing else, it will allow you to justify your algorithm decision. Explain to your instructor why you chose speed over correctness or vice versa.
As a starting point, try his One-at-a-time hash:
ub4 one_at_a_time(char *key, ub4 len)
{
ub4 hash, i;
for (hash=0, i=0; i<len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return (hash & mask);
}
It's simple, but does surprisingly well against more complex algorithms.