Universal Hashfunctions for string

Universal Hashfunctions for string - java

I am trying to implement two different universal hash functions for strings.
But I have the problem that sometimes the hash value is 0.
With this I can´t use the hash function because I want to implement double hashing and have to implement this function: hash_func1(string s) + i * hash_func2(string s) to go through the hash table.
But if one hash function is 0 nothing changes and I get an endless loop.
This is for collision detection in a hash table.
I need two different universal hash functions for doing that.
I have tried different hash functions but cant find anything that works.
Can anyone help me with this problem?
This are some of the functions I have tried.
int h = 0 , r1 = 31415 , r2 = 27183;
for (int i =0; i < key.length (); i ++) {
h = ( r1 * h + key.charAt ( i )) % capacity ;
r1 = r1 * r2 % (capacity -1);
}
return h ;
Or this one
int seed = 131;
long hash = 0;
for(int i = 0; i < key.length(); i++)
{
hash = (hash * seed) + key.charAt(i);
}
return (int) (hash % capacity);

The wikipedia article on double hashing suggests that you modify your hash function to avoid that it becomes zero, the easiest way to do that being to simply add 1:
int h1 = hash_func1(s);
int h2 = (hash_func2(s) % (capacity - 1)) + 1;
// loop over (h1 + i * h2) % capacity
EDIT: Oops, I guess you also need to bound it by capacity - 1, otherwise with h2 == capacity, you would still run into an endless loop...
Or, even better, have hash_func2() already return a value less than capacity - 1, then adding 1 is sufficient.

Related

Rabin-Karp not working for large primes (gives wrong output)

So I was solving this problem (Rabin Karp's algorithm) and wrote this solution:
private static void searchPattern(String text, String pattern) {
int txt_len = text.length(), pat_len = pattern.length();
int hash_pat = 0, hash_txt = 0; // hash values for pattern and text's substrings
final int mod = 100005; // prime number to calculate modulo... larger modulo denominator reduces collisions in hash
final int d = 256; // to include all the ascii character codes
int coeff = 1; // stores the multiplier (or coeffecient) for the first index of the sliding window
/*
* HASHING PATTERN:
* say text = "abcd", then
* hashed text = 256^3 *'a' + 256^2 *'b' + 256^1 *'c' + 256^0 *'d'
*/
// The value of coeff would be "(d^(pat_len - 1)) % mod"
for (int i = 0; i < pat_len - 1; i++)
coeff = (coeff * d) % mod;
// calculate hash of the first window and the pattern itself
for (int i = 0; i < pat_len; i++) {
hash_pat = (d * hash_pat + pattern.charAt(i)) % mod;
hash_txt = (d * hash_txt + text.charAt(i)) % mod;
}
for (int i = 0; i < txt_len - pat_len; i++) {
if (hash_txt == hash_pat) {
// our chances of collisions are quite less (1/mod) so we dont need to recheck the substring
System.out.println("Pattern found at index " + i);
}
hash_txt = (d * (hash_txt - text.charAt(i) * coeff) + text.charAt(i + pat_len)) % mod; // calculating next window (i+1 th index)
// We might get negative value of t, converting it to positive
if (hash_txt < 0)
hash_txt = hash_txt + mod;
}
if (hash_txt == hash_pat) // checking for the last window
System.out.println("Pattern found at index " + (txt_len - pat_len));
}
Now this code is simply not working if the mod = 1000000007, whereas as soon as we take some other prime number (large enough, like 1e5+7), the code magically starts working !
The line at which the code's logic failed is:
hash_txt = (d * (hash_txt - text.charAt(i) * coeff) + text.charAt(i + pat_len)) % mod;
Can someone please tell me why is this happening ??? Maybe this is a stupid doubt but I just do not understand.

In Java, an int is a 32-bit integer. If a calculation with such number mathematically yields a result that needs more binary digits, the extra digits are silently discarded. This is called overflow.
To avoid this, the Rabin-Karp algorithm reduces results modulo some prime in each step, thereby keeping the number small enough that the next step will not overflow. For this to work, the prime chosen must be suitably small that
d * (hash + max(char) * coeff) + max(char)) < max(int)
Since
0 ≤ hash < p,
1 ≤ coeff < p,
max(char) = 216
max(int) = 231
any prime smaller than 27=128 will do. For larger primes, it depends on what their coeff ends up being, but even if we select one with the smallest possible coeff = 1, the prime must not exceed 223, which is much smaller than the prime you used.
In practice, one therefore uses Rabin-Karp with an integer datatype that is significantly bigger that the character type, such as a long (64 bits). Then, any prime < 239 will do.
Even then, if it worth noting that your reasoning
our chances of collisions are quite less (1/mod) so we dont need to recheck the substring
is flawed, because the probability is determined not by chance, but by the strings being checked. Unless you know the probability distribution of your inputs, you can't know what the probability of failure is. That's why Rabin-Karp rechecks the string to make sure.

Creating a pi(x) Table

Let pi(x) denote the number of primes <= x. For example, pi(100) = 25. I would like to create a table which stores values of pi(x) for all x <= L. I figured the quickest way would be to use the sieve of Eratosthenes. First I mark all primes, and then I use dynamic programming, summing the count of primes and increasing each time a new prime appears. This is implemented in the Java code below:
public static int [] piTableSimple (int L)
{
int sqrtl = (int) Math.sqrt(L);
int [] piTable = new int [L + 1];
Arrays.fill(piTable, 1);
piTable[0] = 0;
piTable[1] = 0;
for (int i = 2 ; i <= sqrtl ; i++)
if (piTable[i] == 1)
for (int j = i * i ; j <= L ; j += i)
piTable[j] = 0;
for (int i = 1 ; i < piTable.length ; i++)
piTable[i] += piTable[i - 1];
return piTable;
}
There are 2 problems with this implementation:
It uses large amounts of memory, as the space complexity is O(n)
Because Java arrays are "int"-indexed, the bound for L is 2^31 - 1
I can "cheat" a little though. Because for even values of x, pi(x) = pi(x - 1), enabling me to both reduce memory usage by a factor of 2, and increase the bound for L by a factor of 2 (Lmax <= 2^32).
This is implemented with a simple modification to the above code:
public static long [] piTableSmart (long L)
{
long sqrtl = (long) Math.sqrt(L);
long [] piTable = new long [(int) (L/2 + 1)];
Arrays.fill(piTable, 1);
piTable[0] = 0;
piTable[1] = 0;
piTable[2] = 1;
for (int i = 3 ; i <= sqrtl ; i += 2)
if (piTable[(i + 1) / 2] == 1)
{
long li = (long) i;
long inc = li * 2L;
for (long j = li * li ; j <= L ; j += inc)
piTable[(int) ((j + 1) / 2)] = 0;
}
piTable[2] = 2;
for (int i = 1 ; i < piTable.length ; i++)
piTable[i] += piTable[i - 1];
return piTable;
}
Note that the value of pi(2) = 1 is not directly represnted in this array. But this has simple workarounds and checks that solve it. This implementation comes with a small cost, that the actual value of pi(x) is not accessed in a straight-forward way, but rather to access the value of pi(x), one has to use
piTable[(x + 1) / 2]
And this works for both even and odd values of x of course. The latter completes creating a pi(x) table for x <= L = 10^9 in 10s on my rather slowish laptop.
I would like to further reduce the space required and also increase the bound for L for my purposes, without severly deteriorating performance (for example, the cost of slightly more arithmetic operations to access the value of pi(x) as in the latter code barely deteriorates performance). Can it be done in an efficient and smart way?

You should use a segmented Sieve of Eratosthenes, which reduces the memory requirement from O(n) to O(sqrt(n)). Here is an implementation.
Do you need to store all the pi? Here is a function that computes pi(x). It's reasonably quick up to 10**12.
If you find this useful, please upvote this answer and also the two linked answers.

Now that I understand better what you want to do, I can give a better answer.
The normal way to compute pi(x) starts with pre-computed tables arranged at intervals, then uses a segmented sieve to interpolate between the pre-computed points; the pre-computations may be done by sieving or by any of several other methods. Those tables get big, as you have pointed out. If you want to be able to compute pi(x) up to 1020, and you are willing to sieve a range up to 1012 each time someone calls your function, you will need a table with 108 64-bit integers, which will take nearly a gigabyte of space; calls to your function will take about half-a-minute each for the sieving, assuming a recent-vintage personal computer. Of course, you can choose where you want to be on the time/space trade-off curve by choosing how many pre-computed points you will have.
You are talking about computing pi(x) for x > 1024, which will take lots more space, or lots more time, or both. Lots. Recent projects that have computed huge values of pi(x), for values of x like 1024 or 1025, have taken months to compute.
You might want to look at Kim Walisch's primesieve program, which has a very fast segmented sieve. You might also look at the website of Tomás Oliveira e Silva, where you will find tables of pi(x) up to 1022.
Having said all that, what you want to do probably isn't feasible.

Hashing to Negative Values

Pretty much the title: I'm hashing a bunch of names (10000-ish) and some are outputting as negative. (table size is 20011).
The hash function in question is:
public static long hash2 ( String key ){
int hashVal = 0;
for( int i = 0; i < key.length(); i++ )
hashVal = (37 * hashVal) + key.charAt(i);
return hashVal % 20011;
}
I dug around and I think I have to do something to do with "wrap around." But I don't know how to go about that.

This is a clear case of Integer Overflow. As you have mentioned in the question that the String may have upto 10000 characters then the hashValue will definitely overflow because it is needed to store value around 37^10000. Even this will fail in string of length 20.
In number theory,
(A+B)%M = (A%M + B%M) % M;
(A*B)%M = (A%M * B%M) % M;
You should apply modulo operation inside the for loop. However If you do modulo operation at last or in execution of for loop, Both will give the same answer If overflow doesn't happen.
So make changes accordingly,
public static long hash2 ( String key ){
int hashVal = 0;
for( int i = 0; i < key.length(); i++ )
{
hashVal = (37 * hashVal) + key.charAt(i);
hashVal%=20011;
}
return hashVal;
}

hashVal is an Integer. It is most likely that your hash function is causing an Integer overflow.
You can easily resolve this by using Math.abs() to ensure that hashVal is a positive number. e.g.
hashVal = hashVal == Integer.MIN_VALUE ? 0 : Math.abs(hashVal);
return hashVal % 20011;
The mod % is to ensure that the final index computed is within the bounds of the table (i.e. if it's >= 20011, it uses the the remainder of division to as you say 'wrap around').

Reduce treatment time of the FFT

I'm currently working on Java for Android. I try to implement the FFT in order to realize a kind of viewer of the frequencies.
Actually I was able to do it, but the display is not fluid at all.
I added some traces in order to check the treatment time of each part of my code, and the fact is that the FFT takes about 300ms to be applied on my complex array, that owns 4096 elements. And I need it to take less than 100ms, as my thread (that displays the frequencies) is refreshed every 100ms. I reduced the initial array in order that the FFT results own only 1028 elements, and it works, but the result is deprecated.
Does someone have an idea ?
I used the default fft.java and Complex.java classes that can be found on the internet.
For information, my code computing the FFT is the following :
int bytesPerSample = 2;
Complex[] x = new Complex[bufferSize/2] ;
for (int index = 0 ; index < bufferReadResult - bytesPerSample + 1; index += bytesPerSample)
{
// 16BITS = 2BYTES
float asFloat = Float.intBitsToFloat(asInt);
double sample = 0;
for (int b = 0; b < bytesPerSample; b++) {
int v = buffer[index + b];
if (b < bytesPerSample - 1 || bytesPerSample == 1) {
v &= 0xFF;
}
sample += v << (b * 8);
}
double sample32 = 100 * (sample / 32768.0); // don't know the use of this compute...
x[index/bytesPerSample] = new Complex(sample32, 0);
}
Complex[] tx = new Complex[1024]; // size = 2048
///// reduction of the size of the signal in order to improve the fft traitment time
for (int i = 0; i < x.length/4; i++)
{
tx[i] = new Complex(x[i*4].re(), 0);
}
// Signal retrieval thanks to the FFT
fftRes = FFT.fft(tx);

I don't know Java, but you're way of converting between your input data and an array of complex values seems very convoluted. You're building two arrays of complex data where only one is necessary.
Also it smells like your complex real and imaginary values are doubles. That's way over the top for what you need, and ARMs are veeeery slow at double arithmetic anyway. Is there a complex class based on single precision floats?
Thirdly you're performing a complex fft on real data by filling the imaginary part of your complexes with zero. Whilst the result will be correct it is twice as much work straight off (unless the routine is clever enough to spot that, which I doubt). If possible perform a real fft on your data and save half your time.
And then as Simon says there's the whole issue of avoiding garbage collection and memory allocation.
Also it looks like your FFT has no preparatory step. This mean that the routine FFT.fft() is calculating the complex exponentials every time. The longest part of the FFT calculation is working out the complex exponentials, which is a shame because for any given FFT length the exponentials are constants. They don't depend on your input data at all. In the real time world we use FFT routines where we calculate the exponentials once at the start of the program and then the actual fft itself takes that const array as one of its inputs. Don't know if your FFT class can do something similar.
If you do end up going to something like FFTW then you're going to have to get used to calling C code from your Java. Also make sure you get a version that supports (I think) NEON, ARM's answer to SSE, AVX and Altivec. It's worth ploughing through their release notes to check. Also I strongly suspect that FFTW will only be able to offer a significant speed up if you ask it to perform an FFT on single precision floats, not doubles.
Google luck!
--Edit--
I meant of course 'good luck'. Give me a real keyboard quick, these touchscreen ones are unreliable...

First, thanks for all your answers.
I followed them and made two test :
first one, I replace the double used in my Complex class by float. The result is just a bit better, but not enough.
then I've rewroten the fft method in order not to use Complex anymore, but a two-dimensional float array instead. For each row of this array, the first column contains the real part, and the second one the imaginary part.
I also changed my code in order to instanciate the float array only once, on the onCreate method.
And the result... is worst !! Now it takes a little bit more than 500ms instead of 300ms.
I don't know what to do now.
You can find below the initial fft fonction, and then the one I've re-wroten.
Thanks for your help.
// compute the FFT of x[], assuming its length is a power of 2
public static Complex[] fft(Complex[] x) {
int N = x.length;
// base case
if (N == 1) return new Complex[] { x[0] };
// radix 2 Cooley-Tukey FFT
if (N % 2 != 0) { throw new RuntimeException("N is not a power of 2 : " + N); }
// fft of even terms
Complex[] even = new Complex[N/2];
for (int k = 0; k < N/2; k++) {
even[k] = x[2*k];
}
Complex[] q = fft(even);
// fft of odd terms
Complex[] odd = even; // reuse the array
for (int k = 0; k < N/2; k++) {
odd[k] = x[2*k + 1];
}
Complex[] r = fft(odd);
// combine
Complex[] y = new Complex[N];
for (int k = 0; k < N/2; k++) {
double kth = -2 * k * Math.PI / N;
Complex wk = new Complex(Math.cos(kth), Math.sin(kth));
y[k] = q[k].plus(wk.times(r[k]));
y[k + N/2] = q[k].minus(wk.times(r[k]));
}
return y;
}
public static float[][] fftf(float[][] x) {
/**
* x[][0] = real part
* x[][1] = imaginary part
*/
int N = x.length;
// base case
if (N == 1) return new float[][] { x[0] };
// radix 2 Cooley-Tukey FFT
if (N % 2 != 0) { throw new RuntimeException("N is not a power of 2 : " + N); }
// fft of even terms
float[][] even = new float[N/2][2];
for (int k = 0; k < N/2; k++) {
even[k] = x[2*k];
}
float[][] q = fftf(even);
// fft of odd terms
float[][] odd = even; // reuse the array
for (int k = 0; k < N/2; k++) {
odd[k] = x[2*k + 1];
}
float[][] r = fftf(odd);
// combine
float[][] y = new float[N][2];
double kth, wkcos, wksin ;
for (int k = 0; k < N/2; k++) {
kth = -2 * k * Math.PI / N;
//Complex wk = new Complex(Math.cos(kth), Math.sin(kth));
wkcos = Math.cos(kth) ; // real part
wksin = Math.sin(kth) ; // imaginary part
// y[k] = q[k].plus(wk.times(r[k]));
y[k][0] = (float) (q[k][0] + wkcos * r[k][0] - wksin * r[k][1]);
y[k][1] = (float) (q[k][1] + wkcos * r[k][1] + wksin * r[k][0]);
// y[k + N/2] = q[k].minus(wk.times(r[k]));
y[k + N/2][0] = (float) (q[k][0] - (wkcos * r[k][0] - wksin * r[k][1]));
y[k + N/2][1] = (float) (q[k][1] - (wkcos * r[k][1] + wksin * r[k][0]));
}
return y;
}

actually I think I don't understand everything.
First, about Math.cos and Math.sin : how do you want me not to compute it each time ? Do you mean that I should instanciate the whole values only once (e.g store it in an array) and use them for each compute ?
Second, about the N % 2, indeed it's not very useful, I could make the test before the call of the function.
Third, about Simon's advice : I mixed what he said and what you said, that's why I've replaced the Complex by a two-dimensional float[][]. If that was not what he suggested, then what was it ?
At least, I'm not a FFT expert, so what do you mean by making a "real FFT" ? Do you mean that my imaginary part is useless ? If so, I'm not sure, because later in my code, I compute the magnitude of each frequence, so sqrt(real[i]*real[i] + imag[i]*imag[i]). And I think that my imaginary part is not equal to zero...
thanks !

How to round up the result of integer division?

I'm thinking in particular of how to display pagination controls, when using a language such as C# or Java.
If I have x items which I want to display in chunks of y per page, how many pages will be needed?

Found an elegant solution:
int pageCount = (records + recordsPerPage - 1) / recordsPerPage;
Source: Number Conversion, Roland Backhouse, 2001

Converting to floating point and back seems like a huge waste of time at the CPU level.
Ian Nelson's solution:
int pageCount = (records + recordsPerPage - 1) / recordsPerPage;
Can be simplified to:
int pageCount = (records - 1) / recordsPerPage + 1;
AFAICS, this doesn't have the overflow bug that Brandon DuRette pointed out, and because it only uses it once, you don't need to store the recordsPerPage specially if it comes from an expensive function to fetch the value from a config file or something.
I.e. this might be inefficient, if config.fetch_value used a database lookup or something:
int pageCount = (records + config.fetch_value('records per page') - 1) / config.fetch_value('records per page');
This creates a variable you don't really need, which probably has (minor) memory implications and is just too much typing:
int recordsPerPage = config.fetch_value('records per page')
int pageCount = (records + recordsPerPage - 1) / recordsPerPage;
This is all one line, and only fetches the data once:
int pageCount = (records - 1) / config.fetch_value('records per page') + 1;

For C# the solution is to cast the values to a double (as Math.Ceiling takes a double):
int nPages = (int)Math.Ceiling((double)nItems / (double)nItemsPerPage);
In java you should do the same with Math.ceil().

This should give you what you want. You will definitely want x items divided by y items per page, the problem is when uneven numbers come up, so if there is a partial page we also want to add one page.
int x = number_of_items;
int y = items_per_page;
// with out library
int pages = x/y + (x % y > 0 ? 1 : 0)
// with library
int pages = (int)Math.Ceiling((double)x / (double)y);

The integer math solution that Ian provided is nice, but suffers from an integer overflow bug. Assuming the variables are all int, the solution could be rewritten to use long math and avoid the bug:
int pageCount = (-1L + records + recordsPerPage) / recordsPerPage;
If records is a long, the bug remains. The modulus solution does not have the bug.

In need of an extension method:
public static int DivideUp(this int dividend, int divisor)
{
return (dividend + (divisor - 1)) / divisor;
}
No checks here (overflow, DivideByZero, etc), feel free to add if you like. By the way, for those worried about method invocation overhead, simple functions like this might be inlined by the compiler anyways, so I don't think that's where to be concerned. Cheers.
P.S. you might find it useful to be aware of this as well (it gets the remainder):
int remainder;
int result = Math.DivRem(dividend, divisor, out remainder);

HOW TO ROUND UP THE RESULT OF INTEGER DIVISION IN C#
I was interested to know what the best way is to do this in C# since I need to do this in a loop up to nearly 100k times. Solutions posted by others using Math are ranked high in the answers, but in testing I found them slow. Jarod Elliott proposed a better tactic in checking if mod produces anything.
int result = (int1 / int2);
if (int1 % int2 != 0) { result++; }
I ran this in a loop 1 million times and it took 8ms. Here is the code using Math:
int result = (int)Math.Ceiling((double)int1 / (double)int2);
Which ran at 14ms in my testing, considerably longer.

A variant of Nick Berardi's answer that avoids a branch:
int q = records / recordsPerPage, r = records % recordsPerPage;
int pageCount = q - (-r >> (Integer.SIZE - 1));
Note: (-r >> (Integer.SIZE - 1)) consists of the sign bit of r, repeated 32 times (thanks to sign extension of the >> operator.) This evaluates to 0 if r is zero or negative, -1 if r is positive. So subtracting it from q has the effect of adding 1 if records % recordsPerPage > 0.

Another alternative is to use the mod() function (or '%'). If there is a non-zero remainder then increment the integer result of the division.

For records == 0, rjmunro's solution gives 1. The correct solution is 0. That said, if you know that records > 0 (and I'm sure we've all assumed recordsPerPage > 0), then rjmunro solution gives correct results and does not have any of the overflow issues.
int pageCount = 0;
if (records > 0)
{
pageCount = (((records - 1) / recordsPerPage) + 1);
}
// no else required
All the integer math solutions are going to be more efficient than any of the floating point solutions.

I do the following, handles any overflows:
var totalPages = totalResults.IsDivisble(recordsperpage) ? totalResults/(recordsperpage) : totalResults/(recordsperpage) + 1;
And use this extension for if there's 0 results:
public static bool IsDivisble(this int x, int n)
{
return (x%n) == 0;
}
Also, for the current page number (wasn't asked but could be useful):
var currentPage = (int) Math.Ceiling(recordsperpage/(double) recordsperpage) + 1;

you can use
(int)Math.Ceiling(((decimal)model.RecordCount )/ ((decimal)4));

Alternative to remove branching in testing for zero:
int pageCount = (records + recordsPerPage - 1) / recordsPerPage * (records != 0);
Not sure if this will work in C#, should do in C/C++.

I made this for me, thanks to Jarod Elliott & SendETHToThisAddress replies.
public static int RoundedUpDivisionBy(this int #this, int divider)
{
var result = #this / divider;
if (#this % divider is 0) return result;
return result + Math.Sign(#this * divider);
}
Then I realized it is overkill for the CPU compared to the top answer.
However, I think it's readable and works with negative numbers as well.

A generic method, whose result you can iterate over may be of interest:
public static Object[][] chunk(Object[] src, int chunkSize) {
int overflow = src.length%chunkSize;
int numChunks = (src.length/chunkSize) + (overflow>0?1:0);
Object[][] dest = new Object[numChunks][];
for (int i=0; i<numChunks; i++) {
dest[i] = new Object[ (i<numChunks-1 || overflow==0) ? chunkSize : overflow ];
System.arraycopy(src, i*chunkSize, dest[i], 0, dest[i].length);
}
return dest;
}

The following should do rounding better than the above solutions, but at the expense of performance (due to floating point calculation of 0.5*rctDenominator):
uint64_t integerDivide( const uint64_t& rctNumerator, const uint64_t& rctDenominator )
{
// Ensure .5 upwards is rounded up (otherwise integer division just truncates - ie gives no remainder)
return (rctDenominator == 0) ? 0 : (rctNumerator + (int)(0.5*rctDenominator)) / rctDenominator;
}

I had a similar need where I needed to convert Minutes to hours & minutes. What I used was:
int hrs = 0; int mins = 0;
float tm = totalmins;
if ( tm > 60 ) ( hrs = (int) (tm / 60);
mins = (int) (tm - (hrs * 60));
System.out.println("Total time in Hours & Minutes = " + hrs + ":" + mins);

You'll want to do floating point division, and then use the ceiling function, to round up the value to the next integer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Universal Hashfunctions for string - java

Related

Rabin-Karp not working for large primes (gives wrong output)

Creating a pi(x) Table

Hashing to Negative Values

Reduce treatment time of the FFT

How to round up the result of integer division?

Categories

Resources