Cuckoo Hashing Collisions lead to overflow - java

I'm trying to implement cuckoo hashing with hash functions:
hash1: key.hashcode() % capacity
hash2: key.hashcode() / capacity % capacity
With an infinite loop check and rehashing method doubling capacity. The program works fine with small amount of data, but when data gets big (around 20k elements) the program keeps getting rehashing until the capacity gets overflowed.
I figured that mostly the infinite rehashing causes by the data with exactly same hashcode. After rehashing, there will be chance other data get same hashcode and causing rehashing again.
I already use Java built-in hashcode but the chance of same hashcodes still high when data is large. Even I modified a little bit hashcode method, eventually there is still data with same hashcode.
So which hash method should I use to prevent this?

A usual method to create a hash function is generally to use primes. I write a function (below), with which, I don't guarantee no collisions, but it should be lessened.
hashFunction1(String s){
int k = 7; //take a prime number, can be anything (I just chose 7)
for(int i = 0; i < s.length(); i++){
k *= (23 * (int)(s.charAt(i)));
k %= capacity;
}
}
//23 is another randomly chosen number.
You can write a similar hash function as hashFunction2, choosing two different prime numbers. But here, the main problem is, for strings "stop" and "pots", this gives same hash code.
So, an improvization over this function can be:
hashFunction1(String s){
int k = 7; //take a prime number, can be anything (I just chose 7)
for(int i = 0; i < s.length(); i++){
k *= (23 * (int)(s.charAt(i)));
k += (int)(s.charAt(i));
k %= capacity;
}
}
which will resolve this (for most cases, if not all).
If you still find this function bad, instead of s.charAt(i), you can use a unique prime number mapped to every character, ie. a=3, b=5, c=7, d=11 and so on. This should resolve collision even more.
EDIT:
You are using +n, which is a constant.
2 is not the prime to be used in such cases. Use an odd prime number, 3 works.

Related

Big-Oh notation calculating constant steps

I've searched about Big O notation for some time, and I learned that when calculating we have to assume that every sentence that doesn't depend on the size of the input data takes a constant C number computational steps.
My program goes like this.
It "always" takes in a "48 bit" random seed and produces an output, and the actual moves that happen within the process of producing the output varies according to the seed value itself, not by the size because it's fixed.
I for looped this process for n times, in order to get n outputs.
Does this mean the Big O notation is O(n) for my program? Or am I completely misunderstanding something?
So, the number of loops I just write in the code. For example, if I set it to 1000, it takes in 1000 input seeds and produces 1000 outputs. The process within the loop, so the number of for loops or the number of if - else or switch statements inside the bigger loop are fixed. The only thing that changes inside the bigger loop is which "if statement" to choose depending on the value of the seed.
The complexity is always expressed as relative to something.
Since your input length is constant, it doesn't make much sense to express the complexity relative to that. It may have O(n) complexity relative to number of loops, but again, since the value is hard-coded this information will have little value to the user.
Perhaps the most useful in your case is information about what the complexity is relative to the input value. If that is constant, you can say that your program performs in constant time, because no matter what the (valid) user input will be, the time it will take your program to produce the output will roughly be the same.
int f(int[] a, int[] b, int c) {
int n = a.length;
int m = b.length;
int y = 0;
for (int i = 0; i < n; ++i) {
y += a[i];
for (int j = 0; j < m; ++j) {
y -= b[j];
}
}
for (int k = 0; k < c; ++k) {
y += 13;
}
return y;
}
So the complexity is O(n.m)+O(c) counting steps (loop, recursive calls).
for (long k = seed; k > 1; k /= 2) {
...;
}
would give O(²log seed), at most 48.
Strictly said, O(n) means that this n is algorithm parameter/input of some kind or derived from it. It can be length of input, or even output, but it's derived from algorithm parameter.
So, this O(n) has a meaning when we talk about your procedure "loop this process for n times", if you automated this procedure with some script. Algorithm itself still works in O(1) time. If you don't automate procedure, just forget about Big O - manual action makes it irrelevant.

How do I make this have a space complexity of O(1) instead of O(n)?

I'm trying to convert a decimal into binary number using iterative process. How can I make this have a space complexity of O(1) instead of O(n)?
int i = 0;
int j;
int bin[] = new int[n]; //n here is my paramater int n
while(n > 0) {
bin[i] = n % 2;
n /= 2;
i++;
}
//I'm reversing the order of index i with variable j to get right order (e.g. 26 has 11010, instead of 01011)
for(j = i -1; j >= 0; j--) {
System.out.print(bin[j]);
}
First, you don't need place for n bits if the value itself is n. You just need log2(n)+1. It won't give you wrong results to use n bits, but for big values of n, the memory available to your Java process might be not enough.
And, about O(1)... maybe not really what you were thinking, but:
Javas int has a specific fixed value range, which leads to the guarantee that a (positive) int value needs max 31 bit (if you have negative numbers too, storing the sign somewhere is necessary, that's bit 32).
With that information, strictly speaking, you can get O(1) just by rewriting your loops so that they loop exactly 31 times. Then, for each value of n, your code has exactly the same amount of steps, and that is O(1) per definition.
Going the bit fiddling route won't help here. There are some useful shortcuts if your values fulfil certain conditions, but if you want your code to work with any int value, the normal loop as you have here is likely the best you can get.
(Of yourse, CPU intrinsics may help, but not for Java...)

Checking whether there is a subset of size k of an array which has a sum multiple of n

Good evening, I have an array in java with n integer numbers. I want to check if there is a subset of size k of the entries that satisfies the condition:
The sum of those k entries is a multiple of m.
How may I do this as efficiently as possible? There are n!/k!(n-k)! subsets that I need to check.
You can use dynamic programming. The state is (prefix length, sum modulo m, number of elements in a subset). Transitions are obvious: we either add one more number(increasing the number of elements in a subset and computing new sum modulo m), or we just increase prefix lenght(not adding the current number). If you just need a yes/no answer, you can store only the last layer of values and apply bit optimizations to compute transitions faster. The time complexity is O(n * m * k), or about n * m * k / 64 operations with bit optimizations. The space complexity is O(m * k). It looks feasible for a few thousands of elements. By bit optimizations I mean using things like bitset in C++ that can perform an operation on a group of bits at the same time using bitwise operations.
I don't like this solution, but it may work for your needs
public boolean containsSubset( int[] a , int currentIndex, int currentSum, int depth, int divsor, int maxDepth){
//you could make a, maxDepth, and divisor static as well
//If maxDepthis equal to depth, then our subset has k elements, in addition the sum of
//elements must be divisible by out divsor, m
//If this condition is satisafied, then there exists a subset of size k whose sum is divisible by m
if(depth==maxDepth&&currentSum%divsor==0)
return true;
//If the depth is greater than or equal maxDepth, our subset has more than k elements, thus
//adding more elements can not satisfy the necessary conditions
//additionally we know that if it contains k elements and is divisible by m, it would've satisafied the above condition.
if(depth>=maxdepth)
return false;
//boolean to be returned, initialized to false because we have not found any sets yet
boolean ret = false;
//iterate through all remaining elements of our array
for (int i = currentIndex+1; i < a.length; i++){
//this may be an optimization or this line
//for (int i = currentIndex+1; i < a.length-maxDepth+depth; i++){
//by recursing, we add a[i] to our set we then use an or operation on all our subsets that could
//be constructed from the numbers we have so far so that if any of them satisfy our condition (return true)
//then the value of the variable ret will be true
ret |= containsSubset(a,i,currentSum+a[i],depth+1,divisor, maxDepth);
} //end for
//return the variable storing whether any sets of numbers that could be constructed from the numbers so far.
return ret;
}
Then invoke this method as such
//this invokes our method with "no numbers added to our subset so far" so it will try adding
// all combinations of other elements to determine if the condition is satisfied.
boolean answer = containsSubset(myArray,-1,0,0,m,k);
EDIT:
You could probably optimize this by taking everything modulo (%) m and deleting repeats. For examples with large values of n and/or k, but small values of m, this could be a pretty big optimization.
EDIT 2:
The above optimization I listed isn't helpful. You may need the repeats to get the correct information. My bad.
Happy Coding! Let me know if you have any questions!
If numbers have lower and upper bounds, it might be better to:
Iterate all multiples of n where lower_bound * k < multiple < upper_bound * k
Check if there is a subset with sum multiple in the array (see Subset Sum problem) using dynamic programming.
Complexity is O(k^2 * (lower_bound + upper_bound)^2). This approach can be optimized further, I believe with careful thinking.
Otherwise you can find all subsets of size k. Complexity is O(n!). Using backtracking (pseudocode-ish):
function find_subsets(array, k, index, current_subset):
if current_subset.size = k:
add current_subset to your solutions list
return
if index = array.size:
return
number := array[index]
add number to current_subset
find_subsets(array, k, index + 1, current_subset)
remove number from current_subset
find_subsets(array, k, index + 1, current_subset)

How to efficiently generate a set of unique random numbers with a predefined distribution?

I have a map of items with some probability distribution:
Map<SingleObjectiveItem, Double> itemsDistribution;
Given a certain m I have to generate a Set of m elements sampled from the above distribution.
As of now I was using the naive way of doing it:
while(mySet.size < m)
mySet.add(getNextSample(itemsDistribution));
The getNextSample(...) method fetches an object from the distribution as per its probability. Now, as m increases the performance severely suffers. For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. Generate 1000 such sets and you have an application that crawls.
Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? Most collection shuffling techniques and the like are uniformly random. What would be a good way to address this?
UPDATE: The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is in the first run we'll definitely get a sample for the set. The 2nd iteration, it may be called at least twice and so on. If getNextSample is sequential in nature, i.e., goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2, 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 then the loop is at least Sigma(n^3). And that too is the lower bound!
If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). Efficient but may not be by a large margin.
Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. These sets are part of a randomized 'schedule' of items. Hence a 'set' of items.
Start out by generating a number of random points in two dimentions.
Then apply your distribution
Now find all entries within the distribution and pick the x coordinates, and you have your random numbers with the requested distribution like this:
The problem is unlikely to be the loop you show:
Let n be the size of the distribution, and I be the number of invocations to getNextSample. We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E[I] < m / (1 - m / n).
That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. If that is "slow" and "crawls", it is likely because getNextSample is slow. This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element).
The following should be faster (if m < 0.8 n)
class Distribution<T> {
private double[] cummulativeWeight;
private T[] item;
private double totalWeight;
Distribution(Map<T, Double> probabilityMap) {
int i = 0;
cummulativeWeight = new double[probabilityMap.size()];
item = (T[]) new Object[probabilityMap.size()];
for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
item[i] = entry.getKey();
totalWeight += entry.getValue();
cummulativeWeight[i] = totalWeight;
i++;
}
}
T randomItem() {
double weight = Math.random() * totalWeight;
int index = Arrays.binarySearch(cummulativeWeight, weight);
if (index < 0) {
index = -index - 1;
}
return item[index];
}
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
set.add(randomItem());
}
return set;
}
}
public class Test {
public static void main(String[] args) {
int max = 1_000_000;
HashMap<Integer, Double> probabilities = new HashMap<>();
for (int i = 0; i < max; i++) {
probabilities.put(i, (double) i);
}
Distribution<Integer> d = new Distribution<>(probabilities);
Set<Integer> set = d.randomSubset(max / 2);
//System.out.println(set);
}
}
The expected runtime is O(m / (1 - m / n) * log n). On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds.
As we can see, the expected runtime approaches infinity as m approaches n. If that is a problem (i.e. m > 0.9 n), the following more complex approach should work better:
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
T randomItem = randomItem();
remove(randomItem); // removes the item from the distribution
set.add(randomItem);
}
return set;
}
To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is.
But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n.
If you are not concerning with randomness properties too much then I do it like this:
create buffer for pseudo-random numbers
double buff[MAX]; // [edit1] double pseudo random numbers
MAX is size should be big enough ... 1024*128 for example
type can be any (float,int,DWORD...)
fill buffer with numbers
you have range of numbers x = < x0,x1 > and probability function probability(x) defined by your probability distribution so do this:
for (i=0,x=x0;x<=x1;x+=stepx)
for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers
buff[i]=x+(double(i)*q); // [edit1] ...
The stepx is your accuracy for items (for integral types = 1) now the buff[] array has the same distribution as you need but it is not pseudo-random. Also you should add check if j is not >= MAX to avoid array overruns and also at the end the real size of buff[] is j (can be less than MAX due to rounding)
shuffle buff[]
do just few loops of swap buff[i] and buff[j] where i is the loop variable and j is pseudo-random <0-MAX)
write your pseudo-random function
it just return number from the buffer. At first call returns the buff[0] at second buff[1] and so on ... For standard generators When you hit the end of buff[] then shuffle buff[] again and start from buff[0] again. But as you need unique numbers then you can not reach the end of buffer so so set MAX to be big enough for your task otherwise uniqueness will not be assured.
[Notes]
MAX should be big enough to store the whole distribution you want. If it is not big enough then items with low probability can be missing completely.
[edit1] - tweaked answer a little to match the question needs (pointed by meriton thanks)
PS. complexity of initialization is O(N) and for get number is O(1).
You should implement your own random number generator (using a MonteCarlo methode or any good uniform generator like mersen twister) and basing on the inversion method (here).
For example : exponential law: generate a uniform random number u in [0,1] then your random variable of the exponential law would be : ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm.
Hope it'll help ;).
I think you have two problems:
Your itemDistribution doesn't know you need a set, so when the set you are building gets
large you will pick a lot of elements that are already in the set. If you start with the
set all full and remove elements you will run into the same problem for very small sets.
Is there a reason why you don't remove the element from the itemDistribution after you
picked it? Then you wouldn't pick the same element twice?
The choice of datastructure for itemDistribution looks suspicious to me. You want the
getNextSample operation to be fast. Doesn't the map from values to probability force you
to iterate through large parts of the map for each getNextSample. I'm no good at
statistics but couldn't you represent the itemDistribution the other way, like a map from
probability, or maybe the sum of all smaller probabilities + probability to a element
of the set?
Your performance depends on how your getNextSample function works. If you have to iterate over all probabilities when you pick the next item, it might be slow.
A good way to pick several unique random items from a list is to first shuffle the list and then pop items off the list. You can shuffle the list once with the given distribution. From then on, picking your m items ist just popping the list.
Here's an implementation of a probabilistic shuffle:
List<Item> prob_shuffle(Map<Item, int> dist)
{
int n = dist.length;
List<Item> a = dist.keys();
int psum = 0;
int i, j;
for (i in dist) psum += dist[i];
for (i = 0; i < n; i++) {
int ip = rand(psum); // 0 <= ip < psum
int jp = 0;
for (j = i; j < n; j++) {
jp += dist[a[j]];
if (ip < jp) break;
}
psum -= dist[a[j]];
Item tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
return a;
}
This in not Java, but pseudocude after an implementation in C, so please take it with a grain of salt. The idea is to append items to the shuffled area by continuously picking items from the unshuffled area.
Here, I used integer probabilities. (The proabilities don't have to add to a special value, it's just "bigger is better".) You can use floating-point numbers but because of inaccuracies, you might end up going beyond the array when picking an item. You should use item n - 1 then. If you add that saftey net, you could even have items with zero probability that always get picked last.
There might be a method to speed up the picking loop, but I don't really see how. The swapping renders any precalculations useless.
Accumulate your probabilities in a table
Probability
Item Actual Accumulated
Item1 0.10 0.10
Item2 0.30 0.40
Item3 0.15 0.55
Item4 0.20 0.75
Item5 0.25 1.00
Make a random number between 0.0 and 1.0 and do a binary search for the first item with a sum that is greater than your generated number. This item would have been chosen with the desired probability.
Ebbe's method is called rejection sampling.
I sometimes use a simple method, using an inverse cumulative distribution function, which is a function that maps a number X between 0 and 1 onto the Y axis.
Then you just generate a uniformly distributed random number between 0 and 1, and apply the function to it.
That function is also called the "quantile function".
For example, suppose you want to generate a normally distributed random number.
It's cumulative distribution function is called Phi.
The inverse of that is called probit.
There are many ways to generate normal variates, and this is just one example.
You can easily construct an approximate cumulative distribution function for any univariate distribution you like, in the form of a table.
Then you can just invert it by table-lookup and interpolation.

Efficient way of altering data in an array with threads

I've been trying to figure out the most efficient way where many threads are altering a very big byte array on bit level. For ease of explaining I'll base the question around a multithreaded Sieve of Eratosthenes to ease explaining the question. The code though should not be expected to fully completed as I'll omit certain parts that aren't directly related. The sieve also wont be fully optimised as thats not the direct question. The sieve will work in such a way that it saves which values are primes in a byte array, where each byte contains 7 numbers (we can't alter the first bit due to all things being signed).
Lets say our goal is to find all the primes below 1 000 000 000 (1 billion). As a result we would need an byte array of length 1 000 000 000 / 7 +1 or 142 857 143 (About 143 million).
class Prime {
int max = 1000000000;
byte[] b = new byte[(max/7)+1];
Prime() {
for(int i = 0; i < b.length; i++) {
b[i] = (byte)127; //Setting all values to 1 at start
}
findPrimes();
}
/*
* Calling remove will set the bit value associated with the number
* to 0 signaling that isn't an prime
*/
void remove(int i) {
int j = i/7; //gets which array index to access
b[j] = (byte) (b[j] & ~(1 << (i%7)));
}
void findPrimes() {
remove(1); //1 is not a prime and we wanna remove it from the start
int prime = 2;
while (prime*prime < max) {
for(int i = prime*2; i < max; i = prime + i) {
remove(i);
}
prime = nextPrime(prime); //This returns the next prime from the list
}
}
... //Omitting code, not relevant to question
}
Now we got a basic outline where something runs through all numbers for a certain mulitplication table and calls remove to remove numbers set bits that fits the number to 9 if we found out they aren't primes.
Now to up the ante we create threads that do the checking for us. We split the work so that each takes a part of the removing from the table. So for example if we got 4 threads and we are running through the multiplication table for the prime 2, we would assign thread 1 all in the 8 times tables with an starting offset of 2, that is 4, 10, 18, ...., the second thread gets an offset of 4, so it goes through 6, 14, 22... and so on. They then call remove on the ones they want.
Now to the real question. As most can see that while the prime is less than 7 we will have multiple threads accessing the same array index. While running through 2 for example we will have thread 1, thread 2 and thread 3 will all try to access b[0] to alter the byte which causes an race condition which we don't want.
The question therefore is, whats the best way of optimising access to the byte array.
So far the thoughts I've had for it are:
Putting synchronized on the remove method. This obviously would be very easy to implement but an horrible ideas as it would remove any type of gain from having threads.
Create an mutex array equal in size to the byte array. To enter an index one would need the mutex on the same index. This Would be fairly fast but would require another very big array in memory which might not be the best way to do it
Limit the numbers stored in the byte to prime number we start running on. So if we start on 2 we would have numbers per array. This would however increase our array length to 500 000 000 (500 million).
Are there other ways of doing this in a fast and optimal way without overusing the memory?
(This is my first question here so I tried to be as detailed and thorough as possible but I would accept any comments on how I can improve the question - to much detail, needs more detail etc.)
You can use an array of atomic integers for this. Unfortunately there isn't a getAndAND, which would be ideal for your remove() function, but you can CAS in a loop:
java.util.concurrent.atomic.AtomicIntegerArray aia;
....
void remove(int i) {
int j = i/32; //gets which array index to access
do {
int oldVal = aia.get(j);
int newVal = oldVal & ~(1 << (i%32));
boolean updated = aia.weakCompareAndSet(j, oldVal, newVal);
} while(!updated);
}
Basically you keep trying to adjust the slot to remove that bit, but you only succeed if nobody else modifies it out from under you. Safe, and likely to be very efficient. weakCompareAndSet is basically an abstracted Load-link/Store conditional instruction.
BTW, there's no reason not to use the sign bit.
I think you could avoid synchronizing threads...
For example, this task:
for(int i = prime*2; i < max; i = prime + i) {
remove(i);
}
it could be partitioned in small tasks.
for (int i =0; i < thread_poll; i++){
int totalPos = max/8; // dividing virtual array in bytes
int partitionSize = totalPos /thread_poll; // dividing bytes by thread poll
removeAll(prime, partitionSize*i*8, (i + 1)* partitionSize*8);
}
....
// no colisions!!!
void removeAll(int prime, int initial; int max){
k = initial / prime;
if (k < 2) k = 2;
for(int i = k * prime; i < max; i = i + prime) {
remove(i);
}
}

Categories