Quadratic probing doesn't hit all elements in prime numbered hash table

Quadratic probing doesn't hit all elements in prime numbered hash table - java

Say I have a hash table with 59 elements (each element value is an integer). Index 15 is blank and the rest of the table is full of data. Depending on the number I want to insert, the quadratic probing formula never hits element 15!
Assume I want to insert the number 199 (which should hash to 22 using the hashFunc() function I'm using below.:
public int hashFunc(int key)
{
return key % arraySize; //199 % 59 = 22
}
public void insert(DataItem item)
{
int key = item.getKey(); // extract the key (199)
int hashVal = hashFunc(key); // hash the key (22)
int i = 1;
//The while loop just checks that the array index isn't null and isn't equal to -1 which I defined to be a deleted element
while(hashArray[hashVal] != null && hashArray[hashVal].getKey() != -1)
{
hashVal = hashFunc(key) + (i * i); //This never hits element 15!!!
i++;
hashVal %= arraySize; // wraparound when hashVal is beyond 59
}
hashArray[hashVal] = item; // insert item
}

This is expected in a quadratic probing hash table. Using some modular arithmetic, you can show that only the first p / 2 probe locations in the probe sequence are guaranteed to be unique, meaning that it's possible that each element's probe sequence will not visit half of the locations in the table.
To fix this, you should probably update your code so that you rehash any time p / 2 or more of the table locations are in-use. Alternatively, you can use the technique suggested on the Wikipedia article of alternating the sign of your probe offset (+1, -4, +9, -16, +25, etc.), which should ensure that you can hit every possible location.
Hope this helps!

Related

Find two numbers in array x,y where x<y, x repeats at least n/3 times and y at least n/4 times

I have been struggling to solve an array problem with linear time,
The problem is:
Assuming we are given an array A [1...n] write an algorithm that return true if:
There are two numbers in the array x,y that have the following:
x < y
x repeats more than n/3 times
y repeats more than n/4 times
I have tried to write the following java program to do so assuming we have a sorted array but I don't think it is the best implementation.
public static boolean solutionManma(){
int [] arr = {2,2,2,3,3,3};
int n = arr.length;
int xCount = 1;
int yCount = 1;
int maxXcount= xCount,maxYCount = yCount;
int currX = arr[0];
int currY = arr[n-1];
for(int i = 1; i < n-2;i++){
int right = arr[n-2-i+1];
int left = arr[i];
if(currX == left){
xCount++;
}
else{
maxXcount = Math.max(xCount,maxXcount);
xCount = 1;
currX = left;
}
if(currY == right){
yCount++;
}
else {
maxYCount = Math.max(yCount,maxYCount);
yCount = 1;
currY = right;
}
}
return (maxXcount > n/3 && maxYCount > n/4);
}
If anyone has an algorithm idea for this kind of issue (preferably O(n)) I would much appreciate it because I got stuck with this one.

The key part of this problem is to find in linear time and constant space the values which occur more than n/4 times. (Note: the text of your question says "more than" and the title says "at least". Those are not the same condition. This answer is based on the text of your question.)
There are at most three values which occur more than n/4 times, and a list of such values must also include any value which occurs more than n/3 times.
The algorithm we'll use returns a list of up to three values. It only guarantees that all values which satisfy the condition are in the list it returns. The list might include other values, and it does not provide any information about the precise frequencies.
So a second pass is necessary, which scans the vector a second time counting the occurrences of each of the three values returned. Once you have the three counts, it's simple to check whether the smallest value which occurs more than n/3 times (if any) is less than the largest value which occurs more than n/4 times.
To construct the list of candidates, we use a generalisation of the Boyer-Moore majority vote algorithm, which finds a value which occurs more than n/2 times. The generalisation, published in 1982 by J. Misra and D. Gries, uses k-1 counters, each possibly associated with a value, to identify values which might occur more than 1/k times. In this case, k is 4 and so we need three counters.
Initially, all of the counters are 0 and are not associated with any value. Then for each value in the array, we do the following:
If there is a counter associated with that value, we increment it.
If no counter is associated with that value but some counter is at 0, we associate that counter with the value and increment its count to 1.
Otherwise, we decrement every counter's count.
Once all the values have been processed, the values associated with counters with positive counts are the candidate values.
For a general implementation where k is not known in advance, it would be possible to use a hash-table or other key-value map to identify values with counts. But in this case, since it is known that k is a small constant, we can just use a simple vector of three value-count pairs, making this algorithm O(n) time and O(1) space.

I will suggest the following solution, using the following assumption:
In an array of length n there will be at most n different numbers
The key feature will be to count the frequency of occurance for each different input using a histogram with n bins, meaning O(n) space. The algorithm will be as follows:
create a histogram vector with n bins, initialized to zeros
for index ii in the length of the input array a
2.1. Increase the value: hist[a[ii]] +=1
set found_x and found_y to False
for the iith bin in the histogram, check:
4.1. if found_x == False
4.1.1. if hist[ii] > n/3, set found_x = True and set x = ii
4.2. else if found_y == False
4.2.1. if hist[ii] > n/4, set y = ii and return x, y
Explanation
In the first run over the array you document the occurance frequency of all the numbers. In the run over the histogram array, which also has a length of n, you check the occurrence. First you check if there is a number that occurred more than n/3 times and if there is, for the rest of the numbers (by default larger than x due to the documentation in the histogram) you check if there is another number which occurred more than n/4 times. if there is, you return the found x and y and if there isn't you simply return not found after covering all the bins in the histogram.
As far as time complexity, you goover the input array once and you go over the histogram with the same length once, therefore the time complexity is O(n) is requested.

Given a sorted array (that is sorted in terms of the absolute value) and a number, find the position of the number

Say we have an array
a[] ={1,2,-3,3,-3,-3,4,-4,5}
And find the position of 3 (which would be 3)
There are be no multiple indexes for an answer.
It must be efficient, and NOT linear.
I was thinking of doing a Binary Search of the array, but instead of comparing the raw values, I wanted to compare the absolute values; abs(a[i]) and abs(n) [n is the input number]. Then if the values are equal, I do another comparison, now with the raw values a[i] and n.
But I run into a problem where, if I am in the above situation with the same array {1,2,-3,3,-3,-3,4,-4,5}, and am looking for 3, there are multiple -3 that get in the way (thus, I would have to check if the raw values a[i] and n does not work, I have to check a[i+1] and a[i-1].)
Ok im just rambling now. Am i thinking too hard for this?
Help me out thanks!!! :D

It is a modified binary search problem. The difference between this and regular binary search is that you need to find and test all of the elements that compare as equal according to the sorting criterion.
I would:
use a tweaked binary search algorithm to find the index of the left-most element that matches
iterate through the indexes until you find the element are looking for, or an element whose absolute value no longer matches.
That should be O(logN) for the first step. The second step is O(1) on average if you assume that the element values are evenly distributed. (The worst case for the second step is O(N); e.g. when the elements all have the same absolute value, and the one you want is the last in the array.)

Here's the method to solve your problem:
/**
* #param a array sorted by absolute value
* #param key value to find (must be positive)
* #return position of the first occurence of the key or -1 if key not found
*/
public static int binarySearch(int[] a, int key) {
int low = 0;
int high = a.length-1;
while (low <= high) {
int mid = (low + high) >>> 1;
int midVal = Math.abs(a[mid]);
if (midVal < key)
low = mid + 1;
else if (midVal > key || (midVal == key && mid > 0 && Math.abs(a[mid-1]) == key))
high = mid - 1;
else
return mid; // key found
}
return -1; // key not found.
}
It's a modification of Arrays.binarySearch from JDK. There are several changes. First, we compare absolute values. Second, as you want not any key position, but the first one, I modified a condition: if we found a key we check whether the previous array item has the same value. If yes, then we continue search. This way algorithm remains O(log N) even for special cases where too many values which are equal to key.

Finding unique numbers from sorted array in less than O(n)

I had an interview and there was the following question:
Find unique numbers from sorted array in less than O(n) time.
Ex: 1 1 1 5 5 5 9 10 10
Output: 1 5 9 10
I gave the solution but that was of O(n).
Edit: Sorted array size is approx 20 billion and unique numbers are approx 1000.

Divide and conquer:
look at the first and last element of a sorted sequence (the initial sequence is data[0]..data[data.length-1]).
If both are equal, the only element in the sequence is the first (no matter how long the sequence is).
If the are different, divide the sequence and repeat for each subsequence.
Solves in O(log(n)) in the average case, and O(n) only in the worst case (when each element is different).
Java code:
public static List<Integer> findUniqueNumbers(int[] data) {
List<Integer> result = new LinkedList<Integer>();
findUniqueNumbers(data, 0, data.length - 1, result, false);
return result;
}
private static void findUniqueNumbers(int[] data, int i1, int i2, List<Integer> result, boolean skipFirst) {
int a = data[i1];
int b = data[i2];
// homogenous sequence a...a
if (a == b) {
if (!skipFirst) {
result.add(a);
}
}
else {
//divide & conquer
int i3 = (i1 + i2) / 2;
findUniqueNumbers(data, i1, i3, result, skipFirst);
findUniqueNumbers(data, i3 + 1, i2, result, data[i3] == data[i3 + 1]);
}
}

I don't think it can be done in less than O(n). Take the case where the array contains 1 2 3 4 5: in order to get the correct output, each element of the array would have to be looked at, hence O(n).

If your sorted array of size n has m distinct elements, you can do O(mlogn).
Note that this is going to efficient when m << n (eg m=2 and n=100)
Algorithm:
Initialization: Current element y = first element x[0]
Step 1: Do a binary search for the last occurrence of y in x (can be done in O(log(n)) time. Let it's index be i
Step 2: y = x[i+1] and go to step 1
Edit: In cases where m = O(n) this algorithm is going to work badly. To alleviate it you can run it in parallel with regular O(n) algorithm. The meta algorithm consists of my algorithm and O(n) algorithm running in parallel. The meta algorithm stops when either of these two algorithms complete.

Since the data consists of integers, there are a finite number of unique values that can occur between any two values. So, start with looking at the first and last value in the array. If a[length-1] - a[0] < length - 1, there will be some repeating values. Put a[0] and a[length-1] into some constant-access-time container like a hash set. If the two values are equal, you konow that there is only one unique value in the array and you are done. You know that the array is sorted. So, if the two values are different, you can look at the middle element now. If the middle element is already in the set of values, you know that you can skip the whole left part of the array and only analyze the right part recursively. Otherwise, analyze both left and right part recursively.
Depending on the data in the array you will be able to get the set of all unique values in a different number of operations. You get them in constant time O(1) if all the values are the same since you will know it after only checking the first and last element. If there are "relatively few" unique values, your complexity will be close to O(log N) because after each partition you will "quite often" be able to throw away at least one half of the analyzed sub-array. If the values are all unique and a[length-1] - a[0] = length - 1, you can also "define" the set in constant time because they have to be consecutive numbers from a[0] to a[length-1]. However, in order to actually list them, you will have to output each number, and there are N of them.
Perhaps someone can provide a more formal analysis, but my estimate is that this algorithm is roughly linear in the number of unique values rather than the size of the array. This means that if there are few unique values, you can get them in few operations even for a huge array (e.g. in constant time regardless of array size if there is only one unique value). Since the number of unique values is no grater than the size of the array, I claim that this makes this algorithm "better than O(N)" (or, strictly: "not worse than O(N) and better in many cases").

import java.util.*;
/**
* remove duplicate in a sorted array in average O(log(n)), worst O(n)
* #author XXX
*/
public class UniqueValue {
public static void main(String[] args) {
int[] test = {-1, -1, -1, -1, 0, 0, 0, 0,2,3,4,5,5,6,7,8};
UniqueValue u = new UniqueValue();
System.out.println(u.getUniqueValues(test, 0, test.length - 1));
}
// i must be start index, j must be end index
public List<Integer> getUniqueValues(int[] array, int i, int j) {
if (array == null || array.length == 0) {
return new ArrayList<Integer>();
}
List<Integer> result = new ArrayList<>();
if (array[i] == array[j]) {
result.add(array[i]);
} else {
int mid = (i + j) / 2;
result.addAll(getUniqueValues(array, i, mid));
// avoid duplicate divide
while (mid < j && array[mid] == array[++mid]);
if (array[(i + j) / 2] != array[mid]) {
result.addAll(getUniqueValues(array, mid, j));
}
}
return result;
}
}

How efficient is this hash function?

I am not sure the best way to go about hashing a "dictionary" into a table.
The dictionary has 61406 words, I determine the overload by SizeOFDictionary/.75
That gives me 81874 buckets in the table.
I run it through my hash function(generic random algorithm) and there are 31690 buckets that get used up. and 50 some thousand that are empty. The largest bucket only contains 10 words.
My question: Do these numbers suffice for a hashing project? I am unfamiliar with what I am trying to achieve, to me, it seems like 50 some thousand is a lot of empty buckets.
Here is my hashing function.
private void hashingAlgorithm(String word)
{
int key = 1;
//Multiplying ASCII values of string
//To determine the index
for(int i = 0 ; i < word.length(); i++){
key *= (int)word.charAt(i);
//Accounting for integer overflow
if(key<0)
key*=-1;
}
key %= sizeOfTable;
//Inserting into the table
table[key].addToBucket(word);
}

Performance analysis:
Your hashing function doesn't take the order into account. According to your algorithm, if there's no overflow,
ab = ba. Your code depends on overflow to make difference between different order. So there is space for a lot of extra collisions which can be removed if you think about the sentences to be a N based number.
Suggested Improvement:
2 * 3 == 3 * 2
but
2 * 223 + 3 != 3 * 223 + 2
So if we represent the strings as N based number, number of collisions will be decreased at a dramatic scale.

If dictionary contains words like :
abdc
abcd
dbca
dabc
dacb
all will get hashed to same value in hash table i.e int(a)*int(b)*int(c)*int(d) , which is not a good idea .
So , use rolling hash .
example :
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]
where base be a prime number like say 31.
NOTE : [i] means char.at(i) .
you can also use modulo p [obviously p is a prime number] operator to avoid overflow and limit your size of hash table .
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1] mod p

Getting rid of cycle in bin index calculation

Here is an implementation of HashMap.
It provides this code for getting index of the bin:
private int getIndex(K key)
{
int hash = key.hashCode() % nodes.length;
if (hash < 0)
hash += nodes.length;
return hash;
}
To make sure the hash value is not bigger than the size of the table,
the result of the user provided hash function is used modulo the
length of the table. We need the index to be non-negative, but the
modulus operator (%) will return a negative number if the left operand
(the hash value) is negative, so we have to test for it and make it
non-negative.
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
I think there should be O(1) algorithm for it (independent of hash value).
If so, how can it be achieved?

It can't be a very big negative number.
The result of anything % nodes.length is always less that nodes.length in absolute value, so you need a single if, not a loop. This is exactly what the code does:
if (hash < 0) /* `if', not `while' */
hash += nodes.length;

This not the approach HashMap uses in reality.
272 /**
273 * Returns index for hash code h.
274 */
275 static int indexFor(int h, int length) {
276 return h & (length-1);
277 }
This works because length is always a power of 2 and this is the same an unsigned % length
If hash turns out to be very big negative value, the additions hash += nodes.length in the cycle may take a lot of processing.
The hash at this point must be between -length+1 and length-1 so it cannot be a very large negative value and the code wouldn't work if it did. In any case it doesn't matter how large the value is, the cost is always the same.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.