How efficient is this hash function? - java

I am not sure the best way to go about hashing a "dictionary" into a table.
The dictionary has 61406 words, I determine the overload by SizeOFDictionary/.75
That gives me 81874 buckets in the table.
I run it through my hash function(generic random algorithm) and there are 31690 buckets that get used up. and 50 some thousand that are empty. The largest bucket only contains 10 words.
My question: Do these numbers suffice for a hashing project? I am unfamiliar with what I am trying to achieve, to me, it seems like 50 some thousand is a lot of empty buckets.
Here is my hashing function.
private void hashingAlgorithm(String word)
{
int key = 1;
//Multiplying ASCII values of string
//To determine the index
for(int i = 0 ; i < word.length(); i++){
key *= (int)word.charAt(i);
//Accounting for integer overflow
if(key<0)
key*=-1;
}
key %= sizeOfTable;
//Inserting into the table
table[key].addToBucket(word);
}

Performance analysis:
Your hashing function doesn't take the order into account. According to your algorithm, if there's no overflow,
ab = ba. Your code depends on overflow to make difference between different order. So there is space for a lot of extra collisions which can be removed if you think about the sentences to be a N based number.
Suggested Improvement:
2 * 3 == 3 * 2
but
2 * 223 + 3 != 3 * 223 + 2
So if we represent the strings as N based number, number of collisions will be decreased at a dramatic scale.

If dictionary contains words like :
abdc
abcd
dbca
dabc
dacb
all will get hashed to same value in hash table i.e int(a)*int(b)*int(c)*int(d) , which is not a good idea .
So , use rolling hash .
example :
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]
where base be a prime number like say 31.
NOTE : [i] means char.at(i) .
you can also use modulo p [obviously p is a prime number] operator to avoid overflow and limit your size of hash table .
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1] mod p

Related

Collision strength of Java's Arrays.hashCode

How strong is the hashing mechanism that is used in the Arrays.hashCode methods against collision? What is the possibility of two different arrays (of, say, double) to have an exact hash value calculated with these methods?
Arrays.hashCode(double[]) is specified to return the equivalent value of a List containing Double values representing the same numeric value.
List.hashCode in turn is specified with a fairly simple algorithm:
int hashCode = 1;
for (E e : list)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
In general the multiplication with a prime number is a good practice for general-purpose hash functions, but it's far from a cryptographically strong hash function.
This means that while collisions are unlikely in the general (effectively random) case, they can usually be constructed quite easily if you can influence (or select) the hashCode of the items in the List.
As a constructed example consider these two statements:
System.out.println(Arrays.hashCode(new double[] {4.753E-321d}));
System.out.println(Arrays.hashCode(new double[] {4.9E-324d, 4.9E-324d}));
Both of these will output 993, despite being clearly different arrays.
This is the implementation of Arrays.hashCode that you use
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). Lets call those pure hashes
Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution.
Lets consider a smaller example. I assume you have only 2 elements in your array and the range from 0 to 5 each. I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision.
val hashes = for {
i <- 0 to 4
j <- 0 to 4
} yield (i * 31 + j) % 38
enter code here
println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25
To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. I bet you will see an extremely regular pattern.

I want to generate four Different Random Numbers in android But Their total should not be Greater Than the Specified range like 0 to 69

This my Button OnClick() Method on Click of the button All Four random numbers will be Displayed on the Logcat
public void onClick(View v) {
/* the Code for Four Random Numbers*/
final Random random = new Random();
final Set<Integer> mySet = new HashSet<>();
while (mySet.size() < 4) {
mySet.add(random.nextInt(69) + 1);
}
// Now Adding it to the ArrayList
ArrayList<Integer> Elements = new ArrayList<>(mySet);
Log.i("Elements","A:" + Elements.get(0));
Log.i("Elements","B:" + Elements.get(1));
Log.i("Elements","C:" + Elements.get(2));
Log.i("Elements","D:" + Elements.get(3));
}
});
The Output Will be Look Like this (I just give an Example of one case It is Different in every case Whenever I run a App)
A : 11
B : 28
C : 57
D : 1
Now the Problem is :
The sum of all the numbers is greater than the specified range which is 0 to 69
When We Add A,B,C,D values which is equals to 97
Which is greater than the Specified Range 0 to 69
So I Want the Random Numbers in such a way that :
when we Add A,B,C,D the their Sum Should no Exceed the Range that is 69
So My Question is How can i Do That ?
Please Help!! I am Stuck in that part of the Code
and I find no Solution
In addition to what other people have suggested, you might want to read this: find all subsets that sum to a particular value
First, note that this bears some resemblance to the subset sum problem. Given the set of all numbers between 1 and 69, you're looking for a subset of 4 of them that adds up to 69. For any natural number, there are only a finite number of such sets (although it obviously eventually gets computationally infeasible to enumerate all of them). Either way, your answer is guaranteed to be one of these sets regardless of what algorithm you use.
The linked question shows code to find all of the subsets that add up to a particular value. Once you have this, just filter on all the subsets that have length 4 and randomly pick one of them.
There are two ways, depending on how you want to do it:
Rereoll the random numbers until a valid sum is achieved. This will be truly random.
Change the max value for the 2nd, 3rd, 4th, etc random number so that it can't be too big. However this will change it from every number being equally likely to smaller numbers being more likely.
This is a tricky question. These is no mention of a sample space, so one can assume it has to be valid integer of sorts. Generate any number of a number signed 32 bit integer. This would be –2147483648 and 2147483647 until you have four unique ones which sum is between 0 and 69.
Best of luck!
you can use this code :
final Random random = new Random();
final Set<Integer> mySet = new HashSet<>();
int thesum = 0 ;
while (mySet.size() < 4 && thesum< 69) {
int nmbr = random.nextInt(69) + 1 ;
thesum = thesum + nmbr ;
mySet.add(nmbr) ;
}
if( mySet.size() == 3)
{ArrayList<Integer> Elements = new ArrayList<>(mySet);
Log.i("Elements","A:" + Elements.get(0));
Log.i("Elements","B:" + Elements.get(1));
Log.i("Elements","C:" + Elements.get(2));
Log.i("Elements","D:" + Elements.get(3));}

Counting all permutations of a string (Cracking the Coding Interview, Chapter VI - Example 12)

In Gayle Laakman's book "Cracking the Coding Interview", chapter VI (Big O), example 12, the problem states that given the following Java code for computing a string's permutations, it is required to compute the code's complexity
public static void permutation(String str) {
permutation(str, "");
}
public static void permutation(String str, String prefix) {
if (str.length() == 0) {
System.out.println(prefix);
} else {
for (int i = 0; i < str.length(); i++) {
String rem = str.substring(0, i) + str.substring(i + 1);
permutation(rem, prefix + str.charAt(i));
}
}
}
The book assumes that since there will be n! permutations, if we consider each of the permutations to be a leaf in the call tree, where each of the leaves is attached to a path of length n, then there will be no more that n*n! nodes in the tree (i.e.: the number of calls is no more than n*n!).
But shouldn't the number of nodes be:
since the number of calls is equivalent to the number of nodes (take a look at the figure in the video Permutations Of String | Code Tutorial by Quinston Pimenta).
If we follow this method, the number of nodes will be 1 (for the first level/root of the tree) + 3 (for the second level) + 3*2 (for the third level) + 3*2*1 (for the fourth/bottom level)
i.e.: the number of nodes = 3!/3! + 3!/2! + 3!/1! + 3!/0! = 16
However, according to the aforementioned method, the number of nodes will be 3*3! = 18
Shouldn't we count shared nodes in the tree as one node, since they express one function call?
You're right about the number of nodes. That formula gives the exact number, but the method in the book counts some multiple times.
Your sum also seems to be approach e * n! for large n, so can be simplified to O(n!).
It's still technically correct to say the number of calls is no more than n * n!, as this is a valid upper bound. Depending on how this is used, this can be fine, and may be easier prove.
For the time complexity, we need to multiply by the average work done for each node.
First, check the String concatenation. Each iteration creates 2 new Strings to pass to the next node. The length of one String increases by 1, and the length of the other decreases by 1, but the total length is always n, giving a time complexity of O(n) for each iteration.
The number of iterations varies for each level, so we can't just multiply by n. Instead look at the total number of iterations for the whole tree, and get the average for each node. With n = 3:
The 1 node in the first level iterates 3 times: 1 * 3 = 3
The 3 nodes in the second level iterate 2 times: 3 * 2 = 6
The 6 nodes in the third level iterate 1 time: 6 * 1 = 6
The total number of iterations is: 3 + 6 + 6 = 15. This is about the same as number of nodes in the tree. So the average number of iterations for each node is constant.
In total, we have O(n!) iterations that each do O(n) work giving a total time complexity of O(n * n!).
According to your video where we have string with 3 characters (ABC), the number of permutations is 6 = 3!, and 6 happens to be equal to 1 + 2 + 3. However, if we have a string with 4 characters (ABCD), the number of permutations should be 4 * 3! as D could be in any position from 1 to 4. With each position of D you can generate 3! permutations for the rest. If you re-draw the tree and count the number of permutations you will see the difference.
According to your code, we have n! = str.length()! permutations, but in each call of the permutations, you also run a loop from 0 to n-1. Therefore, you have O(n * n!).
Update in response to the edited question
Firstly, in programming, we often have either 0->n-1 or 1->n not 0->n.
Secondly, we don't count the number of nodes in this case as if you take a look at the recursion tree in the clip again, you will see nodes duplicated. The permutations in this case should be the number of leaves which are unique among each other.
For instance, if you have a string with 4 characters, the number of leaves should be 4 * 3! = 24 and it would be the number of permutations. However, in your code snippet, you also have a 0->n-1 = 0->3 loop in each permutation, so you need to count the loops in. Thus, your code complexity in this case is O(n *n!) = O(4 * 4!).

Array Duplicate Efficiency Riddle

Recently in AP Computer Science A, our class recently learned about arrays. Our teacher posed to us a riddle.
Say you have 20 numbers, 10 through 100 inclusive, right? (these numbers are gathered from another file using Scanners)
As each number is read, we must print the number if and only if it is not a duplicate of a number already read. Now, here's the catch. We must use the smallest array possible to solve the problem.
That's the real problem I'm having. All of my solutions require a pretty big array that has 20 slots in it.
I am required to use an array. What would be the smallest array that we could use to solve the problem efficiently?
If anyone could explain the method with pseudocode (or in words) that would be awesome.
In the worst case we have to use an array of length 19.
Why 19? Each unique number has to be remembered in order to sort out duplicates from the following numbers. Since you know that there are 20 numbers incoming, but not more, you don't have to store the last number. Either the 20th number already appeared (then don't do anything), or the 20th number is unique (then print it and exit – no need to save it).
By the way: I wouldn't call an array of length 20 big :)
If your numbers are integers: You have a range from 10 to 100. So you need 91 Bits to store which values have already been read. A Java Long has 64 Bits. So you will need an array of two Longs. Let every Bit (except for the superfluous ones) stand for a number from 10 to 100. Initialize both longs with 0. When a number is read, check if the corresponding bit mapped to the read value is set to 1. If yes, the read number is a duplicate, if no set the bit to 1.
This is the idea behind the BitSet class.
Agree with Socowi. If number of numbers is known and it is equal to N , it is always possible to use N-1 array to store duplicates. Once the last element from the input is received and it is already known that this is the last element, it is not really needed to store this last value in the duplicates array.
Another idea. If your numbers are small and really located in [10:100] diapason, you can use 1 Long number for storing at least 2 small Integers and extract them from Long number using binary AND to extract small integers values back. In this case it is possible to use N/2 array. But it will make searching in this array more complicated and does not save much memory, only number of items in the array will be decreased.
You technically don't need an array, since the input size is fixed, you can just declare 20 variables. But let's say it wasn't fixed.
As other answer says, worst case is indeed 19 slots in the array. But, assuming we are talking about integers here, there is a better case scenario where some numbers form a contiguous interval. In that case, you only have to remember the highest and lowest number, since anything in between is also a duplicate. You can use an array of intervals.
With the range of 10 to 100, the numbers can be spaced apart and you still need an array of 19 intervals, in the worst case. But let's say, that the best case occurs, and all numbers form a contiguous interval, then you only need 1 array slot.
The problem you'd still have to solve is to create an abstraction over an array, that expands itself by 1 when an element is added, so it will use the minimal size necessary. (Similar to ArrayList, but it doubles in size when capacity is reached).
Since an array cannot change size at run time You need a companion variable to count the numbers that are not duplicates and fill the array partially with only those numbers.
Here is a simple code that use companion variable currentsize and fill the array partially.
Alternative you can use arrayList which change size during run time
final int LENGTH = 20;
double[] numbers = new double[LENGTH];
int currentSize = 0;
Scanner in = new Scanner(System.in);
while (in.hasNextDouble()){
if (currentSize < numbers.length){
numbers[currentSize] = in.nextDouble();
currentSize++;
}
}
Edit
Now the currentSize contains those actual numbers that are not duplicates and you did not fill all 20 elements in case you had some duplicates. Of course you need some code to determine whither a numbers is duplicate or not.
My last answer misunderstood what you were needing, but I turned this thing up that does it an int array of 5 elements using bit shifting. Since we know the max number is 100 we can store (Quite messily) four numbers into each index.
Random rand = new Random();
int[] numbers = new int[5];
int curNum;
for (int i = 0; i < 20; i++) {
curNum = rand.nextInt(100);
System.out.println(curNum);
boolean print = true;
for (int x = 0; x < i; x++) {
byte numberToCheck = ((byte) (numbers[(x - (x % 4)) / 4] >>> ((x%4) * 8)));
if (numberToCheck == curNum) {
print = false;
}
}
if (print) {
System.out.println("No Match: " + curNum);
}
int index = ((i - (i % 4)) / 4);
numbers[index] = numbers[index] | (curNum << (((i % 4)) * 8));
}
I use rand to get my ints but you could easily change this to a scanner.

Need help in understanding Rolling Hash computation in constant time for Rabin-Karp Implementation

I've been trying to implement Rabin-Karp algorithm in Java. I have hard time computing the rolling hash value in constant time. I've found one implementation at http://algs4.cs.princeton.edu/53substring/RabinKarp.java.html. Still I could not get how these two lines work.
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
txtHash = (txtHash*R + txt.charAt(i)) % Q;
I looked at couple of articles on modular arithmetic but no article could able to penetrate my thick skull. Please give some pointers to understand this.
First you need to understand how the hash is computed.
Lets take a simple case of base 10 strings. How would you guarantee that the hash code of a string is unique? Base 10 is what we use to represent numbers, and we don't have collisions!!
"523" = 5*10^2 + 2*10^1 + 3*10^0 = 523
using the above hash function you are guaranteed to get distinct hashes for every string.
Given the hash of "523", if you want to calculate the hash of "238", i.e. by jutting out the leftmost digit 5 and bringing in a new digit 8 from the right, you would have to do the following:
1) remove the effect of the 5 from the hash:
hash = hash - 5*10^2 (523-500 = 23)
2) adjust the hash of the remaining chars by shifting by 1, hash = hash * 10
3) add the hash of the new character:
hash = hash + 8 (230 + 8 = 238, which as we expected is the base 10 hash of "238")
Now let's extend this to all ascii characters. This takes us to the base 256 world. Therefore the hash of the same string "523" now is
= 5*256^2 + 2*256^1 + 3*256^0 = 327680 + 512 + 3 = 328195.
You can imagine as the string length increases you will will exceed the range of integer/long in most programming languages relatively quickly.
How can we solve this? The way this is routinely solved is by working with modulus a large prime number. The drawback of this method is that we will now get false positives as well, which is a small price to pay if it takes the runtime of your algorithm from quadratic to linear!
The complicated equation you quoted is nothing but the steps 1-3 above done with modulus math.
The two modulus properties used above are ->
a) (a*b) % p = ((a % p) * (b % p)) % p
b) a % p = (a + p) % p
Lets go back to steps 1-3 mentioned above ->
1) (expanded using property a) hash = hash - ((5 % p)*(10^2 %p) %p)
vs. what you quoted
txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
Here are is how the two are related!
RM = 10^3 % p
txt.charAt(i-M) % Q = 5 % p
The additional + Q you see is just to ensure that the hash is not negative. See property b above.
2 & 3) hash = hash*10 + 8, vs txtHash = (txtHash*R + txt.charAt(i)) % Q;
Is the same but with taking mod of the final hash result!
Looking at properties a & b more closely, should help you figure it out!
This is the "rolling" aspect of the hash. It's eliminating the contribution of the oldest character (txt.charAt(i-M)), and incorporating the contribution of the newest character(txt.charAt(i)).
The hash function is defined as:
M-1
hash[i] = ( SUM { input[i-j] * R^j } ) % Q
j=0
(where I'm using ^ to denote "to the power of".)
But this can be written as an efficient recursive implementation as:
hash[i] = (txtHash*R - input[i-M]*(R^M) + input[i]) % Q
Your reference code is doing this, but it's using various techniques to ensure that the result is always computed correctly (and efficiently).
So, for instance, the + Q in the first expression has no mathematical effect, but it ensures that the result of the sum is always positive (if it goes negative, % Q doesn't have the desired effect). It's also breaking the calculation into stages, presumably to prevent numerical overflow.

Categories