Collision strength of Java's Arrays.hashCode

Collision strength of Java's Arrays.hashCode - java

How strong is the hashing mechanism that is used in the Arrays.hashCode methods against collision? What is the possibility of two different arrays (of, say, double) to have an exact hash value calculated with these methods?

Arrays.hashCode(double[]) is specified to return the equivalent value of a List containing Double values representing the same numeric value.
List.hashCode in turn is specified with a fairly simple algorithm:
int hashCode = 1;
for (E e : list)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
In general the multiplication with a prime number is a good practice for general-purpose hash functions, but it's far from a cryptographically strong hash function.
This means that while collisions are unlikely in the general (effectively random) case, they can usually be constructed quite easily if you can influence (or select) the hashCode of the items in the List.
As a constructed example consider these two statements:
System.out.println(Arrays.hashCode(new double[] {4.753E-321d}));
System.out.println(Arrays.hashCode(new double[] {4.9E-324d, 4.9E-324d}));
Both of these will output 993, despite being clearly different arrays.

This is the implementation of Arrays.hashCode that you use
public static int hashCode(int a[]) {
if (a == null)
return 0;
int result = 1;
for (int element : a)
result = 31 * result + element;
return result;
}
If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). Lets call those pure hashes
Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution.
Lets consider a smaller example. I assume you have only 2 elements in your array and the range from 0 to 5 each. I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision.
val hashes = for {
i <- 0 to 4
j <- 0 to 4
} yield (i * 31 + j) % 38
enter code here
println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25
To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. I bet you will see an extremely regular pattern.

Related

Creating combinations of a BitSet

Assume I have a Java BitSet. I now need to make combinations of the BitSet such that only Bits which are Set can be flipped. i.e. only need combinations of Bits which are set.
For Eg. BitSet - 1010, Combinations - 1010, 1000, 0010, 0000
BitSet - 1100, Combination - 1100, 1000, 0100, 0000
I can think of a few solutions E.g. I can take combinations of all 4 bits and then XOR the combinations with the original Bitset. But this would be very resource-intensive for large sparse BitSets. So I was looking for a more elegant solution.

It appears that you want to get the power set of the bit set. There is already an answer here about how to get the power set of a Set<T>. Here, I will show a modified version of the algorithm shown in that post, using BitSets:
private static Set<BitSet> powerset(BitSet set) {
Set<BitSet> sets = new HashSet<>();
if (set.isEmpty()) {
sets.add(new BitSet(0));
return sets;
}
Integer head = set.nextSetBit(0);
BitSet rest = set.get(0, set.size());
rest.clear(head);
for (BitSet s : powerset(rest)) {
BitSet newSet = s.get(0, s.size());
newSet.set(head);
sets.add(newSet);
sets.add(s);
}
return sets;
}

You can perform the operation in a single linear pass instead of recursion, if you realize the integer numbers are a computer’s intrinsic variant of “on off” patterns and iterating over the appropriate integer range will ultimately produce all possible permutations. The only challenge in your case, is to transfer the densely packed bits of an integer number to the target bits of a BitSet.
Here is such a solution:
static List<BitSet> powerset(BitSet set) {
int nBits = set.cardinality();
if(nBits > 30) throw new OutOfMemoryError(
"Not enough memory for "+BigInteger.ONE.shiftLeft(nBits)+" BitSets");
int max = 1 << nBits;
int[] targetBits = set.stream().toArray();
List<BitSet> sets = new ArrayList<>(max);
for(int onOff = 0; onOff < max; onOff++) {
BitSet next = new BitSet(set.size());
for(int bitsToSet = onOff, ix = 0; bitsToSet != 0; ix++, bitsToSet>>>=1) {
if((bitsToSet & 1) == 0) {
int skip = Integer.numberOfTrailingZeros(bitsToSet);
ix += skip;
bitsToSet >>>= skip;
}
next.set(targetBits[ix]);
}
sets.add(next);
}
return sets;
}
It uses an int value for the iteration, which is already enough to represent all permutations that can ever be stored in one of Java’s builtin collections. If your source BitSet has 2³¹ one bits, the 2³² possible combinations do not only require a hundred GB heap, but also a collection supporting 2³² elements, i.e. a size not representable as int.
So the code above terminates early if the number exceeds the capabilities, without even trying. You could rewrite it to use a long or even BigInteger instead, to keep it busy in such cases, until it will fail with an OutOfMemoryError anyway.
For the working cases, the int solution is the most efficient variant.
Note that the code returns a List rather than a HashSet to avoid the costs of hashing. The values are already known to be unique and hashing would only pay off if you want to perform lookups, i.e. call contains with another BitSet. But to test whether an existing BitSet is a permutation of your input BitSet, you wouldn’t even need to generate all permutations, a simple bit operation, e.g. andNot would tell you that already. So for storing and iterating the permutations, an ArrayList is more efficient.

Fast way to check if long integer is a cube (in Java)

I am writing a program in which I am required to check if certain large numbers (permutations of cubes) are cubic (equal to n^3 for some n).
At the moment I simply use the method
static boolean isCube(long input) {
double cubeRoot = Math.pow(input,1.0/3.0);
return Math.round(cubeRoot) == cubeRoot;
}
but this is very slow when working with large numbers (10+ digits). Is there a faster way to determine if integer numbers are cubes?

There are only 2^21 cubes that don't overflow a long (2^22 - 1 if you allow negative numbers), so you could just use a HashSet lookup.

The Hacker's Delight book has a short and fast function for integer cube roots which could be worth porting to 64bit longs, see below.
It appears that testing if a number is a perfect cube can be done faster than actually computing the cube root. Burningmath has a technique that uses the "digital root" (sum the digits. repeat until it's a single digit). If the digital root is 0, 1 or 8, your number might be a perfect cube.
This method could be extremely valuable for your case of permuting (the digits of?) numbers. If you can rule out a number by its digital root, all permutations are also ruled out.
They also describe a technique based on the prime factors for checking perfect cubes. This looks most appropriate for mental arithmetic, as I think factoring is slower than cube-rooting on a computer.
Anyway, the digital root is quick to computer, and you even have your numbers as a string of digits to start with. You'll still need a divide-by-10 loop, but your starting point is the sum of digits of the input, not the whole number, so it won't be many divisions. (Integer division is about an order of magnitude slower than multiplication on current CPUs, but division by a compile-time-constant can be optimized to multiply+shift with a fixed-point inverse. Hopefully Java JIT compilers use that, too, and maybe even use it for runtime constants.)
This plus A. Webb's test (input % 819 -> search of a table of 45 entries) will rule out a lot of inputs as not possible perfect cubes.
IDK if binary search, linear search, or hash/set would be best.
These tests could be a front-end to David Eisenstat's idea of just storing the set of longs that are perfect cubes in a data structure that allows quick is-present checks. (e.g. HashSet). Yes, cache misses are expensive enough that at least the digital-root test is probably worth it before doing a HashSet lookup, maybe both.
You could use less memory on this idea by using it for a Bloom Filter instead of an exact set (David Ehrman's suggestion). This would give another candidate-rejection frontend to the full calculation. The guavac BloomFilter implementation requires a "funnel" function to translate objects to bytes, which in this case should be f(x)=x).
I suspect that Bloom filtering isn't going to be a big win over an exact HashSet check, since it requires multiple memory accesses. It's appropriate when you really can't afford the space for a full table, and what you're filtering out is something really expensive like a disk access.
The integer cube root function (below) is probably faster than a single cache miss. If the cbrt check is causing cache misses, then probably the rest of your code will suffer more cache misses too, when its data is evicted.
Math.SE had a question about this for perfect squares, but that was about squares, not cubes, so none of this came up. The answers there did discuss and avoid the problems in your method, though. >.<
There are several problems with your method:
The problem with using pow(x, 1./3) is that 1/3 does not have an exact representation in floating point, so you're not "really" getting the cube root. So use cbrt. It's highly unlikely to be slower, unless it has higher accuracy that comes with a time cost.
You're assuming Math.pow or Math.cbrt always return a value that's exactly an integer, and not 41.999999 or something. Java docs say:
The computed result must be within 1 ulp of the exact result.
This means your code might not work on a conforming Java implementation. Comparing floating point numbers for exactly equal is tricky business. What Every Computer Scientist Should Know About Floating-Point Arithmetic has much to say about floating point, but it's really long. (With good reason. It's easy to shoot yourself in the foot with floating point.) See also Comparing Floating Point Numbers, 2012 Edition, Bruce Dawson's series of FP articles.
I think it won't work for all long values. double can only precisely represent integers up to 2^53 (size of the mantissa in a 64bit IEEE double). Math.cbrt of integers that can't be represented exactly is even less likely to be an exact integer.
FP cube root, and then testing the resulting integer, avoids all the problems that the FP comparison introduced:
static boolean isCube(long input) {
double cubeRoot = Math.cbrt(input);
long intRoot = Math.round(cubeRoot);
return (intRoot*intRoot*intRoot) == input;
}
(After searching around, I see other people on other stackoverflow / stackexchange answers suggesting that integer-comparison method, too.)
If you need high performance, and you don't mind having a more complex function with more source code, then there are possibilities. For example, use a cube-root successive-approximation algorithm with integer math. If you eventually get to a point where n^3 < input <(n+1)^3, theninput` isn't a cube.
There's some discussion of methods on this math.SE question.
I'm not going to take the time to dig into integer cube-root algorithms in detail, as the cbrt part is probably not the main bottleneck. Probably input parsing and string->long conversion is a major part of your bottleneck.
Actually, I got curious. Turns out there is already an integer cube-root implementation available in Hacker's Delight (use / copying / distributing even without attribution is allowed. AFAICT, it's essentially public domain code.):
// Hacker's delight integer cube-root (for 32-bit integers, I think)
int icbrt1(unsigned x) {
int s;
unsigned y, b;
y = 0;
for (s = 30; s >= 0; s = s - 3) {
y = 2*y;
b = (3*y*(y + 1) + 1) << s;
if (x >= b) {
x = x - b;
y = y + 1;
}
}
return y;
}
That 30 looks like a magic number based on the number of bits in an int. Porting this to long would require testing. (Also note that this is C, but looks like it should compile in Java, too!)
IDK if this is common knowledge among Java people, but the 32bit Windows JVM doesn't use the server JIT engine, and doesn't optimize your code as well.

You can first eliminate a large number of candidates by testing modulo given numbers. For example, a cube modulo the number 819 can only take on the following 45 values.
0 125 181 818 720 811 532 755 476
1 216 90 307 377 694 350 567 442
8 343 559 629 658 351 190 91 469
27 512 287 252 638 118 603 161 441
64 729 99 701 792 378 260 468 728
So, you could eliminate actually having to compute the cubic root in almost 95% of uniformly distributed cases.

The hackers delight routine seems to work on long numbers if you just change int to long and 30 to 60. If you change 30 to 61 it does not seem to work.
I didn't really understand the program, so I made another version that seems to work in Java.
private static int cubeRoot(long n) {
final int MAX_POWER = 21;
int power = MAX_POWER;
long factor;
long root = 0;
long next, square, cube;
while (power >= 0) {
factor = 1 << power;
next = root + factor;
while (true) {
if (next > n) {
break;
}
if (n / next < next) {
break;
}
square = next * next;
if (n / square < next) {
break;
}
cube = square * next;
if (cube > n) {
break;
}
root = next;
next += factor;
}
--power;
}
return (int) root;
}

Please define very show. Here is a test program:
public static void main(String[] args) {
for (long v = 1; v > 0; v = v * 10) {
long start = System.nanoTime();
for (int i = 0; i < 100; i++)
isCube(v);
long end = System.nanoTime();
System.out.println(v + ": " + (end - start) + "ns");
}
}
static boolean isCube(long input) {
double cubeRoot = Math.pow(input,1.0/3.0);
return Math.round(cubeRoot) == cubeRoot;
}
Output is:
1: 290528ns
10: 46188ns
100: 45332ns
1000: 46188ns
10000: 46188ns
100000: 46473ns
1000000: 46188ns
10000000: 45048ns
100000000: 45048ns
1000000000: 44763ns
10000000000: 45048ns
100000000000: 44477ns
1000000000000: 45047ns
10000000000000: 46473ns
100000000000000: 47044ns
1000000000000000: 46188ns
10000000000000000: 65291ns
100000000000000000: 45047ns
1000000000000000000: 44477ns
I don't see a performance impact of "large" numbers.

Is there even an algorithm for 2^(n) - 1 which lies in Theta Ө(1)?

so I have a question about an algorithm I'm supposed to "invent"/"find". It's an algorithm which calculates 2^(n) - 1 for Ө(n^n) and Ө(1) and Ө(n).
I was thinking for several hours but I couldn't find any solution for both tasks (the first ones while the last one was the easist imo, I posted the algorithm below). But I'm not skilled enough to "invent"/"find" one for a very slow and very fast algorithm.
So far my algorithms are (In Pseudocode):
The one for Ө(n)
int f(int n) {
int number = 2
if(n = 0) then return 0
if(n==1) then return 1
while(n > 1)
number = number * 2
n--
number = number - 1
return number
A simple one and kinda obvious one which uses recursion though I don't know how fast it is (It would be nice if someone could tell me that):
int f(int n) {
if(n==0) then return 0
if(n==1) then return 1
return 3*f(n-1) - 2*f(n-2)
}

Assuming n is not bounded by any constant (and output should not be a simple int, but a data type that can contain large integers to allow it) - there is no algorithm
to yield 2^n -1 in Ө(1), since the size of the output itself is
Ө(log(n)), so if we assume there is such algorithm, and let it
run in constant time and makes less than C operations, for n =
2^(C+1), you will require C+1 operations only to print the
output, which contradicts the assumption that C is the upper bound, so
there is no such algorithm.
For Ө(n^n), if you have a more efficient algorithm (Ө(n) for example), you can make a pointless loop that runs extra n^n iterations and do nothing important, it will make your algorithm Ө(n^n).
There is also a Ө(log(n)*M(logn)) algorithm, using exponent by squaring, and then simply reducing 1 from this value. In here M(x) is complexity of your multiplying operator for number containing x digits.
As commented by #kajacx, you can even improve (3) by applying Fourier transform

Something like:
HugeInt h = 1;
h = h << n;
h = h - 1;
Obviously HugeInt is pseudo-code for an integer type that can be of arbitrary size allowing for any n.
=====
Look at amit's answer instead!

the Ө(n^n) is too tricky for me, but a real Ө(1) algorithm on any "binary" architecture would be:
return n-1 bits filled with 1
(assuming your architecture can allocate and fill n-1 bits in constant time)
;)

How efficient is this hash function?

I am not sure the best way to go about hashing a "dictionary" into a table.
The dictionary has 61406 words, I determine the overload by SizeOFDictionary/.75
That gives me 81874 buckets in the table.
I run it through my hash function(generic random algorithm) and there are 31690 buckets that get used up. and 50 some thousand that are empty. The largest bucket only contains 10 words.
My question: Do these numbers suffice for a hashing project? I am unfamiliar with what I am trying to achieve, to me, it seems like 50 some thousand is a lot of empty buckets.
Here is my hashing function.
private void hashingAlgorithm(String word)
{
int key = 1;
//Multiplying ASCII values of string
//To determine the index
for(int i = 0 ; i < word.length(); i++){
key *= (int)word.charAt(i);
//Accounting for integer overflow
if(key<0)
key*=-1;
}
key %= sizeOfTable;
//Inserting into the table
table[key].addToBucket(word);
}

Performance analysis:
Your hashing function doesn't take the order into account. According to your algorithm, if there's no overflow,
ab = ba. Your code depends on overflow to make difference between different order. So there is space for a lot of extra collisions which can be removed if you think about the sentences to be a N based number.
Suggested Improvement:
2 * 3 == 3 * 2
but
2 * 223 + 3 != 3 * 223 + 2
So if we represent the strings as N based number, number of collisions will be decreased at a dramatic scale.

If dictionary contains words like :
abdc
abcd
dbca
dabc
dacb
all will get hashed to same value in hash table i.e int(a)*int(b)*int(c)*int(d) , which is not a good idea .
So , use rolling hash .
example :
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]
where base be a prime number like say 31.
NOTE : [i] means char.at(i) .
you can also use modulo p [obviously p is a prime number] operator to avoid overflow and limit your size of hash table .
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1] mod p

Issue with implementation of Fermat's little therorm

Here's my implementation of Fermat's little theorem. Does anyone know why it's not working?
Here are the rules I'm following:
Let n be the number to test for primality.
Pick any integer a between 2 and n-1.
compute a^n mod n.
check whether a^n = a mod n.
myCode:
int low = 2;
int high = n -1;
Random rand = new Random();
//Pick any integer a between 2 and n-1.
Double a = (double) (rand.nextInt(high-low) + low);
//compute:a^n = a mod n
Double val = Math.pow(a,n) % n;
//check whether a^n = a mod n
if(a.equals(val)){
return "True";
}else{
return "False";
}
This is a list of primes less than 100000. Whenever I input in any of these numbers, instead of getting 'true', I get 'false'.
The First 100,008 Primes
This is the reason why I believe the code isn't working.

In java, a double only has a limited precision of about 15 to 17 digits. This means that while you can compute the value of Math.pow(a,n), for very large numbers, you have no guarantee you'll get an exact result once the value has more than 15 digits.
With large values of a or n, your computation will exceed that limit. For example
Math.pow(3, 67) will have a value of 9.270946314789783e31 which means that any digit after the last 3 is lost. For this reason, after applying the modulo operation, you have no guarantee to get the right result (example).
This means that your code does not actually test what you think it does. This is inherent to the way floating point numbers work and you must change the way you hold your values to solve this problem. You could use long but then you would have problems with overflows (a long cannot hold a value greater than 2^64 - 1 so again, in the case of 3^67 you'd have another problem.
One solution is to use a class designed to hold arbitrary large numbers such as BigInteger which is part of the Java SE API.

As the others have noted, taking the power will quickly overflow. For example, if you are picking a number n to test for primality as small as say, 30, and the random number a is 20, 20^30 = about 10^39 which is something >> 2^90. (I took the ln of 10^39).
You want to use BigInteger, which even has the exact method you want:
public BigInteger modPow(BigInteger exponent, BigInteger m)
"Returns a BigInteger whose value is (this^exponent mod m)"
Also, I don't think that testing a single random number between 2 and n-1 will "prove" anything. You have to loop through all the integers between 2 and n-1.

#evthim Even if you have used the modPow function of the BigInteger class, you cannot get all the prime numbers in the range you selected correctly. To clarify the issue further, you will get all the prime numbers in the range, but some numbers you have are not prime. If you rearrange this code using the BigInteger class. When you try all 64-bit numbers, some non-prime numbers will also write. These numbers are as follows;
341, 561, 645, 1105, 1387, 1729, 1905, 2047, 2465, 2701, 2821, 3277, 4033, 4369, 4371, 4681, 5461, 6601, 7957, 8321, 8481, 8911, 10261, 10585, 11305, 12801, 13741, 13747, 13981, 14491, 15709, 15841, 16705, 18705, 18721, 19951, 23001, 23377, 25761, 29341, ...
https://oeis.org/a001567
161038, 215326, 2568226, 3020626, 7866046, 9115426, 49699666, 143742226, 161292286, 196116194, 209665666, 213388066, 293974066, 336408382, 376366, 666, 566, 566, 666 2001038066, 2138882626, 2952654706, 3220041826, ...
https://oeis.org/a006935
As a solution, make sure that the number you tested is not in this list by getting a list of these numbers from the link below.
http://www.cecm.sfu.ca/Pseudoprimes/index-2-to-64.html
The solution for C # is as follows.
public static bool IsPrime(ulong number)
{
return number == 2
? true
: (BigInterger.ModPow(2, number, number) == 2
? (number & 1 != 0 && BinarySearchInA001567(number) == false)
: false)
}
public static bool BinarySearchInA001567(ulong number)
{
// Is number in list?
// todo: Binary Search in A001567 (https://oeis.org/A001567) below 2 ^ 64
// Only 2.35 Gigabytes as a text file http://www.cecm.sfu.ca/Pseudoprimes/index-2-to-64.html
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Collision strength of Java's Arrays.hashCode - java

How strong is the hashing mechanism that is used in the Arrays.hashCode methods against collision? What is the possibility of two different arrays (of, say, double) to have an exact hash value calculated with these methods?

Related

Creating combinations of a BitSet

Fast way to check if long integer is a cube (in Java)

Is there even an algorithm for 2^(n) - 1 which lies in Theta Ө(1)?

How efficient is this hash function?

Issue with implementation of Fermat's little therorm

Categories

Resources