I am storing a lot of objects with geographically positions as 2D points (x,y) in granularity of meters. To represent the world I am using a grid divided in cells of 1 square km. Currently I am using HashMap<Position, Object> for this. Any other map or appropriate data structure is fine, but I the solution works so I am only interested in solving the details.
I have been reading a lot about making good hash functions, specifically for 2D points. So far, no solutions have been really good (rated in terms of as collision-free as possible).
To test some ideas I wrote a very simple java program to generate hash codes for points from an arbitrary number (-1000,-1000) to (1000, 1000) (x1, y1 -> x2,y2) and storing them in a HashSet<Integer> and this is my result:
# java HashTest
4000000 number of unique positions
test1: 3936031 (63969 buckets, 1,60%) collisions using Objects.hash(x,y)
test2: 0 (4000000 buckets, 100,00%) collisions using (x << 16) + y
test3: 3998000 (2000 buckets, 0,05%) collisions using x
test4: 3924037 (75963 buckets, 1,90%) collisions using x*37 + y
test5: 3996001 (3999 buckets, 0,10%) collisions using x*37 + y*37
test6: 3924224 (75776 buckets, 1,89%) collisions using x*37 ^ y
test7: 3899671 (100329 buckets, 2,51%) collisions using x*37 ^ y*37
test8: 0 (4000000 buckets, 100,00%) collisions using PerfectlyHashThem
test9: 0 (4000000 buckets, 100,00%) collisions using x << 16 | (y & 0xFFFF)
Legend: number of collisions , buckets(collisions), perc(collisions)
Most of these hash functions perform really bad. In fact, the only good solution is the one that shifts x to the first 16 bits of the integer. The limitation, I guess, is that the two most distant points must not be more than the square root of Integer.MAX_INT, i.e. area must be less than 46 340 square km.
This is my test function (just copied for each new hash function):
public void test1() {
HashSet<Integer> hashCodes = new HashSet<Integer>();
int collisions = 0;
for (int x = -MAX_VALUE; x < MAX_VALUE; ++x) {
for (int y = -MAX_VALUE; y < MAX_VALUE; ++y) {
final int hashCode = Objects.hash(x,y);
if (hashCodes.contains(hashCode))
collisions++;
hashCodes.add(hashCode);
}
}
System.console().format("test1: %1$s (%2$s buckets, %3$.2f%%) collisions using Objects.hash(x,y)\n", collisions, buckets(collisions), perc(collisions));
}
Am I thinking wrong here? Should I fine-tune the primes to get better results?
Edits:
Added more hash functions (test8 and test9). test8 comes from the reponse by #nawfal in Mapping two integers to one, in a unique and deterministic way (converted from short to int).
public void test1() {
int MAX_VALUE = 1000;
HashSet<Integer> hashCodes = new HashSet<Integer>();
int collisions = 0;
for (int x = -MAX_VALUE; x < MAX_VALUE; ++x) {
for (int y = -MAX_VALUE; y < MAX_VALUE; ++y) {
final int hashCode = ((x+MAX_VALUE)<<16)|((y+MAX_VALUE)&0xFFFF);
if (hashCodes.contains(hashCode))
collisions++;
hashCodes.add(hashCode);
}
}
System.out.println("Collisions: " + collisions + " // Buckets: " + hashCodes.size());
}
Prints: Collisions: 0 // Buckets: 4000000
I a similar question with the answer being to use a Cantor pairing function. Here:
Mapping two integers to one, in a unique and deterministic way.
The Cantor pairing function can be used for negative integers as well, using bijection.
Related
I'm programming a 3-dimensional cellular automata. The way I'm iterating through it right now in each generation is:
Create a list of all possible coordinates in the 3D space.
Shuffle the list.
Iterate through the list until all coordinates have been visited.
Goto 2.
Here's the code:
I've a simple 3 integer struct
public class Coordinate
{
public int x;
public int y;
public int z;
public Coordinate(int x, int y, int z) {this.x = x; this.y = y; this.z = z;}
}
then at some point I do this:
List<Coordinate> all_coordinates = new ArrayList<>();
[...]
for(int z=0 ; z<length ; z++)
{
for(int x=0 ; x<diameter ; x++)
{
for(int y=0 ; y<diameter ; y++)
{
all_coordinates.add(new Coordinate(x,y,z));
}
}
}
and then in the main algorithm I do this:
private void next_generation()
{
Collections.shuffle(all_coordinates);
for (int i=0 ; i < all_coordinates.size() ; i++)
{
[...]
}
}
The problem is, once the automata gets too large, the list containing all possible points gets huge. I need a way to shuffle through all the points without having to actually store all the possible points in memory. How should I go about this?
One way to do this is to start by mapping your three dimensional coordinates into a single dimension. Let's say that your three dimensions' sizes are X, Y, and Z. So your x coordinate goes from 0 to X-1, etc. The full size of your space is X*Y*Z. We'll call that S.
To map any coordinate in 3-space to 1-space, you use the formula (x*X) + (Y*y) + z.
Of course, once you generate the numbers, you have to convert back to 3-space. That's a simple matter of reversing the conversion above. Assuming that coord is the 1-space coordinate:
x = coord/X
coord = coord % X
y = coord/Y
z = coord % Y
Now, with a single dimension to work with, you've simplified the problem to one of generating all the numbers from 0 to S in pseudo-random order, without duplication.
I know of at least three ways to do this. The simplest uses a multiplicative inverse, as I showed here: Given a number, produce another random number that is the same every time and distinct from all other results.
When you've generated all of the numbers, you "re-shuffle" the list by picking a different x and m values for the multiplicative inverse calculations.
Another way of creating a non-repeating pseudo-random sequence in a particular range is with a linear feedback shift register. I don't have a ready example, but I have used them. To change the order, (i.e. re-shuffle), you re-initialize the generator with different parameters.
You might also be interested in the answers to this question: Unique (non-repeating) random numbers in O(1)?. That user was only looking for 1,000 numbers, so he could use a table, and the accepted answer reflects that. Other answers cover the LFSR, and a Linear congruential generator that is designed with a specific period.
None of the methods I mentioned require that you maintain much state. The amount of state you need to maintain is constant, whether your range is 20 or 20,000,000.
Note that all of the methods I mentioned above give pseudo-random sequences. They will not be truly random, but they'll likely be close enough to random to fit your needs.
So I'm working with a file that has 400 data values, all ints and ranging from 4 to 20,000 in value. I load all these into an array of size 400. There is another empty array of ListNodes of size 600 that I will move the data to, but using a self-written hash code (I'll post it below).
Because each index in the array of length 600 has a ListNode in it, if there are any collisions, then the data value is added to the back of the ListNode. I also have a method that returns the percent of the array that is null. But basically since I'm loading 400 data values into an array of size 600, the least percent of nulls I can have is 33.3% because if there are no collisions, then 400 slots in the array are taken and 200 are null, but this is not the case:
return (num+123456789/(num*9365))%600; //num is the value read from the array of 400
That hashCode has given me my best result of 48.3% nulls and I need it to be below 47% at least. any suggestions or solutions to imporve this hashCode? I would greatly appreciate any help. If you need any more info or details please let me know. Thank you!!!
I did some experiments with random numbers: generate 400 uniformly distributed random numbers in the range [0, 599] and check how many values in that range are not generated. It turns out, on average 51.3 % of the values are not generated. So your 48.3 % is already better than expected.
The target of 47 % seems unrealistic unless some form of perfect hashing is used.
If you want to make some experiments on your own, here is the program.
public static void main(String[] args) {
Random r = new Random();
int[] counts = new int[600];
for (int i = 0; i < 400; i++) {
counts[r.nextInt(600)]++;
}
int n = 0;
for (int i = 0; i < 600; i++) {
if (counts[i] == 0) {
n++;
}
}
System.out.println(100.0 * n / 600);
}
I'd use JAVAs implementation of the hashing-algorithm:
Hava a look at open-jdk HashMap
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
Note you have too add a modulo-operation to make sure that the value wouldn't be greater than 600
EDIT 1
>>> is logical shift right
EXAMPLE:
10000000 >>> 2 = 00100000
I'm trying to use the l2 normalization on a double vector with Java.
double[] vector = {0.00423823948, 0.00000000000823285934, 0.0000342523505342, 0.000040240234023423, 0, 0};
Now if i use the l2 normalization
for(double i : vector){
squareVectorSum += i * i;
}
normalizationFactor = Math.sqrt(squareVectorSum);
// System.out.println(squareVectorSum+" "+normalizationFactor);
for(int i = 0; i < vector.length; i++){
double normalizedFeature = vector[i] / normalizationFactor;
vector_result[i] = normalizedFeature;
}
My normalized vector is like this
Normalized vector (l2 normalization)
0.9999222784309146 1.9423676996312713E-9 0.008081112110203743 0.009493825603572155 0.0 0.0
Now if if make the squared sum of all the normalized-vector components I should get a sum that is is equal to one, instead my squared sum is
for(double i : vector_result){
sum += i*i;
}
Squared sum of the normalized-vector
1.0000000000000004
Why is my sum not equal to one?
Are there some problems in the code?
Or it's just because my numbers are too small and there is some approximation with doubles?
As indicated above, this is a common issue, one you're going to have to deal with if you're going to use floating point binary arithmetic. The problem mostly crops up when you want to compare two floating point binary numbers for equality. Since the operations applied to arrive at the values may not be identical, neither will their binary representations.
There are at least a couple strategies you can consider to deal with this situation. The first involves comparing the absolute difference between two floating point numbers, x and y rather than strict equality and comparing them to some small value ϵ>0. This would look something like
if (Math.abs(y-x) < epsilon) {
// Assume x == y
} else {
// Assume x != y
}
This works well when the possible values of x and y have a relatively tight bounding on their exponents. When this is not the case, value of x and y may be such that the difference always dominates the ϵ you choose (if the exponent is too large) or ϵ dominates the difference (such as when the possible exponents of x and y are small). To get around this, instead of comparing the absolute difference, you could instead compare the ratio of x and y to 1.0 and see whether that ratio has an absolute difference from 1.0 by more than ϵ. That would look like:
if (Math.abs(x/y-1.0) < epsilon) {
// Assume x == y
} else {
// Assume x != y
}
You will likely need to add another check to ensure y!=0 to avoid division by zero, but that's the general idea.
Other options include using a fixed point library for Java or a rational number library for Java. I have no recommendations for that, though.
I would like to use a HashMap
to map (x, y) coordinates to values.
What is a good hashCode() function definition?
In this case, I am only storing integer coordinates of the form (x, y)
where y - x = 0, 1, ..., M - 1 for some parameter M.
To get unique Value from two numbers, you can use bijective algorithm described in here
< x; y >= x + (y + ( (( x +1 ) /2) * (( x +1 ) /2) ) )
This will give you unquie value , which can be used for hashcode
public int hashCode()
{
int tmp = ( y + ((x+1)/2));
return x + ( tmp * tmp);
}
I generally use Objects.hash(Object... value) for generating hash code for a sequence of items.
The hash code is generated as if all the input values were placed into an array, and that array were hashed by calling Arrays.hashCode(Object[]).
#Override
public int hashCode() {
return Objects.hash(x, y);
}
Use Objects.hash(x, y, z) for 3D coordinates.
If you wish to handle it manually, you could do compute hashCode using:-
// For 2D coordinates
hashCode = LARGE_PRIME * X + Y;
// For 3D coordinates
hashCode = LARGE_PRIME^2 * X + LARGE_PRIME * Y + Z;
To calculate a hash code for objects with several properties, often a generic solution is implemented. This implementation uses a constant factor to combine the properties, the value of the factor is a subject of discussions. It seems that a factor of 33 or 397 will often result in a good distribution of hash codes, so they are suited for dictionaries.
This is a small example in C#, though it should be easily adabtable to Java:
public override int GetHashCode()
{
unchecked // integer overflows are accepted here
{
int hashCode = 0;
hashCode = (hashCode * 397) ^ this.Hue.GetHashCode();
hashCode = (hashCode * 397) ^ this.Saturation.GetHashCode();
hashCode = (hashCode * 397) ^ this.Luminance.GetHashCode();
return hashCode;
}
}
This scheme should also work for your coordinates, simply replace the properties with the X and Y value. Note that we should prevent integer overflow exceptions, in DotNet this can be achieved by using the unchecked block.
Have you considered simply shifting either x or y by half the available bits?
For "classic" 8bit thats only 16 cells/axis, but with todays "standard" 32bit it grows to over 65k cells/axis.
#override
public int hashCode() {
return x | (y << 15);
}
For obvious reasons this only works as long as both x and y are in between 0 and 0xFFFF (0-65535, inclusive), but thats plenty of space, more than 4.2bio cells.
Edit:
Another option, but that requires you to know the actual size, would be to do x + y * width (where width ofc is in the direction of x)
That depends on what you intend on using the hash code for:
If you plan on using it as a sort of index, E.g. knowing x and y will hash into an index where (x, y) data is stored, it's better to use a vector for such a thing.
Coordinates[][] coordinatesBucket = new Coordinates[maxY][maxX];
But if you absolutely must have a unique hash for every (x, y) combination, then try applying the coordinates to a decimal table (rather than adding or multiplying). For example, x=20 y=40 would give you the simple and unique code xy=2040.
Here is a fun one: I need to generate random x/y pairs that are correlated at a given value of Pearson product moment correlation coefficient, or Pearson r. You can imagine this as two arrays, array X and array Y, where the values of array X and array Y must be re-generated, re-ordered or transformed until they are correlated with each other at a given level of Pearson r. Here is the kicker: Array X and Array Y must be uniform distributions.
I can do this with a normal distribution, but transforming the values without skewing the distribution has me stumped. I tried re-ordering the values in the arrays to increase the correlation, but I will never get arrays correlated at 1.00 or -1.00 just by sorting.
Any ideas?
--
here is the AS3 code for random correlated gaussians, to get the wheels turning:
public static function nextCorrelatedGaussians(r:Number):Array{
var d1:Number;
var d2:Number;
var n1:Number;
var n2:Number;
var lambda:Number;
var r:Number;
var arr:Array = new Array();
var isNeg:Boolean;
if (r<0){
r *= -1;
isNeg=true;
}
lambda= ( (r*r) - Math.sqrt( (r*r) - (r*r*r*r) ) ) / (( 2*r*r ) - 1 );
n1 = nextGaussian();
n2 = nextGaussian();
d1 = n1;
d2 = ((lambda*n1) + ((1-lambda)*n2)) / Math.sqrt( (lambda*lambda) + (1-lambda)*(1-lambda));
if (isNeg) {d2*= -1}
arr.push(d1);
arr.push(d2);
return arr;
}
I ended up writing a short paper on this
It doesn't include your sorting method (although in practice I think it's similar to my first method, in a roundabout way), but does describe two ways that don't require iteration.
Here is an implementation of of twolfe18's algorithm written in Actionscript 3:
for (var j:int=0; j < size; j++) {
xValues[i]=Math.random());
}
var varX:Number = Util.variance(xValues);
var varianceE:Number = 1/(r*varX) - varX;
for (var i:int=0; i < size; i++) {
yValues[i] = xValues[i] + boxMuller(0, Math.sqrt(varianceE));
}
boxMuller is just a method that generates a random Gaussian with the arguments (mean, stdDev).
size is the size of the distribution.
Sample output
Target p: 0.8
Generated p: 0.04846346291280387
variance of x distribution: 0.0707786253165176
varianceE: 17.589920412141158
As you can see I'm still a ways off. Any suggestions?
This apparently simple question has been messing up with my mind since yesterday evening! I looked for the topic of simulating distributions with a dependency, and the best I found is this: simulate dependent random variables. The gist of it is, you can easily simulate 2 normals with given correlation, and they outline a method to transform these non-independent normals, but this won't preserve correlation. The correlation of the transform will be correlated, so to speak, but not identical. See the paragraph "Rank correlation coefficents".
Edit: from what I gather from the second part of the article, the copula method would allow you to simulate / generate random variables with rank correlation.
start with the model y = x + e where e is the error (a normal random variable). e should have a mean of 0 and variance k.
long story short, you can write a formula for the expected value of the Pearson in terms of k, and solve for k. note, you cannot randomly generate data with the Pearson exactly equal to a specific value, only with the expected Pearson of a specific value.
i'll try to come back and edit this post to include a closed form solution when i have access to some paper.
EDIT: ok, i have a hand-wavy solution that is probably correct (but will require testing to confirm). for now, assume desired Pearson = p > 0 (you can figure out the p < 0 case). like i mentioned earlier, set your model for Y = X + E (X is uniform, E is normal).
sample to get your x's
compute var(x)
the variance of E should be: (1/(rsd(x)))^2 - var(x)
generate your y's based on your x's and sample from your normal random variable E
for p < 0, set Y = -X + E. proceed accordingly.
basically, this follows from the definition of Pearson: cov(x,y)/var(x)*var(y). when you add noise to the x's (Y = X + E), the expected covariance cov(x,y) should not change from that with no noise. the var(x) does not change. the var(y) is the sum of var(x) and var(e), hence my solution.
SECOND EDIT: ok, i need to read definitions better. the definition of Pearson is cov(x, y)/(sd(x)sd(y)). from that, i think the true value of var(E) should be (1/(rsd(x)))^2 - var(x). see if that works.
To get a correlation of 1 both X and Y should be the same, so copy X to Y and you have a correlation of 1. To get a -1 correlation, make Y = 1 - X. (assuming X values are [0,1])
A strange problem demands a strange solution -- here is how I solved it.
-Generate array X
-Clone array X to Create array Y
-Sort array X (you can use whatever method you want to sort array X -- quicksort, heapsort anything stable.)
-Measure the starting level of pearson's R with array X sorted and array Y unsorted.
WHILE the correlation is outside of the range you are hoping for
IF the correlation is to low
run one iteration of CombSort11 on array Y then recheck correlation
ELSE IF the correlation is too high
randomly swap two values and recheck correlation
And thats it! Combsort is the real key, it has the effect of increasing the correlation slowly and steadily. Check out Jason Harrison's demo to see what I mean. To get a negative correlation you can invert the sort or invert one of the arrays after the whole process is complete.
Here is my implementation in AS3:
public static function nextReliableCorrelatedUniforms(r:Number, size:int, error:Number):Array {
var yValues:Array = new Array;
var xValues:Array = new Array;
var coVar:Number = 0;
for (var e:int=0; e < size; e++) { //create x values
xValues.push(Math.random());
}
yValues = xValues.concat();
if(r != 1.0){
xValues.sort(Array.NUMERIC);
}
var trueR:Number = Util.getPearson(xValues, yValues);
while(Math.abs(trueR-r)>error){
if (trueR < r-error){ // combsort11 for y
var gap:int = yValues.length;
var swapped:Boolean = true;
while (trueR <= r-error) {
if (gap > 1) {
gap = Math.round(gap / 1.3);
}
var i:int = 0;
swapped = false;
while (i + gap < yValues.length && trueR <= r-error) {
if (yValues[i] > yValues[i + gap]) {
var t:Number = yValues[i];
yValues[i] = yValues[i + gap];
yValues[i + gap] = t;
trueR = Util.getPearson(xValues, yValues)
swapped = true;
}
i++;
}
}
}
else { // decorrelate
while (trueR >= r+error) {
var a:int = Random.randomUniformIntegerBetween(0, size-1);
var b:int = Random.randomUniformIntegerBetween(0, size-1);
var temp:Number = yValues[a];
yValues[a] = yValues[b];
yValues[b] = temp;
trueR = Util.getPearson(xValues, yValues)
}
}
}
var correlates:Array = new Array;
for (var h:int=0; h < size; h++) {
var pair:Array = new Array(xValues[h], yValues[h]);
correlates.push(pair);}
return correlates;
}