Numerical accuracy with log probability Java implementation

Numerical accuracy with log probability Java implementation - java

Sometimes when you do calculations with very small probabilities using common data types such as doubles, numerical inaccuracies cascade over multiple calculations and lead to incorrect results. Because of this it is recommended to use log probabilities, which improve numerical stability. I have implemented log probabilities in Java and my implementation works, but it has worse numerical stability than using raw doubles. What is wrong with my implementation? What is an accurate and efficient way to perform many consecutive calculations with small probabilities in Java?
I'm unable to provide a neatly contained demonstration of this problem because the inaccuracies cascade over many calculations. However, here is proof that a problem exists: this submission to a CodeForces contest fails due to numerical accuracy. Running test #7 and adding debug prints clearly show that from day 1774, numerical errors begin cascading until the sum of probabilities drops to 0 (when it should be 1). After replacing my Prob class with a simple wrapper over doubles the exact same solution passes tests.
My implementation of multiplying probabilities:
a * b = Math.log(a) + Math.log(b)
My implementation of addition:
a + b = Math.log(a) + Math.log(1 + Math.exp(Math.log(b) - Math.log(a)))
The stability problem is most likely contained within those 2 lines, but here is my entire implementation:
class Prob {
/** Math explained: https://en.wikipedia.org/wiki/Log_probability
* Quick start:
* - Instantiate probabilities, eg. Prob a = new Prob(0.75)
* - add(), multiply() return new objects, can perform on nulls & NaNs.
* - get() returns probability as a readable double */
/** Logarithmized probability. Note: 0% represented by logP NaN. */
private double logP;
/** Construct instance with real probability. */
public Prob(double real) {
if (real > 0) this.logP = Math.log(real);
else this.logP = Double.NaN;
}
/** Construct instance with already logarithmized value. */
static boolean dontLogAgain = true;
public Prob(double logP, boolean anyBooleanHereToChooseThisConstructor) {
this.logP = logP;
}
/** Returns real probability as a double. */
public double get() {
return Math.exp(logP);
}
#Override
public String toString() {
return ""+get();
}
/***************** STATIC METHODS BELOW ********************/
/** Note: returns NaN only when a && b are both NaN/null. */
public static Prob add(Prob a, Prob b) {
if (nullOrNaN(a) && nullOrNaN(b)) return new Prob(Double.NaN, dontLogAgain);
if (nullOrNaN(a)) return copy(b);
if (nullOrNaN(b)) return copy(a);
double x = a.logP;
double y = b.logP;
double sum = x + Math.log(1 + Math.exp(y - x));
return new Prob(sum, dontLogAgain);
}
/** Note: multiplying by null or NaN produces NaN (repping 0% real prob). */
public static Prob multiply(Prob a, Prob b) {
if (nullOrNaN(a) || nullOrNaN(b)) return new Prob(Double.NaN, dontLogAgain);
return new Prob(a.logP + b.logP, dontLogAgain);
}
/** Returns true if p is null or NaN. */
private static boolean nullOrNaN(Prob p) {
return (p == null || Double.isNaN(p.logP));
}
/** Returns a new instance with the same value as original. */
private static Prob copy(Prob original) {
return new Prob(original.logP, dontLogAgain);
}
}

Problem was caused by the way Math.exp(z) was used in this line:
a + b = Math.log(a) + Math.log(1 + Math.exp(Math.log(b) - Math.log(a)))
When z reaches extreme values, numerical accuracy of double is not enough for the output of Math.exp(z). This causes us to lose information, produce an inaccurate result, and then these results cascade over multiple calculations.
When z >= 710 then Math.exp(z) = Infinity
When z <= -746 then Math.exp(z) = 0
In the original code I was calling Math.exp with y - x and arbitrarily choosing which is x and which is why. Let's instead choose y and x based on which is larger, so that z is negative rather than positive. The point where we get overflow is further on the negative side (746 rather than 710) and more importantly, when we overflow, we end up at 0 rather than infinity. Which is what we want with a low probability.
double x = Math.max(a.logP, b.logP);
double y = Math.min(a.logP, b.logP);
double sum = x + Math.log(1 + Math.exp(y - x));

Related

Sigmoid function return NaN in Java

I am trying to create a logistic regression algorithm in java but when I calculate the logarithm of the likelihood it is always returning NaN. My method which calculates the logarithm looks like this :
//Calculate log likelihood on given data
private double getLogLikelihood(double cat, double[] x) {
return cat * Math.log(findProbability(x))
+ (1 - cat) * Math.log(1 - findProbability(x));
}
And the findProbability method is just take an instance from the dataset and returning the sigmoid funcion result which is between 0 and 1.
//Calculate the sum of w * x for each weight and attribute
//call the sigmoid function with that s
public double findProbability(double[] x){
double s = 0;
for(int i = 0; i < this.weights.length; i++){
if(i >= x.length) break;
s += this.weights[i] * x[i];
}
return sigmoid(s);
}
private double sigmoid(double s){
return 1 / (1 + Math.exp(-s));
}
Moreover, my starting weights are :
[-0.2982955509135178, -0.4984900460081106, -1.816880187922516, -2.7325608512266073, 0.12542715714800834, 0.1516078084483485, 0.27631147403449774, 0.1371611094778011, 0.16029832096058613, 0.3117065974657231, 0.04262385176091778, 0.1948263133838624, 0.10788353525185314, 0.770608588466501, 0.2697281907888033, 0.09920694325563077, 0.003224073601703939, 0.021573742410541247, 0.21528348692817675, 0.3275511757298476, -0.1500597314893408, -0.7221692528386277, -2.062544912370121, 1.4315146889363015, 0.2522133355419722, 0.23919315019065995, 0.3200037377021523, 0.059466770771758076, 0.04012493980772944, 0.2553236501265919]
Finally, an instance from my dataset is :[M,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189]
I tried to initialize the starting weightss with different random numbers but thats didnt solve the problem.

The arithematic is causing a rounding error leaving you with 1.
double b = 1 + Math.exp(-3522);
b will be equal to 1, because otherwise you will need too many sig figs. You'll have to approximate the value to keep the precision. 1/(1+s) ~= 1 - s; Which means you need to calculate log(1) and log(s).
edit: sorry, I made a mistake, it appears Math.exp(-3522) is evaluated as 0 after rounding. Ill leave this answer because Math.exp(-x) might be too small to add to 1, or it might just be zero.

NaN is a result of dividing by zero or calling Math.log on a non-positive number, so u should try and find where exactly this happens. I suggest debugging or adding code to print the values of which u take the logarithm/dividy by.
EDIT: it seems it is a rounding error: exp(-s) will return a result so small that added with 1 it will still remain 1. This causes the logarithm to return -Inf. I'd suggest u try and find a mathematical way to solve this by trying to perhaps to approximate the log of the exponential.

I found a solution to my problem so I post it here:
I added an overflow check:
private double sigmoid(double s){
if(s>20){
s=20;
}else if(s<-20){
s=-20;
}
double exp = Math.exp(s);
return exp/(1+exp);
}
Also changing 1/(1+Math.exp(s) to exp/(1+exp) proved to be more stable in small disturbances of inputs.

Adding and subtracting exact values to float

Question:
The total amount of floating points is finite, there's about 2^32 of them. With a float, you can go directly to the next or previous one using java.lang.Math.nextAfter. I call that a single leap. My main quesion, composed of sub questions is, how can I navigate on floats using leaps ?
First, how can I move a float to another with multiple leaps at once ?
public static float moveFloat(float value, int leaps) {
for(int i = 0; i < Math.abs(leaps); i++)
value = Math.nextAfter(value, Float.POSITIVE_INFINITY * signum(leaps));
return value;
}
That way should work on theory but is really unoptimized. How can I do it in a single addition ?
I also need to know how much leaps there's between 2 floats. Here's the example implementation for this one:
public static int getLeaps(float value, float destination) {
int leaps = 0;
float direction = signum(destination - value);
while(value * direction < destination * direction) {
value = Math.nextAfter(value, Float.POSITIVE_INFINITY * direction);
leaps++;
}
return leaps;
}
Again, same problem here. This implementation isn't suitable.
Extra:
The thing I call a leap, does it have an actual name ?
Background:
I'm trying to make a simple 2D physics engine in Java and I have trouble with my floating point operations. I learned about relative error float comparison and it helped a bit but it's not magic. What I want is to be exact with my floating points.
I already know a lot of base ten numbers cannot be exactly represented with floating points but execptionally, I don't care. All I want is exact float arithmetic in base 2.
To simplify, in my collision detection and response process, I check if shapes overlap (let's stay in one dimension for this example) and I replace the 2 shapes overlapping using their weight.
See this example:
If the black lines are the float values(and the space between each other leaps) whatever the precision is, I want to place both shapes (colored lines) to be exactly at the brown position. (The brown position is determined by the weights ratio and by rounding. What I call penetration is the overlaping area/distance. If the penetration would of been 5, red would been pushed by 1 and blue by 4).
The problem is, do to that I have to keep the penetration of the collision (in this case the penetration is exactly the ULP of the float, or 1 leap) in a float and I suspect this leads to inexactitude. If the penetration value is bigger than the coordinates of the shapes, it will be less precise so they won't be exactly replaced at the good coordinate.
What I imagine is to keep the penetration of the collision as the amount of leaps I need to get from one to the another and use it afterwards.
This is a simplified version of the current code I have:
public class ReplaceResolver implements CollisionResolver {
#Override
public void resolve(Collision collision) {
float deltaB = collision.weightRatio * collision.penetration; //bodyA's weight over the sum of the 2 (pre calculated)
float deltaA = 1f - deltaB;
//the normal indicates where the shape should be pushed. For now, my engine is only AA so a component of the normal (x or y) is always 0 while the other is 1
if(deltaB > 0)
replace(collision.bodyA, collision.normalB, deltaA);
if(deltaA > 0)
replace(collision.bodyB, collision.normalA, deltaB);
}
private void replace(Body body, Vector2 normal, float delta) {
body.getPosition().x += normal.x * delta; //body.getPosition() is a Vector2
body.getPosition().y += normal.y * delta;
}
}
Obviously, this doesn't work properly and accumulates floating point precision error. The error is well handled by my collision detection which checks for float equality using ULP. However it breaks when crossing 0 because of the ULP going extremely low.
I could simply fix an epsilon for a physic simulation but it would remove the whole point of using floats. The technique I want to use lets the user choose his precision implicitly and theorically should be working with any precision.

Underlying IEEE 754 floating point model has this property: if you re-interpret the bits as Integer, taking the next float after (or before depending on the direction) is just like taking the next (or previous) integer, that is adding or subtracting 1 to the bit pattern re-interpreted as integer.
Stepping n times is adding (or subtracting) n to the bit pattern. It's as simple as that as long as the sign does not change, and you don't overflow to NaN or Inf.
And the number of different floats between two floats is the difference of two integers if the signs agree.
If signs differ, since the float has a sign-magnitude like representation, which does not fit the integer representation, you'll then have to exert a bit of arithmetic.

I want to do the same calculation. So, if "leaps" means as #aka.nice said, the integer difference/span/distance between two float-point values according to the IEEE 754 floating-point "single format" bit layout (IEEE754 Format), I may have found a simple method:
public static native int floatToRawIntBits(float value) and Java_java_lang_Float_floatToRawIntBits can be used for this purpose, which has similar functionality to my test code in c++ (reinterpret a memory (reinterpret_cast)).
#include <stdio.h>
/* https://stackoverflow.com/questions/44008357/adding-and-subtracting-exact-values-to-float */
int main(void) {
float float0 = 1.5f;
float float1 = 1.5000001f;
int intbits_of_float0 = *(int *)&float0;
int intbits_of_float1 = *(int *)&float1;
printf("float %.17g is reinterpreted as an integer %d\n", float0, intbits_of_float0);
printf("float %.17g is reinterpreted as an integer %d\n", float1, intbits_of_float1);
return 0;
}
And, the Java code (online compiler) below is used to calcuate the "leaps":
public class Toy {
public static void main(String args[]) {
int length = 0x82000000;
int x = length >>> 24;
int y = (length >>> 24) & 0xFF;
System.out.println("length = " + length + ", x = " + x + ", y = " + y);
float float0 = 1.5f;
float float1 = 1.5000001f;
float float2 = 1.5000002f;
float float4 = 1.5000004f;
float float5 = 1.5000005f;
// testLeaps(float0, float4);
// testLeaps(0, float5);
// testLeaps(0, -float1);
// testLeaps(-float1, 0);
System.out.println(Math.nextAfter(-float1, Float.POSITIVE_INFINITY));
System.out.println(INT_POWER_MASK & Float.floatToIntBits(-float0));
System.out.println(INT_POWER_MASK & Float.floatToIntBits(float0));
// testLeaps(-float1, -float0);
testLeaps(-float0, 0);
testLeaps(float0, 0);
}
public static void testLeaps(float value, float destination) {
System.out.println("optLeaps(" + value + ", " + destination + ") = " + optLeaps(value, destination));
System.out.println("getLeaps(" + value + ", " + destination + ") = " + getLeaps(value, destination));
}
public static final int INT_POWER_MASK = 0x7f800000 | 0x007fffff; // ~0x80000000
/**
* Retrieves the integer difference between two float-point values according to
* the IEEE 754 floating-point "single format" bit layout.
*
* <pre>
* mask 0x80000000 | 0x7f800000 | 0x007fffff
* sign | exponent | coefficient/significand/mantissa
* +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
* | | | |
* +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
* 31 30 23 22 0
* 0x7fc00000 => NaN
* 0x7f800000 +Infinity
* 0xff800000 -Infinity
* </pre>
*
* Using base (radix) 10, the numerical value of such a float type number is
* `(-1)^sign x coefficient x 10^exponent`, so the coefficient is a key factor
* to calculation of leaps coefficient.
*
* #param value the first operand
* #param destination the second operand
* #return the integer span from {#code value} to {#code destination}
*/
public static int optLeaps(float value, float destination) {
// TODO process possible cases for some special inputs.
int valueBits = Float.floatToIntBits(value); // IEEE 754 floating-point "single format" bit layout
int destinationBits = Float.floatToIntBits(destination); // IEEE 754 floating-point "single format" bit layout
int leaps; // Float.intBitsToFloat();
if ((destinationBits ^ valueBits) >= 0) {
leaps = Math.abs(destinationBits - valueBits);
} else {
leaps = INT_POWER_MASK & destinationBits + INT_POWER_MASK & valueBits;
}
return leaps;
}
public static int getLeaps(float value, float destination) {
int leaps = 0;
float signum = Math.signum(destination - value);
// float direction = Float.POSITIVE_INFINITY * signum;
// while (value * signum < destination * signum) {
// value = Math.nextAfter(value, direction); // Float.POSITIVE_INFINITY * direction
// leaps++;
// }
if (0 == signum) {
return 0;
}
if (0 < signum) {
while (value < destination) {
value = Math.nextAfter(value, Float.POSITIVE_INFINITY);
leaps++;
}
} else {
while (value > destination) {
value = Math.nextAfter(value, Float.NEGATIVE_INFINITY);
leaps++;
}
}
return leaps;
}
// optimiaze to reduce the elapsed time by roughly half
}

To start, I just want to say I don't like hacking into an Objects implementation, and you should using your own (or another library) implementation first, but sometimes you have to get creative.
Lets start with key detail here, what you call the "Leap" (I would call rounding error), So What/Why is there rounding error? Floats (and Doubles) are stored as Integer X Base_Integer^exponent_Integer. (IEEE Standard) So using base 10, If you have 1.2340 X 10^3 (or 1,234.0) your "Leap" will be 0.1 since that is the size of your least significant digit (In storage, the . is implied).
(And I'm out, too much black magic here for me)

Unique Computational value for an array

I have been thinking of it but have ran out of idea's. I have 10 arrays each of length 18 and having 18 double values in them. These 18 values are features of an image. Now I have to apply k-means clustering on them.
For implementing k-means clustering I need a unique computational value for each array. Are there any mathematical or statistical or any logic that would help me to create a computational value for each array, which is unique to it based upon values inside it. Thanks in advance.
Here is my array example. Have 10 more
[0.07518284315321135
0.002987851573676068
0.002963866526639678
0.002526139418225552
0.07444872939213325
0.0037219653347541617
0.0036979802877177715
0.0017920256571474585
0.07499695903867931
0.003477831820276616
0.003477831820276616
0.002036159171625004
0.07383539747505984
0.004311312204791184
0.0043352972518275745
0.0011786937400740452
0.07353130134299131
0.004339580295941216]

Did you checked the Arrays.hashcode in Java 7 ?
/**
* Returns a hash code based on the contents of the specified array.
* For any two <tt>double</tt> arrays <tt>a</tt> and <tt>b</tt>
* such that <tt>Arrays.equals(a, b)</tt>, it is also the case that
* <tt>Arrays.hashCode(a) == Arrays.hashCode(b)</tt>.
*
* <p>The value returned by this method is the same value that would be
* obtained by invoking the {#link List#hashCode() <tt>hashCode</tt>}
* method on a {#link List} containing a sequence of {#link Double}
* instances representing the elements of <tt>a</tt> in the same order.
* If <tt>a</tt> is <tt>null</tt>, this method returns 0.
*
* #param a the array whose hash value to compute
* #return a content-based hash code for <tt>a</tt>
* #since 1.5
*/
public static int hashCode(double a[]) {
if (a == null)
return 0;
int result = 1;
for (double element : a) {
long bits = Double.doubleToLongBits(element);
result = 31 * result + (int)(bits ^ (bits >>> 32));
}
return result;
}
I dont understand why #Marco13 mentioned " this is not returning unquie for arrays".
UPDATE
See #Macro13 comment for the reason why it cannot be unquie..
UPDATE
If we draw a graph using your input points, ( 18 elements) has one spike and 3 low values and the pattern goes..
if that is true.. you can find the mean of your Peak ( 1, 4, 8,12,16 ) and find the low Mean from remaining values.
So that you will be having Peak mean and Low mean . and you find the unquie number to represent these two also preserve the values using bijective algorithm described in here
This Alogirthm also provides formulas to reverse i.e take the Peak and Low mean from the unquie value.
To find unique pair < x; y >= x + (y + ( (( x +1 ) /2) * (( x +1 ) /2) ) )
Also refer Exercise 1 in pdf page 2 to reverse x and y.
For finding Mean and find paring value.
public static double mean(double[] array){
double peakMean = 0;
double lowMean = 0;
for (int i = 0; i < array.length; i++) {
if ( (i+1) % 4 == 0 || i == 0){
peakMean = peakMean + array[i];
}else{
lowMean = lowMean + array[i];
}
}
peakMean = peakMean / 5;
lowMean = lowMean / 13;
return bijective(lowMean, peakMean);
}
public static double bijective(double x,double y){
double tmp = ( y + ((x+1)/2));
return x + ( tmp * tmp);
}
for test
public static void main(String[] args) {
double[] arrays = {0.07518284315321135,0.002963866526639678,0.002526139418225552,0.07444872939213325,0.0037219653347541617,0.0036979802877177715,0.0017920256571474585,0.07499695903867931,0.003477831820276616,0.003477831820276616,0.002036159171625004,0.07383539747505984,0.004311312204791184,0.0043352972518275745,0.0011786937400740452,0.07353130134299131,0.004339580295941216};
System.out.println(mean(arrays));
}
You can use this the peak and low values to find the similar images.

You can simply sum the values, using double precision, the result value will unique most of the times. On the other hand, if the value position is relevant, then you can apply a sum using the index as multiplier.
The code could be as simple as:
public static double sum(double[] values) {
double val = 0.0;
for (double d : values) {
val += d;
}
return val;
}
public static double hash_w_order(double[] values) {
double val = 0.0;
for (int i = 0; i < values.length; i++) {
val += values[i] * (i + 1);
}
return val;
}
public static void main(String[] args) {
double[] myvals =
{ 0.07518284315321135, 0.002987851573676068, 0.002963866526639678, 0.002526139418225552, 0.07444872939213325, 0.0037219653347541617, 0.0036979802877177715, 0.0017920256571474585, 0.07499695903867931, 0.003477831820276616,
0.003477831820276616, 0.002036159171625004, 0.07383539747505984, 0.004311312204791184, 0.0043352972518275745, 0.0011786937400740452, 0.07353130134299131, 0.004339580295941216 };
System.out.println("Computed value based on sum: " + sum(myvals));
System.out.println("Computed value based on values and its position: " + hash_w_order(myvals));
}
The output for that code, using your list of values is:
Computed value based on sum: 0.41284176550504803
Computed value based on values and its position: 3.7396448842464496

Well, here's a method that works for any number of doubles.
public BigInteger uniqueID(double[] array) {
final BigInteger twoToTheSixtyFour =
BigInteger.valueOf(Long.MAX_VALUE).add(BigInteger.ONE);
BigInteger count = BigInteger.ZERO;
for (double d : array) {
long bitRepresentation = Double.doubleToRawLongBits(d);
count = count.multiply(twoToTheSixtyFour);
count = count.add(BigInteger.valueOf(bitRepresentation));
}
return count;
}
Explanation
Each double is a 64-bit value, which means there are 2^64 different possible double values. Since a long is easier to work with for this sort of thing, and it's the same number of bits, we can get a 1-to-1 mapping from doubles to longs using Double.doubleToRawLongBits(double).
This is awesome, because now we can treat this like a simple combinations problem. You know how you know that 1234 is a unique number? There's no other number with the same value. This is because we can break it up by its digits like so:
1234 = 1 * 10^3 + 2 * 10^2 + 3 * 10^1 + 4 * 10^0
The powers of 10 would be "basis" elements of the base-10 numbering system, if you know linear algebra. In this way, base-10 numbers are like arrays consisting of only values from 0 to 9 inclusively.
If we want something similar for double arrays, we can discuss the base-(2^64) numbering system. Each double value would be a digit in a base-(2^64) representation of a value. If there are 18 digits, there are (2^64)^18 unique values for a double[] of length 18.
That number is gigantic, so we're going to need to represent it with a BigInteger data-structure instead of a primitive number. How big is that number?
(2^64)^18 = 61172327492847069472032393719205726809135813743440799050195397570919697796091958321786863938157971792315844506873509046544459008355036150650333616890210625686064472971480622053109783197015954399612052812141827922088117778074833698589048132156300022844899841969874763871624802603515651998113045708569927237462546233168834543264678118409417047146496
There are that many unique configurations of 18-length double arrays and this code lets you uniquely describe them.

I'm going to suggest three methods, with different pros and cons which I will outline.
Hash Code
This is the obvious "solution", though it has been correctly pointed out that it will not be unique. However, it will be very unlikely that any two arrays will have the same value.
Weighted Sum
Your elements appear to be bounded; perhaps they range from a minimum of 0 to a maximum of 1. If this is the case, you can multiply the first number by N^0, the second by N^1, the third by N^2 and so on, where N is some large number (ideally the inverse of your precision). This is easily implemented, particularly if you use a matrix package, and very fast. We can make this unique if we choose.
Euclidean Distance from Mean
Subtract the mean of your arrays from each array, square the results, sum the squares. If you have an expected mean, you can use that. Again, not unique, there will be collisions, but you (almost) can't avoid that.
The difficulty of uniqueness
It has already been explained that hashing will not give you a unique solution. A unique number is possible in theory, using the Weighted Sum, but we have to use numbers of a very large size. Let's say your numbers are 64 bits in memory. That means that there are 2^64 possible numbers they can represent (slightly less using floating point). Eighteen such numbers in an array could represent 2^(64*18) different numbers. That's huge. If you use anything less, you will not be able to guarantee uniqueness due to the pigeonhole principle.
Let's look at a trivial example. If you have four letters, a, b, c and d, and you have to number them each uniquely using the numbers 1 to 3, you can't. That's the pigeonhole principle. You have 2^(18*64) possible numbers. You can't number them uniquely with less than 2^(18*64) numbers, and hashing doesn't give you that.
If you use BigDecimal, you can represent (almost) arbitrarily large numbers. If the largest element you can get is 1 and the smallest 0, then you can set N = 1/(precision) and apply the Weighted Sum mentioned above. This will guarantee uniqueness. The precision for doubles in Java is Double.MIN_VALUE. Note that the array of weights needs to be stored in _Big Decimal_s!
That satisfies this part of your question:
create a computational value for each array, which is unique to it
based upon values inside it
However, there is a problem:
1 and 2 suck for K Means
I am assuming from your discussion with Marco 13 that you are performing the clustering on the single values, not the length 18 arrays. As Marco has already mentioned, Hashing sucks for K means. The whole idea is that the smallest change in the data will result in a large change in Hash Values. That means that two images which are similar, produce two very similar arrays, produce two very different "unique" numbers. Similarity is not preserved. The result will be pseudo random!!!
Weighted Sums are better, but still bad. It will basically ignore all the elements except for the last one, unless the last element is the same. Only then will it look at the next to last, and so on. Similarity is not really preserved.
Euclidean distance from the mean (or at least some point) will at least group things together in a sort of sensible way. Direction will be ignored, but at least things that are far from the mean won't be grouped with things that are close. Similarity of one feature is preserved, the other features are lost.
In summary
1 is very easy, but is not unique and doesn't preserve similarity.
2 is easy, can be unique and doesn't preserve similarity.
3 is easy, but is not unique and preserves some similarity.
Implementatio of Weighted Sum. Not really tested.
public class Array2UniqueID {
private final double min;
private final double max;
private final double prec;
private final int length;
/**
* Used to provide a {#code BigInteger} that is unique to the given array.
* <p>
* This uses weighted sum to guarantee that two IDs match if and only if
* every element of the array also matches. Similarity is not preserved.
*
* #param min smallest value an array element can possibly take
* #param max largest value an array element can possibly take
* #param prec smallest difference possible between two array elements
* #param length length of each array
*/
public Array2UniqueID(double min, double max, double prec, int length) {
this.min = min;
this.max = max;
this.prec = prec;
this.length = length;
}
/**
* A convenience constructor which assumes the array consists of doubles of
* full range.
* <p>
* This will result in very large IDs being returned.
*
* #see Array2UniqueID#Array2UniqueID(double, double, double, int)
* #param length
*/
public Array2UniqueID(int length) {
this(-Double.MAX_VALUE, Double.MAX_VALUE, Double.MIN_VALUE, length);
}
public BigDecimal createUniqueID(double[] array) {
// Validate the data
if (array.length != length) {
throw new IllegalArgumentException("Array length must be "
+ length + " but was " + array.length);
}
for (double d : array) {
if (d < min || d > max) {
throw new IllegalArgumentException("Each element of the array"
+ " must be in the range [" + min + ", " + max + "]");
}
}
double range = max - min;
/* maxNums is the maximum number of numbers that could possibly exist
* between max and min.
* The ID will be in the range 0 to maxNums^length.
* maxNums = range / prec + 1
* Stored as a BigDecimal for convenience, but is an integer
*/
BigDecimal maxNums = BigDecimal.valueOf(range)
.divide(BigDecimal.valueOf(prec))
.add(BigDecimal.ONE);
// For convenience
BigDecimal id = BigDecimal.valueOf(0);
// 2^[ (el-1)*length + i ]
for (int i = 0; i < array.length; i++) {
BigDecimal num = BigDecimal.valueOf(array[i])
.divide(BigDecimal.valueOf(prec))
.multiply(maxNums).pow(i);
id = id.add(num);
}
return id;
}

As I understand, you are going to make k-clustering, based on the double values.
Why not just wrap double value in an object, with array and position identifier, so you would know in which cluster it ended up?
Something like:
public class Element {
final public double value;
final public int array;
final public int position;
public Element(double value, int array, int position) {
this.value = value;
this.array = array;
this.position = position;
}
}
If you need to cluster array as a whole,
You can transform original arrays of length 18 to array of length 19 with last or first element being unique id, that you will ignore during clustering, but, to which you could refer after clustering finished. That way this have a small memory footprint - of 8 additional bytes for an array, and easy association with the original value.
If space is absolutely a problem, and you have all values of an array lesser than 1, you can add unique id, greater or equal to 1 to each array, and cluster, based on reminder of division to 1, 0.07518284315321135 stays 0.07518284315321135 for the 1st, and 0.07518284315321135 becomes 1.07518284315321135 for the 2nd, although this increases complexity of computation during clustering.

First of all, let's try to understand what you need mathematically:
Uniquely mapping an array of m real numbers to a single number is in fact a bijection between R^m and R, or at least N.
Since floating points are in fact rational numbers, your problem is to find a bijection between Q^m and N, which can be transformed to N^n to N, because you know your values will always be greater than 0 (just multiply your values by the precision).
Thus you need to map N^m to N. Take a look at the Cantor Pairing Function for some ideas

A guaranteed way to generate a unique result based on the array is to convert it to one big string, and use that for your computational value.
It may be slow, but it will be unique based on the array's values.
Implementation examples:
Best way to convert an ArrayList to a string

How do I generate normal cumulative distribution in Java? its inverse cdf? How about lognormal?

I am brand new to Java, second day! I want generate samples with normal distribution. I am using inverse transformation.
Basically, I want to find the inverse normal cumulative distribution, then find its inverse. And generate samples.
My questions is: Is there a built-in function for inverse normal cdf? Or do I have to hand code?
I have seen people refer to this on apache commons. Is this a built-in? Or do I have to download it?
If I have to do it myself, can you give me some tips? If I download, doesn't my prof also have to have the "package" or special file installed?
Thanks in advance!
Edit:Just found I can't use libraries, also heard there is simpler way converting normal using radian.

As it is mentioned here:
Apache Commons - Math has what you are looking for.
More specifically, check out the NormalDistrubitionImpl class.
And no your professor doesn't need to download stuff if you provide him with all the needed libraries.
UPDATE :
If you want to hand code it (I don't know the actual formula), you can check the following link:
http://home.online.no/~pjacklam/notes/invnorm/
There are 2 people who implemented it in java: http://home.online.no/~pjacklam/notes/invnorm/#Java

I had had the same problem and find its solution, the following code will give results for cumulative distribution function just like excel do:
private static double erf(double x)
{
//A&S formula 7.1.26
double a1 = 0.254829592;
double a2 = -0.284496736;
double a3 = 1.421413741;
double a4 = -1.453152027;
double a5 = 1.061405429;
double p = 0.3275911;
x = Math.abs(x);
double t = 1 / (1 + p * x);
//Direct calculation using formula 7.1.26 is absolutely correct
//But calculation of nth order polynomial takes O(n^2) operations
//return 1 - (a1 * t + a2 * t * t + a3 * t * t * t + a4 * t * t * t * t + a5 * t * t * t * t * t) * Math.Exp(-1 * x * x);
//Horner's method, takes O(n) operations for nth order polynomial
return 1 - ((((((a5 * t + a4) * t) + a3) * t + a2) * t) + a1) * t * Math.exp(-1 * x * x);
}
public static double NORMSDIST(double z)
{
double sign = 1;
if (z < 0) sign = -1;
double result=0.5 * (1.0 + sign * erf(Math.abs(z)/Math.sqrt(2)));
return result;
}

Mathematically, this is a hard problem, and there are a few solutions you might consider.
Dislcaimer: Mathematical jargon ahead.
As you probably already know, the normalcdf function is used to calculate probabilities of normal random variables. Because a normal distribution is continuous, the corresponding probability density function (normalpdf) does not itself give probabilities, (in contrast to discrete distributions like binomial or geometric distributions). Instead, the area under the curve gives the probability that the normal random variable falls within a range of values. So, the normalcdf function you seek is the area under a section of the normalpdf function.
Mathematically, finding the area under a continuous curve is a fundamental problem of calculus. The solution to this type of problem is called an integral and integrating a function over a range of numbers means finding the area under the curve and between the lowest value in the range to the highest.
In most circumstances, we could just integrate the pdf function to get the cdf function, then evaluate it wherever we want. The heart of the problem, and the reason that an algorithm in Java is not as simple as one might think, is that normalpdf function does not have a closed form integral- it's value cannot be calculated in any finite number of steps. So, values of the normalcdf function are particularly elusive.
There are two main classes of solutions for the problem.
1. Numerical Integration Techniques
Numerical integration techniques solve the problem by approximating the area under the curve geometrically. The area is divided into rectangles or other shapes of equal or varying widths, with the height of each being given by the pdf function. The sum of the areas of the rectangle is an approximation of the area under the curve, which is the corresponding probability. These technique can be used to compute values to arbitrary precision, but is more computationally expensive than class 2. Using better approximations (e.g. Simpson's rule) improves computation. A simple numeric integration method follows.
public static double normCDF(double z)
{ double LeftEndpoint = -100;
int nRectangles = 100000;
double runningSum = 0;
double x;
for(int n = 0; n < nRectangles; n++){
x = LeftEndpoint + n*(z-LeftEndpoint)/nRectangles;
runningSum += Math.pow(Math.sqrt(2*Math.PI),-1)*Math.exp(-Math.pow(x,2)/2)*(z-LeftEndpoint)/nRectangles;
}
System.out.println(runningSum);
return runningSum;
}
2. Analytic Techniques
Analytic techniques take advantage of the fact that while the normalpdf does not have a closed-form integral, the pdf can be "converted" to a sum called a Taylor series, then integrated term-by-term. Basically, it turns the pdf into a sum of infinitely many simple functions, then integrates each one analytically, then adds together all of the integrals. Since this is an analytic procedure, a programmer need only include the integral series in the program after computing the coefficients. The precision of the result just depends on how many terms of the sum you include in the calculation, and tends to approach accurate values much sooner than numerical integration techniques. For example, the solution by Mohammad Aldefrawy computes just five coefficients. Below is a method that includes the computation of coefficients, so you one could compute values to arbitrary precision (Actually, the normalcdf series isn't computed directly. Instead, the coefficients of the related error function are computed then converted by a linear transformation). However, since computation of the coefficients involves the factorial function, one experiences memory issues for substantially large numbers of coefficients. Thankfully, this method returns values with much higher precision in a fraction of the iterations required by methods in the previous class of solutions to yield similar results.
public static double normalCDF(double x){
System.out.println(0.5*(1+erf(x/Math.sqrt(2))));
return 0.5*(1+erf(x/Math.sqrt(2)));
}
public static double erf(double z)
{
int nTerms = 315;
double runningSum = 0;
for(int n = 0; n < nTerms; n++){
runningSum += Math.pow(-1,n)*Math.pow(z,2*n+1)/(factorial(n)*(2*n+1));
}
return (2/Math.sqrt(Math.PI))*runningSum;
}
static double factorial(int n){
if(n == 0) return 1;
if(n == 1) return 1;
return n*factorial(n-1);
}
Other functions
For the inverse function, since we used the error function in the normalCDF method, we can use the inverse error function in a similar way. Again, we obtain the coefficients of the inverse error function analytically, then compute them as needed in the method.
public static double invErf(double z)
{
int nTerms = 315;
double runningSum = 0;
double[] a = new double[nTerms + 1];
double[] c = new double[nTerms + 1];
c[0]=1;
for(int n = 1; n < nTerms; n++){
double runningSum2=0;
for (int k = 0; k <= n-1; k++){
runningSum2 += c[k]*c[n-1-k]/((k+1)*(2*k+1));
}
c[n] = runningSum2;
runningSum2 = 0;
}
for(int n = 0; n < nTerms; n++){
a[n] = c[n]/(2*n+1);
runningSum += a[n]*Math.pow((0.5)*Math.sqrt(Math.PI)*z,2*n+1);
}
return runningSum;
}
public static double invNorm(double A){
return (2/Math.sqrt(2))*invErf(2*A-1);
}
I don't have a method for the lognormal function, but you could obtain one using the same idea.

I never tried it but the guys from algo team were using Colt and they were happy with the results.

Manipulating and comparing floating points in java

In Java the floating point arithmetic is not represented precisely. For example this java code:
float a = 1.2;
float b= 3.0;
float c = a * b;
if(c == 3.6){
System.out.println("c is 3.6");
}
else {
System.out.println("c is not 3.6");
}
Prints "c is not 3.6".
I'm not interested in precision beyond 3 decimals (#.###). How can I deal with this problem to multiply floats and compare them reliably?

It's a general rule that floating point number should never be compared like (a==b), but rather like (Math.abs(a-b) < delta) where delta is a small number.
A floating point value having fixed number of digits in decimal form does not necessary have fixed number of digits in binary form.
Addition for clarity:
Though strict == comparison of floating point numbers has very little practical sense, the strict < and > comparison, on the contrary, is a valid use case (example - logic triggering when certain value exceeds threshold: (val > threshold) && panic();)

If you are interested in fixed precision numbers, you should be using a fixed precision type like BigDecimal, not an inherently approximate (though high precision) type like float. There are numerous similar questions on Stack Overflow that go into this in more detail, across many languages.

I think it has nothing to do with Java, it happens on any IEEE 754 floating point number. It is because of the nature of floating point representation. Any languages that use the IEEE 754 format will encounter the same problem.
As suggested by David above, you should use the method abs of java.lang.Math class to get the absolute value (drop the positive/negative sign).
You can read this: http://en.wikipedia.org/wiki/IEEE_754_revision and also a good numerical methods text book will address the problem sufficiently.
public static void main(String[] args) {
float a = 1.2f;
float b = 3.0f;
float c = a * b;
final float PRECISION_LEVEL = 0.001f;
if(Math.abs(c - 3.6f) < PRECISION_LEVEL) {
System.out.println("c is 3.6");
} else {
System.out.println("c is not 3.6");
}
}

I’m using this bit of code in unit tests to compare if the outcome of 2 different calculations are the same, barring floating point math errors.
It works by looking at the binary representation of the floating point number. Most of the complication is due to the fact that the sign of floating point numbers is not two’s complement. After compensating for that it basically comes down to just a simple subtraction to get the difference in ULPs (explained in the comment below).
/**
* Compare two floating points for equality within a margin of error.
*
* This can be used to compensate for inequality caused by accumulated
* floating point math errors.
*
* The error margin is specified in ULPs (units of least precision).
* A one-ULP difference means there are no representable floats in between.
* E.g. 0f and 1.4e-45f are one ULP apart. So are -6.1340704f and -6.13407f.
* Depending on the number of calculations involved, typically a margin of
* 1-5 ULPs should be enough.
*
* #param expected The expected value.
* #param actual The actual value.
* #param maxUlps The maximum difference in ULPs.
* #return Whether they are equal or not.
*/
public static boolean compareFloatEquals(float expected, float actual, int maxUlps) {
int expectedBits = Float.floatToIntBits(expected) < 0 ? 0x80000000 - Float.floatToIntBits(expected) : Float.floatToIntBits(expected);
int actualBits = Float.floatToIntBits(actual) < 0 ? 0x80000000 - Float.floatToIntBits(actual) : Float.floatToIntBits(actual);
int difference = expectedBits > actualBits ? expectedBits - actualBits : actualBits - expectedBits;
return !Float.isNaN(expected) && !Float.isNaN(actual) && difference <= maxUlps;
}
Here is a version for double precision floats:
/**
* Compare two double precision floats for equality within a margin of error.
*
* #param expected The expected value.
* #param actual The actual value.
* #param maxUlps The maximum difference in ULPs.
* #return Whether they are equal or not.
* #see Utils#compareFloatEquals(float, float, int)
*/
public static boolean compareDoubleEquals(double expected, double actual, long maxUlps) {
long expectedBits = Double.doubleToLongBits(expected) < 0 ? 0x8000000000000000L - Double.doubleToLongBits(expected) : Double.doubleToLongBits(expected);
long actualBits = Double.doubleToLongBits(actual) < 0 ? 0x8000000000000000L - Double.doubleToLongBits(actual) : Double.doubleToLongBits(actual);
long difference = expectedBits > actualBits ? expectedBits - actualBits : actualBits - expectedBits;
return !Double.isNaN(expected) && !Double.isNaN(actual) && difference <= maxUlps;
}

This is a weakness of all floating point representations, and it happens because some numbers that appear to have a fixed number of decimals in the decimal system, actually have an infinite number of decimals in the binary system. And so what you think is 1.2 is actually something like 1.199999999997 because when representing it in binary it has to chop off the decimals after a certain number, and you lose some precision. Then multiplying it by 3 actually gives 3.5999999...
http://docs.python.org/py3k/tutorial/floatingpoint.html <- this might explain it better (even if it's for python, it's a common problem of the floating point representation)

Like the others wrote:
Compare floats with: if (Math.abs(a - b) < delta)
You can write a nice method for doing this:
public static int compareFloats(float f1, float f2, float delta)
{
if (Math.abs(f1 - f2) < delta)
{
return 0;
} else
{
if (f1 < f2)
{
return -1;
} else {
return 1;
}
}
}
/**
* Uses <code>0.001f</code> for delta.
*/
public static int compareFloats(float f1, float f2)
{
return compareFloats(f1, f2, 0.001f);
}
So, you can use it like this:
if (compareFloats(a * b, 3.6f) == 0)
{
System.out.println("They are equal");
}
else
{
System.out.println("They aren't equal");
}

There is an apache class for comparing doubles: org.apache.commons.math3.util.Precision
It contains some interesting constants: SAFE_MIN and EPSILON, which are the maximum possible deviations when performing arithmetic operations.
It also provides the necessary methods to compare, equal or round doubles.

Rounding is a bad idea. Use BigDecimal and set it's precision as needed.
Like:
public static void main(String... args) {
float a = 1.2f;
float b = 3.0f;
float c = a * b;
BigDecimal a2 = BigDecimal.valueOf(a);
BigDecimal b2 = BigDecimal.valueOf(b);
BigDecimal c2 = a2.multiply(b2);
BigDecimal a3 = a2.setScale(2, RoundingMode.HALF_UP);
BigDecimal b3 = b2.setScale(2, RoundingMode.HALF_UP);
BigDecimal c3 = a3.multiply(b3);
BigDecimal c4 = a3.multiply(b3).setScale(2, RoundingMode.HALF_UP);
System.out.println(c); // 3.6000001
System.out.println(c2); // 3.60000014305114740
System.out.println(c3); // 3.6000
System.out.println(c == 3.6f); // false
System.out.println(Float.compare(c, 3.6f) == 0); // false
System.out.println(c2.compareTo(BigDecimal.valueOf(3.6f)) == 0); // false
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f)) == 0); // false
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f).setScale(2, RoundingMode.HALF_UP)) == 0); // true
System.out.println(c3.compareTo(BigDecimal.valueOf(3.6f).setScale(9, RoundingMode.HALF_UP)) == 0); // false
System.out.println(c4.compareTo(BigDecimal.valueOf(3.6f).setScale(2, RoundingMode.HALF_UP)) == 0); // true
}

To compare two floats, f1 and f2 within precision of #.### I believe you would need to do like this:
((int) (f1 * 1000 + 0.5)) == ((int) (f2 * 1000 + 0.5))
f1 * 1000 lifts 3.14159265... to 3141.59265, + 0.5 results in 3142.09265 and the (int) chops off the decimals, 3142. That is, it includes 3 decimals and rounds the last digit properly.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Numerical accuracy with log probability Java implementation - java

Related

Sigmoid function return NaN in Java

Adding and subtracting exact values to float

Unique Computational value for an array

How do I generate normal cumulative distribution in Java? its inverse cdf? How about lognormal?

Manipulating and comparing floating points in java

Categories

Resources