Efficient implementation of mutual information in Java

Efficient implementation of mutual information in Java - java

I'm looking to calculate mutual information between two features, using Java.
I've read Calculating Mutual Information For Selecting a Training Set in Java already, but that was a discussion of if mutual information was appropriate for the poster, with only some light pseudo-code as to the implementation.
My current code is below, but I'm hoping there is a way to optimise it, as I have large quantities of information to process. I'm aware that calling out to another language/framework may improve speed, but would like to focus on solving this in Java for now.
Any help much appreciated.
public static double calculateNewMutualInformation(double frequencyOfBoth, double frequencyOfLeft,
double frequencyOfRight, int noOfTransactions) {
if (frequencyOfBoth == 0 || frequencyOfLeft == 0 || frequencyOfRight == 0)
return 0;
// supp = f11
double supp = frequencyOfBoth / noOfTransactions; // P(x,y)
double suppLeft = frequencyOfLeft / noOfTransactions; // P(x)
double suppRight = frequencyOfRight / noOfTransactions; // P(y)
double f10 = (suppLeft - supp); // P(x) - P(x,y)
double f00 = (1 - suppRight) - f10; // (1-P(y)) - P(x,y)
double f01 = (suppRight - supp); // P(y) - P(x,y)
// -1 * ((P(x) * log(Px)) + ((1 - P(x)) * log(1-p(x)))
double HX = -1 * ((suppLeft * MathUtils.logWithoutNaN(suppLeft)) + ((1 - suppLeft) * MathUtils.logWithoutNaN(1 - suppLeft)));
// -1 * ((P(y) * log(Py)) + ((1 - P(y)) * log(1-p(y)))
double HY = -1 * ((suppRight * MathUtils.logWithoutNaN(suppRight)) + ((1 - suppRight) * MathUtils.logWithoutNaN(1 - suppRight)));
double one = (supp * MathUtils.logWithoutNaN(supp)); // P(x,y) * log(P(x,y))
double two = (f10 * MathUtils.logWithoutNaN(f10));
double three = (f01 * MathUtils.logWithoutNaN(f01));
double four = (f00 * MathUtils.logWithoutNaN(f00));
double HXY = -1 * (one + two + three + four);
return (HX + HY - HXY) / (HX == 0 ? MathUtils.EPSILON : HX);
}
public class MathUtils {
public static final double EPSILON = 0.000001;
public static double logWithoutNaN(double value) {
if (value == 0) {
return Math.log(EPSILON);
} else if (value < 0) {
return 0;
}
return Math.log(value);
}

I have found the following to be fast, but I have not compared it against your method - only that provided in weka.
It works on the premise of re-arranging the MI equation so that it is possible to minimise the number of floating point operations:
We start by defining as count/frequency over number of samples/transactions. So, we define the number of items as n, the number of times x occurs as |x|, the number of times y occurs as |y| and the number of times they co-occur as |x,y|. We then get,
.
Now, we can re-arrange that by flipping the bottom of the inner divide, this gives us (n|x,y|)/(|x||y|). Also, compute use N = 1/n so we have one less divide operation. This gives us:
This gives us the following code:
/***
* Computes MI between variables t and a. Assumes that a.length == t.length.
* #param a candidate variable a
* #param avals number of values a can take (max(a) == avals)
* #param t target variable
* #param tvals number of values a can take (max(t) == tvals)
* #return
*/
static double computeMI(int[] a, int avals, int[] t, int tvals) {
double numinst = a.length;
double oneovernuminst = 1/numinst;
double sum = 0;
// longs are required here because of big multiples in calculation
long[][] crosscounts = new long[avals][tvals];
long[] tcounts = new long[tvals];
long[] acounts = new long[avals];
// Compute counts for the two variables
for (int i=0;i<a.length;i++) {
int av = a[i];
int tv = t[i];
acounts[av]++;
tcounts[tv]++;
crosscounts[av][tv]++;
}
for (int tv=0;tv<tvals;tv++) {
for (int av=0;av<avals;av++) {
if (crosscounts[av][tv] != 0) {
// Main fraction: (n|x,y|)/(|x||y|)
double sumtmp = (numinst*crosscounts[av][tv])/(acounts[av]*tcounts[tv]);
// Log bit (|x,y|/n) and update product
sum += oneovernuminst*crosscounts[av][tv]*Math.log(sumtmp)*log2;
}
}
}
return sum;
}
This code assumes that the values of a and t are not sparse (i.e. min(t)=0 and tvals=max(t)) for it to be efficient. Otherwise (as commented) large and unnecessary arrays are created.
I believe this approach improves further when computing MI between several variables at once (the count operations can be condensed - especially that of the target). The implementation I use is one that interfaces with WEKA.
Finally, it might be more efficient even to take the log out of the summations. But I am unsure whether log or power will take more computation within the loop. This is done by:
Apply a*log(b) = log(a^b)
Move the log to outside the summations, using log(a)+log(b) = log(ab)
and gives:

I am not mathematician but..
There are just a bunch of floating point calculations here. Some mathemagician might be able to reduce this to fewer calculation, try the Math SE.
Meanwhile, you should be able to use a static final double for Math.log(EPSILON)
Your problem might not be a single call but the volume of data for which this calculation has to be done. That problem is better solved by throwing more hardware at it.

Related

Trapez Method in Java

I found a formula in the Internet for calculating the trapezoid method , it works as it should, but I do not see why should I performed the following lines in the trapez method:
sum = 0.5 * bef + (h * sum);
i= i+ 2
The first iteration performed by the following command in main :
tra[0] = 0.5 * ((b - a) / n) * (function(a) + function(b));
//calculates the first step value
the trapez method for the next iterations:
/**
* calculate the next step with trapez method
* #param a -lower limit
* #param b -upper limit
* #param bef -previous step value
* #param n -number of dividing points
* #return integral area
*/
public static double trapz(double a, double b,double bef, int n)
{
double sum = 0;
double h = ((b - a)/n);
for (int i = 1; i <= n; i = i + 2) {
sum += function(a + (i) * h);
}
sum = 0.5 * bef + (h * sum);
return sum;
}

The function would be used in conjunction with a driver loop that doubles the number of subintervals at each iteration, refining the estimated integral until the difference from one iteration to the next is less than some threshold criterion. It is desirable in such an endeavor to avoid repeating computations that have already been performed, and that's the point of the lines you asked about.
Consider the function values that are needed when applying the trapezoid rule on a given number of subintervals. Now consider the function values needed for splitting each subinterval in half and applying the trapezoid rule to those subintervals. Half (give or take 1) of the function values needed in the latter case are the same ones needed in the former. The code presented simply reuses the previously computed values (0.5 * bef), adding to them only the new values (i = i + 2). It must scale down the previous estimate by a factor of two to account for splitting the subintervals in two.
Note that for the code to be right, it appears that argument n must represent the number of subintervals of the integration region, not the number of dividing points as its documentation claims.

BigInteger: count the number of decimal digits in a scalable method

I need the count the number of decimal digits of a BigInteger. For example:
99 returns 2
1234 returns 4
9999 returns 4
12345678901234567890 returns 20
I need to do this for a BigInteger with 184948 decimal digits and more. How can I do this fast and scalable?
The convert-to-String approach is slow:
public String getWritableNumber(BigInteger number) {
// Takes over 30 seconds for 184948 decimal digits
return "10^" + (number.toString().length() - 1);
}
This loop-devide-by-ten approach is even slower:
public String getWritableNumber(BigInteger number) {
int digitSize = 0;
while (!number.equals(BigInteger.ZERO)) {
number = number.divide(BigInteger.TEN);
digitSize++;
}
return "10^" + (digitSize - 1);
}
Are there any faster methods?

Here's a fast method based on Dariusz's answer:
public static int getDigitCount(BigInteger number) {
double factor = Math.log(2) / Math.log(10);
int digitCount = (int) (factor * number.bitLength() + 1);
if (BigInteger.TEN.pow(digitCount - 1).compareTo(number) > 0) {
return digitCount - 1;
}
return digitCount;
}
The following code tests the numbers 1, 9, 10, 99, 100, 999, 1000, etc. all the way to ten-thousand digits:
public static void test() {
for (int i = 0; i < 10000; i++) {
BigInteger n = BigInteger.TEN.pow(i);
if (getDigitCount(n.subtract(BigInteger.ONE)) != i || getDigitCount(n) != i + 1) {
System.out.println("Failure: " + i);
}
}
System.out.println("Done");
}
This can check a BigInteger with 184,948 decimal digits and more in well under a second.

This looks like it is working. I haven't run exhaustive tests yet, n'or have I run any time tests but it seems to have a reasonable run time.
public class Test {
/**
* Optimised for huge numbers.
*
* http://en.wikipedia.org/wiki/Logarithm#Change_of_base
*
* States that log[b](x) = log[k](x)/log[k](b)
*
* We can get log[2](x) as the bitCount of the number so what we need is
* essentially bitCount/log[2](10). Sadly that will lead to inaccuracies so
* here I will attempt an iterative process that should achieve accuracy.
*
* log[2](10) = 3.32192809488736234787 so if I divide by 10^(bitCount/4) we
* should not go too far. In fact repeating that process while adding (bitCount/4)
* to the running count of the digits will end up with an accurate figure
* given some twiddling at the end.
*
* So here's the scheme:
*
* While there are more than 4 bits in the number
* Divide by 10^(bits/4)
* Increase digit count by (bits/4)
*
* Fiddle around to accommodate the remaining digit - if there is one.
*
* Essentially - each time around the loop we remove a number of decimal
* digits (by dividing by 10^n) keeping a count of how many we've removed.
*
* The number of digits we remove is estimated from the number of bits in the
* number (i.e. log[2](x) / 4). The perfect figure for the reduction would be
* log[2](x) / 3.3219... so dividing by 4 is a good under-estimate. We
* don't go too far but it does mean we have to repeat it just a few times.
*/
private int log10(BigInteger huge) {
int digits = 0;
int bits = huge.bitLength();
// Serious reductions.
while (bits > 4) {
// 4 > log[2](10) so we should not reduce it too far.
int reduce = bits / 4;
// Divide by 10^reduce
huge = huge.divide(BigInteger.TEN.pow(reduce));
// Removed that many decimal digits.
digits += reduce;
// Recalculate bitLength
bits = huge.bitLength();
}
// Now 4 bits or less - add 1 if necessary.
if ( huge.intValue() > 9 ) {
digits += 1;
}
return digits;
}
// Random tests.
Random rnd = new Random();
// Limit the bit length.
int maxBits = BigInteger.TEN.pow(200000).bitLength();
public void test() {
// 100 tests.
for (int i = 1; i <= 100; i++) {
BigInteger huge = new BigInteger((int)(Math.random() * maxBits), rnd);
// Note start time.
long start = System.currentTimeMillis();
// Do my method.
int myLength = log10(huge);
// Record my result.
System.out.println("Digits: " + myLength+ " Took: " + (System.currentTimeMillis() - start));
// Check the result.
int trueLength = huge.toString().length() - 1;
if (trueLength != myLength) {
System.out.println("WRONG!! " + (myLength - trueLength));
}
}
}
public static void main(String args[]) {
new Test().test();
}
}
Took about 3 seconds on my Celeron M laptop so it should hit sub 2 seconds on some decent kit.

I think that you could use bitLength() to get a log2 value, then change the base to 10.
The result may be wrong, however, by one digit, so this is just an approximation.
However, if that's acceptable, you could always add 1 to the result and bound it to be at most. Or, subtract 1, and get at least.

You can first convert the BigInteger to a BigDecimal and then use this answer to compute the number of digits. This seems more efficient than using BigInteger.toString() as that would allocate memory for String representation.
private static int numberOfDigits(BigInteger value) {
return significantDigits(new BigDecimal(value));
}
private static int significantDigits(BigDecimal value) {
return value.scale() < 0
? value.precision() - value.scale()
: value.precision();
}

This is an another way to do it faster than Convert-to-String method. Not the best run time, but still reasonable 0.65 seconds versus 2.46 seconds with Convert-to-String method (at 180000 digits).
This method computes the integer part of the base-10 logarithm from the given value. However, instead of using loop-divide, it uses a technique similar to Exponentiation by Squaring.
Here is a crude implementation that achieves the runtime mentioned earlier:
public static BigInteger log(BigInteger base,BigInteger num)
{
/* The technique tries to get the products among the squares of base
* close to the actual value as much as possible without exceeding it.
* */
BigInteger resultSet = BigInteger.ZERO;
BigInteger actMult = BigInteger.ONE;
BigInteger lastMult = BigInteger.ONE;
BigInteger actor = base;
BigInteger incrementor = BigInteger.ONE;
while(actMult.multiply(base).compareTo(num)<1)
{
int count = 0;
while(actMult.multiply(actor).compareTo(num)<1)
{
lastMult = actor; //Keep the old squares
actor = actor.multiply(actor); //Square the base repeatedly until the value exceeds
if(count>0) incrementor = incrementor.multiply(BigInteger.valueOf(2));
//Update the current exponent of the base
count++;
}
if(count == 0) break;
/* If there is no way to multiply the "actMult"
* with squares of the base (including the base itself)
* without keeping it below the actual value,
* it is the end of the computation
*/
actMult = actMult.multiply(lastMult);
resultSet = resultSet.add(incrementor);
/* Update the product and the exponent
* */
actor = base;
incrementor = BigInteger.ONE;
//Reset the values for another iteration
}
return resultSet;
}
public static int digits(BigInteger num)
{
if(num.equals(BigInteger.ZERO)) return 1;
if(num.compareTo(BigInteger.ZERO)<0) num = num.multiply(BigInteger.valueOf(-1));
return log(BigInteger.valueOf(10),num).intValue()+1;
}
Hope this will helps.

generating random integers between 0 and some value where half are in the set (0,5] and the other half (5,x]

I am looking for a way to generate a random integer from 0-x, where x is defined at runtime by the human user. However, half of those numbers must be greater than zero and less than or equal to 5 (0,5] and the other half must be in the set of [6,x].
I know that the following code will generate a number from 0-x. The main problem is ensuring that half of them will be in the set of (0,5]
Math.random() * x;
I'm not looking for someone to do this for me, just looking for some hints. Thank you!

You could first flip a coin and based on that generate upper or lower number:
final Random rnd = new Random();
while (true)
System.out.println(rnd.nextBoolean()? rnd.nextInt(6) : 6 + rnd.nextInt(x-5));
Or, using the unwieldy Math.random() (bound to have trouble at the edges of the range):
while (true)
System.out.println(Math.floor(
math.random() < 0.5 ? (Math.random() * 6) : (6 + (x-5) * Math.random())
));
Consider this as a hint only :)

I'd do this:
double halfX= x / 2.0;
double random = Math.random() * x;
if( random< halfX ) {
random = random*5.0/(halfX);
} else {
random = (random/halfX - 1) * (x-5.0) + 5.0 ;
}
I think it is good now. This is less understandable and readable, but has only one call to random for each invocation. Apart from the fact MarkoTopolnic pointed out: the user needed an integer... I'd have to calculate what rounding would do to the distribution.
This is absolutely not easy... My head aches, so the best I can come up with:
double halfX= x / 2.0 + 1.0;
double random = Math.random() * (x+2.0);
int randomInt;
if( random< halfX ) {
randomInt = (int) (random*6.0/(halfX)); //truncating, means equal distribution from 0-5
} else {
randomInt = (int) ((random/halfX - 1.0) * (x-5.0) + 6.0) ; //notice x-5.0, this range before truncation is actually from 6.0 to x+1.0, after truncating it gets to [6;x], as this is integer
}
The second part I'm not sure though... A few hours of sleep would get it right... I hope the intentions and logic is clear though...

In case anyone is curious, here's the solution I came up with based on Marko's solution.
I had the following class defined for another part of this program.
public class BooleanSource
{
private double probability;
BooleanSource(double p) throws IllegalArgumentException
{
if(p < 0.0)
throw new IllegalArgumentException("Probability too small");
if(p > 1.0)
throw new IllegalArgumentException("Probability too large");
probability = p;
}
public boolean occurs()
{
return (Math.random() < probability);
}
}
With that, I did the following
private static void setNumItems(Customer c, int maxItems)
{
BooleanSource numProb = new BooleanSource(0.5);
int numItems;
if(numProb.occurs())
{
double num = (Math.random()*4)+1;
numItems = (int) Math.round(num);
}
else
{
double num = 5 + (maxItems-5)*Math.random();
numItems = (int) Math.round(num);
}
c.setNumItems(numItems);
}

Newton's method with specified digits of precision

I'm trying to write a function in Java that calculates the n-th root of a number. I'm using Newton's method for this. However, the user should be able to specify how many digits of precision they want. This is the part with which I'm having trouble, as my answer is often not entirely correct. The relevant code is here: http://pastebin.com/d3rdpLW8. How could I fix this code so that it always gives the answer to at least p digits of precision? (without doing more work than is necessary)
import java.util.Random;
public final class Compute {
private Compute() {
}
public static void main(String[] args) {
Random rand = new Random(1230);
for (int i = 0; i < 500000; i++) {
double k = rand.nextDouble()/100;
int n = (int)(rand.nextDouble() * 20) + 1;
int p = (int)(rand.nextDouble() * 10) + 1;
double math = n == 0 ? 1d : Math.pow(k, 1d / n);
double compute = Compute.root(n, k, p);
if(!String.format("%."+p+"f", math).equals(String.format("%."+p+"f", compute))) {
System.out.println(String.format("%."+p+"f", math));
System.out.println(String.format("%."+p+"f", compute));
System.out.println(math + " " + compute + " " + p);
}
}
}
/**
* Returns the n-th root of a positive double k, accurate to p decimal
* digits.
*
* #param n
* the degree of the root.
* #param k
* the number to be rooted.
* #param p
* the decimal digit precision.
* #return the n-th root of k
*/
public static double root(int n, double k, int p) {
double epsilon = pow(0.1, p+2);
double approx = estimate_root(n, k);
double approx_prev;
do {
approx_prev = approx;
// f(x) / f'(x) = (x^n - k) / (n * x^(n-1)) = (x - k/x^(n-1)) / n
approx -= (approx - k / pow(approx, n-1)) / n;
} while (abs(approx - approx_prev) > epsilon);
return approx;
}
private static double pow(double x, int y) {
if (y == 0)
return 1d;
if (y == 1)
return x;
double k = pow(x * x, y >> 1);
return (y & 1) == 0 ? k : k * x;
}
private static double abs(double x) {
return Double.longBitsToDouble((Double.doubleToLongBits(x) << 1) >>> 1);
}
private static double estimate_root(int n, double k) {
// Extract the exponent from k.
long exp = (Double.doubleToLongBits(k) & 0x7ff0000000000000L);
// Format the exponent properly.
int D = (int) ((exp >> 52) - 1023);
// Calculate and return 2^(D/n).
return Double.longBitsToDouble((D / n + 1023L) << 52);
}
}

Just iterate until the update is less than say, 0.0001, if you want a precision of 4 decimals.
That is, set your epsilon to Math.pow(10, -n) if you want n digits of precision.

Let's recall what the error analysis of Newton's method says. Basically, it gives us an error for the nth iteration as a function of the error of the n-1 th iteration.
So, how can we tell if the error is less than k? We can't, unless we know the error at e(0). And if we knew the error at e(0), we would just use that to find the correct answer.
What you can do is say "e(0) <= m". You can then find n such that e(n) <= k for your desired k. However, this requires knowing the maximal value of f'' in your radius, which is (in general) just as hard a problem as finding the x intercept.
What you're checking is if the error changes by less than k, which is a perfectly acceptable way to do it. But it's not checking if the error is less than k. As Axel and others have noted, there are many other root-approximation algorithms, some of which will yield easier error analysis, and if you really want this, you should use one of those.

You have a bug in your code. Your pow() method's last line should read
return (y & 1) == 1 ? k : k * x;
rather than
return (y & 1) == 0 ? k : k * x;

Java: Implementing simple equation

I am looking to implement the simple equation:
i,j = -Q ± √(Q2-4PR) / 2P
To do so I have the following code (note: P = 10. Q = 7. R = 10):
//Q*Q – 4PR = -351 mod 11 = -10 mod 11 = 1, √1 = 1
double test = Math.sqrt(modulo(((Q*Q) - ((4*P)*R))));
// Works, but why *-10 needed?
i = (int)(((-Q+test)/(P*2))*-10); // i = 3
j = (int)(((-Q-test)/(P*2))*-10); // j = 4
To put it simply, test takes the first part of the equation and mods it to a non-zero integer in-between 0 and 11, then i and j are written. i and j return the right number, but for some reason *-10 is needed to get them right (a number I guessed to get the correct values).
If possible, I'd like to find a better way of performing the above equation because my way of doing it seems wrong and just works. I'd like to do it as the equation suggests, rather than hack it to work.

The quadratic equation is more usually expressed in terms of a, b and c. To satisfy ax2+bx+c = 0, you get (-b +/- sqrt(b^2-4ac)) / 2a as answers.
I think your basic problem is that you're using modulo for some reason instead of taking the square root. The factor of -10 is just a fudge factor which happens to work for your test case.
You should have something like this:
public static void findRoots(double a, double b, double c)
{
if (b * b < 4 * a * c)
{
throw new IllegalArgumentException("Equation has no roots");
}
double tmp = Math.sqrt(b * b - 4 * a * c);
double firstRoot = (-b + tmp) / (2 * a);
double secondRoot = (-b - tmp) / (2 * a);
System.out.println("Roots: " + firstRoot + ", " + secondRoot);
}
EDIT: Your modulo method is currently going to recurse pretty chronically. Try this instead:
public static int modulo(int x)
{
return ((x % 11) + 11) % 11;
}
Basically the result of the first % 11 will be in the range [-10, 10] - so after adding another 11 and taking % 11 again, it'll be correct. No need to recurse.
At that point there's not much reason to have it as a separate method, so you can use:
public static void findRoots(double a, double b, double c)
{
int squareMod11 = (((b * b - 4 * a * c) % 11) + 11) % 11;
double tmp = Math.sqrt(squareMod11);
double firstRoot = (-b + tmp) / (2 * a);
double secondRoot = (-b - tmp) / (2 * a);
System.out.println("Roots: " + firstRoot + ", " + secondRoot);
}

You need to take the square root. Note that Q^2-4PR yields a negative number, and consequently you're going to have to handle complex numbers (or restrict input to avoid this scenario). Apache Math may help you here.

use Math.sqrt for the square root. Why do you cast i and j to ints? It is equation giving you roots of square function, so i and j can be any complex numbers. You shall limit the discriminant to positive-only values for real (double) roots, otherwise use complex numbers.
double test = Q*Q - 4*P*R;
if(Q < 0) throw new Exception("negative discriminant!");
else {
test = Math.sqrt(test);
double i = (-Q + test) / 2*P;
double i = (-Q - test) / 2*P;
}

Why are you doing modulo and not square root? Your code seems to be the way to get the roots of a quadratic equation ((a±sqrt(b^2-4ac))/2a), so the code should be:
double delta = Q*Q-4*P*R);
if(delta < 0.0) {
throw new Exception("no roots");
}
double d = Math.power(delta,0.5);
double r1 = (Q + d)/(2*P)
double r2 = (Q - d)/(2*P)

As pointed out by others, your use of mod isn't even wrong. Why are you making up mathematics like this?
It's well known that the naive solution to the quadratic equation can have problems if the value of b is very nearly equal to the discriminant.
A better way to do it is suggested in section 5.6 of "Numerical Recipes in C++": if we define
(source: equationsheet.com)
Then the two roots are:
and
Your code also needs to account for pathological cases (e.g., a = 0).
Let's substitute your values into these formulas and see what we get. If a = 10, b = 7, and c = 10, then :
(source: equationsheet.com)
Then the two roots are:
(source: equationsheet.com)
and
(source: equationsheet.com)
I think I have the signs right.
If your calculation is giving you trouble, it's likely due to the fact that you have complex roots that your method can't take into account properly. You'll need a complex number class.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficient implementation of mutual information in Java - java

Related

Trapez Method in Java

BigInteger: count the number of decimal digits in a scalable method

generating random integers between 0 and some value where half are in the set (0,5] and the other half (5,x]

Newton's method with specified digits of precision

Java: Implementing simple equation

Categories

Resources