Compute weighted averages for large numbers - java

I'm trying to get the weighted average of a few numbers. Basically I have:
Price - 134.42
Quantity - 15236545
There can be as few as one or two or as many as fifty or sixty pairs of prices and quantities. I need to figure out the weighted average of the price. Basically, the weighted average should give very little weight to pairs like
Price - 100000000.00
Quantity - 3
and more to the pair above.
The formula I currently have is:
((price)(quantity) + (price)(quantity) + ...)/totalQuantity
So far I have this done:
double optimalPrice = 0;
int totalQuantity = 0;
double rolling = 0;
System.out.println(rolling);
Iterator it = orders.entrySet().iterator();
while(it.hasNext()) {
System.out.println("inside");
Map.Entry order = (Map.Entry)it.next();
double price = (Double)order.getKey();
int quantity = (Integer)order.getValue();
System.out.println(price + " " + quantity);
rolling += price * quantity;
totalQuantity += quantity;
System.out.println(rolling);
}
System.out.println(rolling);
return rolling/totalQuantity;
The problem is I very quickly max out the "rolling" variable.
How can I actually get my weighted average?

A double can hold a pretty large number (about 1.7 x 10^308, according the docs), but you probably shouldn't use it for values where exact precision is required (such as monetary values).
Check out the BigDecimal class instead. This question on SO talks about it in more detail.

One solution is to use java.math.BigInteger for both rolling and totalQuantity, and only divide them at the end. This has a better numeric stability, as you only have a single floating-point division at the end and everything else is integer operations.
BigInteger is basically unbounded so you shouldn't run into any overflows.
EDIT: Sorry, only upon re-reading I've noticed your price is a double anyway. Maybe it's worth circumventing this by multiplying it with 100 and then converting to BigInteger - since I see in your example it has precisely 2 digits right of the decimal point - and then divide it by 100 at the end, although it's a bit of a hack.

For maximum flexibility, use BigDecimal for rolling, and BigInteger for totalQuantity. After dividing (note, you have it backwards; it should be rolling / totalQuantity), you can either return a BigDecimal, or use doubleValue at a loss of precision.

At any given point, you have recorded both the total value ax + by + cz + ... = pq and the total weight a + b + c + ... = p. Knowing both then gives you the average value pq/p = q. The problem is that pq and p are large sums that overflow, even though you just want the moderately sized q.
The next step adds, for example, a weight of r and a value s. You want to find the new sum (pq + rs) / (p + r) by using only the value of q, which can only happen if p and pq somehow "annihilate" by being in the numerator and denominator of the same fraction. That's impossible, as I'll show.
The value that you need to add in this iteration is, naturally,
(pq + rs) / (p + r) - q
Which can't be simplified to a point where p*q and p disappear. You can also find
(pq + rs) / q(p + r)
the factor by which you'd multiply q in order to get the next average; but again, pq and p remain. So there's no clever solution.
Others have mentioned arbitrary-precision variables, and that's a good solution here. The size of p and pq grow linearly with the number of entries, and the memory usage and calculation speed of integers/floats grows logarithmically with the size of the values. So performance is O(log(n)) unlike the disaster that it would if p were somehow the multiple of many numbers.

First, I don't see how you could be "maxing out" the rolling variable. As #Ash points out, it can represent values up to about 1.7 x 10^308. The only possibility I can think of is that you have some bad values in your input. (Perhaps the real problem is that you are losing precision ...)
Second, your use of a Map as to represent orders is strange and probably broken. The way you are currently using it, you cannot represent orders involving two or more items with the same price.

Your final result is an just a weighted average of precises, so presumably you don't need to follow the rules used when calculating account balances etc. If I am correct about the above, then you don't need to use BigDecimal, double will suffice.
The problem of overflow can be solved by storing a "running average" and updating it with each new entry. Namely, let
a_n = (sum_{i=1}^n x_i * w_i) / (sum_{i=1}^n w_i)
for n = 1, ..., N. You start with a_n = x_n and then add
d_n := a_{n+1} - a_n
to it. The formula for d_n is
d_n = (x_{n+1} - w_{n+1}*a_n) / W_{n+1}
where W_n := sum_{i=1}^n w_n. You need to keep track of W_n, but this problem can be solved by storing it as double (it will be OK as we're only interested in the average). You can also normalize the weights, if you know that all your weights are multiples of 1000, just divide them by 1000.
To get additional accuracy, you can use compensated summation.
Preemptive explanation: it is OK to use floating point arithmetic here. double has relative precision of 2E-16. The OP is averaging positive numbers, so there will be no cancellation error. What the proponents of arbitrary precision arithmetic don't tell you is that, leaving aside rounding rules, in the cases when it does give you lots of additional precision over IEEE754 floating point arithmetic, this will come at significant memory and performance cost. Floating point arithmetic was designed by very smart people (Prof. Kahan, among others), and if there was a way of cheaply increasing arithmetic precision over what is offered by floating point, they'd do it.
Disclaimer: if your weights are completely crazy (one is 1, another is 10000000), then I am not 100% sure if you will get satisfying accuracy, but you can test it on some example when you know what the answer should be.

Do two loops: compute totalQuantity first in the first loop. Then in the second loop accumulate price * (quantity / totalQuantity).

Related

Smart algorithm to randomize a Double in range but with odds

I use the following function to generate a random double in a specific range :
nextDouble(1.50, 7.00)
However, I've been trying to come up with an algorithm to make the randomization have higher probability to generate a double that is close to the 1.50 than it is to 7.00. Yet I don't even know where it starts. Anything come to mind ?
Java is also welcome.
You should start by discovering what probability distribution you need. Based on your requirements, and assuming that random number generations are independent, perhaps Poisson distribution is what you are looking for:
a call center receives an average of 180 calls per hour, 24 hours a day. The calls are independent; receiving one does not change the probability of when the next one will arrive. The number of calls received during any minute has a Poisson probability distribution with mean 3: the most likely numbers are 2 and 3 but 1 and 4 are also likely and there is a small probability of it being as low as zero and a very small probability it could be 10.
The usual probability distributions are already implemented in libraries e.g. org.apache.commons.math3.distribution.PoissonDistribution in Apache Commons Math3.
I suggest to not think about this problem in terms of generating a random number with irregular probability. Instead, think about generating a random number normally in a some range, but then map this range into another one in non-linear way.
Let's split our algorithm into 3 steps:
Generate a random number in [0, 1) range linearly (so using a standard random generator).
Map it into another [0, 1) range in non-linear way.
Map the resulting [0, 1) into [1.5, 7) linearly.
Steps 1. and 3. are easy, the core of our algorithm is 2. We need a way to map [0, 1) into another [0, 1), but non-linearly, so e.g. 0.7 does not have to produce 0.7. Classic math helps here, we just need to look at visual representations of algebraic functions.
In your case you expect that while the input number increases from 0 to 1, the result first grows very slowly (to stay near 1.5 for a longer time), but then it speeds up. This is exactly how e.g. y = x ^ 2 function looks like. Your resulting code could be something like:
fun generateDouble(): Double {
val step1 = Random.nextDouble()
val step2 = step1.pow(2.0)
val step3 = step2 * 5.5 + 1.5
return step3
}
or just:
fun generateDouble() = Random.nextDouble().pow(2.0) * 5.5 + 1.5
By changing the exponent to bigger numbers, the curve will be more aggressive, so it will favor 1.5 more. By making the exponent closer to 1 (e.g. 1.4), the result will be more close to linear, but still it will favor 1.5. Making the exponent smaller than 1 will start to favor 7.
You can also look at other algebraic functions with this shape, e.g. y = 2 ^ x - 1.
What you could do is to 'correct' the random with a factor in the direction of 1.5. You would create some sort of bias factor. Like this:
#Test
void DoubleTest() {
double origin = 1.50;
final double fiarRandom = new Random().nextDouble(origin, 7);
System.out.println(fiarRandom);
double biasFactor = 0.9;
final double biasedDiff = (fiarRandom - origin) * biasFactor;
double biasedRandom = origin + biasedDiff;
System.out.println(biasedRandom);
}
The lower you set the bias factor (must be >0 & <= 1), the stronger your bias towards 1.50.
You can take a straightforward approach. As you said you want a higher probability of getting the value closer to 1.5 than 7.00, you can even set the probability. So, here their average is (1.5+7)/2 = 4.25.
So let's say I want a 70% probability that the random value will be closer to 1.5 and a 30% probability closer to 7.
double finalResult;
double mid = (1.5+7)/2;
double p = nextDouble(0,100);
if(p<=70) finalResult = nextDouble(1.5,mid);
else finalResult = nextDouble(mid,7);
Here, the final result has 70% chance of being closer to 1.5 than 7.
As you did not specify the 70% probability you can even make it random.
you just have to generate nextDouble(50,100) which will give you a value more than or equal 50% and less than 100% which you can use later to apply this probability for your next calculation. Thanks
I missed that I am using the same solution strategy as in the reply by Nafiul Alam Fuji. But since I have already formulated my answer, I post it anyway.
One way is to split the range into two subranges, say nextDouble(1.50, 4.25) and nextDouble(4.25, 7.0). You select one of the subranges by generating a random number between 0.0 and 1.0 using nextDouble() and comparing it to a threshold K. If the random number is less than K, you do nextDouble(1.50, 4.25). Otherwise nextDouble(4.25, 7.0).
Now if K=0.5, it is like doing nextDouble(1.50, 7). But by increasing K, you will do nextDouble(1.50, 4.25) more often and favor it over nextDouble(4.25, 7.0). It is like flipping an unfair coin where K determines the extent of the cheating.

Using a low pass filter to calculate average?

If I want to calculate an average of 400 data points (noise values from an accelerometer sensor), can I use a low pass function such as this one to do that?
private float lowPass(float alpha, float input, float previousOutput) {
return alpha * previousOutput + (1 - alpha) * input;
}
I'm comparing this to simply storing the 400 data points in a List<float>, summing them up and dividing by 400.
I'm getting quite different results even with high values for alpha. Am I doing something wrong? Can I use the low pass filter to calculate an average, or is it generally better to simply calculate the "real" average?
EDIT
My low pass function originally took a float[] as input and output, since my data comes from a 3-axis accelerometer. I changed this to float and removed the internal for loop to avoid confusion. This also means that the input/output is now passed as primitive values, so the method returns a float instead of operating directly on the output array.
If you can afford to compute the arithmetic mean (which doesn't even require extra storage if you keep a running sum) then that would probably be the better option in most cases for reasons described bellow.
Warning: maths ahead
For sake of comparing the arithmetic average with the first-order recursive low-pass filter you are using, let's start with a signal of N samples, where each sample has a value equal to m plus some Gaussian noise of variance v. Let's further assume that the noise is independent from sample to sample.
The computation of the arithmetic average on this signal will give you a random result with mean m and variance v/N.
Assuming the first previousOutput is initialized to zero, deriving the mean and variance for the last output (output[N-1]) of the low-pass filter, we would get a mean m * (1 - alpha^N) and variance v * (1-alpha)^2 * (1-alpha^(2*N)) / (1 - alpha^2).
An immediate problem that can be seen is that for large m, the estimated mean m * (1 - alpha^N) can be quite far for the true value m. This problem unfortunately gets worse as alpha gets closer to 1. This occurs because the filter does not have time to ramp up to it's steady state value.
To avoid this issue, one may consider initializing the first previousOutput with the first input sample.
In this case the mean and variance of the last output would be m and v * ((1-alpha)^2*(1-alpha^(2*N-2))/(1-alpha^2) + alpha^(2*N-2)) respectively. This time the problem is that for larger alpha the output variance is largely dominated by the variance of that first sample that was used for the initialization. This is particularly obvious in the following comparative graph of the output variance (normalized by the input variance):
So, either you get a bias in the estimated mean when initializing previousOutput with zero, or you get a large residual variance when initializing with the first sample (much more so than with the arithmetic mean computation).
Note in conclusion that actual performance may vary for your specific data, depending on the nature of the observed variations.
What's output[] ? If it holds the results and you initialize with 0s, then this term will always be zero: alpha * output[i]
And in general:
A low-pass filter is a filter that passes signals with a frequency
lower than a certain cutoff frequency and attenuates signals with
frequencies higher than the cutoff frequency.
So it is not average it is basically a cutoff up to a specific threshold.

Modular arithmetic: Division over factorials % Prime

I want to efficiently calculate ((X+Y)!/(X!Y!))% P (P is like 10^9+7)
This discussion gives some insights on distributing modulo over division.
My concern is it's not necessary that a modular inverse always exists for a number.
Basically, I am looking for a code implementation of solving the problem.
For multiplication it is very straightforward:
public static int mod_mul(int Z,int X,int Y,int P)
{
// Z=(X+Y) the factorial we need to calculate, P is the prime
long result = 1;
while(Z>1)
{
result = (result*Z)%P
Z--;
}
return result;
}
I also realize that many factors can get cancelled in the division (before taking modulus), but if the number of divisors increase, then I'm finding it difficult to efficiently come up with an algorithm to divide. ( Looping over List(factors(X)+factors(Y)...) to see which divides current multiplying factor of numerator).
Edit: I don't want to use BigInt solutions.
Is there any java/python based solution or any standard algorithm/library for cancellation of factors( if inverse option is not full-proof) or approaching this type of problem.
((X+Y)!/(X!Y!)) is a low-level way of spelling a binomial coefficient ((X+Y)-choose-X). And while you didn't say so in your question, a comment in your code implies that P is prime. Put those two together, and Lucas's theorem applies directly: http://en.wikipedia.org/wiki/Lucas%27_theorem.
That gives a very simple algorithm based on the base-P representations of X+Y and X. Whether BigInts are required is impossible to guess because you didn't give any bounds on your arguments, beyond that they're ints. Note that your sample mod_mul code may not work at all if, e.g., P is greater than the square root of the maximum int (because result * Z may overflow then).
It's binomial coefficients - C(x+y,x).
You can calculate it differently C(n,m)=C(n-1,m)+C(n-1,m-1).
If you are OK with time complexity O(x*y), the code will be much simpler.
http://en.wikipedia.org/wiki/Combination
for what you need here is a way to do it efficiently : -
C(n,k) = C(n-1,k) + C(n-1,k-1)
Use dynamic programming to calculate efficient in bottom up approach
C(n,k)%P = ((C(n-1,k))%P + (C(n-1,k-1))%P)%P
Therefore F(n,k) = (F(n-1,k)+F(n-1,k-1))%P
Another faster approach : -
C(n,k) = C(n-1,k-1)*n/k
F(n,k) = ((F(n-1,k-1)*n)%P*inv(k)%P)%P
inv(k)%P means modular inverse of k.
Note:- Try to evaluate C(n,n-k) if (n-k<k) because nC(n-k) = nCk

Iterate over every possible double value

Consider the case where you want to test every possible input value. Creating a case where you can iterate over all the possible ints is fairly easy, as you can just increment the value by 1 and repeat.
How would you go about doing this same idea for all the possible double values?
You can iterate over all possible long values and then use Double.longBitsToDouble() to get a double for each possible 64-bit combination.
Note however that this will take a while. If you require 100 nanoseconds of processing for each double value it will take roughly (not all bit combinations are different double numbers, e.g. NaN) 2^64*1e-7/86400/365 years which is more than 16e11/86400/365 = 50700 years on a single CPU. Unless you have a datacenter to do the computation, it is a better idea to go over possible range of all input values sampling the interval at a configurable number of points.
Analogous feat for float is still difficult but doable: assuming you need 10 milliseconds of processing for each input value you need roughly 2^32*1e-2/86400 = 497.1 days on a single CPU. You would use Float.intBitsToFloat() in this case.
Java's Double class lets you construct and take apart Double values into its constituent pieces. This, and an understanding of double representation, will allow you at least conceptually to enumerate all possible doubles. You will likely find that there are too many though.
do a loop like:
for (double v = Double.MIN_VALUE; v <= Double.MAX_VALUE; v = Math.nextUp(v)) {
// ...
}
but as already explained in Adam's answer, it will take long to run.
(this will neither create NaN nor Infinity)

How to handle multiplication of numbers close to 1

I have a bunch of floating point numbers (Java doubles), most of which are very close to 1, and I need to multiply them together as part of a larger calculation. I need to do this a lot.
The problem is that while Java doubles have no problem with a number like:
0.0000000000000000000000000000000001 (1.0E-34)
they can't represent something like:
1.0000000000000000000000000000000001
Consequently of this I lose precision rapidly (the limit seems to be around 1.000000000000001 for Java's doubles).
I've considered just storing the numbers with 1 subtracted, so for example 1.0001 would be stored as 0.0001 - but the problem is that to multiply them together again I have to add 1 and at this point I lose precision.
To address this I could use BigDecimals to perform the calculation (convert to BigDecimal, add 1.0, then multiply), and then convert back to doubles afterwards, but I have serious concerns about the performance implications of this.
Can anyone see a way to do this that avoids using BigDecimal?
Edit for clarity: This is for a large-scale collaborative filter, which employs a gradient descent optimization algorithm. Accuracy is an issue because often the collaborative filter is dealing with very small numbers (such as the probability of a person clicking on an ad for a product, which may be 1 in 1000, or 1 in 10000).
Speed is an issue because the collaborative filter must be trained on tens of millions of data points, if not more.
Yep: because
(1 + x) * (1 + y) = 1 + x + y + x*y
In your case, x and y are very small, so x*y is going to be far smaller - way too small to influence the results of your computation. So as far as you're concerned,
(1 + x) * (1 + y) = 1 + x + y
This means you can store the numbers with 1 subtracted, and instead of multiplying, just add them up. As long as the results are always much less than 1, they'll be close enough to the mathematically precise results that you won't care about the difference.
EDIT: Just noticed: you say most of them are very close to 1. Obviously this technique won't work for numbers that are not close to 1 - that is, if x and y are large. But if one is large and one is small, it might still work; you only care about the magnitude of the product x*y. (And if both numbers are not close to 1, you can just use regular Java double multiplication...)
Perhaps you could use logarithms?
Logarithms conveniently reduce multiplication to addition.
Also, to take care of the initial precision loss, there is the function log1p (at least, it exists in C/C++), which returns log(1+x) without any precision loss. (e.g. log1p(1e-30) returns 1e-30 for me)
Then you can use expm1 to get the decimal part of the actual result.
Isn't this sort of situation exactly what BigDecimal is for?
Edited to add:
"Per the second-last paragraph, I would prefer to avoid BigDecimals if possible for performance reasons." – sanity
"Premature optimization is the root of all evil" - Knuth
There is a simple solution practically made to order for your problem. You are concerned it might not be fast enough, so you want to do something complicated that you think will be faster. The Knuth quote gets overused sometimes, but this is exactly the situation he was warning against. Write it the simple way. Test it. Profile it. See if it's too slow. If it is then start thinking about ways to make it faster. Don't add all this additional complex, bug-prone code until you know it's necessary.
Depending on where the numbers are coming from and how you are using them, you may want to use rationals instead of floats. Not the right answer for all cases, but when it is the right answer there's really no other.
If rationals don't fit, I'd endorse the logarithms answer.
Edit in response to your edit:
If you are dealing with numbers representing low response rates, do what scientists do:
Represent them as the excess / deficit (normalize out the 1.0 part)
Scale them. Think in terms of "parts per million" or whatever is appropriate.
This will leave you dealing with reasonable numbers for calculations.
Its worth noting that you are testing the limits of your hardware rather than Java. Java uses the 64-bit floating point in your CPU.
I suggest you test the performance of BigDecimal before you assume it won't be fast enough for you. You can still do tens of thousands of calculations per second with BigDecimal.
As David points out, you can just add the offsets up.
(1+x) * (1+y) = 1 + x + y + x*y
However, it seems risky to choose to drop out the last term. Don't. For example, try this:
x = 1e-8
y = 2e-6
z = 3e-7
w = 4e-5
What is (1+x)(1+y)(1+z)*(1+w)? In double precision, I get:
(1+x)(1+y)(1+z)*(1+w)
ans =
1.00004231009302
However, see what happens if we just do the simple additive approximation.
1 + (x+y+z+w)
ans =
1.00004231
We lost the low order bits that may have been important. This is only an issue if some of the differences from 1 in the product are at least sqrt(eps), where eps is the precision you are working in.
Try this instead:
f = #(u,v) u + v + u*v;
result = f(x,y);
result = f(result,z);
result = f(result,w);
1+result
ans =
1.00004231009302
As you can see, this gets us back to the double precision result. In fact, it is a bit more accurate, since the internal value of result is 4.23100930230249e-05.
If you really need the precision, you will have to use something like BigDecimal, even if it's slower than Double.
If you don't really need the precision, you could perhaps go with David's answer. But even if you use multiplications a lot, it might be some Premature Optimization, so BIgDecimal might be the way to go anyway
When you say "most of which are very close to 1", how many, exactly?
Maybe you could have an implicit offset of 1 in all your numbers and just work with the fractions.

Categories