Get random number with larger numbers increasingly unlikely [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
How can I get a random number in the range k to h such that the closer a number is to h the more unlikely it will come up?
I'm going to need a number between 20 and 1980.

I've tried some stuff in Eclipse, here are results.
interface Generator {
double generate(double low, double high);
}
abstract class AbstractGenerator implements Generator {
protected final Random rand;
public AbstractGenerator()
{
rand = new Random();
}
public AbstractGenerator(long seed)
{
rand = new Random(seed);
}
}
Now results for various generator implementations:
I've tried to generate 100k numbers on scale 0 to 9, and here they are shown as bars.
Catan 2 (add two dice)
class Catan2 extends AbstractGenerator {
#Override
public double generate(double low, double high)
{
return low + (high - low) * Math.abs(-1 + (rand.nextDouble() + rand.nextDouble()));
}
}
Reusults:
0 : *******************
1 : ******************
2 : ****************
3 : **************
4 : ************
5 : *********
6 : *******
7 : *****
8 : ***
9 : *
Catan 3 (add three dice)
class Catan3 extends AbstractGenerator {
#Override
public double generate(double low, double high)
{
return low + (high - low) * Math.abs(-1.5 + (rand.nextDouble() + rand.nextDouble() + rand.nextDouble())) / 1.5;
}
}
Reusults:
0 : ***********************
1 : *********************
2 : *******************
3 : ***************
4 : ***********
5 : *******
6 : *****
7 : ***
8 : *
9 : *
Catan 4 (add four dice)
class Catan4 extends AbstractGenerator {
#Override
public double generate(double low, double high)
{
return low + (high - low) * Math.abs(-2 + (rand.nextDouble() + rand.nextDouble() + rand.nextDouble() + rand.nextDouble())) / 2D;
}
}
Results:
0 : ***************************
1 : ************************
2 : ********************
3 : **************
4 : *********
5 : *****
6 : ***
7 : *
8 : *
9 : *
I think "Catan 3" is the best of those.
Formula being: low+(high-low)*abs(-1.5+(RAND+RAND+RAND))/1.5
Basically, I get a "hill" distribution, then I center it and take it's abs value. Then I norm it to the desired values.

And yet another option. There are standard methods to produce random numbers on a Gaussian distribution. Set up a Gaussian RNG with an average of k and a standard deviation of h/5. Reject any number below k (about half the numbers generated) and reject all numbers greater than h (5% or less).
You can tweak the standard deviation if you want to optimise the results. Effectively this is a half-Gaussian RNG with a truncated tail, so the numbers are not linear; you will get more closer to k than to h.
ETA: Thanks to #MightyPork's comment, which got me thinking. A Gaussian distribution is symmetric, so there is no need to throw away any raw values less than k. Just shift them from below k to the same distance above k:
if (raw < k)
raw <- k + (k - raw)
end if
Values above h will still need to be rejected.

Say our range is [0,4], create an array like this:
[000001111222334]
Now use a standard Random object to draw from the array. By doing this, we have gone from drawing from a uniform distribution to a distribution of our own design. In reality, we're not going to want to use an auxiliary array. You can do the following in lieu of an auxiliary array:
Draw from [0,14]; map [0,4] to 0, [5,8] to 1, [9,11] to 2, [12,13] to 3 and [14] to 4.
It really depends on what your distribution looks like. You can approximate drawing from a non-uniform distribution via drawing multiple times from uniform distributions over varying ranges. Of course, if you know the probability mass function or probability density function of your distribution, then you're golden.

If you need good control over the distribution of numbers, then a good way to go is the method of inverses. Create a sorted table of (x,y) pairs where x and y both increase monotonically: x from 0 to 1 and y from the low to high value of pseudo-random numbers you need. The algorithm is:
x = uniform random float in [0..1)
Search the table to find (x[i],y[i]) such that x[i] <= x < x[i+1]
// Return linearly interpolated y value
return y[i] + (x - x[i]) / (x[i+1] - x[i]) * (y[i+1] - y[i])
You control the distribution of return values with the table entries.
If the table contains only (0,0) and (1,1), then obviously the return value is equal to x, and the distribution is uniform. To get more high numbers, describe a curve that increases more rapidly at the start and is flatter at the higher x values, say:
(0,0) (0.25,0.5) (1,1)
You should be able to see why this works. In the uniform distribution, half the numbers are between 0 and .5. With this table, only a quarter of the numbers are in that range, so the other three-quarters are in 0.5 to 1. The high numbers are more frequent as you require.
You can create as smooth a curve as you like and of any shape as long as it's monotonically increasing. If the table has more than a few pairs, consider binary search for speed.
For a range of 20 to 1980, the corresponding table would be something like:
(0, 20) (0.25, 1000) (1, 1980)
If you need integers, you'd want to use
(0, 20) (0.25, 1000) (1, 1981)
and then truncate the fraction from the result.
Again, you'd probably want more points in the table to make the ICDF smoother. This is for illustration.
The Math
The curve stored in the table is called the inverse cumulative density function (ICDF) for returned pseudo-random numbers. A probability distribution function (PDF) is a non-negative function with area under the curve of 1. Commonly used PFDs are uniform, exponential, and normal. The corresponding CDF is the running integral of the PDF. The ICDF is the inverse of the CDF. It's well known that to generate random numbers with any given PDF, you can find the ICDF and apply the algorithm above.

Related

How to find the point that gives the maximum value fast? Java or c++ code please

I need a fast way to find maximum value when intervals are overlapping, unlike finding the point where got overlap the most, there is "order". I would have int[][] data that 2 values in int[], where the first number is the center, the second number is the radius, the closer to the center, the larger the value at that point is going to be. For example, if I am given data like:
int[][] data = new int[][]{
{1, 1},
{3, 3},
{2, 4}};
Then on a number line, this is how it's going to looks like:
x axis: -2 -1 0 1 2 3 4 5 6 7
1 1: 1 2 1
3 3: 1 2 3 4 3 2 1
2 4: 1 2 3 4 5 4 3 2 1
So for the value of my point to be as large as possible, I need to pick the point x = 2, which gives a total value of 1 + 3 + 5 = 9, the largest possible value. It there a way to do it fast? Like time complexity of O(n) or O(nlogn)
This can be done with a simple O(n log n) algorithm.
Consider the value function v(x), and then consider its discrete derivative dv(x)=v(x)-v(x-1). Suppose you only have one interval, say {3,3}. dv(x) is 0 from -infinity to -1, then 1 from 0 to 3, then -1 from 4 to 6, then 0 from 7 to infinity. That is, the derivative changes by 1 "just after" -1, by -2 just after 3, and by 1 just after 6.
For n intervals, there are 3*n derivative changes (some of which may occur at the same point). So find the list of all derivative changes (x,change), sort them by their x, and then just iterate through the set.
Behold:
intervals = [(1,1), (3,3), (2,4)]
events = []
for mid, width in intervals:
before_start = mid - width - 1
at_end = mid + width
events += [(before_start, 1), (mid, -2), (at_end, 1)]
events.sort()
prev_x = -1000
v = 0
dv = 0
best_v = -1000
best_x = None
for x, change in events:
dx = x - prev_x
v += dv * dx
if v > best_v:
best_v = v
best_x = x
dv += change
prev_x = x
print best_x, best_v
And also the java code:
TreeMap<Integer, Integer> ts = new TreeMap<Integer, Integer>();
for(int i = 0;i<cows.size();i++) {
int index = cows.get(i)[0] - cows.get(i)[1];
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
index = cows.get(i)[0] + 1;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) - 2);
}else {
ts.put(index, -2);
}
index = cows.get(i)[0] + cows.get(i)[1] + 2;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
}
int value = 0;
int best = 0;
int change = 0;
int indexBefore = -100000000;
while(ts.size() > 1) {
int index = ts.firstKey();
value += (ts.get(index) - indexBefore) * change;
best = Math.max(value, best);
change += ts.get(index);
ts.remove(index);
}
where cows is the data
Hmmm, a general O(n log n) or better would be tricky, probably solvable via linear programming, but that can get rather complex.
After a bit of wrangling, I think this can be solved via line intersections and summation of function (represented by line segment intersections). Basically, think of each as a triangle on top of a line. If the inputs are (C,R) The triangle is centered on C and has a radius of R. The points on the line are C-R (value 0), C (value R) and C+R (value 0). Each line segment of the triangle represents a value.
Consider any 2 such "triangles", the max value occurs in one of 2 places:
The peak of one of the triangle
The intersection point of the triangles or the point where the two triangles overall. Multiple triangles just mean more possible intersection points, sadly the number of possible intersections grows quadratically, so O(N log N) or better may be impossible with this method (unless some good optimizations are found), unless the number of intersections is O(N) or less.
To find all the intersection points, we can just use a standard algorithm for that, but we need to modify things in one specific way. We need to add a line that extends from each peak high enough so it would be higher than any line, so basically from (C,C) to (C,Max_R). We then run the algorithm, output sensitive intersection finding algorithms are O(N log N + k) where k is the number of intersections. Sadly this can be as high as O(N^2) (consider the case (1,100), (2,100),(3,100)... and so on to (50,100). Every line would intersect with every other line. Once you have the O(N + K) intersections. At every intersection, you can calculate the the value by summing the of all points within the queue. The running sum can be kept as a cached value so it only changes O(K) times, though that might not be posible, in which case it would O(N*K) instead. Making it it potentially O(N^3) (in the worst case for K) instead :(. Though that seems reasonable. For each intersection you need to sum up to O(N) lines to get the value for that point, though in practice, it would likely be better performance.
There are optimizations that could be done considering that you aim for the max and not just to find intersections. There are likely intersections not worth pursuing, however, I could also see a situation where it is so close you can't cut it down. Reminds me of convex hull. In many cases you can easily reduce 90% of the data, but there are cases where you see the worst case results (every point or almost every point is a hull point). For example, in practice there are certainly causes where you can be sure that the sum is going to be less than the current known max value.
Another optimization might be building an interval tree.

Find a sum equal or greater than given target using only numbers from set

Example 1:
Shop selling beer, available packages are 6 and 10 units per package. Customer inputs 26 and algorithm replies 26, because 26 = 10 + 10 + 6.
Example 2:
Selling spices, available packages are 0.6, 1.5 and 3. Target value = 5. Algorithm returns value 5.1, because it is the nearest greater number than target possible to achieve with packages (3, 1.5, 0.6).
I need a Java method that will suggest that number.
Simmilar algorithm is described in Bin packing problem, but it doesn't suit me.
I tried it and when it returned me the number smaller than target I was runnig it once again with increased target number. But it is not efficient when number of packages is huge.
I need almost the same algorithm, but with the equal or greater nearest number.
Similar question: Find if a number is a possible sum of two or more numbers in a given set - python.
First let's reduce this problem to integers rather than real numbers, otherwise we won't get a fast optimal algorithm out of this. For example, let's multiply all numbers by 100 and then just round it to the next integer. So say we have item sizes x1, ..., xn and target size Y. We want to minimize the value
k1 x1 + ... + kn xn - Y
under the conditions
(1) ki is a non-positive integer for all n ≥ i ≥ 1
(2) k1 x1 + ... + kn xn - Y ≥ 0
One simple algorithm for this would be to ask a series of questions like
Can we achieve k1 x1 + ... + kn xn = Y + 0?
Can we achieve k1 x1 + ... + kn xn = Y + 1?
Can we achieve k1 x1 + ... + kn xn = Y + z?
etc. with increasing z
until we get the answer "Yes". All of these problems are instances of the Knapsack problem with the weights set equal to the values of the items. The good news is that we can solve all those at once, if we can establish an upper bound for z. It's easy to show that there is a solution with z ≤ Y, unless all the xi are larger than Y, in which case the solution is just to pick the smallest xi.
So let's use the pseudopolynomial dynamic programming approach to solve Knapsack: Let f(i,j) be 1 iif we can reach total item size j with the first i items (x1, ..., xi). We have the recurrence
f(0,0) = 1
f(0,j) = 0 for all j > 0
f(i,j) = f(i - 1, j) or f(i - 1, j - x_i) or f(i - 1, j - 2 * x_i) ...
We can solve this DP array in O(n * Y) time and O(Y) space. The result will be the first j ≥ Y with f(n, j) = 1.
There are a few technical details that are left as an exercise to the reader:
How to implement this in Java
How to reconstruct the solution if needed. This can be done in O(n) time using the DP array (but then we need O(n * Y) space to remember the whole thing).
You want to solve the integer programming problem min(ct) s.t. ct >= T, c >= 0 where T is your target weight, and c is a non-negative integer vector specifying how much of each package to purchase, and t is the vector specifying the weight of each package. You can either solve this with dynamic programming as pointed out by another answer, or, if your weights and target weight are too large then you can use general integer programming solvers, which have been highly optimized over the years to give good speed performance in practice.

Random number,with nonuniform distributed [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Generate random number with non-uniform density
I try to identify/create a function ( in Java ) that give me a nonuniform distributed sequence of number.
if I has a function that say function f(x), and x>0 it will give me a random number
from 0 to x.
The function most work with any given x and this below is only a example how I want to have.
But if we say x=100 the function f(x) will return s nonunifrom distributed.
And I want for example say
0 to 20 be approximately 20% of all case.
21 to 50 be approximately 50% of all case.
51 to 70 be approximately 20% of all case.
71 to 100be approximately 10 of all case.
In short somting that give me a number like normal distribution and it peek at 30-40 in this case x is 100.
http://en.wikipedia.org/wiki/Normal_distribution
( I can use a uniform random gen as score if need, and only a function that will transfrom the uniform result to a non-uniform result. )
EDIT
My final solution for this problem is:
/**
* Return a value from [0,1] and mean as 0.3, It give 10% of it is lower
* then 0.1. 5% is higher then 0.8 and 30% is in rang 0.25 to 0.45
*
* #return
*/
public double nextMyGaussian() {
double d = -1000;
while (d < -1.5) {
// RANDOMis Java's normal Random() class.
// The nextGaussian is normal give a value from -5 to +5?
d = RANDOM.nextGaussian() * 1.5;
}
if (d > 3.5d) {
return 1;
}
return ((d + 1.5) / 5);
}
A simple solution would be to generate a first random number between 0 and 9.
0 means the 10 first percents, 1 the ten following percents, etc.
So if you get 0 or 1, you generate a second random number between 0 and 20. If you get 2, 3, 4, 5 or 6, you generate a second random number between 21 and 50, etc.
Could you just write a function that sums a number of random numbers it the 1-X range and takes an average? this will tend to the normal distribution as n increases
See:
Generate random numbers following a normal distribution in C/C++
I hacked something like the below:
class CrudeDistribution {
final int TRIALS = 20;
public int getAverageFromDistribution(int upperLimit) {
return getAverageOfRandomTrials(TRIALS, upperLimit);
}
private int getAverageOfRandomTrials(int trials, int upperLimit) {
double d = 0.0;
for (int i=0; i<trials; i++) {
d +=getRandom(upperLimit);
}
return (int) (d /= trials);
}
private int getRandom(int upperLimit) {
return (int) (Math.random()*upperLimit)+1;
}
}
There are libraries in Commons-Math that can generate distributions based on means and standard deviations (that measure the spread). and in the link some algorithms that do this.
Probably a fun hour of so of hunting to find the relevant 2 liner:
https://commons.apache.org/math/userguide/distribution.html
One solution would be to do a random number between 1-100 and based on the result do another random number in the appropriate range.
1-20 -> 0-20
21-70 -> 21-50
71-90 -> 51-70
91-100 -> 71-100
Hope that makes sense.
You need to create the f(x) first.
Assuming values x are equiprobable, your f(x) is
double f(x){
if(x<=20){
return x;
}else if (x>20 && x<=70){
return (x-20)/50*30+20;
} else if(...
etc
Just generate a bunch, say at least 30, uniform random numbers between 0 and x. Then take the mean of those. The mean will, following the central limit theorem, be a random number from a normal distribution centered around x/2.

Generate a random number with max, min and mean(average) in Java

I need to generate random numbers with following properties.
Min should be 200
Max should be 20000
Average(mean) is 500.
Optional: 75th percentile to be 5000
Definitely it is not uniform distribution, nor gaussian. I need to give some left skewness.
Java Random probably won't work because it only gives you normal(gaussian) distributions.
What you're probably looking for is an f distribution (see below). You can probably use the distlib library here and choose the f distribution. You can use the random method to get your random number.
Say X is your target variable, lets normalize the range by doing Y=(X-200)/(20000-200). So now you want some Y random variable that takes values in [0,1] with mean (500-200)/(20000-200)=1/66.
You have many options, the most natural one seems to me a Beta distribution, Y ~ Beta(a,b) with a/(a+b) = 1/66 - you have an extra degree of freedom, which you can choose either to fit the last quartile requirement.
After that, you simply return X as Y*(20000-200)+200
To generate a Beta random variable, you can use Apache Commons or see here.
This may not be the answer you're looking for, but the specific case with 3 uniform distributions:
(Ignore the numbers on the left, but it is to scale!)
public int generate() {
if(random(0, 65) == 0) {
// 50-100 percentile
if(random(1, 13) > 3) {
// 50-75 percentile
return random(500, 5000);
} else {
// 75-100 percentile
return random(5000, 20000);
}
} else {
// 0-50 percentile
return random(200, 500);
}
}
How I got the numbers
First, the area under the curve is equal between 200-500 and 500-20000. This means that the height relationship is 300 * leftHeight == 19500 * rightHeight making leftHeight == 65 * rightHeight
This gives us a 1/66 chance to choose right, and a 65/66 chance to choose left.
I then made the same calculation for the 75th percentile, except the ratio was 500-5000 chance == 5000-20000 chance * 10 / 3. Again, this means we have a 10/13 chance to be in 50-75 percentile, and a 3/13 chance to be in 75-100.
Kudos to #Stas - I am using his 'inclusive random' function.
And yes, I realise my numbers are wrong as this method works with discrete numbers, and my calculations were continuous. It would be good if someone could correct my border cases.
You can have a function f working on [0;1] such as
Integral(f(x)dx) on [0;1] = 500
f(0) = 200
f(0.75) = 5000
f(1) = 20000
I guess a function of the form
f(x) = a*exp(x) + b*x + c
could be a solution, you just have to solve the related system.
Then, you do f(uniform_random(0,1)) and there you are !
Your question is vague as there are numerous random distributions with a given minimum, maximum, and mean.
Indeed, one solution among many is to choose max with probability (mean-min)/(max-min) and min otherwise. That is, this solution generates one of only two numbers — the minimum and the maximum.
The following is another solution.
The PERT distribution (or beta-PERT distribution) is designed to take a minimum and maximum and estimated mode. It's a "smoothed-out" version of the triangular distribution, and generating a random variate from that distribution can be implemented as follows:
startpt + (endpt - startpt) *
BetaDist(1.0 + (midpt - startpt) * shape / (endpt - startpt),
1.0 + (endpt - midpt) * shape / (endpt - startpt))
where—
startpt is the minimum,
midpt is the mode (not necessarily average or mean),
endpt is the maximum,
shape is a number 0 or greater, but usually 4, and
BetaDist(X, Y) returns a random variate from the beta distribution with parameters X and Y.
Given a known mean (mean), midpt can be calculated by:
3 * mean / 2 - (startpt + endpt) / 4

What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

I have a requirement to calculate the average of a very large set of doubles (10^9 values). The sum of the values exceeds the upper bound of a double, so does anyone know any neat little tricks for calculating an average that doesn't require also calculating the sum?
I am using Java 1.5.
You can calculate the mean iteratively. This algorithm is simple, fast, you have to process each value just once, and the variables never get larger than the largest value in the set, so you won't get an overflow.
double mean(double[] ary) {
double avg = 0;
int t = 1;
for (double x : ary) {
avg += (x - avg) / t;
++t;
}
return avg;
}
Inside the loop avg always is the average value of all values processed so far. In other words, if all the values are finite you should not get an overflow.
The very first issue I'd like to ask you is this:
Do you know the number of values beforehand?
If not, then you have little choice but to sum, and count, and divide, to do the average. If Double isn't high enough precision to handle this, then tough luck, you can't use Double, you need to find a data type that can handle it.
If, on the other hand, you do know the number of values beforehand, you can look at what you're really doing and change how you do it, but keep the overall result.
The average of N values, stored in some collection A, is this:
A[0] A[1] A[2] A[3] A[N-1] A[N]
---- + ---- + ---- + ---- + .... + ------ + ----
N N N N N N
To calculate subsets of this result, you can split up the calculation into equally sized sets, so you can do this, for 3-valued sets (assuming the number of values is divisable by 3, otherwise you need a different divisor)
/ A[0] A[1] A[2] \ / A[3] A[4] A[5] \ // A[N-1] A[N] \
| ---- + ---- + ---- | | ---- + ---- + ---- | \\ + ------ + ---- |
\ 3 3 3 / \ 3 3 3 / // 3 3 /
--------------------- + -------------------- + \\ --------------
N N N
--- --- ---
3 3 3
Note that you need equally sized sets, otherwise numbers in the last set, which will not have enough values compared to all the sets before it, will have a higher impact on the final result.
Consider the numbers 1-7 in sequence, if you pick a set-size of 3, you'll get this result:
/ 1 2 3 \ / 4 5 6 \ / 7 \
| - + - + - | + | - + - + - | + | - |
\ 3 3 3 / \ 3 3 3 / \ 3 /
----------- ----------- ---
y y y
which gives:
2 5 7/3
- + - + ---
y y y
If y is 3 for all the sets, you get this:
2 5 7/3
- + - + ---
3 3 3
which gives:
2*3 5*3 7
--- + --- + ---
9 9 9
which is:
6 15 7
- + -- + -
9 9 9
which totals:
28
-- ~ 3,1111111111111111111111.........1111111.........
9
The average of 1-7, is 4. Obviously this won't work. Note that if you do the above exercise with the numbers 1, 2, 3, 4, 5, 6, 7, 0, 0 (note the two zeroes at the end there), then you'll get the above result.
In other words, if you can't split the number of values up into equally sized sets, the last set will be counted as though it has the same number of values as all the sets preceeding it, but it will be padded with zeroes for all the missing values.
So, you need equally sized sets. Tough luck if your original input set consists of a prime number of values.
What I'm worried about here though is loss of precision. I'm not entirely sure Double will give you good enough precision in such a case, if it initially cannot hold the entire sum of the values.
Apart from using the better approaches already suggested, you can use BigDecimal to make your calculations. (Bear in mind it is immutable)
IMHO, the most robust way of solving your problem is
sort your set
split in groups of elements whose sum wouldn't overflow - since they are sorted, this is fast and easy
do the sum in each group - and divide by the group size
do the sum of the group's sum's (possibly calling this same algorithm recursively) - be aware that if the groups will not be equally sized, you'll have to weight them by their size
One nice thing of this approach is that it scales nicely if you have a really large number of elements to sum - and a large number of processors/machines to use to do the math
Please clarify the potential ranges of the values.
Given that a double has a range ~= +/-10^308, and you're summing 10^9 values, the apparent range suggested in your question is values of the order of 10^299.
That seems somewhat, well, unlikely...
If your values really are that large, then with a normal double you've got only 17 significant decimal digits to play with, so you'll be throwing away about 280 digits worth of information before you can even think about averaging the values.
I would also note (since no-one else has) that for any set of numbers X:
mean(X) = sum(X[i] - c) + c
-------------
N
for any arbitrary constant c.
In this particular problem, setting c = min(X) might dramatically reduce the risk of overflow during the summation.
May I humbly suggest that the problem statement is incomplete...?
A double can be divided by a power of 2 without loss of precision. So if your only problem if the absolute size of the sum you could pre-scale your numbers before summing them. But with a dataset of this size, there is still the risk that you will hit a situation where you are adding small numbers to a large one, and the small numbers will end up being mostly (or completely) ignored.
for instance, when you add 2.2e-20 to 9.0e20 the result is 9.0e20 because once the scales are adjusted so that they numbers can be added together, the smaller number is 0. Doubles can only hold about 17 digits, and you would need more than 40 digits to add these two numbers together without loss.
So, depending on your data set and how many digits of precision you can afford to loose, you may need to do other things. Breaking the data into sets will help, but a better way to preserve precision might be to determine a rough average (you may already know this number). then subtract each value from the rough average before you sum it. That way you are summing the distances from the average, so your sum should never get very large.
Then you take the average delta, and add it to your rough sum to get the correct average. Keeping track of the min and max delta will also tell you how much precision you lost during the summing process. If you have lots of time and need a very accurate result, you can iterate.
You could take the average of averages of equal-sized subsets of numbers that don't exceed the limit.
divide all values by the set size and then sum it up
Option 1 is to use an arbitrary-precision library so you don't have an upper-bound.
Other options (which lose precision) are to sum in groups rather than all at once, or to divide before summing.
So I don't repeat myself so much, let me state that I am assuming that the list of numbers is normally distributed, and that you can sum many numbers before you overflow. The technique still works for non-normal distros, but somethings will not meet the expectations I describe below.
--
Sum up a sub-series, keeping track of how many numbers you eat, until you approach the overflow, then take the average. This will give you an average a0, and count n0. Repeat until you exhaust the list. Now you should have many ai, ni.
Each ai and ni should be relatively close, with the possible exception of the last bite of the list. You can mitigate that by under-biting near the end of the list.
You can combine any subset of these ai, ni by picking any ni in the subset (call it np) and dividing all the ni in the subset by that value. The max size of the subsets to combine is the roughly constant value of the n's.
The ni/np should be close to one. Now sum ni/np * ai and multiple by np/(sum ni), keeping track of sum ni. This gives you a new ni, ai combination, if you need to repeat the procedure.
If you will need to repeat (i.e., the number of ai, ni pairs is much larger than the typical ni), try to keep relative n sizes constant by combining all the averages at one n level first, then combining at the next level, and so on.
First of all, make yourself familiar with the internal representation of double values. Wikipedia should be a good starting point.
Then, consider that doubles are expressed as "value plus exponent" where exponent is a power of two. The limit of the largest double value is an upper limit of the exponent, and not a limit of the value! So you may divide all large input numbers by a large enough power of two. This should be safe for all large enough numbers. You can re-multiply the result with the factor to check whether you lost precision with the multiplication.
Here we go with an algorithm
public static double sum(double[] numbers) {
double eachSum, tempSum;
double factor = Math.pow(2.0,30); // about as large as 10^9
for (double each: numbers) {
double temp = each / factor;
if (t * factor != each) {
eachSum += each;
else {
tempSum += temp;
}
}
return (tempSum / numbers.length) * factor + (eachSum / numbers.length);
}
and dont be worried by the additional division and multiplication. The FPU will optimize the hell out of them since they are done with a power of two (for comparison imagine adding and removing digits at the end of a decimal numbers).
PS: in addition, you may want to use Kahan summation to improve the precision. Kahan summation avoids loss of precision when very large and very small numbers are summed up.
I posted an answer to a question spawned from this one, realizing afterwards that my answer is better suited to this question than to that one. I've reproduced it below. I notice though, that my answer is similar to a combination of Bozho's and Anon.'s.
As the other question was tagged language-agnostic, I chose C# for the code sample I've included. Its relative ease of use and easy-to-follow syntax, along with its inclusion of a couple of features facilitating this routine (a DivRem function in the BCL, and support for iterator functions), as well as my own familiarity with it, made it a good choice for this problem. Since the OP here is interested in a Java solution, but I'm not Java-fluent enough to write it effectively, it might be nice if someone could add a translation of this code to Java.
Some of the mathematical solutions here are very good. Here's a simple technical solution.
Use a larger data type. This breaks down into two possibilities:
Use a high-precision floating point library. One who encounters a need to average a billion numbers probably has the resources to purchase, or the brain power to write, a 128-bit (or longer) floating point library.
I understand the drawbacks here. It would certainly be slower than using intrinsic types. You still might over/underflow if the number of values grows too high. Yada yada.
If your values are integers or can be easily scaled to integers, keep your sum in a list of integers. When you overflow, simply add another integer. This is essentially a simplified implementation of the first option. A simple (untested) example in C# follows
class BigMeanSet{
List<uint> list = new List<uint>();
public double GetAverage(IEnumerable<uint> values){
list.Clear();
list.Add(0);
uint count = 0;
foreach(uint value in values){
Add(0, value);
count++;
}
return DivideBy(count);
}
void Add(int listIndex, uint value){
if((list[listIndex] += value) < value){ // then overflow has ocurred
if(list.Count == listIndex + 1)
list.Add(0);
Add(listIndex + 1, 1);
}
}
double DivideBy(uint count){
const double shift = 4.0 * 1024 * 1024 * 1024;
double rtn = 0;
long remainder = 0;
for(int i = list.Count - 1; i >= 0; i--){
rtn *= shift;
remainder <<= 32;
rtn += Math.DivRem(remainder + list[i], count, out remainder);
}
rtn += remainder / (double)count;
return rtn;
}
}
Like I said, this is untested—I don't have a billion values I really want to average—so I've probably made a mistake or two, especially in the DivideBy function, but it should demonstrate the general idea.
This should provide as much accuracy as a double can represent and should work for any number of 32-bit elements, up to 232 - 1. If more elements are needed, then the count variable will need be expanded and the DivideBy function will increase in complexity, but I'll leave that as an exercise for the reader.
In terms of efficiency, it should be as fast or faster than any other technique here, as it only requires iterating through the list once, only performs one division operation (well, one set of them), and does most of its work with integers. I didn't optimize it, though, and I'm pretty certain it could be made slightly faster still if necessary. Ditching the recursive function call and list indexing would be a good start. Again, an exercise for the reader. The code is intended to be easy to understand.
If anybody more motivated than I am at the moment feels like verifying the correctness of the code, and fixing whatever problems there might be, please be my guest.
I've now tested this code, and made a couple of small corrections (a missing pair of parentheses in the List<uint> constructor call, and an incorrect divisor in the final division of the DivideBy function).
I tested it by first running it through 1000 sets of random length (ranging between 1 and 1000) filled with random integers (ranging between 0 and 232 - 1). These were sets for which I could easily and quickly verify accuracy by also running a canonical mean on them.
I then tested with 100* large series, with random length between 105 and 109. The lower and upper bounds of these series were also chosen at random, constrained so that the series would fit within the range of a 32-bit integer. For any series, the results are easily verifiable as (lowerbound + upperbound) / 2.
*Okay, that's a little white lie. I aborted the large-series test after about 20 or 30 successful runs. A series of length 109 takes just under a minute and a half to run on my machine, so half an hour or so of testing this routine was enough for my tastes.
For those interested, my test code is below:
static IEnumerable<uint> GetSeries(uint lowerbound, uint upperbound){
for(uint i = lowerbound; i <= upperbound; i++)
yield return i;
}
static void Test(){
Console.BufferHeight = 1200;
Random rnd = new Random();
for(int i = 0; i < 1000; i++){
uint[] numbers = new uint[rnd.Next(1, 1000)];
for(int j = 0; j < numbers.Length; j++)
numbers[j] = (uint)rnd.Next();
double sum = 0;
foreach(uint n in numbers)
sum += n;
double avg = sum / numbers.Length;
double ans = new BigMeanSet().GetAverage(numbers);
Console.WriteLine("{0}: {1} - {2} = {3}", numbers.Length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
for(int i = 0; i < 100; i++){
uint length = (uint)rnd.Next(100000, 1000000001);
uint lowerbound = (uint)rnd.Next(int.MaxValue - (int)length);
uint upperbound = lowerbound + length;
double avg = ((double)lowerbound + upperbound) / 2;
double ans = new BigMeanSet().GetAverage(GetSeries(lowerbound, upperbound));
Console.WriteLine("{0}: {1} - {2} = {3}", length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
}
A random sampling of a small set of the full dataset will often result in a 'good enough' solution. You obviously have to make this determination yourself based on system requirements. Sample size can be remarkably small and still obtain reasonably good answers. This can be adaptively computed by calculating the average of an increasing number of randomly chosen samples - the average will converge within some interval.
Sampling not only addresses the double overflow concern, but is much, much faster. Not applicable for all problems, but certainly useful for many problems.
Consider this:
avg(n1) : n1 = a1
avg(n1, n2) : ((1/2)*n1)+((1/2)*n2) = ((1/2)*a1)+((1/2)*n2) = a2
avg(n1, n2, n3) : ((1/3)*n1)+((1/3)*n2)+((1/3)*n3) = ((2/3)*a2)+((1/3)*n3) = a3
So for any set of doubles of arbitrary size, you could do this (this is in C#, but I'm pretty sure it could be easily translated to Java):
static double GetAverage(IEnumerable<double> values) {
int i = 0;
double avg = 0.0;
foreach (double value in values) {
avg = (((double)i / (double)(i + 1)) * avg) + ((1.0 / (double)(i + 1)) * value);
i++;
}
return avg;
}
Actually, this simplifies nicely into (already provided by martinus):
static double GetAverage(IEnumerable<double> values) {
int i = 1;
double avg = 0.0;
foreach (double value in values) {
avg += (value - avg) / (i++);
}
return avg;
}
I wrote a quick test to try this function out against the more conventional method of summing up the values and dividing by the count (GetAverage_old). For my input I wrote this quick function to return as many random positive doubles as desired:
static IEnumerable<double> GetRandomDoubles(long numValues, double maxValue, int seed) {
Random r = new Random(seed);
for (long i = 0L; i < numValues; i++)
yield return r.NextDouble() * maxValue;
yield break;
}
And here are the results of a few test trials:
long N = 100L;
double max = double.MaxValue * 0.01;
IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 1.00535024998431E+306
double newWay = GetAverage(doubles); // 1.00535024998431E+306
doubles = GetRandomDoubles(N, max, 1);
oldWay = GetAverage_old(doubles); // 8.75142021696299E+305
newWay = GetAverage(doubles); // 8.75142021696299E+305
doubles = GetRandomDoubles(N, max, 2);
oldWay = GetAverage_old(doubles); // 8.70772312848651E+305
newWay = GetAverage(doubles); // 8.70772312848651E+305
OK, but what about for 10^9 values?
long N = 1000000000;
double max = 100.0; // we start small, to verify accuracy
IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 49.9994879713857
double newWay = GetAverage(doubles); // 49.9994879713868 -- pretty close
max = double.MaxValue * 0.001; // now let's try something enormous
doubles = GetRandomDoubles(N, max, 0);
oldWay = GetAverage_old(doubles); // Infinity
newWay = GetAverage(doubles); // 8.98837362725198E+305 -- no overflow
Naturally, how acceptable this solution is will depend on your accuracy requirements. But it's worth considering.
Check out the section for cummulative moving average
In order to keep logic simple, and keep performance not the best but acceptable, i recommend you to use BigDecimal together with the primitive type.
The concept is very simple, you use primitive type to sum values together, whenever the value will underflow or overflow, you move the calculate value to the BigDecimal, then reset it for the next sum calculation. One more thing you should aware is when you construct BigDecimal, you ought to always use String instead of double.
BigDecimal average(double[] values){
BigDecimal totalSum = BigDecimal.ZERO;
double tempSum = 0.00;
for (double value : values){
if (isOutOfRange(tempSum, value)) {
totalSum = sum(totalSum, tempSum);
tempSum = 0.00;
}
tempSum += value;
}
totalSum = sum(totalSum, tempSum);
BigDecimal count = new BigDecimal(values.length);
return totalSum.divide(count);
}
BigDecimal sum(BigDecimal val1, double val2){
BigDecimal val = new BigDecimal(String.valueOf(val2));
return val1.add(val);
}
boolean isOutOfRange(double sum, double value){
// because sum + value > max will be error if both sum and value are positive
// so I adapt the equation to be value > max - sum
if(sum >= 0.00 && value > Double.MAX - sum){
return true;
}
// because sum + value < min will be error if both sum and value are negative
// so I adapt the equation to be value < min - sum
if(sum < 0.00 && value < Double.MIN - sum){
return true;
}
return false;
}
From this concept, every time the result is underflow or overflow, we will keep that value into the bigger variable, this solution might a bit slowdown the performance due to the BigDecimal calculation, but it guarantee the runtime stability.
Why so many complicated long answers. Here is the simplest way to find the running average till now without any need to know how many elements or size etc..
long int i = 0;
double average = 0;
while(there are still elements)
{
average = average * (i / i+1) + X[i] / (i+1);
i++;
}
return average;

Categories