I was playing around with the Random class's nextDouble() method as shown below. I expected nextDouble() to return a pseudorandom double value on the interval [-50.0, 50.0), however, after running the loop 1 billion times the output came out to maximum: 49.99999995014588 minimum: -49.99999991024878. I ran the loop without my manipulations of the output interval, and I got maximum: 0.9999999998979311 minimum: 0.0. I find this strange, because all I have done to the 0.0 that was returned is multiply it by 100.0 and subtract 50.0 from it. Why does this code snippet below never return exactly -50.0?
EDIT: Just for fun I ran the loop another 500 million times, and the output is now: maximum: 49.99999994222232 minimum: -49.999999996750944.
import java.util.Random;
public class randomTest{
public static void main(String[] args) {
double max = 0;
double min = 0;
Random math = new Random();
for(int a = 0; a < 1000000000; a++) {
double rand = math.nextDouble() * 100.0 - (100.0 / 2.0);
max = Math.max(max, rand);
min = Math.min(min, rand);
}
System.out.println("maximum: " + max + " minimum: " + min);
}
}
The javadoc clearly states that the upper bound on nextDouble() is exclusive not inclusive. That means that 1.0 will not be returned.
According to the javadoc, 0.0 will be returned .... with a probability of approximately 1 in 254. (That is one time in 18,014,398,509,481,984.)
(It boils down to determining whether two successive calls to next(27) will return zero. That is possible, if you examine the specification for the LCNG used by next(int).)
So, your code doesn't hit 50.0 because it can't. It should be able to hit -50.0 but you would probably need to run it in the order of 1.0E19 times for that to happen. You only ran it 5.0E8 times.
nextDouble() works by first generating a random long, i.e. an integer spread evenly between the numbers -263 and 263-1. If you generate one billion numbers, you are still generating only 109/264 = 5.421 x 10-11 of the possibilities, a tiny fraction. Thus the odds that any particular number will be generated are extremely tiny.
Even accounting for rounding, the chance is still small. Note that your output contains 16 significant digits, which means that there are somewhere between 1015 and 1016 possible sequences of decimal digits you can generate. If you only generate 109 of those, the probability of generating any particular number is 10-7.
Taken from oracle docs:
public double nextDouble() Returns the next pseudorandom, uniformly
distributed double value between 0.0 and 1.0 from this random number
generator's sequence. The general contract of nextDouble is that one
double value, chosen (approximately) uniformly from the range 0.0d
(inclusive) to 1.0d (exclusive), is pseudorandomly generated and
returned.
The method nextDouble is implemented by class Random as if by:
public double nextDouble() { return (((long)next(26) << 27) +
> next(27))
> / (double)(1L << 53); }
The hedge "approximately" is used in the foregoing description only because the next method is only
approximately an unbiased source of independently chosen bits. If it
were a perfect source of randomly chosen bits, then the algorithm
shown would choose double values from the stated range with perfect
uniformity.
[In early versions of Java, the result was incorrectly calculated as:
return (((long)next(27) << 27) + next(27))
> / (double)(1L << 54);
This might seem to be equivalent, if not better, but in fact it introduced a large nonuniformity because of the
bias in the rounding of floating-point numbers: it was three times as
likely that the low-order bit of the significand would be 0 than that
it would be 1! This nonuniformity probably doesn't matter much in
practice, but we strive for perfection.]
So its clear the the max value isn't included when generating the number,
Implement it yourself. Something work for me:
public double nextDoubleInclusive()
{
return myRandom.nextInt(Integer.MAX_VALUE) / (Integer.MAX_VALUE - 1.0);
}
notice that you probably didn't get 0.0 when running without the offset.
your "min" starts with 0.0.
with a little change to your code (min = 1) you can see that you aren't getting 0.0 (you could, not the odds are against you).
double max = 0;
double min = 1;
Random math = new Random();
for(int a = 0; a < 1000000000; a++) {
double rand = math.nextDouble();
max = Math.max(max, rand);
min = Math.min(min, rand);
}
System.out.println("maximum: " + max + " minimum: " + min);
maximum: 0.9999999989149039 minimum: 4.5566594941703897E-10
I have a map of items with some probability distribution:
Map<SingleObjectiveItem, Double> itemsDistribution;
Given a certain m I have to generate a Set of m elements sampled from the above distribution.
As of now I was using the naive way of doing it:
while(mySet.size < m)
mySet.add(getNextSample(itemsDistribution));
The getNextSample(...) method fetches an object from the distribution as per its probability. Now, as m increases the performance severely suffers. For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. Generate 1000 such sets and you have an application that crawls.
Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? Most collection shuffling techniques and the like are uniformly random. What would be a good way to address this?
UPDATE: The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is in the first run we'll definitely get a sample for the set. The 2nd iteration, it may be called at least twice and so on. If getNextSample is sequential in nature, i.e., goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2, 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 then the loop is at least Sigma(n^3). And that too is the lower bound!
If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). Efficient but may not be by a large margin.
Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. These sets are part of a randomized 'schedule' of items. Hence a 'set' of items.
Start out by generating a number of random points in two dimentions.
Then apply your distribution
Now find all entries within the distribution and pick the x coordinates, and you have your random numbers with the requested distribution like this:
The problem is unlikely to be the loop you show:
Let n be the size of the distribution, and I be the number of invocations to getNextSample. We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E[I] < m / (1 - m / n).
That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. If that is "slow" and "crawls", it is likely because getNextSample is slow. This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element).
The following should be faster (if m < 0.8 n)
class Distribution<T> {
private double[] cummulativeWeight;
private T[] item;
private double totalWeight;
Distribution(Map<T, Double> probabilityMap) {
int i = 0;
cummulativeWeight = new double[probabilityMap.size()];
item = (T[]) new Object[probabilityMap.size()];
for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
item[i] = entry.getKey();
totalWeight += entry.getValue();
cummulativeWeight[i] = totalWeight;
i++;
}
}
T randomItem() {
double weight = Math.random() * totalWeight;
int index = Arrays.binarySearch(cummulativeWeight, weight);
if (index < 0) {
index = -index - 1;
}
return item[index];
}
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
set.add(randomItem());
}
return set;
}
}
public class Test {
public static void main(String[] args) {
int max = 1_000_000;
HashMap<Integer, Double> probabilities = new HashMap<>();
for (int i = 0; i < max; i++) {
probabilities.put(i, (double) i);
}
Distribution<Integer> d = new Distribution<>(probabilities);
Set<Integer> set = d.randomSubset(max / 2);
//System.out.println(set);
}
}
The expected runtime is O(m / (1 - m / n) * log n). On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds.
As we can see, the expected runtime approaches infinity as m approaches n. If that is a problem (i.e. m > 0.9 n), the following more complex approach should work better:
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
T randomItem = randomItem();
remove(randomItem); // removes the item from the distribution
set.add(randomItem);
}
return set;
}
To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is.
But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n.
If you are not concerning with randomness properties too much then I do it like this:
create buffer for pseudo-random numbers
double buff[MAX]; // [edit1] double pseudo random numbers
MAX is size should be big enough ... 1024*128 for example
type can be any (float,int,DWORD...)
fill buffer with numbers
you have range of numbers x = < x0,x1 > and probability function probability(x) defined by your probability distribution so do this:
for (i=0,x=x0;x<=x1;x+=stepx)
for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers
buff[i]=x+(double(i)*q); // [edit1] ...
The stepx is your accuracy for items (for integral types = 1) now the buff[] array has the same distribution as you need but it is not pseudo-random. Also you should add check if j is not >= MAX to avoid array overruns and also at the end the real size of buff[] is j (can be less than MAX due to rounding)
shuffle buff[]
do just few loops of swap buff[i] and buff[j] where i is the loop variable and j is pseudo-random <0-MAX)
write your pseudo-random function
it just return number from the buffer. At first call returns the buff[0] at second buff[1] and so on ... For standard generators When you hit the end of buff[] then shuffle buff[] again and start from buff[0] again. But as you need unique numbers then you can not reach the end of buffer so so set MAX to be big enough for your task otherwise uniqueness will not be assured.
[Notes]
MAX should be big enough to store the whole distribution you want. If it is not big enough then items with low probability can be missing completely.
[edit1] - tweaked answer a little to match the question needs (pointed by meriton thanks)
PS. complexity of initialization is O(N) and for get number is O(1).
You should implement your own random number generator (using a MonteCarlo methode or any good uniform generator like mersen twister) and basing on the inversion method (here).
For example : exponential law: generate a uniform random number u in [0,1] then your random variable of the exponential law would be : ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm.
Hope it'll help ;).
I think you have two problems:
Your itemDistribution doesn't know you need a set, so when the set you are building gets
large you will pick a lot of elements that are already in the set. If you start with the
set all full and remove elements you will run into the same problem for very small sets.
Is there a reason why you don't remove the element from the itemDistribution after you
picked it? Then you wouldn't pick the same element twice?
The choice of datastructure for itemDistribution looks suspicious to me. You want the
getNextSample operation to be fast. Doesn't the map from values to probability force you
to iterate through large parts of the map for each getNextSample. I'm no good at
statistics but couldn't you represent the itemDistribution the other way, like a map from
probability, or maybe the sum of all smaller probabilities + probability to a element
of the set?
Your performance depends on how your getNextSample function works. If you have to iterate over all probabilities when you pick the next item, it might be slow.
A good way to pick several unique random items from a list is to first shuffle the list and then pop items off the list. You can shuffle the list once with the given distribution. From then on, picking your m items ist just popping the list.
Here's an implementation of a probabilistic shuffle:
List<Item> prob_shuffle(Map<Item, int> dist)
{
int n = dist.length;
List<Item> a = dist.keys();
int psum = 0;
int i, j;
for (i in dist) psum += dist[i];
for (i = 0; i < n; i++) {
int ip = rand(psum); // 0 <= ip < psum
int jp = 0;
for (j = i; j < n; j++) {
jp += dist[a[j]];
if (ip < jp) break;
}
psum -= dist[a[j]];
Item tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
return a;
}
This in not Java, but pseudocude after an implementation in C, so please take it with a grain of salt. The idea is to append items to the shuffled area by continuously picking items from the unshuffled area.
Here, I used integer probabilities. (The proabilities don't have to add to a special value, it's just "bigger is better".) You can use floating-point numbers but because of inaccuracies, you might end up going beyond the array when picking an item. You should use item n - 1 then. If you add that saftey net, you could even have items with zero probability that always get picked last.
There might be a method to speed up the picking loop, but I don't really see how. The swapping renders any precalculations useless.
Accumulate your probabilities in a table
Probability
Item Actual Accumulated
Item1 0.10 0.10
Item2 0.30 0.40
Item3 0.15 0.55
Item4 0.20 0.75
Item5 0.25 1.00
Make a random number between 0.0 and 1.0 and do a binary search for the first item with a sum that is greater than your generated number. This item would have been chosen with the desired probability.
Ebbe's method is called rejection sampling.
I sometimes use a simple method, using an inverse cumulative distribution function, which is a function that maps a number X between 0 and 1 onto the Y axis.
Then you just generate a uniformly distributed random number between 0 and 1, and apply the function to it.
That function is also called the "quantile function".
For example, suppose you want to generate a normally distributed random number.
It's cumulative distribution function is called Phi.
The inverse of that is called probit.
There are many ways to generate normal variates, and this is just one example.
You can easily construct an approximate cumulative distribution function for any univariate distribution you like, in the form of a table.
Then you can just invert it by table-lookup and interpolation.
I have a requirement to calculate the average of a very large set of doubles (10^9 values). The sum of the values exceeds the upper bound of a double, so does anyone know any neat little tricks for calculating an average that doesn't require also calculating the sum?
I am using Java 1.5.
You can calculate the mean iteratively. This algorithm is simple, fast, you have to process each value just once, and the variables never get larger than the largest value in the set, so you won't get an overflow.
double mean(double[] ary) {
double avg = 0;
int t = 1;
for (double x : ary) {
avg += (x - avg) / t;
++t;
}
return avg;
}
Inside the loop avg always is the average value of all values processed so far. In other words, if all the values are finite you should not get an overflow.
The very first issue I'd like to ask you is this:
Do you know the number of values beforehand?
If not, then you have little choice but to sum, and count, and divide, to do the average. If Double isn't high enough precision to handle this, then tough luck, you can't use Double, you need to find a data type that can handle it.
If, on the other hand, you do know the number of values beforehand, you can look at what you're really doing and change how you do it, but keep the overall result.
The average of N values, stored in some collection A, is this:
A[0] A[1] A[2] A[3] A[N-1] A[N]
---- + ---- + ---- + ---- + .... + ------ + ----
N N N N N N
To calculate subsets of this result, you can split up the calculation into equally sized sets, so you can do this, for 3-valued sets (assuming the number of values is divisable by 3, otherwise you need a different divisor)
/ A[0] A[1] A[2] \ / A[3] A[4] A[5] \ // A[N-1] A[N] \
| ---- + ---- + ---- | | ---- + ---- + ---- | \\ + ------ + ---- |
\ 3 3 3 / \ 3 3 3 / // 3 3 /
--------------------- + -------------------- + \\ --------------
N N N
--- --- ---
3 3 3
Note that you need equally sized sets, otherwise numbers in the last set, which will not have enough values compared to all the sets before it, will have a higher impact on the final result.
Consider the numbers 1-7 in sequence, if you pick a set-size of 3, you'll get this result:
/ 1 2 3 \ / 4 5 6 \ / 7 \
| - + - + - | + | - + - + - | + | - |
\ 3 3 3 / \ 3 3 3 / \ 3 /
----------- ----------- ---
y y y
which gives:
2 5 7/3
- + - + ---
y y y
If y is 3 for all the sets, you get this:
2 5 7/3
- + - + ---
3 3 3
which gives:
2*3 5*3 7
--- + --- + ---
9 9 9
which is:
6 15 7
- + -- + -
9 9 9
which totals:
28
-- ~ 3,1111111111111111111111.........1111111.........
9
The average of 1-7, is 4. Obviously this won't work. Note that if you do the above exercise with the numbers 1, 2, 3, 4, 5, 6, 7, 0, 0 (note the two zeroes at the end there), then you'll get the above result.
In other words, if you can't split the number of values up into equally sized sets, the last set will be counted as though it has the same number of values as all the sets preceeding it, but it will be padded with zeroes for all the missing values.
So, you need equally sized sets. Tough luck if your original input set consists of a prime number of values.
What I'm worried about here though is loss of precision. I'm not entirely sure Double will give you good enough precision in such a case, if it initially cannot hold the entire sum of the values.
Apart from using the better approaches already suggested, you can use BigDecimal to make your calculations. (Bear in mind it is immutable)
IMHO, the most robust way of solving your problem is
sort your set
split in groups of elements whose sum wouldn't overflow - since they are sorted, this is fast and easy
do the sum in each group - and divide by the group size
do the sum of the group's sum's (possibly calling this same algorithm recursively) - be aware that if the groups will not be equally sized, you'll have to weight them by their size
One nice thing of this approach is that it scales nicely if you have a really large number of elements to sum - and a large number of processors/machines to use to do the math
Please clarify the potential ranges of the values.
Given that a double has a range ~= +/-10^308, and you're summing 10^9 values, the apparent range suggested in your question is values of the order of 10^299.
That seems somewhat, well, unlikely...
If your values really are that large, then with a normal double you've got only 17 significant decimal digits to play with, so you'll be throwing away about 280 digits worth of information before you can even think about averaging the values.
I would also note (since no-one else has) that for any set of numbers X:
mean(X) = sum(X[i] - c) + c
-------------
N
for any arbitrary constant c.
In this particular problem, setting c = min(X) might dramatically reduce the risk of overflow during the summation.
May I humbly suggest that the problem statement is incomplete...?
A double can be divided by a power of 2 without loss of precision. So if your only problem if the absolute size of the sum you could pre-scale your numbers before summing them. But with a dataset of this size, there is still the risk that you will hit a situation where you are adding small numbers to a large one, and the small numbers will end up being mostly (or completely) ignored.
for instance, when you add 2.2e-20 to 9.0e20 the result is 9.0e20 because once the scales are adjusted so that they numbers can be added together, the smaller number is 0. Doubles can only hold about 17 digits, and you would need more than 40 digits to add these two numbers together without loss.
So, depending on your data set and how many digits of precision you can afford to loose, you may need to do other things. Breaking the data into sets will help, but a better way to preserve precision might be to determine a rough average (you may already know this number). then subtract each value from the rough average before you sum it. That way you are summing the distances from the average, so your sum should never get very large.
Then you take the average delta, and add it to your rough sum to get the correct average. Keeping track of the min and max delta will also tell you how much precision you lost during the summing process. If you have lots of time and need a very accurate result, you can iterate.
You could take the average of averages of equal-sized subsets of numbers that don't exceed the limit.
divide all values by the set size and then sum it up
Option 1 is to use an arbitrary-precision library so you don't have an upper-bound.
Other options (which lose precision) are to sum in groups rather than all at once, or to divide before summing.
So I don't repeat myself so much, let me state that I am assuming that the list of numbers is normally distributed, and that you can sum many numbers before you overflow. The technique still works for non-normal distros, but somethings will not meet the expectations I describe below.
--
Sum up a sub-series, keeping track of how many numbers you eat, until you approach the overflow, then take the average. This will give you an average a0, and count n0. Repeat until you exhaust the list. Now you should have many ai, ni.
Each ai and ni should be relatively close, with the possible exception of the last bite of the list. You can mitigate that by under-biting near the end of the list.
You can combine any subset of these ai, ni by picking any ni in the subset (call it np) and dividing all the ni in the subset by that value. The max size of the subsets to combine is the roughly constant value of the n's.
The ni/np should be close to one. Now sum ni/np * ai and multiple by np/(sum ni), keeping track of sum ni. This gives you a new ni, ai combination, if you need to repeat the procedure.
If you will need to repeat (i.e., the number of ai, ni pairs is much larger than the typical ni), try to keep relative n sizes constant by combining all the averages at one n level first, then combining at the next level, and so on.
First of all, make yourself familiar with the internal representation of double values. Wikipedia should be a good starting point.
Then, consider that doubles are expressed as "value plus exponent" where exponent is a power of two. The limit of the largest double value is an upper limit of the exponent, and not a limit of the value! So you may divide all large input numbers by a large enough power of two. This should be safe for all large enough numbers. You can re-multiply the result with the factor to check whether you lost precision with the multiplication.
Here we go with an algorithm
public static double sum(double[] numbers) {
double eachSum, tempSum;
double factor = Math.pow(2.0,30); // about as large as 10^9
for (double each: numbers) {
double temp = each / factor;
if (t * factor != each) {
eachSum += each;
else {
tempSum += temp;
}
}
return (tempSum / numbers.length) * factor + (eachSum / numbers.length);
}
and dont be worried by the additional division and multiplication. The FPU will optimize the hell out of them since they are done with a power of two (for comparison imagine adding and removing digits at the end of a decimal numbers).
PS: in addition, you may want to use Kahan summation to improve the precision. Kahan summation avoids loss of precision when very large and very small numbers are summed up.
I posted an answer to a question spawned from this one, realizing afterwards that my answer is better suited to this question than to that one. I've reproduced it below. I notice though, that my answer is similar to a combination of Bozho's and Anon.'s.
As the other question was tagged language-agnostic, I chose C# for the code sample I've included. Its relative ease of use and easy-to-follow syntax, along with its inclusion of a couple of features facilitating this routine (a DivRem function in the BCL, and support for iterator functions), as well as my own familiarity with it, made it a good choice for this problem. Since the OP here is interested in a Java solution, but I'm not Java-fluent enough to write it effectively, it might be nice if someone could add a translation of this code to Java.
Some of the mathematical solutions here are very good. Here's a simple technical solution.
Use a larger data type. This breaks down into two possibilities:
Use a high-precision floating point library. One who encounters a need to average a billion numbers probably has the resources to purchase, or the brain power to write, a 128-bit (or longer) floating point library.
I understand the drawbacks here. It would certainly be slower than using intrinsic types. You still might over/underflow if the number of values grows too high. Yada yada.
If your values are integers or can be easily scaled to integers, keep your sum in a list of integers. When you overflow, simply add another integer. This is essentially a simplified implementation of the first option. A simple (untested) example in C# follows
class BigMeanSet{
List<uint> list = new List<uint>();
public double GetAverage(IEnumerable<uint> values){
list.Clear();
list.Add(0);
uint count = 0;
foreach(uint value in values){
Add(0, value);
count++;
}
return DivideBy(count);
}
void Add(int listIndex, uint value){
if((list[listIndex] += value) < value){ // then overflow has ocurred
if(list.Count == listIndex + 1)
list.Add(0);
Add(listIndex + 1, 1);
}
}
double DivideBy(uint count){
const double shift = 4.0 * 1024 * 1024 * 1024;
double rtn = 0;
long remainder = 0;
for(int i = list.Count - 1; i >= 0; i--){
rtn *= shift;
remainder <<= 32;
rtn += Math.DivRem(remainder + list[i], count, out remainder);
}
rtn += remainder / (double)count;
return rtn;
}
}
Like I said, this is untested—I don't have a billion values I really want to average—so I've probably made a mistake or two, especially in the DivideBy function, but it should demonstrate the general idea.
This should provide as much accuracy as a double can represent and should work for any number of 32-bit elements, up to 232 - 1. If more elements are needed, then the count variable will need be expanded and the DivideBy function will increase in complexity, but I'll leave that as an exercise for the reader.
In terms of efficiency, it should be as fast or faster than any other technique here, as it only requires iterating through the list once, only performs one division operation (well, one set of them), and does most of its work with integers. I didn't optimize it, though, and I'm pretty certain it could be made slightly faster still if necessary. Ditching the recursive function call and list indexing would be a good start. Again, an exercise for the reader. The code is intended to be easy to understand.
If anybody more motivated than I am at the moment feels like verifying the correctness of the code, and fixing whatever problems there might be, please be my guest.
I've now tested this code, and made a couple of small corrections (a missing pair of parentheses in the List<uint> constructor call, and an incorrect divisor in the final division of the DivideBy function).
I tested it by first running it through 1000 sets of random length (ranging between 1 and 1000) filled with random integers (ranging between 0 and 232 - 1). These were sets for which I could easily and quickly verify accuracy by also running a canonical mean on them.
I then tested with 100* large series, with random length between 105 and 109. The lower and upper bounds of these series were also chosen at random, constrained so that the series would fit within the range of a 32-bit integer. For any series, the results are easily verifiable as (lowerbound + upperbound) / 2.
*Okay, that's a little white lie. I aborted the large-series test after about 20 or 30 successful runs. A series of length 109 takes just under a minute and a half to run on my machine, so half an hour or so of testing this routine was enough for my tastes.
For those interested, my test code is below:
static IEnumerable<uint> GetSeries(uint lowerbound, uint upperbound){
for(uint i = lowerbound; i <= upperbound; i++)
yield return i;
}
static void Test(){
Console.BufferHeight = 1200;
Random rnd = new Random();
for(int i = 0; i < 1000; i++){
uint[] numbers = new uint[rnd.Next(1, 1000)];
for(int j = 0; j < numbers.Length; j++)
numbers[j] = (uint)rnd.Next();
double sum = 0;
foreach(uint n in numbers)
sum += n;
double avg = sum / numbers.Length;
double ans = new BigMeanSet().GetAverage(numbers);
Console.WriteLine("{0}: {1} - {2} = {3}", numbers.Length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
for(int i = 0; i < 100; i++){
uint length = (uint)rnd.Next(100000, 1000000001);
uint lowerbound = (uint)rnd.Next(int.MaxValue - (int)length);
uint upperbound = lowerbound + length;
double avg = ((double)lowerbound + upperbound) / 2;
double ans = new BigMeanSet().GetAverage(GetSeries(lowerbound, upperbound));
Console.WriteLine("{0}: {1} - {2} = {3}", length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
}
A random sampling of a small set of the full dataset will often result in a 'good enough' solution. You obviously have to make this determination yourself based on system requirements. Sample size can be remarkably small and still obtain reasonably good answers. This can be adaptively computed by calculating the average of an increasing number of randomly chosen samples - the average will converge within some interval.
Sampling not only addresses the double overflow concern, but is much, much faster. Not applicable for all problems, but certainly useful for many problems.
Consider this:
avg(n1) : n1 = a1
avg(n1, n2) : ((1/2)*n1)+((1/2)*n2) = ((1/2)*a1)+((1/2)*n2) = a2
avg(n1, n2, n3) : ((1/3)*n1)+((1/3)*n2)+((1/3)*n3) = ((2/3)*a2)+((1/3)*n3) = a3
So for any set of doubles of arbitrary size, you could do this (this is in C#, but I'm pretty sure it could be easily translated to Java):
static double GetAverage(IEnumerable<double> values) {
int i = 0;
double avg = 0.0;
foreach (double value in values) {
avg = (((double)i / (double)(i + 1)) * avg) + ((1.0 / (double)(i + 1)) * value);
i++;
}
return avg;
}
Actually, this simplifies nicely into (already provided by martinus):
static double GetAverage(IEnumerable<double> values) {
int i = 1;
double avg = 0.0;
foreach (double value in values) {
avg += (value - avg) / (i++);
}
return avg;
}
I wrote a quick test to try this function out against the more conventional method of summing up the values and dividing by the count (GetAverage_old). For my input I wrote this quick function to return as many random positive doubles as desired:
static IEnumerable<double> GetRandomDoubles(long numValues, double maxValue, int seed) {
Random r = new Random(seed);
for (long i = 0L; i < numValues; i++)
yield return r.NextDouble() * maxValue;
yield break;
}
And here are the results of a few test trials:
long N = 100L;
double max = double.MaxValue * 0.01;
IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 1.00535024998431E+306
double newWay = GetAverage(doubles); // 1.00535024998431E+306
doubles = GetRandomDoubles(N, max, 1);
oldWay = GetAverage_old(doubles); // 8.75142021696299E+305
newWay = GetAverage(doubles); // 8.75142021696299E+305
doubles = GetRandomDoubles(N, max, 2);
oldWay = GetAverage_old(doubles); // 8.70772312848651E+305
newWay = GetAverage(doubles); // 8.70772312848651E+305
OK, but what about for 10^9 values?
long N = 1000000000;
double max = 100.0; // we start small, to verify accuracy
IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 49.9994879713857
double newWay = GetAverage(doubles); // 49.9994879713868 -- pretty close
max = double.MaxValue * 0.001; // now let's try something enormous
doubles = GetRandomDoubles(N, max, 0);
oldWay = GetAverage_old(doubles); // Infinity
newWay = GetAverage(doubles); // 8.98837362725198E+305 -- no overflow
Naturally, how acceptable this solution is will depend on your accuracy requirements. But it's worth considering.
Check out the section for cummulative moving average
In order to keep logic simple, and keep performance not the best but acceptable, i recommend you to use BigDecimal together with the primitive type.
The concept is very simple, you use primitive type to sum values together, whenever the value will underflow or overflow, you move the calculate value to the BigDecimal, then reset it for the next sum calculation. One more thing you should aware is when you construct BigDecimal, you ought to always use String instead of double.
BigDecimal average(double[] values){
BigDecimal totalSum = BigDecimal.ZERO;
double tempSum = 0.00;
for (double value : values){
if (isOutOfRange(tempSum, value)) {
totalSum = sum(totalSum, tempSum);
tempSum = 0.00;
}
tempSum += value;
}
totalSum = sum(totalSum, tempSum);
BigDecimal count = new BigDecimal(values.length);
return totalSum.divide(count);
}
BigDecimal sum(BigDecimal val1, double val2){
BigDecimal val = new BigDecimal(String.valueOf(val2));
return val1.add(val);
}
boolean isOutOfRange(double sum, double value){
// because sum + value > max will be error if both sum and value are positive
// so I adapt the equation to be value > max - sum
if(sum >= 0.00 && value > Double.MAX - sum){
return true;
}
// because sum + value < min will be error if both sum and value are negative
// so I adapt the equation to be value < min - sum
if(sum < 0.00 && value < Double.MIN - sum){
return true;
}
return false;
}
From this concept, every time the result is underflow or overflow, we will keep that value into the bigger variable, this solution might a bit slowdown the performance due to the BigDecimal calculation, but it guarantee the runtime stability.
Why so many complicated long answers. Here is the simplest way to find the running average till now without any need to know how many elements or size etc..
long int i = 0;
double average = 0;
while(there are still elements)
{
average = average * (i / i+1) + X[i] / (i+1);
i++;
}
return average;