Splitting an array into two subarrays with minimal sum

Splitting an array into two subarrays with minimal sum - java

My question is if given an array,we have to split that into two sub-arrays such that the absolute difference between the sum of the two arrays is minimum with a condition that the difference between number of elements of the arrays should be atmost one.
Let me give you an example.Suppose
Example 1: 100 210 100 75 340
Answer :
Array1{100,210,100} and Array2{75,340} --> Difference = |410-415|=5
Example 2: 10 10 10 10 40
Answer : Array1{10,10,10} and Array2{10,40} --> Difference = |30-50|=20
Here we can see that though we can divide the array into {10,10,10,10} and {40},we are not dividing because the constraint "the number of elements between the arrays should be atmost 1" will be violated if we do so.
Can somebody provide a solution for this ?
My approach:
->Calculate sum of the array
->Divide the sum by 2
->Let the size of the knapsack=sum/2
->Consider the weights of the array values as 1.(If you have come across the knapsack problem ,you may know about the weight concept)
->Then consider the array values as the values of the weights.
->Calculate the answer which will be array1 sum.
->Total sum-answer=array2 sum
This approach fails.
Calculating the two arrays sum is enough.We are not interested in which elements form the sum.
Thank you!
Source: This is an ICPC problem.

I have an algorithm that works in O(n3) time, but I have no hard proof it is optimal. It seems to work for every test input I give it (including some with negative numbers), so I figured it was worth sharing.
You start by splitting the input into two equally sized arrays (call them one[] and two[]?). Start with one[0], and see which element in two[] would give you the best result if swapped. Whichever one gives the best result, swap. If none give a better result, don't swap it. Then move on to the next element in one[] and do it again.
That part is O(2) by itself. The problem is, it might not get the best results the first time through. If you just keep doing it until you don't make any more swaps, you end up with an ugly bubble-type construction which makes it O(n3) total.
Here's some ugly Java code to demonstrate (also at ideone.com if you want to play with it):
static int[] input = {1,2,3,4,5,-6,7,8,9,10,200,-1000,100,250,-720,1080,200,300,400,500,50,74};
public static void main(String[] args) {
int[] two = new int[input.length/2];
int[] one = new int[input.length - two.length];
int totalSum = 0;
for(int i=0;i<input.length;i++){
totalSum += input[i];
if(i<one.length)
one[i] = input[i];
else
two[i-one.length] = input[i];
}
float goal = totalSum / 2f;
boolean swapped;
do{
swapped = false;
for(int j=0;j<one.length;j++){
int curSum = sum(one);
float curBestDiff = Math.abs(goal - curSum);
int curBestIndex = -1;
for(int i=0;i<two.length;i++){
int testSum = curSum - one[j] + two[i];
float diff = Math.abs(goal - testSum);
if(diff < curBestDiff){
curBestDiff = diff;
curBestIndex = i;
}
}
if(curBestIndex >= 0){
swapped = true;
System.out.println("swapping " + one[j] + " and " + two[curBestIndex]);
int tmp = one[j];
one[j] = two[curBestIndex];
two[curBestIndex] = tmp;
}
}
} while(swapped);
System.out.println(Arrays.toString(one));
System.out.println(Arrays.toString(two));
System.out.println("diff = " + Math.abs(sum(one) - sum(two)));
}
static int sum(int[] list){
int sum = 0;
for(int i=0;i<list.length;i++)
sum += list[i];
return sum;
}

Can you provide more information on the upper limit of the input?
For your algorithm, I think your are trying to pick floor(n/2) items and find it's maximum sum of value as array1 sum...(If this is not your original thought then please ignore the following lines)
If this is the case, then knapsack size should be n/2 instead of sum/2,
but even so, I think it's still not working. The ans is min(|a - b|) and maximizing a is a different issue. For eg, {2,2,10,10}, you will get a = 20, b = 4, while the ans is a = b = 12.
To answer the problem, I think I need more information of the upper limit of the input..
I cannot come up with a brilliant dp state but a 3-dimensional state
dp(i,n,v) := in first i-th items, pick n items out and make a sum of value v
each state is either 0 or 1 (false or true)
dp(i,n,v) = dp(i-1, n, v) | dp(i-1, n-1, v-V[i])
This dp state is so naive that it has a really high complexity which usually cannot pass a ACM / ICPC problem, so if possible please provide more information and see if I can come up another better solution...Hope I can help a bit :)

DP soluction will give lg(n) time. Two array, iterate one from start to end, and calculate the sum, the other iterate from end to start, and do the same thing. Finally, iterate from start to end and get minimal difference.

Related

Calculate absolute minimum difference between any two numbers of a huge integer array

I have a very long array in a Java program (300 000+ unsorted integers) and need to calculate the minimum absolute difference between any two numbers inside the array, and display the absolute difference and the corresponding pair of numbers as an output. The whole calculation should happen very quickly.
I have the following code, which would usually work:
private static void calcMinAbsDiff(int[] inputArray)
{
Arrays.sort(inputArray);
int minimum = Math.abs(inputArray[1] - inputArray[0]);
int firstElement = inputArray[0];
int secondElement = inputArray[1];
for (int i = 2; i < inputArray.length; i++)
{
if(Math.abs(inputArray[i] - inputArray[i-1]) < minimum)
{
minimum = Math.abs(inputArray[i] - inputArray[i-1]);
firstElement = inputArray[i-1];
secondElement = inputArray[i];
}
}
System.out.println("Minimum Absolute Difference : "+minimum);
System.out.println("Pair of Elements : ("+firstElement+", "+secondElement+")");
}
However, the output I receive is all 0s. I believe this is because the array is way too long.

If you have two or more zeros and no negative integers in your dataset, then your output is expected. After sorting, then inputArray[0] and inputArray[1] would both be 0, and the difference would be 0. No other pair of adjacent elements would have an absolute difference less than 0, so minimum, firstElement, and second Element would all be 0 at the end of the algorithm.
If you really have no zeros in your dataset, or if you do have negative integers, then you may have an initialization problem. Check this thread:
Why is my simple Array only printing zeros in java?
If that's not it, then only other thing I can think of is that you have a problem in the previous scope causing the data to get zeroed out.
I would try printing samples of your dataset at various points to see exactly where/when it's getting zeroed.
If you still have trouble, then post more info on the dataset and the scope which calls this function to help us see what's going on. Let us know how you make out!

Allocating N tonnes of food in K rooms with M capacity

I found this problem online:
You have N tonnes of food and K rooms to store them into. Every room has a capacity of M. In how many ways can you distribute the food in the rooms, so that every room has at least 1 ton of food.
My approach was to recursively find all possible variations that satisfy the conditions of the problem. I start with an array of size K, initialized to 1. Then I keep adding 1 to every element of the array and recursively check whether the new array satisfies the condition. However, the recursion tree gets too large too quickly and the program takes too long for slightly higher values of N, K and M.
What would be a more efficient algorithm to achieve this task? Are there any optimizations to be done to the existing algorithm implementation?
This is my implementation:
import java.util.Arrays;
import java.util.HashSet;
import java.util.Scanner;
public class Main {
// keeping track of valid variations, disregarding duplicates
public static HashSet<String> solutions = new HashSet<>();
// calculating sum of each variation
public static int sum(int[] array) {
int sum = 0;
for (int i : array) {
sum += i;
}
return sum;
}
public static void distributionsRecursive(int food, int rooms, int roomCapacity, int[] variation, int sum) {
// if all food has been allocated
if (sum == food) {
// add solution to solutions
solutions.add(Arrays.toString(variation));
return;
}
// keep adding 1 to every index in current variation
for (int i = 0; i < rooms; i++) {
// create new array for every recursive call
int[] tempVariation = Arrays.copyOf(variation, variation.length);
// if element is equal to room capacity, can't add any more in it
if (tempVariation[i] == roomCapacity) {
continue;
} else {
tempVariation[i]++;
sum = sum(tempVariation);
// recursively call function on new variation
distributionsRecursive(food, rooms, roomCapacity, tempVariation, sum);
}
}
return;
}
public static int possibleDistributions(int food, int rooms, int roomCapacity) {
int[] variation = new int[rooms];
// start from all 1, keep going till all food is allocated
Arrays.fill(variation, 1);
distributionsRecursive(food, rooms, roomCapacity, variation, rooms);
return solutions.size();
}
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
int food = in.nextInt();
int rooms = in.nextInt();
int roomCapacity = in.nextInt();
int total = possibleDistributions(food, rooms, roomCapacity);
System.out.println(total);
in.close();
}
}

Yes, your recursion tree will become large if you do this in a naive manner. Let's say you have 10 tonnes and 3 rooms, and M=2. One valid arrangement is [2,3,5]. But you also have [2,5,3], [3,2,5], [3,5,2], [5,2,3], and [5,3,2]. So for every valid grouping of numbers, there are actually K! permutations.
A possibly better way to approach this problem would be to determine how many ways you can make K numbers (minimum M and maximum N) add up to N. Start by making the first number as large as possible, which would be N-(M*(K-1)). In my example, that would be:
10 - 2*(3-1) = 6
Giving the answer [6,2,2].
You can then build an algorithm to adjust the numbers to come up with valid combinations by "moving" values from left to right. In my example, you'd have:
6,2,2
5,3,2
4,4,2
4,3,3
You avoid the seemingly infinite recursion by ensuring that values are decreasing from left to right. For example, in the above you'd never have [3,4,3].
If you really want all valid arrangements, you can generate the permutations for each of the above combinations. I suspect that's not necessary, though.
I think that should be enough to get you started towards a good solution.

One solution would be to compute the result for k rooms from the result for k - 1 rooms.
I've simplified the problem a bit in allowing to store 0 tonnes in a room. If we have to store at least 1 we can just subtract this in advance and reduce the capacity of rooms by 1.
So we define a function calc: (Int,Int) => List[Int] that computes for a number of rooms and a capacity a list of numbers of combinations. The first entry contains the number of combinations we get for storing 0 , the next entry when storing 1 and so on.
We can easily compute this function for one room. So calc(1,m) gives us a list of ones up to the mth element and then it only contains zeros.
For a larger k we can define this function recursively. We just calculate calc(k - 1, m) and then build the new list by summing up prefixes of the old list. E.g. if we want to store 5 tons, we can store all 5 in the first room and 0 in the following rooms, or 4 in the first and 1 in the following and so on. So we have to sum up the combinations for 0 to 5 for the rest of the rooms.
As we have a maximal capacity we might have to leave out some of the combinations, i.e. if the room only has capacity 3 we must not count the combinations for storing 0 and 1 tons in the rest of the rooms.
I've implemented this approach in Scala. I've used streams (i.e. infinite Lists) but as you know the maximal amount of elements you need this is not necessary.
The time complexity of the approach should be O(k*n^2)
def calc(rooms: Int, capacity: Int): Stream[Long] =
if(rooms == 1) {
Stream.from(0).map(x => if(x <= capacity) 1L else 0L)
} else {
val rest = calc(rooms - 1, capacity)
Stream.from(0).map(x => rest.take(x+1).drop(Math.max(0,x - capacity)).sum)
}
You can try it here:
http://goo.gl/tVgflI
(I've replaced the Long by BigInt there to make it work for larger numbers)

First tip, remove distributionsRecursive and don't build up a list of solutions. The list of all solutions is a huge data set. Just produce a count.
That will let you turn possibleDistributions into a recursive function defined in terms of itself. The recursive step will be, possibleDistributions(food, rooms, roomCapacity) = sum from i = 1 to roomCapacity of possibleDistributions(food - i, rooms - 1, roomCapacity).
You will save a lot of memory, but still have your underlying performance problem. However with a pure recursive function you can now fix that with https://en.wikipedia.org/wiki/Memoization.

minimum sum subarray in O(N) by Kadane's algorithm

We all know about the maximum sum subarray and the famous Kadane's algorithm. But can we use the same algorithm to find minimum sum also?
My take is:
change the sign and find the max sum in that, same as the way we calculate the maximum sum subarray. Than change the sign of the
elements in the array to make it in initial state.
Please help me in correcting the algo if it has any issue.
corner case: I know there is an issue if all the elements are positive and we can handle this case by doing some preprocessing i.e. traverse the array if all are +ve than just return the minimum number from the array.
The above mention algorithm will work and well supported (explained) by dasblinkenlight.

Will the approach that I have mentioned work to find the minimum sum?
Yes, it will. You can re-state the problem of finding the minimum sum as finding a negative sum with the largest absolute value. When you switch the signs of your numbers and keep the rest of the algorithm in place, that's the number that the algorithm is going to return back to you.
I know there is an issue if all the elements are positive
No, there's no issue: consider the original Kadane's algorithm when all elements are negative. In this case the algorithm returns an empty sequence for the sum of zero - the highest one possible under the circumstances. In other words, when all elements are negative, your best solution is to take none of them.
Your modified algorithm is going to do the same in case when all numbers are positive: again, your best solution is to not take numbers at all.
If you add a requirement that the range returned back from the algorithm may not be empty, you could modify the algorithm slightly to find the smallest positive number (or the greatest negative number) in case when Kadane's algorithm returns an empty range as the optimal solution.

static void subArraySumMin(int a[]) {
int minendingHere = 0;
int minSoFar = a[0];
for (int i = 1; i < a.length; i++) {
minendingHere = Math.min(a[i], minendingHere + a[i]);
minSoFar = Math.min(minSoFar, minendingHere);
}
System.out.println(minSoFar);
}

Just replace max with min.
//O(n)
public static int minSubArraySum(int[] arr) {
int minSum = 0;
int curSum = 0;
for (int num : arr) {
curSum += num;
minSum = Math.min(minSum, curSum);
curSum = Math.min(curSum, 0);
}
return minSum;
}

generating all unique pairs from a list of numbers, n choose 2

i have a list of elements (let's say integers), and i need to make all possible 2-pair comparisons. my approach is O(n^2), and i am wondering if there is a faster way. here is my implementation in java.
public class Pair {
public int x, y;
public Pair(int x, int y) {
this.x = x;
this.y = y;
}
}
public List<Pair> getAllPairs(List<Integer> numbers) {
List<Pair> pairs = new ArrayList<Pair>();
int total = numbers.size();
for(int i=0; i < total; i++) {
int num1 = numbers.get(i).intValue();
for(int j=i+1; j < total; j++) {
int num2 = numbers.get(j).intValue();
pairs.add(new Pair(num1,num2));
}
}
return pairs;
}
please note that i don't allow self-pairing, so there should be ((n(n+1))/2) - n possible pairs. what i have currently works, but as n increases, it is taking me an unbearable long amount of time to get the pairs. is there any way to turn the O(n^2) algorithm above to something sub-quadratic? any help is appreciated.
by the way, i also tried the algorithm below, but when i benchmark, empirically, it performs worst than what i had above. i had thought that by avoiding an inner loop this would speed things up. shouldn't this algorithm below be faster? i would think that it's O(n)? if not, please explain and let me know. thanks.
public List<Pair> getAllPairs(List<Integer> numbers) {
int n = list.size();
int i = 0;
int j = i + 1;
while(true) {
int num1 = list.get(i);
int num2 = list.get(j);
pairs.add(new Pair(num1,num2));
j++;
if(j >= n) {
i++;
j = i + 1;
}
if(i >= n - 1) {
break;
}
}
}

Well, you can't, right?
The result is a list with n*(n-1)/2 elements, no matter what those elements are, just to populate this list (say with zeros) takes O(n^2) time, since n*(n-1)/2 = O(n^2)...

You cannot make it sub-quadric, because as you said - the output is itself quadric - and to create it, you need at least #elements_in_output ops.
However, you could do some "cheating" create your list on the fly:
You can create a class CombinationsGetter that implements Iterable<Pair>, and implement its Iterator<Pair>. This way, you will be able to iterate on the elements on the fly, without creating the list first, which might decrease latency for your application.
Note: It will still be quadric! The time to generate the list on the fly will just be distributed between more operations.
EDIT:
Another solution, which is faster then the naive approach - is multithreading.
Create a few threads, each will get his "slice" of the data - and generate relevant pairs, and create its own partial list.
Later - you can use ArrayList.addAll() to convert those different lists into one.
Note: though complexity is stiil O(n^2), it is likely to be much faster - since the creation of pairs is done in parallel, and ArrayList.addAll() is implemented much more effieciently then the trivial insert one by one elements.
EDIT2:
Your second code is still O(n^2), even though it is a "single loop" - the loop itself will repeat O(n^2) times. Have a look at your variable i. It increases only when j==n, and it decreases j back to i+1 when it does it. So, it will result in n + (n-1) + ... + 1 iterations, and this is sum of arithmetic progression, and gets us back to O(n^2) as expected.
We cannot get better then O(n^2), because we are trying to create O(n^2) distinct Pair objects.

What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

I have a requirement to calculate the average of a very large set of doubles (10^9 values). The sum of the values exceeds the upper bound of a double, so does anyone know any neat little tricks for calculating an average that doesn't require also calculating the sum?
I am using Java 1.5.

You can calculate the mean iteratively. This algorithm is simple, fast, you have to process each value just once, and the variables never get larger than the largest value in the set, so you won't get an overflow.
double mean(double[] ary) {
double avg = 0;
int t = 1;
for (double x : ary) {
avg += (x - avg) / t;
++t;
}
return avg;
}
Inside the loop avg always is the average value of all values processed so far. In other words, if all the values are finite you should not get an overflow.

The very first issue I'd like to ask you is this:
Do you know the number of values beforehand?
If not, then you have little choice but to sum, and count, and divide, to do the average. If Double isn't high enough precision to handle this, then tough luck, you can't use Double, you need to find a data type that can handle it.
If, on the other hand, you do know the number of values beforehand, you can look at what you're really doing and change how you do it, but keep the overall result.
The average of N values, stored in some collection A, is this:
A[0] A[1] A[2] A[3] A[N-1] A[N]
---- + ---- + ---- + ---- + .... + ------ + ----
N N N N N N
To calculate subsets of this result, you can split up the calculation into equally sized sets, so you can do this, for 3-valued sets (assuming the number of values is divisable by 3, otherwise you need a different divisor)
/ A[0] A[1] A[2] \ / A[3] A[4] A[5] \ // A[N-1] A[N] \
| ---- + ---- + ---- | | ---- + ---- + ---- | \\ + ------ + ---- |
\ 3 3 3 / \ 3 3 3 / // 3 3 /
--------------------- + -------------------- + \\ --------------
N N N
--- --- ---
3 3 3
Note that you need equally sized sets, otherwise numbers in the last set, which will not have enough values compared to all the sets before it, will have a higher impact on the final result.
Consider the numbers 1-7 in sequence, if you pick a set-size of 3, you'll get this result:
/ 1 2 3 \ / 4 5 6 \ / 7 \
| - + - + - | + | - + - + - | + | - |
\ 3 3 3 / \ 3 3 3 / \ 3 /
----------- ----------- ---
y y y
which gives:
2 5 7/3
- + - + ---
y y y
If y is 3 for all the sets, you get this:
2 5 7/3
- + - + ---
3 3 3
which gives:
2*3 5*3 7
--- + --- + ---
9 9 9
which is:
6 15 7
- + -- + -
9 9 9
which totals:
28
-- ~ 3,1111111111111111111111.........1111111.........
9
The average of 1-7, is 4. Obviously this won't work. Note that if you do the above exercise with the numbers 1, 2, 3, 4, 5, 6, 7, 0, 0 (note the two zeroes at the end there), then you'll get the above result.
In other words, if you can't split the number of values up into equally sized sets, the last set will be counted as though it has the same number of values as all the sets preceeding it, but it will be padded with zeroes for all the missing values.
So, you need equally sized sets. Tough luck if your original input set consists of a prime number of values.
What I'm worried about here though is loss of precision. I'm not entirely sure Double will give you good enough precision in such a case, if it initially cannot hold the entire sum of the values.

Apart from using the better approaches already suggested, you can use BigDecimal to make your calculations. (Bear in mind it is immutable)

IMHO, the most robust way of solving your problem is
sort your set
split in groups of elements whose sum wouldn't overflow - since they are sorted, this is fast and easy
do the sum in each group - and divide by the group size
do the sum of the group's sum's (possibly calling this same algorithm recursively) - be aware that if the groups will not be equally sized, you'll have to weight them by their size
One nice thing of this approach is that it scales nicely if you have a really large number of elements to sum - and a large number of processors/machines to use to do the math

Please clarify the potential ranges of the values.
Given that a double has a range ~= +/-10^308, and you're summing 10^9 values, the apparent range suggested in your question is values of the order of 10^299.
That seems somewhat, well, unlikely...
If your values really are that large, then with a normal double you've got only 17 significant decimal digits to play with, so you'll be throwing away about 280 digits worth of information before you can even think about averaging the values.
I would also note (since no-one else has) that for any set of numbers X:
mean(X) = sum(X[i] - c) + c
-------------
N
for any arbitrary constant c.
In this particular problem, setting c = min(X) might dramatically reduce the risk of overflow during the summation.
May I humbly suggest that the problem statement is incomplete...?

A double can be divided by a power of 2 without loss of precision. So if your only problem if the absolute size of the sum you could pre-scale your numbers before summing them. But with a dataset of this size, there is still the risk that you will hit a situation where you are adding small numbers to a large one, and the small numbers will end up being mostly (or completely) ignored.
for instance, when you add 2.2e-20 to 9.0e20 the result is 9.0e20 because once the scales are adjusted so that they numbers can be added together, the smaller number is 0. Doubles can only hold about 17 digits, and you would need more than 40 digits to add these two numbers together without loss.
So, depending on your data set and how many digits of precision you can afford to loose, you may need to do other things. Breaking the data into sets will help, but a better way to preserve precision might be to determine a rough average (you may already know this number). then subtract each value from the rough average before you sum it. That way you are summing the distances from the average, so your sum should never get very large.
Then you take the average delta, and add it to your rough sum to get the correct average. Keeping track of the min and max delta will also tell you how much precision you lost during the summing process. If you have lots of time and need a very accurate result, you can iterate.

You could take the average of averages of equal-sized subsets of numbers that don't exceed the limit.

divide all values by the set size and then sum it up

Option 1 is to use an arbitrary-precision library so you don't have an upper-bound.
Other options (which lose precision) are to sum in groups rather than all at once, or to divide before summing.

So I don't repeat myself so much, let me state that I am assuming that the list of numbers is normally distributed, and that you can sum many numbers before you overflow. The technique still works for non-normal distros, but somethings will not meet the expectations I describe below.
--
Sum up a sub-series, keeping track of how many numbers you eat, until you approach the overflow, then take the average. This will give you an average a0, and count n0. Repeat until you exhaust the list. Now you should have many ai, ni.
Each ai and ni should be relatively close, with the possible exception of the last bite of the list. You can mitigate that by under-biting near the end of the list.
You can combine any subset of these ai, ni by picking any ni in the subset (call it np) and dividing all the ni in the subset by that value. The max size of the subsets to combine is the roughly constant value of the n's.
The ni/np should be close to one. Now sum ni/np * ai and multiple by np/(sum ni), keeping track of sum ni. This gives you a new ni, ai combination, if you need to repeat the procedure.
If you will need to repeat (i.e., the number of ai, ni pairs is much larger than the typical ni), try to keep relative n sizes constant by combining all the averages at one n level first, then combining at the next level, and so on.

First of all, make yourself familiar with the internal representation of double values. Wikipedia should be a good starting point.
Then, consider that doubles are expressed as "value plus exponent" where exponent is a power of two. The limit of the largest double value is an upper limit of the exponent, and not a limit of the value! So you may divide all large input numbers by a large enough power of two. This should be safe for all large enough numbers. You can re-multiply the result with the factor to check whether you lost precision with the multiplication.
Here we go with an algorithm
public static double sum(double[] numbers) {
double eachSum, tempSum;
double factor = Math.pow(2.0,30); // about as large as 10^9
for (double each: numbers) {
double temp = each / factor;
if (t * factor != each) {
eachSum += each;
else {
tempSum += temp;
}
}
return (tempSum / numbers.length) * factor + (eachSum / numbers.length);
}
and dont be worried by the additional division and multiplication. The FPU will optimize the hell out of them since they are done with a power of two (for comparison imagine adding and removing digits at the end of a decimal numbers).
PS: in addition, you may want to use Kahan summation to improve the precision. Kahan summation avoids loss of precision when very large and very small numbers are summed up.

I posted an answer to a question spawned from this one, realizing afterwards that my answer is better suited to this question than to that one. I've reproduced it below. I notice though, that my answer is similar to a combination of Bozho's and Anon.'s.
As the other question was tagged language-agnostic, I chose C# for the code sample I've included. Its relative ease of use and easy-to-follow syntax, along with its inclusion of a couple of features facilitating this routine (a DivRem function in the BCL, and support for iterator functions), as well as my own familiarity with it, made it a good choice for this problem. Since the OP here is interested in a Java solution, but I'm not Java-fluent enough to write it effectively, it might be nice if someone could add a translation of this code to Java.
Some of the mathematical solutions here are very good. Here's a simple technical solution.
Use a larger data type. This breaks down into two possibilities:
Use a high-precision floating point library. One who encounters a need to average a billion numbers probably has the resources to purchase, or the brain power to write, a 128-bit (or longer) floating point library.
I understand the drawbacks here. It would certainly be slower than using intrinsic types. You still might over/underflow if the number of values grows too high. Yada yada.
If your values are integers or can be easily scaled to integers, keep your sum in a list of integers. When you overflow, simply add another integer. This is essentially a simplified implementation of the first option. A simple (untested) example in C# follows
class BigMeanSet{
List<uint> list = new List<uint>();
public double GetAverage(IEnumerable<uint> values){
list.Clear();
list.Add(0);
uint count = 0;
foreach(uint value in values){
Add(0, value);
count++;
}
return DivideBy(count);
}
void Add(int listIndex, uint value){
if((list[listIndex] += value) < value){ // then overflow has ocurred
if(list.Count == listIndex + 1)
list.Add(0);
Add(listIndex + 1, 1);
}
}
double DivideBy(uint count){
const double shift = 4.0 * 1024 * 1024 * 1024;
double rtn = 0;
long remainder = 0;
for(int i = list.Count - 1; i >= 0; i--){
rtn *= shift;
remainder <<= 32;
rtn += Math.DivRem(remainder + list[i], count, out remainder);
}
rtn += remainder / (double)count;
return rtn;
}
}
Like I said, this is untested—I don't have a billion values I really want to average—so I've probably made a mistake or two, especially in the DivideBy function, but it should demonstrate the general idea.
This should provide as much accuracy as a double can represent and should work for any number of 32-bit elements, up to 232 - 1. If more elements are needed, then the count variable will need be expanded and the DivideBy function will increase in complexity, but I'll leave that as an exercise for the reader.
In terms of efficiency, it should be as fast or faster than any other technique here, as it only requires iterating through the list once, only performs one division operation (well, one set of them), and does most of its work with integers. I didn't optimize it, though, and I'm pretty certain it could be made slightly faster still if necessary. Ditching the recursive function call and list indexing would be a good start. Again, an exercise for the reader. The code is intended to be easy to understand.
If anybody more motivated than I am at the moment feels like verifying the correctness of the code, and fixing whatever problems there might be, please be my guest.
I've now tested this code, and made a couple of small corrections (a missing pair of parentheses in the List<uint> constructor call, and an incorrect divisor in the final division of the DivideBy function).
I tested it by first running it through 1000 sets of random length (ranging between 1 and 1000) filled with random integers (ranging between 0 and 232 - 1). These were sets for which I could easily and quickly verify accuracy by also running a canonical mean on them.
I then tested with 100* large series, with random length between 105 and 109. The lower and upper bounds of these series were also chosen at random, constrained so that the series would fit within the range of a 32-bit integer. For any series, the results are easily verifiable as (lowerbound + upperbound) / 2.
*Okay, that's a little white lie. I aborted the large-series test after about 20 or 30 successful runs. A series of length 109 takes just under a minute and a half to run on my machine, so half an hour or so of testing this routine was enough for my tastes.
For those interested, my test code is below:
static IEnumerable<uint> GetSeries(uint lowerbound, uint upperbound){
for(uint i = lowerbound; i <= upperbound; i++)
yield return i;
}
static void Test(){
Console.BufferHeight = 1200;
Random rnd = new Random();
for(int i = 0; i < 1000; i++){
uint[] numbers = new uint[rnd.Next(1, 1000)];
for(int j = 0; j < numbers.Length; j++)
numbers[j] = (uint)rnd.Next();
double sum = 0;
foreach(uint n in numbers)
sum += n;
double avg = sum / numbers.Length;
double ans = new BigMeanSet().GetAverage(numbers);
Console.WriteLine("{0}: {1} - {2} = {3}", numbers.Length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
for(int i = 0; i < 100; i++){
uint length = (uint)rnd.Next(100000, 1000000001);
uint lowerbound = (uint)rnd.Next(int.MaxValue - (int)length);
uint upperbound = lowerbound + length;
double avg = ((double)lowerbound + upperbound) / 2;
double ans = new BigMeanSet().GetAverage(GetSeries(lowerbound, upperbound));
Console.WriteLine("{0}: {1} - {2} = {3}", length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
}

A random sampling of a small set of the full dataset will often result in a 'good enough' solution. You obviously have to make this determination yourself based on system requirements. Sample size can be remarkably small and still obtain reasonably good answers. This can be adaptively computed by calculating the average of an increasing number of randomly chosen samples - the average will converge within some interval.
Sampling not only addresses the double overflow concern, but is much, much faster. Not applicable for all problems, but certainly useful for many problems.

Consider this:
avg(n1) : n1 = a1
avg(n1, n2) : ((1/2)*n1)+((1/2)*n2) = ((1/2)*a1)+((1/2)*n2) = a2
avg(n1, n2, n3) : ((1/3)*n1)+((1/3)*n2)+((1/3)*n3) = ((2/3)*a2)+((1/3)*n3) = a3
So for any set of doubles of arbitrary size, you could do this (this is in C#, but I'm pretty sure it could be easily translated to Java):
static double GetAverage(IEnumerable<double> values) {
int i = 0;
double avg = 0.0;
foreach (double value in values) {
avg = (((double)i / (double)(i + 1)) * avg) + ((1.0 / (double)(i + 1)) * value);
i++;
}
return avg;
}
Actually, this simplifies nicely into (already provided by martinus):
static double GetAverage(IEnumerable<double> values) {
int i = 1;
double avg = 0.0;
foreach (double value in values) {
avg += (value - avg) / (i++);
}
return avg;
}
I wrote a quick test to try this function out against the more conventional method of summing up the values and dividing by the count (GetAverage_old). For my input I wrote this quick function to return as many random positive doubles as desired:
static IEnumerable<double> GetRandomDoubles(long numValues, double maxValue, int seed) {
Random r = new Random(seed);
for (long i = 0L; i < numValues; i++)
yield return r.NextDouble() * maxValue;
yield break;
}
And here are the results of a few test trials:
long N = 100L;
double max = double.MaxValue * 0.01;
IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 1.00535024998431E+306
double newWay = GetAverage(doubles); // 1.00535024998431E+306
doubles = GetRandomDoubles(N, max, 1);
oldWay = GetAverage_old(doubles); // 8.75142021696299E+305
newWay = GetAverage(doubles); // 8.75142021696299E+305
doubles = GetRandomDoubles(N, max, 2);
oldWay = GetAverage_old(doubles); // 8.70772312848651E+305
newWay = GetAverage(doubles); // 8.70772312848651E+305
OK, but what about for 10^9 values?
long N = 1000000000;
double max = 100.0; // we start small, to verify accuracy
IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 49.9994879713857
double newWay = GetAverage(doubles); // 49.9994879713868 -- pretty close
max = double.MaxValue * 0.001; // now let's try something enormous
doubles = GetRandomDoubles(N, max, 0);
oldWay = GetAverage_old(doubles); // Infinity
newWay = GetAverage(doubles); // 8.98837362725198E+305 -- no overflow
Naturally, how acceptable this solution is will depend on your accuracy requirements. But it's worth considering.

Check out the section for cummulative moving average

In order to keep logic simple, and keep performance not the best but acceptable, i recommend you to use BigDecimal together with the primitive type.
The concept is very simple, you use primitive type to sum values together, whenever the value will underflow or overflow, you move the calculate value to the BigDecimal, then reset it for the next sum calculation. One more thing you should aware is when you construct BigDecimal, you ought to always use String instead of double.
BigDecimal average(double[] values){
BigDecimal totalSum = BigDecimal.ZERO;
double tempSum = 0.00;
for (double value : values){
if (isOutOfRange(tempSum, value)) {
totalSum = sum(totalSum, tempSum);
tempSum = 0.00;
}
tempSum += value;
}
totalSum = sum(totalSum, tempSum);
BigDecimal count = new BigDecimal(values.length);
return totalSum.divide(count);
}
BigDecimal sum(BigDecimal val1, double val2){
BigDecimal val = new BigDecimal(String.valueOf(val2));
return val1.add(val);
}
boolean isOutOfRange(double sum, double value){
// because sum + value > max will be error if both sum and value are positive
// so I adapt the equation to be value > max - sum
if(sum >= 0.00 && value > Double.MAX - sum){
return true;
}
// because sum + value < min will be error if both sum and value are negative
// so I adapt the equation to be value < min - sum
if(sum < 0.00 && value < Double.MIN - sum){
return true;
}
return false;
}
From this concept, every time the result is underflow or overflow, we will keep that value into the bigger variable, this solution might a bit slowdown the performance due to the BigDecimal calculation, but it guarantee the runtime stability.

Why so many complicated long answers. Here is the simplest way to find the running average till now without any need to know how many elements or size etc..
long int i = 0;
double average = 0;
while(there are still elements)
{
average = average * (i / i+1) + X[i] / (i+1);
i++;
}
return average;

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting an array into two subarrays with minimal sum - java

DP soluction will give lg(n) time. Two array, iterate one from start to end, and calculate the sum, the other iterate from end to start, and do the same thing. Finally, iterate from start to end and get minimal difference.

Related

Calculate absolute minimum difference between any two numbers of a huge integer array

Allocating N tonnes of food in K rooms with M capacity

minimum sum subarray in O(N) by Kadane's algorithm

generating all unique pairs from a list of numbers, n choose 2

What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

Categories

Resources