Related
I have been struggling to solve an array problem with linear time,
The problem is:
Assuming we are given an array A [1...n] write an algorithm that return true if:
There are two numbers in the array x,y that have the following:
x < y
x repeats more than n/3 times
y repeats more than n/4 times
I have tried to write the following java program to do so assuming we have a sorted array but I don't think it is the best implementation.
public static boolean solutionManma(){
int [] arr = {2,2,2,3,3,3};
int n = arr.length;
int xCount = 1;
int yCount = 1;
int maxXcount= xCount,maxYCount = yCount;
int currX = arr[0];
int currY = arr[n-1];
for(int i = 1; i < n-2;i++){
int right = arr[n-2-i+1];
int left = arr[i];
if(currX == left){
xCount++;
}
else{
maxXcount = Math.max(xCount,maxXcount);
xCount = 1;
currX = left;
}
if(currY == right){
yCount++;
}
else {
maxYCount = Math.max(yCount,maxYCount);
yCount = 1;
currY = right;
}
}
return (maxXcount > n/3 && maxYCount > n/4);
}
If anyone has an algorithm idea for this kind of issue (preferably O(n)) I would much appreciate it because I got stuck with this one.
The key part of this problem is to find in linear time and constant space the values which occur more than n/4 times. (Note: the text of your question says "more than" and the title says "at least". Those are not the same condition. This answer is based on the text of your question.)
There are at most three values which occur more than n/4 times, and a list of such values must also include any value which occurs more than n/3 times.
The algorithm we'll use returns a list of up to three values. It only guarantees that all values which satisfy the condition are in the list it returns. The list might include other values, and it does not provide any information about the precise frequencies.
So a second pass is necessary, which scans the vector a second time counting the occurrences of each of the three values returned. Once you have the three counts, it's simple to check whether the smallest value which occurs more than n/3 times (if any) is less than the largest value which occurs more than n/4 times.
To construct the list of candidates, we use a generalisation of the Boyer-Moore majority vote algorithm, which finds a value which occurs more than n/2 times. The generalisation, published in 1982 by J. Misra and D. Gries, uses k-1 counters, each possibly associated with a value, to identify values which might occur more than 1/k times. In this case, k is 4 and so we need three counters.
Initially, all of the counters are 0 and are not associated with any value. Then for each value in the array, we do the following:
If there is a counter associated with that value, we increment it.
If no counter is associated with that value but some counter is at 0, we associate that counter with the value and increment its count to 1.
Otherwise, we decrement every counter's count.
Once all the values have been processed, the values associated with counters with positive counts are the candidate values.
For a general implementation where k is not known in advance, it would be possible to use a hash-table or other key-value map to identify values with counts. But in this case, since it is known that k is a small constant, we can just use a simple vector of three value-count pairs, making this algorithm O(n) time and O(1) space.
I will suggest the following solution, using the following assumption:
In an array of length n there will be at most n different numbers
The key feature will be to count the frequency of occurance for each different input using a histogram with n bins, meaning O(n) space. The algorithm will be as follows:
create a histogram vector with n bins, initialized to zeros
for index ii in the length of the input array a
2.1. Increase the value: hist[a[ii]] +=1
set found_x and found_y to False
for the iith bin in the histogram, check:
4.1. if found_x == False
4.1.1. if hist[ii] > n/3, set found_x = True and set x = ii
4.2. else if found_y == False
4.2.1. if hist[ii] > n/4, set y = ii and return x, y
Explanation
In the first run over the array you document the occurance frequency of all the numbers. In the run over the histogram array, which also has a length of n, you check the occurrence. First you check if there is a number that occurred more than n/3 times and if there is, for the rest of the numbers (by default larger than x due to the documentation in the histogram) you check if there is another number which occurred more than n/4 times. if there is, you return the found x and y and if there isn't you simply return not found after covering all the bins in the histogram.
As far as time complexity, you goover the input array once and you go over the histogram with the same length once, therefore the time complexity is O(n) is requested.
You are given two strings S and T. An infinitely long string is formed in the following manner:
Take an empty string,
Append S one time,
Append T two times,
Append S three times,
Append T four times,
and so on, appending the strings alternately and increasing the number of repetitions by 1 each time.
You will also be given an integer K.
You need to tell the Kth Character of this infinitely long string.
Sample Input (S, T, K):
a
bc
4
Sample Output:
b
Sample Explanation:
The string formed will be "abcbcaaabcbcbcbcaaaaa...". So the 4th character is "b".
My attempt:
public class FindKthCharacter {
public char find(String S, String T, int K) {
// lengths of S and T
int s = S.length();
int t = T.length();
// Counters for S and T
int sCounter = 1;
int tCounter = 2;
// To store final chunks of string
StringBuilder sb = new StringBuilder();
// Loop until K is greater than zero
while (K > 0) {
if (K > sCounter * s) {
K -= sCounter * s;
sCounter += 2;
if (K > tCounter * t) {
K -= tCounter * t;
tCounter += 2;
} else {
return sb.append(T.repeat(tCounter)).charAt(K - 1);
}
} else {
return sb.append(S.repeat(sCounter)).charAt(K - 1);
}
}
return '\u0000';
}
}
But is there any better way to reduce its time complexity?
I've tried to give a guide here, rather than just give the solution.
If s and t are the lengths of the strings S and T, then you need to find the largest odd n such that
(1+3+5+...+n)s + (2+4+6+...+(n+1))t < K.
You can simplify these expressions to get a quadratic equation in n.
Let N be (1+3+..+n)s + (2+4+6+...+(n+1))t. You know that K will lie either in the next (n+2) copies of S, or the (n+3) copies of T that follow. Compare K to N+(n+2)s, and take the appropriate letter of either S or T using a modulo.
The only difficult step here is solving the large quadratic, but you can do it in O(log K) arithmetic operations easily enough by doubling n until it's too large, and then using a binary search on the remaining range. (If K is not too large so that floating point is viable, you can do it in O(1) time using the well-known quadratic formula).
Here my quick attempt, there probably is a better solution. Runtime is still O(sqrt n), but memory is O(1).
public static char find(String a, String b, int k) {
int lenA = a.length();
int lenB = b.length();
int rep = 0;
boolean isA = false;
while (k >= 0) {
++rep;
isA = !isA;
k -= (isA ? lenA : lenB) * rep;
}
int len = (isA ? lenA : lenB);
int idx = (len * rep + k) % len;
return (isA ? a : b).charAt(idx);
}
Here's a O(1) solution that took me some time to come up with (read I would have failed an interview on time). Hopefully the process is clear and you can implement it in code.
Our Goal is to return the char that maps to the kth index.
But How? Just 4 easy steps, actually.
Step 1: Find out how many iterations of our two patterns it would take to represent at least k characters.
Step 2: Using this above number of iterations i, return how many characters are present in the previous i-1 iterations.
Step 3: Get the number of characters n into iteration i that our kth character is. (k - result of step 2)
Step 4: Mod n by the length of the pattern to get index into pattern for the specific char. If i is odd, look into s, else look into t.
For step 1, we need to find a formula to give us the iteration i that character k is in. To derive this formula, it may be easier to first derive the formula needed for step 2.
Step 2's formula is basically given an iteration i, return how many characters are present in that iteration. We are solving for 'k' in this equation and are given i, while it's the opposite for step 1 where were are solving for i given k. If we can derive the equation of find k given i, then we can surely reverse it to find i given k.
Now, let's try to derive the formula for step 2 and find k given i. Here it's best to start with the most basic example to see the pattern.
s = "a", t = "b"
i=1 a
i=2 abb
i=3 abbaaa
i=4 abbaaabbbb
i=5 abbaaabbbbaaaaa
i=6 abbaaabbbbaaaaabbbbbb
Counting the total number of combined chars for each pattern during its next iteration gives us:
#iterations of pattern: 1 2 3 4 5 6 7 8 9 10
every new s iteration: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100
every new t iteration: 2, 6, 12, 20, 30, 42, 56, 72, 90, 110
You might notice some nice patterns here. For example, s has a really nice formula to find out how many combined characters it has at any given iteration. It's simply (# of s iterations^2)*s.length. t also has a simple formula. It is (# of t iterations * (# of t iterations + 1))*t.length. You may have noticed that these formulas are the formulas for sum of odd and even numbers (if you did you get a kudos). This makes sense because each pattern's sum for an iteration i is the sum of all of its previous iterations.
Using s,t as length of their respective patterns, we now have the following formula to find the total number of chars at a given iteration.
#chars = s*(# of s iterations)^2 + t * (# of t iterations * (# of t iterations + 1))
Now we just need to do some math to get the number of iterations for each pattern given i.
# of s iterations given i = ceil(i/2.0)
# of t iterations given i = floor(i/2) which / operation gives us by default
Plugging these back into our formula we get:
total # of chars = s*(ceil(i/2.0)^2) + t*((i/2)*((i/2)+1))
We have just completed step 2, and we now know at any given iteration how many total chars there are. We could stop here and start picking random iterations and adjusting accordingly until we get near k, but we can do better than that. Let's use the above formula now to complete step 1 which we skipped. We just need to reorganize our equation to solve for i now.
Doing some simplyfying we get:
// 2
// i i i
// s (-) + t - ( - + 1 ) = k
// 2 2 2
// ----------------------------
// 2
// i t i
// s - + - ( - + 1 )i = k
// 4 2 2
// ----------------------------
// 2 2
// si ti ti
// ---- + ---- + ---- - k = 0
// 4 4 2
// ----------------------------
//
// 2 2
// si + ti + 2ti - 4k = 0
// ----------------------------
// 2
// (s + t)i + 2ti - 4k = 0
// ----------------------------
This looks like a polynomial. Wow! You're right! That means we can solve it using the quadratic formula.
A=(s+t), B=2t, C=-4k
quadratic formula = (-2t + sqrt(2t^2 + 16(s+t)k)) / 2(s+t)
This is our formula for step 1, and it will give us the iteration that the kth character is on. We just need to ceil it. I'm actually not smart enough to know why this works. It just does. Here is a desmos graph that graphs our two polynomials from step 2: s(Siterations)^2 and t(Titerations (Titerations + 1)).
The area under both curves is our total number of chars at an iteration (the vertical line). The formula from step 1 is also graphed, and we can see that for any s, t, k that the x intercept (which represents our xth iteration) is always: previous iteration < x <= current iteration, which is why the ceil works.
We have now completed steps 1 and 2. We have a formula to get the ith iteration that the kth character is on and a formula that gives us how many characters are in an ith iteration. Steps 3 and 4 should follow and we get our answer. This is constant time.
I have a map of items with some probability distribution:
Map<SingleObjectiveItem, Double> itemsDistribution;
Given a certain m I have to generate a Set of m elements sampled from the above distribution.
As of now I was using the naive way of doing it:
while(mySet.size < m)
mySet.add(getNextSample(itemsDistribution));
The getNextSample(...) method fetches an object from the distribution as per its probability. Now, as m increases the performance severely suffers. For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. Generate 1000 such sets and you have an application that crawls.
Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? Most collection shuffling techniques and the like are uniformly random. What would be a good way to address this?
UPDATE: The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is in the first run we'll definitely get a sample for the set. The 2nd iteration, it may be called at least twice and so on. If getNextSample is sequential in nature, i.e., goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2, 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 then the loop is at least Sigma(n^3). And that too is the lower bound!
If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). Efficient but may not be by a large margin.
Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. These sets are part of a randomized 'schedule' of items. Hence a 'set' of items.
Start out by generating a number of random points in two dimentions.
Then apply your distribution
Now find all entries within the distribution and pick the x coordinates, and you have your random numbers with the requested distribution like this:
The problem is unlikely to be the loop you show:
Let n be the size of the distribution, and I be the number of invocations to getNextSample. We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E[I] < m / (1 - m / n).
That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. If that is "slow" and "crawls", it is likely because getNextSample is slow. This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element).
The following should be faster (if m < 0.8 n)
class Distribution<T> {
private double[] cummulativeWeight;
private T[] item;
private double totalWeight;
Distribution(Map<T, Double> probabilityMap) {
int i = 0;
cummulativeWeight = new double[probabilityMap.size()];
item = (T[]) new Object[probabilityMap.size()];
for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
item[i] = entry.getKey();
totalWeight += entry.getValue();
cummulativeWeight[i] = totalWeight;
i++;
}
}
T randomItem() {
double weight = Math.random() * totalWeight;
int index = Arrays.binarySearch(cummulativeWeight, weight);
if (index < 0) {
index = -index - 1;
}
return item[index];
}
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
set.add(randomItem());
}
return set;
}
}
public class Test {
public static void main(String[] args) {
int max = 1_000_000;
HashMap<Integer, Double> probabilities = new HashMap<>();
for (int i = 0; i < max; i++) {
probabilities.put(i, (double) i);
}
Distribution<Integer> d = new Distribution<>(probabilities);
Set<Integer> set = d.randomSubset(max / 2);
//System.out.println(set);
}
}
The expected runtime is O(m / (1 - m / n) * log n). On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds.
As we can see, the expected runtime approaches infinity as m approaches n. If that is a problem (i.e. m > 0.9 n), the following more complex approach should work better:
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
T randomItem = randomItem();
remove(randomItem); // removes the item from the distribution
set.add(randomItem);
}
return set;
}
To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is.
But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n.
If you are not concerning with randomness properties too much then I do it like this:
create buffer for pseudo-random numbers
double buff[MAX]; // [edit1] double pseudo random numbers
MAX is size should be big enough ... 1024*128 for example
type can be any (float,int,DWORD...)
fill buffer with numbers
you have range of numbers x = < x0,x1 > and probability function probability(x) defined by your probability distribution so do this:
for (i=0,x=x0;x<=x1;x+=stepx)
for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers
buff[i]=x+(double(i)*q); // [edit1] ...
The stepx is your accuracy for items (for integral types = 1) now the buff[] array has the same distribution as you need but it is not pseudo-random. Also you should add check if j is not >= MAX to avoid array overruns and also at the end the real size of buff[] is j (can be less than MAX due to rounding)
shuffle buff[]
do just few loops of swap buff[i] and buff[j] where i is the loop variable and j is pseudo-random <0-MAX)
write your pseudo-random function
it just return number from the buffer. At first call returns the buff[0] at second buff[1] and so on ... For standard generators When you hit the end of buff[] then shuffle buff[] again and start from buff[0] again. But as you need unique numbers then you can not reach the end of buffer so so set MAX to be big enough for your task otherwise uniqueness will not be assured.
[Notes]
MAX should be big enough to store the whole distribution you want. If it is not big enough then items with low probability can be missing completely.
[edit1] - tweaked answer a little to match the question needs (pointed by meriton thanks)
PS. complexity of initialization is O(N) and for get number is O(1).
You should implement your own random number generator (using a MonteCarlo methode or any good uniform generator like mersen twister) and basing on the inversion method (here).
For example : exponential law: generate a uniform random number u in [0,1] then your random variable of the exponential law would be : ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm.
Hope it'll help ;).
I think you have two problems:
Your itemDistribution doesn't know you need a set, so when the set you are building gets
large you will pick a lot of elements that are already in the set. If you start with the
set all full and remove elements you will run into the same problem for very small sets.
Is there a reason why you don't remove the element from the itemDistribution after you
picked it? Then you wouldn't pick the same element twice?
The choice of datastructure for itemDistribution looks suspicious to me. You want the
getNextSample operation to be fast. Doesn't the map from values to probability force you
to iterate through large parts of the map for each getNextSample. I'm no good at
statistics but couldn't you represent the itemDistribution the other way, like a map from
probability, or maybe the sum of all smaller probabilities + probability to a element
of the set?
Your performance depends on how your getNextSample function works. If you have to iterate over all probabilities when you pick the next item, it might be slow.
A good way to pick several unique random items from a list is to first shuffle the list and then pop items off the list. You can shuffle the list once with the given distribution. From then on, picking your m items ist just popping the list.
Here's an implementation of a probabilistic shuffle:
List<Item> prob_shuffle(Map<Item, int> dist)
{
int n = dist.length;
List<Item> a = dist.keys();
int psum = 0;
int i, j;
for (i in dist) psum += dist[i];
for (i = 0; i < n; i++) {
int ip = rand(psum); // 0 <= ip < psum
int jp = 0;
for (j = i; j < n; j++) {
jp += dist[a[j]];
if (ip < jp) break;
}
psum -= dist[a[j]];
Item tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
return a;
}
This in not Java, but pseudocude after an implementation in C, so please take it with a grain of salt. The idea is to append items to the shuffled area by continuously picking items from the unshuffled area.
Here, I used integer probabilities. (The proabilities don't have to add to a special value, it's just "bigger is better".) You can use floating-point numbers but because of inaccuracies, you might end up going beyond the array when picking an item. You should use item n - 1 then. If you add that saftey net, you could even have items with zero probability that always get picked last.
There might be a method to speed up the picking loop, but I don't really see how. The swapping renders any precalculations useless.
Accumulate your probabilities in a table
Probability
Item Actual Accumulated
Item1 0.10 0.10
Item2 0.30 0.40
Item3 0.15 0.55
Item4 0.20 0.75
Item5 0.25 1.00
Make a random number between 0.0 and 1.0 and do a binary search for the first item with a sum that is greater than your generated number. This item would have been chosen with the desired probability.
Ebbe's method is called rejection sampling.
I sometimes use a simple method, using an inverse cumulative distribution function, which is a function that maps a number X between 0 and 1 onto the Y axis.
Then you just generate a uniformly distributed random number between 0 and 1, and apply the function to it.
That function is also called the "quantile function".
For example, suppose you want to generate a normally distributed random number.
It's cumulative distribution function is called Phi.
The inverse of that is called probit.
There are many ways to generate normal variates, and this is just one example.
You can easily construct an approximate cumulative distribution function for any univariate distribution you like, in the form of a table.
Then you can just invert it by table-lookup and interpolation.
I've always taken it for granted that iterative search is the go-to method for finding maximum values in an unsorted list.
The thought came to me rather randomly, but in a nutshell: I believe I can accomplish the task in O(logn) time with n being the input array's size.
The approach piggy-backs on merge sort: divide and conquer.
Step 1: divide the findMax() task to two sub-tasks findMax(leftHalf) and findMax(rightHalf). This division should be finished in O(logn) time.
Step 2: merge the two maximum candidates back up. Each layer in this step should take constant time O(1), and there are, per the previous step, O(logn) such layers. So it should also be done in O(1) * O(logn) = O(logn) time (pardon the abuse of notation). This is so wrong. Each comparison is done in constant time, but there are 2^j/2 such comparisons to be done (2^j pairs of candidates at level j-th).
Thus, the whole task should be completed in O(logn) time. O(n) time.
However, when I try to time it, I get results that clearly reflect a linear O(n) running time.
size = 100000000 max = 0 time = 556
size = 200000000 max = 0 time = 1087
size = 300000000 max = 0 time = 1648
size = 400000000 max = 0 time = 1990
size = 500000000 max = 0 time = 2190
size = 600000000 max = 0 time = 2788
size = 700000000 max = 0 time = 3586
How come?
Here's the code (I left the arrays uninitialized to save on pre-processing time, the method, as far as I'd tested it, accurately identifies the maximum value in unsorted arrays):
public static short findMax(short[] list) {
return findMax(list, 0, list.length);
}
public static short findMax(short[] list, int start, int end) {
if(end - start == 1) {
return list[start];
}
else {
short leftMax = findMax(list, start, start+(end-start)/2);
short rightMax = findMax(list, start+(end-start)/2, end);
return (leftMax <= rightMax) ? (rightMax) : (leftMax);
}
}
public static void main(String[] args) {
for(int j=1; j < 10; j++) {
int size = j*100000000; // 100mil to 900mil
short[] x = new short[size];
long start = System.currentTimeMillis();
int max = findMax(x);
long end = System.currentTimeMillis();
System.out.println("size = " + size + "\t\t\tmax = " + max + "\t\t\t time = " + (end - start));
System.out.println();
}
}
You should count the number of comparisons that actually take place :
In the final step, after you find the maximum of the first n/2 numbers and last n/2 nubmers, you need 1 more comparison to find the maximum of the entire set of numbers.
On the previous step you have to find the maximum of the first and second groups of n/4 numbers and the maximum of the third and fourth groups of n/4 numbers, so you have 2 comparisons.
Finally, at the end of the recursion, you have n/2 groups of 2 numbers, and you have to compare each pair, so you have n/2 comparisons.
When you sum them all you get :
1 + 2 + 4 + ... + n/2 = n-1 = O(n)
You indeed create log(n) layers.
But at the end of the day, you still go through each element of every created bucket. Therefore you go through every element. So overall you are still O(n).
With Eran's answer, you already know what's wrong with your reasoning.
But anyway, there is a theorem called the Master Theorem, which aids in the running time analysis of recursive functions.
It verses on the following equation:
T(n) = a*T(n/b) + O(n^d)
Where T(n) is the running time for a problem of size n.
In your case, the recurrence equation would be T(n) = 2*T(n/2) + O(1) So a=2, b=2, and d=0. That is the case because, for each n-sized instance of your problem, you break it into 2 (a) subproblems, of size n / 2 (b), and combines them in O(1) = O(n^0).
The master theorem simply states three cases:
if a = b^d, then the total running time is O(n^d*log n)
if a < b^d, then the total running time is O(n^d)
if a > b^d, then the total running time is O(n^(log a / log b))
Your case matches the third, so the total running time is O(n^(log 2 / log 2)) = O(n)
It is a nice exercise to try to understand the reason behind these three cases. They are merely the cases for which:
1st) We do the same amount total work for each recursion level (this is the case of mergesort), so we only multiply the merging time, O(n^d), by the number of levels, log n.
2nd) We do less work for the second recursion level than for the first, and so on. Therefore the total work is basically the one for the last merge step (first recursion level), O(n^d).
3rd) We do more work for deeper levels (your case), so the running time is O(number of leaves in the recursion tree). In your case you have n leaves for the deeper recursion level, so O(n).
There are some short videos on a Stanford cousera course which are very nice to explain the Master Method, available https://www.coursera.org/course/algo. I believe you can always preview the course, even if not enrolled.
I am developing a space combat game in Java as part of an ongoing effort to learn the language. In a battle, I have k ships firing their guns at a fleet of n of their nefarious enemies. Depending on how many of their enemies get hit by how many of the shots, (each ship fires one shot which hits one enemy), some will be damaged and some destroyed. I want to figure out how many enemies were hit once, how many were hit twice and so on, so that at the end I have a table that looks something like this, for 100 shots fired:
Number of hits | Number of occurences | Total shots
----------------------------------------------------
1 | 30 | 30
2 | 12 | 24
3 | 4 | 12
4 | 7 | 28
5 | 1 | 5
Obviously, I can brute force this for small numbers of shots and enemies by randomly placing each shot on an enemy and then counting how many times each got shot at the end. This method, however, will be very impractical if I've got three million intrepid heroes firing on a swarm of ten million enemies.
Ideally, what I'd like is a way to generate a distribution of how many enemies are likely to be hit by exactly some number of shots. I could then use a random number generator to pick a point on that distribution, and then repeat this process, increasing the number of hits each time, until approximately all shots are accounted for. Is there a general statistical distribution / way of estimating approximately how many enemies get hit by how many shots?
I've been trying to work out something from the birthday problem to figure out the probability of how many birthdays are shared by exactly some number of people, but have not made any significant progress.
I will be implementing this in Java.
EDIT: I found a simplification of this that may be easier to solve: what's the distribution of probabilities that n enemies are not hit at all? I.e. whats the probability that zero are not hit, one is not hit, two are not hit, etc.
It's a similar problem, (ok, the same problem but with a simplification), but seems like it might be easier to solve, and would let me generate the full distribution in a couple of iterations.
You should take a look at multinomial distribution, constraining it to the case where all pi are equal to 1/k (be careful to note that the Wikipedia article swaps the meaning of your k and n).
Previous attempt at answer
Maybe an approach like the following will be fruitful:
the probability that a particular ship is hit by a particular shot is 1/n;
the probability that a given ship is hit exactly once after k shots: h1 = 1/n (1-1/n)k-1;
as above, but exactly twice: h2 = (1/n)2 (1-1/n)k-2, and so on;
expected number of ships hit exactly once: n h1 and so on.
If you have S ships and fire A shots at them, each individual ship's number of hits will follow a binominal distribution where p = 1/S and n = A:
http://en.wikipedia.org/wiki/Binomial_distribution
You can query this distribution and ask:
How likely is it for a ship to be hit 0 times?
How likely is it for a ship to be hit 1 time?
How likely is it for a ship to be hit 2 times?
How likely is it for a ship to be hit (max health) or more times? (Hint: Just subtract 1.0 from everything below)
and multiply these by the number of ships, S, to get the number of ships that you expect to be hit 0, 1, 2, 3, etc times. However, as this is an expectation not a randomly rolled result, battles will go exactly the same way every time.
If you have a low number of ships yet high number of shots, you can roll the binominal distribution once per ship. OR if you have a low number of shots yet high number of ships, you can randomly place each shot. I haven't yet thought of a cool way to get the random distribution (or a random approximation thereof) of high number of shots AND high number of shots, but it would be awesome to find out one :)
I'm assuming that each shot has probability h to hit any bad ship. If h = 0, all shots will miss. If h = 1, all shots will hit something.
Now, let's say you shoot b bullets. The expected value of ships hit is simply Hs = h * b, but these are not unique ships hit.
So we have a list of ships that is Hs long. The chance of any specific enemy ship being hit given N enemy ships is 1/N. Therefore, the chance to be in the first k slots but no the other slots is
(1/N)^k * (1-1/N)^(Hs-k)
Note that this is Marko Topolnik's answer. The problem is that this is a specific ship being in the FIRST k slots, as opposed to being in any combination of k slots. We must modify this by taking into the account the number of combinations of k slots in Hs total slots:
(Hs choose k) * (1/N)^k * (1-1/N)^(Hs-k)
Now we have the chance of a specific ship being in k slots. Well, now we need to consider the entire fleet of N ships:
(Hs choose k) * (1/N)^k * (1-1/N)^(Hs-k) * N
This expression represents the expected number of ships being hit k times within an N sized fleet that was hit with Hs shots in a uniform distribution.
Numerical Sanity Check:
Let's say two bullets hit (Hs=2) and we have two enemy ships (N=2). Assign each ship a binary ID, and let's enumerate the possible hit lists.
00 (ship 0 hit twice)
01
10
11
The number of ships hit once is:
(2 choose 1) * (1/2)^1 * (1-1/2)^(2-1) * 2 = 1
The number of ships hit twice is:
(2 choose 2) * (1/2)^2 * (1-1/2)^(2-2) * 2 = 0.5
To complete the sanity check, we need to make sure our total number of hits equals Hs. Every ship hit twice takes 2 bullets, and every ship hit once takes one bullet:
1*1 + 0.5*2 = 2 == Hs **TRUE**
One more quick example with Hs=3 and N=2:
(3 choose 1) * (1/2)^1 * (1-1/2)^(3-1) * 2
3 * 0.5 * 0.25 * 2 = 0.75
(3 choose 2) * (1/2)^2 * (1-1/2)^(3-2) * 2
3 * 0.5^2 * 0.5 * 2 = 0.75
(3 choose 3) * (1/2)^3 * (1-1/2)^(3-3) * 2
1 * 0.5^3 * 1 * 2 = 0.25
0.75 + 0.75*2 + 0.25*3 = 3 == Hs **TRUE**
Figured out a way of solving this, and finally got around to writing it up in Java. This gives an exact solution for computing the probability of m ships not being hit given k ships and n shots. It is, however, quite computationally expensive. First, a summary of what I did:
The proability is equal to the total number of ways to shoot the ships with exactly m not hit divided by the total number of ways to shoot ships.
P = m_misses / total
Total is k^n, since each shot can hit one of k ships.
To get the numerator, start with nCr(k,m). This is the number of ways of choosing m ships to not be hit. This multiplied by the number of ways of hitting k-m ships without missing any is the total probability.
nCr(k,m)*(k-m_noMiss)
P = ---------------------
k^n
Now to calculate the second term in the numerator. This is the sum across all distributions of shots of how many ways there are for a certain shot distribution to happen. For example, if 2 ships are hit by 3 bullets, and each ship is hit at least once, they can be hit in the following ways:
100
010
001
110
101
011
The shot distributions are equal to the length k-m compositions of k. In this case, we would have [2,1] and [1,2], the length 2 compositions of 3.
For the first composition, [2,1], we can calculate the numbers of ways of generating this by choosing 2 out of the 3 shots to hit the first ship, and then 1 out of the remaining 1 shots to hit the second, i.e. nCr(3,2) * nCr(1,1). Note that we can simplify this to 3!/(2!*1!). This pattern applies to all shot patters, so the number of ways that a certain pattern, p, can occur can be written as n!/prodSum(j=1,k-m,p_j!), in which the notation indicates the product sum from 1 to k-m, j is an index, and p_j represents the jth term in p.
If we define P as the set of all length k-m compositions of n, the probability of m ships not being hit is then:
nCr(k,m)*sum(p is an element of P, n!/prodSum(j=1,k-m,p_j!))
P = --------------------------------------------------------------
k^n
The notation is a bit sloppy since there's not way of putting equations of math symbols into SO, but that's the gist of it.
That being said, this method is horribly inefficient, but I can't seem to find a better one. If someone can simplify this, by all means post your method! I'm curious as to how it can be done.
And the java code for doing this:
import java.util.ArrayList;
import java.util.Arrays;
import org.apache.commons.math3.util.ArithmeticUtils;
class Prob{
public boolean listsEqual(Integer[] integers, Integer[] rootComp){
if(integers.length != rootComp.length){
return false;
}
for (int i = 0; i < integers.length; i++){
if(integers[i] != rootComp[i]){return false;};
}
return true;
}
public Integer[] firstComp(int base, int length){
Integer[] comp = new Integer[length];
Arrays.fill(comp, 1);
comp[0] = base - length + 1;
return comp;
}
public Integer[][] enumerateComps(int base, int length){
//Provides all compositions of base of size length
if(length > base){return null;};
Integer[] rootComp = firstComp(base, length);
ArrayList<Integer[]> compsArray = new ArrayList<Integer[]>();
do {
compsArray.add(rootComp);
rootComp = makeNextComp(rootComp);
} while(!listsEqual(compsArray.get(compsArray.size() - 1), rootComp));
Integer[][] newArray = new Integer[compsArray.size()][length];
int i = 0;
for (Integer[] comp : compsArray){
newArray[i] = comp;
i++;
}
return newArray;
}
public double getProb(int k, int n, int m){
//k = # of bins
//n = number of objects
//m = number of empty bins
//First generate list of length k-m compositions of n
if((n < (k-m)) || (m >= k)){
return 0;
}
int[] comp = new int[n-1];
Arrays.fill(comp, 1);
comp[0] = n - (k-m) + 1;
//Comp is now the first
Integer[][] L = enumerateComps(n, k-m);
double num = 0;
double den = Math.pow(k, n);
double prodSum;
int remainder;
for(Integer[] thisComp : L){
remainder = n;
prodSum = 1;
for(Integer thisVal : thisComp){
prodSum = prodSum * ArithmeticUtils.binomialCoefficient(remainder, thisVal);
remainder -= thisVal;
}
num += prodSum;
}
return num * ArithmeticUtils.binomialCoefficient(k, m) / den;
}
public Integer[] makeNextComp(Integer[] rootComp){
Integer[] comp = rootComp.clone();
int i = comp.length - 1;
int lastVal = comp[i];
i--;
for(; i >=0 ; i--){
if (comp[i] != 1){
//Subtract 1 from comp[i]
comp[i] -= 1;
i++;
comp[i] = lastVal + 1;
i++;
for(;i < comp.length; i++){
comp[i] = 1;
};
return comp;
}
}
return comp;
}
}
public class numbersTest {
public static void main(String[] args){
//System.out.println(ArithmeticUtils.binomialCoefficient(100,50));
Prob getProbs = new Prob();
Integer k = 10; //ships
Integer n = 10; //shots
Integer m = 4; //unscathed
double myProb = getProbs.getProb(k,n,m);
System.out.printf("Probability of %s ships, %s hits, and %s unscathed: %s",k,n,m,myProb);
}
}