I am developing a space combat game in Java as part of an ongoing effort to learn the language. In a battle, I have k ships firing their guns at a fleet of n of their nefarious enemies. Depending on how many of their enemies get hit by how many of the shots, (each ship fires one shot which hits one enemy), some will be damaged and some destroyed. I want to figure out how many enemies were hit once, how many were hit twice and so on, so that at the end I have a table that looks something like this, for 100 shots fired:
Number of hits | Number of occurences | Total shots
----------------------------------------------------
1 | 30 | 30
2 | 12 | 24
3 | 4 | 12
4 | 7 | 28
5 | 1 | 5
Obviously, I can brute force this for small numbers of shots and enemies by randomly placing each shot on an enemy and then counting how many times each got shot at the end. This method, however, will be very impractical if I've got three million intrepid heroes firing on a swarm of ten million enemies.
Ideally, what I'd like is a way to generate a distribution of how many enemies are likely to be hit by exactly some number of shots. I could then use a random number generator to pick a point on that distribution, and then repeat this process, increasing the number of hits each time, until approximately all shots are accounted for. Is there a general statistical distribution / way of estimating approximately how many enemies get hit by how many shots?
I've been trying to work out something from the birthday problem to figure out the probability of how many birthdays are shared by exactly some number of people, but have not made any significant progress.
I will be implementing this in Java.
EDIT: I found a simplification of this that may be easier to solve: what's the distribution of probabilities that n enemies are not hit at all? I.e. whats the probability that zero are not hit, one is not hit, two are not hit, etc.
It's a similar problem, (ok, the same problem but with a simplification), but seems like it might be easier to solve, and would let me generate the full distribution in a couple of iterations.
You should take a look at multinomial distribution, constraining it to the case where all pi are equal to 1/k (be careful to note that the Wikipedia article swaps the meaning of your k and n).
Previous attempt at answer
Maybe an approach like the following will be fruitful:
the probability that a particular ship is hit by a particular shot is 1/n;
the probability that a given ship is hit exactly once after k shots: h1 = 1/n (1-1/n)k-1;
as above, but exactly twice: h2 = (1/n)2 (1-1/n)k-2, and so on;
expected number of ships hit exactly once: n h1 and so on.
If you have S ships and fire A shots at them, each individual ship's number of hits will follow a binominal distribution where p = 1/S and n = A:
http://en.wikipedia.org/wiki/Binomial_distribution
You can query this distribution and ask:
How likely is it for a ship to be hit 0 times?
How likely is it for a ship to be hit 1 time?
How likely is it for a ship to be hit 2 times?
How likely is it for a ship to be hit (max health) or more times? (Hint: Just subtract 1.0 from everything below)
and multiply these by the number of ships, S, to get the number of ships that you expect to be hit 0, 1, 2, 3, etc times. However, as this is an expectation not a randomly rolled result, battles will go exactly the same way every time.
If you have a low number of ships yet high number of shots, you can roll the binominal distribution once per ship. OR if you have a low number of shots yet high number of ships, you can randomly place each shot. I haven't yet thought of a cool way to get the random distribution (or a random approximation thereof) of high number of shots AND high number of shots, but it would be awesome to find out one :)
I'm assuming that each shot has probability h to hit any bad ship. If h = 0, all shots will miss. If h = 1, all shots will hit something.
Now, let's say you shoot b bullets. The expected value of ships hit is simply Hs = h * b, but these are not unique ships hit.
So we have a list of ships that is Hs long. The chance of any specific enemy ship being hit given N enemy ships is 1/N. Therefore, the chance to be in the first k slots but no the other slots is
(1/N)^k * (1-1/N)^(Hs-k)
Note that this is Marko Topolnik's answer. The problem is that this is a specific ship being in the FIRST k slots, as opposed to being in any combination of k slots. We must modify this by taking into the account the number of combinations of k slots in Hs total slots:
(Hs choose k) * (1/N)^k * (1-1/N)^(Hs-k)
Now we have the chance of a specific ship being in k slots. Well, now we need to consider the entire fleet of N ships:
(Hs choose k) * (1/N)^k * (1-1/N)^(Hs-k) * N
This expression represents the expected number of ships being hit k times within an N sized fleet that was hit with Hs shots in a uniform distribution.
Numerical Sanity Check:
Let's say two bullets hit (Hs=2) and we have two enemy ships (N=2). Assign each ship a binary ID, and let's enumerate the possible hit lists.
00 (ship 0 hit twice)
01
10
11
The number of ships hit once is:
(2 choose 1) * (1/2)^1 * (1-1/2)^(2-1) * 2 = 1
The number of ships hit twice is:
(2 choose 2) * (1/2)^2 * (1-1/2)^(2-2) * 2 = 0.5
To complete the sanity check, we need to make sure our total number of hits equals Hs. Every ship hit twice takes 2 bullets, and every ship hit once takes one bullet:
1*1 + 0.5*2 = 2 == Hs **TRUE**
One more quick example with Hs=3 and N=2:
(3 choose 1) * (1/2)^1 * (1-1/2)^(3-1) * 2
3 * 0.5 * 0.25 * 2 = 0.75
(3 choose 2) * (1/2)^2 * (1-1/2)^(3-2) * 2
3 * 0.5^2 * 0.5 * 2 = 0.75
(3 choose 3) * (1/2)^3 * (1-1/2)^(3-3) * 2
1 * 0.5^3 * 1 * 2 = 0.25
0.75 + 0.75*2 + 0.25*3 = 3 == Hs **TRUE**
Figured out a way of solving this, and finally got around to writing it up in Java. This gives an exact solution for computing the probability of m ships not being hit given k ships and n shots. It is, however, quite computationally expensive. First, a summary of what I did:
The proability is equal to the total number of ways to shoot the ships with exactly m not hit divided by the total number of ways to shoot ships.
P = m_misses / total
Total is k^n, since each shot can hit one of k ships.
To get the numerator, start with nCr(k,m). This is the number of ways of choosing m ships to not be hit. This multiplied by the number of ways of hitting k-m ships without missing any is the total probability.
nCr(k,m)*(k-m_noMiss)
P = ---------------------
k^n
Now to calculate the second term in the numerator. This is the sum across all distributions of shots of how many ways there are for a certain shot distribution to happen. For example, if 2 ships are hit by 3 bullets, and each ship is hit at least once, they can be hit in the following ways:
100
010
001
110
101
011
The shot distributions are equal to the length k-m compositions of k. In this case, we would have [2,1] and [1,2], the length 2 compositions of 3.
For the first composition, [2,1], we can calculate the numbers of ways of generating this by choosing 2 out of the 3 shots to hit the first ship, and then 1 out of the remaining 1 shots to hit the second, i.e. nCr(3,2) * nCr(1,1). Note that we can simplify this to 3!/(2!*1!). This pattern applies to all shot patters, so the number of ways that a certain pattern, p, can occur can be written as n!/prodSum(j=1,k-m,p_j!), in which the notation indicates the product sum from 1 to k-m, j is an index, and p_j represents the jth term in p.
If we define P as the set of all length k-m compositions of n, the probability of m ships not being hit is then:
nCr(k,m)*sum(p is an element of P, n!/prodSum(j=1,k-m,p_j!))
P = --------------------------------------------------------------
k^n
The notation is a bit sloppy since there's not way of putting equations of math symbols into SO, but that's the gist of it.
That being said, this method is horribly inefficient, but I can't seem to find a better one. If someone can simplify this, by all means post your method! I'm curious as to how it can be done.
And the java code for doing this:
import java.util.ArrayList;
import java.util.Arrays;
import org.apache.commons.math3.util.ArithmeticUtils;
class Prob{
public boolean listsEqual(Integer[] integers, Integer[] rootComp){
if(integers.length != rootComp.length){
return false;
}
for (int i = 0; i < integers.length; i++){
if(integers[i] != rootComp[i]){return false;};
}
return true;
}
public Integer[] firstComp(int base, int length){
Integer[] comp = new Integer[length];
Arrays.fill(comp, 1);
comp[0] = base - length + 1;
return comp;
}
public Integer[][] enumerateComps(int base, int length){
//Provides all compositions of base of size length
if(length > base){return null;};
Integer[] rootComp = firstComp(base, length);
ArrayList<Integer[]> compsArray = new ArrayList<Integer[]>();
do {
compsArray.add(rootComp);
rootComp = makeNextComp(rootComp);
} while(!listsEqual(compsArray.get(compsArray.size() - 1), rootComp));
Integer[][] newArray = new Integer[compsArray.size()][length];
int i = 0;
for (Integer[] comp : compsArray){
newArray[i] = comp;
i++;
}
return newArray;
}
public double getProb(int k, int n, int m){
//k = # of bins
//n = number of objects
//m = number of empty bins
//First generate list of length k-m compositions of n
if((n < (k-m)) || (m >= k)){
return 0;
}
int[] comp = new int[n-1];
Arrays.fill(comp, 1);
comp[0] = n - (k-m) + 1;
//Comp is now the first
Integer[][] L = enumerateComps(n, k-m);
double num = 0;
double den = Math.pow(k, n);
double prodSum;
int remainder;
for(Integer[] thisComp : L){
remainder = n;
prodSum = 1;
for(Integer thisVal : thisComp){
prodSum = prodSum * ArithmeticUtils.binomialCoefficient(remainder, thisVal);
remainder -= thisVal;
}
num += prodSum;
}
return num * ArithmeticUtils.binomialCoefficient(k, m) / den;
}
public Integer[] makeNextComp(Integer[] rootComp){
Integer[] comp = rootComp.clone();
int i = comp.length - 1;
int lastVal = comp[i];
i--;
for(; i >=0 ; i--){
if (comp[i] != 1){
//Subtract 1 from comp[i]
comp[i] -= 1;
i++;
comp[i] = lastVal + 1;
i++;
for(;i < comp.length; i++){
comp[i] = 1;
};
return comp;
}
}
return comp;
}
}
public class numbersTest {
public static void main(String[] args){
//System.out.println(ArithmeticUtils.binomialCoefficient(100,50));
Prob getProbs = new Prob();
Integer k = 10; //ships
Integer n = 10; //shots
Integer m = 4; //unscathed
double myProb = getProbs.getProb(k,n,m);
System.out.printf("Probability of %s ships, %s hits, and %s unscathed: %s",k,n,m,myProb);
}
}
Related
There are N cities connected by N-1 roads.
Each adjacent pair of cities is connected by bidirectional roads i.e.
i-th city is connected to i+1-th city for all 1 <= i <= N-1, given as below:
1 --- 2 --- 3 --- 4...............(N-1) --- N
We got M queries of type (c1, c2) to disconnect the pair of cities c1 and c2.
For that we decided to block some roads to meet all these M queries.
Now, we have to determine the minimum number of roads that needs to be
blocked such that all queries will be served.
Example :
inputs:
- N = 5 // number of cities
- M = 2 // number of query requests
- C = [[1,4], [2,5]] // queries
output: 1
Approach :
1. Block the road connecting the cities C2 and C3 and all queries will be served.
2. Thus, the minimum roads needs to be blocked is 1.
Constraints :
- 1 <= T <= 2 * 10^5 // numner of test cases
- 2 <= N <= 2 * 10^5 // number of cities
- 0 <= M <= 2 * 10^5 // number of queries
- 1 <= C(i,j) <= N
It is guaranteed that the sum of N over T test cases doesn't exceed 10^6.
It is also guaranteed that the sum of M over T test cases doesn't exceed 10^6.
My Approach :
Solved this problem using Min-Heap, but not sure if it will work
on all the edges(corner) test cases and has the optimal
time/space complexities.
public int solve(int N, int M, Integer[][] c) {
int minCuts = 0;
if(M == 0) return 0;
// sorting based on the start city in increasing order.
Arrays.sort(c, (Integer[] a, Integer[] b) -> {
return a[0] - b[0];
});
PriorityQueue<Integer> minHeap = new PriorityQueue<>();
// as soon as I finds any end city in my minHeap which is less than equal to my current start city, I increment mincuts and remove all elements from the minHeap.
for(int i = 0; i < M ; i++) {
int start = c[i][0];
int end = c[i][1];
if(!minHeap.isEmpty() && minHeap.peek() <= start) {
minCuts += 1;
while(!minHeap.isEmpty()) {
minHeap.poll();
}
}
minHeap.add(end);
}
return minCuts + 1;
}
Is there any any edge test-case for which this approach will fail?
For each query, there is an (inclusive) interval of acceptable cut points, so the task is to find the minimum number of cut points that intersect all intervals.
The usual algorithm for this problem, which you can see here, is an optimized implementation of this simple procedure:
Select the smallest interval end as a cut point
Remove all the intervals that it intersects
Repeat until there are no more intervals.
It's easy to prove that that it's always optimal to select the smallest interval end:
The smallest cut point must be <= the smallest interval end, because otherwise that interval won't get cut.
If an interval intersects any point <= the smallest interval end, then it must also intersect the smallest interval end.
The smallest interval end is therefore an optimal choice for the smallest cut point.
It takes a little more work, but you can prove that your algorithm is also an implementation of this procedure.
First, we can show that the smallest interval end is always the first one popped off the heap, because nothing is popped until we find a starting point greater than a known endpoint.
Then we can show that the endpoints removed from the heap correspond to exactly the intervals that are cut by that first endpoint. All of their start points must be <= that first endpoint, because otherwise we would have removed them earlier. Note that you didn't adjust your queries into inclusive intervals, so your test says peek() <= start. If they were adjusted to be inclusive, it would say peek() < start.
Finally, we can trivially show that there are always unpopped intervals left on the heap, so you need that +1 at the end.
So your algorithm makes the same optimal selection of cut points. It's more complicated than the other one, though, and harder to verify, so I wouldn't use it.
I need a fast way to find maximum value when intervals are overlapping, unlike finding the point where got overlap the most, there is "order". I would have int[][] data that 2 values in int[], where the first number is the center, the second number is the radius, the closer to the center, the larger the value at that point is going to be. For example, if I am given data like:
int[][] data = new int[][]{
{1, 1},
{3, 3},
{2, 4}};
Then on a number line, this is how it's going to looks like:
x axis: -2 -1 0 1 2 3 4 5 6 7
1 1: 1 2 1
3 3: 1 2 3 4 3 2 1
2 4: 1 2 3 4 5 4 3 2 1
So for the value of my point to be as large as possible, I need to pick the point x = 2, which gives a total value of 1 + 3 + 5 = 9, the largest possible value. It there a way to do it fast? Like time complexity of O(n) or O(nlogn)
This can be done with a simple O(n log n) algorithm.
Consider the value function v(x), and then consider its discrete derivative dv(x)=v(x)-v(x-1). Suppose you only have one interval, say {3,3}. dv(x) is 0 from -infinity to -1, then 1 from 0 to 3, then -1 from 4 to 6, then 0 from 7 to infinity. That is, the derivative changes by 1 "just after" -1, by -2 just after 3, and by 1 just after 6.
For n intervals, there are 3*n derivative changes (some of which may occur at the same point). So find the list of all derivative changes (x,change), sort them by their x, and then just iterate through the set.
Behold:
intervals = [(1,1), (3,3), (2,4)]
events = []
for mid, width in intervals:
before_start = mid - width - 1
at_end = mid + width
events += [(before_start, 1), (mid, -2), (at_end, 1)]
events.sort()
prev_x = -1000
v = 0
dv = 0
best_v = -1000
best_x = None
for x, change in events:
dx = x - prev_x
v += dv * dx
if v > best_v:
best_v = v
best_x = x
dv += change
prev_x = x
print best_x, best_v
And also the java code:
TreeMap<Integer, Integer> ts = new TreeMap<Integer, Integer>();
for(int i = 0;i<cows.size();i++) {
int index = cows.get(i)[0] - cows.get(i)[1];
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
index = cows.get(i)[0] + 1;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) - 2);
}else {
ts.put(index, -2);
}
index = cows.get(i)[0] + cows.get(i)[1] + 2;
if(ts.containsKey(index)) {
ts.replace(index, ts.get(index) + 1);
}else {
ts.put(index, 1);
}
}
int value = 0;
int best = 0;
int change = 0;
int indexBefore = -100000000;
while(ts.size() > 1) {
int index = ts.firstKey();
value += (ts.get(index) - indexBefore) * change;
best = Math.max(value, best);
change += ts.get(index);
ts.remove(index);
}
where cows is the data
Hmmm, a general O(n log n) or better would be tricky, probably solvable via linear programming, but that can get rather complex.
After a bit of wrangling, I think this can be solved via line intersections and summation of function (represented by line segment intersections). Basically, think of each as a triangle on top of a line. If the inputs are (C,R) The triangle is centered on C and has a radius of R. The points on the line are C-R (value 0), C (value R) and C+R (value 0). Each line segment of the triangle represents a value.
Consider any 2 such "triangles", the max value occurs in one of 2 places:
The peak of one of the triangle
The intersection point of the triangles or the point where the two triangles overall. Multiple triangles just mean more possible intersection points, sadly the number of possible intersections grows quadratically, so O(N log N) or better may be impossible with this method (unless some good optimizations are found), unless the number of intersections is O(N) or less.
To find all the intersection points, we can just use a standard algorithm for that, but we need to modify things in one specific way. We need to add a line that extends from each peak high enough so it would be higher than any line, so basically from (C,C) to (C,Max_R). We then run the algorithm, output sensitive intersection finding algorithms are O(N log N + k) where k is the number of intersections. Sadly this can be as high as O(N^2) (consider the case (1,100), (2,100),(3,100)... and so on to (50,100). Every line would intersect with every other line. Once you have the O(N + K) intersections. At every intersection, you can calculate the the value by summing the of all points within the queue. The running sum can be kept as a cached value so it only changes O(K) times, though that might not be posible, in which case it would O(N*K) instead. Making it it potentially O(N^3) (in the worst case for K) instead :(. Though that seems reasonable. For each intersection you need to sum up to O(N) lines to get the value for that point, though in practice, it would likely be better performance.
There are optimizations that could be done considering that you aim for the max and not just to find intersections. There are likely intersections not worth pursuing, however, I could also see a situation where it is so close you can't cut it down. Reminds me of convex hull. In many cases you can easily reduce 90% of the data, but there are cases where you see the worst case results (every point or almost every point is a hull point). For example, in practice there are certainly causes where you can be sure that the sum is going to be less than the current known max value.
Another optimization might be building an interval tree.
I have created TicTacToe game. I use minmax algorithm.
When the board is 3x3 I just calculate every possible move for a game till the end and -1 for loss, 0 for tie, 1 for win.
When it comes to 5x5 it can't be done(to many options(like 24^24) so I have created evaluation method which gives: 10^0 for one CIRCLE inline, 10^1 for 2 CIRCLE inline, ..., 10^4 for 5 CIRCLES inline, but it is useless.
Does anybody have better idea for assesment?
Example:
O|X|X| | |
----------
|O| | | |
----------
X|O| | | |
----------
| | | | |
----------
| | | | |
Evaluation -10, 2 circles across once and inline once (+200), 2 crosses inline(-100), and -1 three times and + 1 three times for single cross and circle.
This is my evaluation method now:
public void setEvaluationForBigBoards() {
int evaluation = 0;
int howManyInLine = board.length;
for(; howManyInLine > 0; howManyInLine--) {
evaluation += countInlines(player.getStamp(), howManyInLine);
evaluation -= countInlines(player.getOppositeStamp(), howManyInLine);
}
this.evaluation = evaluation;
}
public int countInlines(int sign, int howManyInLine) {
int points = (int) Math.pow(10, howManyInLine - 1);
int postiveCounter = 0;
for(int i = 0; i < board.length; i++) {
for(int j = 0; j < board[i].length; j++) {
//czy od tego miejsca jest cos po przekatnej w prawo w dol, w lewo w dol, w dol, w prawo
if(toRigth(i, j, sign, howManyInLine))
postiveCounter++;
if(howManyInLine > 1) {
if(toDown(i, j, sign, howManyInLine))
postiveCounter++;
if(toRightDiagonal(i, j, sign, howManyInLine))
postiveCounter++;
if(toLeftDiagonal(i, j, sign, howManyInLine))
postiveCounter++;
}
}
}
return points * postiveCounter;
}
Number of options (possible sequences of moves) after the first move is 24! and not 24^24. It is still a too much high
number so it is correct to implement an heuristic.
Note that answers about good heuristics are necessarily based on the opinion of the writer so I give my opinion but to find
out what is "the best heuristic" you should make the various ideas playing one against the other in the following way:
take the two heuristics A and B that you want to compare
generate at random a starting configuration
let A play with O and B play with X
from the same starting configuration let A play with X and B play with O
take stats of which one wins more
Now my thoughts about good possible heuristics starting points for an nxn playfield with winning sequence length of n:
since the winning condition for a player it to form a straight sequence of its marks my idea is to use as base values the number of possibilities that each player has still available to built such a straight sequence.
in an empty field both O and X have ideally the possibility to realize the winning sequence in several ways:
horizontal possibilities: n
vertical possibilities: n
diagonal possibilities: 2
total possibilities: 2n+2
in the middle of a round the number of remaining opportunities for a player are calculated as: "the number of rows without opponent's marks + the number of columns without opponent's marks + the number of diagonals without opponent's marks.
instead than calculate all each time it can be considered that:
after a move of one player the umber of still available possibilities are:
unchanged for him
equal or lowered for the opponent (if the mark has been placed in a row/col/diagonal where no marks had already been placed by the considered player)
as heuristic i can propose -
is possible that - k * with k > 1 give better results and in the end this can be related to how a draw is considered with regard to a lose.
One side consideration:
playfield cells are n^2
winning possibilities are 2n+2 if we keep the winning length equal to the field edge size
this give me the idea that the more the size is increased the less interesting is to play because the probability of a draw after a low number of moves (with reference to the playfield area) becomes higher and higher.
for this reason I think that the game with a winning length lower that n (for example 3 independently from the playfield size) is more interesting.
Named l the wining length we have that the number of possibilities is 2*((n+1-l)*(2n+1-l)) = O(n^2) and so well proportioned with the field area.
I have a map of items with some probability distribution:
Map<SingleObjectiveItem, Double> itemsDistribution;
Given a certain m I have to generate a Set of m elements sampled from the above distribution.
As of now I was using the naive way of doing it:
while(mySet.size < m)
mySet.add(getNextSample(itemsDistribution));
The getNextSample(...) method fetches an object from the distribution as per its probability. Now, as m increases the performance severely suffers. For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. Generate 1000 such sets and you have an application that crawls.
Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? Most collection shuffling techniques and the like are uniformly random. What would be a good way to address this?
UPDATE: The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is in the first run we'll definitely get a sample for the set. The 2nd iteration, it may be called at least twice and so on. If getNextSample is sequential in nature, i.e., goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2, 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 then the loop is at least Sigma(n^3). And that too is the lower bound!
If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). Efficient but may not be by a large margin.
Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. These sets are part of a randomized 'schedule' of items. Hence a 'set' of items.
Start out by generating a number of random points in two dimentions.
Then apply your distribution
Now find all entries within the distribution and pick the x coordinates, and you have your random numbers with the requested distribution like this:
The problem is unlikely to be the loop you show:
Let n be the size of the distribution, and I be the number of invocations to getNextSample. We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E[I] < m / (1 - m / n).
That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. If that is "slow" and "crawls", it is likely because getNextSample is slow. This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element).
The following should be faster (if m < 0.8 n)
class Distribution<T> {
private double[] cummulativeWeight;
private T[] item;
private double totalWeight;
Distribution(Map<T, Double> probabilityMap) {
int i = 0;
cummulativeWeight = new double[probabilityMap.size()];
item = (T[]) new Object[probabilityMap.size()];
for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
item[i] = entry.getKey();
totalWeight += entry.getValue();
cummulativeWeight[i] = totalWeight;
i++;
}
}
T randomItem() {
double weight = Math.random() * totalWeight;
int index = Arrays.binarySearch(cummulativeWeight, weight);
if (index < 0) {
index = -index - 1;
}
return item[index];
}
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
set.add(randomItem());
}
return set;
}
}
public class Test {
public static void main(String[] args) {
int max = 1_000_000;
HashMap<Integer, Double> probabilities = new HashMap<>();
for (int i = 0; i < max; i++) {
probabilities.put(i, (double) i);
}
Distribution<Integer> d = new Distribution<>(probabilities);
Set<Integer> set = d.randomSubset(max / 2);
//System.out.println(set);
}
}
The expected runtime is O(m / (1 - m / n) * log n). On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds.
As we can see, the expected runtime approaches infinity as m approaches n. If that is a problem (i.e. m > 0.9 n), the following more complex approach should work better:
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
T randomItem = randomItem();
remove(randomItem); // removes the item from the distribution
set.add(randomItem);
}
return set;
}
To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is.
But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n.
If you are not concerning with randomness properties too much then I do it like this:
create buffer for pseudo-random numbers
double buff[MAX]; // [edit1] double pseudo random numbers
MAX is size should be big enough ... 1024*128 for example
type can be any (float,int,DWORD...)
fill buffer with numbers
you have range of numbers x = < x0,x1 > and probability function probability(x) defined by your probability distribution so do this:
for (i=0,x=x0;x<=x1;x+=stepx)
for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers
buff[i]=x+(double(i)*q); // [edit1] ...
The stepx is your accuracy for items (for integral types = 1) now the buff[] array has the same distribution as you need but it is not pseudo-random. Also you should add check if j is not >= MAX to avoid array overruns and also at the end the real size of buff[] is j (can be less than MAX due to rounding)
shuffle buff[]
do just few loops of swap buff[i] and buff[j] where i is the loop variable and j is pseudo-random <0-MAX)
write your pseudo-random function
it just return number from the buffer. At first call returns the buff[0] at second buff[1] and so on ... For standard generators When you hit the end of buff[] then shuffle buff[] again and start from buff[0] again. But as you need unique numbers then you can not reach the end of buffer so so set MAX to be big enough for your task otherwise uniqueness will not be assured.
[Notes]
MAX should be big enough to store the whole distribution you want. If it is not big enough then items with low probability can be missing completely.
[edit1] - tweaked answer a little to match the question needs (pointed by meriton thanks)
PS. complexity of initialization is O(N) and for get number is O(1).
You should implement your own random number generator (using a MonteCarlo methode or any good uniform generator like mersen twister) and basing on the inversion method (here).
For example : exponential law: generate a uniform random number u in [0,1] then your random variable of the exponential law would be : ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm.
Hope it'll help ;).
I think you have two problems:
Your itemDistribution doesn't know you need a set, so when the set you are building gets
large you will pick a lot of elements that are already in the set. If you start with the
set all full and remove elements you will run into the same problem for very small sets.
Is there a reason why you don't remove the element from the itemDistribution after you
picked it? Then you wouldn't pick the same element twice?
The choice of datastructure for itemDistribution looks suspicious to me. You want the
getNextSample operation to be fast. Doesn't the map from values to probability force you
to iterate through large parts of the map for each getNextSample. I'm no good at
statistics but couldn't you represent the itemDistribution the other way, like a map from
probability, or maybe the sum of all smaller probabilities + probability to a element
of the set?
Your performance depends on how your getNextSample function works. If you have to iterate over all probabilities when you pick the next item, it might be slow.
A good way to pick several unique random items from a list is to first shuffle the list and then pop items off the list. You can shuffle the list once with the given distribution. From then on, picking your m items ist just popping the list.
Here's an implementation of a probabilistic shuffle:
List<Item> prob_shuffle(Map<Item, int> dist)
{
int n = dist.length;
List<Item> a = dist.keys();
int psum = 0;
int i, j;
for (i in dist) psum += dist[i];
for (i = 0; i < n; i++) {
int ip = rand(psum); // 0 <= ip < psum
int jp = 0;
for (j = i; j < n; j++) {
jp += dist[a[j]];
if (ip < jp) break;
}
psum -= dist[a[j]];
Item tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
return a;
}
This in not Java, but pseudocude after an implementation in C, so please take it with a grain of salt. The idea is to append items to the shuffled area by continuously picking items from the unshuffled area.
Here, I used integer probabilities. (The proabilities don't have to add to a special value, it's just "bigger is better".) You can use floating-point numbers but because of inaccuracies, you might end up going beyond the array when picking an item. You should use item n - 1 then. If you add that saftey net, you could even have items with zero probability that always get picked last.
There might be a method to speed up the picking loop, but I don't really see how. The swapping renders any precalculations useless.
Accumulate your probabilities in a table
Probability
Item Actual Accumulated
Item1 0.10 0.10
Item2 0.30 0.40
Item3 0.15 0.55
Item4 0.20 0.75
Item5 0.25 1.00
Make a random number between 0.0 and 1.0 and do a binary search for the first item with a sum that is greater than your generated number. This item would have been chosen with the desired probability.
Ebbe's method is called rejection sampling.
I sometimes use a simple method, using an inverse cumulative distribution function, which is a function that maps a number X between 0 and 1 onto the Y axis.
Then you just generate a uniformly distributed random number between 0 and 1, and apply the function to it.
That function is also called the "quantile function".
For example, suppose you want to generate a normally distributed random number.
It's cumulative distribution function is called Phi.
The inverse of that is called probit.
There are many ways to generate normal variates, and this is just one example.
You can easily construct an approximate cumulative distribution function for any univariate distribution you like, in the form of a table.
Then you can just invert it by table-lookup and interpolation.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Generate random number with non-uniform density
I try to identify/create a function ( in Java ) that give me a nonuniform distributed sequence of number.
if I has a function that say function f(x), and x>0 it will give me a random number
from 0 to x.
The function most work with any given x and this below is only a example how I want to have.
But if we say x=100 the function f(x) will return s nonunifrom distributed.
And I want for example say
0 to 20 be approximately 20% of all case.
21 to 50 be approximately 50% of all case.
51 to 70 be approximately 20% of all case.
71 to 100be approximately 10 of all case.
In short somting that give me a number like normal distribution and it peek at 30-40 in this case x is 100.
http://en.wikipedia.org/wiki/Normal_distribution
( I can use a uniform random gen as score if need, and only a function that will transfrom the uniform result to a non-uniform result. )
EDIT
My final solution for this problem is:
/**
* Return a value from [0,1] and mean as 0.3, It give 10% of it is lower
* then 0.1. 5% is higher then 0.8 and 30% is in rang 0.25 to 0.45
*
* #return
*/
public double nextMyGaussian() {
double d = -1000;
while (d < -1.5) {
// RANDOMis Java's normal Random() class.
// The nextGaussian is normal give a value from -5 to +5?
d = RANDOM.nextGaussian() * 1.5;
}
if (d > 3.5d) {
return 1;
}
return ((d + 1.5) / 5);
}
A simple solution would be to generate a first random number between 0 and 9.
0 means the 10 first percents, 1 the ten following percents, etc.
So if you get 0 or 1, you generate a second random number between 0 and 20. If you get 2, 3, 4, 5 or 6, you generate a second random number between 21 and 50, etc.
Could you just write a function that sums a number of random numbers it the 1-X range and takes an average? this will tend to the normal distribution as n increases
See:
Generate random numbers following a normal distribution in C/C++
I hacked something like the below:
class CrudeDistribution {
final int TRIALS = 20;
public int getAverageFromDistribution(int upperLimit) {
return getAverageOfRandomTrials(TRIALS, upperLimit);
}
private int getAverageOfRandomTrials(int trials, int upperLimit) {
double d = 0.0;
for (int i=0; i<trials; i++) {
d +=getRandom(upperLimit);
}
return (int) (d /= trials);
}
private int getRandom(int upperLimit) {
return (int) (Math.random()*upperLimit)+1;
}
}
There are libraries in Commons-Math that can generate distributions based on means and standard deviations (that measure the spread). and in the link some algorithms that do this.
Probably a fun hour of so of hunting to find the relevant 2 liner:
https://commons.apache.org/math/userguide/distribution.html
One solution would be to do a random number between 1-100 and based on the result do another random number in the appropriate range.
1-20 -> 0-20
21-70 -> 21-50
71-90 -> 51-70
91-100 -> 71-100
Hope that makes sense.
You need to create the f(x) first.
Assuming values x are equiprobable, your f(x) is
double f(x){
if(x<=20){
return x;
}else if (x>20 && x<=70){
return (x-20)/50*30+20;
} else if(...
etc
Just generate a bunch, say at least 30, uniform random numbers between 0 and x. Then take the mean of those. The mean will, following the central limit theorem, be a random number from a normal distribution centered around x/2.