Smart algorithm to randomize a Double in range but with odds - java

I use the following function to generate a random double in a specific range :
nextDouble(1.50, 7.00)
However, I've been trying to come up with an algorithm to make the randomization have higher probability to generate a double that is close to the 1.50 than it is to 7.00. Yet I don't even know where it starts. Anything come to mind ?
Java is also welcome.

You should start by discovering what probability distribution you need. Based on your requirements, and assuming that random number generations are independent, perhaps Poisson distribution is what you are looking for:
a call center receives an average of 180 calls per hour, 24 hours a day. The calls are independent; receiving one does not change the probability of when the next one will arrive. The number of calls received during any minute has a Poisson probability distribution with mean 3: the most likely numbers are 2 and 3 but 1 and 4 are also likely and there is a small probability of it being as low as zero and a very small probability it could be 10.
The usual probability distributions are already implemented in libraries e.g. org.apache.commons.math3.distribution.PoissonDistribution in Apache Commons Math3.

I suggest to not think about this problem in terms of generating a random number with irregular probability. Instead, think about generating a random number normally in a some range, but then map this range into another one in non-linear way.
Let's split our algorithm into 3 steps:
Generate a random number in [0, 1) range linearly (so using a standard random generator).
Map it into another [0, 1) range in non-linear way.
Map the resulting [0, 1) into [1.5, 7) linearly.
Steps 1. and 3. are easy, the core of our algorithm is 2. We need a way to map [0, 1) into another [0, 1), but non-linearly, so e.g. 0.7 does not have to produce 0.7. Classic math helps here, we just need to look at visual representations of algebraic functions.
In your case you expect that while the input number increases from 0 to 1, the result first grows very slowly (to stay near 1.5 for a longer time), but then it speeds up. This is exactly how e.g. y = x ^ 2 function looks like. Your resulting code could be something like:
fun generateDouble(): Double {
val step1 = Random.nextDouble()
val step2 = step1.pow(2.0)
val step3 = step2 * 5.5 + 1.5
return step3
}
or just:
fun generateDouble() = Random.nextDouble().pow(2.0) * 5.5 + 1.5
By changing the exponent to bigger numbers, the curve will be more aggressive, so it will favor 1.5 more. By making the exponent closer to 1 (e.g. 1.4), the result will be more close to linear, but still it will favor 1.5. Making the exponent smaller than 1 will start to favor 7.
You can also look at other algebraic functions with this shape, e.g. y = 2 ^ x - 1.

What you could do is to 'correct' the random with a factor in the direction of 1.5. You would create some sort of bias factor. Like this:
#Test
void DoubleTest() {
double origin = 1.50;
final double fiarRandom = new Random().nextDouble(origin, 7);
System.out.println(fiarRandom);
double biasFactor = 0.9;
final double biasedDiff = (fiarRandom - origin) * biasFactor;
double biasedRandom = origin + biasedDiff;
System.out.println(biasedRandom);
}
The lower you set the bias factor (must be >0 & <= 1), the stronger your bias towards 1.50.

You can take a straightforward approach. As you said you want a higher probability of getting the value closer to 1.5 than 7.00, you can even set the probability. So, here their average is (1.5+7)/2 = 4.25.
So let's say I want a 70% probability that the random value will be closer to 1.5 and a 30% probability closer to 7.
double finalResult;
double mid = (1.5+7)/2;
double p = nextDouble(0,100);
if(p<=70) finalResult = nextDouble(1.5,mid);
else finalResult = nextDouble(mid,7);
Here, the final result has 70% chance of being closer to 1.5 than 7.
As you did not specify the 70% probability you can even make it random.
you just have to generate nextDouble(50,100) which will give you a value more than or equal 50% and less than 100% which you can use later to apply this probability for your next calculation. Thanks

I missed that I am using the same solution strategy as in the reply by Nafiul Alam Fuji. But since I have already formulated my answer, I post it anyway.
One way is to split the range into two subranges, say nextDouble(1.50, 4.25) and nextDouble(4.25, 7.0). You select one of the subranges by generating a random number between 0.0 and 1.0 using nextDouble() and comparing it to a threshold K. If the random number is less than K, you do nextDouble(1.50, 4.25). Otherwise nextDouble(4.25, 7.0).
Now if K=0.5, it is like doing nextDouble(1.50, 7). But by increasing K, you will do nextDouble(1.50, 4.25) more often and favor it over nextDouble(4.25, 7.0). It is like flipping an unfair coin where K determines the extent of the cheating.

Related

math question about random (x) and random() % x - Java [duplicate]

This question already has answers here:
Why do people say there is modulo bias when using a random number generator?
(10 answers)
Closed 2 years ago.
so my question is at Java but it can be in any programming language.
there is this declaration :
Random rnd = new Random();
We want to get a random number at range 0 to x
I want to know if there is any mathematical difference between the following:
rnd.nextInt() % x;
and
rnd.nextInt(x)
The main question is, are one of these solutions more random than the other? Is one solution more appropriate or "correct" than the other? If they are equal I will be happy to see the mathematics proof for it
Welcome to "mathematical insight" with "MS Paint".
So, from a statistical standpoint, it would depend on the distribution of the numbers being generated. First of all, we'll treat the probability of any one number coming up as an independant event (aka discarding the seed, which RNG, etc). Following that, a modulus simply takes a range of numbers (e.g. a from N, where 0<=a<N), and subdivides them based on the divisor (the x in a % x). While the numbers are technically from a discrete population (integers), the range of integers for a probability mass function would be so large that it'd end up looking like a continuous graph anyhow. So let's consider a graph of the probability distribution function for a range of numbers:
If your random number generator doesn't generate with a uniform distribution across the range of numbers (aka, any number is as likely to come up as another number), then modulo would (potentially) be breaking up the results of a non-uniform distribution. When you consider the individual integers in those ranges as discrete (and individual) outcomes, the probability of any number i (0 <= i < x) being the result is the multiplication of the individual probabilities (i_1 * i_2 * ... * i_(N/x)). To think of it another way, if we overlaid the subdivisions of the ranges, it's plain to see that in non-symmetric distributions, it's much more likely that a modulo would not result in equally likely outcomes:
Remember, the likelihood of an outcome i in the graph above would be achieved through multiplying the likelihood of the individuals numbers (i_1, ..., i_(N/x)) in the range N that could result in i. For further clarity, if your range N doesn't evenly divide by the modular divisor x, there will always be some amount of numbers N % x that will have 1 addditional integer that could produce their result. This means that most modulus divisors that aren't a power of 2 (and similarly, ranges that are not a multiple of their divisor) could be skewed towards their lower results, regardless of having a uniform distribution:
So to summarize the point, Random#nextInt(int bound) takes all of these things (and more!) into consideration, and will consistently produce an outcome with uniform probability across the range of bound. Random#nextInt() % bound is only a halfway step that works in some specific scenarios. To your teacher's point, I would argue it's more likely you'll see some specific subset of numbers when using the modulus approach, not less.
new Random(x) just creates the Random object with the given seed, it does not ifself yield a random value.
I presume you are asking what the difference is between nextInt() % x and nextInt(x).
The difference is as follows.
nextInt(x)
nextInt(x) yields a random number n where 0 ≤ n < x, evenly distributed.
nextInt() % x
nextInt() % x yields a random number in the full integer range1, and then applies modulo x. The full integer range includes negative numbers, so the result could also be a negative number. With other words, the range is −x < n < x.
Furthermore, the distribution is not even in by far the most cases. nextInt() has 232 possibilities, but, for simplicity's sake, let's assume it has 24 = 16 possibilities, and we choose x not to be 16 or greater. Let's assume that x is 10.
All possibilities are 0, 1, 2, …, 14, 15, 16. After applying the modulo 10, the results are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5. That means that some numbers have a greater likelihood to occur than others. That also means that the change of some numbers occurring twice has increased.
As we see, nextInt() % x has two problems:
Range is not as required.
Uneven distribution.
So you should definetely use nextInt(int bound) here. If the requirement is get only unique numbers, you must exclude the numbers already drawn from the number generator. See also Generating Unique Random Numbers in Java.
1 According to the Javadoc.

Do an action with some probability in java

In Java, I am trying to do an action with a probability p. p is a float variable in my code. I came up with this way of doing it:
if( new Random().nextFloat() < p)
do action
I wanted to confirm if this is the correct way of doing it.
There is a TL;DR at the end.
From javadocs for nextFloat() (emphasis by me):
public float nextFloat()
Returns the next pseudorandom, uniformly distributed float value
between 0.0 and 1.0 from this random number generator's sequence.
If you understand what uniform distribution is, knowing this about nextFloat() is going to be enough for you. Yet I am going to explain a little about uniform distribution.
In uniform distribution, U(a,b) each number in the interval [a,b], and also all sub-intervals of the same length within [a,b] are equally probable, i.e. they have equal probability.
In the figure, on the left is the PDF, and on the right the CDF for uniform distribution.
For uniform distribution, the probability of getting a number less than or equal to n, P(x <= n) from the distribution is equal to the number itself (look at the right graph, which is cumulative distribution function for uniform distribution). That is, P(x <= 0.5) = 0.5, P(x <= 0.9) = 0.9. You can learn more about uniform distribution from any good statistics book, or some googling.
Fitting to your situation:
Now, probability of getting a number less than or equal to p generated using nextFloat() is equal to p, as nextFloat() returns uniformly distributed number. So, to make an action happen with a probability equal to p all you have to do is:
if (condition that is true with a probability p) {
do action
}
From what is discussed about nextFloat() and uniform distribution, it turns out to be:
if(randObj.nextFloat() <= p) {
do action
}
Conclusion:
What you did is almost the right way to do what you intended. Just adding the equal sign after < is all that's needed, and it doesn't hurt much to leave out the equal sign either!
P.S.: You don't need to create a new Random object each time in your conditional, you can create one, say randObj before your loop, and then invoke its nextFloat() method whenever you want to generate a random number, as I have done in my code.
Comment by pjs:
Take a look at the comment on the question by pjs, which is very important and well said. I quote:
Do not create a new Random object each time, that's not how PRNGs are
meant to be used! A single Random object provides a sequence of values
with good distributional properties. Multiple Random objects created
in rapid succession are 1) computationally expensive, and 2) may have
highly correlated initial states, thus producing highly correlated
outcomes. Random actually works best when you create a single instance
per program and keep drawing from it, unless you really really know
what you're doing and have specific reasons for using correlation
induction strategies.
TL;DR
What you did is almost the right way to do it. Just adding the equal sign after < (to make it <=) is all that's needed, and it doesn't hurt much to leave out the equal sign either!
Yes. That is correct (from a pure probability perspective). Random().nextFloat() will generate a number between 0.0 and 1.0 exclusive. So as long as your probability is as a float in the range 0.0 and 1.0, this is the correct way of doing it.
You can read more of the exact nextFloat() documentation here.

Random but most likely 1 float

I want to randomize a float that so that
There is 95% chance to be about 1
There is 0.01% chance to be < 0.1 or > 1.9
It never becomes 0 or 2
Is this possible by using Random.nextFloat() several times for example?
A visual illustration of the probability:
You need to find a function f such that:
f is continuous and increasing on [0, 1]
f(0) > 0 and f(1) < 2
f(0.01) >= 0.1 and f(0.99) <= 1.9
f(x) is "about 1" for 0.025 <= x <= 0.975
And then just take f(Random.nextDouble())
For example, Math.tan(3*(x-0.5))/14.11 fits this, so for your expression I'd use:
Math.tan(3*(Random.nextDouble()-0.5))/14.11
The probability is distributed as:
I do not code in JAVA but anyway, if I would want to use the internal pseudo-random generator (I usually use different approaches for this) I would do it like this:
Definitions
Let's say we have pseudo-random generator Random.nextFloat() returning values in range <0,1> with uniform distribution.
Create mapping from uniform <0,1> to yours (0,2)
It would be something like:
THE 0.001 SHOULD BE 0.0001 !!! I taught it was 0.1% instead 0.01% while drawing ...
Let's call it f(x). It can be a table (piecewise interpolation), or construct some polynomial that will match the properties you need (BEZIER,Interpolation polynomials,...)
As you can see the x axis is the probability and the y axis is the pseudo-random value (in your range). As built-in pseudo-random generators are uniform, they will generate uniformly distributed numbers between <0,1> which can be directly used as x.
To avoid the 0.0 and 2.0 either throw them away or use interval <0.0+ulp,2.0-ulp> where ulp is unit in last place
The graph is drawn in Paint and consists of 2x cubic BEZIER (4 control points per cubic) and a single Line.
Now just convert the ranges
So your pseudo-random value will be:
value=f(Random.nextFloat());
[Notes]
This would be better with fixed point format numbers otherwise you need to make the curvatures insanely high order to make any effect or use very huge amount of data to match desired probability output.

Benford's Law in Java - how to make a math function into Java

I have a quick question. I am trying to make a fraud detection app in java, the app will be primarily based on Benford's law. Benford's law is super cool, it basically can be interpreted to say that in a real financial transaction the first digit is commonly a 1, 2, or 3 and very rarely an 8, 9. I haven't been able to get the Benford formula translated into code that can be run in Java.
http://www.mathpages.com/home/kmath302/kmath302.htm This link has more information about what the Benford law is and how it can be used.
I know that I will have to use the java math class to be able to use a natural log function, but I am not sure how to do that. Any help would be greatly appreciated.
Thanks so much!!
#Rui has mentioned how to compute the probability distribution function, but that's not going to help you much here.
What you want to use is either the Kolmogorov-Smirnov test or the Chi-squared test. Both are for used for comparing data to a known probability distribution to determine whether the dataset is likely/unlikely to have that probability distribution.
Chi-squared is for discrete distributions, and K-S is for continuous.
For using chi-squared with Benford's law, you would just create a histogram H[N], e.g. with 9 bins N=1,2,... 9, iterate over the dataset to check the first digit to count # of samples for each of the 9 non-zero digits (or first two digits with 90 bins). Then run the chi-squared test to compare the histogram with the expected count E[N].
For example, let's say you have 100 pieces of data. E[N] can be computed from Benford's Law:
E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] = 9.6910
E[5] = 7.9181
E[6] = 6.6946
E[7] = 5.7992
E[8] = 5.1152
E[9] = 4.5757
Then compute Χ2 = sum((H[k]-E[k])^2/E[k]), and compare to a threshold as specified in the test. (Here we have a fixed distribution with no parameters, so the number of parameters s=0 and p = s+1 = 1, and the # of bins n is 9, so the # of degrees of freedom = n-p = 8*. Then you go to your handy-dandy chi-squared table and see if the numbers look ok. For 8 degrees of freedom the confidence levels look like this:
Χ2 > 13.362: 10% chance the dataset still matches Benford's Law
Χ2 > 15.507: 5% chance the dataset still matches Benford's Law
Χ2 > 17.535: 2.5% chance the dataset still matches Benford's Law
Χ2 > 20.090: 1% chance the dataset still matches Benford's Law
Χ2 > 26.125: 0.1% chance the dataset still matches Benford's Law
Suppose your histogram yielded H = [29,17,12,10,8,7,6,5,6], for a Χ2 = 0.5585. That's very close to the expected distribution. (maybe even too close!)
Now suppose your histogram yielded H = [27,16,10,9,5,11,6,5,11], for a Χ2 = 13.89. There is less than a 10% chance that this histogram is from a distribution that matches Benford's Law. So I'd call the dataset questionable but not overly so.
Note that you have to pick the significance level (e.g. 10%/5%/etc.). If you use 10%, expect roughly 1 out of every 10 datasets that are really from Benford's distribution to fail, even though they're OK. It's a judgement call.
Looks like Apache Commons Math has a Java implementation of a chi-squared test:
ChiSquareTestImpl.chiSquare(double[] expected, long[] observed)
*note on degrees of freedom = 8: this makes sense; you have 9 numbers but they have 1 constraint, namely they all have to add up to the size of the dataset, so once you know the first 8 numbers of the histogram, you can figure out the ninth.
Kolmogorov-Smirnov is actually simpler (something I hadn't realized until I found a simple enough statement of how it works) but works for continuous distributions. The method works like this:
You compute the cumulative distribution function (CDF) for your probability distribution.
You compute an empirical cumulative distribution function (ECDF), which is easily obtained by putting your dataset in sorted order.
You find D = (approximately) the maximum vertical distance between the two curves.
Let's handle these more in depth for Benford's Law.
CDF for Benford's Law: this is just C = log10 x, where x is in the interval [1,10), i.e. including 1 but excluding 10. This can be easily seen if you look at the generalized form of Benford's Law, and instead of writing it log(1+1/n), writing it as log(n+1)-log(n) -- in other words, to get the probability of each bin, they're subtracting successive differences of log(n), so log(n) must be the CDF
ECDF: Take your dataset, and for each number, make the sign positive, write it in scientific notation, and set the exponent to 0. (Not sure what to do if you have a number that is 0; that seems to not lend itself to Benford's Law analysis.) Then sort the numbers in ascending order. The ECDF is the number of datapoints <= x for any valid x.
Calculate maximum difference D = max(d[k]) for each d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k]).
Here's an example: suppose our dataset = [3.02, 1.99, 28.3, 47, 0.61]. Then ECDF is represented by the sorted array [1.99, 2.83, 3.02, 4.7, 6.1], and you calculate D as follows:
D = max(
log10(1.99) - 0/5, 1/5 - log10(1.99),
log10(2.83) - 1/5, 2/5 - log10(2.83),
log10(3.02) - 2/5, 3/5 - log10(3.02),
log10(4.70) - 3/5, 4/5 - log10(4.70),
log10(6.10) - 4/5, 5/5 - log10(6.10)
)
which = 0.2988 (=log10(1.99) - 0).
Finally you have to use the D statistic -- I can't seem to find any reputable tables online, but Apache Commons Math has a KolmogorovSmirnovDistributionImpl.cdf() function that takes a calculated D value as input and tells you the probability that D would be less than this. It's probably easier to take 1-cdf(D) which tells you the probability that D would be greater than or equal to the value you calculate: if this is 1% or 0.1% it probably means that the data doesn't fit Benford's Law, but if it's 25% or 50% it's probably a good match.
If I understand correctly, you want the Benford formula in Java syntax?
public static double probability(int i) {
return Math.log(1+(1/(double) i))/Math.log(10);
}
Remember to insert a
import java.lang.Math;
after your package declaration.
I find it suspicious no one answered this yet.... >_>
I think what you are looking for is something like this:
for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
{
answer += Math.log(1+(1/(i*10+(double) digit)));
}
answer *= (1/Math.log(10)));

Compute weighted averages for large numbers

I'm trying to get the weighted average of a few numbers. Basically I have:
Price - 134.42
Quantity - 15236545
There can be as few as one or two or as many as fifty or sixty pairs of prices and quantities. I need to figure out the weighted average of the price. Basically, the weighted average should give very little weight to pairs like
Price - 100000000.00
Quantity - 3
and more to the pair above.
The formula I currently have is:
((price)(quantity) + (price)(quantity) + ...)/totalQuantity
So far I have this done:
double optimalPrice = 0;
int totalQuantity = 0;
double rolling = 0;
System.out.println(rolling);
Iterator it = orders.entrySet().iterator();
while(it.hasNext()) {
System.out.println("inside");
Map.Entry order = (Map.Entry)it.next();
double price = (Double)order.getKey();
int quantity = (Integer)order.getValue();
System.out.println(price + " " + quantity);
rolling += price * quantity;
totalQuantity += quantity;
System.out.println(rolling);
}
System.out.println(rolling);
return rolling/totalQuantity;
The problem is I very quickly max out the "rolling" variable.
How can I actually get my weighted average?
A double can hold a pretty large number (about 1.7 x 10^308, according the docs), but you probably shouldn't use it for values where exact precision is required (such as monetary values).
Check out the BigDecimal class instead. This question on SO talks about it in more detail.
One solution is to use java.math.BigInteger for both rolling and totalQuantity, and only divide them at the end. This has a better numeric stability, as you only have a single floating-point division at the end and everything else is integer operations.
BigInteger is basically unbounded so you shouldn't run into any overflows.
EDIT: Sorry, only upon re-reading I've noticed your price is a double anyway. Maybe it's worth circumventing this by multiplying it with 100 and then converting to BigInteger - since I see in your example it has precisely 2 digits right of the decimal point - and then divide it by 100 at the end, although it's a bit of a hack.
For maximum flexibility, use BigDecimal for rolling, and BigInteger for totalQuantity. After dividing (note, you have it backwards; it should be rolling / totalQuantity), you can either return a BigDecimal, or use doubleValue at a loss of precision.
At any given point, you have recorded both the total value ax + by + cz + ... = pq and the total weight a + b + c + ... = p. Knowing both then gives you the average value pq/p = q. The problem is that pq and p are large sums that overflow, even though you just want the moderately sized q.
The next step adds, for example, a weight of r and a value s. You want to find the new sum (pq + rs) / (p + r) by using only the value of q, which can only happen if p and pq somehow "annihilate" by being in the numerator and denominator of the same fraction. That's impossible, as I'll show.
The value that you need to add in this iteration is, naturally,
(pq + rs) / (p + r) - q
Which can't be simplified to a point where p*q and p disappear. You can also find
(pq + rs) / q(p + r)
the factor by which you'd multiply q in order to get the next average; but again, pq and p remain. So there's no clever solution.
Others have mentioned arbitrary-precision variables, and that's a good solution here. The size of p and pq grow linearly with the number of entries, and the memory usage and calculation speed of integers/floats grows logarithmically with the size of the values. So performance is O(log(n)) unlike the disaster that it would if p were somehow the multiple of many numbers.
First, I don't see how you could be "maxing out" the rolling variable. As #Ash points out, it can represent values up to about 1.7 x 10^308. The only possibility I can think of is that you have some bad values in your input. (Perhaps the real problem is that you are losing precision ...)
Second, your use of a Map as to represent orders is strange and probably broken. The way you are currently using it, you cannot represent orders involving two or more items with the same price.
Your final result is an just a weighted average of precises, so presumably you don't need to follow the rules used when calculating account balances etc. If I am correct about the above, then you don't need to use BigDecimal, double will suffice.
The problem of overflow can be solved by storing a "running average" and updating it with each new entry. Namely, let
a_n = (sum_{i=1}^n x_i * w_i) / (sum_{i=1}^n w_i)
for n = 1, ..., N. You start with a_n = x_n and then add
d_n := a_{n+1} - a_n
to it. The formula for d_n is
d_n = (x_{n+1} - w_{n+1}*a_n) / W_{n+1}
where W_n := sum_{i=1}^n w_n. You need to keep track of W_n, but this problem can be solved by storing it as double (it will be OK as we're only interested in the average). You can also normalize the weights, if you know that all your weights are multiples of 1000, just divide them by 1000.
To get additional accuracy, you can use compensated summation.
Preemptive explanation: it is OK to use floating point arithmetic here. double has relative precision of 2E-16. The OP is averaging positive numbers, so there will be no cancellation error. What the proponents of arbitrary precision arithmetic don't tell you is that, leaving aside rounding rules, in the cases when it does give you lots of additional precision over IEEE754 floating point arithmetic, this will come at significant memory and performance cost. Floating point arithmetic was designed by very smart people (Prof. Kahan, among others), and if there was a way of cheaply increasing arithmetic precision over what is offered by floating point, they'd do it.
Disclaimer: if your weights are completely crazy (one is 1, another is 10000000), then I am not 100% sure if you will get satisfying accuracy, but you can test it on some example when you know what the answer should be.
Do two loops: compute totalQuantity first in the first loop. Then in the second loop accumulate price * (quantity / totalQuantity).

Categories