I'm using java 6 random (java.util.Random,linux 64) to randomly decide between serving one version of a page to a second one (Normal A/B testing), technically i initialize the class once with the default empty constructor and it's injected to a bean (Spring) as a property .
Most of the times the copies of the pages are within 8%(+-) of each other but from time to time i see deviations of up to 20 percent , e.g :
I now have two copies that split : 680 / 570 is that considered normal ?
Is there a better/faster version to use than java random ?
Thanks
A deviation of 20% does seem rather large, but you would need to talk to a trained statistician to find out if it is statistically anomalous.
UPDATE - and the answer is that it is not necessarily anomalous. The statistics predict that you would get an outlier like this roughly 0.3% of the time.
It is certainly plausible for a result like this to be caused by the random number generator. The Random class uses a simple "linear congruential" algorithm, and this class of algorithms are strongly auto-correlated. Depending on how you use the random number, this could lead anomalies at the application level.
If this is the cause of your problem, then you could try replacing it with a crypto-strength random number generator. See the javadocs for SecureRandom. SecureRandom is more expensive than Random, but it is unlikely that this will make any difference in your use-case.
On the other hand, if these outliers are actually happening at roughly the rate predicted by the theory, changing the random number generator shouldn't make any difference.
If these outliers are really troublesome, then you need to take a different approach. Instead of generating N random choices, generate a list of false / true with exactly the required ratio, and then shuffle the list; e.g. using Collections.shuffle.
I believe this is fairly normal as it is meant to generate random sequences. If you want repeated patterns after certain interval, I think you may want to use a specific seed value in the constructor and reset the random with same seed after certain interval.
e.g. after every 100/500/n calls to Random.next.., reset the seed with old value using Random.setSeed(long seed) method.
java.util.Random.nextBoolean() is an approach for a standard binomial distribution, which has standard deviation of sqrt(n*p*(1-p)), with p=0.5.
So if you do 900 iterations, the standard deviation is sqrt(900*.5*.5) = 15, so most times the distribution would be in the range 435 - 465.
However, it is pseudo-random, and has a limited cycle of numbers it will go through before starting over. So if you have enough iterations, the actual deviation will be much smaller than the theoretical one. Java uses the formula seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1). You could write a different formula with smaller numbers to purposely obtain a smaller deviation, which would make it a worse random number generator, but better fitted for your purpose.
You could for example create a list of 5 trues and 5 falses in it, and use Collections.shuffle to randomize the list. Then you iterate over them sequentially. After 10 iterations you re-shuffle the list and start from the beginning. That way you'll never deviate more than 5.
See http://en.wikipedia.org/wiki/Linear_congruential_generator for the mathematics.
Related
This question already has answers here:
Java normal distribution
(2 answers)
Closed 7 years ago.
I'm experimenting with AI for an extremely simple simulation game.
The game will contain people (instances of objects with random properties) who have a set amount of money to spend.
I'd like the distribution of "wealth" to be statistically valid.
How can I generate a random number (money) which adheres to a standard deviation (e.g. mean:50, standard deviation: 10), whereby a value closer to the mean is more likely to be generated?
I think you're focusing on the wrong end of the problem. The first thing you need to do is identify the distribution you want to use to model wealth. A normal distribution with a mean of 50 and standard deviation of 10 nominally meets your needs, but so does a uniform distribution in the range [32.67949, 67.32051]. There are lots of statistical distributions that can have the same mean and standard deviation but which have completely different shapes, and it is the shape that will determine the validity of your distribution.
Income and wealth turn out to have very skewed distributions - they are bounded below by zero, while a few people have such large amounts compared to the rest of us that they drag the mean upward by quite noticeable amounts. Consequently, you don't want to use a naive distribution choice such as uniform or Gaussian, or anything else that is symmetric or can dip into negative territory. Using an exponential would be far more realistic, but still may not be sufficiently extreme to capture actual wealth distribution we see in the real world.
Once you've picked a distribution, there are many software libraries or sources of info that will help you generate values from that distribution.
Generating random numbers is a vast topic. But since you said it's a simple simulation, here's a simple approach to get going:
Generate several (say n) random numbers uniformly distributed on (0, 1). The built-in function Math.random can supply those numbers.
Add up those numbers. The sum has a distribution which is approximately normal, with mean = n/2 and standard deviation = sqrt(n)/sqrt(12). So if you subtract n/2, and then divide by sqrt(n)/sqrt(12), you'll have something which is approximately normal with mean 0 and standard deviation 1. Obviously if you pick n = 12 all you have to do is subtract 6 from the sum and you're done.
Now to get any other mean and standard deviation, just multiply by the standard deviation you want, and add the mean you want.
There are many other ways to go about it, but this is perhaps the simplest. I assume that's OK given your description of the problem.
Let's assume I have a reliably truly random source of random numbers, but it is very slow. It only give me a few hundreds of numbers every couple of hours.
Since I need way more than that I was thinking to use those few precious TRN I can get as seeds for java.util.Random (or scala.util.Random). I also always will pick a new one to generate the next random number.
So I guess my questions are:
Can the numbers I generate from those Random instance in Java be considered truly random since the seed is truly random?
Is there still a condition that is not met for true randomness?
If I keep on adding levels at what point will randomness be lost?
Or (as I thought when I came up with it) is truly random as long as the stream of seeds is?
I am assuming that nobody has intercepted the stream of seeds, but I do not plan to use those numbers for security purposes.
For a pseudo random generator like java.util.Random, the next generated number in the sequence becomes predictable given only a few numbers from the sequence, so you will loose your "true randomness" very fast. Better use one of the generators provided by java.security.SecureRandom - these are all strong random generators with an VERY long sequence length, which should be pretty hard to be predicted.
Our java Random gives uniformly spread random numbers. That is not true randomness, which may yield five times the same number.
Furthermore for every specific seed the same sequence is generated (intentionally). With 2^64 seeds in general irrelevant. (Note hackers could store the first ten numbers of every sequence; thereby rapidly catching up.)
So if you at large intervals use a truely random number as seed, you will get a uniform distribution during that interval. In effect not very different from not using the true randomizers.
Now combining random sequences might reduce the randomness. Maybe translating the true random number to bytes, and xor-ing every new random number with another byte, might give a wilder variance.
Please do not take my word only - I cannot guarantee the mathematical correctness of the above. A math/algorithmic forum might give more info.
When you take out more bits, than you have put in they are for sure no longer truly random. The break point may even occur earlier if the random number generator is bad. This can be seen by considering the entropy of the sequences. The seed value determines the sequence completely, so there are at most as many sequences as seed values. If they are all distinct, the entropy is the same as that of the seeds (which is essentially the number of seed bits, assuming the seed is truly random).
However, if different seeds lead to the same pseudo random sequence the entropy of the sequences will be lower than that of the seeds. If we cut off the sequences after n bits, the entropy may be even lower.
But why care if you don't use it for security purposes? Are you sure the pseudo random numbers are not good enough for your application?
I need/want to get random (well, not entirely) numbers to use for password generation.
What I do: Currently I am generating them with SecureRandom.
I am obtaining the object with
SecureRandom sec = SecureRandom.getInstance("SHA1PRNG", "SUN");
and then seeding it like this
sec.setSeed(seed);
Target: A (preferably fast) way to create random numbers, which are cryptographically at least a safe as the SHA1PRNG SecureRandom implementation. These need to be the same on different versions of the JRE and Android.
EDIT: The seed is generated from user input.
Problem: With SecureRandom.getInstance("SHA1PRNG", "SUN"); it fails like this:
java.security.NoSuchProviderException: SUN. Omitting , "SUN" produces random numbers, but those are different than the default (JRE 7) numbers.
Question: How can I achieve my Target?
You don't want it to be predictable: I want, because I need the predictability so that the same preconditions result in the same output. If they are not the same, its impossible hard to do what the user expects from the application.
EDIT: By predictable I mean that, when knowing a single byte (or a hundred) you should not be able to predict the next, but when you know the seed, you should be able to predict the first (and all others). Maybe another word is reproducible.
If anyone knows of a more intuitive way, please tell me!
I ended up isolating the Sha1Prng from the sun sources which guarantees reproducibility on all versions of Java and android. I needed to drop some important methods to ensure compatibility with android, as android does not have access to nio classes...
I recommend using UUID.randomUUID(), then splitting it into longs using getLeastSignificantBits() and getMostSignificantBits()
If you want predictable, they aren't random. That breaks your "Target" requirement of being "safe" and devolves into a simple shared secret between two servers.
You can get something that looks sort of random but is predicatable by using the characteristics of prime integers where you build a set of integers by starting with I (some specific integer) and add the first prime number and then modulo by the 2nd prime number. (In truth the first and second numbers only have to be relatively prime--meaning they have no common prime factors--not counting 1, in case you call that a factor.
If you repeat the process of adding and doing the modulo, you will get a set of numbers that you can repeatably reproduce and they are ordered in the sense that taking any member of the set, adding the first prime and doing the modulo by the 2nd prime, you will always get the same result.
Finally, if the 1st prime number is large relative to the second one, the sequence is not easily predictable by humans and seems sort of random.
For example, 1st prime = 7, 2nd prime = 5 (Note that this shows how it works but is not useful in real life)
Start with 2. Add 7 to get 9. Modulo 5 to get 4.
4 plus 7 = 11. Modulo 5 = 1.
Sequence is 2, 4, 1, 3, 0 and then it repeats.
Now for real life generation of numbers that seem random. The relatively prime numbers are 91193 and 65536. (I chose the 2nd one because it is a power of 2 so all modulo-ed values can fit in 16 bits.)
int first = 91193;
int modByLogicalAnd = 0xFFFF;
int nonRandomNumber = 2345; // Use something else
for (int i = 0; i < 1000 ; ++i) {
nonRandomNumber += first;
nonRandomNumber &= modByLogicalAnd;
// print it here
}
Each iteration generates 2 bytes of sort of random numbers. You can pack several of them into a buffer if you need larger random "strings".
And they are repeatable. Your user can pick the starting point and you can use any prime you want (or, in fact, any number without 2 as a factor).
BTW - Using a power of 2 as the 2nd number makes it more predictable.
Ignoring RNGs that use some physical input (random clock bits, electrical noise, etc) all software RNGs are predicable, given the same starting conditions. They are, after all, (hopefully) deterministic computer programs.
There are some algorithms that intentionally include the physical input (by, eg, sampling the computer clock occasionally) in attempt to prevent predictability, but those are (to my knowledge) the exception.
So any "conventional" RNG, given the same seed and implemented to the same specification, should produce the same sequence of "random" numbers. (This is why a computer RNG is more properly called a "pseudo-random number generator".)
The fact that an RNG can be seeded with a previously-used seed and reproduce a "known" sequence of numbers does not make the RNG any less secure than one where your are somehow prevented from seeding it (though it may be less secure than the fancy algorithms that reseed themselves at intervals). And the ability to do this -- to reproduce the same sequence again and again is not only extraordinarily useful in testing, it has some "real life" applications in encryption and other security applications. (In fact, an encryption algorithm is, in essence, simply a reproducible random number generator.)
I discovered something strange with the generation of random numbers using Java's Random class.
Basically, if you create multiple Random objects using close seeds (for example between 1 and 1000) the first value generated by each generator will be almost the same, but the next values looks fine (i didn't search further).
Here are the two first generated doubles with seeds from 0 to 9 :
0 0.730967787376657 0.24053641567148587
1 0.7308781907032909 0.41008081149220166
2 0.7311469360199058 0.9014476240300544
3 0.731057369148862 0.07099203475193139
4 0.7306094602878371 0.9187140138555101
5 0.730519863614471 0.08825840967622589
6 0.7307886238322471 0.5796252073129174
7 0.7306990420600421 0.7491696031336331
8 0.7302511331990172 0.5968915822372118
9 0.7301615514268123 0.7664359929590888
And from 991 to 1000 :
991 0.7142160704801332 0.9453385235522973
992 0.7109015598097105 0.21848118381994108
993 0.7108119780375055 0.38802559454181795
994 0.7110807233541204 0.8793923921785096
995 0.7109911564830766 0.048936787999225295
996 0.7105432327208906 0.896658767102804
997 0.7104536509486856 0.0662031629235198
998 0.7107223962653005 0.5575699754613725
999 0.7106328293942568 0.7271143712820883
1000 0.7101849056320707 0.574836350385667
And here is a figure showing the first value generated with seeds from 0 to 100,000.
First random double generated based on the seed :
I searched for information about this, but I didn't see anything referring to this precise problem. I know that there is many issues with LCGs algorithms, but I didn't know about this one, and I was wondering if this was a known issue.
And also, do you know if this problem only for the first value (or first few values), or if it is more general and using close seeds should be avoided?
Thanks.
You'd be best served by downloading and reading the Random source, as well as some papers on pseudo-random generators, but here are some of the relevant parts of the source. To begin with, there are three constant parameters that control the algorithm:
private final static long multiplier = 0x5DEECE66DL;
private final static long addend = 0xBL;
private final static long mask = (1L << 48) - 1;
The multiplier works out to approximately 2^34 and change, the mask 2^48 - 1, and the addend is pretty close to 0 for this analysis.
When you create a Random with a seed, the constructor calls setSeed:
synchronized public void setSeed(long seed) {
seed = (seed ^ multiplier) & mask;
this.seed.set(seed);
haveNextNextGaussian = false;
}
You're providing a seed pretty close to zero, so initial seed value that gets set is dominated by multiplier when the two are OR'ed together. In all your test cases with seeds close to zero, the seed that is used internally is roughly 2^34; but it's easy to see that even if you provided very large seed numbers, similar user-provided seeds will yield similar internal seeds.
The final piece is the next(int) method, which actually generates a random integer of the requested length based on the current seed, and then updates the seed:
protected int next(int bits) {
long oldseed, nextseed;
AtomicLong seed = this.seed;
do {
oldseed = seed.get();
nextseed = (oldseed * multiplier + addend) & mask;
} while (!seed.compareAndSet(oldseed, nextseed));
return (int)(nextseed >>> (48 - bits));
}
This is called a 'linear congruential' pseudo-random generator, meaning that it generates each successive seed by multiplying the current seed by a constant multiplier and then adding a constant addend (and then masking to take the lower 48 bits, in this case). The quality of the generator is determined by the choice of multiplier and addend, but the ouput from all such generators can be easily predicted based on the current input and has a set period before it repeats itself (hence the recommendation not to use them in sensitive applications).
The reason you're seeing similar initial output from nextDouble given similar seeds is that, because the computation of the next integer only involves a multiplication and addition, the magnitude of the next integer is not much affected by differences in the lower bits. Calculation of the next double involves computing a large integer based on the seed and dividing it by another (constant) large integer, and the magnitude of the result is mostly affected by the magnitude of the integer.
Repeated calculations of the next seed will magnify the differences in the lower bits of the seed because of the repeated multiplication by the constant multiplier, and because the 48-bit mask throws out the highest bits each time, until eventually you see what looks like an even spread.
I wouldn't have called this an "issue".
And also, do you know if this problem only for the first value (or first few values), or if it is more general and using close seeds should be avoided?
Correlation patterns between successive numbers is a common problem with non-crypto PRNGs, and this is just one manifestation. The correlation (strictly auto-correlation) is inherent in the mathematics underlying the algorithm(s). If you want to understand that, you should probably start by reading the relevant part of Knuth's Art of Computer Programming Chapter 3.
If you need non-predictability you should use a (true) random seed for Random ... or let the system pick a "pretty random" one for you; e.g. using the no-args constructor. Or better still, use a real random number source or a crypto-quality PRNG instead of Random.
For the record:
The javadoc (Java 7) does not specify how Random() seeds itself.
The implementation of Random() on Java 7 for Linux, is seeded from the nanosecond clock, XORed with a 'uniquifier' sequence. The 'uniquifier' sequence is LCG which uses different multiplier, and whose state is static. This is intended to avoid auto-correlation of the seeds ...
This is a fairly typical behaviour for pseudo-random seeds - they aren't required to provide completely different random sequences, they only provide a guarantee that you can get the same sequence again if you use the same seed.
The behaviour happens because of the mathematical form of the PRNG - the Java one uses a linear congruential generator, so you are just seeing the results running the seed through one round of the linear congruential generator. This isn't enough to completely mix up all the bit patterns, hence you see similar results for similar seeds.
Your best strategy is probably just to use very different seeds - one option would be to obtain these by hashing the seed values that you are currently using.
By making random seeds (for instance, using some mathematical functions on System.currentTimeMillis() or System.nanoTime() for seed generation) you can get better random result. Also can look at here for more information
I know that randomized UUIDs have a very, very, very low probability for collision in theory, but I am wondering, in practice, how good Java's randomUUID() is in terms of not having collision? Does anybody have any experience to share?
UUID uses java.security.SecureRandom, which is supposed to be "cryptographically strong". While the actual implementation is not specified and can vary between JVMs (meaning that any concrete statements made are valid only for one specific JVM), it does mandate that the output must pass a statistical random number generator test.
It's always possible for an implementation to contain subtle bugs that ruin all this (see OpenSSH key generation bug) but I don't think there's any concrete reason to worry about Java UUIDs's randomness.
Wikipedia has a very good answer
http://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
...
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes.
...
Thus, for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
Does anybody have any experience to share?
There are 2^122 possible values for a type-4 UUID. (The spec says that you lose 2 bits for the type, and a further 4 bits for a version number.)
Assuming that you were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in your lifetime would be vanishingly small. And to detect the duplicate, you'd have to solve the problem of comparing 1 million new UUIDs per second against all of the UUIDs you have previously generated1!
The chances that anyone has experienced (i.e. actually noticed) a duplicate in real life are even smaller than vanishingly small ... because of the practical difficulty of looking for collisions.
Now of course, you will typically be using a pseudo-random number generator, not a source of truly random numbers. But I think we can be confident that if you are using a creditable provider for your cryptographic strength random numbers, then it will be cryptographic strength, and the probability of repeats will be the same as for an ideal (non-biased) random number generator.
However, if you were to use a JVM with a "broken" crypto- random number generator, all bets are off. (And that might include some of the workarounds for "shortage of entropy" problems on some systems. Or the possibility that someone has tinkered with your JRE, either on your system or upstream.)
1 - Assuming that you used "some kind of binary btree" as proposed by an anonymous commenter, each UUID is going to need O(NlogN) bits of RAM memory to represent N distinct UUIDs assuming low density and random distribution of the bits. Now multiply that by 1,000,000 and the number of seconds that you are going to run the experiment for. I don't think that is practical for the length of time needed to test for collisions of a high quality RNG. Not even with (hypothetical) clever representations.
I'm not an expert, but I'd assume that enough smart people looked at Java's random number generator over the years. Hence, I'd also assume that random UUIDs are good. So you should really have the theoretical collision probability (which is about 1 : 3 × 10^38 for all possible UUIDs. Does anybody know how this changes for random UUIDs only? Is it 1/(16*4) of the above?)
From my practical experience, I've never seen any collisions so far. I'll probably have grown an astonishingly long beard the day I get my first one ;)
At a former employer we had a unique column that contained a random uuid. We got a collision the first week after it was deployed. Sure, the odds are low but they aren't zero. That is why Log4j 2 contains UuidUtil.getTimeBasedUuid. It will generate a UUID that is unique for 8,925 years so long as you don't generate more than 10,000 UUIDs/millisecond on a single server.
The original generation scheme for UUIDs was to concatenate the UUID version with the MAC address of the computer that is generating the UUID, and with the number of 100-nanosecond intervals since the adoption of the Gregorian calendar in the West. By representing a single point in space (the computer) and time (the number of intervals), the chance of a collision in values is effectively nil.
Many of the answers discuss how many UUIDs would have to be generated to reach a 50% chance of a collision. But a 50%, 25%, or even 1% chance of collision is worthless for an application where collision must be (virtually) impossible.
Do programmers routinely dismiss as "impossible" other events that can and do occur?
When we write data to a disk or memory and read it back again, we take for granted that the data are correct. We rely on the device's error correction to detect any corruption. But the chance of undetected errors is actually around 2-50.
Wouldn't it make sense to apply a similar standard to random UUIDs? If you do, you will find that an "impossible" collision is possible in a collection of around 100 billion random UUIDs (236.5).
This is an astronomical number, but applications like itemized billing in a national healthcare system, or logging high frequency sensor data on a large array of devices could definitely bump into these limits. If you are writing the next Hitchhiker's Guide to the Galaxy, don't try to assign UUIDs to each article!
I play at lottery last year, and I've never won ....
but it seems that there lottery has winners ...
doc : https://www.rfc-editor.org/rfc/rfc4122
Type 1 : not implemented. collision are possible if the uuid is generated at the same moment. impl can be artificially a-synchronize in order to bypass this problem.
Type 2 : never see a implementation.
Type 3 : md5 hash : collision possible (128 bits-2 technical bytes)
Type 4 : random : collision possible (as lottery). note that the jdk6 impl dont use a "true" secure random because the PRNG algorithm is not choose by developer and you can force system to use a "poor" PRNG algo. So your UUID is predictable.
Type 5 : sha1 hash : not implemented : collision possible (160 bit-2 technical bytes)
We have been using the Java's random UUID in our application for more than one year and that to very extensively. But we never come across of having collision.