How good is Java's UUID.randomUUID? - java

I know that randomized UUIDs have a very, very, very low probability for collision in theory, but I am wondering, in practice, how good Java's randomUUID() is in terms of not having collision? Does anybody have any experience to share?

UUID uses java.security.SecureRandom, which is supposed to be "cryptographically strong". While the actual implementation is not specified and can vary between JVMs (meaning that any concrete statements made are valid only for one specific JVM), it does mandate that the output must pass a statistical random number generator test.
It's always possible for an implementation to contain subtle bugs that ruin all this (see OpenSSH key generation bug) but I don't think there's any concrete reason to worry about Java UUIDs's randomness.

Wikipedia has a very good answer
http://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
...
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes.
...
Thus, for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.

Does anybody have any experience to share?
There are 2^122 possible values for a type-4 UUID. (The spec says that you lose 2 bits for the type, and a further 4 bits for a version number.)
Assuming that you were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in your lifetime would be vanishingly small. And to detect the duplicate, you'd have to solve the problem of comparing 1 million new UUIDs per second against all of the UUIDs you have previously generated1!
The chances that anyone has experienced (i.e. actually noticed) a duplicate in real life are even smaller than vanishingly small ... because of the practical difficulty of looking for collisions.
Now of course, you will typically be using a pseudo-random number generator, not a source of truly random numbers. But I think we can be confident that if you are using a creditable provider for your cryptographic strength random numbers, then it will be cryptographic strength, and the probability of repeats will be the same as for an ideal (non-biased) random number generator.
However, if you were to use a JVM with a "broken" crypto- random number generator, all bets are off. (And that might include some of the workarounds for "shortage of entropy" problems on some systems. Or the possibility that someone has tinkered with your JRE, either on your system or upstream.)
1 - Assuming that you used "some kind of binary btree" as proposed by an anonymous commenter, each UUID is going to need O(NlogN) bits of RAM memory to represent N distinct UUIDs assuming low density and random distribution of the bits. Now multiply that by 1,000,000 and the number of seconds that you are going to run the experiment for. I don't think that is practical for the length of time needed to test for collisions of a high quality RNG. Not even with (hypothetical) clever representations.

I'm not an expert, but I'd assume that enough smart people looked at Java's random number generator over the years. Hence, I'd also assume that random UUIDs are good. So you should really have the theoretical collision probability (which is about 1 : 3 × 10^38 for all possible UUIDs. Does anybody know how this changes for random UUIDs only? Is it 1/(16*4) of the above?)
From my practical experience, I've never seen any collisions so far. I'll probably have grown an astonishingly long beard the day I get my first one ;)

At a former employer we had a unique column that contained a random uuid. We got a collision the first week after it was deployed. Sure, the odds are low but they aren't zero. That is why Log4j 2 contains UuidUtil.getTimeBasedUuid. It will generate a UUID that is unique for 8,925 years so long as you don't generate more than 10,000 UUIDs/millisecond on a single server.

The original generation scheme for UUIDs was to concatenate the UUID version with the MAC address of the computer that is generating the UUID, and with the number of 100-nanosecond intervals since the adoption of the Gregorian calendar in the West. By representing a single point in space (the computer) and time (the number of intervals), the chance of a collision in values is effectively nil.

Many of the answers discuss how many UUIDs would have to be generated to reach a 50% chance of a collision. But a 50%, 25%, or even 1% chance of collision is worthless for an application where collision must be (virtually) impossible.
Do programmers routinely dismiss as "impossible" other events that can and do occur?
When we write data to a disk or memory and read it back again, we take for granted that the data are correct. We rely on the device's error correction to detect any corruption. But the chance of undetected errors is actually around 2-50.
Wouldn't it make sense to apply a similar standard to random UUIDs? If you do, you will find that an "impossible" collision is possible in a collection of around 100 billion random UUIDs (236.5).
This is an astronomical number, but applications like itemized billing in a national healthcare system, or logging high frequency sensor data on a large array of devices could definitely bump into these limits. If you are writing the next Hitchhiker's Guide to the Galaxy, don't try to assign UUIDs to each article!

I play at lottery last year, and I've never won ....
but it seems that there lottery has winners ...
doc : https://www.rfc-editor.org/rfc/rfc4122
Type 1 : not implemented. collision are possible if the uuid is generated at the same moment. impl can be artificially a-synchronize in order to bypass this problem.
Type 2 : never see a implementation.
Type 3 : md5 hash : collision possible (128 bits-2 technical bytes)
Type 4 : random : collision possible (as lottery). note that the jdk6 impl dont use a "true" secure random because the PRNG algorithm is not choose by developer and you can force system to use a "poor" PRNG algo. So your UUID is predictable.
Type 5 : sha1 hash : not implemented : collision possible (160 bit-2 technical bytes)

We have been using the Java's random UUID in our application for more than one year and that to very extensively. But we never come across of having collision.

Related

Compact alternatives to UUID as a correlationId

I am adding correlationIDs to our public and private APIs. This is to be able to trace a request progress through logs.
UUIDs being long strings, take up much space. I need a compact alternative to UUID as a correlation ID.
It will be ok, if a correlationId repeats after a fix period (say 2 months) since the API requests older than that won't be required to be traced.
I have considered using java.util.Random nextLong(). But it does not guarantee that it won't repeat.
Also, SecureRandom can pose some performance issues is what I understand and also, I don't need the correlationIDs to be secure.
It would be good to have other options considered.
If you can accept IDs up to 8 characters long, the number of possible IDs depends on the character set of those IDs.
For hexadecimal characters (16-character set), the number of IDs is 4294967296 or 2^32.
For A-Z, 0-9 (36-character set), the number of IDs is 2821109907456 or about 2^41.
For base64 or base64url, 0-9 (63-character set), the number of IDs is 248155780267521 or about 2^47.
You should decide which character set to use, as well as ask yourself whether your application can check IDs for randomness and whether you can tolerate the risk of duplicate IDs (so that you won't have a problem generating them randomly using SecureRandom). Also note the following:
The risk of duplicates depends on the number of possible IDs. Roughly speaking, after your application generates the square root of all possible IDs at random, the risk of duplicates becomes non-negligible (which in the case of hexadecimal IDs, will occur after just 65536 IDs). See "Birthday problem" for a more precise statement and formulas.
If your application is distributed across multiple computers, you might choose to assign each computer a unique value to include in the ID.
If you don't care about whether your correlation IDs are secure, you might choose to create unique IDs by doing a reversible operation on sequential IDs, such as a reversible mixing function or a linear congruential generator.
You say that SecureRandom "can pose some performance issues". You won't know if it will unless you try it and measure your application's performance. Generating IDs with SecureRandom may or may not be too slow for your purposes.
For further considerations and advice, see what I write on "Unique Random Identifiers".

Does randomUUID give a unique id?

I am trying to create session tokens for my REST API. Each time the user logs in I am creating a new token by
UUID token = UUID.randomUUID();
user.setSessionId(token.toString());
Sessions.INSTANCE.sessions.put(user.getName(), user.getSessionId());
However, I am not sure how to protect against duplicate sessionTokens.
For example: Can there be a scenario when user1 signs in and gets a token 87955dc9-d2ca-4f79-b7c8-b0223a32532a and user2 signs in and also gets a token 87955dc9-d2ca-4f79-b7c8-b0223a32532a.
Is there a better way of doing this?
If you get a UUID collision, go play the lottery next.
From Wikipedia:
Randomly generated UUIDs have 122 random bits. Out of a total of 128
bits, four bits are used for the version ('Randomly generated UUID'),
and two bits for the variant ('Leach-Salz').
With random UUIDs, the
chance of two having the same value can be calculated using
probability theory (Birthday paradox). Using the approximation
p(n)\approx 1-e^{-\tfrac{n^2}{{2x}}}
these are the probabilities of an
accidental clash after calculating n UUIDs, with x=2122:
n probability
68,719,476,736 = 236 0.0000000000000004 (4 × 10−16)
2,199,023,255,552 = 241 0.0000000000004 (4 × 10−13)
70,368,744,177,664 = 246 0.0000000004 (4 × 10−10)
To put these numbers into perspective,
the annual risk of someone being hit by a meteorite is estimated to be
one chance in 17 billion, which means the probability is about
0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of > UUIDs in a year and having one duplicate. In
other words, only after generating 1 billion UUIDs every second for
the next 100 years, the probability of creating just one duplicate
would be about 50%. The probability of one duplicate would be about
50% if every person on earth owns 600 million UUIDs.
Since a UUID has a finite size there is no way for it to be unique across all of space and time.
If you need a UUID that is guaranteed to be unique within any reasonable use case you can use Log4j 2's Uuid.getTimeBasedUuid(). It is guaranteed to be unique for about 8,900 years so long as you generate less than 10,000 UUIDs per millisecond.
Oracle UUID document. http://docs.oracle.com/javase/7/docs/api/java/util/UUID.html
They use this algorithm from the The Internet Engineering Task Force. http://www.ietf.org/rfc/rfc4122.txt
A quote from the abstract.
A UUID is 128 bits long, and can guarantee uniqueness across
space and time.
While the abstract claims a guarantee, there are only 3.4 x 10^38 combinations. CodeChimp
From UUID.randomUUID() Javadoc:
Static factory to retrieve a type 4 (pseudo randomly generated) UUID. The UUID is generated using a cryptographically strong pseudo random number generator.
It's random and therefore a collision will occur, definitely, as confirmed others in comments above/below that detected collisions very early. Instead of a version 4 (random based) I would advice You to use version 1 (time based).
Possible solutions:
1) UUID utility from Log4j
You can use 3rd party implementation from Log4j UuidUtil.getTimeBasedUuid() that is based on the current timestamp, measured in units of 100 nanoseconds from October 10, 1582, concatenated with the MAC address of the device where the UUID is created. Please see package org.apache.logging.log4j.core.util from artifact log4j-core.
2) UUID utility from FasterXML
There is also 3rd party implementation from FasterXML Generators.timeBasedGenerator().generate() that is based on time and MAC address, too. Please see package com.fasterxml.uuid from artifact java-uuid-generator.
3) Do it on your own
Or You can implement Your own using constructor new UUID(long mostSigBits, long leastSigBits) from core Java. Please see following very nice explanation Baeldung - Guide to UUID in Java where October 15, 1582 (actually, very famous day) is used in implementation.
If you want to be absolutely 100% dead certain that there will be NO duplicates, just make a TokenHandler. All it needs is a synchronized method that generates a random UUID, loops over every single one that has been created (not very time efficient, sure, but if the token is to be used as a session ID then a good data structure is all that is needed to make this very fast still), and if the token is unique then the handler saves it before returning it.
This is seriously overkill though. Would be easier to just follow the suggestion of having your tokens be a combination of UUID and timestamp. If you use System.nanotime in addition to UUID I don't see there being a collision at any time in eternity.

At what point is True randomness lost? True random number as a java.util.Random seed?

Let's assume I have a reliably truly random source of random numbers, but it is very slow. It only give me a few hundreds of numbers every couple of hours.
Since I need way more than that I was thinking to use those few precious TRN I can get as seeds for java.util.Random (or scala.util.Random). I also always will pick a new one to generate the next random number.
So I guess my questions are:
Can the numbers I generate from those Random instance in Java be considered truly random since the seed is truly random?
Is there still a condition that is not met for true randomness?
If I keep on adding levels at what point will randomness be lost?
Or (as I thought when I came up with it) is truly random as long as the stream of seeds is?
I am assuming that nobody has intercepted the stream of seeds, but I do not plan to use those numbers for security purposes.
For a pseudo random generator like java.util.Random, the next generated number in the sequence becomes predictable given only a few numbers from the sequence, so you will loose your "true randomness" very fast. Better use one of the generators provided by java.security.SecureRandom - these are all strong random generators with an VERY long sequence length, which should be pretty hard to be predicted.
Our java Random gives uniformly spread random numbers. That is not true randomness, which may yield five times the same number.
Furthermore for every specific seed the same sequence is generated (intentionally). With 2^64 seeds in general irrelevant. (Note hackers could store the first ten numbers of every sequence; thereby rapidly catching up.)
So if you at large intervals use a truely random number as seed, you will get a uniform distribution during that interval. In effect not very different from not using the true randomizers.
Now combining random sequences might reduce the randomness. Maybe translating the true random number to bytes, and xor-ing every new random number with another byte, might give a wilder variance.
Please do not take my word only - I cannot guarantee the mathematical correctness of the above. A math/algorithmic forum might give more info.
When you take out more bits, than you have put in they are for sure no longer truly random. The break point may even occur earlier if the random number generator is bad. This can be seen by considering the entropy of the sequences. The seed value determines the sequence completely, so there are at most as many sequences as seed values. If they are all distinct, the entropy is the same as that of the seeds (which is essentially the number of seed bits, assuming the seed is truly random).
However, if different seeds lead to the same pseudo random sequence the entropy of the sequences will be lower than that of the seeds. If we cut off the sequences after n bits, the entropy may be even lower.
But why care if you don't use it for security purposes? Are you sure the pseudo random numbers are not good enough for your application?

How do the "random" generators in different languages (i.e. Java and C++) compare?

Despite the weird title, I wish to ask a legitimate question: which method's generated numbers are more random: Java's Random() class or Math.random(), or C++'s rand()?
I've heard that PHP's rand() is quite bad, i.e. if you map its results you can clearly see a pattern; sadly, I don't know how to draw a map in C++ or Java.
Also, just out of interest, what about C#?
Both Java and C++ generate pseudo-random numbers which are either:
adequate to the task for anyone who isn't a statistician or cryptographer (a); or
woefully inadequate to those two classes of people.
In all honesty, unless you are in one of those classes, pseudo-random number generators are fine.
Java also has SecureRandom which purports to provide crypto-class non-determinism (I can't comment on the veracity of that argument) and C++ now has a much wider variety of random number generation capability than just rand() - see <random> for details.
Specific operating systems may provides sources of entropy for random number generators such as CryptGenRandom under Windows or reading /dev/random under Linux. Alternatively, you could add entropy by using random events such as user input timing.
(a) May actually contain traces of other job types that aren't statistician or cryptographer :-)
java.util.Random (which is used internally by Math.random()) uses a Linear congruential generator, which is a rather weak RNG, but enough for simple things. For important applications, one should use java.security.SecureRandom instead.
I don't think the C or C++ language specifications proscribe the algorithm to use for rand() but most implementations use a LCG as well. C++11 has added new APIs that yield higher-quality randomness.
There is a very good document that can be found on the web, done by one of the worldwide experts in random number generators.
Here is the document
The first part of the document is a description of the tests, which you might skip unless your really interested. From page 27, there are the results of the different tests for many generators, including Java, C++, Matlab, Mathematica, Excel, Boost,... (They are described in the text).
It seems that the generator of Java is a bit better, but both are not among the best in the world. The MT19937 from C++11 is already much better.
PHP uses a seed. If the seed is the same at two different times, the rand() function will ALWAYS output the same thing. (Which can be quite bad for tokens for example).
I don't know for C++ and Java, but there's no true randomness, which makes quality difficult to evaluate. Security musn't rely on such functions.
I'm not aware of any language where random numbers are truly random - I'm sure such a thing exists, but generally, it's "You stick a seed in, and you get the sequence that seed gives". Which is fine if you want to make a simple 'shootem-up' game, basic poker-game, roulette simulator for home use, etc. But if you have money relying on the game being truly random (e.g., you are giving out money based on the results of certain sequences) or your secret files are relying on your random numbers, then you will definitely need some other mechanism for finding random numbers.
And there are some "true" random number generators around. They do not provide a seed, so predictability based on what number(s) you got last time is low. I'm not saying it's zero, because I'm not sure you can get that even from sampling radio waves at an unused radio frequency, radioactive decay or whatever the latest method of genearing true random numbers is.

What would be considered a standard deviation boundry for java random?

I'm using java 6 random (java.util.Random,linux 64) to randomly decide between serving one version of a page to a second one (Normal A/B testing), technically i initialize the class once with the default empty constructor and it's injected to a bean (Spring) as a property .
Most of the times the copies of the pages are within 8%(+-) of each other but from time to time i see deviations of up to 20 percent , e.g :
I now have two copies that split : 680 / 570 is that considered normal ?
Is there a better/faster version to use than java random ?
Thanks
A deviation of 20% does seem rather large, but you would need to talk to a trained statistician to find out if it is statistically anomalous.
UPDATE - and the answer is that it is not necessarily anomalous. The statistics predict that you would get an outlier like this roughly 0.3% of the time.
It is certainly plausible for a result like this to be caused by the random number generator. The Random class uses a simple "linear congruential" algorithm, and this class of algorithms are strongly auto-correlated. Depending on how you use the random number, this could lead anomalies at the application level.
If this is the cause of your problem, then you could try replacing it with a crypto-strength random number generator. See the javadocs for SecureRandom. SecureRandom is more expensive than Random, but it is unlikely that this will make any difference in your use-case.
On the other hand, if these outliers are actually happening at roughly the rate predicted by the theory, changing the random number generator shouldn't make any difference.
If these outliers are really troublesome, then you need to take a different approach. Instead of generating N random choices, generate a list of false / true with exactly the required ratio, and then shuffle the list; e.g. using Collections.shuffle.
I believe this is fairly normal as it is meant to generate random sequences. If you want repeated patterns after certain interval, I think you may want to use a specific seed value in the constructor and reset the random with same seed after certain interval.
e.g. after every 100/500/n calls to Random.next.., reset the seed with old value using Random.setSeed(long seed) method.
java.util.Random.nextBoolean() is an approach for a standard binomial distribution, which has standard deviation of sqrt(n*p*(1-p)), with p=0.5.
So if you do 900 iterations, the standard deviation is sqrt(900*.5*.5) = 15, so most times the distribution would be in the range 435 - 465.
However, it is pseudo-random, and has a limited cycle of numbers it will go through before starting over. So if you have enough iterations, the actual deviation will be much smaller than the theoretical one. Java uses the formula seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1). You could write a different formula with smaller numbers to purposely obtain a smaller deviation, which would make it a worse random number generator, but better fitted for your purpose.
You could for example create a list of 5 trues and 5 falses in it, and use Collections.shuffle to randomize the list. Then you iterate over them sequentially. After 10 iterations you re-shuffle the list and start from the beginning. That way you'll never deviate more than 5.
See http://en.wikipedia.org/wiki/Linear_congruential_generator for the mathematics.

Categories