Does randomUUID give a unique id? - java

I am trying to create session tokens for my REST API. Each time the user logs in I am creating a new token by
UUID token = UUID.randomUUID();
user.setSessionId(token.toString());
Sessions.INSTANCE.sessions.put(user.getName(), user.getSessionId());
However, I am not sure how to protect against duplicate sessionTokens.
For example: Can there be a scenario when user1 signs in and gets a token 87955dc9-d2ca-4f79-b7c8-b0223a32532a and user2 signs in and also gets a token 87955dc9-d2ca-4f79-b7c8-b0223a32532a.
Is there a better way of doing this?

If you get a UUID collision, go play the lottery next.
From Wikipedia:
Randomly generated UUIDs have 122 random bits. Out of a total of 128
bits, four bits are used for the version ('Randomly generated UUID'),
and two bits for the variant ('Leach-Salz').
With random UUIDs, the
chance of two having the same value can be calculated using
probability theory (Birthday paradox). Using the approximation
p(n)\approx 1-e^{-\tfrac{n^2}{{2x}}}
these are the probabilities of an
accidental clash after calculating n UUIDs, with x=2122:
n probability
68,719,476,736 = 236 0.0000000000000004 (4 × 10−16)
2,199,023,255,552 = 241 0.0000000000004 (4 × 10−13)
70,368,744,177,664 = 246 0.0000000004 (4 × 10−10)
To put these numbers into perspective,
the annual risk of someone being hit by a meteorite is estimated to be
one chance in 17 billion, which means the probability is about
0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of > UUIDs in a year and having one duplicate. In
other words, only after generating 1 billion UUIDs every second for
the next 100 years, the probability of creating just one duplicate
would be about 50%. The probability of one duplicate would be about
50% if every person on earth owns 600 million UUIDs.

Since a UUID has a finite size there is no way for it to be unique across all of space and time.
If you need a UUID that is guaranteed to be unique within any reasonable use case you can use Log4j 2's Uuid.getTimeBasedUuid(). It is guaranteed to be unique for about 8,900 years so long as you generate less than 10,000 UUIDs per millisecond.

Oracle UUID document. http://docs.oracle.com/javase/7/docs/api/java/util/UUID.html
They use this algorithm from the The Internet Engineering Task Force. http://www.ietf.org/rfc/rfc4122.txt
A quote from the abstract.
A UUID is 128 bits long, and can guarantee uniqueness across
space and time.
While the abstract claims a guarantee, there are only 3.4 x 10^38 combinations. CodeChimp

From UUID.randomUUID() Javadoc:
Static factory to retrieve a type 4 (pseudo randomly generated) UUID. The UUID is generated using a cryptographically strong pseudo random number generator.
It's random and therefore a collision will occur, definitely, as confirmed others in comments above/below that detected collisions very early. Instead of a version 4 (random based) I would advice You to use version 1 (time based).
Possible solutions:
1) UUID utility from Log4j
You can use 3rd party implementation from Log4j UuidUtil.getTimeBasedUuid() that is based on the current timestamp, measured in units of 100 nanoseconds from October 10, 1582, concatenated with the MAC address of the device where the UUID is created. Please see package org.apache.logging.log4j.core.util from artifact log4j-core.
2) UUID utility from FasterXML
There is also 3rd party implementation from FasterXML Generators.timeBasedGenerator().generate() that is based on time and MAC address, too. Please see package com.fasterxml.uuid from artifact java-uuid-generator.
3) Do it on your own
Or You can implement Your own using constructor new UUID(long mostSigBits, long leastSigBits) from core Java. Please see following very nice explanation Baeldung - Guide to UUID in Java where October 15, 1582 (actually, very famous day) is used in implementation.

If you want to be absolutely 100% dead certain that there will be NO duplicates, just make a TokenHandler. All it needs is a synchronized method that generates a random UUID, loops over every single one that has been created (not very time efficient, sure, but if the token is to be used as a session ID then a good data structure is all that is needed to make this very fast still), and if the token is unique then the handler saves it before returning it.
This is seriously overkill though. Would be easier to just follow the suggestion of having your tokens be a combination of UUID and timestamp. If you use System.nanotime in addition to UUID I don't see there being a collision at any time in eternity.

Related

Compact alternatives to UUID as a correlationId

I am adding correlationIDs to our public and private APIs. This is to be able to trace a request progress through logs.
UUIDs being long strings, take up much space. I need a compact alternative to UUID as a correlation ID.
It will be ok, if a correlationId repeats after a fix period (say 2 months) since the API requests older than that won't be required to be traced.
I have considered using java.util.Random nextLong(). But it does not guarantee that it won't repeat.
Also, SecureRandom can pose some performance issues is what I understand and also, I don't need the correlationIDs to be secure.
It would be good to have other options considered.
If you can accept IDs up to 8 characters long, the number of possible IDs depends on the character set of those IDs.
For hexadecimal characters (16-character set), the number of IDs is 4294967296 or 2^32.
For A-Z, 0-9 (36-character set), the number of IDs is 2821109907456 or about 2^41.
For base64 or base64url, 0-9 (63-character set), the number of IDs is 248155780267521 or about 2^47.
You should decide which character set to use, as well as ask yourself whether your application can check IDs for randomness and whether you can tolerate the risk of duplicate IDs (so that you won't have a problem generating them randomly using SecureRandom). Also note the following:
The risk of duplicates depends on the number of possible IDs. Roughly speaking, after your application generates the square root of all possible IDs at random, the risk of duplicates becomes non-negligible (which in the case of hexadecimal IDs, will occur after just 65536 IDs). See "Birthday problem" for a more precise statement and formulas.
If your application is distributed across multiple computers, you might choose to assign each computer a unique value to include in the ID.
If you don't care about whether your correlation IDs are secure, you might choose to create unique IDs by doing a reversible operation on sequential IDs, such as a reversible mixing function or a linear congruential generator.
You say that SecureRandom "can pose some performance issues". You won't know if it will unless you try it and measure your application's performance. Generating IDs with SecureRandom may or may not be too slow for your purposes.
For further considerations and advice, see what I write on "Unique Random Identifiers".

Which digits of a UUID are least likely to collide if the generator (e.g. Java version of UUID) is unknown?

Suppose we have an existing set of UUIDs (say, millions, though it doesn't matter) that may have been generated by different clients, so that we don't know the algorithm that generated any UUID. But we can assume that they are popular implementations.
Are there a set of 8 or more digits (not necessarily contiguous, though ideally yes) that are less or more likely to collide?
For example, I've seen the uuid() function in MySQL, when used twice in the same statement, generate 2 UUIDs exactly the same except the 5th through 8th digits:
0dec7a69-ded8-11e8-813e-42010a80044f
0decc891-ded8-11e8-813e-42010a80044f
^^^^
What is the answer generally?
The application is to expose a more compact ID for customers to copy and paste or communicate over phone. Unfortunately we're bound to using UUIDs in the backend, and understandably reluctant to creating mappings between long and short versions of IDs, but we can live with using a truncated UUID that occasionally collides and returns more than 1 result.
Suggestion: first 8 digits
1c59f6a6-21e6-481d-80ee-af3c54ac400a
^^^^^^^^
All generator implementations are required to use the same algorithms for a given version, so worry about the latter rather than the former.
UUID version 1 & version 2 are generally arranged from most to least entropy for a given source. So, the first 8 digits are probably the least likely to collide.
UUID version 4 and version 3 & 5 are designed to have uniform entropy, aside from the reserved digits for version and variant. So the first 8 digits are as good as any others.
There is one method that will work, no matter the caveats of the UUID specification. Since a UUID is in itself intended to be globally unique, a secure hash made out of it using a proper algorithm with at least the same bit size will have the same properties.
Except that the secure hash will have entropy through the hash value instead of specific locations.
As an example, you could do:
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(uuid.toString().getBytes(StandardCharsets.UTF_8));
And then you take as many bits out of the hash as you need and convert them back to a String.
This is a one-way function though; to map it back to the UUID in a fast an efficient manner, you need to keep a mapping table. (You can of course check if a UUID matches the shorter code by performing the one-way hash on the UUID again)
However, if you were to take a non-contiguous portion out of the UUID, you would have the same issue.

What would be considered a standard deviation boundry for java random?

I'm using java 6 random (java.util.Random,linux 64) to randomly decide between serving one version of a page to a second one (Normal A/B testing), technically i initialize the class once with the default empty constructor and it's injected to a bean (Spring) as a property .
Most of the times the copies of the pages are within 8%(+-) of each other but from time to time i see deviations of up to 20 percent , e.g :
I now have two copies that split : 680 / 570 is that considered normal ?
Is there a better/faster version to use than java random ?
Thanks
A deviation of 20% does seem rather large, but you would need to talk to a trained statistician to find out if it is statistically anomalous.
UPDATE - and the answer is that it is not necessarily anomalous. The statistics predict that you would get an outlier like this roughly 0.3% of the time.
It is certainly plausible for a result like this to be caused by the random number generator. The Random class uses a simple "linear congruential" algorithm, and this class of algorithms are strongly auto-correlated. Depending on how you use the random number, this could lead anomalies at the application level.
If this is the cause of your problem, then you could try replacing it with a crypto-strength random number generator. See the javadocs for SecureRandom. SecureRandom is more expensive than Random, but it is unlikely that this will make any difference in your use-case.
On the other hand, if these outliers are actually happening at roughly the rate predicted by the theory, changing the random number generator shouldn't make any difference.
If these outliers are really troublesome, then you need to take a different approach. Instead of generating N random choices, generate a list of false / true with exactly the required ratio, and then shuffle the list; e.g. using Collections.shuffle.
I believe this is fairly normal as it is meant to generate random sequences. If you want repeated patterns after certain interval, I think you may want to use a specific seed value in the constructor and reset the random with same seed after certain interval.
e.g. after every 100/500/n calls to Random.next.., reset the seed with old value using Random.setSeed(long seed) method.
java.util.Random.nextBoolean() is an approach for a standard binomial distribution, which has standard deviation of sqrt(n*p*(1-p)), with p=0.5.
So if you do 900 iterations, the standard deviation is sqrt(900*.5*.5) = 15, so most times the distribution would be in the range 435 - 465.
However, it is pseudo-random, and has a limited cycle of numbers it will go through before starting over. So if you have enough iterations, the actual deviation will be much smaller than the theoretical one. Java uses the formula seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1). You could write a different formula with smaller numbers to purposely obtain a smaller deviation, which would make it a worse random number generator, but better fitted for your purpose.
You could for example create a list of 5 trues and 5 falses in it, and use Collections.shuffle to randomize the list. Then you iterate over them sequentially. After 10 iterations you re-shuffle the list and start from the beginning. That way you'll never deviate more than 5.
See http://en.wikipedia.org/wiki/Linear_congruential_generator for the mathematics.

Is UUID.randomUUID() suitable for use as a one-time password?

As previous discussed, confirmation emails should have a unique, (practically) un-guessable code--essentially a one-time password--in the confirmation link.
The UUID.randomUUID() docs say:
The UUID is generated using a cryptographically strong pseudo random
number generator.
Does this imply that the the UUID random generator in a properly implemented JVM is suitable for use as the unique, (practically) un-guessable OTP?
if you read the RFC that defines UUIDs, and which is linked to from the API docs, you'll see that not all bits of the UUID are actually random (the "variant" and the "version" are not random). so a type 4 UUID (the kind that you intend to use), if implemented correctly, should have 122 bits of (secure, for this implementation) random information, out of a total size of 128 bits.
so yes, it will work as well as a 122 bit random number from a "secure" generator. but a shorter value may contain a sufficient amount of randomness and might be easier for a user (maybe i am the only old-fashioned person who still reads email in a terminal, but confirmation URLs that wrap across lines are annoying....).
No. According to the UUID spec:
Do not assume that UUIDs are hard to guess; they should not be used as
security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will exacerbate
the situation.
Also, UUIDs only have 16 possible characters (0 through F). You can generate a much more compact and explicitly secure random password using SecureRandom (thanks to #erickson).
import java.security.SecureRandom;
import java.math.BigInteger;
public final class PasswordGenerator {
private SecureRandom random = new SecureRandom();
public String nextPassword() {
return new BigInteger(130, random).toString(32);
}
}
P.S.
I want to give a clear example of how using UUID as a security token may lead to issues:
In uuid-random we discovered an enormous speed-boost by internally re-using random bytes in a clever way, leading to predictable UUIDs. Though we did not release the change, the RFC allows it and such optimizations could sneak into your UUID library unnoticed.
Yes, using a java.util.UUID is fine, randomUUID methods generates from a cryptographically secure source. There's not much more that needs to be said.
Here's my suggestion:
Send the user a link with a huge password in it as the URL argument.
When user clicks the link, write your backend so that it will determine whether or not the argument is correct and that the user is logged in.
Invalidate the UUID 24 hours after it has been issued.
This will take some work, but it's necessary if you really care about writing a robust, secure system.
Password strength can be quantified based on the required entropy(higher the better).
For a binary computer,
entropy = password length * log2(symbol space)
symbol space is the total unique symbols(characters) available for selection
For a normal english speaking user with qwerty keyboard,
symbols are selected from 52 characters(26 * 2 for both cases) + 10 numbers + maybe 15 other characters like (*, + -, ...), the general symbol space is around 75.
if we expect a minimum password length of 8:
entropy = 8 * log275 ~= 8 * 6.x ~= 50
To achieve an entropy(50) for autogenerated one-time passwords with only hexadecimal (16 symbol space = 0-9,a-f),
password length = 50 / log216 = 50 / 4 ~= 12
If the application can be relaxed to consider the complete case sensitive english alphabets and numbers, the sample space will be 62 (26 * 2 + 10),
password length = 50 / log262 = 50 / 6 ~= 8
This has reduced the number of characters to be typed by the user to 8 (from 12 with hexadecimal).
With UUID.randomUUID(), two main concerns are
user has to enter 32 characters (not user friendly)
implementations has to ensure uniqueness criteria (close coupling with library versions and language dependencies)
I understand this is not a direct answer and its really up to the application owner to choose the best strategy considering the security and usability constraints.
Personally, i will not use UUID.randomUUID() as one-time password.
The point of the random code for a confirmation link is that the attacker should not be able to guess nor predict the value. As you can see, to find the correct code to your confirmation link, a 128bits length UUID yields 2^128 different possible codes, namely, 340,282,366,920,938,463,463,374,607,431,768,211,456 possible codes to try. I think your confirmation link is not for launching a nuclear weapon, right? This is difficult enough for attacker to guess. It's secure.
-- update --
If you don't trust the cryptographically strong random number generator provided, you can put some more unpredictable parameters with the UUID code and hash them. For example,
code = SHA1(UUID, Process PID, Thread ID, Local connection port number, CPU temperature)
This make it even harder to predict.
I think this should be suitable, as it is generated randomly rather than from any specific input (ie you're not feeding it with a username or something like that) - so multiple calls to this code will give different results. It states that its a 128-bit key, so its long enough to be impractical to break.
Are you then going to use this key to encrypt a value, or are you expecting to use this as the actual password? Regardless, you'll need to re-interpret the key into a format that can be entered by a keyboard. For example, do a Base64 or Hex conversion, or somehow map the values to alpha-numerics, otherwise the user will be trying to enter byte values that don't exist on the keyboard.
It is perfect as one time password, as even I had implemented the same for application on which am working. Moreover, the link which you've shared says it all.
I think java.util.UUID should be fine. You can find more information from this article:

How good is Java's UUID.randomUUID?

I know that randomized UUIDs have a very, very, very low probability for collision in theory, but I am wondering, in practice, how good Java's randomUUID() is in terms of not having collision? Does anybody have any experience to share?
UUID uses java.security.SecureRandom, which is supposed to be "cryptographically strong". While the actual implementation is not specified and can vary between JVMs (meaning that any concrete statements made are valid only for one specific JVM), it does mandate that the output must pass a statistical random number generator test.
It's always possible for an implementation to contain subtle bugs that ruin all this (see OpenSSH key generation bug) but I don't think there's any concrete reason to worry about Java UUIDs's randomness.
Wikipedia has a very good answer
http://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
...
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes.
...
Thus, for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
Does anybody have any experience to share?
There are 2^122 possible values for a type-4 UUID. (The spec says that you lose 2 bits for the type, and a further 4 bits for a version number.)
Assuming that you were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in your lifetime would be vanishingly small. And to detect the duplicate, you'd have to solve the problem of comparing 1 million new UUIDs per second against all of the UUIDs you have previously generated1!
The chances that anyone has experienced (i.e. actually noticed) a duplicate in real life are even smaller than vanishingly small ... because of the practical difficulty of looking for collisions.
Now of course, you will typically be using a pseudo-random number generator, not a source of truly random numbers. But I think we can be confident that if you are using a creditable provider for your cryptographic strength random numbers, then it will be cryptographic strength, and the probability of repeats will be the same as for an ideal (non-biased) random number generator.
However, if you were to use a JVM with a "broken" crypto- random number generator, all bets are off. (And that might include some of the workarounds for "shortage of entropy" problems on some systems. Or the possibility that someone has tinkered with your JRE, either on your system or upstream.)
1 - Assuming that you used "some kind of binary btree" as proposed by an anonymous commenter, each UUID is going to need O(NlogN) bits of RAM memory to represent N distinct UUIDs assuming low density and random distribution of the bits. Now multiply that by 1,000,000 and the number of seconds that you are going to run the experiment for. I don't think that is practical for the length of time needed to test for collisions of a high quality RNG. Not even with (hypothetical) clever representations.
I'm not an expert, but I'd assume that enough smart people looked at Java's random number generator over the years. Hence, I'd also assume that random UUIDs are good. So you should really have the theoretical collision probability (which is about 1 : 3 × 10^38 for all possible UUIDs. Does anybody know how this changes for random UUIDs only? Is it 1/(16*4) of the above?)
From my practical experience, I've never seen any collisions so far. I'll probably have grown an astonishingly long beard the day I get my first one ;)
At a former employer we had a unique column that contained a random uuid. We got a collision the first week after it was deployed. Sure, the odds are low but they aren't zero. That is why Log4j 2 contains UuidUtil.getTimeBasedUuid. It will generate a UUID that is unique for 8,925 years so long as you don't generate more than 10,000 UUIDs/millisecond on a single server.
The original generation scheme for UUIDs was to concatenate the UUID version with the MAC address of the computer that is generating the UUID, and with the number of 100-nanosecond intervals since the adoption of the Gregorian calendar in the West. By representing a single point in space (the computer) and time (the number of intervals), the chance of a collision in values is effectively nil.
Many of the answers discuss how many UUIDs would have to be generated to reach a 50% chance of a collision. But a 50%, 25%, or even 1% chance of collision is worthless for an application where collision must be (virtually) impossible.
Do programmers routinely dismiss as "impossible" other events that can and do occur?
When we write data to a disk or memory and read it back again, we take for granted that the data are correct. We rely on the device's error correction to detect any corruption. But the chance of undetected errors is actually around 2-50.
Wouldn't it make sense to apply a similar standard to random UUIDs? If you do, you will find that an "impossible" collision is possible in a collection of around 100 billion random UUIDs (236.5).
This is an astronomical number, but applications like itemized billing in a national healthcare system, or logging high frequency sensor data on a large array of devices could definitely bump into these limits. If you are writing the next Hitchhiker's Guide to the Galaxy, don't try to assign UUIDs to each article!
I play at lottery last year, and I've never won ....
but it seems that there lottery has winners ...
doc : https://www.rfc-editor.org/rfc/rfc4122
Type 1 : not implemented. collision are possible if the uuid is generated at the same moment. impl can be artificially a-synchronize in order to bypass this problem.
Type 2 : never see a implementation.
Type 3 : md5 hash : collision possible (128 bits-2 technical bytes)
Type 4 : random : collision possible (as lottery). note that the jdk6 impl dont use a "true" secure random because the PRNG algorithm is not choose by developer and you can force system to use a "poor" PRNG algo. So your UUID is predictable.
Type 5 : sha1 hash : not implemented : collision possible (160 bit-2 technical bytes)
We have been using the Java's random UUID in our application for more than one year and that to very extensively. But we never come across of having collision.

Categories