Why is 4 the most recurring UUID character?

Why is 4 the most recurring UUID character? - java

I wrote a small program in Java that generates 5000 random UUIDs and finds the most recurring character in them overall and I always get as a result that the most recurring character after "-" ( always 20.000 occurencies obviously ) is "4" ( I ran the program several times always getting the same result ).
I was just curious about this fact and was wondering if someone had a technical explanation or if it's really just a coincidence.
Thanks!
This is the function I used to generate the 5000 random UUIDs.
UUID.randomUUID().toString();

Because UUIDs aren't entirely random. Check the Universally unique identifier on wikipedia which explains the various versions.
They look like:
xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
Where the M and the N are definitely not random (they indicate versions and variants), and the rest may not be random either depending on the mode you're using. The code you write gets you version 4, which means 'M' is always 4, and half of 'N' is also unchanging. You get 122 bits of randomness; not 128.
4 is the most common digit because the 13th 'digit' is always a 4, as per the UUID design.

Related

Mapping Unique 16-Digit numeric ID to Unique Alphanumeric ID

In a project I'm working on, I need to generate 16 character long unique IDs, consisting of 10 numbers plus 26 uppercase letters (only uppercase). They must be guaranteed to be universally unique, with zero chance of a repeat ever.
The IDs are not stored forever. An ID is thrown out of the database after a period of time and a new unique ID must be generated. The IDs can never repeat with the thrown out ones either.
So randomly generating 16 digits and checking against a list of previously generated IDs is not an option because there is no comprehensive list of previous IDs. Also, UUID will not work because the IDs must be 16 digits in length.
Right now I'm using 16-Digit Unique IDs, that are guaranteed to be universally unique every time they're generated (I'm using timestamps to generate them plus unique server ID). However, I need the IDs to be difficult to predict, and using timestamps makes them easy to predict.
So what I need to do is map the 16 digit numeric IDs that I have into the larger range of 10 digits + 26 letters without losing uniqueness. I need some sort of hashing function that maps from a smaller range to a larger range, guaranteeing a one-to-one mapping so that the unique IDs are guaranteed to stay unique after being mapped.
I have searched and so far have not found any hashing or mapping functions that are guaranteed to be collision-free, but one must exist if I'm mapping to a larger space. Any suggestions are appreciated.

Brandon Staggs wrote a good article on Implementing a Partial Serial Number Verification System. The examples are written in Delphi, but could be converted to other languages.

EDIT: This is an updated answer, as I misread the constraints on the final ID.
Here is a possible solution.
Let set:
UID16 = 16-digit unique ID
LUID = 16-symbol UID (using digits+letters)
SECRET = a secret string
HASH = some hash of SECRET+UID16
Now, you can compute:
LUID = BASE36(UID16) + SUBSTR(BASE36(HASH), 0, 5)
BASE36(UID16) will produce a 11-character string (because 16 / log10(36) ~= 10.28)
It is guaranteed to be unique because the original UID16 is fully included in the final ID. If you happen to get a hash collision with two different UID16, you'll still have two distinct LUID.
Yet, it is difficult to predict because the 5 other symbols are based on a non-predictable hash.
NB: you'll only get log2(36^5) ~= 26 bits of entropy on the hash part, which may or may not be enough depending on your security requirements. The less predictable the original UID16, the better.

One general solution to your problem is encryption. Because encryption is reversible it is always a one-to-one mapping from plaintext to cyphertext. If you encrypt the numbers [0, 1, 2, 3, ...] then you are guaranteed that the resulting cyphertexts are also unique, as long as you keep the same key, do not repeat a number or overflow the allowed size. You just need to keep track of the next number to encrypt, incrementing as needed, and check that it never overflows.
The problem then reduces to the size (in bits) of the encryption and how to present it as text. You say: "10 numbers plus 26 uppercase letters (only uppercase)." That is similar to Base32 encoding, which uses the digits 2, 3, 4, 5, 6, 7 and 26 letters. Not exactly what you require, but perhaps close enough and available off the shelf. 16 characters at 5 bits per Base32 character is 80 bits. You could use an 80 bit block cipher and convert the output to Base32. Either roll your own simple Feistel cipher or use Hasty Pudding cipher, which can be set for any block size. Do not roll your own if there is a major security requirement here. Your own Feistel cipher will give you uniqueness and obfuscation, not security. Hasty Pudding gives security as well.
If you really do need all 10 digits and 26 letters, then you are looking at a number in base 36. Work out the required bit size for 36^16 and proceed as before. Convert the cyphertext bits to a number expressed in base 36.
If you write your own cipher then it appears that you do not need the decryption function, which will save a little work.

You want to map from a space consisting of 1016 distinct values to one with 3616 values.
The ratio of the sizes of these two spaces is ~795,866,110.
Use BigDecimal and multiply each input value by the ratio to distribute the input keys equally over the output space. Then base-36 encode the resulting value.
Here's a sample of 16-digit values consisting of 11 digits "timestamp" and 5 digits server ID encoded using the above scheme.
Decimal ID Base-36 Encoding
---------------- ----------------
4156333000101044 -> EYNSC8L1QJD7MJDK
4156333000201044 -> EYNSC8LTY4Y8Y7A0
4156333000301044 -> EYNSC8MM5QJA9V6G
4156333000401044 -> EYNSC8NEDC4BLJ2W
4156333000501044 -> EYNSC8O6KXPCX6ZC
4156333000601044 -> EYNSC8OYSJAE8UVS
4156333000701044 -> EYNSC8PR04VFKIS8
4156333000801044 -> EYNSC8QJ7QGGW6OO
The first 11 digits form the "timestamp" and I calculated the result for a series incremented by 1; the last five digits are an arbitrary "server ID", in this case 01044.

Hashids unique space

I was wondering if I'm experiencing a bug, or have just run into a limitation of the Hashids algorithm.
I'm using a custom alphabet, which consists of all uppercase letters, minus "O" and "I" and digits 2 - 9.
After generating several million hashes, I noticed that duplicates started to appear. I'm confused by this, especially since Hashids claims that duplicates are not possible since the algorithm is simply a hex version of an integer. And so long as the integers remain unique (such as counting up forever), so will the hashes.
Does a custom alphabet make it more likely for duplicates to appear? Also, I was expecting the number of unique hashes for my alphabet to be: 32^7 = 34,359,738,368. Before my counter hit this number, the generated hashids grew from 7 characters long to 8.
Does anyone have any ideas as to why this is happening?
Edit: another rather strange anomaly: after generating 10647 hashes, the rest (2.9 million plus) either start with a K or an X. I'm beginning to think the custom alphabet plus the length of the salt affect how the letters get shuffled.

have same problem, just try this :
var hashids = new Hashids("BSomeoneNameN161179IBRB46", 5, "ABCDEFGHIJKLMNPQRSTUVWXYZ1234567890");
var id = hashids.encode(1234567);
var numbers = hashids.decode(id);
changing salt by deleting last 5 characters one by one, just showing same result.
making salt not more than 20 chars seems solve the problem.

I solved this issue by adding the letters and digits I,O,0,1 back into the alphabet being used. With the increase in length of the alphabet, the rotation calculated by Hashids was affected. I simply filtered out any output that included an I,O,0 or 1 using a regular expression.

Java program to generate a unique and random six alpha numeric code

I need to generate a reservation code of 6 alpha numeric characters, that is random and unique in java.
Tried using UUID.randomuuid().toString(), However the id is too long and the requirement demands that it should only be 6 characters.
What approaches are possible to achieve this?
Just to clarify, (Since this question is getting marked as duplicate).
The other solutions I've found are simply generating random characters, which is not enough in this case. I need to reasonably ensure that a random code is not generated again.

Consider using the hashids library to generate salted hashes of integers (your database ids or other random integers which is probably better).
http://hashids.org/java/
Hashids hashids = new Hashids("this is my salt",6);
String id = hashids.encode(1, 2, 3);
long[] numbers = hashids.decode(id);

You have 36 characters in the alphanumeric character set (0-9 digits + a-z letters). With 6 places you achieve 366 = 2.176.782.336 different options, that is slightly larger than 231.
Therefore you can use Unix time to create a unique ID. However, you must assure that no ID generated within the same second.
If you cannot guarantee that, you end up with a (synchronized) counter within your class. Also, if you want to survive a JVM restart, you should save the current value (e.g. to a database, file, etc. whatever options you have).

Despite its name, UUIDs are not unique. It's simply extremely unlikely to get a 128 bit collision. With 6 (less than 32 bit) it's very likely that you get a collision if you just hash stuff or generate a random string.
If the uniqueness constraint is necessary then you need to
generate a random 6 character string
Check if you generated that string before by querying your database
If you generated it before, go back to 1
Another way would be to use a pseadorandom permutation (PRP) of size 32 bit. Block ciphers are modeled as PRP functions, but there aren't many that support 32 bit block sizes. Some are Speck by the NSA and the Hasty Pudding Cipher.
With a PRP you could for example take an already unique value like your database primary key and encrypt it with the block cipher. If the input is not bigger than 32 bit then the output will still be unique.
Then you would run Base62 (or at least Base 41) over the output and remove the padding characters to get a 6 character output.

if you do a substring that value may not be unique
for more info please see following similar link
Generating 8-character only UUIDs

Lets say your corpus is the collection of alpha numberic letters. a-zA-Z0-9.
char[] corpus = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
We can use SecureRandom to generate a seed, which will ask the OS for entropy, depending on the os. The trick here is to keep a uniform distribution, each byte has 255 values, but we only need around 62 so I will propose rejection sampling.
int generated = 0;
int desired=6;
char[] result= new char[desired];
while(generated<desired){
byte[] ran = SecureRandom.getSeed(desired);
for(byte b: ran){
if(b>=0&&b<corpus.length){
result[generated] = corpus[b];
generated+=1;
if(generated==desired) break;
}
}
}
Improvements could include, smarter wrapping of generated values.
When can we expect a repeat? Lets stick with the corpus of 62 and assume that the distribution is completely random. In that case we have the birthday problem. That gives us N = 62^6 possiblities. We want to find n where the chance of a repeat around 10%.
p(r)= 1 - N!/(N^n (N-n)!)
And using the approximation given in the wikipedia page.
n = sqrt(-ln(0.9)2N)
Which gives us about 109000 numbers for 10% chance. For a 0.1% chance it woul take about 10000 numbers.

you can trying to make substring out of your generated UUID.
String uuid = UUID.randomUUID().toString();
System.out.println("uuid = " + uuid.substring(0,5);

Does randomUUID give a unique id?

I am trying to create session tokens for my REST API. Each time the user logs in I am creating a new token by
UUID token = UUID.randomUUID();
user.setSessionId(token.toString());
Sessions.INSTANCE.sessions.put(user.getName(), user.getSessionId());
However, I am not sure how to protect against duplicate sessionTokens.
For example: Can there be a scenario when user1 signs in and gets a token 87955dc9-d2ca-4f79-b7c8-b0223a32532a and user2 signs in and also gets a token 87955dc9-d2ca-4f79-b7c8-b0223a32532a.
Is there a better way of doing this?

If you get a UUID collision, go play the lottery next.
From Wikipedia:
Randomly generated UUIDs have 122 random bits. Out of a total of 128
bits, four bits are used for the version ('Randomly generated UUID'),
and two bits for the variant ('Leach-Salz').
With random UUIDs, the
chance of two having the same value can be calculated using
probability theory (Birthday paradox). Using the approximation
p(n)\approx 1-e^{-\tfrac{n^2}{{2x}}}
these are the probabilities of an
accidental clash after calculating n UUIDs, with x=2122:
n probability
68,719,476,736 = 236 0.0000000000000004 (4 × 10−16)
2,199,023,255,552 = 241 0.0000000000004 (4 × 10−13)
70,368,744,177,664 = 246 0.0000000004 (4 × 10−10)
To put these numbers into perspective,
the annual risk of someone being hit by a meteorite is estimated to be
one chance in 17 billion, which means the probability is about
0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of > UUIDs in a year and having one duplicate. In
other words, only after generating 1 billion UUIDs every second for
the next 100 years, the probability of creating just one duplicate
would be about 50%. The probability of one duplicate would be about
50% if every person on earth owns 600 million UUIDs.

Since a UUID has a finite size there is no way for it to be unique across all of space and time.
If you need a UUID that is guaranteed to be unique within any reasonable use case you can use Log4j 2's Uuid.getTimeBasedUuid(). It is guaranteed to be unique for about 8,900 years so long as you generate less than 10,000 UUIDs per millisecond.

Oracle UUID document. http://docs.oracle.com/javase/7/docs/api/java/util/UUID.html
They use this algorithm from the The Internet Engineering Task Force. http://www.ietf.org/rfc/rfc4122.txt
A quote from the abstract.
A UUID is 128 bits long, and can guarantee uniqueness across
space and time.
While the abstract claims a guarantee, there are only 3.4 x 10^38 combinations. CodeChimp

From UUID.randomUUID() Javadoc:
Static factory to retrieve a type 4 (pseudo randomly generated) UUID. The UUID is generated using a cryptographically strong pseudo random number generator.
It's random and therefore a collision will occur, definitely, as confirmed others in comments above/below that detected collisions very early. Instead of a version 4 (random based) I would advice You to use version 1 (time based).
Possible solutions:
1) UUID utility from Log4j
You can use 3rd party implementation from Log4j UuidUtil.getTimeBasedUuid() that is based on the current timestamp, measured in units of 100 nanoseconds from October 10, 1582, concatenated with the MAC address of the device where the UUID is created. Please see package org.apache.logging.log4j.core.util from artifact log4j-core.
2) UUID utility from FasterXML
There is also 3rd party implementation from FasterXML Generators.timeBasedGenerator().generate() that is based on time and MAC address, too. Please see package com.fasterxml.uuid from artifact java-uuid-generator.
3) Do it on your own
Or You can implement Your own using constructor new UUID(long mostSigBits, long leastSigBits) from core Java. Please see following very nice explanation Baeldung - Guide to UUID in Java where October 15, 1582 (actually, very famous day) is used in implementation.

If you want to be absolutely 100% dead certain that there will be NO duplicates, just make a TokenHandler. All it needs is a synchronized method that generates a random UUID, loops over every single one that has been created (not very time efficient, sure, but if the token is to be used as a session ID then a good data structure is all that is needed to make this very fast still), and if the token is unique then the handler saves it before returning it.
This is seriously overkill though. Would be easier to just follow the suggestion of having your tokens be a combination of UUID and timestamp. If you use System.nanotime in addition to UUID I don't see there being a collision at any time in eternity.

Consistent random numbers across versions and platforms

I need/want to get random (well, not entirely) numbers to use for password generation.
What I do: Currently I am generating them with SecureRandom.
I am obtaining the object with
SecureRandom sec = SecureRandom.getInstance("SHA1PRNG", "SUN");
and then seeding it like this
sec.setSeed(seed);
Target: A (preferably fast) way to create random numbers, which are cryptographically at least a safe as the SHA1PRNG SecureRandom implementation. These need to be the same on different versions of the JRE and Android.
EDIT: The seed is generated from user input.
Problem: With SecureRandom.getInstance("SHA1PRNG", "SUN"); it fails like this:
java.security.NoSuchProviderException: SUN. Omitting , "SUN" produces random numbers, but those are different than the default (JRE 7) numbers.
Question: How can I achieve my Target?
You don't want it to be predictable: I want, because I need the predictability so that the same preconditions result in the same output. If they are not the same, its impossible hard to do what the user expects from the application.
EDIT: By predictable I mean that, when knowing a single byte (or a hundred) you should not be able to predict the next, but when you know the seed, you should be able to predict the first (and all others). Maybe another word is reproducible.
If anyone knows of a more intuitive way, please tell me!

I ended up isolating the Sha1Prng from the sun sources which guarantees reproducibility on all versions of Java and android. I needed to drop some important methods to ensure compatibility with android, as android does not have access to nio classes...

I recommend using UUID.randomUUID(), then splitting it into longs using getLeastSignificantBits() and getMostSignificantBits()

If you want predictable, they aren't random. That breaks your "Target" requirement of being "safe" and devolves into a simple shared secret between two servers.
You can get something that looks sort of random but is predicatable by using the characteristics of prime integers where you build a set of integers by starting with I (some specific integer) and add the first prime number and then modulo by the 2nd prime number. (In truth the first and second numbers only have to be relatively prime--meaning they have no common prime factors--not counting 1, in case you call that a factor.
If you repeat the process of adding and doing the modulo, you will get a set of numbers that you can repeatably reproduce and they are ordered in the sense that taking any member of the set, adding the first prime and doing the modulo by the 2nd prime, you will always get the same result.
Finally, if the 1st prime number is large relative to the second one, the sequence is not easily predictable by humans and seems sort of random.
For example, 1st prime = 7, 2nd prime = 5 (Note that this shows how it works but is not useful in real life)
Start with 2. Add 7 to get 9. Modulo 5 to get 4.
4 plus 7 = 11. Modulo 5 = 1.
Sequence is 2, 4, 1, 3, 0 and then it repeats.
Now for real life generation of numbers that seem random. The relatively prime numbers are 91193 and 65536. (I chose the 2nd one because it is a power of 2 so all modulo-ed values can fit in 16 bits.)
int first = 91193;
int modByLogicalAnd = 0xFFFF;
int nonRandomNumber = 2345; // Use something else
for (int i = 0; i < 1000 ; ++i) {
nonRandomNumber += first;
nonRandomNumber &= modByLogicalAnd;
// print it here
}
Each iteration generates 2 bytes of sort of random numbers. You can pack several of them into a buffer if you need larger random "strings".
And they are repeatable. Your user can pick the starting point and you can use any prime you want (or, in fact, any number without 2 as a factor).
BTW - Using a power of 2 as the 2nd number makes it more predictable.

Ignoring RNGs that use some physical input (random clock bits, electrical noise, etc) all software RNGs are predicable, given the same starting conditions. They are, after all, (hopefully) deterministic computer programs.
There are some algorithms that intentionally include the physical input (by, eg, sampling the computer clock occasionally) in attempt to prevent predictability, but those are (to my knowledge) the exception.
So any "conventional" RNG, given the same seed and implemented to the same specification, should produce the same sequence of "random" numbers. (This is why a computer RNG is more properly called a "pseudo-random number generator".)
The fact that an RNG can be seeded with a previously-used seed and reproduce a "known" sequence of numbers does not make the RNG any less secure than one where your are somehow prevented from seeding it (though it may be less secure than the fancy algorithms that reseed themselves at intervals). And the ability to do this -- to reproduce the same sequence again and again is not only extraordinarily useful in testing, it has some "real life" applications in encryption and other security applications. (In fact, an encryption algorithm is, in essence, simply a reproducible random number generator.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.