Compact alternatives to UUID as a correlationId - java

I am adding correlationIDs to our public and private APIs. This is to be able to trace a request progress through logs.
UUIDs being long strings, take up much space. I need a compact alternative to UUID as a correlation ID.
It will be ok, if a correlationId repeats after a fix period (say 2 months) since the API requests older than that won't be required to be traced.
I have considered using java.util.Random nextLong(). But it does not guarantee that it won't repeat.
Also, SecureRandom can pose some performance issues is what I understand and also, I don't need the correlationIDs to be secure.
It would be good to have other options considered.

If you can accept IDs up to 8 characters long, the number of possible IDs depends on the character set of those IDs.
For hexadecimal characters (16-character set), the number of IDs is 4294967296 or 2^32.
For A-Z, 0-9 (36-character set), the number of IDs is 2821109907456 or about 2^41.
For base64 or base64url, 0-9 (63-character set), the number of IDs is 248155780267521 or about 2^47.
You should decide which character set to use, as well as ask yourself whether your application can check IDs for randomness and whether you can tolerate the risk of duplicate IDs (so that you won't have a problem generating them randomly using SecureRandom). Also note the following:
The risk of duplicates depends on the number of possible IDs. Roughly speaking, after your application generates the square root of all possible IDs at random, the risk of duplicates becomes non-negligible (which in the case of hexadecimal IDs, will occur after just 65536 IDs). See "Birthday problem" for a more precise statement and formulas.
If your application is distributed across multiple computers, you might choose to assign each computer a unique value to include in the ID.
If you don't care about whether your correlation IDs are secure, you might choose to create unique IDs by doing a reversible operation on sequential IDs, such as a reversible mixing function or a linear congruential generator.
You say that SecureRandom "can pose some performance issues". You won't know if it will unless you try it and measure your application's performance. Generating IDs with SecureRandom may or may not be too slow for your purposes.
For further considerations and advice, see what I write on "Unique Random Identifiers".

Related

Which digits of a UUID are least likely to collide if the generator (e.g. Java version of UUID) is unknown?

Suppose we have an existing set of UUIDs (say, millions, though it doesn't matter) that may have been generated by different clients, so that we don't know the algorithm that generated any UUID. But we can assume that they are popular implementations.
Are there a set of 8 or more digits (not necessarily contiguous, though ideally yes) that are less or more likely to collide?
For example, I've seen the uuid() function in MySQL, when used twice in the same statement, generate 2 UUIDs exactly the same except the 5th through 8th digits:
0dec7a69-ded8-11e8-813e-42010a80044f
0decc891-ded8-11e8-813e-42010a80044f
^^^^
What is the answer generally?
The application is to expose a more compact ID for customers to copy and paste or communicate over phone. Unfortunately we're bound to using UUIDs in the backend, and understandably reluctant to creating mappings between long and short versions of IDs, but we can live with using a truncated UUID that occasionally collides and returns more than 1 result.
Suggestion: first 8 digits
1c59f6a6-21e6-481d-80ee-af3c54ac400a
^^^^^^^^
All generator implementations are required to use the same algorithms for a given version, so worry about the latter rather than the former.
UUID version 1 & version 2 are generally arranged from most to least entropy for a given source. So, the first 8 digits are probably the least likely to collide.
UUID version 4 and version 3 & 5 are designed to have uniform entropy, aside from the reserved digits for version and variant. So the first 8 digits are as good as any others.
There is one method that will work, no matter the caveats of the UUID specification. Since a UUID is in itself intended to be globally unique, a secure hash made out of it using a proper algorithm with at least the same bit size will have the same properties.
Except that the secure hash will have entropy through the hash value instead of specific locations.
As an example, you could do:
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(uuid.toString().getBytes(StandardCharsets.UTF_8));
And then you take as many bits out of the hash as you need and convert them back to a String.
This is a one-way function though; to map it back to the UUID in a fast an efficient manner, you need to keep a mapping table. (You can of course check if a UUID matches the shorter code by performing the one-way hash on the UUID again)
However, if you were to take a non-contiguous portion out of the UUID, you would have the same issue.

Even distribution of long integer identifiers into buckets

I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers.
With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?
Since I got feedback that my question does not have enough background information. I am adding some more information.
The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.
The identifiers themselves are long integers.
The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.
Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).
I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years
)
There's no need to involve String.
new Integer(i).hashCode()
... gives you a hash - designed for the very purpose of evenly distributing into buckets.
new Integer(i).hashCode() % n
... will give you a number in the range you want.
However Integer.hashCode() is just:
return value;
So new Integer(i).hashCode() % n is equivalent to i % n.
Your question as is cannot be answered. #slim's try is the best you will get, because crucial information is missing in your question.
To distribute a set of items, you have to know something about their initial distribution.
If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.
If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].
That is, half of your buckets end up with 50% more items than the other half.
If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.
This is when you need a hash function.
But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.
To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute
hash(input) = input * 2654435761 % 2^32
It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]
I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.

digest function whose output size <= input size

I am dealing with fixed length records, with fixed length fields. Some of these fields are sensitive ... think account number. Let's say the account number in my record is defined as a maximum of 19 bytes. I would like to find (or create) a hash of the account number, the result of which itself is no more than 19 bytes. This way, I can still correlate records by this field, the original value not recoverable, and importantly my fixed length record and field size is not changed. Basically, for any field a, f(a) = a' where sizeof(a) == sizeof(a'). Is this possible, even if not cryptographically secure?
If you want to limit the size of the hash to 19 bytes, you could simply truncate a standard hash. Obviously, this increases the chance of hash collisions (two account numbers hashing to the same value).
See also this question which discusses truncation.
However, the original values may be recoverable by brute force. The number of account numbers will probably not be huge, so someone can enumerate them all, run them through the same hashing algorithm, and determine the original account number of a given record. This is a real vulnerability which has been exploited in practice to de-anonymise data.
I can't answer your question directly, but what you are looking for is referred to as "tokenization," I believe. One reason to use tokenization instead of a simple digest or hashing scheme is to avoid problems related to collisions. Some providers even perform this exact type of tokenization for things like replacing a credit card number with a valid (as far as the format is concerned) token that can be processed like a regular credit card number without exposing any sensitive information.

What's the importance of using Random.setSeed?

When writing Java program, we use setSeed in the Random class. Why would we use this method?
Can't we just use Random without using setSeed? What is the main purpose of using setSeed?
One use of this is that it enables you to reproduce the results of your program in future.
As an example, I wanted to compute a random variable for each row in a database. I wanted the program to be reproducible, but I wanted randomness between rows. To do this, I set the random number seed to the primary key of each row. That way, when I ran the program again, I got the same results, but between rows, the random variable was pseudo random.
The seed is used to initialize the random number generator. A seed is used to set the starting point for generating a series of random numbers. The seed sets the generator to a random starting point. A unique seed returns a unique random number sequence.
This might be of help .
A pseudorandom number generator (PRNG), also known as a deterministic random bit generator DRBG, is an algorithm for generating a sequence of numbers that approximates the properties of random numbers. The sequence is not truly random in that it is completely determined by a relatively small set of initial values, called the PRNG's state, which includes a truly random seed.
I can see two reasons for doing this:
You can create a reproducible random stream. For a given seed, the same results will be returned from consecutive calls to (the same) nextX methods.
If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers
You feel, for some reason, that your seed is of a higher quality than the default source (which I'm guessing is derived from the current time on your PC).
A specific seed will always give the same sequence of "pseudo-random" numbers. So there are only 2^48 different sequences in Random because setSeed only uses 48-bits of the seed parameter! Besides setSeed, one may also use a constructor with a seed (e.g. new Random(seed)).
When setSeed(seed) or new Random(seed) are not used, the Random() constructor sets the seed of the random number generator to a value very likely to be distinct from any other invocation of this constructor.
Java reference for the above information: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Random.html
In the ordinary case, don't use a seed. Just use the empty constructor Random() and don't call setSeed. This way you'll likely get different pseudo-random numbers each time the class is constructed and invoked.
For data dependent debugging, where you want to repeat the same pseudo-random numbers, use a specific seed. In this case, use Random(seed) or setSeed(seed).
For non-security critical uses, there's no need to worry whether the specific seed/sequence might be recognized and subsequent numbers predicted, because of the large range of seeds. However, "Instances of java.util.Random are not cryptographically secure. Consider instead using SecureRandom to get a cryptographically secure pseudo-random number generator for use by security-sensitive applications." source
Several others have mentioned reproducibility. Reproducibility is at the heart of debugging, you need to be able to reproduce the circumstances in which the bug occurred.
Another important use of reproducibility is that you can play some statistical games to reduce the variability of some estimates. See Wikipedia's Variance Reduction article for more details, but the intuition is as follows. Suppose you're considering two different layouts for a bank or a grocery store. You can't build them both and see which works better, so you use simulation. You know from queueing theory that the size of lines and delays customers experience are partly due to the layout, but also partly due to the variation in arrival times, demand loads, etc, so you use randomness in your two models. If you run the models completely independently, and find that the lines are bigger in layout 1 than in layout 2, it might be because of the layout or it might be because layout 1 just happened to get more customers or a more demanding mix of transactions due to the luck of the draw. However, if both systems use the exact same set of customers arriving at the same times and having the same transaction demands, it's a "fairer" comparison. The differences you observe are more likely to be because of the layout. You can accomplish this by reproducing the randomness in both systems - use the same seeds, and synchronize so that the same random numbers are used for the same purpose in both systems.

How good is Java's UUID.randomUUID?

I know that randomized UUIDs have a very, very, very low probability for collision in theory, but I am wondering, in practice, how good Java's randomUUID() is in terms of not having collision? Does anybody have any experience to share?
UUID uses java.security.SecureRandom, which is supposed to be "cryptographically strong". While the actual implementation is not specified and can vary between JVMs (meaning that any concrete statements made are valid only for one specific JVM), it does mandate that the output must pass a statistical random number generator test.
It's always possible for an implementation to contain subtle bugs that ruin all this (see OpenSSH key generation bug) but I don't think there's any concrete reason to worry about Java UUIDs's randomness.
Wikipedia has a very good answer
http://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:
...
This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes.
...
Thus, for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.
Does anybody have any experience to share?
There are 2^122 possible values for a type-4 UUID. (The spec says that you lose 2 bits for the type, and a further 4 bits for a version number.)
Assuming that you were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in your lifetime would be vanishingly small. And to detect the duplicate, you'd have to solve the problem of comparing 1 million new UUIDs per second against all of the UUIDs you have previously generated1!
The chances that anyone has experienced (i.e. actually noticed) a duplicate in real life are even smaller than vanishingly small ... because of the practical difficulty of looking for collisions.
Now of course, you will typically be using a pseudo-random number generator, not a source of truly random numbers. But I think we can be confident that if you are using a creditable provider for your cryptographic strength random numbers, then it will be cryptographic strength, and the probability of repeats will be the same as for an ideal (non-biased) random number generator.
However, if you were to use a JVM with a "broken" crypto- random number generator, all bets are off. (And that might include some of the workarounds for "shortage of entropy" problems on some systems. Or the possibility that someone has tinkered with your JRE, either on your system or upstream.)
1 - Assuming that you used "some kind of binary btree" as proposed by an anonymous commenter, each UUID is going to need O(NlogN) bits of RAM memory to represent N distinct UUIDs assuming low density and random distribution of the bits. Now multiply that by 1,000,000 and the number of seconds that you are going to run the experiment for. I don't think that is practical for the length of time needed to test for collisions of a high quality RNG. Not even with (hypothetical) clever representations.
I'm not an expert, but I'd assume that enough smart people looked at Java's random number generator over the years. Hence, I'd also assume that random UUIDs are good. So you should really have the theoretical collision probability (which is about 1 : 3 × 10^38 for all possible UUIDs. Does anybody know how this changes for random UUIDs only? Is it 1/(16*4) of the above?)
From my practical experience, I've never seen any collisions so far. I'll probably have grown an astonishingly long beard the day I get my first one ;)
At a former employer we had a unique column that contained a random uuid. We got a collision the first week after it was deployed. Sure, the odds are low but they aren't zero. That is why Log4j 2 contains UuidUtil.getTimeBasedUuid. It will generate a UUID that is unique for 8,925 years so long as you don't generate more than 10,000 UUIDs/millisecond on a single server.
The original generation scheme for UUIDs was to concatenate the UUID version with the MAC address of the computer that is generating the UUID, and with the number of 100-nanosecond intervals since the adoption of the Gregorian calendar in the West. By representing a single point in space (the computer) and time (the number of intervals), the chance of a collision in values is effectively nil.
Many of the answers discuss how many UUIDs would have to be generated to reach a 50% chance of a collision. But a 50%, 25%, or even 1% chance of collision is worthless for an application where collision must be (virtually) impossible.
Do programmers routinely dismiss as "impossible" other events that can and do occur?
When we write data to a disk or memory and read it back again, we take for granted that the data are correct. We rely on the device's error correction to detect any corruption. But the chance of undetected errors is actually around 2-50.
Wouldn't it make sense to apply a similar standard to random UUIDs? If you do, you will find that an "impossible" collision is possible in a collection of around 100 billion random UUIDs (236.5).
This is an astronomical number, but applications like itemized billing in a national healthcare system, or logging high frequency sensor data on a large array of devices could definitely bump into these limits. If you are writing the next Hitchhiker's Guide to the Galaxy, don't try to assign UUIDs to each article!
I play at lottery last year, and I've never won ....
but it seems that there lottery has winners ...
doc : https://www.rfc-editor.org/rfc/rfc4122
Type 1 : not implemented. collision are possible if the uuid is generated at the same moment. impl can be artificially a-synchronize in order to bypass this problem.
Type 2 : never see a implementation.
Type 3 : md5 hash : collision possible (128 bits-2 technical bytes)
Type 4 : random : collision possible (as lottery). note that the jdk6 impl dont use a "true" secure random because the PRNG algorithm is not choose by developer and you can force system to use a "poor" PRNG algo. So your UUID is predictable.
Type 5 : sha1 hash : not implemented : collision possible (160 bit-2 technical bytes)
We have been using the Java's random UUID in our application for more than one year and that to very extensively. But we never come across of having collision.

Categories