Generating a globally unique identifier in Java

Generating a globally unique identifier in Java - java

Summary: I'm developing a persistent Java web application, and I need to make sure that all resources I persist have globally unique identifiers to prevent duplicates.
The Fine Print:
I'm not using an RDBMS, so I don't have any fancy sequence generators (such as the one provided by Oracle)
I'd like it to be fast, preferably all in memory - I'd rather not have to open up a file and increment some value
It needs to be thread safe (I'm anticipating that only one JVM at a time will need to generate IDs)
There needs to be consistency across instantiations of the JVM. If the server shuts down and starts up, the ID generator shouldn't re-generate the same IDs it generated in previous instantiations (or at least the chance has to be really, really slim - I anticipate many millions of presisted resources)
I have seen the examples in the EJB unique ID pattern article. They won't work for me (I'd rather not rely solely on System.currentTimeMillis() because we'll be persisting multiple resources per millisecond).
I have looked at the answers proposed in this question. My concern about them is, what is the chance that I will get a duplicate ID over time? I'm intrigued by the suggestion to use java.util.UUID for a UUID, but again, the chances of a duplicate need to be infinitesimally small.
I'm using JDK6

Pretty sure UUIDs are "good enough". There are 340,282,366,920,938,463,463,374,607,431,770,000,000 UUIDs available.
http://www.wilybeagle.com/guid_store/guid_explain.htm
"To put these numbers into perspective, one's annual risk of being hit by a meteorite is estimated to be one chance in 17 billion, that means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs"
http://en.wikipedia.org/wiki/Universally_Unique_Identifier

public class UniqueID {
private static long startTime = System.currentTimeMillis();
private static long id;
public static synchronized String getUniqueID() {
return "id." + startTime + "." + id++;
}
}

If it needs to be unique per PC: you could probably use (System.currentTimeMillis() << 4) | (staticCounter++ & 15) or something like that.
That would allow you to generate 16 per ms. If you need more, shift by 5 and and it with 31...
if it needs to be unique across multiple PCs, you should also combine in your primary network card's MAC address.
edit: to clarify
private static int staticCounter=0;
private final int nBits=4;
public long getUnique() {
return (currentTimeMillis() << nBits) | (staticCounter++ & 2^nBits-1);
}
and change nBits to the square root of the largest number you should need to generate per ms.
It will eventually roll over. Probably 20 years or something with nBits at 4.

From memory the RMI remote packages contain a UUID generator. I don't know whether thats worth looking into.
When I've had to generate them I typically use a MD5 hashsum of the current date time, the user name and the IP address of the computer. Basically the idea is to take everything that you can find out about the computer/person and then generate a MD5 hash of this information.
It works really well and is incredibly fast (once you've initialised the MessageDigest for the first time).

why not do like this
String id = Long.toString(System.currentTimeMillis()) +
(new Random()).nextInt(1000) +
(new Random()).nextInt(1000);

if you want to use a shorter and faster implementation that java UUID take a look at:
https://code.google.com/p/spf4j/source/browse/trunk/spf4j-core/src/main/java/org/spf4j/concurrent/UIDGenerator.java
see the implementation choices and limitations in the javadoc.
here is a unit test on how to use:
https://code.google.com/p/spf4j/source/browse/trunk/spf4j-core/src/test/java/org/spf4j/concurrent/UIDGeneratorTest.java

Related

How to generate fixed length random number without conflict?

I'm working on an application where I've to generate code like Google classroom. When a user creates a class I generate code using following functions
private String codeGenerator(){
StringBuilder stringBuilder=new StringBuilder();
String chars="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
int characterLength=chars.length();
for(int i=0;i<5;i++){
stringBuilder.append(chars.charAt((int)Math.floor(Math.random()*characterLength)));
}
return stringBuilder.toString();
}
As I have 62 different characters. I can generate total 5^62 code total which is quite large. I can generate this code in server or user device. So my question is which one is better approach? How likely a generated code will conflict with another code?

From a comment, it seems that you are generating group codes for your own application.
For the purposes and scale of your app, 5-character codes may be appropriate. But there are several points you should know:
Random number generators are not designed to generate unique numbers. You can generate a random code as you're doing now, but you should check that code for uniqueness (e.g., check it against a table that stores group codes already generated) before you treat that code as unique.
If users are expected to type in a group code, you should include a way to check whether a group code is valid, to avoid users accidentally joining a different group than intended. This is often done by adding a so-called "checksum digit" to the end of the group code. See also this answer.
It seems that you're trying to generate codes that should be hard to guess. In that case, Math.random() is far from suitable (as is java.util.Random) — especially because the group codes are so short. Use a secure random generator instead, such as java.security.SecureRandom (fortunately for you, its security issues were addressed in Android 4.4, which, as I can tell from a comment of yours, is the minimum Android version your application supports; see also this question). Also, if possible, make group codes longer, such as 8 or 12 characters long.
For more information, see Unique Random Identifiers.
Also, there is another concern. There is a serious security issue if the 5-character group code is the only thing that grants access to that group. Ideally, there should be other forms of authorization, such as allowing only logged-in users or certain logged-in users—
to access the group via that group code, or
to accept invitations to join the group via that group code (e.g., in Google Classroom, the PERMISSION_DENIED error code can be raised when a user tries to accept an invitation to join a class).

The only way to avoid duplicates in your scheme is to keep a copy of the ones that you have already generated, and avoid "generating" anything that would result in a duplicate. Since 5^62 is a lot, you could simply store them on a table if using a database; or on a hashset if everything is in-memory and there is only one instance of the application (remember to save the list of generated IDs to disk every time you create a new one, and to re-read it at startup).
The chances of a collision are low: you would need to generate around 5^(62/2) = 5^31 ~= 4.6E21 really-random identifiers for a collision to be more likely than not (see birthday paradox) - and it would take a lot of space to store and check all those identifiers for duplicates to detect that this was the case. But such is the price of security.

Que: A sack contains a blue ball and a red ball. I draw one ball from the sack. What are the chances it is a red ball?
Ans: 1/2
Que: I have a collection of 5^62 unique codes. I choose one code from the collection. What are the chances that it is "ABCDE"?
Ans: 1/(5^62)
NOTE: Random number generators are not actually random.

Well, in case you need a unique generator, what about the following. This is definitely not a random, but it's definitely unique for one instance.
public final class UniqueCodeGenerator implements Supplier<String> {
private int code;
#Override
public synchronized String get() {
return String.format("%05d", code++);
}
public static void main(String... args) {
Supplier<String> generator = new UniqueCodeGenerator();
for (int i = 0; i < 10; i++)
System.out.println(generator.get());
}
}

Generate unique ID in Java, to label groups of related entries in a log

There are several posts on SO on this topic. Each of those talk about a specific approach so wanted to just get a comparison in one question.
Using new Date() as unique identifier
Generating a globally unique identifier in Java
I am trying to implement a feature where we are able to identify certain events in the log file. These events need to be associated with a unique id.
I am trying to come up with a strategy for this unique ID generation.
The ID has to have 2 parts :
some static information + some dynamic information
The logs can be searched for the pattern when debugging of events is needed.
I have three ways :
static info + Joda Date time("abc"+2014-01-30T12:36:12.703)
static info + Atomic Integer
static info + UUID
For the scope of this question, multiple JVMs is not a consideration.
I need to generate unique IDs in an efficient manner on one JVM. Also, I will not be able to use a database dependent solution.
Which of the 3 above mentioned strategies works best ?
If not one from the above, any other strategy ?
Is the Joda time based strategy robust ? The JVM is single but there will be concurrent users so there can be concurrent events.
In conjunction with one of the above/other strategies, Do I need to make my method thread-safe / synchronized ?

I have had the same need as you, distinguishing a thread of related entries interleaved with other unrelated entries in a log. I have tried all three of your suggested approaches. My experience was in 4D not Java, but similar.
Date-Time
In my case, I was using a date-time value resolved to whole seconds. That is simply too large a granularity. I easily had collisions where multiple events started within the same second. Damn those speedy computers!
In your case with either the bundled java.util.Date or Joda-Time (highly recommended for other purposes), both resolve to milliseconds. A millisecond is a long time in modern computers, so I don't recommend this.
In Java 8, the new java.time.* package (inspired by Joda-Time, defined by JSR 310) resolve to nanoseconds. This might seem to be a better identifier, but no. For one thing, your computer's physical time-keeping clock may not support such a fine resolution. Another is that computers keep getting faster. Lastly, a computer's clock can be reset, indeed it is reset often as computer clocks drift quite a bit. Modern OSes reset their clocks by frequently checking with a time server either locally or over the Internets.
Also, logs already have a timestamp, so we are not getting any extra benefit by using a date-time as our identifier. Indeed, having a second date-time in the log entry may actually cause confusion.
Serial Number
By "Atomic Integer", I assume you mean a serial number incrementing to increasing numbers.
This seems overkill for your purpose.
You don't care about the sequence, it has no meaning for this purpose of grouping log entries. You don't really care if one group came nth number before or after another group.
Maintaining a sequence is a pain, a point of potential failure. I've always eventually ran into administrative problems with maintaining a sequence.
So this approach adds risk without adding any special benefit.
UUID
Bingo! Just what you need.
A UUID is easily generated, using either the bundled java.util.UUID class' ability to generate Version 3 or 4 UUIDs, or using a third-party library, or accessing the command-line's uuidgen tool.
For a very high volume, [Version 1] UUID (MAC + date-time + random number) would be best. For logging, a Version 4 UUID (entirely random) is absolutely acceptable.
Having a collision is not a realistic concern. Especially for the limited number of values you would be generating for logs. I'm amazed by people who, failing to comprehend the numbers, say they would never replace a sequence with a UUID. Yet when pressed, every single programmer and sysadmin I know has experienced failures with at least one sequence.
No concerns about thread-safety. No concerns about contention (see my test results on another answer of mine).
Another benefit of a UUID is that its usual hexadecimal representation, such as:
6536ca53-bcad-4552-977f-16945fee13e2
…is easily recognizable. When recognized, the reader immediately knows that string is meant to be a unique identifier. So it's presence in your log is self-documenting.
I've found UUIDs to be the Duct Tape of computing. I keep finding new uses for them.
So, at the start of the code in question, generate a UUID and then embed that into every one of the related log entries.
While the hex string representation of a UUID is hard to read and write, in practice you need only scan a few of the digits at the beginning or end. Or use copy-paste with search and filter features in our modern console tools.
A few factoids
A UUID is known in the Microsoft world as as a GUID.
A UUID is not a string, but a 128-bit value. Bits, just bits in memory, "on"/"off" values. Some databases, such as Postgres, know how to handle and store UUID as such 128-bit values. If we wish to show those bits to humans, we could use a series of 128 digits of "1" & "0". But humans do not do well trying to read or write 128 digits of ones and zeros. So we use the hexadecimal representation. But even 32 hex digits is too much for humans, so we break the string into groups separated with hyphens as shown above, for a total of 36 characters.
The spec for a UUID is quite clear that a hexadecimal representation should be lowercase. The spec says that when creating a UUID from a string input, uppercase should be tolerated. But when generating a hex string, it should be lowercase. Many implementations of UUIDs ignore this requirement. I suggest sticking to the spec and converting your UUID hex strings to lowercase.
MDC – Mapped Diagnostic Context
I have not yet used MDC, but want to point it out…
Some logging frameworks are adding support for this idea of tagging related log entries. Such support is called Mapped Diagnostic Context (MDC). The MDC manages contextual information on a per thread basis.
A quick introductory article is Log4j MDC (Mapped Diagnostic Context) : What and Why .
The best logging façade, SLF4J, offers such an MDC feature. The best implementation of that façade, Logback, has a chapter documenting its MDC feature.

Computers are fast, using time to attempt to create a unique value is going to fail.
Instead use a UUID.
From the JSE 6.0 UUID API page
"[UUID is] A class that represents an immutable universally unique identifier (UUID)."
Here is some code:
import java.util.UUID;
private String id;
id = UUID.randomUUID().toString();

I have written a simple service which can generate semi-unique non-sequential 64 bit long numbers. It can be deployed on multiple machines for redundancy and scalability. It use ZeroMQ for messaging. For more information on how it works look at github page: zUID

Generate ID fast and with high probability of uniqueness

I want to generate ID to event that occur in my application.
The event frequency is up to the user load, so it might occur hundreds-thousand of time per second.
I can't afford using UUID.randomUUID() because it might be problematic in performance matters - Take a look at this.
I thought of generating ID as follows:
System.currentTimeMillis() + ";" + Long.toString(_random.nextLong())
When _random is a static java.util.Random my class is holding.
My questions are:
Do you think the distribution of this combination will be good enough to my needs?
Does Java's Random implementation related to the current time and therefore the fact I'm combining the two is dangerous?

I would use the following.
final AtomicLong counter = new AtomicLong(System.currentTimeMillis() * 1000);
and
long l = counter.getAndIncrement(); // takes less than 10 nano-seconds most of the time.
This will be unique within your system and across restarts provided you average less than one million per second.
Even at this rate, the number will not overflow for some time.
class Main {
public static void main(String[] args) {
System.out.println(new java.util.Date(Long.MAX_VALUE/1000));
}
}
prints
Sun Jan 10 04:00:54 GMT 294247
EDIT: In the last 8 years I have switched to using nanosecond wall clock and memory-mapped files to ensure uniqueness across processes on the same machine. The code is available here. https://github.com/OpenHFT/Chronicle-Bytes/blob/ea/src/main/java/net/openhft/chronicle/bytes/MappedUniqueTimeProvider.java

To prevent possible collisions I would suggest you to somehow integrate users' unique ids into the generated id. You can do this either adding user id to directly to the generated id
System.currentTimeMillis() + ";" + Long.toString(_random.nextLong()) + userId
or you can use separate _random for each user that uses the user's id as its seed.

UUID uuid = UUID.randomUUID(); is less than 8 times slower, after warming up, 0.015 ms versus 0.0021 ms on my PC. That would be a positive argument for UUID - for me.
One could shift the random long a bit to the right, so time is more normative, sequencing.
No, there is a pseudo random distribution involved.

I can't afford using UUID.randomUUID() because it might be problematic
And it might not. Currently, you're solving a problem that might not exist. I suggest to use an interface so you can easily swap out the generated ID but stick to this generator on which many smart people have spent a lot of time to make it right.
Your own solution might work in many cases but the corner cases are important and you will only see those after a few years of experience.
That said, combining the current time + Random should give pretty unique IDs. But they are easy to guess and insecure.

I would use a library to avoid reinventing the wheel.
For example, JUG (https://github.com/cowtowncoder/java-uuid-generator) can generate 5 millions time-based UUIDs per second (https://github.com/cowtowncoder/java-uuid-generator/blob/master/release-notes/FAQ) :
<dependency>
<groupId>com.fasterxml.uuid</groupId>
<artifactId>java-uuid-generator</artifactId>
<version>4.0.1</version>
</dependency>
UUID uuid = Generators.timeBasedGenerator().generate();

How to generate incremental identifier in java

I have requirement in which I continuously receive messages that needs to be written in a file. Every time a new message is received it needs to be written in a separate file. What I want is to generate an unique identifier to be used as a file-name. I also want to preserve the order of the messages as well. By this I mean, the identifier generated as a file-name should always be incremental.
I was using UUID.randomUUID() to generate file-names but the problem with this approach is that UUID only assures randomness of the identifier but is not incremental. As a result I am losing the ordering of the file (I want file generated first should appear first in the list).
Approaches known
Can use System.currentTimeMillis() but I can receive multiple messages at same time stamp.
2.Another approach could be to implement static long value and increment it whenever a file is to be created and use the long value as a file-name. But I am not sure about this approach. Also it doesn't seem to be a proper solution to my problem. I think there could be far better solutions than this one.
If someone could suggest me a better solution to this problem, will be highly appreciated.

If you want your id value to uniformly rise even between server restarts, then you must either base it on the system time or have some elaborately robust logic that persists the last ID used. Note that achieving robustness on its own is not hard, but achieving it in a performant and scalable way is.
If you additionally need the id to be unique across multiple nodes in a redundant server cluster, then you need even more elaborate logic, which definitely involves a persistent store to which all the boxes synchronize access. Making this performant is, of course, even harder.
The best option I can see is to have a quite long ID so there's room for these parts:
System.currentTimeMillis for long-term uniqueness (across restarts);
System.nanotime for finer granularity;
a unique id of each server node (determined in a platform-specific way).
The method will still have to remember the last value generated and retry in case of a duplicate. It won't have to retry too many times, though, just until the next nanoTime clock tick—it could even busy-wait for it.
Sketch of code without point 3 (single-node implementation):
private static long lastNanos;
public static synchronized String uniqueId() {
for (;/*ever*/;) {
final long n = System.nanoTime();
if (n == lastNanos) continue;
lastNanos = n;
return "" + System.currentTimeMillis() + n;
}
}

Ok, my hands up. My last answer was fairly flaky and I've deleted it.
Keeping with the spirit of the site, I thought I'd try a different tac.
If you say you are keeping these messages in a single file then you could try something like creating an unique Id out of the size of the file?
Before you write the message to the file it's id could be the current size of the file.
You could add the filename + size as the id if these messages need to be unique across a number of files.
I'll leave the hot potato of synchronization to another day. But you could wrap all of this up in a syncronized object that keeps track of things.
Also, I am assuming that any messages written to the file will not be removed in the future.
ADDITIONAL NOTE:
You could create an message processing object that opens the file on construction (or via a create method).
This object will get the initial size of the file and this will be used as the unique id.
As each message is added (in a synchronized manner), the id is incremented by the size of the message.
This would address the performance issues. Will not work if more than one JVM/Node accesses the same file.
Skeletal Idea:
public class MessageSink {
private long id = 0;
public MessageSink(String filename) {
id = ... get file size ..
}
public synchronized addMessage(Message msg) {
msg.setId(id);
.. write to file + flush ..
.. or add to stack of messages that need to be written to file
.. at a later stage.
id = id + msg.getSize();
}
public void flushMessages() {
.. open file
.. for each message in stack write ...
.. flush and close file
}
}

I had the same requirement and found a suitable solution. Twitter Snowflake uses a simple algorithm to generate sortable 64bit (long) ids. Snowflake is written on Scala but the approach is simple and could be easily used in a Java code.
id is composed of:
timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years);
machine id - 10 bits (MAC address could be used as a hardware id);
sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
Formula looks like: ((timestamp - customEpoch) << timestampShift) | (machineId << machineIdShift) | sequenceNumber;
Shift for each component depends on it's bits position in ID.
Detailed description and source code could be found at github:
Twitter Snowflake
Basic Java implementation of the Snowflake algorithm

Why does my UUID use too much time?

String s = UUID.randomUUID().toString();
return s.substring(0,8) + s.substring(9,13) + s.substring(14,18) +
s.substring(19,23) + s.substring(24);
I use JDK1.5's UUID, but it uses too much time when I connect/disconnect from the net.
I think the UUID may want to access some net.
Can anybody help me?

UUID generation is done locally and doesn't require any alive network connection.

Quoting the API odc:
public static UUID randomUUID()
Static factory to retrieve a type 4 (pseudo randomly generated) UUID.
The UUID is generated using a
cryptographically strong pseudo random
number generator.
Your delay is probably being caused by the intialization of the cryptographically strong RNG - those take some time, and might even depend on the presence of a network connection as a source of entropy. However, this should happen only once during the runtime of the JVM. I don't see a way around this problem, though.

The javadoc for UUID http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html has some good information on how the UUID is generated. It uses the time and clock frequency to generate the UUID. Like sharptooth says, no network interface is required. Is there possibly some other concurrent process running that could possibly be causing this problem?

What's the purpose of those s.substring calls? It looks like you're returning the original string.

If you're appending 5 Strings together, over a large set of data, that could be the issue. Try to use StringBuffer. It's amazing the difference that can make when concatenating more than 1-2 Strings together, especially for larger datasets

For older versions of Java (6 and earlier maybe?), there's a bug in Random that causes it to iterate over the entire temp directory. We've seen seed generation take 10 minutes on some egregiously bad build machines at NVIDIA. You might want to check the size of your temp dir.
Compare: http://www.docjar.com/html/api/sun/security/provider/SeedGenerator.java.html
To: http://www.java2s.com/Open-Source/Java-Document/6.0-JDK-Modules/j2me/sun/security/provider/SeedGenerator.java.htm

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.