I came across a question recently that was for "Generating primary key in a clustered environment of 5 App-Servers - [OAS Version 10] without using database".
Usually we generate PK by a DB sequence, or storing the values in a database table and then using a SP to generate the new PK value...However current requirement is to generate primary key for my application without referencing the database using JDK 1.4.
Need expert's help to arrive on better ways to handle this.
Thanks,
Use a UUID as your primary key and generate it client-side.
Edit:
Since your comment I felt I should expand on why this is a good way to do things.
Although sequential primary keys are the most common in databases, using a randomly generated primary key is frequently the best choice for distributed databases or (particularly) databases that support a "disconnected" user interface, i.e. a UI where the user is not continuously connected to the database at all times.
UUIDs are the best form of randomly generated key since they are guaranteed to be very unique; the likelyhood of the same UUID being generated twice is so extremely low as to be almost completely impossible. UUIDs are also ubiquitous; nearly every platform has support for the generation of them built in, and for those that don't there's almost always a third-party library to take up the slack.
The biggest benefit to using a randomly generated primary key is that you can build many complex data relationships (with primary and foreign keys) on the client side and (when you're ready to save, for example) simply dump everything to the database in a single bulk insert without having to rely on post-insert steps to obtain the key for later relationship inserts.
On the con side, UUIDs are 16 bytes rather than a standard 4-byte int -- 4 times the space. Is that really an issue these days? I'd say not, but I know some who would argue otherwise. The only real performance concern when it comes to UUIDs is indexing, specifically clustered indexing. I'm going to wander into the SQL Server world, since I don't develop against Oracle all that often and that's my current comfort zone, and talk about the fact that SQL Server will by default create a clustered index across all fields on the primary key of a table. This works fairly well in the auto-increment int world, and provides for some good performance for key-based lookups. Any DBA worth his salt, however, will cluster differently, but folks who don't pay attention to that clustering and who also use UUIDs (GUIDs in the Microsoft world) tend to get some nasty slowdowns on insert-heavy databases, because the clustered index has to be recomputed every insert and if it's clustered against a UUID, which could put the new key in the middle of the clustered sequence, a lot of data could potentially need to be rearranged to maintain the clustered index. This may or may not be an issue in the Oracle world -- I just don't know if Oracle PKs are clustered by default like they are in SQL Server.
If that run-on sentence was too hard to follow, just remember this: if you use a UUID as your primary key, do not cluster on that key!
You may find it helpful to look up UUID generation.
In the simple case, one program running one thread on each machine, you can do something such as
MAC address + time in nanseconds since 1970.
If you cannot use database at all, GUID/UUID is the only reliable way to go. However, if you can use database occasionally, try HiLo algorithm.
You should consider using ids in the form of UUID. Java5 has a class for representing them (and must also have a factory to generate them). With this factory class, you can backport the code to your anticated Java 1.4 in order to have the identifiers you require.
Take a look at these strategies used by Hibernate (section 5.1.5 in the link). You will surely find it useful.
It explains several methods, its pros and cons, also stating if they are safe in a clustered environment.
Best of all, there is available code that already implements it for you :)
If it fits your application, you can use a larger string key coupled with a UUID() function or SHA1(of random data).
For sequential int's, I'll leave that to another poster.
You can generate a key based on the combination of below three things
The IP address or MAC address of machine
Current time
An incremental counter on each instance (to ensure same key does not get generated twice on one machine as time may appear same in two immediate key creations because of underlying time precision)
by using Statement Object you can called statement.getGeneratedKeys(); method to retrieve the auto-generated key(s) generated by the execution of this Statement object.
Java doc
Here is how it's done in MongoDB: http://www.mongodb.org/display/DOCS/Object+IDs
They include a timestamp.
But you can also install Oracle Express and select sequences, you can select in bulk:
SQL> select mysequence.nextval from dual connect by level < 20;
NEXTVAL
1
2
3
4
5
..
20
Why are you not allowed to use the database? Money (Oracle express is free) or single point of failure? Or do you want to support other databases than Oracle in the future?
Its shipped OOB in many Spring-based applications like Hybris-
The typeCode is the name of your table like, User, Address, etc.
private PK generatePkForCode(final String typeCode)
{
final TypeInfoMap persistenceInfo = Registry.getCurrentTenant().getPersistenceManager().getPersistenceInfo(typeCode);
return PK.createCounterPK(persistenceInfo.getItemTypeCode());
}
Related
This is from Hibernate official tutorial:
There is an alternative <composite-id> declaration that allows access to legacy data with composite keys. Its use is strongly discouraged for anything else.
Why are composite keys discouraged? I am considering using a 3-column table where all of the columns are foreign keys and together form a primary key that is a meaningful relationship in my model. I don't see why this is a bad idea, espicially that I will be using an index on them.
What's the alternative? Create an additional automatically generated column and use it as a primary key? I still need to query my 3 columns anyways!?
In short, why is this statement true? and what's the better alternative?
They discourage them for several reasons:
they're cumbersome to use. Each time you need to reference an object (or row), for eexample in your web application, you need to pass 3 parameters instead of just one.
they're inefficient. Instead of simply hashing an integer, the database needs to hash a composite of 3 columns.
they lead to bugs: developers inevitably implement the equals and hashCode methods of the primary key class incorrectly. Or they make it mutable, and modify their value once stored in a HashSet or HashMap
they pollute the schema. If another table needs to reference this 3-column table, it will need to have a 3 columns instead of just one as a foreign key. Now suppose you follow the same design and make this 3-column foreign key part of the primary key of this new table, you'll quickly have a 4-column primary key, and then a 5-column PK in the next table, etc. etc., leading to duplication of data, and a dirty schema.
The alternative is to have a single-column, auto-generated primary key, in addition to the other three columns. If you want to make the tuple of three columns unique, then use a unique constraint.
Even if it is - maybe - too late to answer your question, I want here to give another point of view (more moderate I hope) on the need (Is it really an advise ?) of Hibernate to use surrogate keys.
First of all, I want to be clear on the fact that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. I am not trying to say that one key type is better than the other. I am trying to say that depending on your requirements, natural keys might be a better choice than surrogate ones and vice versa.
Myths on natural keys
Composite keys are less efficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following facts:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Why Hibernate prefers/needs surrogate keys ?
As stated in Java Persistence with Hibernate reference:
More experienced Hibernate users use saveOrUpdate() exclusively; it’s
much easier to let Hibernate decide what is new and what is old,
especially in a more complex network of objects with mixed state. The
only (not really serious) disadvantage of exclusive saveOrUpdate() is
that it sometimes can’t guess whether an instance is old or new
without firing a SELECT at the database—for example, when a class is
mapped with a natural composite key and no version or timestamp
property.
Some manifestations of the limitation (This is how, I think, we should call it) can be found here.
Conclusion
Please don't be too squared on your opinions. Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
I would consider the problem from a design point of view. It's not just if Hibernate considers them good or bad. The real question is: are natural keys good candidates to be good identifiers for my data?
In your business model, today it can be convenient to identify a record by some of its data, but business models evolves in time. And when this happens, you'll find that your natural key doesn't fit anymore to uniquely identify your data. And with referential integrity in other tables, this will make things MUCH harder to change.
Having a surrogate PK is convenient because it doesn't chain how your data is identified in your storage with your business model structure.
Natural keys cannot be generated from a sequence, and the case of data which cannot be identified by its data is much more frequent. This is an evidence that natural keys differ from a storage key, and they cannot be taken as a general (and good) approach.
Using surrogate keys simplifies the design of the application and database. They are easier to use, are more performant, and do a perfect job.
Natural keys bring only disadvantages: I cannot think of a single advantage for using natural keys.
That said, I think hibernate has no real issues with natural (composed) keys. But you'll probably find some problems (or bugs) sometimes, and issues with the documentation or trying to get help, because the hibernate community widely acknowledges the benefits of surrogate keys. So, prepare a good answer for why you did choose a composite key.
If Hibernate documentation is properly understood:
"There is an alternative <composite-id> declaration that Allows access to legacy data with composite keys. Its use is strongly discouraged for anything else."
on topic 5.1.4. id tag xml <id> which enables the primary key mapping made too soon we can conclude that the hibernate documentation discourages the use of <composite-id> instead of <id> xml tag for composite primary key mapping and NOT make any reference negative to use composite primary keys.
Applications developed with the database as a tool are definitely more beneficial to keep work flow on surrogate keys, using clustered indices for query optimization.
Special care does need to be made for Data Warehousing and OLAP style systems however, that utilize a massive Fact Table to tie surrogate keys of dimensions together. In this case the data dictates the dashboard/application that can be used to maintain records.
So instead of one method being preferable to another, perhaps it is one directive is advantageous to another, for key construction : You won't be developing a Hibernate app very easily to harness direct access to an SSAS system instance.
I develop using both key mixtures, and feel to implement a solid star or snowflake pattern a surrogate with clustered index is typically my first choice.
So, to the regards of the OP and others looking by: if you want to stay db invariant with your development (which Hibernate specializes in) -- utilize the surrogate method, and when data reads tend to slow, or you notice certain queries drain performance, revert to your specific database, and add composite, clustered indices that optimize query order.
Do not confuse primary key and unique index. If you use natural keys, you link your key to your business, to business data; and it's not so good. So, even if a set of data could be use to define a composite key it is not recommended.
To my point of view, composite keys are mainly usable when you've an existing schema
Summary
Really globally unique IDs in flash and/or javascript clients. Can I do this with an RNG available in current browsers/flash or must I build a composite ID with server-side randomness?
Details
I need to generate globally unique identifiers for objects. I have multiple server-side "systems" written in java that need to be able to exchange ids; each of these systems also has a set of flex/javascript clients that actually generate the IDs for new objects. I need to guarantee global uniqueness across the set of unrelated systems; for instance I need to be able to merge/sync the databases of two independent systems. I must guarantee that there is never a collision between these ids and that I never need to change the id of an object once created. I need to be able to generate ids in flash and javascript clients without contacting the server for every id. A solution that relies on some server provided seed or system id is fine as long as the server isn't contacted too often. A solution that works completely disconnected is preferable. Similarly a solution that requires no upfront registration of systems is preferable to one that relies on a central authority (like the OUI in a MAC address).
I know the obvious solution is "use a UUID generator," such as UIDUtil in flash. This function specifically disclaims global uniqueness. In general, I'm concerned about relying on a PRNG to guarantee global uniqueness.
Proposed solutions
Rely entirely on a secure random number generator in the client.
Flash 11+ has flash.crypto.generateRandomBytes; Javascript has window.crypto but it's pretty new and not supported in IE. There are solutions like sjcl that use the mouse to add entropy.
I understand that given a perfect RNG the possibility of collision for a 2122 random UID is meteorite tiny, but I'm worried that I won't actually get this degree of randomness in a javascript or flash client. I'm further concerned that the typical use case for even a cryptographic RNG is different from mine: for session keys, etc, collisions are acceptable as long as they are unpredictable by an attacker. In my case, collisions are completely unacceptable. Should I really rely on the raw output of a secure RNG for a unique ID?
Generate a composite ID that includes system, session and object IDs.
An obvious implementation would be to create a system UUID at server install time, keep a per-client-login session id (eg in a database), and then send the system and session ids to the client which would keep a per-session counter. The uid would be the triple: system ID, session ID, client counter.
I could imagine directly concatenating these or hashing them with a cryptographic hash. I'm concerned that the hashing itself may potentially introduce collisions, particularly if the input to the hash is about the same size as the output. But the hash would obscure the system id and counters which could leak information.
Instead of generating the system ID at install time, another solution would be to have a central registry that handed out unique system IDs, kind of like what DOI does. This requires more coordination however, but I guess is the only way to really guarantee global uniquness.
Key questions
Random or composite based?
Include system ID?
If system id: generate a random system ID or use a central registry?
Include timestamp or some other nonce?
To hash or not to hash?
The simplest answer is to use a server assigned client ID which is incremented for each client, and a value on each client which is incremented for each fragment on that client. The pair of client ID and fragment ID becomes the globally unique ID for that piece of content.
Another simple approach is to generate a set of unique IDs (say 2k at a time) on the server and send them in a batch to each client. When the client runs out of IDs it contacts the server for more.
Client IDs should be stored in a central repository accessible to all the servers.
It may help looking at methods for distributed hashing which is used to uniquely identify and locates fragments within a peer-to-peer environment. This may be overkill considering you have a server which can intervene to assert uniqueness.
To answer your questions you need to determine the benefit that the added complexity of a system ID, nonce or hash would bring.
System ID:
A system ID would typically be used to uniquely identify the system within a domain. So if you don't care who the user is, or how many sessions are open, but only want to make sure you know who the device is, then use a system ID. This is usually less useful in a user-centric environment such as JavaScript or Flash, where the user or session may be relevant.
Nonce:
A nonce/salt/random seed would be used to obfuscate or otherwise scramble the ID. This is important when you don't want others to be able to guess the original value of the ID. If this is necessary then it may be better to encrypt the ID with a private encryption key, and pass a public decryption key to each consumer who needs to read the ID.
Timestamp: Considering the variability of the client's clock (ie you cannot guarantee it adheres to any time or time zone), a timestamp would need to be treated as a pseudo-random value for this application.
Hash: While hashes are often (ab)used to create unique keys, their real purpose is to map a large (possibly infinite) domain to a smaller, more manageable one. For example, MD5 is typically used to generate a unique ID from a timestamp, random number, and/or nonce data. What is actually happening is that the MD5 function is mapping an infinite range of data into a space of 2^128 possibilities. While this is a massive space, it is not infinite, so logic tells you that there will be (even if only in theory) the same hash assigned to two different fragments. On the other hand perfect hashing attempts to assign a unique identifier to each piece of data, however this is entirely unnecessary if you just assign a unique identifier to each client fragment to start with.
Something quick and dirty and also may not work out for your use case --
Using Java's UUID and coupling that with something like , say clientName.
This should solve the multiple client and multiple server issue.
The rationale behind this is that the possibility of getting 2 calls at the same nanosecond are low, refer to the links provided below. Now, by coupling the clientName with the UUID you are ensuring unique IDs across clients and that should leave only handling the use case of the same client calling twice within the same nanosecond.
You could write a java module to generate the IDs and then get Flash to talk to this module.
For your reference, you could refer to --
Is unique id generation using UUID really unique?
Getting java and flash to talk to each other
A middle ground builds on #ping's answer:
Use client name, high-resolution time, and optionally some other pseudo-random seed
Hash the data to produce the UID (or, just go directly to using UUIDs)
Log the resulting to a central server for entry into a database
Treat any collision as a prominently flagged bug, rather than as a situation that deserves special code.
With a UUID or reasonably long hash, the chances of a duplicate or nil. So either:
A) You'll get no duplicates for the life of the application, life is good.
B) You'll see a duplicate, or maybe two (freaky!), over a few decades. Intervene manually to deal with those cases; if you're running servers with your client, you can afford it.
C) If you get a third collision, then there is something fundamentally wrong with the code, and this can be investigated and measures taken to avoid a repetition.
This way, the ID is generated at the client, contacts to the server are one-way and operationally non-critical, the seeds don't have to be random, the hashing obscures the origins of the ID and so avoids constructed collisions, and you can be confident that there have been no collisions. (If you test that collision detection code!) Even UUIDs could be plenty adequate in this scenario.
The only way hashing increases the likelihood of collisions is if your information content in the original seed information approaches the size of the hash. That's extremely unlikely, but if true and you're still thinking about micrometeorites, just increase the size of the hashed value.
My two cents.. Each server locks a DB table and get an id from it, and increments it. this will be the server unique id.
Each client connecting will get this id, coupled with a unique identifier issued by the server. This unique key has to be unique for this server, but another server might issue the same id to a different client.
Finally, each client will generate a unique id for each request.
Coupling all three will guarantee a true unique global id over the entire system, the final id will look something like:
[server id][client id][request id]
Can I put a MAX value for the database table primary key, either via JPA or at the database level? If it is not possible, then I was thinking about
Create a random key between 0-9999999999 (9999999999 is my MAX)
Do a SELECT on the database with the newly create key, if return object is null, then INSERT, if not repeat go back to step 1
So if I do the above, two questions. Please keep in mind that the environment is high concurrent:
Q1: Does the overhead of check with SELECT, if not there, INSERT significant? What I really mean is: is this process normal, since usually I let the DB create a unique PK for me?
Q2: If Q1 does not create significant performance degradation, can I run into concurrent issue? For example, if P1 with Id1 check the table, Id1 is not there, it ready to insert, P2 sneak in insert Id1 before P1 could. So when P1 insert Id1, it fails. I dont want the process to fail here, I want it to go back up the loop, find a new id, repeat the process. How do I do that?
My environment is SQL and MYSQL db. I use JPA with Eclipselink implementation
NOTE: Some people question my decision to implement it this way, the answer is exact what TravisJ suggest below. I have a very high concurrent environment. When a process kick off, I need to create a request to another process, passing to that process a unique 10 character long id. Since the environment is high current, I want to leverage the unique, not null feature of PK. The request contain lot of information in it, so I create aRequest table, with the request Id as my PK. I know since all DB index their PK, query out the PK is fast. If there are better way, please let me know.
You can implement a Check Constraint in your table definition:
CREATE TABLE P
(
P_Id int PRIMARY KEY NOT NULL,
...
CONSTRAINT chk_P_Id CHECK (P_Id>0 and P_Id<9999999999)
)
EDIT: As stated in the comments, MySql does not honor CHECK constraints. This is a 6-year old defect in the bug log and the MySql team has yet to fix it. As MySql is now overseen by Oracle Corp, it may never be fixed (simply considered a "documented limitation", and people who don't like it can upgrade to the paid DBMS). However, this syntax, and the check constraint feature itself, DO work in Oracle, MS SQL Server (including SQLExpress/MSDE), MS Access, Postgre and SQLite.
Why not start at 1 and use auto-increment? This will be much more efficient because you will not get collisions, which you must cycle through. If you run out of numbers, you will be in the same boat either way, but at least going sequentially, you won't have to deal with collisions.
Imagine trying to find an unused key when you have used up 90% of your available numbers. That will take some time, and there is always a possibility that it never (in your lifetime) finds an unused key if you are generating them randomly.
Also, using auto-increment, it's easy to tell if you're close to the limit (SELECT MAX(col)). You could script an alert to let you know when you need to reset. For the random method, what would that query look like?
If you're using InnoDB, then you still might not want to use a primary key. Inserting random records into a clustered index is a performance hit since the actual table data must be reordered. Instead use a unique key (with an auto-increment primary key).
Using a unique index on the column in question, simply generate a random number in the range and attempt to insert it. If the insertion fails, then generate a new number and try again. If the insert succeeds, then proceed. This accounts for the concurrency issue.
Still, the sequential auto-increment key is going to yield better performance.
See,
http://en.wikibooks.org/wiki/Java_Persistence/Identity_and_Sequencing
and,
http://en.wikibooks.org/wiki/Java_Persistence/Identity_and_Sequencing#Advanced_Sequencing
JPA already has good Id generation support, it does not make sense to implement your own.
If you are concerned about concurrency and performance, and using MySQL, I would recommend using TABLE generator with a large preallocation size (on other databases I would recommend SEQUENCE generator). If you have a lot of data, ensure you use a long for your id.
If you really think you need more than this, then consider UUID id generation. EclipseLink 2.4 with provide a #UUIDGenerator.
I am building a distributed application on top of Java and Cassandra. To generate unique sequential 32bit & 64 bit IDs, is an approach like using Flickr's ticket servers to generate primary IDs, a good one? I am particularly excited about this as it can help me reduce the size of the IDs to 32 bits or 64 bits as required, which otherwise may go up to 128 bits with UUIDs. I do not want these IDs to be perfectly sequential, but increasing at least!
Using a single database server may however introduce a single point of failure that was eliminated by Cassandra. However this may be OK for the initial stage of our application. Later we may introduce two servers to alleviate those problems.
Does this sound like a good strategy? In short, we are mixing MYSQL and Cassandra in one application. I know, if mySQL is down for some reason then we can't go ahead with Cassandra alone.
We have looked to other solutions like snowflake however it did not perfectly matched our requirements.
EDIT : I am seeking advice on whether using MySQL for generating unique primary IDs to key the data/ entities stored in Cassandra database is a good approach. What are the downsides, if any, of an approach like Flickr's Ticket servers?
I'm not a big fan of trying to attach meaning to surrogate keys (which you're trying to do if you want them to increase over time). As you're seeing, it makes your problem of generating keys more complicated. Assuming that you want the keys to increase over time simply so that you can sort data, why not include a timestamp of when the object was created and store that in your data store? This simplifies the key generation significantly and allows you to do pretty much everything you could do with keys that increase over time, with the added bonus of the fact it will be crystal clear to whoever has to maintain your code how objects should be sorted.
In general, you can't have both "always increasing" and "no SPOF and no complex synchronization".
If you want to have several ID-generators which do not have to ask each other on every new ID, each of them really need a separate ID-pool.
A really simple example is mentioned in the article linked by you, where one server creates odd ones while the other one create even ones. (You can expand this to more servers trivially). Of course, then you can't be sure that one server doesn't run ahead of the other, which can lead to a non-increasing sequence like 111, 120, 113, 122, 115, 124 ...
If you only want "roughly increasing", you can implement a scheme where each server in some intervals (like each minute or each 10000 IDs) tells the other one(s) his current ID, and the other one then jumps its own ID (only forward) if he hangs back too far. This should be done in a way which does not interrupt the ID-generation, for robustness if the other server is down.
Ah, for the "free bits at the end", simply multiply your ID by some number (the same one each time, and a power of two if you really want "free bits" and not only "space for data"), then add your data (which should be less than number). But of course then you'll run out of ID space quite a bit earlier (by factor number).
I am working on a web application related to Discussion forums using Java and Cassandra database.
I need to construct 'keys' for the rows storing the user's details and & another set of rows storing the content posted by the user.
One option is to get the randomly generated UUID provided by Java language, but these are 16 bytes long. and since NoSQL database involves heavy denormalization, I am concerned whether I would be wasting lots of disk space, RAM and other resources if the key could be generated in smaller sizes.
I need to generate two types of keys, one for the Users & other for Content Posted by Users.
For the Content posted by users, would timestamp+userId be a good key. where timestamp is the server time at which content was posted and userId refers to key of user row.
Any suggestions, comments appreciated ..
Thanks
Marcos
Is this a distributed application?
Then you could use a simple synchronized counter and initialize it on startup with the next available id.
On the other hand a database should be able to handle the UUID hashes as created by java.
This is a standard for creating things like sessionIds, that need to be unique.
Your problem is somewhat similar since a session in your context would represent a set of user input.