This is from Hibernate official tutorial:
There is an alternative <composite-id> declaration that allows access to legacy data with composite keys. Its use is strongly discouraged for anything else.
Why are composite keys discouraged? I am considering using a 3-column table where all of the columns are foreign keys and together form a primary key that is a meaningful relationship in my model. I don't see why this is a bad idea, espicially that I will be using an index on them.
What's the alternative? Create an additional automatically generated column and use it as a primary key? I still need to query my 3 columns anyways!?
In short, why is this statement true? and what's the better alternative?
They discourage them for several reasons:
they're cumbersome to use. Each time you need to reference an object (or row), for eexample in your web application, you need to pass 3 parameters instead of just one.
they're inefficient. Instead of simply hashing an integer, the database needs to hash a composite of 3 columns.
they lead to bugs: developers inevitably implement the equals and hashCode methods of the primary key class incorrectly. Or they make it mutable, and modify their value once stored in a HashSet or HashMap
they pollute the schema. If another table needs to reference this 3-column table, it will need to have a 3 columns instead of just one as a foreign key. Now suppose you follow the same design and make this 3-column foreign key part of the primary key of this new table, you'll quickly have a 4-column primary key, and then a 5-column PK in the next table, etc. etc., leading to duplication of data, and a dirty schema.
The alternative is to have a single-column, auto-generated primary key, in addition to the other three columns. If you want to make the tuple of three columns unique, then use a unique constraint.
Even if it is - maybe - too late to answer your question, I want here to give another point of view (more moderate I hope) on the need (Is it really an advise ?) of Hibernate to use surrogate keys.
First of all, I want to be clear on the fact that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. I am not trying to say that one key type is better than the other. I am trying to say that depending on your requirements, natural keys might be a better choice than surrogate ones and vice versa.
Myths on natural keys
Composite keys are less efficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following facts:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Why Hibernate prefers/needs surrogate keys ?
As stated in Java Persistence with Hibernate reference:
More experienced Hibernate users use saveOrUpdate() exclusively; it’s
much easier to let Hibernate decide what is new and what is old,
especially in a more complex network of objects with mixed state. The
only (not really serious) disadvantage of exclusive saveOrUpdate() is
that it sometimes can’t guess whether an instance is old or new
without firing a SELECT at the database—for example, when a class is
mapped with a natural composite key and no version or timestamp
property.
Some manifestations of the limitation (This is how, I think, we should call it) can be found here.
Conclusion
Please don't be too squared on your opinions. Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
I would consider the problem from a design point of view. It's not just if Hibernate considers them good or bad. The real question is: are natural keys good candidates to be good identifiers for my data?
In your business model, today it can be convenient to identify a record by some of its data, but business models evolves in time. And when this happens, you'll find that your natural key doesn't fit anymore to uniquely identify your data. And with referential integrity in other tables, this will make things MUCH harder to change.
Having a surrogate PK is convenient because it doesn't chain how your data is identified in your storage with your business model structure.
Natural keys cannot be generated from a sequence, and the case of data which cannot be identified by its data is much more frequent. This is an evidence that natural keys differ from a storage key, and they cannot be taken as a general (and good) approach.
Using surrogate keys simplifies the design of the application and database. They are easier to use, are more performant, and do a perfect job.
Natural keys bring only disadvantages: I cannot think of a single advantage for using natural keys.
That said, I think hibernate has no real issues with natural (composed) keys. But you'll probably find some problems (or bugs) sometimes, and issues with the documentation or trying to get help, because the hibernate community widely acknowledges the benefits of surrogate keys. So, prepare a good answer for why you did choose a composite key.
If Hibernate documentation is properly understood:
"There is an alternative <composite-id> declaration that Allows access to legacy data with composite keys. Its use is strongly discouraged for anything else."
on topic 5.1.4. id tag xml <id> which enables the primary key mapping made too soon we can conclude that the hibernate documentation discourages the use of <composite-id> instead of <id> xml tag for composite primary key mapping and NOT make any reference negative to use composite primary keys.
Applications developed with the database as a tool are definitely more beneficial to keep work flow on surrogate keys, using clustered indices for query optimization.
Special care does need to be made for Data Warehousing and OLAP style systems however, that utilize a massive Fact Table to tie surrogate keys of dimensions together. In this case the data dictates the dashboard/application that can be used to maintain records.
So instead of one method being preferable to another, perhaps it is one directive is advantageous to another, for key construction : You won't be developing a Hibernate app very easily to harness direct access to an SSAS system instance.
I develop using both key mixtures, and feel to implement a solid star or snowflake pattern a surrogate with clustered index is typically my first choice.
So, to the regards of the OP and others looking by: if you want to stay db invariant with your development (which Hibernate specializes in) -- utilize the surrogate method, and when data reads tend to slow, or you notice certain queries drain performance, revert to your specific database, and add composite, clustered indices that optimize query order.
Do not confuse primary key and unique index. If you use natural keys, you link your key to your business, to business data; and it's not so good. So, even if a set of data could be use to define a composite key it is not recommended.
To my point of view, composite keys are mainly usable when you've an existing schema
Related
Forgive me for what is probably a stupid or obvious question - I'm new to databases.
I'm planning to store file path links to on-disk media files in a Derby database from java but I'm curious about the best way to set up the tables.
Just to clarify I do not intend to store the actual media in the database, only file paths.
The table will contain in the order of 10k-100k rows.
I believe that the file path should be the primary key as it uniquely identifies each media file.
What are the best options for setting up a table with file paths and to be able to efficiently search (mostly for a substring in the filename, but also for media attributes)?
I am planning to use VARCHAR(4096) as maximum linux path length is 4096 characters.
Are there any pros or cons in creating a table in this way, with an index on what could be quite a long VARCHAR column? How do you suggest I should design the tables?
Thanks!
Disclaimer: This is a very personal opinion and probably many people will disagree.
You are considering using a "natural key", and I'm against using them. A natural key is an existing property of an object that identifies it uniquely... until it doesn't.
It's like my full name, of my identity number in my country. Those properties seem to be unique, but the problem is that they are not stable. They are existing, known properties that are visible; this visibility make them vulnerable to change. This mean, they will change in time. Will I be the same person if I change my name?
Also a key is usually used to be somewhat linked to other tables. A big PK is not great for that. But this is more of a practical issue.
I would recommend you to use a simple INT or BIGINT as the primary key and add a UNIQUE constraint to the path property. This way your model would be more flexible. If the media is moved to another path, you just will need to update a single value in the table; if the path were the PK then you would need to update all foreign keys related to it.
Do not use a long character string as a primary key.
Use a synthetic primary key.
Here are some reasons:
One important purpose of primary keys is to support foreign keys. You don't want to have 4k strings lurking all over your database, when you could just have a 4 byte integer.
Another important reason for primary keys is to uniquely find each row. Most people I know don't want to have to type 4k characters to identify a row. I type fast and that would take time for me. And I'm sure I would make a typo somewhere along the way.
Two strings might only differ in, say, the 2017th character. I wouldn't want to have to figure out they are different, especially if the character is a 1 versus l or O versus 0.
Define an auto incrementing/identify/serial primary key. You can always declare the URL as unique so it is not duplicated (although some databases may not allow such a long key in an index).
In one of the components of my application I have to deal with data which comes from a set of csv files (where content is added and updated over time) and store it into db to a very plain table with Hibernate. The data itself lacks of unique value which is suitable for use as a primary key, but the uniqueness could be achieved with a composite key instead. I'd like to get rid of composite key which was used as a temporary solution. Is there an idiomatic way to do this, keeping an entity objects behave the same way?
The first thing which came to my mind is some hash based solution, but I'm not sure about it.
What is the recommended way in Java to add prefixes to String keys to be stored in database of a web application?
I have an EntityId per Entity but I want to store different kinds of data for an Entity, in different rows distinguished by prefixed EntityId keys like this format:
EntityId | PrefixForThisDataCategory
In general, databases resist that kind of thing. They're happy to store "prefix" characters in a separate column, though. If that column needs to be part of a composite key, they're happy to do that, too.
But if you want to store different kinds of data in different rows of the same table, I hope I can discourage you. Databases--SQL databases, that is--are designed to keep different kinds of data in different tables. People in one table, addresses in another table . . . not addresses in some rows of a table of people.
Of course, you might be aiming at something completely different.
If I understand your question correctly you basically want to store data representing different sets of information in a common table and use one or more fields to differentiate what type of data has been stored.
THIS IS A REALLY BAD IDEA ! - had to say it :-)
I can tell you from experience on many projects that storing data in this way always leads to problems and really messy code. My strong recommendation would be to store the data is separate tables.
There is one variation I can think of however, that is similar to your request, that is derived class mapping in hibernate. There hibernate maps sets of data into tables based on the class that is being stored. This is only for mapping hierarchies of classes and is controlled by hibernate so that you don't have to worry about it.
I came across a question recently that was for "Generating primary key in a clustered environment of 5 App-Servers - [OAS Version 10] without using database".
Usually we generate PK by a DB sequence, or storing the values in a database table and then using a SP to generate the new PK value...However current requirement is to generate primary key for my application without referencing the database using JDK 1.4.
Need expert's help to arrive on better ways to handle this.
Thanks,
Use a UUID as your primary key and generate it client-side.
Edit:
Since your comment I felt I should expand on why this is a good way to do things.
Although sequential primary keys are the most common in databases, using a randomly generated primary key is frequently the best choice for distributed databases or (particularly) databases that support a "disconnected" user interface, i.e. a UI where the user is not continuously connected to the database at all times.
UUIDs are the best form of randomly generated key since they are guaranteed to be very unique; the likelyhood of the same UUID being generated twice is so extremely low as to be almost completely impossible. UUIDs are also ubiquitous; nearly every platform has support for the generation of them built in, and for those that don't there's almost always a third-party library to take up the slack.
The biggest benefit to using a randomly generated primary key is that you can build many complex data relationships (with primary and foreign keys) on the client side and (when you're ready to save, for example) simply dump everything to the database in a single bulk insert without having to rely on post-insert steps to obtain the key for later relationship inserts.
On the con side, UUIDs are 16 bytes rather than a standard 4-byte int -- 4 times the space. Is that really an issue these days? I'd say not, but I know some who would argue otherwise. The only real performance concern when it comes to UUIDs is indexing, specifically clustered indexing. I'm going to wander into the SQL Server world, since I don't develop against Oracle all that often and that's my current comfort zone, and talk about the fact that SQL Server will by default create a clustered index across all fields on the primary key of a table. This works fairly well in the auto-increment int world, and provides for some good performance for key-based lookups. Any DBA worth his salt, however, will cluster differently, but folks who don't pay attention to that clustering and who also use UUIDs (GUIDs in the Microsoft world) tend to get some nasty slowdowns on insert-heavy databases, because the clustered index has to be recomputed every insert and if it's clustered against a UUID, which could put the new key in the middle of the clustered sequence, a lot of data could potentially need to be rearranged to maintain the clustered index. This may or may not be an issue in the Oracle world -- I just don't know if Oracle PKs are clustered by default like they are in SQL Server.
If that run-on sentence was too hard to follow, just remember this: if you use a UUID as your primary key, do not cluster on that key!
You may find it helpful to look up UUID generation.
In the simple case, one program running one thread on each machine, you can do something such as
MAC address + time in nanseconds since 1970.
If you cannot use database at all, GUID/UUID is the only reliable way to go. However, if you can use database occasionally, try HiLo algorithm.
You should consider using ids in the form of UUID. Java5 has a class for representing them (and must also have a factory to generate them). With this factory class, you can backport the code to your anticated Java 1.4 in order to have the identifiers you require.
Take a look at these strategies used by Hibernate (section 5.1.5 in the link). You will surely find it useful.
It explains several methods, its pros and cons, also stating if they are safe in a clustered environment.
Best of all, there is available code that already implements it for you :)
If it fits your application, you can use a larger string key coupled with a UUID() function or SHA1(of random data).
For sequential int's, I'll leave that to another poster.
You can generate a key based on the combination of below three things
The IP address or MAC address of machine
Current time
An incremental counter on each instance (to ensure same key does not get generated twice on one machine as time may appear same in two immediate key creations because of underlying time precision)
by using Statement Object you can called statement.getGeneratedKeys(); method to retrieve the auto-generated key(s) generated by the execution of this Statement object.
Java doc
Here is how it's done in MongoDB: http://www.mongodb.org/display/DOCS/Object+IDs
They include a timestamp.
But you can also install Oracle Express and select sequences, you can select in bulk:
SQL> select mysequence.nextval from dual connect by level < 20;
NEXTVAL
1
2
3
4
5
..
20
Why are you not allowed to use the database? Money (Oracle express is free) or single point of failure? Or do you want to support other databases than Oracle in the future?
Its shipped OOB in many Spring-based applications like Hybris-
The typeCode is the name of your table like, User, Address, etc.
private PK generatePkForCode(final String typeCode)
{
final TypeInfoMap persistenceInfo = Registry.getCurrentTenant().getPersistenceManager().getPersistenceInfo(typeCode);
return PK.createCounterPK(persistenceInfo.getItemTypeCode());
}
I have a certain object type that is stored in a database. This type now gets additional information associated with it which will differ in structure among instances. Although for groups of instances the information will be identically structured, the structure will only be known at runtime and will change over time.
I decided to just add a blob field to the table and store the key/value pairs there in some serialized format. From your experience, what format is most advisable?
In the context of my application, the storage space for this is secondary. There's one particular operation that I want to be fast, which is looking up the correct instance for a given set of key / value pairs (so it's a kind of variable-field composite key). I guess that means, is there a format that plays particularly well with typical database indexing?
Additionally, I might be interested in looking for a set of instances that share the same set of keys (an adhoc "class", if you wish).
I'm writing this in Java and I'm storing in various types of SQL databases. I've got JSON, GPB and native Java serialization on my radar, favouring the cross-language formats. I can think of two basic strategies:
store the set of values in the table and add a foreign key to a separate table that contains the set of keys
store the key/value pairs in the table
If your goal is to take advantage of database indexes, storing the unstructured data in a BLOB is not going to be effective. BLOBs are essentially opaque from the RDBMS's perspective.
I gather from your description that the unstructured part of the data takes the form of an arbitrary set of key-value pairs associated with the object, right? Well, if the types of all keys are the same (e.g. they're all strings), I'd recommend simply creating a child table with (at least) three columns: the key, the value, and a foreign key to the parent object's row in its table. Since the keys will then be stored in the database as a regular column, they can be indexed effectively. The index should also include the foreign key to the parent table.
A completely different approach would be to look at a "schemaless" database engine like CouchDB, which is specifically designed to deal with unstructured data. I have zero experience with such systems and I don't know how well the rest of your application would lend itself to this alternative storage strategy, but it might be worth looking into.
Not really an anwser to your question, but did you considered looking at the Java Edition of BerkeleyDB ? Duplicate keys and serialized values can be stored with this (fast) engine.