For some reason it is need to get UUID before storing in database. Class java.util.UUID can be used for that. But is it safe to use this generated ids as primary key in database or uuid should be generated by db only?
Note
Actual MySql is used, but I do not think that it can affect the question answers.
It really shouldn't make any difference where the UUIDs are generated, as long as they are unique. There isn't anything special about MySQL's built-in UUID() function.
The problem, however, is with UUIDs in general. In InnoDB (which is what you should be using), the primary key is the clustered index... which means rows are physically stored in primary key order... which means you have a performance penalty to consider any time rows are not inserted into a table in primary key order. You will have a significant number of page splits and a significant amount of fragmentation in your tables.
And, clearly, if you generate several UUIDs in succession, it should be readily apparent that they are not lexically sequential.
Additionally, and particularly if you store a UUID as a 36-character CHAR or VARCHAR then your joins will be on 36-byte values, which brings its own potential performance issues -- in contrast with an INT, which is only 4 bytes, or a BIGINT, which is 8. Foreign key constraint checking will also have to use larger values.
An AUTO_INCREMENT primary key solves both issues, because rows are, by definition, inserted in primary key order, and the keys are comprised of fewer bytes, which should mean better join performance.
Will the performance be horrible? No, but it won't be optimal.
However, to answer the question, it should not matter how or where the UUIDs are generated. One of the motivations, in fact, for UUIDs, is the very fact that they should be unique, regardless of source.
Related
I have mentioned a sequence generation strategy as IDENTITY on my entity class for the primary key of a table in AS400 system.
#Id
#GeneratedValue(strategy=GenerationType.IDENTITY)
#Column(name = "SEQNO")
private Integer seqNo;
The table's primary key column is defined as GENERATED ALWAYS AS IDENTITY in database.
SEQNO BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY(START WITH 1, INCREMENT BY 1)
My understanding of IDENTITY strategy is that it will leave the primary key generation responsibility to the table itself.
The problem that I am facing is that somehow in one environment, while inserting record in the table it gives me [SQL0803] Duplicate Key value specified.
Now there are couple of questions in my mind:
Is my understanding correct for #GeneratedValue(strategy=GenerationType.IDENTITY)?
In which scenario table will generate Duplicate key?
I figured out there are sequence values missing in the table, i.e. after 4, the sequence till 20 is missing and I do not know if someone manually deleted it or not, but could this be related to duplicate key generation?
YES. IDENTITY means use in-datastore features like "AUTO_INCREMENT", "SERIAL", "IDENTITY". So any INSERT should omit the IDENTITY column, and will pull the value back (into memory, for that object) after the INSERT is executed.
Should never get a duplicate key. Check the INSERT statement being used.
Some external process using the same table? Use the logs to see SQL and work it out.
I don't use JPA, but what you have seems reasonable to me.
As far as the DB2 for i side...
Are you sure you're getting the duplicate key error on the identity column? Are there no other columns defined as unique?
It is possible to have a duplicate key error on an identity column.
What you need to realize is that the next identity value is stored in the table object; not calculated on the fly. When I started using Identities, I got bit by a CMS package that routinely used CPYF to move data between newly created versions of a table. The new version of the table would have a next identity value of 1, even though there might be 100K records in it. (the package has since gotten smarter :) But the point remains that CPYF for instance, doesn't play nice with identity columns.
Additionally, it is possible to override the GENERATED ALWAYS via the OVERRIDING SYSTEM VALUE or OVERRIDING USER VALUE clauses of the INSERT statement. But inserting with an override has no effect on the stored next identity value. I suppose one could consider CPYF as using OVERRIDING SYSTEM VALUE
Now, as far as your missing identities...
Data was deleted
Data was copied in with overridden identities
Somebody ALTER TABLE <...> ALTER COLUMN <...> RESTART WITH
You lost the use of some values
Let me explain #4. For performance reasons, DB2 for i by default will cache 20 identity values for a process to use. So if you have two processes adding records, one will get values 1-20 the other 20-40. This allows both process to insert concurrently. However, if process 1 only inserts 10 records, then identity values 11-20 will be lost. If you absolutely must have continuous identity values, then specify NO CACHE during the creation of the identity.
create table test
myid int generated always
as identity
(start with 1, increment by 1, no cache)
Finally, with respect to the caching of identity values. While confirming a few things for this answer, I noticed that the use of ALTER TABLE to add a new column seemed to cause a loss of the cached values. I inserted 3 rows, did the alter table and the next row got an identity value of 21.
In my postgreSQL database, I have a bigint and it's base36 conversion.
I have urls containing short_id in base36 and the decimal version, but I will query according to the url so base36.
Can base36 short_id be my primary key for better performance ?
At least in PostgreSQL, the PRIMARY KEY isn't about performance. It's about correctness and data structure. The query planner doesn't care about the PRIMARY KEY, only about the UNIQUE NOT NULL index that's created by defining the PRIMARY KEY constraint.
You can define whatever indexes you like. Want another unique index? Just create one.
If the base36 column is guaranteed unique then yes, it is a candidate for a primary key. Whether it's the best choice, and whether it's actually any faster than whatever you're currently doing, is somewhat app dependent.
Note that extra indexes aren't free - they do incur a cost for inserts and updates. So don't go crazy creating multiple indexes on every column for write-heavy tables.
BTW, some other database systems do have stronger performance implications for the PRIMARY KEY. In particular, on DB systems that use index-organized tables (where the main table is in a b-tree structure) the choice of the clustering key - usually also the primary key - is a big thing for performance.
In PostgreSQL every table is just a heap, so it's not relevant.
I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.
For example:
create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;
Or i can create a entire new table:
create table grouptoemail(groupid int, email text, primary key(groupid, email));
select * from grouptoemail where groupid=2;
They both do the job.
I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.
According to me your first approach is correct.
create table emailtogroup(email text, groupid int, primary key(email));
because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.
The second form of table creation is useful for range scans. For example if you have a use case like
i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.
In that case you may have to design a table like
create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));
In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)
It depends on the cardinality of groupid. The cassandra docs:
When not to use an index
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a
high-cardinality column, which has many distinct values, a query
between the fields will incur many seeks for very few results. In the
table with a billion users, looking up users by their email address (a
value that is typically unique for each user) instead of by their
state, is likely to be very inefficient. It would probably be more
efficient to manually maintain the table as a form of an index instead
of using the Cassandra built-in index. For columns containing unique
data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an
indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every
value is distinct.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.
Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.
I'm looking to create a table for users and tracking their objectives. The objectives themselves would be on the order of 100s, if not 1000s, and would be maintained in their own table, but it wouldn't know who completed them - it would only define what objectives are available.
Objective:
ID | Name | Notes |
----+---------+---------+
| | |
Now, in the Java environment, the users will have a java.util.BitSet for the objectives. So I can go
/* in class User */
boolean hasCompletedObjective(int objectiveNum) {
if(objectiveNum < 0 || objectivenum > objectives.length())
throw new IllegalArgumentException("Objective " + objectiveNum + " is invalid. Use a constant from class Objective.");
return objectives.get(objectivenum);
}
I know internally, the BitSet uses a long[] to do its storage. What would be the best way to represent this in my Derby database? I'd prefer to keep it in columns on the AppUser table if at all possible, because they really are elements of the user.
Derby does not support arrays (to my knowledge) and while I'm not sure the column limit, something seems wrong with having 1000 columns, espeically since I know I will not be querying the database with things like
SELECT *
FROM AppUser
WHERE AppUser.ObjectiveXYZ
What are my options, both for storing it, and marshaling it into the BitSet?
Are there viable alternatives to java.util.BitSet?
Is there a flaw in the general approach? I'm open to ideas!
Thanks!
*EDIT: If at all possible, I would like the ability to add more objectives with only a data modification, not a table modification. But again, I'm open to ideas!
[puts on fake moustache]
Store the bitset as a BLOB. Start by simply serializing it, then if you want more space-efficiency, trying pushing the results through a DeflaterOutputStream on their way to the database. For better space- and time- efficiency, try the bitmap compression method used in FastBit, which breaks the bitset into 31-bit chunks, then run-length encodes all-zero chunks, packing the literal and run chunks into 32-bit words along with a discriminator bit.
If you know you'll only look at the objective bitset while the ResultSet that brought it from the database is still open, write a new bitset class that wraps the Blob interface and implements get on top of getBytes. This avoids having to read the whole BLOB into memory to check a few specific bits, and at least avoids having to allocate a separate buffer for the bitset if you do want to look at all the values. Note that making this work with a compressed bitset will take substantial ingenuity.
Be aware that this approach gives you no referential integrity, and no ability to query on the user-objective relationship, little flexibility for different uses of the data in future, and is exactly the kind of thing that Don Knuth warned you about.
The orthodox way to do this does not involve bitsets at all. You have a table for users, a table for objectives, and a join table, indicating which objectives a user has. Something like:
create table users (
id integer primary key,
name varchar(100) not null
);
create table objectives (
id integer primary key,
name varchar(100) not null
);
create table user_objective (
user_id integer not null references users,
objective_id integer not null references objectives,
primary key (user_id, objective_id)
);
Whenever a user has an objective, you put a row in the join table indicating the fact.
If you want to get the results into a bitset for a user, do an outer join of the user onto the objectives table via the join table, such that you get a row back for every objective, which has a single column with, say, a 1 for each joined objective, or 0 if there was no join.
The orthodox approach would also be to use a Set on the Java side, rather than a bitset. That maps very nicely onto the join table. Have you considered doing it this way?
If you're worried about memory consumption, a set will use about one pointer per objective a user actually has; the bitset will use a bit per possible objective. Most JVMs have 32-bit pointers (only old or huge-heaped 64-bit JVMs have 64-bit pointers), so if each user has on average less than 1/32nd of the possible objectives, the set will use less memory. There are some groovy data structures which will be able to store this information more compactly than either of those structures, but let's leave that to another question.
I wish to store UUIDs created using java.util.UUID in a HSQLDB database.
The obvious option is to simply store them as strings (in the code they will probably just be treated as such), i.e. varchar(36).
What other options should I consider for this, considering issues such as database size and query speed (neither of which are a huge concern due to the volume of data involved, but I would like to consider them at least)
HSQLDB has a built-in UUID type. Use that
CREATE TABLE t (
id UUID PRIMARY KEY
);
You have a few options:
Store it as a VARCHAR(36), as you already have suggested. This will take 36 bytes (288 bits) of storage per UUID, not counting overhead.
Store each UUID in two BIGINT columns, one for the least-significant bits and one for the most-significant bits; use UUID#getLeastSignificantBits() and UUID#getMostSignificantBits() to grab each part and store it appropriately. This will take 128 bits of storage per UUID, not counting any overhead.
Store each UUID as an OBJECT; this stores it as the binary serialized version of the UUID class. I have no idea how much space this takes up; I'd have to run a test to see what the default serialized form of a Java UUID is.
The upsides and downsides of each approach is based on how you're passing the UUIDs around your app -- if you're passing them around as their string-equivalents, then the downside of requiring double the storage capacity for the VARCHAR(36) approach is probably outweighed by not having to convert them each time you do a DB query or update. If you're passing them around as native UUIDs, then the BIGINT method probably is pretty low-overhead.
Oh, and it's nice that you're looking to consider speed and storage space issues, but as many better than me have said, it's also good that you recognize that these might not be critically important given the amount of data your app will be storing and maintaining. As always, micro-optimization for the sake of performance is only important if not doing so leads to unacceptable cost or performance. Otherwise, these two issues -- the storage space of the UUIDs, and the time it takes to maintain and query them in the DB -- are reasonably low-importance given the cheap cost of storage and the ability of DB indices to make your life much easier. :)
I would recommend char(36) instead of varchar(36). Not sure about hsqldb, but in many DBMS char is a little faster.
For lookups, if the DBMS is smart, then you can use an integer value to "get closer" to your UUID.
For example, add an int column to your table as well as the char(36). When you insert into your table, insert the uuid.hashCode() into the int column. Then your searches can be like this
WHERE intCol = ? and uuid = ?
As I said, if hsqldb is smart like mysql or sql server, it will narrow the search by the intCol and then only compare at most a few values by the uuid. We use this trick to search through million+ record tables by string, and it is essentially as fast as an integer lookup.
Using BINARY(16) is another possibility. Less storage space than character types. Use CREATE TYPE UUID .. or CREATE DOMAIN UUID .. as suggested above.
I think the easiest thing to do would be to create your own domain thus creating your own UUID "type" (not really a type, but almost).
You also should consider the answer to this question (especially if you plan to use it instead of a "normal" primary key)
INT, BIGINT or UUID/GUID in HSQLDB? (deleted by community ...)
HSQLDB: Domain Creation and Manipulation