If we have a sequence to generate unique ID fields for a table, which of the 2 approaches is more efficient:
Create a trigger on insert, to populate the ID field by fetching nextval from sequence.
Calling nextval on the sequence in the application layer before inserting the object (or tuple) in the db.
EDIT: The application performs a mass upload. So assume thousands or a few millions of rows to be inserted each time the app runs. Would triggers from #1 be more efficient than calling the sequence within the app as mentioned in #2?
Since you are inserting a large number of rows, the most efficient approach would be to include the sequence.nextval as part of the SQL statement itself, i.e.
INSERT INTO table_name( table_id, <<other columns>> )
VALUES( sequence_name.nextval, <<bind variables>> )
or
INSERT INTO table_name( table_id, <<other columns>> )
SELECT sequence_name.nextval, <<other values>>
FROM some_other_table
If you use a trigger, you will force a context shift from the SQL engine to the PL/SQL engine (and back again) for every row you insert. If you get the nextval separately, you'll force an additional round-trip to the database server for every row. Neither of these are particularly costly if you do them once or twice. If you do them millions of times, though, the milliseconds add up to real time.
If you're only concerned about performance, on Oracle it'll generally be a bit faster to populate the ID with a sequence in your INSERT statement, rather than use a trigger, as triggers add a bit of overhead.
However (as Justin Cave says), the performance difference will probably be insignificant unless you're inserting millions of rows at a time. Test it to see.
What is a key? One or more fields to uniquely identify records, should be final and never change in the course of an application.
I make a difference between technical and business keys. Technical keys are defined on the database and are generated (sequence, uuid, etc ); business keys are defined by your domain model.
That's why I suggest
always generate technical PK's with a sequence/trigger on the database
never use this PK field in your application ( tip: mark the getId()
setId() #Deprecated )
define business fields which uniquely identify your entity and use these in equals/hashcode methods
I'd say if you already use hibernate, then let it control how the id's are created with #SequenceGenerator and #GeneratedValue. It will be more transparent, and Hibernate can reserve id's for itself so it might be more efficient than doing it by hand, or from a trigger.
Related
I have a table like (id INTEGER, sometext VARCHAR(255), ....) with id as the primary key and a UNIQUE constraint on sometext. It gets used in a web server, where a request needs to find the id corresponding to a given sometext if it exists, otherwise a new row gets inserted.
This is the only operation on this table. There are no updates and no other operations on this table. Its sole purpose is to persistently number of encountered values of sometext. This means that I can't drop the id and use sometext as the PK.
I do the following:
First, I consult my own cache in order to avoid any DB access. Nearly always, this works and I'm done.
Otherwise, I use Hibernate Criteria to find the row by sometext. Usually, this works and again, I'm done.
Otherwise, I need to insert a new row.
This works fine, except when there are two overlapping requests with the same sometext. Then an ConstraintViolationException results. I'd need something like INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE (Mysql syntax) or MERGE (Firebird syntax).
I wonder what are the options?
AFAIK Hibernate merge works on PK only, so it's inappropriate. I guess, a native query might help or not, as it may or may not be committed when the second INSERT takes place.
Just let the database handle the concurrency. Start a secondary transaction purely for inserting the new row. if it fails with a ConstraintViolationException, just roll that transaction back and read the new row.
Not sure this scales well if the likelihood of a duplicate is high, a lot of extra work if some percent (depends on database) of transactions have to fail the insert and then reselect.
A secondary transaction minimizes the length of time the transaction to add the new text takes, assuming the database supports it correctly, it might be possible for the thread 1 transaction to cause the thread 2 select/insert to hang until the thread 1 transaction is committed or rolled back. Overall database design might also affect transaction throughput.
I don't necessarily question why sometext can't be a PK, wondering why you need to break it out at all. Of course, large volumes might substantially save space if sometext records are large, it almost seems like you're trying to emulate a lucene index to give you a complete list of text values.
Can I put a MAX value for the database table primary key, either via JPA or at the database level? If it is not possible, then I was thinking about
Create a random key between 0-9999999999 (9999999999 is my MAX)
Do a SELECT on the database with the newly create key, if return object is null, then INSERT, if not repeat go back to step 1
So if I do the above, two questions. Please keep in mind that the environment is high concurrent:
Q1: Does the overhead of check with SELECT, if not there, INSERT significant? What I really mean is: is this process normal, since usually I let the DB create a unique PK for me?
Q2: If Q1 does not create significant performance degradation, can I run into concurrent issue? For example, if P1 with Id1 check the table, Id1 is not there, it ready to insert, P2 sneak in insert Id1 before P1 could. So when P1 insert Id1, it fails. I dont want the process to fail here, I want it to go back up the loop, find a new id, repeat the process. How do I do that?
My environment is SQL and MYSQL db. I use JPA with Eclipselink implementation
NOTE: Some people question my decision to implement it this way, the answer is exact what TravisJ suggest below. I have a very high concurrent environment. When a process kick off, I need to create a request to another process, passing to that process a unique 10 character long id. Since the environment is high current, I want to leverage the unique, not null feature of PK. The request contain lot of information in it, so I create aRequest table, with the request Id as my PK. I know since all DB index their PK, query out the PK is fast. If there are better way, please let me know.
You can implement a Check Constraint in your table definition:
CREATE TABLE P
(
P_Id int PRIMARY KEY NOT NULL,
...
CONSTRAINT chk_P_Id CHECK (P_Id>0 and P_Id<9999999999)
)
EDIT: As stated in the comments, MySql does not honor CHECK constraints. This is a 6-year old defect in the bug log and the MySql team has yet to fix it. As MySql is now overseen by Oracle Corp, it may never be fixed (simply considered a "documented limitation", and people who don't like it can upgrade to the paid DBMS). However, this syntax, and the check constraint feature itself, DO work in Oracle, MS SQL Server (including SQLExpress/MSDE), MS Access, Postgre and SQLite.
Why not start at 1 and use auto-increment? This will be much more efficient because you will not get collisions, which you must cycle through. If you run out of numbers, you will be in the same boat either way, but at least going sequentially, you won't have to deal with collisions.
Imagine trying to find an unused key when you have used up 90% of your available numbers. That will take some time, and there is always a possibility that it never (in your lifetime) finds an unused key if you are generating them randomly.
Also, using auto-increment, it's easy to tell if you're close to the limit (SELECT MAX(col)). You could script an alert to let you know when you need to reset. For the random method, what would that query look like?
If you're using InnoDB, then you still might not want to use a primary key. Inserting random records into a clustered index is a performance hit since the actual table data must be reordered. Instead use a unique key (with an auto-increment primary key).
Using a unique index on the column in question, simply generate a random number in the range and attempt to insert it. If the insertion fails, then generate a new number and try again. If the insert succeeds, then proceed. This accounts for the concurrency issue.
Still, the sequential auto-increment key is going to yield better performance.
See,
http://en.wikibooks.org/wiki/Java_Persistence/Identity_and_Sequencing
and,
http://en.wikibooks.org/wiki/Java_Persistence/Identity_and_Sequencing#Advanced_Sequencing
JPA already has good Id generation support, it does not make sense to implement your own.
If you are concerned about concurrency and performance, and using MySQL, I would recommend using TABLE generator with a large preallocation size (on other databases I would recommend SEQUENCE generator). If you have a lot of data, ensure you use a long for your id.
If you really think you need more than this, then consider UUID id generation. EclipseLink 2.4 with provide a #UUIDGenerator.
Hello and happy new year for everyone.
I need to insert a record at the end of a table (the table has not set autoincrement) using JPA.
I know I could get the last id (integer) and apply to the entity before insert, but how could that be done? Which way would be most effective?
There is no such thing as "the end of the table". Rows in a relational table are not sorted.
Simply insert your new row. If you need any particular order, you need to apply an ORDER BY when selecting the rows from the table.
If you are talking about generating a new ID, then use an Oracle sequence. It guarantees uniqueness.
I would not recommend using a "counter table".
That solution is either not scalable (if it's correctly implemented) or not safe (if it's scalable).
That's what sequences were created for. I don't know JPA, but if you can't get the ID from a sequence then I suggest you find a better ORM.
Well, while i do not know where the end of a table really is, JPA has a lot of options for plugging in ID generators.
One common option is to use a table of its own, having a counter for each entity you need an ID for (from http://download.oracle.com/docs/cd/B32110_01/web.1013/b28221/cmp30cfg001.htm).
#Id(generate=TABLE, generator="ADDRESS_TABLE_GENERATOR")
#TableGenerator(
name="ADDRESS_TABLE_GENERATOR",
tableName="EMPLOYEE_GENERATOR_TABLE",
pkColumnValue="ADDRESS_SEQ"
)
#Column(name="ADDRESS_ID")
public Integer getId() {
return id;
}
...other "Generator" strategies to be googled...
EDIT
I dare to reference #a_horse_with_no_name as he says he does not know about JPA. If you want to use native mechanisms like sequence (that are not available in every DB) you can declare such a generator in JPA, too.
I do not know what issues he encountered with the table approach - i know large installations running this successfully. But anyway, this depends on a lot of factors besides scalability, for example if you want this to be portable etc. Just lookup the different strategies and select the appropriate.
So i have a database where there is a lot of data being inserted from a java application. Usualy i insert into table1 get the last id, then again insert into table2 and get the last id from there and finally insert into table3 and get that id as well and work with it within the application. And i insert around 1000-2000 rows of data every 10-15 minutes.
And using a lot of small inserts and selects on a production webserver is not really good, because it sometimes bogs down the server.
My question is: is there a way how to insert multiple data into table1, table2, table3 without using such a huge amount of selects and inserts? Is there a sql-fu technique i'm missing?
Since you're probably relying on auto_increment primary keys, you have to do the inserts one at a time, at least for table1 and table2. Because MySQL won't give you more than the very last key generated.
You should never have to select. You can get the last inserted id from the Statement using the getGeneratedKeys() method. See an example showing this in the MySQL manual for the Connector/J:
http://dev.mysql.com/doc/refman/5.1/en/connector-j-usagenotes-basic.html#connector-j-examples-autoincrement-getgeneratedkeys
Other recommendations:
Use multi-row INSERT syntax for table3.
Use ALTER TABLE DISABLE KEYS while you're importing, and re-enable them when you're finished.
Use explicit transactions. I.e. begin a transaction before your data-loading routine, and commit at the end. I'd probably also commit after every 1000 rows of table1.
Use prepared statements.
Unfortunately, you can't use the fastest method for bulk load of data, LOAD DATA INFILE, because that doesn't allow you to get the generated id values per row.
There's a lot to talk about here:
It's likely that network latency is killing you if each of those INSERTs is another network roundtrip. Try batching your requests so they only require a single roundtrip for the entire transaction.
Speaking of transactions, you don't mention them. If all three of those INSERTs need to be a single unit of work you'd better be handling transactions properly. If you don't know how, better research them.
Try caching requests if they're reused a lot. The fastest roundtrip is the one you don't make.
You could redesign your database such that the primary key was not a database-generated, auto-incremented value, but rather a client generated UUID. Then you could generated all the keys for every record upfront and batch the inserts however you like.
I have a webservice in java that receives a list of information to be inserted or updated in a database. I don't know which one is to insert or update.
Which one is the best approach to abtain better performance results:
Iterate over the list(a object list, with the table pk on it), try to insert the entry on Database. If the insert failed, run a update
Try to load the entry from database. if the results retrieved update, if not insert the entry.
another option? tell me about it :)
In first calls, i believe that most of the entries will be new bd entries, but there will be a saturation point that most of the entries will be to update.
I'm talking about a DB table that could reach over 100 million entries in a mature form.
What will be your approach? Performance is my most important goal.
If your database supports MERGE, I would have thought that was most efficient (and treats all the data as a single set).
See:
http://www.oracle.com/technology/products/oracle9i/daily/Aug24.html
https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=194
If performance is your goal then first get rid of the word iterate from your vocabulary! learn to do things in sets.
If you need to update or insert, always do the update first. Otherwise it is easy to find yourself updating the record you just inserted by accident. If you are doing this it helps to have an identifier you can look at to see if the record exists. If the identifier exists, then do the update otherwise do the insert.
The important thing is to understand the balance or ratio between the number of inserts versus the number of updates on the list you receive. IMHO you should implement an abstract strategy that says "persists this on database". Then create concrete strategies that (for example):
checks for primary key, if zero records are found does the insert, else updates
Does the update and, if fails, does the insert.
others
And then pull the strategy to use (the class fully qualified name for example) from a configuration file. This way you can switch from one strategy to another easily. If it is feasible, could be depending on your domain, you can put an heuristic that selects the best strategy based on the input entities on the set.
MySQL supports this:
INSERT INTO foo
SET bar='baz', howmanybars=1
ON DUPLICATE KEY UPDATE howmanybars=howmanybars+1
Option 2 is not going to be the most efficient. The database will already be making this check for you when you do the actual insert or update in order to enforce the primary key. By making this check yourself you are incurring the overhead of a table lookup twice as well as an extra round trip from your Java code. Choose which case is the most likely and code optimistically.
Expanding on option 1, you can use a stored procedure to handle the insert/update. This example with PostgreSQL syntax assumes the insert is the normal case.
CREATE FUNCTION insert_or_update(_id INTEGER, _col1 INTEGER) RETURNS void
AS $$
BEGIN
INSERT INTO
my_table (id, col1)
SELECT
_id, _col1;
EXCEPTION WHEN unique_violation THEN
UPDATE
my_table
SET
col1 = _col1
WHERE
id = _id;
END;
END;
$$
LANGUAGE plpgsql;
You could also make the update the normal case and then check the number of rows affected by the update statement to determine if the row is actually new and you need to do an insert.
As alluded to in some other answers, the most efficient way to handle this operation is in one batch:
Take all of the rows passed to the web service and bulk insert them into a temporary table
Update rows in the mater table from the temp table
Insert new rows in the master table from the temp table
Dispose of the temp table
The type of temporary table to use and most efficient way to manage it will depend on the database you are using.