Hibernate concurrency creating a duplicate record on saveOrUpdate - java

I'm trying to implement a counter with Java, Spring, Hibernate and Oracle SQL. Each record represents a count, by a given timestamp. Let's say each record is uniquely identified by the minute, and each record holds a count column. The service should expect to receive a ton of concurrent requests and my update a counter column for possibly the same record.
In my table, if the record does not exist, just insert the record in and set its count to 1. Otherwise, find the record by timestamp and increase its existing counter column by 1.
In order to ensure that we're maintain data consistency and integrity, I'm using pessimistic locking. For example, if 20 counts come in at the same time, and not necessarily by the same user, it's possible that we may override the record from a stale read of that record before updating. With locking, I'm ensuring that if 20 counts come in, the net effect on the database should represent the 20 count.
So locking is fine, but the problem is that if the record never did exist in the first place, and we have two or more concurrent requests coming in trying to update the not-yet-existant record, I've observed that the a duplicate record gets inserted because we cannot lock on a record that doesn't exist yet. How can we ensure that no duplicates get created in the table? Should it be controlled via Oracle? Or can I manage this via my app and Hibernate?
Thank you.

One was to avoid this sort of problem altogether would be to just generate the count at the time you actually query the data. Oracle has an analytic function ROW_NUMBER() which can assign a row number to each record in the result set of a query. As a rough example, consider the following query:
SELECT
ts,
ROW_NUMBER() OVER (ORDER BY ts) rn
FROM yourTable
The count you want would be in the rn column, representing the number of records appearing since the first entry in the table. Of course, you could further restrict the query.
This approach is robust to removing records, as the count would always start with 1. One drawback is that row number functionality is not supported by Hibernate. You would have to run this either as a native query or a stored proc.

Related

Get multiple Oracle sequences in one roundtrip

We have a "audit" table that we create lots of rows in. Our persistence layer queries the audit table sequence to create a new row in the audit table. With millions of rows being created daily the select statement to get the next value from the sequence is one of our top ten most executed queries. We would like to reduce the number of database roundtrips just to get the sequence next value (primary key) before inserting a new row in the audit table. We know you can't batch select statements from JDBC. Are there any common techniques for reducing database roundtrips to get a sequence next value?
Get a couple (e.g. 1000) of sequence values in advance by a single select:
select your_sequence.nextval
from dual
connect by level < 1000
cache the obtained sequences and use it for the next 1000 audit inserts.
Repeat this when you have run out of cached sequence values.
Skip the select statement for the sequence and generate the sequence value in the insert statement itself.
insert (ID,..) values (my_sequence.nextval,..)
No need for an extra select. If you need the sequence value get it by adding a returning clause.
insert (ID,..) values (my_sequence.nextval,..) returning ID into ..
Save some extra time by specifying a cache value for the sequence.
I suggest you change the "INCREMENT BY" option of the sequence and set it to a number like 100 (you have to decide what step size must be taken by your sequence, 100 is an example.)
then implement a class called SequenceGenerator, in this class you have a property that contains the nextValue, and every 100 times, calls the sequence.nextVal in order to keep the db sequence up to date.
in this way you will go to db every 100 inserts for the sequence nextVal
every time the application starts, you have to initialize the SequenceGenerator class with the sequence.nextVal.
the only downside of this approach is that if your application stops for any reason, you will loose some of the sequences values and there will be gaps in your ids. but it should not be a logical problem if you don't have anu business logic on the id values.

Java Multithreaded delete on same sets of table

I had to cleanup database ( few tables with given condition , where columns for conditions are always same ) e.g.
delete from table1 where date < given_date1 and id = given_id
delete from table2 where date < given_date2 and id = given_id
Where given_id and givendate relation varies on both table by table and id by id.
The actual delete condition is not always where date < givendate , I just wrote for example, so say one id has got 300 days of data, and other of 500 days of data, the where condition is allowed to delete oldes 10 days of data where 10 is a variable, based on user input, so at one iteration all nodes are processed with deleting oldest 10 days of data and thus query changes for each id, but the fact is that it would be on same sets of table
earlier that script was written in as sql script and doing its operation but was taking time, Now I have implemented a multithreaded java application where the new code looks like
for(i=0; i < idcount ; i++)
{
//launch new thread and against that thread call
delete(date,currentid);
}
function delete(date,id)
{
delete from table1 where date < given_date and id = given_id
delete from table2 where date < given_date and id = given_id
}
after implementing this I found deadlock on sql table, which was solved by indexing the tables, but still its not fast as it is supposed to be, If I have 500 threads they are all launched one after other, and obviously running on same sets of table. and sql is not actually executing in parallel on each table ?
When I monitor my java.exe and sqlserver.exe, its not busy at all ? I hope it is supposed to be.
Could anyone tell me what could be best approach to implement multithreaded delete on same sets of table, so that I can bump up the thread and do deletion in parallel and consume available resources
If all the actions are delete on a given id the I would just do a delete on each table doing all the ids at once.
e.g.
delete from table1 where date < given_date and id in (given_id1, given_id2 ..... )
If there are lots of given_ids the first insert them into a temporary table then execute each delete by joining the table to have deletions with the temporary table
Also if trying to use multiple threads then the improvement is really only expected if you act on a table in a thread so there will not be contention in the database.
Ignoring the problem you created...
Why not use the IN statement?
delete from table1 where date < given_date and id IN (id1, id2, id3, ...)
Update based on clarification:
Based on the explanation in the comments, my guess is that you don't have good indexes and every delete statement is resulting in a table scan. Each table scan locks the table and thus the database can only process one statement at a time. Index the date and id columns along with any other column used in the where clause of your delete statement.
In my personal experience, I make a class to manage my queries and the communication with the database. I use a thread pool to manage my threads and simply have the threads make calls to my static database manager. The manager should have a synchronized method in it that acquires a lock() on to the database connection. The threads will then be able to access the database and their actions won't conflict with each other.
If you dont care about making all command in one transaction unit so put the delete in its own transaction (small one).

Fetching records one by one from PostgreSql DB

There's a DB that contains approximately 300-400 records. I can make a simple query for fetching 30 records like:
SELECT * FROM table
WHERE isValidated = false
LIMIT 30
Some more words about content of DB table. There's a column named isValidated, that can (as you correctly guessed) take one of two values: true or false. After a query some of the records should be made validated (isValidated=true). It is approximately 5-6 records from each bunch of 30 records. Correspondingly after each query, I will fetch the records (isValidated=false) from previous query. In fact, I'll never get to the end of the table with such approach.
The validation process is made with Java + Hibernate. I'm new to Hibernate, so I use Criterion for making this simple query.
Is there any best practices for such task? The variant with adding a flag-field (that marks records which were fetched already) is inappropriate (over-engineering for this DB).
Maybe there's an opportunity to create some virtual table where records that were already processed will be stored or something like this. BTW, after all the records are processed, it is planned to start processing them again (it is possible, that some of them need to be validated).
Thank you for your help in advance.
I can imagine several solutions:
store everything in memory. You only have 400 records, and it could be a perfectly fine solution given this small number
use an order by clause (which you should do anyway) on a unique column (the PK, for example), store the ID of the last loaded record, and make sure the next query uses where ID > :lastId

Insert a lot of data into database in very small inserts

So i have a database where there is a lot of data being inserted from a java application. Usualy i insert into table1 get the last id, then again insert into table2 and get the last id from there and finally insert into table3 and get that id as well and work with it within the application. And i insert around 1000-2000 rows of data every 10-15 minutes.
And using a lot of small inserts and selects on a production webserver is not really good, because it sometimes bogs down the server.
My question is: is there a way how to insert multiple data into table1, table2, table3 without using such a huge amount of selects and inserts? Is there a sql-fu technique i'm missing?
Since you're probably relying on auto_increment primary keys, you have to do the inserts one at a time, at least for table1 and table2. Because MySQL won't give you more than the very last key generated.
You should never have to select. You can get the last inserted id from the Statement using the getGeneratedKeys() method. See an example showing this in the MySQL manual for the Connector/J:
http://dev.mysql.com/doc/refman/5.1/en/connector-j-usagenotes-basic.html#connector-j-examples-autoincrement-getgeneratedkeys
Other recommendations:
Use multi-row INSERT syntax for table3.
Use ALTER TABLE DISABLE KEYS while you're importing, and re-enable them when you're finished.
Use explicit transactions. I.e. begin a transaction before your data-loading routine, and commit at the end. I'd probably also commit after every 1000 rows of table1.
Use prepared statements.
Unfortunately, you can't use the fastest method for bulk load of data, LOAD DATA INFILE, because that doesn't allow you to get the generated id values per row.
There's a lot to talk about here:
It's likely that network latency is killing you if each of those INSERTs is another network roundtrip. Try batching your requests so they only require a single roundtrip for the entire transaction.
Speaking of transactions, you don't mention them. If all three of those INSERTs need to be a single unit of work you'd better be handling transactions properly. If you don't know how, better research them.
Try caching requests if they're reused a lot. The fastest roundtrip is the one you don't make.
You could redesign your database such that the primary key was not a database-generated, auto-incremented value, but rather a client generated UUID. Then you could generated all the keys for every record upfront and batch the inserts however you like.

Insert fail then update OR Load and then decide if insert or update

I have a webservice in java that receives a list of information to be inserted or updated in a database. I don't know which one is to insert or update.
Which one is the best approach to abtain better performance results:
Iterate over the list(a object list, with the table pk on it), try to insert the entry on Database. If the insert failed, run a update
Try to load the entry from database. if the results retrieved update, if not insert the entry.
another option? tell me about it :)
In first calls, i believe that most of the entries will be new bd entries, but there will be a saturation point that most of the entries will be to update.
I'm talking about a DB table that could reach over 100 million entries in a mature form.
What will be your approach? Performance is my most important goal.
If your database supports MERGE, I would have thought that was most efficient (and treats all the data as a single set).
See:
http://www.oracle.com/technology/products/oracle9i/daily/Aug24.html
https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=194
If performance is your goal then first get rid of the word iterate from your vocabulary! learn to do things in sets.
If you need to update or insert, always do the update first. Otherwise it is easy to find yourself updating the record you just inserted by accident. If you are doing this it helps to have an identifier you can look at to see if the record exists. If the identifier exists, then do the update otherwise do the insert.
The important thing is to understand the balance or ratio between the number of inserts versus the number of updates on the list you receive. IMHO you should implement an abstract strategy that says "persists this on database". Then create concrete strategies that (for example):
checks for primary key, if zero records are found does the insert, else updates
Does the update and, if fails, does the insert.
others
And then pull the strategy to use (the class fully qualified name for example) from a configuration file. This way you can switch from one strategy to another easily. If it is feasible, could be depending on your domain, you can put an heuristic that selects the best strategy based on the input entities on the set.
MySQL supports this:
INSERT INTO foo
SET bar='baz', howmanybars=1
ON DUPLICATE KEY UPDATE howmanybars=howmanybars+1
Option 2 is not going to be the most efficient. The database will already be making this check for you when you do the actual insert or update in order to enforce the primary key. By making this check yourself you are incurring the overhead of a table lookup twice as well as an extra round trip from your Java code. Choose which case is the most likely and code optimistically.
Expanding on option 1, you can use a stored procedure to handle the insert/update. This example with PostgreSQL syntax assumes the insert is the normal case.
CREATE FUNCTION insert_or_update(_id INTEGER, _col1 INTEGER) RETURNS void
AS $$
BEGIN
INSERT INTO
my_table (id, col1)
SELECT
_id, _col1;
EXCEPTION WHEN unique_violation THEN
UPDATE
my_table
SET
col1 = _col1
WHERE
id = _id;
END;
END;
$$
LANGUAGE plpgsql;
You could also make the update the normal case and then check the number of rows affected by the update statement to determine if the row is actually new and you need to do an insert.
As alluded to in some other answers, the most efficient way to handle this operation is in one batch:
Take all of the rows passed to the web service and bulk insert them into a temporary table
Update rows in the mater table from the temp table
Insert new rows in the master table from the temp table
Dispose of the temp table
The type of temporary table to use and most efficient way to manage it will depend on the database you are using.

Categories