How to avoid duplicate insert in DB through Java?

How to avoid duplicate insert in DB through Java? - java

I need to insert employee in Employee table What i want is is to avoid duplicate inserts i.e. if thwo thread tries to insert same employee at same time then last transaction
should fail. For example if first_name and hire_date is same for two employees(same employee coming from two threads) then fail the last transaction.
Approach 1:- First approach i can think of put the constraint at column level(like combined unique constraint on first_name and hire_date) or in the query check if
employee exist throw error(i believe it will be possible through PL/SQL)
Approach 2:- Can it be done at java level too like create a method which first check if employee exists then throw error. In that case i need to make the method scynchronized (or
synchronized block) but it will impact performance it will unnecassrily hold other transactions also. Is there a way i can make put the lock(Reentrant lock) or use the synchronized method based on name/hiredate so that only those specific thransaction are put on hold which has same name and hiredate
public void save(Employee emp){
//hibernate api to save
}
I believe Approach 1 should be preferred as its simple and easier to implement. Right ? Even yes, i would like to know if it can be handled efficiently at java level ?

What i want is is to avoid duplicate inserts
and
but it will impact performance it will unnecassrily hold other transactions also
So, you want highly concurrent inserts that guarantee no duplicates.
Whether you do this in Java or in the database, the only way to avoid duplicate inserts is to serialize (or, Java-speak, synchronize). That is, have one transaction wait for another.
The Oracle database will do this automatically for you if you create a PRIMARY KEY or UNIQUE constraint on your key values. Simultaneous inserts that are not duplicates will not interfere or wait for one another. However, if two sessions simultaneously attempt duplicate inserts, the second will wait until the first completes. If the first session completed via COMMIT, then the second transaction will fail with a duplicate key on index violation. If the first session completed via ROLLBACK, the second transaction will complete successfully.
You can do something similar in Java as well, but the problem is you need a locking mechanism that is accessible to all sessions. synchronize and similar alternatives work only if all sessions are running in the same JVM.
Also, in Java, a key to maximizing concurrency and minimizing waits would be to only wait for actual duplicates. You can achieve something close to that by hashing the incoming key values and then synchronzing only on that hash. That is, for example, put 65,536 objects into a list. Then when an insert wants to happen, hash the incoming key values to a number between 1 and 65536. Then get that object from list and synchronize on that. Of course, you can also synchronize on the actual key values, but a hash is usually as good and can be easier to work with, especially if the incoming key values are unwieldly or sensitive.
That all said, this should absolutely all be done in the database using a simple PRIMARY KEY constraint on your table and appropriate error handling.

One of the main reasons of using databases is that they give you consistency.
You are volunteering to put some of that responsibility back into your application. That very much sounds like the wrong approach. Instead, you should study exactly which capabilities your database offers; and try to make "as much use of them as possible".
In that sense you try to fix a problem on the wrong level.

Pseudo Code :
void save (Employee emp){
if(!isEmployeeExist(emp)){
//Hibernate api to save
}
}
boolean isEmployeeExist(Employee emp){
// build and run query for finding the employee
return true; //if employee exists else return false
}

Good question. I would strongly suggest using MERGE (INSERT and UPDATE in single DML) in this case. Let Oracle handle txn and locks. It's best in your case.
You should create Primary Key, Unique constraint (approach 1) regardless of any solution to preserve data integrity.
-- Sample statement
MERGE INTO employees e
USING (SELECT * FROM hr_records) h
ON (e.id = h.emp_id)
WHEN MATCHED THEN
UPDATE SET e.address = h.address
WHEN NOT MATCHED THEN
INSERT (id, address)
VALUES (h.emp_id, h.address);

since the row is not inserted yet, the isolation level such as READ_COMMITED/REPEATABLE_READ will not be applicable on them.
Best is to apply DB constraint(unique) , if that does not exist then in a multi node setup
you can't achive it thru java locks as well. As request can go to any node.
So, in that case we need to have distributed lock kind of functionality.
We can create a table lock where we can define for each table only one/or collection of insertion is possible at a node.
Ex:
Table_Name, Lock_Acquired
emp, 'N'
no any code can get READ_COMMITED on this row and try to update Lock_acuired to 'Y'
so , any other code in other thread or other node wont be able to proceed further and lock will be given only when the previous lock has been released.
This will make sure highly concurrent system which can avoid duplication, however this will suffer from scalibiliy issue. So decide accordingly what you want to achive.

Related

Prevent violating of UNIQUE constraint with Hibernate

I have a table like (id INTEGER, sometext VARCHAR(255), ....) with id as the primary key and a UNIQUE constraint on sometext. It gets used in a web server, where a request needs to find the id corresponding to a given sometext if it exists, otherwise a new row gets inserted.
This is the only operation on this table. There are no updates and no other operations on this table. Its sole purpose is to persistently number of encountered values of sometext. This means that I can't drop the id and use sometext as the PK.
I do the following:
First, I consult my own cache in order to avoid any DB access. Nearly always, this works and I'm done.
Otherwise, I use Hibernate Criteria to find the row by sometext. Usually, this works and again, I'm done.
Otherwise, I need to insert a new row.
This works fine, except when there are two overlapping requests with the same sometext. Then an ConstraintViolationException results. I'd need something like INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE (Mysql syntax) or MERGE (Firebird syntax).
I wonder what are the options?
AFAIK Hibernate merge works on PK only, so it's inappropriate. I guess, a native query might help or not, as it may or may not be committed when the second INSERT takes place.

Just let the database handle the concurrency. Start a secondary transaction purely for inserting the new row. if it fails with a ConstraintViolationException, just roll that transaction back and read the new row.

Not sure this scales well if the likelihood of a duplicate is high, a lot of extra work if some percent (depends on database) of transactions have to fail the insert and then reselect.
A secondary transaction minimizes the length of time the transaction to add the new text takes, assuming the database supports it correctly, it might be possible for the thread 1 transaction to cause the thread 2 select/insert to hang until the thread 1 transaction is committed or rolled back. Overall database design might also affect transaction throughput.
I don't necessarily question why sometext can't be a PK, wondering why you need to break it out at all. Of course, large volumes might substantially save space if sometext records are large, it almost seems like you're trying to emulate a lucene index to give you a complete list of text values.

Can I put a MAX value for the database table primary key?

Can I put a MAX value for the database table primary key, either via JPA or at the database level? If it is not possible, then I was thinking about
Create a random key between 0-9999999999 (9999999999 is my MAX)
Do a SELECT on the database with the newly create key, if return object is null, then INSERT, if not repeat go back to step 1
So if I do the above, two questions. Please keep in mind that the environment is high concurrent:
Q1: Does the overhead of check with SELECT, if not there, INSERT significant? What I really mean is: is this process normal, since usually I let the DB create a unique PK for me?
Q2: If Q1 does not create significant performance degradation, can I run into concurrent issue? For example, if P1 with Id1 check the table, Id1 is not there, it ready to insert, P2 sneak in insert Id1 before P1 could. So when P1 insert Id1, it fails. I dont want the process to fail here, I want it to go back up the loop, find a new id, repeat the process. How do I do that?
My environment is SQL and MYSQL db. I use JPA with Eclipselink implementation
NOTE: Some people question my decision to implement it this way, the answer is exact what TravisJ suggest below. I have a very high concurrent environment. When a process kick off, I need to create a request to another process, passing to that process a unique 10 character long id. Since the environment is high current, I want to leverage the unique, not null feature of PK. The request contain lot of information in it, so I create aRequest table, with the request Id as my PK. I know since all DB index their PK, query out the PK is fast. If there are better way, please let me know.

You can implement a Check Constraint in your table definition:
CREATE TABLE P
(
P_Id int PRIMARY KEY NOT NULL,
...
CONSTRAINT chk_P_Id CHECK (P_Id>0 and P_Id<9999999999)
)
EDIT: As stated in the comments, MySql does not honor CHECK constraints. This is a 6-year old defect in the bug log and the MySql team has yet to fix it. As MySql is now overseen by Oracle Corp, it may never be fixed (simply considered a "documented limitation", and people who don't like it can upgrade to the paid DBMS). However, this syntax, and the check constraint feature itself, DO work in Oracle, MS SQL Server (including SQLExpress/MSDE), MS Access, Postgre and SQLite.

Why not start at 1 and use auto-increment? This will be much more efficient because you will not get collisions, which you must cycle through. If you run out of numbers, you will be in the same boat either way, but at least going sequentially, you won't have to deal with collisions.
Imagine trying to find an unused key when you have used up 90% of your available numbers. That will take some time, and there is always a possibility that it never (in your lifetime) finds an unused key if you are generating them randomly.
Also, using auto-increment, it's easy to tell if you're close to the limit (SELECT MAX(col)). You could script an alert to let you know when you need to reset. For the random method, what would that query look like?
If you're using InnoDB, then you still might not want to use a primary key. Inserting random records into a clustered index is a performance hit since the actual table data must be reordered. Instead use a unique key (with an auto-increment primary key).
Using a unique index on the column in question, simply generate a random number in the range and attempt to insert it. If the insertion fails, then generate a new number and try again. If the insert succeeds, then proceed. This accounts for the concurrency issue.
Still, the sequential auto-increment key is going to yield better performance.

See,
http://en.wikibooks.org/wiki/Java_Persistence/Identity_and_Sequencing
and,
http://en.wikibooks.org/wiki/Java_Persistence/Identity_and_Sequencing#Advanced_Sequencing
JPA already has good Id generation support, it does not make sense to implement your own.
If you are concerned about concurrency and performance, and using MySQL, I would recommend using TABLE generator with a large preallocation size (on other databases I would recommend SEQUENCE generator). If you have a lot of data, ensure you use a long for your id.
If you really think you need more than this, then consider UUID id generation. EclipseLink 2.4 with provide a #UUIDGenerator.

Avoiding for loop and try to utilize collection APIs instead (performance)

I have a piece of code from an old project.
The logic (in a high level) is as follows:
The user sends a series of {id,Xi} where id is the primary key of the object in the database.
The aim is that the database is updated but the series of Xi values is always unique.
I.e. if the user sends {1,X1} and in the database we have {1,X2},{2,X1} the input should be rejected otherwise we end up with duplicates i.e. {1,X1},{2,X1} i.e. we have X1 twice in different rows.
In lower level the user sends a series of custom objects that encapsulate this information.
Currently the implementation for this uses "brute-force" i.e. continuous for-loops over input and jdbc resultset to ensure uniqueness.
I do not like this approach and moreover the actual implementation has subtle bugs but this is another story.
I am searching for a better approach, both in terms of coding and performance.
What I was thinking is the following:
Create a Set from the user's input list. If the Set has different size than list, then user's input has duplicates.Stop there.
Load data from jdbc.
Create a HashMap<Long,String> with the user's input. The key is the primary key.
Loop over result set. If HashMap does not contain a key with the same value as ResultSet's row id then add it to HashMap
In the end get HashMap's values as a List.If it contains duplicates reject input.
This is the algorithm I came up.
Is there a better approach than this? (I assume that I am not erroneous on the algorithm it self)

Purely from performance point of view , why not let the database figure out that there are duplicates ( like {1,X1},{2,X1} ) ? Have a unique constraint in place in the table and then when the update statement fails by throwing the exception , catch it and deal with what you would want to do under these input conditions. You may also want to run this as a single transaction just if you need to rollback any partial updates. Ofcourse this is assuming that you dont have any other business rules driving the updates that you havent mentioned here.
With your algorithm , you are spending too much time iterating over HashMaps and Lists to remove duplicates IMHO.

Since you can't change the database, as stated in the comments. I would probably extend out your Set idea. Create a HashMap<Long, String> and put all of the items from the database in it, then also create a HashSet<String> with all of the values from your database in it.
Then as you go through the user input, check the key against the hashmap and see if the values are the same, if they are, then great you don't have to do anything because that exact input is already in your database.
If they aren't the same then check the value against the HashSet to see if it already exists. If it does then you have a duplicate.
Should perform much better than a loop.
Edit:
For multiple updates perform all of the updates on the HashMap created from your database then once again check the Map's value set to see if its' size is different from the key set.
There might be a better way to do this, but this is the best I got.

I'd opt for a database-side solution. Assuming a table with the columns id and value, you should make a list with all the "values", and use the following SQL:
select count(*) from tbl where value in (:values);
binding the :values parameter to the list of values however is appropriate for your environment. (Trivial when using Spring JDBC and a database that supports the in operator, less so for lesser setups. As a last resort you can generate the SQL dynamically.) You will get a result set with one row and one column of a numeric type. If it's 0, you can then insert the new data; if it's 1, report a constraint violation. (If it's anything else you have a whole new problem.)
If you need to check for every item in the user input, change the query to:
select value from tbl where value in (:values)
store the result in a set (called e.g. duplicates), and then loop over the user input items and check whether the value of the current item is in duplicates.
This should perform better than snarfing the entire dataset into memory.

Can I do an atomic MERGE in Oracle?

I have a couple instances of a J2EE app running in a single WebLogic cluster.
At some point, these apps do a MERGE to insert or update a record into the back-end Oracle database. The MERGE checks to see if a row with a specified primary key is there or not. If it's there, update. If not, insert.
Now suppose two app instances want to insert or update a row with primary key = 100. Suppose the row doesn't exist. During the "check" stage of merge, they both see that the rows not there, so both of them attempt to insert. Then I get a unique key constraint violation.
My question is this: Is there an atomic MERGE in Oracle? I'm looking for something that has a similar effect to INSERT ... FOR UPDATE in PL/SQL except that I can only execute SQL from my apps.
EDIT: I was unclear. I AM using the MERGE statement while this error still occurs. The thing is, only the "modifying" part is atomic, not the whole merge.

This is not a problem with MERGE as such. Rather the issue lies in your application. Consider this stored procedure:
create or replace procedure upsert_t23
( p_id in t23.id%type
, p_name in t23.name%type )
is
cursor c is
select null
from t23
where id = p_id;
dummy varchar2(1);
begin
open c;
fetch c into dummy;
if c%notfound then
insert into t23
values (p_id, p_name);
else
update t23
set name = p_name
where id = p_id;
end if;
end;
So, this is the PL/SQL equivalent of a MERGE on T23. What happens if two sessions call it simultaneously?
SSN1> exec upsert_t23(100, 'FOX IN SOCKS')
SSN2> exec upsert_t23(100, 'MR KNOX')
SSN1 gets there first, finds no matching record and inserts a record. SSN2 gets there second but before SSN1 commits, finds no record, inserts a record and hangs because SSN1 has a lock on the unique index node for 100. When SSN1 commits SSN2 will hurl a DUP_VAL_ON_INDEX violation.
The MERGE statement works in exactly the same way. Both sessions will check on (t23.id = 100), not find it and go down the INSERT branch. The first session will succeed and the second will hurl ORA-00001.
One way to handle this is to introduce pessimistic locking. At the start of the UPSERT_T23 procedure we lock the table:
...
lock table t23 in row shared mode nowait;
open c;
...
Now, SSN1 arrives, grabs the lock and proceeds as before. When SSN2 arrives it can't get the lock, so it fails immediately. Which is frustrating for the second user but at least they are not hanging, plus they know someone else is working on the same record.
There is no syntax for INSERT which is equivalent to SELECT ... FOR UPDATE, because there is nothing to select. And so there is no such syntax for MERGE either. What you need to do is include the LOCK TABLE statement in the program unit which issues the MERGE. Whether this is possible for you depends on the framework you're using.

The MERGE statement in the second session can not "see" the insert that the first session did until that session commits. If you reduce the size of the transactions the probability that this will occur will be reduced.
Or, can you sort or partition your data so that all records of a given primary key will be given to the same session. A simple function like "primary key mod N" should distribute evenly to N sessions.
btw, if two records have the same primary key, the second will overwrite the first. Sounds a little odd.

Yes, and it's called.... MERGE
EDIT: The only way to get this water tight is to insert, catch the dup_val_on_index exception and handle it appropriately (update, or insert other record perhaps). This can easily be done with PL/SQL, but you can't use that.
You're also looking for workarounds. Can you catch the dup_val_on_index in Java and issue an extra UPDATE again?
In pseudo-code:
try {
// MERGE
}
catch (dup_val_on_index) {
// UPDATE
}

I am surprised that MERGE would behave the way you describe, but I haven't used it sufficiently to say whether it should or not.
In any case, you might have the transactions that wish to execute the merge set their isolation level to SERIALIZABLE. I think that may solve your issue.

Insert fail then update OR Load and then decide if insert or update

I have a webservice in java that receives a list of information to be inserted or updated in a database. I don't know which one is to insert or update.
Which one is the best approach to abtain better performance results:
Iterate over the list(a object list, with the table pk on it), try to insert the entry on Database. If the insert failed, run a update
Try to load the entry from database. if the results retrieved update, if not insert the entry.
another option? tell me about it :)
In first calls, i believe that most of the entries will be new bd entries, but there will be a saturation point that most of the entries will be to update.
I'm talking about a DB table that could reach over 100 million entries in a mature form.
What will be your approach? Performance is my most important goal.

If your database supports MERGE, I would have thought that was most efficient (and treats all the data as a single set).
See:
http://www.oracle.com/technology/products/oracle9i/daily/Aug24.html
https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=194

If performance is your goal then first get rid of the word iterate from your vocabulary! learn to do things in sets.
If you need to update or insert, always do the update first. Otherwise it is easy to find yourself updating the record you just inserted by accident. If you are doing this it helps to have an identifier you can look at to see if the record exists. If the identifier exists, then do the update otherwise do the insert.

The important thing is to understand the balance or ratio between the number of inserts versus the number of updates on the list you receive. IMHO you should implement an abstract strategy that says "persists this on database". Then create concrete strategies that (for example):
checks for primary key, if zero records are found does the insert, else updates
Does the update and, if fails, does the insert.
others
And then pull the strategy to use (the class fully qualified name for example) from a configuration file. This way you can switch from one strategy to another easily. If it is feasible, could be depending on your domain, you can put an heuristic that selects the best strategy based on the input entities on the set.

MySQL supports this:
INSERT INTO foo
SET bar='baz', howmanybars=1
ON DUPLICATE KEY UPDATE howmanybars=howmanybars+1

Option 2 is not going to be the most efficient. The database will already be making this check for you when you do the actual insert or update in order to enforce the primary key. By making this check yourself you are incurring the overhead of a table lookup twice as well as an extra round trip from your Java code. Choose which case is the most likely and code optimistically.
Expanding on option 1, you can use a stored procedure to handle the insert/update. This example with PostgreSQL syntax assumes the insert is the normal case.
CREATE FUNCTION insert_or_update(_id INTEGER, _col1 INTEGER) RETURNS void
AS $$
BEGIN
INSERT INTO
my_table (id, col1)
SELECT
_id, _col1;
EXCEPTION WHEN unique_violation THEN
UPDATE
my_table
SET
col1 = _col1
WHERE
id = _id;
END;
END;
$$
LANGUAGE plpgsql;
You could also make the update the normal case and then check the number of rows affected by the update statement to determine if the row is actually new and you need to do an insert.
As alluded to in some other answers, the most efficient way to handle this operation is in one batch:
Take all of the rows passed to the web service and bulk insert them into a temporary table
Update rows in the mater table from the temp table
Insert new rows in the master table from the temp table
Dispose of the temp table
The type of temporary table to use and most efficient way to manage it will depend on the database you are using.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.