I use executeBatch() with JDBC to insert multiple rows and I want to get id of inserted rows for another insert I use this code for that purpose:
insertInternalStatement = dbConncetion.prepareStatement(INSERT_RECORD, generatedColumns);
for (Foo foo: foosHashSet) {
insertInternalStatement.setInt(1, foo.getMe());
insertInternalStatement.setInt(1, foo.getMe2());
// ..
insertInternalStatement.addBatch();
}
insertInternalStatement.executeBatch();
// now get inserted ids
try (ResultSet generatedKeys = insertInternalStatement.getGeneratedKeys()) {
Iterator<Foo> fooIterator= foosHashSet.iterator();
while (generatedKeys.next() && fooIterator.hasNext()) {
fooIterator.next().setId(generatedKeys.getLong(1));
}
}
It works fine and ids are returned, my question are:
if I iterate over getGeneratedKeys() and foosHashSet will ids return in same order so that each returned id from database belongs to corresponding Foo instance?
What about when I use multi thread and above code run in multiple threads simultaneously?
Is there any other solution for this? I have two table foo1 and foo2 and I want first insert foo1 records then use their primary ids as foo2 foreign key.
Given support for getGeneratedKeys for batch execution is not defined in the JDBC specification, the behavior will depend on the driver used. I would expect any driver that supports generated keys for batch execution, to return the ids in order they where added to the batch.
However the fact you are using a Set is problematic. Iteration order for most sets are not defined, and could change between iterations (usually only after modification, but in theory you can't assume anything about the order). You need to use something with a guaranteed order, eg a List or maybe a LinkedHashSet.
Applying multi-threading here would probably be a bad idea: you should only use a JDBC connection from a single-thread at a time. Accounting for multi-threading would either require correct locking, or requiring you to split up the workload so it can use separate connections. Whether that would improve or worsen performance is hard to say.
You should be able to iterate through multiple generated keys without problem. They will return in the correct order they were inserted.
I think there should not be any problem adding threads in this matter. The only thing I'm pretty sure is that you would not be able to control the order the ids are inserted on both tables without some code complication.
You could store all firstly inserted ids on a Collection and after all threads/iterations have finished, insert them on second table.
The iteration is the same as long as the fooHashSet is not altered.
One could think using a LinkedHashSet which yields the items in order of insertion. Especially when nothing is removed or overwritten that would be nice.
Concurrent access would be problematic.
Use LinkedHashSet without removal, only adding new items. And additionally wrap it in Collections.synchronizedMap. For set alterations one
would need a Semaphore or such, as synchronizing such a large code block is a no-go.
An - even better performing - solution might be to make a local copy:
List<Me> list = fooHashSet.stream().map(Foo::Me)
.collect(Collectors.toList());
However this still is a somewhat unsatisfying solution:
a batch for multiple inserts and then per insert several other updates/inserts.
Transition to JPA instead of JDBC would somewhat alleviate the situation.
After some experience however I would pose the question whether a database at that point is still the correct tool (hammer)? If it is a graph, a hierarchical data structure, then storing the entire data structure as XML with JAXB in a single database table, could be the best solution. Faster. Easier development. Verifiable data.
Using the database for main data, and the XML for an edited/processed document.
Yes as per the definition of executing batch it says
createFcCouponStatement.executeBatch()
Submits a batch of commands to the database for execution and if all commands execute successfully, returns an array of update counts. The int elements of the array that is returned are ordered to correspond to the commands in the batch, which are ordered according to the order in which they were added to the batch. The elements in the array returned by the method executeBatch may be one of the following:
A number greater than or equal to zero -- indicates that the command was processed successfully and is an update count giving the number of rows in the database that was affected by the command's execution
A value of SUCCESS_NO_INFO -- indicates that the command was processed successfully but that the number of rows affected is unknown
If one of the commands in a batch update fails to execute properly, this method throws a BatchUpdateException, and a JDBC driver may or may not continue to process the remaining commands in the batch. However, the driver's behavior must be consistent with a particular DBMS, either always continuing to process commands or never continuing to process commands. If the driver continues processing after a failure, the array returned by the method BatchUpdateException.getUpdateCounts will contain as many elements as there are commands in the batch, and at least one of the elements will be the following:
A value of EXECUTE_FAILED -- indicates that the command failed to execute successfully and occurs only if a driver continues to process commands after a command fails
The possible implementations and return values have been modified in the Java 2 SDK, Standard Edition, version 1.3 to accommodate the option of continuing to process commands in a batch update after a BatchUpdateException object has been thrown.
Related
I have a MySQL database where I need to do a 1k or so updates, and I am contemplating whether it would be more appropriate to use executeBatch or executeUpdate. The preparedstatement is to be built on an ArrayList of 1k or more ids (which are PKs of the table to be updated). For each update to the table I need to check if it was updated or not (it's possible that the id is not in the table). In the case that the id doesn't exist, I need to add that id to a separate ArrayList which will be used to do batch inserts.
Given the above, is it more appropriate to do:
Various separate executeUpdate() and then store the id if it is not updated, or
Simply create a batch and use executeBatch(), which will return an array of either a 0 or 1 for each separate statement/id.
In case two, the overhead would be an additional array to hold all the 0 or 1 return values. In case one, the overhead would be due to executing each UPDATE separately.
Definitely executeBatch(), and make sure that you add "rewriteBatchedStatements=true" to your jdbc connection string.
The increase in throughput is hard to exaggerate. Your 1K updates will likely take barely longer than a single update, assuming that you have proper indexes and a WHERE clause that makes use of them.
Without the extra setting on the connection string, the time to do the batch update is going to be about the same as to do each update individually.
I'd go with batch since network latency is something to consider unless you are somehow running it on the same box.
I need to insert employee in Employee table What i want is is to avoid duplicate inserts i.e. if thwo thread tries to insert same employee at same time then last transaction
should fail. For example if first_name and hire_date is same for two employees(same employee coming from two threads) then fail the last transaction.
Approach 1:- First approach i can think of put the constraint at column level(like combined unique constraint on first_name and hire_date) or in the query check if
employee exist throw error(i believe it will be possible through PL/SQL)
Approach 2:- Can it be done at java level too like create a method which first check if employee exists then throw error. In that case i need to make the method scynchronized (or
synchronized block) but it will impact performance it will unnecassrily hold other transactions also. Is there a way i can make put the lock(Reentrant lock) or use the synchronized method based on name/hiredate so that only those specific thransaction are put on hold which has same name and hiredate
public void save(Employee emp){
//hibernate api to save
}
I believe Approach 1 should be preferred as its simple and easier to implement. Right ? Even yes, i would like to know if it can be handled efficiently at java level ?
What i want is is to avoid duplicate inserts
and
but it will impact performance it will unnecassrily hold other transactions also
So, you want highly concurrent inserts that guarantee no duplicates.
Whether you do this in Java or in the database, the only way to avoid duplicate inserts is to serialize (or, Java-speak, synchronize). That is, have one transaction wait for another.
The Oracle database will do this automatically for you if you create a PRIMARY KEY or UNIQUE constraint on your key values. Simultaneous inserts that are not duplicates will not interfere or wait for one another. However, if two sessions simultaneously attempt duplicate inserts, the second will wait until the first completes. If the first session completed via COMMIT, then the second transaction will fail with a duplicate key on index violation. If the first session completed via ROLLBACK, the second transaction will complete successfully.
You can do something similar in Java as well, but the problem is you need a locking mechanism that is accessible to all sessions. synchronize and similar alternatives work only if all sessions are running in the same JVM.
Also, in Java, a key to maximizing concurrency and minimizing waits would be to only wait for actual duplicates. You can achieve something close to that by hashing the incoming key values and then synchronzing only on that hash. That is, for example, put 65,536 objects into a list. Then when an insert wants to happen, hash the incoming key values to a number between 1 and 65536. Then get that object from list and synchronize on that. Of course, you can also synchronize on the actual key values, but a hash is usually as good and can be easier to work with, especially if the incoming key values are unwieldly or sensitive.
That all said, this should absolutely all be done in the database using a simple PRIMARY KEY constraint on your table and appropriate error handling.
One of the main reasons of using databases is that they give you consistency.
You are volunteering to put some of that responsibility back into your application. That very much sounds like the wrong approach. Instead, you should study exactly which capabilities your database offers; and try to make "as much use of them as possible".
In that sense you try to fix a problem on the wrong level.
Pseudo Code :
void save (Employee emp){
if(!isEmployeeExist(emp)){
//Hibernate api to save
}
}
boolean isEmployeeExist(Employee emp){
// build and run query for finding the employee
return true; //if employee exists else return false
}
Good question. I would strongly suggest using MERGE (INSERT and UPDATE in single DML) in this case. Let Oracle handle txn and locks. It's best in your case.
You should create Primary Key, Unique constraint (approach 1) regardless of any solution to preserve data integrity.
-- Sample statement
MERGE INTO employees e
USING (SELECT * FROM hr_records) h
ON (e.id = h.emp_id)
WHEN MATCHED THEN
UPDATE SET e.address = h.address
WHEN NOT MATCHED THEN
INSERT (id, address)
VALUES (h.emp_id, h.address);
since the row is not inserted yet, the isolation level such as READ_COMMITED/REPEATABLE_READ will not be applicable on them.
Best is to apply DB constraint(unique) , if that does not exist then in a multi node setup
you can't achive it thru java locks as well. As request can go to any node.
So, in that case we need to have distributed lock kind of functionality.
We can create a table lock where we can define for each table only one/or collection of insertion is possible at a node.
Ex:
Table_Name, Lock_Acquired
emp, 'N'
no any code can get READ_COMMITED on this row and try to update Lock_acuired to 'Y'
so , any other code in other thread or other node wont be able to proceed further and lock will be given only when the previous lock has been released.
This will make sure highly concurrent system which can avoid duplication, however this will suffer from scalibiliy issue. So decide accordingly what you want to achive.
The problem: Everyday we get lots of parts that we want to add to our stock. We get messages over a queue that we read from (using 4 different servers). The queue always contains elements so the servers read as fast as they can. We want the servers to simply update the article if the article exits, and insert it if it doesn't.
Our first, naive solution was simply to select to see if the article existed, and if it didn't we wanted to insert. However since there was no row for us to lock we got problems with two servers doing the select at the same time, finding nothing, and then trying to insert. Of course one of them gave us a duplicate key exception.
So instead we looked to the merge statement. We made a merge statement that looked like this (simplified for clarity):
MERGE INTO articles sr
USING (
VALUES (:PARAM_ARTICLE_NUMBER))
AS v(ARTICLE_NUMBER)
ON sr.ARTICLE_NUMBER = v.ARTICLE_NUMBER
WHEN MATCHED THEN
UPDATE SET
QUANTITY = QUANTITY + :PARAM_QUANTITY
ARRIVED_DATE = CASE WHEN ARRIVED_DATE IS NULL
THEN :PARAM_ARRIVED_DATE
ELSE ARRIVED_DATE END
WHEN NOT MATCHED THEN
INSERT (QUANTITY, ARRIVED_DATE)
VALUES (:PARAM_QUANTITY, CURRENT_TIMESTAMP);
However, for some reason we are still getting duplicate key problems. My believe is that even if the merge statement is atomic two merge statements can run concurrently and select at the same time.
Is there any way, short of locking the whole table, to make sure we only get one insert?
In a similar situation running the MERGE with the Repeatable Read isolation level solved our problem. RS was insufficient, because it still allowed phantom rows, which is exactly the issue you are experiencing. You can simply add WITH RR at the end of the statement and try it out.
Our test suite runs with up to 1000 simultaneous connections and we don't see concurrency much affected by the RR isolation used for that particular statement only.
Do the insert first, catch the duplicate key exception if thrown; then update instead.
Charles
I want your help in order to improve in time the following procedure in Java.
The procedure steps are the following:
I have a table with more than a million records (primary key is auto increment).
I select the min and max primary key value from this table.
I create some initial 'fromRange' and 'toRange' variables based on the min and max values
After I create a loop where I process 20000 records each time:
I fetch the records between 'fromRange' up to 'toRange'
For each record return, I write (append each time) to an XML object (using JAXB)
After I write the XML object created on a file on a disk.
Increase the 'fromRange' and 'toRange' to continue to the next records.
Procedure ends after all records have been process.
This execution takes more than 12 Hours on a normal PC to finish. I was wondering how can I improve this code
to export the files faster. Maybe using threading?
Thanks
I fetch the records between 'fromRange' up to 'toRange'
and
I fetch the records between 'fromRange' up to 'toRange'
Are IO steps, that block computing. Multithreading is a solution to ensure your machine resources are being used optimally.
Of course you should profile this on your own and see that the thread is blocked a lot of the time. If so, yes, multithreading is valid.
Comments:
I have a table with more than a million records (primary key is auto increment).
That is ok, as it is the primary key it has automatically an index most of the DBMS.
I select the min and max primary key value from this table.
You might do this via the first row and last row of your DMBS functions. That's then really selective and should not take long
I create some initial 'fromRange' and 'toRange' variables based on the min and max values
Most of the modern DMBS can save their indices as a B* tree. This means that you have a tree structure which is very fast in finding a value and then the leaves are linked via a linked list which makes it fast to find a rage then. So this should also be selective and not take too much time.
After I create a loop where I process 20000 records each time
I would try to create a Java object and at the very end do the serialization via JAXB.
In general you need to do some trace to see which step consumes most of the time.
Your question is not complete: no total count, no database type, no information about record size. But in general:
Do not use max/min - just select all records and iterate over them
Pay attention to fetch size parameters in JDBC. It is place where you should set 20000
Use JAXB in streaming mode (see JAXB fragment)
Do not forget about Outputstream buffering
It would be definitely better to do all the work in parallel. Keep the main thread reading from the database but all the records, i.e. select * from MyTable order by myId.
Then create ExecutorService by calling one of the methods from Executors factory like newCachedThreadPool.
Then in the main thread keep looping over the records and for each of them send the executor.submit(new Runnable() { doYourWork(record); }, null);. Note the record must be a copy as it will be accessed from different thread!
At the end call executor.shutdown() and executor.awaitTermination(). You can check potential errors by calling get() on Futures returned by submit method.
The other way, if you want more advanced solution, you may consider using Apache Camel for this, specifically SQL example.
I need to insert or update records based on whether record already exist.
I am using JdbcBatchItemWriter to write the records. but if the record with the primary key already exist, I should update it...
So one solution is:
To make two seperate lists one for insert and one for update(note: I have to check in my processor every time whether the record already exist and thus add the record in one of the list), and have two different JdbcBatchItemWriter instances in my writer, for example:
JdbcBatchItemWriter<insertList> insertWriter;
JdbcBatchItemWriter<updateList> updateWriter;
Is there any other way to switch between the queries in writer based on record already exist at the time of batch update...i.e.
just one
JdbcBatchItemWriter<mylist> allWriter...and
allWriter.write(mylistallitems);
I am thinking of using a merge query...but are there any performance issues?
Having two different lists may be a better option since if you have a different persistence mechanism in the future, you needn't redesign your app. You may want to have a single query to get all existing primary keys from DB, store it a Collection holder, and refer to it in the processor.
A brief search on SO on 'Oracle Merge Performance' indicates multiple instances of performance issues due to different factors & maybe slower than crafted insert/update SQLs.
Also if you are receiving the complete data again (for updates), you may want to consider the truncate-insert approach [delete before insertions by adding a listener]