Bulk insert in Java using prepared statements batch update - java

I am trying to fill a resultSet in Java with about 50,000 rows of 10 columns
and then inserting them into another table using the batchExecute method of PreparedStatement.
To make the process faster I did some research and found that while reading data into resultSet the fetchSize plays an important role.
Having a very low fetchSize can result into too many trips to the server and a very high fetchSize can block the network resources, so I experimented a little bit and set up an optimum size that suits my infrastructure.
I am reading this resultSet and creating insert statements to insert into another table of a different database.
Something like this (just a sample, not real code):
for (i=0 ; i<=50000 ; i++) {
statement.setString(1, "a#a.com");
statement.setLong(2, 1);
statement.addBatch();
}
statement.executeBatch();
Will the executeBatch method try to send all the data at once ?
Is there a way to define the batch size?
Is there any better way to speed up the process of bulk insertion?
While updating in bulk (50,000 rows 10 cols), is it better to use a updatable ResultSet or PreparedStaement with batch execution?

I'll address your questions in turn.
Will the executeBatch method tries to send all the data at once?
This can vary with each JDBC driver, but the few I've studied will iterate over each batch entry and send the arguments together with the prepared statement handle each time to the database for execution. That is, in your example above, there would 50,000 executions of the prepared statement with 50,000 pairs of arguments, but these 50,000 steps can be done in a lower-level "inner loop," which is where the time savings come in. As a rather stretched analogy, it's like dropping out of "user mode" down into "kernel mode" and running the entire execution loop there. You save the cost of diving in and out of that lower-level mode for each batch entry.
Is there a way to define the batch size
You've defined it implicitly here by pushing 50,000 argument sets in before executing the batch via Statement#executeBatch(). A batch size of one is just as valid.
Is there any better way to speed up the process of bulk insertion?
Consider opening a transaction explicitly before the batch insertion, and commit it afterward. Don't let either the database or the JDBC driver impose a transaction boundary around each insertion step in the batch. You can control the JDBC layer with the Connection#setAutoCommit(boolean) method. Take the connection out of auto-commit mode first, then populate your batches, start a transaction, execute the batch, then commit the transaction via Connection#commit().
This advice assumes that your insertions won't be contending with concurrent writers, and assumes that these transaction boundaries will give you sufficiently consistent values read from your source tables for use in the insertions. If that's not the case, favor correctness over speed.
Is it better to use a updatable ResultSet or PreparedStatement with batch execution?
Nothing beats testing with your JDBC driver of choice, but I expect the latter—PreparedStatement and Statement#executeBatch() will win out here. The statement handle may have an associated list or array of "batch arguments," with each entry being the argument set provided in between calls to Statement#executeBatch() and Statement#addBatch() (or Statement#clearBatch()). The list will grow with each call to addBatch(), and not be flushed until you call executeBatch(). Hence, the Statement instance is really acting as an argument buffer; you're trading memory for convenience (using the Statement instance in lieu of your own external argument set buffer).
Again, you should consider these answers general and speculative so long as we're not discussing a specific JDBC driver. Each driver varies in sophistication, and each will vary in which optimizations it pursues.

The batch will be done in "all at once" - that's what you've asked it to do.
50,000 seems a bit large to be attempting in one call. I would break it up into smaller chunks of 1,000, like this:
final int BATCH_SIZE = 1000;
for (int i = 0; i < DATA_SIZE; i++) {
statement.setString(1, "a#a.com");
statement.setLong(2, 1);
statement.addBatch();
if (i % BATCH_SIZE == BATCH_SIZE - 1)
statement.executeBatch();
}
if (DATA_SIZE % BATCH_SIZE != 0)
statement.executeBatch();
50,000 rows shouldn't take more than a few seconds.

If it's just data from one/more tables in the DB to be inserted into this table and no intervention (alterations to the resultset), then call statement.executeUpdate(SQL) to perform INSERT-SELECT statment, this is quicker since there is no overhead. No data going outside of the DB and the entire operation is on the DB not in the application.

Bulk unlogged update will not give you the improved performance you want the way you are going about it. See this

Related

Java multi-threaded handling of a rowset

I'm pulling down a table full of data, and I need to handle this and for every row do a bit of formatting and then push out to a REST API.
I use a PostgreSQL database and java implementation, the idea is to pull all data down, get the amount of rows and spin up threads to handle a chunk at a time.
I've got the connection set up and pulling the table into a cached row set, and using last(), getRow() and beforeFirst() to get row count.
I'm trying to find a way to split out a chunk of a rowset and hand it off to be handled, but I can't seem to see anything to do this.
There's limit x and things, but I want to avoid numerous database calls with data this size.
Any ideas would be greatly appreciated.
Here's the kind of thing I'm looking at
RowSet rst = RowSetProvider.newFactory().createCachedRowSet();
rst.setUrl(url);
rst.setUsername(username);
rst.setPassword(password);
String cmd = "select * from event_log";
rst.setCommand(cmd);
rst.execute();
ResultSetMetaData rsmd = rst.getMetaData();
int columnsNumber = rsmd.getColumnCount();
rst.last();
int size = rst.getRow();
int maxPerThread = 1000;
rst.beforeFirst();
int threadsToCreate = size / maxPerThread;
for (int loopCount = 0; loopCount < threadsToCreate; loopCount++)
{
//Create chunk
//Create thread
//Pass chunk into thread and start it
//Once chunk is finished then thread and chunk are destroyed
}
This is the proper way to think about JDBC interactions:
All queries are like an ad-hoc view: SELECT foo, bar BETWEEN a AND b AS baz FROM foo INNER JOIN whatever; - this effectively creates a new temporary table.
A ResultSet is a live interactive concept: A ResultSet is not a dump of the returned data. It is like the relationship between a FileInputStream and a file on disk: The ResultSet has methods that give you data, and that data is probably obtained by chatting to the database, 'live', to obtain this information. The ResultSet itself only has a few handles, and not actual data, though it may do some caching, you have no idea.
As a consequence:
ResultSet is utterly non-parallellizable. If you share a ResultSet object with more than one thread, you wrote a bug, and you can't recover from there.
In many DBs, 'ask for the length' is tantamount to running the entire query soup to nuts, and is therefore quite slow. You probably don't want to do that, and there is no real reason to do that from the perspective of 'I want to concurrently process the information I receieved'. You've picked the wrong approach.
ResultSets can (and generally, for performance reasons, should be!) configured as 'forward only', meaning: You can advance by one row by calling .next(), and once you did that, you can't go back. This significantly reduces the load on the DB server, as it doesn't have to be prepared to properly respond to the request to hop back to the start.
Here's what I suggest you do:
You have a single 'controller' thread which has the ResultSet and runs the query.
Once the query returns, you have no idea how many records you do have. But you do know how much you want to parallelize - how many threads you want to be concurrently churning away at processing this data.
Thus, the answer: Spin up that many threads in the form of an ExecutorPool. Then, have your controller pull rows (call resultSet.next() and pull all data into java types by invoking all the various .getFoo(idxOrColName) methods), marshalling it all into a single java object. I suggest you write a POJO that represents one row's worth of data and create one for each row.
Then, your controller thread will take this object and considers this object 'a job'.
You've now reduced the problem to a basic forkjoin style strategy: You have one thread that produces jobs, and you have some code that will take a single job and completes it. I've just described what ExecutorPool and friends are designed to do.
It is crucial that the ResultSet object is not accessible by your processor threads. There is no point to pull rows from the DB in parallel, because the DB isn't parallel and wouldn't be able to give you this information any faster than a single thread. The only parallelising win you can score here is to do the processing of the data in a concurrent fashion, which is why the above model cannot be improved upon without much more drastic changes.
If you're looking for drastic redesigns, you need to 'pre-chunk'. Let's say, for example, that you already know you have a database with a million rows, and each row has the property that it has a completely random ID. You also know you have X processor threads, where X is a dynamic number that depends on many factors, such as how many CPU cores the hardware you run on has.
Then:
You fire up X threads. You tell each thread its index (so, if you have 7 threads, one has 'index 0', another has 'index 1', all the way up to 'index 6'), and how many total threads there are.
Then, each thread runs the following query:
SELECT * FROM jobs WHERE unid % 7 = 5;
That's the query the 6th job thread would run.
This guarantees that each thread is running about an equal number of jobs, give or take.
Generally this is less efficient than the previous model, given that this most likely means the DB is just doing more work (running the same query 7-fold, instead of only once), and any given worker thread may start idling whilst others are still running, vs. the controller-that-pulls-and-hands-jobs-out model where you won't run into the situation that one thread is done whilst others still have lots of jobs left.
NB: RowSet and ResultSet work effectively the exact same way. In fact, the DB version of RowSet (JdbcRowSet) is implemented as a light wrapper around ResultSet.

JDBC executeBatch() hangs without error in PostgreSQL

I try load 50000 rows in table with 200 columns. I do executeBatch() every 1000 rows. And I get lock for this table. The same code works for MS SQL and Oracle but with postgresql I get this issue. When I decrease executeBatch number from 1000 to 75 all works correctly.
Is there any param in config file witch responding for batch buffer size?
Same issue http://www.postgresql.org/message-id/c44a1bc0-dcb1-4b57-8106-e50f9303b7d1#79g2000hsk.googlegroups.com
When I execute insert statements in batch for tables with a large number of columns occurs hanging, when call statement.executeBatch().
It is specific for postgresql jdbc driver.
To avoid this issue we should increase buffer size params(SO_SNDBUF, SO_RCVBUF) for socket.
For Windows we have to create such params in register:
[HKEY_LOCAL_MACHINE \SYSTEM \CurrentControlSet \Services \Afd \Parameters]
DefaultReceiveWindow(type=DWORD) = 1640960(Decimal)
DefaultSendWindow(type=DWORD) = 1640960(Decimal)
This number(1640960) I get from internet as general recommendation!
And it works for me.
Generally you need to look for the following things.
Are you able to actually get a lock for the table?
Do you have other java code locks that you are waiting for?
In general the first place to check is the pg_stat_activity system view which will show you the query being executed and whether it is active, idle, waiting, etc. Then if it is waiting (i.e. waiting is t), then you will want to check the pg_locks view and see what else may have a lock on anything in the relation.
If waiting is not true, then you are better off checking your java code for client-side locks but since this works for MySQL and Oracle, I assume this is less of an issue.
There is one other thing to be aware of. I am not sure about MySQL and Oracle here but PostgreSQL limits you to one query at a time per connection. You might have some locking there?

MySQL and Java: Insert efficiently as data comes in via events with high frequency

When an external event occurs (incoming measurement data) an event handler in my Java code is being called. The data should be written to a MySQL database. Due to the high frequency of these calls (>1000 per sec) I'd like to handle the inserts efficiently. Unfortunately I'm not a professional developer and an idiot with databases.
Neglecting the efficiency aspect my code would look roughly like this:
public class X {
public void eventHandler(data) {
connection = DriverManager.getConnection()
statement = connection.prepareStatement("insert …")
statement.setString(1, data)
statement.executeUpdate()
statement.close()
connection.close()
}
}
My understanding is that by calling addBatch() and executeBatch() on statement I could limit the physical disk access to let's say every 1000th insert. However as you can see in my code sketch above the statement object is newly instantiated with every call of eventHandler(). Therefore my impression is that the batch mechanism won't be helpful in this context. Same for turning off auto-commit and then calling commit() on the connection object since that one is closed after every insert.
I could turn connection and statement from local variables into class members and reuse them during the whole runtime of the program. But wouldn't it be bad style to keep the database connection open at all time?
A solution would be to buffer the data manually and then write to the database only after collecting a proper batch. But so far I still hope that you smart guys will tell me how to let the database do the buffering for me.
I could turn connection and statement from local variables into class
members and reuse them during the whole runtime of the program. But
wouldn't it be bad style to keep the database connection open at all
time?
Considering that most (database-)connection pools are usually configured to keep at least one or more connections open at all times, I wouldn't call it "bad style". This is to avoid the overhead of starting a new connection on each database operation (unless necessary, if all the already opened connections are in use and the pool allows for more).
I'd probably go with some form of batching in this case (but of course I don't know all your requirements/environment etc). If the data doesn't need to be immediately available somewhere else, you could build some form of a job queue for writing the data, push the incoming data there and let other thread(s) take care of writing it to database in N size batches. Take a look what classes are available in the java.util.concurrent-package.
I would suggest you use a LinkedList<> to buffer the data(like a queue), then store the data into the dbms as and when required in a separate thread, executed at regular intervals(maybe every 2 seconds?)
See how to construct a queue using linkedlist in java

java jdbc design pattern : handle many inserts

I would like to ask for some advices concerning my problem.
I have a batch that does some computation (multi threading environement) and do some inserts in a table.
I would like to do something like batch insert, meaning that once I got a query, wait to have 1000 queries for instance, and then execute the batch insert (not doing it one by one).
I was wondering if there is any design pattern on this.
I have a solution in mind, but it's a bit complicated:
build a method that will receive the queries
add them to a list (the string and/or the statements)
do not execute until the list has 1000 items
The problem : how do I handle the end ?
What I mean is, the last 999 queries, when do I execute them since I'll never get to 1000 ?
What should I do ?
I'm thinking at a thread that wakes up every 5 minutes and check the number of items in a list. If he wakes up twice and the number is the same , execute the existing queries.
Does anyone has a better idea ?
Your database driver needs to support batch inserting. See this.
Have you established your system is choking on network traffic because there is too much communication between the service and the database? If not, I wouldn't worry about batching, until you are sure you need it.
You mention that in your plan you want to check every 5 minutes. That's an eternity. If you are going to get 1000 items in 5 minutes, you shouldn't need batching. That's ~ 3 a second.
Assuming you do want to batch, have a process wake up every 2 seconds and commit whatever is queued up. Don't wait five minutes. It might commit 0 rows, it might commit 10...who cares...With this approach, you don't need to worry that your arbitrary threshold hasn't been met.
I'm assuming that the inserts come in one at a time. If your incoming data comes in n at once, I would just commit every incoming request, no matter how many inserts happen. If your messages are coming in as some sort of messaging system, it's asynchronous anyway, so you shouldn't need to worry about batching. Under high load, the incoming messages just wait till there is capacity to handle them.
Add a commit kind of method to that API that will be called to confirm all items have been added. Also, the optimum batch size is somewhere in the range 20-50. After that the potential gain is outweighed by the bookkeeping necessary for a growing number of statements. You don't mention it explicitly, but of course you must use the dedicated batch API in JDBC.
If you need to keep track of many writers, each in its own thread, then you'll also need a begin kind of method and you can count how many times it was called, compared to how many times commit was called. Something like reference-counting. When you reach zero, you know you can flush your statement buffer.
This is most amazing concept , I have faced many time.So, according to your problem you are creating a batch and that batch has 1000 or more queries for insert . But , if you are inserting into same table with repeated manner.
To avoid this type of situation you can make the insert query like this:-
INSERT INTO table1 VALUES('4','India'),('5','Odisha'),('6','Bhubaneswar')
It can execute only once with multiple values.So, better you can keep all values inside any collections elements (arraylist,list,etc) and finally make a query like above and insert it once.
Also you can use SQL Transaction API.(Commit,rollback,setTraction() ) etc.
Hope ,it will help you.
All the best.

Batch insertion in JDBC

For bulk insertion we normally prefer BATCH operation. How exactly is it optimized for faster insertion in jdbc ?
Normally, by reducing network round-trips. If you are going to execute the same statement 100 times with 100 different sets of bind variables, for example, it would be much more efficient to send all 100 sets of bind variables to the database at once and get back all 100 results using a single network round-trip than it would to incur 100 separate network round-trips in order to execute each query sequentially. If you tell the JDBC driver that you want to create a batch, the driver can minimize the number of times it needs to communicate with the database.

Categories