For bulk insertion we normally prefer BATCH operation. How exactly is it optimized for faster insertion in jdbc ?
Normally, by reducing network round-trips. If you are going to execute the same statement 100 times with 100 different sets of bind variables, for example, it would be much more efficient to send all 100 sets of bind variables to the database at once and get back all 100 results using a single network round-trip than it would to incur 100 separate network round-trips in order to execute each query sequentially. If you tell the JDBC driver that you want to create a batch, the driver can minimize the number of times it needs to communicate with the database.
Related
Multiple instances of my multi-threaded(approx 10 threads) application is running on different machines(approx 10 machines). So overall 100 threads of this application are active simultaneously.
Each of these threads produce 4 output sets, each set containing 1k-5k rows. Each of these sets is pushed to a single Mysql machine , same db, same table(insert or update operation). So there are 4 tables consuming 4 sets produced by each thread.
I am using mybatis as ORM. These threads may consume a lot of time in writing output to DB than processing the requests.
How can I optimize the database writes in this case?
1. Use batch processing of mybatis
2. Write data to files which will be picked up by single consumer thread & written into DB?
3. Write each data set to different files & use 4 consumer threads to pick data from same set that must be pushed to same table, so locking is minimized?
Please suggest other better ways if possible?
Databases are made to handle concurrency. Not sure what exactly mybatis brings into the picture (not a huge fan of ORM in general), but if it is using it, that makes you start thinking about hacks like intermediate files and single-threaded updates, you are probably much better off ripping it out and writing to db with plain jdbc, which should have no problem handling your use case, provided, you batch your updates adequately.
I've a typical scenario & need to understand best possible way to handle this, so here it goes -
I'm developing a solution that will retrieve data from a remote SOAP based web service & will then push this data to an Oracle database on network.
Also, this will be a scheduled task that will execute every 15 minutes.
I've event queues on remote service that contains the INSERT/UPDATE/DELETE operations that have been done since last retrieval, & once I retrieve the events for last 15 minutes, it again add events for next retrieval.
Now, its just pushing data to Oracle so all my interactions are INSERT & UPDATE statements.
There are around 60 tables on Oracle with some of them having 100+ columns. Moreover, for every 15 minutes cycle there would be around 60-70 Inserts, 100+ Updates & 10-20 Deletes.
This will be an executable jar file that will terminate after operation & will again start on next 15 minutes cycle.
So, I need to understand how should I handle WRITE operations (best practices) to improve performance for this application as whole ?
Current Test Code (on every cycle) -
Connects to remote service to get events.
Creates a connection with DB (single connection object).
Identifies the type of operation (INSERT/UPDATE/DELETE) & table on which it is done.
After above, calls the respective method based on type of operation & table.
Uses Preparedstatement with positional parameters, & retrieves each column value from remote service & assigns that to statement parameters.
Commits the statement & returns to get event class to process next event.
Above is repeated till all the retrieved events are processed after which program closes & then starts on next cycle & everything repeats again.
Thanks for help !
If you are inserting or updating one row at a time,You can consider executing a batch Insert or a batch Update. It has been proven that if you are attempting to update or insert rows after a certain quantity, you get much better performance.
The number of DB operations you are talking about (200 every 15 minutes) is tiny and will be easy to finish in less than 15 minutes. Some concrete suggestions:
You should profile your application to understand where it is spending its time. If you don't do this, then you don't know what to optimize next and you don't know if something you did helped or hurt.
If possible, try to get all of the events in one round-trip to the remote server.
You should reuse the connection to the remote service (probably by using a library that supports connection persistence and reuse).
You should reuse the DB connections by using a connection pooling library rather than creating a new connection for each insert/update/delete. Believe it or not, creating the connection probably takes 100+ times as long as doing your DB operation once you have the connection in hand.
You should consider doing multiple (or all) of the database operations in the same transaction rather than creating a new transaction for each row that is changed. However, you should carefully consider your failure modes such that you don't lose any events (if that is an important consideration).
You should consider utilizing prepared statement caching. This may help, but maybe not if Oracle is configured properly.
You should consider trying to analyze your operations to find any that can be batched together. This can be a lot faster if you have some "hot" operations that get done often.
"I've a typical scenario"
No you haven't. You have a bespoke architecture, with a unique data model, unique data and unique business requirements. That's not a bad thing, it's the state of pretty much every computer system that's not been bought off-the-shelf (and even some of them).
So, it's an experiment and you must approach it as such. There is no "best practice". Try various things and see what works best.
"need to understand best possible way to handle this"
You will improve your chances of success enormously by hiring somebody who understands Oracle databases.
I am using a single standalone Mongo DB server with no special topology like replication or sharding etc. Currently I have an issue that mongo DB does not support more than 500 parallel requests. Note that I am using only one instance of MongoClient and the remaining threads are used for inserts. I am using a java executor framework to create the threads and these threads are used to insert data to a collection [all insert in the same collection]
You should queue the requests before you issue them towards the database. There is no use requesting 500 things from your database in parallel. Remember a single request comes with some costs memory wise, locking wise and so on. Actually you are wasting resources by asking your database too much at once - remember I mean this request wise not data wise.
So use a queue (or more) and pool up the requests. From that pool you feed your worker threads (lets say 5 or 10 are enough) and that's it.
Take a look at the Future interface in the concurrent package of java. Using asynchrone processing here looks like the thing with the highest throughput and the lowest resource impact.
But check the MongoDB driver first. I would not be surprised if they have implemented it already this way. If this is the case you just have to limit yourself by using a queue to have only lets say 10 or 100 requests at once being handled by the database driver. Do some performance check tweaking the number of actual requests send to the database.
I would like to ask for some advices concerning my problem.
I have a batch that does some computation (multi threading environement) and do some inserts in a table.
I would like to do something like batch insert, meaning that once I got a query, wait to have 1000 queries for instance, and then execute the batch insert (not doing it one by one).
I was wondering if there is any design pattern on this.
I have a solution in mind, but it's a bit complicated:
build a method that will receive the queries
add them to a list (the string and/or the statements)
do not execute until the list has 1000 items
The problem : how do I handle the end ?
What I mean is, the last 999 queries, when do I execute them since I'll never get to 1000 ?
What should I do ?
I'm thinking at a thread that wakes up every 5 minutes and check the number of items in a list. If he wakes up twice and the number is the same , execute the existing queries.
Does anyone has a better idea ?
Your database driver needs to support batch inserting. See this.
Have you established your system is choking on network traffic because there is too much communication between the service and the database? If not, I wouldn't worry about batching, until you are sure you need it.
You mention that in your plan you want to check every 5 minutes. That's an eternity. If you are going to get 1000 items in 5 minutes, you shouldn't need batching. That's ~ 3 a second.
Assuming you do want to batch, have a process wake up every 2 seconds and commit whatever is queued up. Don't wait five minutes. It might commit 0 rows, it might commit 10...who cares...With this approach, you don't need to worry that your arbitrary threshold hasn't been met.
I'm assuming that the inserts come in one at a time. If your incoming data comes in n at once, I would just commit every incoming request, no matter how many inserts happen. If your messages are coming in as some sort of messaging system, it's asynchronous anyway, so you shouldn't need to worry about batching. Under high load, the incoming messages just wait till there is capacity to handle them.
Add a commit kind of method to that API that will be called to confirm all items have been added. Also, the optimum batch size is somewhere in the range 20-50. After that the potential gain is outweighed by the bookkeeping necessary for a growing number of statements. You don't mention it explicitly, but of course you must use the dedicated batch API in JDBC.
If you need to keep track of many writers, each in its own thread, then you'll also need a begin kind of method and you can count how many times it was called, compared to how many times commit was called. Something like reference-counting. When you reach zero, you know you can flush your statement buffer.
This is most amazing concept , I have faced many time.So, according to your problem you are creating a batch and that batch has 1000 or more queries for insert . But , if you are inserting into same table with repeated manner.
To avoid this type of situation you can make the insert query like this:-
INSERT INTO table1 VALUES('4','India'),('5','Odisha'),('6','Bhubaneswar')
It can execute only once with multiple values.So, better you can keep all values inside any collections elements (arraylist,list,etc) and finally make a query like above and insert it once.
Also you can use SQL Transaction API.(Commit,rollback,setTraction() ) etc.
Hope ,it will help you.
All the best.
I am trying to fill a resultSet in Java with about 50,000 rows of 10 columns
and then inserting them into another table using the batchExecute method of PreparedStatement.
To make the process faster I did some research and found that while reading data into resultSet the fetchSize plays an important role.
Having a very low fetchSize can result into too many trips to the server and a very high fetchSize can block the network resources, so I experimented a little bit and set up an optimum size that suits my infrastructure.
I am reading this resultSet and creating insert statements to insert into another table of a different database.
Something like this (just a sample, not real code):
for (i=0 ; i<=50000 ; i++) {
statement.setString(1, "a#a.com");
statement.setLong(2, 1);
statement.addBatch();
}
statement.executeBatch();
Will the executeBatch method try to send all the data at once ?
Is there a way to define the batch size?
Is there any better way to speed up the process of bulk insertion?
While updating in bulk (50,000 rows 10 cols), is it better to use a updatable ResultSet or PreparedStaement with batch execution?
I'll address your questions in turn.
Will the executeBatch method tries to send all the data at once?
This can vary with each JDBC driver, but the few I've studied will iterate over each batch entry and send the arguments together with the prepared statement handle each time to the database for execution. That is, in your example above, there would 50,000 executions of the prepared statement with 50,000 pairs of arguments, but these 50,000 steps can be done in a lower-level "inner loop," which is where the time savings come in. As a rather stretched analogy, it's like dropping out of "user mode" down into "kernel mode" and running the entire execution loop there. You save the cost of diving in and out of that lower-level mode for each batch entry.
Is there a way to define the batch size
You've defined it implicitly here by pushing 50,000 argument sets in before executing the batch via Statement#executeBatch(). A batch size of one is just as valid.
Is there any better way to speed up the process of bulk insertion?
Consider opening a transaction explicitly before the batch insertion, and commit it afterward. Don't let either the database or the JDBC driver impose a transaction boundary around each insertion step in the batch. You can control the JDBC layer with the Connection#setAutoCommit(boolean) method. Take the connection out of auto-commit mode first, then populate your batches, start a transaction, execute the batch, then commit the transaction via Connection#commit().
This advice assumes that your insertions won't be contending with concurrent writers, and assumes that these transaction boundaries will give you sufficiently consistent values read from your source tables for use in the insertions. If that's not the case, favor correctness over speed.
Is it better to use a updatable ResultSet or PreparedStatement with batch execution?
Nothing beats testing with your JDBC driver of choice, but I expect the latter—PreparedStatement and Statement#executeBatch() will win out here. The statement handle may have an associated list or array of "batch arguments," with each entry being the argument set provided in between calls to Statement#executeBatch() and Statement#addBatch() (or Statement#clearBatch()). The list will grow with each call to addBatch(), and not be flushed until you call executeBatch(). Hence, the Statement instance is really acting as an argument buffer; you're trading memory for convenience (using the Statement instance in lieu of your own external argument set buffer).
Again, you should consider these answers general and speculative so long as we're not discussing a specific JDBC driver. Each driver varies in sophistication, and each will vary in which optimizations it pursues.
The batch will be done in "all at once" - that's what you've asked it to do.
50,000 seems a bit large to be attempting in one call. I would break it up into smaller chunks of 1,000, like this:
final int BATCH_SIZE = 1000;
for (int i = 0; i < DATA_SIZE; i++) {
statement.setString(1, "a#a.com");
statement.setLong(2, 1);
statement.addBatch();
if (i % BATCH_SIZE == BATCH_SIZE - 1)
statement.executeBatch();
}
if (DATA_SIZE % BATCH_SIZE != 0)
statement.executeBatch();
50,000 rows shouldn't take more than a few seconds.
If it's just data from one/more tables in the DB to be inserted into this table and no intervention (alterations to the resultset), then call statement.executeUpdate(SQL) to perform INSERT-SELECT statment, this is quicker since there is no overhead. No data going outside of the DB and the entire operation is on the DB not in the application.
Bulk unlogged update will not give you the improved performance you want the way you are going about it. See this