I have a 1 trillion records file. Batch size is 1000 after which the batch is Executed.
Should I commit after each Batch ? Or Commit just once after all the 1 trillion records are executed in Batches of 1000 ?
{
// Loop for 1 Trillion Records
statement.AddBatch()
if (++count % 1000 == 0)
{
statement.executeBatch()
// SHOULD I COMMIT HERE AFTER EACH BATCH ???
}
} // End Loop
// SHOULD I COMMIT HERE ONCE ONLY ????
A commit marks the end of a successful transaction. So the commit should theoretically happen after all rows have been executed successfully.
If the execution statements are completely independent, than every one should have it's own commit (in theory).
But there may be limitations by the database system that require to split up the rows in several batches with their own commit. Since a database has to reserve some space to be able to do a rollback unless changes are committed, the "cost" of a huge transaction size may by very high.
So the answer is: It depends on your requirements, your database and environment.
Mostly it depends what you want to achieve, usually you need to compromise on something to achieve something. For example, I am deleting 3 million records that are no longer being accessed by my users using a stored procedure.
If I execute delete query all at once, a table lock gets escalated and my other users start getting timeout issues in our applications because the table has been locked by SQL Server (I know the question is not specific to SQL Server but could help debug the problem) to give the deletion process better performance, If you have such a case, you will never go for a bigger batch than 5000. (See Lock Escalation Threshold)
With my current plan, I am deleting 3000 rows per batch and only key lock is happening which is good, I am committing after half a million records are processed.
So, if you do not want simultaneous users hitting the table, you can delete the huge number of records if your database server has enough log space and processing speed but 1 Trillion records are a mess. You better proceed with a batch wise deletion or if 1 Trillion records are total records in the table and you want to delete all of those records, then I'd suggest go for a truncate table.
Related
I have a service that updates multiple tables through multiple services with in a transaction boundary, if one of them fails everything need to be rolled back. One of these SQL update has around 2k+ records to update and its done in two batches 1000 records at a time. Problem is this update is taking too long around 2mins sometimes and transaction timing out. Is there a way in this sql can be performed spanning multiple threads each thread updating 100 records. Thanks in advance.
I create multiple connections and make batch inserts into myTable simultaneously (multi threading)
insertString = "INSERT INTO ... + values + ") ";
insertTable.addBatch(insertString);
insertTable.executeBatch();
insertTable.clearBatch();
Sometime it works fine, however in other cases it hangs. I understand this is because I am inserting into the same table so it gets locked .
How can I write an INSERT statement such that it does not lock the table?
Are there any special transaction start . . commands that can prevent the table from being locked ? In addition, I am curious why it works fine sometimes ?
PS: the maximum number of connections that I used was 1024 (worked perfect sometimes)
Thanks
1024 session is totally insane. Your DBA should block your user for that.
You probably get blocked sessions when you load multiple rows with the same PI.
Single-row INSERTs are the slowest way to load data. A single session with a batch size of a few 1000 will outperform dozens of single-row sessions (but then only use one session).
Or switch to JDBC FastLoad if the target table is empty.
Check
http://developer.teradata.com/connectivity/articles/speed-up-your-jdbcodbc-applications
Is this a staging table?
I have an (java) application that runs database purge queries on startup. Depending on the user, these queries could wind up removing hundreds of thousands of records. I've broken up the queries so that they're limited to 5000 records, with some breathing room between each query running.
The table uses InnoDB.
An example query:
DELETE FROM table WHERE epoch <= '1388094517' LIMIT 5000;
However, certain users are seeing various errors with lock problems:
java.sql.SQLException: The total number of locks exceeds the lock table size
java.sql.BatchUpdateException: Lock wait timeout exceeded; try restarting transaction
Advising mysql config changes are pretty much not possible because this is a distributed application. What steps can I take to make sure the delete queries are not causing locking errors?
The application begins logging data on startup and needs to be able to write to the database while the current delete query is running.
I would like to ask for some advices concerning my problem.
I have a batch that does some computation (multi threading environement) and do some inserts in a table.
I would like to do something like batch insert, meaning that once I got a query, wait to have 1000 queries for instance, and then execute the batch insert (not doing it one by one).
I was wondering if there is any design pattern on this.
I have a solution in mind, but it's a bit complicated:
build a method that will receive the queries
add them to a list (the string and/or the statements)
do not execute until the list has 1000 items
The problem : how do I handle the end ?
What I mean is, the last 999 queries, when do I execute them since I'll never get to 1000 ?
What should I do ?
I'm thinking at a thread that wakes up every 5 minutes and check the number of items in a list. If he wakes up twice and the number is the same , execute the existing queries.
Does anyone has a better idea ?
Your database driver needs to support batch inserting. See this.
Have you established your system is choking on network traffic because there is too much communication between the service and the database? If not, I wouldn't worry about batching, until you are sure you need it.
You mention that in your plan you want to check every 5 minutes. That's an eternity. If you are going to get 1000 items in 5 minutes, you shouldn't need batching. That's ~ 3 a second.
Assuming you do want to batch, have a process wake up every 2 seconds and commit whatever is queued up. Don't wait five minutes. It might commit 0 rows, it might commit 10...who cares...With this approach, you don't need to worry that your arbitrary threshold hasn't been met.
I'm assuming that the inserts come in one at a time. If your incoming data comes in n at once, I would just commit every incoming request, no matter how many inserts happen. If your messages are coming in as some sort of messaging system, it's asynchronous anyway, so you shouldn't need to worry about batching. Under high load, the incoming messages just wait till there is capacity to handle them.
Add a commit kind of method to that API that will be called to confirm all items have been added. Also, the optimum batch size is somewhere in the range 20-50. After that the potential gain is outweighed by the bookkeeping necessary for a growing number of statements. You don't mention it explicitly, but of course you must use the dedicated batch API in JDBC.
If you need to keep track of many writers, each in its own thread, then you'll also need a begin kind of method and you can count how many times it was called, compared to how many times commit was called. Something like reference-counting. When you reach zero, you know you can flush your statement buffer.
This is most amazing concept , I have faced many time.So, according to your problem you are creating a batch and that batch has 1000 or more queries for insert . But , if you are inserting into same table with repeated manner.
To avoid this type of situation you can make the insert query like this:-
INSERT INTO table1 VALUES('4','India'),('5','Odisha'),('6','Bhubaneswar')
It can execute only once with multiple values.So, better you can keep all values inside any collections elements (arraylist,list,etc) and finally make a query like above and insert it once.
Also you can use SQL Transaction API.(Commit,rollback,setTraction() ) etc.
Hope ,it will help you.
All the best.
I am trying to fill a resultSet in Java with about 50,000 rows of 10 columns
and then inserting them into another table using the batchExecute method of PreparedStatement.
To make the process faster I did some research and found that while reading data into resultSet the fetchSize plays an important role.
Having a very low fetchSize can result into too many trips to the server and a very high fetchSize can block the network resources, so I experimented a little bit and set up an optimum size that suits my infrastructure.
I am reading this resultSet and creating insert statements to insert into another table of a different database.
Something like this (just a sample, not real code):
for (i=0 ; i<=50000 ; i++) {
statement.setString(1, "a#a.com");
statement.setLong(2, 1);
statement.addBatch();
}
statement.executeBatch();
Will the executeBatch method try to send all the data at once ?
Is there a way to define the batch size?
Is there any better way to speed up the process of bulk insertion?
While updating in bulk (50,000 rows 10 cols), is it better to use a updatable ResultSet or PreparedStaement with batch execution?
I'll address your questions in turn.
Will the executeBatch method tries to send all the data at once?
This can vary with each JDBC driver, but the few I've studied will iterate over each batch entry and send the arguments together with the prepared statement handle each time to the database for execution. That is, in your example above, there would 50,000 executions of the prepared statement with 50,000 pairs of arguments, but these 50,000 steps can be done in a lower-level "inner loop," which is where the time savings come in. As a rather stretched analogy, it's like dropping out of "user mode" down into "kernel mode" and running the entire execution loop there. You save the cost of diving in and out of that lower-level mode for each batch entry.
Is there a way to define the batch size
You've defined it implicitly here by pushing 50,000 argument sets in before executing the batch via Statement#executeBatch(). A batch size of one is just as valid.
Is there any better way to speed up the process of bulk insertion?
Consider opening a transaction explicitly before the batch insertion, and commit it afterward. Don't let either the database or the JDBC driver impose a transaction boundary around each insertion step in the batch. You can control the JDBC layer with the Connection#setAutoCommit(boolean) method. Take the connection out of auto-commit mode first, then populate your batches, start a transaction, execute the batch, then commit the transaction via Connection#commit().
This advice assumes that your insertions won't be contending with concurrent writers, and assumes that these transaction boundaries will give you sufficiently consistent values read from your source tables for use in the insertions. If that's not the case, favor correctness over speed.
Is it better to use a updatable ResultSet or PreparedStatement with batch execution?
Nothing beats testing with your JDBC driver of choice, but I expect the latter—PreparedStatement and Statement#executeBatch() will win out here. The statement handle may have an associated list or array of "batch arguments," with each entry being the argument set provided in between calls to Statement#executeBatch() and Statement#addBatch() (or Statement#clearBatch()). The list will grow with each call to addBatch(), and not be flushed until you call executeBatch(). Hence, the Statement instance is really acting as an argument buffer; you're trading memory for convenience (using the Statement instance in lieu of your own external argument set buffer).
Again, you should consider these answers general and speculative so long as we're not discussing a specific JDBC driver. Each driver varies in sophistication, and each will vary in which optimizations it pursues.
The batch will be done in "all at once" - that's what you've asked it to do.
50,000 seems a bit large to be attempting in one call. I would break it up into smaller chunks of 1,000, like this:
final int BATCH_SIZE = 1000;
for (int i = 0; i < DATA_SIZE; i++) {
statement.setString(1, "a#a.com");
statement.setLong(2, 1);
statement.addBatch();
if (i % BATCH_SIZE == BATCH_SIZE - 1)
statement.executeBatch();
}
if (DATA_SIZE % BATCH_SIZE != 0)
statement.executeBatch();
50,000 rows shouldn't take more than a few seconds.
If it's just data from one/more tables in the DB to be inserted into this table and no intervention (alterations to the resultset), then call statement.executeUpdate(SQL) to perform INSERT-SELECT statment, this is quicker since there is no overhead. No data going outside of the DB and the entire operation is on the DB not in the application.
Bulk unlogged update will not give you the improved performance you want the way you are going about it. See this