Jdbc Batch Insert Optimization - java

I'm trying to insert data in batches into PostgresSql db. With a batch size of 30 it takes ~17 seconds to insert each batch, which seems incredibly slow. If I persist 10,000 records it will be well over an hour. I need help speeding this up, I have added ?reWriteBatchedInserts=true onto the end of my db connection. I have used prepared statements which was faster, but was very clunky and manually involved. I've also tried using Hibernate, and that was just as slow. I would like to have a more spring based approach for batch inserting. Hence, the auto generated SQL Statements, along with the BeanMap that maps to the insert statement as we would expect, without having to manually set all fields using statement.setString(1, position.getCurrency) etc... The reason I don't want to use those prepared statements with all the manual setup, (even though it is faster) is because I have some tables with 100's of rows, which will become a pain to maintain if changes occur.
Here is my DB Structure:
My PG Database is version 11.16, and my postgres dependency in gradle is 42.5.0.
Any thoughts on why this is taking so long to insert into the DB? I am using NamedParameterJdbcTemplate. If you need any more information please let me know.
CODE:
String positionSql = createInsert("INSERT INTO %s(%s) VALUES(%s)", positionTableMapping, positionColumnMappings);
List<Map<String, Object>> posBuffer = new ArrayList<>();
for (BPSPositionTable position : bpsPositions) {
posBuffer.add(BeanMap.create(position));
if ((count + 1) % batchSize == 0 || (count + 1) == bpsPositions.size()) {
jdbcTemplate.batchUpdate(positionSql, SqlParameterSourceUtils.createBatch(posBuffer));
posBuffer.clear();
count = 0;
}
count ++;
}

Related

How to increase insert of records into RDS MySQL more efficiently and fastly using JDBC API

I'm trying to insert nearly 200,000 records reading from a CSV file to RDS (MySQL) using a Lambda function. The time taken to insert completely is nearly 10 mins which is very concerning. I would like to know how to increase the speed for insertion.
Techniques I tried :
Using Prepared Statement for batch insertion like below code :
BufferedReader lineReader =
new BufferedReader(new InputStreamReader(inputStream, Charset.defaultCharset()));//inputStream is data from csv file
try (PreparedStatement batchStatement = connection.prepareStatement(INSERT_QUERY)) {//connection is JDBC connection instance
LOGGER.debug("Processing Insert");
Stream<String> lineStream = lineReader.lines().skip(1);
List<String> collect = lineStream.collect(Collectors.toList());
for (String line : collect) {
String[] data = line.split(",", -1);
batchStatement.setString(1, data[0]);
//remaining code of setting data
batchStatement.addBatch();
batchStatement.executeBatch();
batchStatement.clearBatch();
}
batchStatement.executeBatch();
connection.commit();
} catch(exception e){
//throw exception code
}finally{
lineReader.close();
connection.close();
}
Implemented rewritebatchedstatements=true in connection URL
Please suggest if anything is feasible in this case for faster inserting data into RDS (MySQL).
Only execute the batch in chunks, such as 100 at a time not one at a time as you have it now:
int rows = 0; // outside the loop
...
if((++rows % 100) == 0) {
batchStatement.executeBatch();
}
// Don't reset the batch as this will wipe the 99 previous rows:
//batchStatement.clearBatch();
Also: changing auto commit mode will improve bulk updates, remember to reset back afterwards if not using addBatch or if connections are re-used:
connection.setAutoCommit(false);
LOAD DATA INFILE into a separate table, t1.
Cleanse the data. That is, fix anything that needs modification, perform normalization, etc.
INSERT INTO real table (...) SELECT ... FROM t1.
If you need further discussion, please provide, in SQL, the table schema and any transforms needed by my step 2. Also, a few rows of sample data may help.

jOOQ batch insertion inconsistency

While working with batch insertion in jOOQ (v3.14.4) I noticed some inconsistency when looking into PostgreSQL (v12.6) logs.
When doing context.batch(<query>).bind(<1st record>).bind(<2nd record>)...bind(<nth record>).execute() the logs show that the records are actually inserted one by one instead of all in one go.
While doing context.insert(<fields>).values(<1st record>).values(<2nd record>)...values(<nth record>) actually inserts everything in one go judging by the postgres logs.
Is it a bug in the jOOQ itself or was I using the batch(...) functionality incorrectly?
Here are 2 code snippets that are supposed to do the same but in reality, the first one inserts records one by one while the second one actually does the batch insertion.
public void batchInsertEdges(List<EdgesRecord> edges) {
Query batchQuery = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES,
Edges.EDGES.METADATA)
.values((Long) null, (Long) null, (CallSiteRecord[]) null, (JSONB) null)
.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class));
var batchBind = context.batch(batchQuery);
for (var edge : edges) {
batchBind = batchBind.bind(edge.getSourceId(), edge.getTargetId(),
edge.getCallSites(), edge.getMetadata());
}
batchBind.execute();
}
public void batchInsertEdges(List<EdgesRecord> edges) {
var insert = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES, Edges.EDGES.METADATA);
for (var edge : edges) {
insert = insert.values(edge.getSourceId(), edge.getTargetId(), edge.getCallSites(), edge.getMetadata());
}
insert.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class))
.execute();
}
I would appreciate some help to figure out why the first code snippet does not work as intended and second one does. Thank you!
There's a difference between "batch processing" (as in JDBC batch) and "bulk processing" (as in what many RDBMS call "bulk updates").
This page of the manual about data import explains the difference.
Bulk size: The number of rows that are sent to the server in one SQL statement.
Batch size: The number of statements that are sent to the server in one JDBC statement batch.
These are fundamentally different things. Both help improve performance. Bulk data processing does so by helping the RDBMS optimise resource allocation algorithms as it knows it is about to insert 10 records. Batch data processing does so by reducing the number of round trips between client and server. Whether either approach has a big impact on any given RDBMS is obviously vendor specific.
In other words, both of your approaches work as intended.

Efficient Bulk INSERT/COPY into table from HashMap

Task:
Given this HashMap structure: Map<String, Map<String, String>> mainMap = new HashMap<>()
I want to INSERT or COPY each value of the inner Map into its own cell in my database.
The size() of mainMap if 50,000.
The size() of the inner Map is 50.
The table to be inserted into has 50 columns.
Each column's header is the key for the inner Map.
EDIT: Originally the user uploads a large spreadsheet with 35 of the 50 columns. I then "cleanse" that data with various formatting, and I add my own 15 new pairs into the innerMap for each mainMap entry. I can't directly COPY from the user's source file to my database without cleansing/formatting/adding.
Once I'm done iterating the spreadsheet and building mainMap, that's when I need to insert into my database table efficiently.
Research:
I've read that COPY is the best approach to initially bulk populate a table, however I'm stuck on whether my requirements warrant that command.
This post states that Postgres has a Prepared Statement parameter limit of 34464 for a query.
I'm assuming I need 50 x 50,000 = 2,500,000 parameters in total.
This equals out to ~ 73 individual queries!
Question:
Is COPY the proper approach here instead of all these parameters?
If so, do I convert the HashMap values into a .sql file, save it on disk on my web app server, and then reference that in my COPY command, and then delete the temp file? Or can I directly pass a concatenated String into it, without risking SQL injection?
This command will be happening often, hence the need to be optimized.
I can't find any examples of converting Java objects into compatible Postgres text file formats, so any feedback helps.
How would you approach this problem?
Additional Info:
My table is pre-existing and can't be deleted since it's the back-end for my webapp and multiple users are connected at any given time.
I understand temporarily removing indexes prior to using COPY can increase performance, but I'm only requiring max 50,000 rows to be inserted or copied at a time, not millions.
StackExchange told me to ask here.
While Java is certainly not the best option to do this kind of ETL, it certainly is possible and with rather little overhead using standard INSERT statements and prepared queries:
conn.setAutoCommit(false);
PreparedStatement stmt = conn.prepareStatement(
"INSERT INTO my_table (col_a, col_b, ...)"
+ " VALUES (?, ?, ...)");
int batchSize = 1000;
int rows = 0;
for (Map<String, String> values : mainMap.values()) {
int i = 0;
stmt.setString(++i, values.get("col_a"));
stmt.setString(++i, values.get("col_b"));
// ...
stmt.addBatch(); // add the row to the batch
if (++rows % batchSize == 0) {
// batch-sizing: execute...
stmt.executeBatch();
}
}
if (rows % batchSize != 0) {
// a last execution if necessary...
stmt.executeBatch();
}
conn.commit(); // atomic action - if any record fails, the whole import will fail
Alternatively, you could write out the Map into a file and use the CopyManager, but I seriously doubt this would be any faster than with the batched inserts (would be different for millions of rows, though).
COPY may indeed be the recommended way for initial bulk uploads, but there are limitations considering your initial data is stored in memory in a Java Map:
Firstly, it expects to load from a file (local to the server, and readable by its user), or a program (again, executed locally on the server), or via STDIN. None of these options are particularly friendly to a JDBC connection.
Secondly, even if you could prepare the data in that format (assuming you're on the same machine to prepare such a file, for example), you'd still need to convert the data held in memory in Java into the format COPY expects. That processing probably wouldn't make it worth using COPY.
I would instead create a PreparedStatement to insert your 50 columns, and then iterate through to execute that prepared statement for each Map in mainMap.values() (i.e. 50 columns each time).
You can gain speed using executeBatch(). That said, I wouldn't execute all the 50000 in one batch, but in sub-batches.
I'd do something like this:
int BATCH_SIZE = 100;
List<String> keyNames = new ArrayList<>();
int i = 0;
try (PreparedStatement ps = conn
.prepareStatement("INSERT INTO xyz (col1, col2, ...) VALUES (?, ?, ...)")) {
for (Map<String, String> rowMap : mainMap.values()) {
int j = 1;
// You need the keynames to be in the same order as the columns
// they match.
for (String key : keyNames) {
ps.setString(j, rowMap.get(key));
j++;
}
ps.addBatch();
if (i > 0 && i % BATCH_SIZE == 0) {
ps.executeBatch();
}
i++;
}
if (i % BATCH_SIZE != 1) {
// More batches to execute since the last time it was done.
ps.executeBatch();
}
}

Optimising MySQL INSERT with many VALUES (),(),();

I am trying to improve my Java app's performance and I'm focusing at this point on one end point which has to insert a large amount of data into mysql.
I'm using plain JDBC with the MariaDB Java client driver:
try (PreparedStatement stmt = connection.prepareStatement(
"INSERT INTO data (" +
"fId, valueDate, value, modifiedDate" +
") VALUES (?,?,?,?)") {
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
}
int[] results = statement.executeBatch();
}
From populating the new DB from dumped files, I know that max_allowed_packet is important and I've got that set to 536,870,912 bytes.
In https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html it states that:
If you are inserting many rows from the same client at the same time,
use INSERT statements with multiple VALUES lists to insert several
rows at a time. This is considerably faster (many times faster in some
cases) than using separate single-row INSERT statements. If you are
adding data to a nonempty table, you can tune the
bulk_insert_buffer_size variable to make data insertion even faster.
See Section 5.1.7, “Server System Variables”.
On my DBs, this is set to 8MB
I've also read about key_buffer_size (currently set to 16MB).
I'm concerned that these last 2 might not be enough. I can do some rough calculations on the JSON input to this algorithm because it looks someething like this:
[{"actualizationDate":null,"data":[{"date":"1999-12-31","value":0},
{"date":"2000-01-07","value":0},{"date":"2000-01-14","value":3144},
{"date":"2000-01-21","value":358},{"date":"2000-01-28","value":1049},
{"date":"2000-02-04","value":-231},{"date":"2000-02-11","value":-2367},
{"date":"2000-02-18","value":-2651},{"date":"2000-02-25","value":-
393},{"date":"2000-03-03","value":1725},{"date":"2000-03-10","value":-
896},{"date":"2000-03-17","value":2210},{"date":"2000-03-24","value":1782},
and it looks like the 8MB configured for bulk_insert_buffer_size could easily be exceeded, if not key_buffer_size as well.
But the MySQL docs only make mention of MyISAM engine tables, and I'm currently using InnoDB tables.
I can set up some tests but it would be good to know how this will break or degrade, if at all.
[EDIT] I have --rewriteBatchedStatements=true. In fact here's my connection string:
jdbc:p6spy:mysql://myhost.com:3306/mydb\
?verifyServerCertificate=true\
&useSSL=true\
&requireSSL=true\
&cachePrepStmts=true\
&cacheResultSetMetadata=true\
&cacheServerConfiguration=true\
&elideSetAutoCommits=true\
&maintainTimeStats=false\
&prepStmtCacheSize=250\
&prepStmtCacheSqlLimit=2048\
&rewriteBatchedStatements=true\
&useLocalSessionState=true\
&useLocalTransactionState=true\
&useServerPrepStmts=true
(from https://github.com/brettwooldridge/HikariCP/wiki/MySQL-Configuration )
An alternative is to execute the batch from time to time. This allows you to reduce the size of the batchs and let you focus on more important problems.
int batchSize = 0;
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
//When limit reach, execute and reset the counter
if(batchSize++ >= BATCH_LIMIT){
statement.executeBatch();
batchSize = 0;
}
}
// To execute the remaining items
if(batchSize > 0){
statement.executeBatch();
}
I generally use a constant or a parameter based on the DAO implementation to be more dynamic but a batch of 10_000 row is a good start.
private static final int BATCH_LIMIT = 10_000;
Note that this is not necessary to clear the batch after an execution. Even if this is not specified in Statement.executeBatch documentation, this is in the JDBC specification 4.3
14 Batch Updates
14.1 Description of Batch Updates
14.1.2 Successful Execution
Calling the method executeBatch closes the calling Statement object’s current result set if one is open.
The statement’s batch is reset to empty once executeBatch returns.
The management of the result is a bit more complicated but you can still concatenate the results if you need them. This can be analyzed at any time since the ResultSet is not needed anymore.

Performance and limitation issues between update() and batchUpdate() methods of NamedParameterJdbcTemplate

I would like to know when to use update() or bacthUpdate() method from NamedParameterJdbcTemplate class of Spring framework.
Is there any row limitation for update()? How many rows can handle update() without having performance issues or hanging my db? Starting from how many rows batchUpdate() is getting good performance?
Thanks.
Bellow is my viewpoint:
when to use update() or bacthUpdate() method from NamedParameterJdbcTemplate class of Spring framework
You should use bacthUpdate() so long as when you need to execute multiple sql together.
Is there any row limitation for update()?
This should depends on the DB you use. But I haven't met row limitation for updating. Of course,updating few rows are faster than updating many rows.(such as, UPDATE ... WHERE id=1 vs UPDATE ... WHERE id > 1).
How many rows can handle update() without having performance issues or hanging my db?
This isn't sure. This depends on the DB you using, Machine Performance, etc. If you want to know the exact result, you can view the DB Vendor's Benchmark, or you can measure it by some tests.
Starting from how many rows batchUpdate() is getting good performance?
In fact, batchUpdate() is commonly used when you do batch INSERT, UPDATE or DELETE, this will improve much performance. such as:
BATCH INSERT:
SqlParameterSource[] batch = SqlParameterSourceUtils.createBatch(employees.toArray());
int[] updateCounts = namedParameterJdbcTemplate.batchUpdate("INSERT INTO EMPLOYEE VALUES (:id, :firstName, :lastName, :address)", batch);
return updateCounts;
BATCH UPDATE:
List<Object[]> batch = new ArrayList<Object[]>();
for (Actor actor : actors) {
Object[] values = new Object[] {
actor.getFirstName(),
actor.getLastName(),
actor.getId()};
batch.add(values);
}
int[] updateCounts = jdbcTemplate.batchUpdate(
"update t_actor set first_name = ?, last_name = ? where id = ?",
batch);
return updateCounts;
Internally, batchUpdate() will use PreparedStatement.addBatch(), you can view some spring jdbc tutorial.. Batch operations sent to the database in one "batch", rather than sending the updates one by one.
Sending a batch of updates to the database in one go, is faster than sending them one by one, waiting for each one to finish. There is less network traffic involved in sending one batch of updates (only 1 round trip), and the database might be able to execute some of the updates in parallel. In addition, the DB Driver must support batch operation when you use batchUpdate() and batchUpdate() isn't in one transaction in default.
More details you can view:
https://docs.spring.io/spring/docs/current/spring-framework-reference/html/jdbc.html#jdbc-advanced-jdbc
http://tutorials.jenkov.com/jdbc/batchupdate.html#batch-updates-and-transactions
Hope you have to help.

Categories