Task:
Given this HashMap structure: Map<String, Map<String, String>> mainMap = new HashMap<>()
I want to INSERT or COPY each value of the inner Map into its own cell in my database.
The size() of mainMap if 50,000.
The size() of the inner Map is 50.
The table to be inserted into has 50 columns.
Each column's header is the key for the inner Map.
EDIT: Originally the user uploads a large spreadsheet with 35 of the 50 columns. I then "cleanse" that data with various formatting, and I add my own 15 new pairs into the innerMap for each mainMap entry. I can't directly COPY from the user's source file to my database without cleansing/formatting/adding.
Once I'm done iterating the spreadsheet and building mainMap, that's when I need to insert into my database table efficiently.
Research:
I've read that COPY is the best approach to initially bulk populate a table, however I'm stuck on whether my requirements warrant that command.
This post states that Postgres has a Prepared Statement parameter limit of 34464 for a query.
I'm assuming I need 50 x 50,000 = 2,500,000 parameters in total.
This equals out to ~ 73 individual queries!
Question:
Is COPY the proper approach here instead of all these parameters?
If so, do I convert the HashMap values into a .sql file, save it on disk on my web app server, and then reference that in my COPY command, and then delete the temp file? Or can I directly pass a concatenated String into it, without risking SQL injection?
This command will be happening often, hence the need to be optimized.
I can't find any examples of converting Java objects into compatible Postgres text file formats, so any feedback helps.
How would you approach this problem?
Additional Info:
My table is pre-existing and can't be deleted since it's the back-end for my webapp and multiple users are connected at any given time.
I understand temporarily removing indexes prior to using COPY can increase performance, but I'm only requiring max 50,000 rows to be inserted or copied at a time, not millions.
StackExchange told me to ask here.
While Java is certainly not the best option to do this kind of ETL, it certainly is possible and with rather little overhead using standard INSERT statements and prepared queries:
conn.setAutoCommit(false);
PreparedStatement stmt = conn.prepareStatement(
"INSERT INTO my_table (col_a, col_b, ...)"
+ " VALUES (?, ?, ...)");
int batchSize = 1000;
int rows = 0;
for (Map<String, String> values : mainMap.values()) {
int i = 0;
stmt.setString(++i, values.get("col_a"));
stmt.setString(++i, values.get("col_b"));
// ...
stmt.addBatch(); // add the row to the batch
if (++rows % batchSize == 0) {
// batch-sizing: execute...
stmt.executeBatch();
}
}
if (rows % batchSize != 0) {
// a last execution if necessary...
stmt.executeBatch();
}
conn.commit(); // atomic action - if any record fails, the whole import will fail
Alternatively, you could write out the Map into a file and use the CopyManager, but I seriously doubt this would be any faster than with the batched inserts (would be different for millions of rows, though).
COPY may indeed be the recommended way for initial bulk uploads, but there are limitations considering your initial data is stored in memory in a Java Map:
Firstly, it expects to load from a file (local to the server, and readable by its user), or a program (again, executed locally on the server), or via STDIN. None of these options are particularly friendly to a JDBC connection.
Secondly, even if you could prepare the data in that format (assuming you're on the same machine to prepare such a file, for example), you'd still need to convert the data held in memory in Java into the format COPY expects. That processing probably wouldn't make it worth using COPY.
I would instead create a PreparedStatement to insert your 50 columns, and then iterate through to execute that prepared statement for each Map in mainMap.values() (i.e. 50 columns each time).
You can gain speed using executeBatch(). That said, I wouldn't execute all the 50000 in one batch, but in sub-batches.
I'd do something like this:
int BATCH_SIZE = 100;
List<String> keyNames = new ArrayList<>();
int i = 0;
try (PreparedStatement ps = conn
.prepareStatement("INSERT INTO xyz (col1, col2, ...) VALUES (?, ?, ...)")) {
for (Map<String, String> rowMap : mainMap.values()) {
int j = 1;
// You need the keynames to be in the same order as the columns
// they match.
for (String key : keyNames) {
ps.setString(j, rowMap.get(key));
j++;
}
ps.addBatch();
if (i > 0 && i % BATCH_SIZE == 0) {
ps.executeBatch();
}
i++;
}
if (i % BATCH_SIZE != 1) {
// More batches to execute since the last time it was done.
ps.executeBatch();
}
}
Related
I'm trying to insert data in batches into PostgresSql db. With a batch size of 30 it takes ~17 seconds to insert each batch, which seems incredibly slow. If I persist 10,000 records it will be well over an hour. I need help speeding this up, I have added ?reWriteBatchedInserts=true onto the end of my db connection. I have used prepared statements which was faster, but was very clunky and manually involved. I've also tried using Hibernate, and that was just as slow. I would like to have a more spring based approach for batch inserting. Hence, the auto generated SQL Statements, along with the BeanMap that maps to the insert statement as we would expect, without having to manually set all fields using statement.setString(1, position.getCurrency) etc... The reason I don't want to use those prepared statements with all the manual setup, (even though it is faster) is because I have some tables with 100's of rows, which will become a pain to maintain if changes occur.
Here is my DB Structure:
My PG Database is version 11.16, and my postgres dependency in gradle is 42.5.0.
Any thoughts on why this is taking so long to insert into the DB? I am using NamedParameterJdbcTemplate. If you need any more information please let me know.
CODE:
String positionSql = createInsert("INSERT INTO %s(%s) VALUES(%s)", positionTableMapping, positionColumnMappings);
List<Map<String, Object>> posBuffer = new ArrayList<>();
for (BPSPositionTable position : bpsPositions) {
posBuffer.add(BeanMap.create(position));
if ((count + 1) % batchSize == 0 || (count + 1) == bpsPositions.size()) {
jdbcTemplate.batchUpdate(positionSql, SqlParameterSourceUtils.createBatch(posBuffer));
posBuffer.clear();
count = 0;
}
count ++;
}
I'm trying to insert nearly 200,000 records reading from a CSV file to RDS (MySQL) using a Lambda function. The time taken to insert completely is nearly 10 mins which is very concerning. I would like to know how to increase the speed for insertion.
Techniques I tried :
Using Prepared Statement for batch insertion like below code :
BufferedReader lineReader =
new BufferedReader(new InputStreamReader(inputStream, Charset.defaultCharset()));//inputStream is data from csv file
try (PreparedStatement batchStatement = connection.prepareStatement(INSERT_QUERY)) {//connection is JDBC connection instance
LOGGER.debug("Processing Insert");
Stream<String> lineStream = lineReader.lines().skip(1);
List<String> collect = lineStream.collect(Collectors.toList());
for (String line : collect) {
String[] data = line.split(",", -1);
batchStatement.setString(1, data[0]);
//remaining code of setting data
batchStatement.addBatch();
batchStatement.executeBatch();
batchStatement.clearBatch();
}
batchStatement.executeBatch();
connection.commit();
} catch(exception e){
//throw exception code
}finally{
lineReader.close();
connection.close();
}
Implemented rewritebatchedstatements=true in connection URL
Please suggest if anything is feasible in this case for faster inserting data into RDS (MySQL).
Only execute the batch in chunks, such as 100 at a time not one at a time as you have it now:
int rows = 0; // outside the loop
...
if((++rows % 100) == 0) {
batchStatement.executeBatch();
}
// Don't reset the batch as this will wipe the 99 previous rows:
//batchStatement.clearBatch();
Also: changing auto commit mode will improve bulk updates, remember to reset back afterwards if not using addBatch or if connections are re-used:
connection.setAutoCommit(false);
LOAD DATA INFILE into a separate table, t1.
Cleanse the data. That is, fix anything that needs modification, perform normalization, etc.
INSERT INTO real table (...) SELECT ... FROM t1.
If you need further discussion, please provide, in SQL, the table schema and any transforms needed by my step 2. Also, a few rows of sample data may help.
I am trying to improve my Java app's performance and I'm focusing at this point on one end point which has to insert a large amount of data into mysql.
I'm using plain JDBC with the MariaDB Java client driver:
try (PreparedStatement stmt = connection.prepareStatement(
"INSERT INTO data (" +
"fId, valueDate, value, modifiedDate" +
") VALUES (?,?,?,?)") {
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
}
int[] results = statement.executeBatch();
}
From populating the new DB from dumped files, I know that max_allowed_packet is important and I've got that set to 536,870,912 bytes.
In https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html it states that:
If you are inserting many rows from the same client at the same time,
use INSERT statements with multiple VALUES lists to insert several
rows at a time. This is considerably faster (many times faster in some
cases) than using separate single-row INSERT statements. If you are
adding data to a nonempty table, you can tune the
bulk_insert_buffer_size variable to make data insertion even faster.
See Section 5.1.7, “Server System Variables”.
On my DBs, this is set to 8MB
I've also read about key_buffer_size (currently set to 16MB).
I'm concerned that these last 2 might not be enough. I can do some rough calculations on the JSON input to this algorithm because it looks someething like this:
[{"actualizationDate":null,"data":[{"date":"1999-12-31","value":0},
{"date":"2000-01-07","value":0},{"date":"2000-01-14","value":3144},
{"date":"2000-01-21","value":358},{"date":"2000-01-28","value":1049},
{"date":"2000-02-04","value":-231},{"date":"2000-02-11","value":-2367},
{"date":"2000-02-18","value":-2651},{"date":"2000-02-25","value":-
393},{"date":"2000-03-03","value":1725},{"date":"2000-03-10","value":-
896},{"date":"2000-03-17","value":2210},{"date":"2000-03-24","value":1782},
and it looks like the 8MB configured for bulk_insert_buffer_size could easily be exceeded, if not key_buffer_size as well.
But the MySQL docs only make mention of MyISAM engine tables, and I'm currently using InnoDB tables.
I can set up some tests but it would be good to know how this will break or degrade, if at all.
[EDIT] I have --rewriteBatchedStatements=true. In fact here's my connection string:
jdbc:p6spy:mysql://myhost.com:3306/mydb\
?verifyServerCertificate=true\
&useSSL=true\
&requireSSL=true\
&cachePrepStmts=true\
&cacheResultSetMetadata=true\
&cacheServerConfiguration=true\
&elideSetAutoCommits=true\
&maintainTimeStats=false\
&prepStmtCacheSize=250\
&prepStmtCacheSqlLimit=2048\
&rewriteBatchedStatements=true\
&useLocalSessionState=true\
&useLocalTransactionState=true\
&useServerPrepStmts=true
(from https://github.com/brettwooldridge/HikariCP/wiki/MySQL-Configuration )
An alternative is to execute the batch from time to time. This allows you to reduce the size of the batchs and let you focus on more important problems.
int batchSize = 0;
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
//When limit reach, execute and reset the counter
if(batchSize++ >= BATCH_LIMIT){
statement.executeBatch();
batchSize = 0;
}
}
// To execute the remaining items
if(batchSize > 0){
statement.executeBatch();
}
I generally use a constant or a parameter based on the DAO implementation to be more dynamic but a batch of 10_000 row is a good start.
private static final int BATCH_LIMIT = 10_000;
Note that this is not necessary to clear the batch after an execution. Even if this is not specified in Statement.executeBatch documentation, this is in the JDBC specification 4.3
14 Batch Updates
14.1 Description of Batch Updates
14.1.2 Successful Execution
Calling the method executeBatch closes the calling Statement object’s current result set if one is open.
The statement’s batch is reset to empty once executeBatch returns.
The management of the result is a bit more complicated but you can still concatenate the results if you need them. This can be analyzed at any time since the ResultSet is not needed anymore.
I need to insert a couple hundreds of millions of records into the mysql db. I'm batch inserting it 1 million at a time. Please see my code below. It seems to be slow. Is there any way to optimize it?
try {
// Disable auto-commit
connection.setAutoCommit(false);
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
}
// Execute the batch
int [] updateCounts = pstmt.executeBatch();
System.out.append("inserted "+updateCounts.length);
I had a similar performance issue with mysql and solved it by setting the useServerPrepStmts and the rewriteBatchedStatements properties in the connection url.
Connection c = DriverManager.getConnection("jdbc:mysql://host:3306/db?useServerPrepStmts=false&rewriteBatchedStatements=true", "username", "password");
I'd like to expand on Bertil's answer, as I've been experimenting with the connection URL parameters.
rewriteBatchedStatements=true is the important parameter. useServerPrepStmts is already false by default, and even changing it to true doesn't make much difference in terms of batch insert performance.
Now I think is the time to write how rewriteBatchedStatements=true improves the performance so dramatically. It does so by rewriting of prepared statements for INSERT into multi-value inserts when executeBatch() (Source). That means that instead of sending the following n INSERT statements to the mysql server each time executeBatch() is called :
INSERT INTO X VALUES (A1,B1,C1)
INSERT INTO X VALUES (A2,B2,C2)
...
INSERT INTO X VALUES (An,Bn,Cn)
It would send a single INSERT statement :
INSERT INTO X VALUES (A1,B1,C1),(A2,B2,C2),...,(An,Bn,Cn)
You can observe it by toggling on the mysql logging (by SET global general_log = 1) which would log into a file each statement sent to the mysql server.
You can insert multiple rows with one insert statement, doing a few thousands at a time can greatly speed things up, that is, instead of doing e.g. 3 inserts of the form INSERT INTO tbl_name (a,b,c) VALUES(1,2,3); , you do INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(1,2,3),(1,2,3); (It might be JDBC .addBatch() does similar optimization now - though the mysql addBatch used to be entierly un-optimized and just issuing individual queries anyhow - I don't know if that's still the case with recent drivers)
If you really need speed, load your data from a comma separated file with LOAD DATA INFILE , we get around 7-8 times speedup doing that vs doing tens of millions of inserts.
If:
It's a new table, or the amount to be inserted is greater then the already inserted data
There are indexes on the table
You do not need other access to the table during the insert
Then ALTER TABLE tbl_name DISABLE KEYS can greatly improve the speed of your inserts. When you're done, run ALTER TABLE tbl_name ENABLE KEYS to start building the indexes, which can take a while, but not nearly as long as doing it for every insert.
You may try using DDBulkLoad object.
// Get a DDBulkLoad object
DDBulkLoad bulkLoad = DDBulkLoadFactory.getInstance(connection);
bulkLoad.setTableName(“mytable”);
bulkLoad.load(“data.csv”);
try {
// Disable auto-commit
connection.setAutoCommit(false);
int maxInsertBatch = 10000;
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
int count = 1;
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
if(count%maxInsertBatch == 0){
pstmt.executeBatch();
}
count++;
}
// Execute the batch
pstmt.executeBatch();
System.out.append("inserted "+count);
I need to insert a couple hundreds of millions of records into the mysql db. I'm batch inserting it 1 million at a time. Please see my code below. It seems to be slow. Is there any way to optimize it?
try {
// Disable auto-commit
connection.setAutoCommit(false);
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
}
// Execute the batch
int [] updateCounts = pstmt.executeBatch();
System.out.append("inserted "+updateCounts.length);
I had a similar performance issue with mysql and solved it by setting the useServerPrepStmts and the rewriteBatchedStatements properties in the connection url.
Connection c = DriverManager.getConnection("jdbc:mysql://host:3306/db?useServerPrepStmts=false&rewriteBatchedStatements=true", "username", "password");
I'd like to expand on Bertil's answer, as I've been experimenting with the connection URL parameters.
rewriteBatchedStatements=true is the important parameter. useServerPrepStmts is already false by default, and even changing it to true doesn't make much difference in terms of batch insert performance.
Now I think is the time to write how rewriteBatchedStatements=true improves the performance so dramatically. It does so by rewriting of prepared statements for INSERT into multi-value inserts when executeBatch() (Source). That means that instead of sending the following n INSERT statements to the mysql server each time executeBatch() is called :
INSERT INTO X VALUES (A1,B1,C1)
INSERT INTO X VALUES (A2,B2,C2)
...
INSERT INTO X VALUES (An,Bn,Cn)
It would send a single INSERT statement :
INSERT INTO X VALUES (A1,B1,C1),(A2,B2,C2),...,(An,Bn,Cn)
You can observe it by toggling on the mysql logging (by SET global general_log = 1) which would log into a file each statement sent to the mysql server.
You can insert multiple rows with one insert statement, doing a few thousands at a time can greatly speed things up, that is, instead of doing e.g. 3 inserts of the form INSERT INTO tbl_name (a,b,c) VALUES(1,2,3); , you do INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(1,2,3),(1,2,3); (It might be JDBC .addBatch() does similar optimization now - though the mysql addBatch used to be entierly un-optimized and just issuing individual queries anyhow - I don't know if that's still the case with recent drivers)
If you really need speed, load your data from a comma separated file with LOAD DATA INFILE , we get around 7-8 times speedup doing that vs doing tens of millions of inserts.
If:
It's a new table, or the amount to be inserted is greater then the already inserted data
There are indexes on the table
You do not need other access to the table during the insert
Then ALTER TABLE tbl_name DISABLE KEYS can greatly improve the speed of your inserts. When you're done, run ALTER TABLE tbl_name ENABLE KEYS to start building the indexes, which can take a while, but not nearly as long as doing it for every insert.
You may try using DDBulkLoad object.
// Get a DDBulkLoad object
DDBulkLoad bulkLoad = DDBulkLoadFactory.getInstance(connection);
bulkLoad.setTableName(“mytable”);
bulkLoad.load(“data.csv”);
try {
// Disable auto-commit
connection.setAutoCommit(false);
int maxInsertBatch = 10000;
// Create a prepared statement
String sql = "INSERT INTO mytable (xxx), VALUES(?)";
PreparedStatement pstmt = connection.prepareStatement(sql);
Object[] vals=set.toArray();
int count = 1;
for (int i=0; i<vals.length; i++) {
pstmt.setString(1, vals[i].toString());
pstmt.addBatch();
if(count%maxInsertBatch == 0){
pstmt.executeBatch();
}
count++;
}
// Execute the batch
pstmt.executeBatch();
System.out.append("inserted "+count);