Performance and limitation issues between update() and batchUpdate() methods of NamedParameterJdbcTemplate - java

I would like to know when to use update() or bacthUpdate() method from NamedParameterJdbcTemplate class of Spring framework.
Is there any row limitation for update()? How many rows can handle update() without having performance issues or hanging my db? Starting from how many rows batchUpdate() is getting good performance?
Thanks.

Bellow is my viewpoint:
when to use update() or bacthUpdate() method from NamedParameterJdbcTemplate class of Spring framework
You should use bacthUpdate() so long as when you need to execute multiple sql together.
Is there any row limitation for update()?
This should depends on the DB you use. But I haven't met row limitation for updating. Of course,updating few rows are faster than updating many rows.(such as, UPDATE ... WHERE id=1 vs UPDATE ... WHERE id > 1).
How many rows can handle update() without having performance issues or hanging my db?
This isn't sure. This depends on the DB you using, Machine Performance, etc. If you want to know the exact result, you can view the DB Vendor's Benchmark, or you can measure it by some tests.
Starting from how many rows batchUpdate() is getting good performance?
In fact, batchUpdate() is commonly used when you do batch INSERT, UPDATE or DELETE, this will improve much performance. such as:
BATCH INSERT:
SqlParameterSource[] batch = SqlParameterSourceUtils.createBatch(employees.toArray());
int[] updateCounts = namedParameterJdbcTemplate.batchUpdate("INSERT INTO EMPLOYEE VALUES (:id, :firstName, :lastName, :address)", batch);
return updateCounts;
BATCH UPDATE:
List<Object[]> batch = new ArrayList<Object[]>();
for (Actor actor : actors) {
Object[] values = new Object[] {
actor.getFirstName(),
actor.getLastName(),
actor.getId()};
batch.add(values);
}
int[] updateCounts = jdbcTemplate.batchUpdate(
"update t_actor set first_name = ?, last_name = ? where id = ?",
batch);
return updateCounts;
Internally, batchUpdate() will use PreparedStatement.addBatch(), you can view some spring jdbc tutorial.. Batch operations sent to the database in one "batch", rather than sending the updates one by one.
Sending a batch of updates to the database in one go, is faster than sending them one by one, waiting for each one to finish. There is less network traffic involved in sending one batch of updates (only 1 round trip), and the database might be able to execute some of the updates in parallel. In addition, the DB Driver must support batch operation when you use batchUpdate() and batchUpdate() isn't in one transaction in default.
More details you can view:
https://docs.spring.io/spring/docs/current/spring-framework-reference/html/jdbc.html#jdbc-advanced-jdbc
http://tutorials.jenkov.com/jdbc/batchupdate.html#batch-updates-and-transactions
Hope you have to help.

Related

How to use fetch size in JdbcTemplate to process 20+ millions rows?

I have table with 20+ millions of rows and I can't select all rows using single query because of OutOfMemoryError. I read about fetchSize attribute and looks like it might help to resolve my issue because it is common advise
But I have question about how to apply it.
I have following code:
private final JdbcTemplate jdbcTemplate;
...
jdbcTemplate.setFetchSize(1000);
List<MyTable> myList= this.jdbcTemplate.query(
"SELECT * FROM my_table",
new Object[]{},
MyTableMapper.INSTANCE
);
mylist.foreach(obj->processAndSave(obj));
Looks like jdbc driver will select 1000 per request. But what should I do to proceess all 20+ millions rows ?
Should I invoke jdbcTemplate.query several times ?
You can Getting results based on a cursor
If you are using PostgreSQL 11
requiments:
setFetchSize > 0
make sure autocommit is off
For you example for standard configurations
spring.datasource.hikari.auto-commit=false
spring.jdbc.template.fetch-size=50
Also if you want process the result ouside of a repository class
you can combine with queryForStream
Stream<MyTable> stream = jdbcTemplate.queryForStream(SQL, MyTableMapper.INSTANCE);
PostgreSQL Documentation:
Issuing a Query and Processing the Result
https://jdbc.postgresql.org/documentation/head/query.html

Optimising MySQL INSERT with many VALUES (),(),();

I am trying to improve my Java app's performance and I'm focusing at this point on one end point which has to insert a large amount of data into mysql.
I'm using plain JDBC with the MariaDB Java client driver:
try (PreparedStatement stmt = connection.prepareStatement(
"INSERT INTO data (" +
"fId, valueDate, value, modifiedDate" +
") VALUES (?,?,?,?)") {
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
}
int[] results = statement.executeBatch();
}
From populating the new DB from dumped files, I know that max_allowed_packet is important and I've got that set to 536,870,912 bytes.
In https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html it states that:
If you are inserting many rows from the same client at the same time,
use INSERT statements with multiple VALUES lists to insert several
rows at a time. This is considerably faster (many times faster in some
cases) than using separate single-row INSERT statements. If you are
adding data to a nonempty table, you can tune the
bulk_insert_buffer_size variable to make data insertion even faster.
See Section 5.1.7, “Server System Variables”.
On my DBs, this is set to 8MB
I've also read about key_buffer_size (currently set to 16MB).
I'm concerned that these last 2 might not be enough. I can do some rough calculations on the JSON input to this algorithm because it looks someething like this:
[{"actualizationDate":null,"data":[{"date":"1999-12-31","value":0},
{"date":"2000-01-07","value":0},{"date":"2000-01-14","value":3144},
{"date":"2000-01-21","value":358},{"date":"2000-01-28","value":1049},
{"date":"2000-02-04","value":-231},{"date":"2000-02-11","value":-2367},
{"date":"2000-02-18","value":-2651},{"date":"2000-02-25","value":-
393},{"date":"2000-03-03","value":1725},{"date":"2000-03-10","value":-
896},{"date":"2000-03-17","value":2210},{"date":"2000-03-24","value":1782},
and it looks like the 8MB configured for bulk_insert_buffer_size could easily be exceeded, if not key_buffer_size as well.
But the MySQL docs only make mention of MyISAM engine tables, and I'm currently using InnoDB tables.
I can set up some tests but it would be good to know how this will break or degrade, if at all.
[EDIT] I have --rewriteBatchedStatements=true. In fact here's my connection string:
jdbc:p6spy:mysql://myhost.com:3306/mydb\
?verifyServerCertificate=true\
&useSSL=true\
&requireSSL=true\
&cachePrepStmts=true\
&cacheResultSetMetadata=true\
&cacheServerConfiguration=true\
&elideSetAutoCommits=true\
&maintainTimeStats=false\
&prepStmtCacheSize=250\
&prepStmtCacheSqlLimit=2048\
&rewriteBatchedStatements=true\
&useLocalSessionState=true\
&useLocalTransactionState=true\
&useServerPrepStmts=true
(from https://github.com/brettwooldridge/HikariCP/wiki/MySQL-Configuration )
An alternative is to execute the batch from time to time. This allows you to reduce the size of the batchs and let you focus on more important problems.
int batchSize = 0;
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
//When limit reach, execute and reset the counter
if(batchSize++ >= BATCH_LIMIT){
statement.executeBatch();
batchSize = 0;
}
}
// To execute the remaining items
if(batchSize > 0){
statement.executeBatch();
}
I generally use a constant or a parameter based on the DAO implementation to be more dynamic but a batch of 10_000 row is a good start.
private static final int BATCH_LIMIT = 10_000;
Note that this is not necessary to clear the batch after an execution. Even if this is not specified in Statement.executeBatch documentation, this is in the JDBC specification 4.3
14 Batch Updates
14.1 Description of Batch Updates
14.1.2 Successful Execution
Calling the method executeBatch closes the calling Statement object’s current result set if one is open.
The statement’s batch is reset to empty once executeBatch returns.
The management of the result is a bit more complicated but you can still concatenate the results if you need them. This can be analyzed at any time since the ResultSet is not needed anymore.

How to select all the table records in batches and process each batch.?

There are more than 10,00,000 records in the table, I am working on. I need to perform an asynchronous operation(a push queue) for each record. Getting all the records at once and processing each record in a loop feels like a bad idea. Instead, I want to fetch records in batches and loop over each batch. Read somewhere on the internet about querying in batches using setFetchSize(int n) and my DAO looks like:
public List<UserPreferenceDTO> getUserPreferences() {
String sqlQueryString = "select us.id as userId, pf.id as preferenceId from users us, preferences pf where us.id = pf.user_id;";
SQLQuery sqlQuery = (SQLQuery) session.createSQLQuery(sqlQueryString).setFetchSize(200);
return sqlQuery.addScalar("userId").addScalar("preferenceId").setResultTransformer(new AliasToBeanResultTransformer(UserPreferenceDTO.class)).list();
}
My Service class looks like:
List<UserPreferenceDTO> userPreferenceDTOs = userDeviceDao.getUserPreferences();
for(UserPreferenceDTO userPreferenceDTO: userPreferenceDTOs ){
pushToRabbitMQ(userPreferenceDTO);
}
I need to get "N" records from the DB push them to the queue for processing then get another "N" records push them to queue and so on till all the records are pushed to queue.
A reasonable setFetchSize() is a must in any batch load scenario as the database won't have to send each row separately. Even if your roundtrip to the database is just 10ms it's still 10ms * 10mln ~ 28 h to do it for all the rows. The improvement usually plateaus somewhere around 1000 but this depends on your environment setup so you need to test it.
It might be enough to replace .list() with .scroll() which returns ScrollableResults which allows to read one record at a time. This will however depend on the database, some like MySQL will fake the scrolling and load the entire result set.
If that's the case you need to use ORDER BY in your query with setFirstResult() and setMaxResult(). This will execute new query to read each batch. It's the safest approach but ORDER BY might be an expensive statement.

How to improve performance of retrieving a REF CURSOR into Java using Spring?

I am performing a call to a function which is part of a DB package. This package is deployed in two locations. One local and another remote (across the Atlantic).
I am retrieving the data via the Spring JDBC template.
There is one function which returns approximately 1000 rows (not all that much) and this is taking about 1.5 seconds when getting the data locally but it's taking in the region of 12 seconds when getting the data remotely.
In all sample code, names have been changed and code has been simplified a little.
Please see an example of the current Java code:
SimpleJdbcCall simpleJdbcCall = new SimpleJdbcCall(getDataSource())
.withSchemaName(MY_SCHEMA_NAME)
.withCatalogName("REFCURSOR_PKG")
.withFunctionName("GET_DATA")
.returningResultSet("RESULT_SET", new DataEntryMapper());
SqlParameterSource params = new MapSqlParameterSource()
.addValue("the_name", name)
.addValue("the_rev", rev);
Map resultSet = simpleJdbcCall.execute(params);
ArrayList list = (ArrayList) resultSet.get("RESULT_SET");
The RowMapper class looks something like this:
class RouteDataEntryMapper implements RowMapper {
public RouteDataEntry mapRow(ResultSet resultSet, int rowNum) throws SQLException {
return new DataEntry(resultSet.getString("name"),
Integer.parseInt(resultSet.getString("rev"));
}
}
SQL package spec snippet:
TYPE REF_CURSOR IS REF CURSOR;
SQL function:
FUNCTION GET_ROUTE_DATA(the_name VARCHAR2, the_rev VARCHAR2) RETURN REF_CURSOR AS
RESULT_SET REF_CURSOR;
BEGIN
OPEN RESULT_SET FOR
select *
from table_name tn
where tn.name = the_name
and tn.rev = the_rev;
RETURN RESULT_SET;
CLOSE RESULT_SET;
EXCEPTION WHEN OTHERS THEN
RAISE;
END GET_ROUTE_DATA;
I have tried using regular boiler plate JDBC also (create connection, prepare statement, execute statement, retrieve data from RESULT_SET, etc) and I found that the vast majority of time was spent looping over the RESULT_SET and extracting the data out of it and into some pojos. In the case of the Spring code above, most of the time was spent during the execute() method but this is probably because it creates the objects using the RowMapper at that time.
So, the common area between them is the performing of actions such as:
rs.getString("name")
and I'm guessing that this is where the problem lies but I could be wrong.
As I said, locally the delay is fine but remotely it's taking way too long. Is this because it's going to the DB on every rs.get... ? Is there a better way to do this?
Thanks in advance.
rs.getString("name")
ResultSet.get*(String columnName) can be replaced with ResultSet.get*(int columnNaumber) which is slightly faster but I doubt that the main problem here.
Is this because it's going to the DB on every rs.get... ?
While it really depends the driver I suspect it won't. For a cached result-set it might go to ther server when your scroll through the cursor but it would still fetch a bunch of rows in every roundtrip.
Two more suggestions I have are:
Use a network sniffing utility to see the data being transferred
Check your driver for any option to pre-fetch and such like.
add this line :-
.withoutProcedureColumnMetaDataAccess
in the following code lines
SimpleJdbcCall simpleJdbcCall = new SimpleJdbcCall(getDataSource())
.withSchemaName(MY_SCHEMA_NAME)
.withCatalogName("REFCURSOR_PKG")
.withFunctionName("GET_DATA")
.withoutProcedureColumnMetaDataAccess // to avoid fetching meta data info from database

What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?

I have this really big table with some millions of records every day and in the end of every day I am extracting all the records of the previous day. I am doing this like:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
Statement.executeQuery(SQL);
The problem is that this program takes like 2GB of memory because it takes all the results in memory then it processes it.
I tried setting the Statement.setFetchSize(10) but it takes exactly the same memory from OS it does not make any difference. I am using Microsoft SQL Server 2005 JDBC Driver for this.
Is there any way to read the results in small chunks like the Oracle database driver does when the query is executed to show only a few rows and as you scroll down more results are shown?
In JDBC, the setFetchSize(int) method is very important to performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database and correspondingly the amount of RAM used for ResultSet processing.
Inherently if setFetchSize(10) is being called and the driver is ignoring it, there are probably only two options:
Try a different JDBC driver that will honor the fetch-size hint.
Look at driver-specific properties on the Connection (URL and/or property map when creating the Connection instance).
The RESULT-SET is the number of rows marshalled on the DB in response to the query.
The ROW-SET is the chunk of rows that are fetched out of the RESULT-SET per call from the JVM to the DB.
The number of these calls and resulting RAM required for processing is dependent on the fetch-size setting.
So if the RESULT-SET has 100 rows and the fetch-size is 10,
there will be 10 network calls to retrieve all of the data, using roughly 10*{row-content-size} RAM at any given time.
The default fetch-size is 10, which is rather small.
In the case posted, it would appear the driver is ignoring the fetch-size setting, retrieving all data in one call (large RAM requirement, optimum minimal network calls).
What happens underneath ResultSet.next() is that it doesn't actually fetch one row at a time from the RESULT-SET. It fetches that from the (local) ROW-SET and fetches the next ROW-SET (invisibly) from the server as it becomes exhausted on the local client.
All of this depends on the driver as the setting is just a 'hint' but in practice I have found this is how it works for many drivers and databases (verified in many versions of Oracle, DB2 and MySQL).
The fetchSize parameter is a hint to the JDBC driver as to many rows to fetch in one go from the database. But the driver is free to ignore this and do what it sees fit. Some drivers, like the Oracle one, fetch rows in chunks, so you can read very large result sets without needing lots of memory. Other drivers just read in the whole result set in one go, and I'm guessing that's what your driver is doing.
You can try upgrading your driver to the SQL Server 2008 version (which might be better), or the open-source jTDS driver.
You need to ensure that auto-commit on the Connection is turned off, or setFetchSize will have no effect.
dbConnection.setAutoCommit(false);
Edit: Remembered that when I used this fix it was Postgres-specific, but hopefully it will still work for SQL Server.
Statement interface Doc
SUMMARY: void setFetchSize(int rows)
Gives the JDBC driver a hint as to the
number of rows that should be fetched
from the database when more rows are
needed.
Read this ebook J2EE and beyond By Art Taylor
Sounds like mssql jdbc is buffering the entire resultset for you. You can add a connect string parameter saying selectMode=cursor or responseBuffering=adaptive. If you are on version 2.0+ of the 2005 mssql jdbc driver then response buffering should default to adaptive.
http://msdn.microsoft.com/en-us/library/bb879937.aspx
It sounds to me that you really want to limit the rows being returned in your query and page through the results. If so, you can do something like:
select * from (select rownum myrow, a.* from TEST1 a )
where myrow between 5 and 10 ;
You just have to determine your boundaries.
Try this:
String SQL = "select col1, col2, coln from mytable where timecol = yesterday";
connection.setAutoCommit(false);
PreparedStatement stmt = connection.prepareStatement(SQL, SQLServerResultSet.TYPE_SS_SERVER_CURSOR_FORWARD_ONLY, SQLServerResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(2000);
stmt.set....
stmt.execute();
ResultSet rset = stmt.getResultSet();
while (rset.next()) {
// ......
I had the exact same problem in a project. The issue is that even though the fetch size might be small enough, the JDBCTemplate reads all the result of your query and maps it out in a huge list which might blow your memory. I ended up extending NamedParameterJdbcTemplate to create a function which returns a Stream of Object. That Stream is based on the ResultSet normally returned by JDBC but will pull data from the ResultSet only as the Stream requires it. This will work if you don't keep a reference of all the Object this Stream spits. I did inspire myself a lot on the implementation of org.springframework.jdbc.core.JdbcTemplate#execute(org.springframework.jdbc.core.ConnectionCallback). The only real difference has to do with what to do with the ResultSet. I ended up writing this function to wrap up the ResultSet:
private <T> Stream<T> wrapIntoStream(ResultSet rs, RowMapper<T> mapper) {
CustomSpliterator<T> spliterator = new CustomSpliterator<T>(rs, mapper, Long.MAX_VALUE, NON-NULL | IMMUTABLE | ORDERED);
Stream<T> stream = StreamSupport.stream(spliterator, false);
return stream;
}
private static class CustomSpliterator<T> extends Spliterators.AbstractSpliterator<T> {
// won't put code for constructor or properties here
// the idea is to pull for the ResultSet and set into the Stream
#Override
public boolean tryAdvance(Consumer<? super T> action) {
try {
// you can add some logic to close the stream/Resultset automatically
if(rs.next()) {
T mapped = mapper.mapRow(rs, rowNumber++);
action.accept(mapped);
return true;
} else {
return false;
}
} catch (SQLException) {
// do something with this Exception
}
}
}
you can add some logic to make that Stream "auto closable", otherwise don't forget to close it when you are done.

Categories