I have a classic Java EE system, Web tier with JSF, EJB 3 for the BL, and Hibernate 3 doing the data access to a DB2 database. I am struggling with the following scenario: A user will initiate a process which involves retrieving a large data set from the database. The retrieval process takes some time and so the user does not receive an immediate response, gets impatient and opens a new browser and initiates the retrieval again, sometimes multiple times. The EJB container is obviously unaware of the fact that the first retrievals are no longer relevant, and when the database returns a result set, Hibernate starts populating a set of POJOs which take up vast amounts of memory, eventually causing an OutOfMemoryError.
A potential solution that I thought of was to use the Hibernate Session's cancelQuery method. However, the cancelQuery method only works before the database returns a result set. Once the database returns a result set and Hibernate begins populating the POJOs, the cancelQuery method no longer has an effect. In this case, the database queries themselves return rather quickly, and the bulk of the performance overhead seems to reside in populating the POJOs, at which point we can no longer call the cancelQuery method.
The solution implemented ended up looking like this:
The general idea was to maintain a map of all the Hibernate sessions that are currently running queries to the HttpSession of the user who initiated them, so that when the user would close the browser we would be able to kill the running queries.
There were two main challenges to overcome here. One was propagating the HTTP session-id from the web tier to the EJB tier without interfering with all the method calls along the way - i.e. not tampering with existing code in the system. The second challenge was to figure out how to cancel the queries once the database had already started returning results and Hibernate was populating objects with the results.
The first problem was overcome based on our realization that all methods being called along the stack were being handled by the same thread. This makes sense, as our application exists all within one container and does not have any remote calls. Being that that is the case, we created a Servlet Filter that intercepts every call to the application and adds a ThreadLocal variable with the current HTTP session-id. This way the HTTP session-id will be available to each one of the method calls lower down along the line.
The second challenge was a little more sticky. We discovered that the Hibernate method responsible for running the queries and subsequently populating the POJOs was called doQuery and located in the org.hibernate.loader.Loader.java class. (We happen to be using Hibernate 3.5.3, but the same holds true for newer versions of Hibernate.):
private List doQuery(
final SessionImplementor session,
final QueryParameters queryParameters,
final boolean returnProxies) throws SQLException, HibernateException {
final RowSelection selection = queryParameters.getRowSelection();
final int maxRows = hasMaxRows( selection ) ?
selection.getMaxRows().intValue() :
Integer.MAX_VALUE;
final int entitySpan = getEntityPersisters().length;
final ArrayList hydratedObjects = entitySpan == 0 ? null : new ArrayList( entitySpan * 10 );
final PreparedStatement st = prepareQueryStatement( queryParameters, false, session );
final ResultSet rs = getResultSet( st, queryParameters.hasAutoDiscoverScalarTypes(), queryParameters.isCallable(), selection, session );
final EntityKey optionalObjectKey = getOptionalObjectKey( queryParameters, session );
final LockMode[] lockModesArray = getLockModes( queryParameters.getLockOptions() );
final boolean createSubselects = isSubselectLoadingEnabled();
final List subselectResultKeys = createSubselects ? new ArrayList() : null;
final List results = new ArrayList();
try {
handleEmptyCollections( queryParameters.getCollectionKeys(), rs, session );
EntityKey[] keys = new EntityKey[entitySpan]; //we can reuse it for each row
if ( log.isTraceEnabled() ) log.trace( "processing result set" );
int count;
for ( count = 0; count < maxRows && rs.next(); count++ ) {
if ( log.isTraceEnabled() ) log.debug("result set row: " + count);
Object result = getRowFromResultSet(
rs,
session,
queryParameters,
lockModesArray,
optionalObjectKey,
hydratedObjects,
keys,
returnProxies
);
results.add( result );
if ( createSubselects ) {
subselectResultKeys.add(keys);
keys = new EntityKey[entitySpan]; //can't reuse in this case
}
}
if ( log.isTraceEnabled() ) {
log.trace( "done processing result set (" + count + " rows)" );
}
}
finally {
session.getBatcher().closeQueryStatement( st, rs );
}
initializeEntitiesAndCollections( hydratedObjects, rs, session, queryParameters.isReadOnly( session ) );
if ( createSubselects ) createSubselects( subselectResultKeys, queryParameters, session );
return results; //getResultList(results);
}
In this method you can see that first the results are brought from the database in the form of a good old fashioned java.sql.ResultSet, after which it runs in a loop over each set and creates an object from it. Some additional initialization is performed in the initializeEntitiesAndCollections() method called after the loop. After debugging a little, we discovered that the bulk of the performance overhead was in these sections of the method, and not in the part that gets the java.sql.ResultSet from the database, but the cancelQuery method was only effective on the first part. The solution therefore was to add an additional condition to the for loop, to check whether the thread is interrupted like this:
for ( count = 0; count < maxRows && rs.next() && !currentThread.isInterrupted(); count++ ) {
// ...
}
as well as to perform the same check before calling the initializeEntitiesAndCollections() method:
if (!Thread.interrupted()) {
initializeEntitiesAndCollections(hydratedObjects, rs, session,
queryParameters.isReadOnly(session));
if (createSubselects) {
createSubselects(subselectResultKeys, queryParameters, session);
}
}
Additionally, by calling the Thread.interrupted() on the second check, the flag is cleared and does not affect the further functioning of the program. Now when a query is to be canceled, the canceling method accesses the Hibernate session and thread stored in a map with the HTTP session-id as the key, calls the cancelQuery method on the session and calls the interrupt method of the thread.
I got a similar problem in a totally different environment. I did the following: before adding the new job to my queue I first checked whether the 'same job' is already enqueued from that user. If so I do not accept the second job and inform the user about that.
This doesn't answer your question on how to protect the user from an outOfMemory if the data is too big to fit in the available ram. But it's a good trick to protect your server from doing useless stuff.
Too complicated for me :-) I would like to create separate service for "heavy" queries. And store in it information about query parameters, maybe results, which would be valid limited time. If query execution is too long, user receive message, that execution of his task will takes considerable time, and he may wait or cancel it. Such scenario works fine for analytic queries. This variant gave you simple access to task, running on the server, to kill its.
But if you has problem with hibernate, than I suppose that problem not in analytic queries, but in ordinary business queries. If its execution too long, can you try to use L2 cache (cold start may be very long, but hot data would be received instantly)? Or optimize hibernate\jbdc parameters?
Related
As it comes from the official Drools documentation it is possible to obtain results from stateless session using Query.
// Set up a list of commands
List cmds = new ArrayList();
cmds.add( CommandFactory.newSetGlobal( "list1", new ArrayList(), true ) );
cmds.add( CommandFactory.newInsert( new Person( "jon", 102 ), "person" ) );
cmds.add( CommandFactory.newQuery( "Get People" "getPeople" );
// Execute the list
ExecutionResults results =
ksession.execute( CommandFactory.newBatchExecution( cmds ) );
// Retrieve the ArrayList
results.getValue( "list1" );
// Retrieve the inserted Person fact
results.getValue( "person" );
// Retrieve the query as a QueryResults instance.
results.getValue( "Get People" );
In the sample below, Get People is a drools Query which basically returns an object or a list of objects form a stateless (!) session.
In my project I need to obtain an object created in stateless Kie session, so I've created a Query:
query "getCustomerProfileResponse"
$result: CustomerProfileResponse()
end
The CustomerProfileResponse object is constructing and creating in RHS:
insert(customerProfileResponse);
I wrote the following code to execute commands in batch mode and query the resulted CustomerProfileResponse:
// Creating a batch list
List<Command<?>> commands = new ArrayList<Command<?>>(10);
commands.add(CommandFactory.newInsert(customerProfile));
commands.add(CommandFactory.newQuery(GET_CUSTOMER_PROFILE_RESPONSE,
GET_CUSTOMER_PROFILE_RESPONSE));
// GO!
ExecutionResults results = kSession.execute(CommandFactory.newBatchExecution(commands));
FlatQueryResults queryResults = (FlatQueryResults) results.getValue(GET_CUSTOMER_PROFILE_RESPONSE); // size() is 0!
But queryResults returns an empty list.
I was searching Stack Overflow for the similar questions and find out that it is not possible to run queries against stateless sessions in Drools using batch mode since the session closes immediately after execute() method is called, and the solution is to inject an empty CustomerProfileResponse object along with CustomerProfile in request.
Does anybody can shed some light onto the issue?
Adding CommandFactory.newFireAllRules() after newInsert and before NewQuery should solve the problem. See http://drools-moved.46999.n3.nabble.com/rules-users-Query-in-stateless-knowledge-session-returns-no-results-td3210735.html
Your rules will not fire until the all the command shave been executed. i.e. the implicit fireAllRules() is once all commands have been executed. Which means the query will be invoked before your rule fires to insert the object.
Instead you need to add the FireAllRules command before executing the query.
Using Spring Boot with Spanner in the Google Cloud Env. we are now struggling with performance issues.
To demonstrate that I set up a small demo case baselining our different approaches how to retrieve data from spanner.
The first approach
uses "native" drivers from Google to instantiate a dbClient and retrieves data like so.
#Repository
public class SpannerNativeDAO implements CustomerDAO {
private final DatabaseClient dbClient;
private final String SQL = "select * from customer where customer_id = ";
public SpannerNativeDAO(
#Value("${spring.cloud.gcp.spanner.instanceId}") String instanceId,
#Value("${spring.cloud.gcp.spanner.database}") String dbId,
#Value("${spring.cloud.gcp.spanner.project-id}") String projectId,
#Value("${google.application.credentials}") String pathToCredentials)
throws IOException {
try (FileInputStream google_application_credentials = new FileInputStream(pathToCredentials)) {
final SpannerOptions spannerOptions =
SpannerOptions.newBuilder().setProjectId(projectId)
.setCredentials(ServiceAccountCredentials.fromStream(google_application_credentials)).build();
final Spanner spanner = spannerOptions.getService();
final DatabaseId databaseId1 = DatabaseId.of(projectId, instanceId, dbId);
dbClient = spanner.getDatabaseClient(databaseId1);
// give it a first shot to speed up consecutive calls
dbClient.singleUse().executeQuery(Statement.of("select 1 from customer"));
}
}
private Customer readCustomerFromSpanner(Long customerId) {
try {
Statement statement = Statement.of(SQL + customerId);
ResultSet resultSet = dbClient.singleUse().executeQuery(statement);
while (resultSet.next()) {
return Customer.builder()
.customerId(resultSet.getLong("customer_id"))
.customerStatus(CustomerStatus.valueOf(resultSet.getString("status")))
.updateTimestamp(Timestamp.from(Instant.now())).build();
}
} catch (Exception ex) {
//log
}
return null;
}
....
}
The second approach
uses the Spring Boot Data Starter (https://github.com/spring-cloud/spring-cloud-gcp/tree/master/spring-cloud-gcp-starters/spring-cloud-gcp-starter-data-spanner)
and simply goes like this
#Repository
public interface SpannerCustomerRepository extends SpannerRepository<Customer, Long> {
#Query("SELECT customer.customer_id, customer.status, customer.status_info, customer.update_timestamp "
+ "FROM customer customer WHERE customer.customer_id = #arg1")
List<Customer> findByCustomerId(#Param("arg1") Long customerId);
}
Now if i take the first approach, establishing a initial gRPC connection to Spanner takes > 5 seconds and all consecutive calls are around 1 sec. The second approach takes only approx. 400ms for each call after the initial call.
To test differences I wired up both solutions in one Spring Boot Project and compared it to a in memory solution (~100ms).
All given timings refer to local tests on dev machines but go back to investigating performance problems within the cloud environment.
I testet several different SpannerOptions (SessionOptions) with no results and ran a profiler on the project.
I seems like 96% of response time comes from establishing a gRPC channel to spanner, whereas the database itself processes and responds within 5ms.
We really don't understand the behaviour. We only work with very little test-data and a couple of small tables.
The DatabaseClient is supposed to manage the ConnectionPool and is itself wired into a Singleton-Scoped Repository-Bean. So Sessions should be reused, rigt?
Why does the first approach take much longer than the second one. The Spring FW itself simply uses the DatabaseClient as member within the SpannerOperations / SpannerTemplate.
How can we generally reduce latency. More than 200ms for plain response on each db call seems four times more than we would have expected. (I am aware that local timing benchmarks need to be treated with care)
Tracing give us good visibility into the client, hopefully it can help you with diagnosing the latencies.
Running TracingSample, I get from stackdriver. There are different backends you can use, or print it out as logs.
The sample above also exports http://localhost:8080/rpcz and http://localhost:8080/tracez you can poke around to check latencies and traces.
A tutorial on setting it up: Cloud Spanner, instrumented by OpenCensus and exported to Stackdriver
The problem here is not related to Spring or DAO's, but that you are not closing the ResultSet that is returned by the query. This causes the Spanner library to think that the session that is used to execute your query is still in use, and causes the library to create a new session every time you execute a query. This session creation, handling and pooling is all taken care of for you by the client library, but it does require you to close resources when they are no longer being used.
I tested this with very simple example, and I can reproduce the exact same behavior as what you are seeing by not closing the ResultSet.
Consider the following example:
/**
* This method will execute the query quickly, as the ResultSet
* is closed automatically by the try-with-resources block.
*/
private Long executeQueryFast() {
Statement statement = Statement.of("SELECT * FROM T WHERE ID=1");
try (ResultSet resultSet = dbClient.singleUse().executeQuery(statement)) {
while (resultSet.next()) {
return resultSet.getLong("ID");
}
} catch (Exception ex) {
// log
}
return null;
}
/**
* This method will execute the query slowly, as the ResultSet is
* not closed and the Spanner library thinks that the session is
* still in use. Executing this method repeatedly will cause
* the library to create a new session for each method call.
* Closing the ResultSet will cause the session that was used
* to be returned to the session pool, and the sessions will be
* re-used.
*/
private Long executeQuerySlow() {
Statement statement = Statement.of("SELECT * FROM T WHERE ID=1");
try {
ResultSet resultSet = dbClient.singleUse().executeQuery(statement);
while (resultSet.next()) {
return resultSet.getLong("ID");
}
} catch (Exception ex) {
// log
}
return null;
}
You should always place ResultSets (and all other AutoCloseables) in a try-with-resources block whenever possible.
Note that if you consume a ResultSet that is returned by Spanner completely, i.e. you call ResultSet#next() until it returns false, the ResultSet is also implicitly closed and the session is returned to the pool. I would however recommend not to rely solely on that, but to always wrap a ResultSet in a try-with-resources.
Can you confirm that the performance doesn't change if the SQL strings are made the same between the two methods? (* vs spelling them out individually).
Also, since you're expecting a single customer in the first method, I'm inferring that the customer ID is a key column? If so, you can use the read-by-key methods from SpannerRepository, and that might be faster than a SQL query.
I am trying to improve my Java app's performance and I'm focusing at this point on one end point which has to insert a large amount of data into mysql.
I'm using plain JDBC with the MariaDB Java client driver:
try (PreparedStatement stmt = connection.prepareStatement(
"INSERT INTO data (" +
"fId, valueDate, value, modifiedDate" +
") VALUES (?,?,?,?)") {
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
}
int[] results = statement.executeBatch();
}
From populating the new DB from dumped files, I know that max_allowed_packet is important and I've got that set to 536,870,912 bytes.
In https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html it states that:
If you are inserting many rows from the same client at the same time,
use INSERT statements with multiple VALUES lists to insert several
rows at a time. This is considerably faster (many times faster in some
cases) than using separate single-row INSERT statements. If you are
adding data to a nonempty table, you can tune the
bulk_insert_buffer_size variable to make data insertion even faster.
See Section 5.1.7, “Server System Variables”.
On my DBs, this is set to 8MB
I've also read about key_buffer_size (currently set to 16MB).
I'm concerned that these last 2 might not be enough. I can do some rough calculations on the JSON input to this algorithm because it looks someething like this:
[{"actualizationDate":null,"data":[{"date":"1999-12-31","value":0},
{"date":"2000-01-07","value":0},{"date":"2000-01-14","value":3144},
{"date":"2000-01-21","value":358},{"date":"2000-01-28","value":1049},
{"date":"2000-02-04","value":-231},{"date":"2000-02-11","value":-2367},
{"date":"2000-02-18","value":-2651},{"date":"2000-02-25","value":-
393},{"date":"2000-03-03","value":1725},{"date":"2000-03-10","value":-
896},{"date":"2000-03-17","value":2210},{"date":"2000-03-24","value":1782},
and it looks like the 8MB configured for bulk_insert_buffer_size could easily be exceeded, if not key_buffer_size as well.
But the MySQL docs only make mention of MyISAM engine tables, and I'm currently using InnoDB tables.
I can set up some tests but it would be good to know how this will break or degrade, if at all.
[EDIT] I have --rewriteBatchedStatements=true. In fact here's my connection string:
jdbc:p6spy:mysql://myhost.com:3306/mydb\
?verifyServerCertificate=true\
&useSSL=true\
&requireSSL=true\
&cachePrepStmts=true\
&cacheResultSetMetadata=true\
&cacheServerConfiguration=true\
&elideSetAutoCommits=true\
&maintainTimeStats=false\
&prepStmtCacheSize=250\
&prepStmtCacheSqlLimit=2048\
&rewriteBatchedStatements=true\
&useLocalSessionState=true\
&useLocalTransactionState=true\
&useServerPrepStmts=true
(from https://github.com/brettwooldridge/HikariCP/wiki/MySQL-Configuration )
An alternative is to execute the batch from time to time. This allows you to reduce the size of the batchs and let you focus on more important problems.
int batchSize = 0;
for (DataPoint dp : datapoints) {
stmt.setLong(1, fId);
stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
stmt.setDouble(3, dp.getValue());
stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
stmt.addBatch();
//When limit reach, execute and reset the counter
if(batchSize++ >= BATCH_LIMIT){
statement.executeBatch();
batchSize = 0;
}
}
// To execute the remaining items
if(batchSize > 0){
statement.executeBatch();
}
I generally use a constant or a parameter based on the DAO implementation to be more dynamic but a batch of 10_000 row is a good start.
private static final int BATCH_LIMIT = 10_000;
Note that this is not necessary to clear the batch after an execution. Even if this is not specified in Statement.executeBatch documentation, this is in the JDBC specification 4.3
14 Batch Updates
14.1 Description of Batch Updates
14.1.2 Successful Execution
Calling the method executeBatch closes the calling Statement object’s current result set if one is open.
The statement’s batch is reset to empty once executeBatch returns.
The management of the result is a bit more complicated but you can still concatenate the results if you need them. This can be analyzed at any time since the ResultSet is not needed anymore.
I have built an importer for MongoDB and Cassandra. Basically all operations of the importer are the same, except for the last part where data gets formed to match the needed cassandra table schema and wanted mongodb document structure. The write performance of Cassandra is really bad compared to MongoDB and I think I'm doing something wrong.
Basically, my abstract importer class loads the data, reads out all data and passes it to the extending MongoDBImporter or CassandraImporter class to send data to the databases. One database is targeted at a time - no "dual" inserts to both C* and MongoDB at the same time. The importer is run on the same machine against the same number of nodes (6).
The Problem:
MongoDB import finished after 57 minutes. I ingested 10.000.000 documents and I expect about the same amount of rows for Cassandra. My Cassandra importer is now running since 2,5 hours and is only at 5.000.000 inserted rows. I will wait for the importer to finish and edit the actual finish time in here.
How I import with Cassandra:
I prepare two statements once before ingesting data. Both statements are UPDATE queries because sometimes I have to append data to an existing list. My table is cleared completely before starting the import. The prepared statements get used over and over again.
PreparedStatement statementA = session.prepare(queryA);
PreparedStatement statementB = session.prepare(queryB);
For every row, I create a BoundStatement and pass that statement to my "custom" batching method:
BoundStatement bs = new BoundStatement(preparedStatement); //either statementA or B
bs = bs.bind();
//add data... with several bs.setXXX(..) calls
cassandraConnection.executeBatch(bs);
With MongoDB, I can insert 1000 Documents (thats the maximum) at a time without problems. For Cassandra, the importer crashes with com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large for just 10 of my statements at some point. I'm using this code to build the batches. Btw, I began with 1000, 500, 300, 200, 100, 50, 20 batch size before but obviously they do not work too. I then set it down to 10 and it threw the exception again. Now I'm out of ideas why it's breaking.
private static final int MAX_BATCH_SIZE = 10;
private Session session;
private BatchStatement currentBatch;
...
#Override
public ResultSet executeBatch(Statement statement) {
if (session == null) {
throw new IllegalStateException(CONNECTION_STATE_EXCEPTION);
}
if (currentBatch == null) {
currentBatch = new BatchStatement(Type.UNLOGGED);
}
currentBatch.add(statement);
if (currentBatch.size() == MAX_BATCH_SIZE) {
ResultSet result = session.execute(currentBatch);
currentBatch = new BatchStatement(Type.UNLOGGED);
return result;
}
return null;
}
My C* schema looks like this
CREATE TYPE stream.event (
data_dbl frozen<map<text, double>>,
data_str frozen<map<text, text>>,
data_bool frozen<map<text, boolean>>,
);
CREATE TABLE stream.data (
log_creator text,
date text, //date of the timestamp
ts timestamp,
log_id text, //some id
hour int, //just the hour of the timestmap
x double,
y double,
events list<frozen<event>>,
PRIMARY KEY ((log_creator, date, hour), ts, log_id)
) WITH CLUSTERING ORDER BY (ts ASC, log_id ASC)
I sometimes need to add further new events to an existing row. That's why I need a List of UDTs. My UDT contains three maps because the event creators produce different data (key/value pairs of type string/double/boolean). I am aware of the fact that the UDTs are frozen and I can not touch the maps of already ingested events. That's fine for me, I just need to add new events that have the same timestamp sometimes. I partition on the creator of the logs (some sensor name) as well as the date of the record (ie. "22-09-2016") and the hour of the timestamp (to distribute data more while keeping related data close together in a partition).
I'm using Cassandra 3.0.8 with the Datastax Java Driver, version 3.1.0 in my pom.
According to What is the batch limit in Cassandra?, I should not increase the batch size by adjusting batch_size_fail_threshold_in_kb in my cassandra.yaml. So... what do or what's wrong with my import?
UPDATE
So I have adjusted my code to run async queries and store the currently running inserts in a list. Whenever an async insert finishes, it will be removed from the list. When the list size exceeds a threshold and an error occured in an insert before, the method will wait 500ms until the inserts are below the threshold. My code is now automatically increasing the threshold when no insert failed.
But after streaming 3.300.000 rows, there were 280.000 inserts being processed but no error happened. This seems number of currently processed inserts looks too high. The 6 cassandra nodes are running on commodity hardware, which is 2 years old.
Is this the high number (280.000 for 6 nodes) of concurrent inserts a problem? Should I add a variable like MAX_CONCURRENT_INSERT_LIMIT?
private List<ResultSetFuture> runningInsertList;
private static int concurrentInsertLimit = 1000;
private static int concurrentInsertSleepTime = 500;
...
#Override
public void executeBatch(Statement statement) throws InterruptedException {
if (this.runningInsertList == null) {
this.runningInsertList = new ArrayList<>();
}
//Sleep while the currently processing number of inserts is too high
while (concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) {
Thread.sleep(concurrentInsertSleepTime);
}
ResultSetFuture future = this.executeAsync(statement);
this.runningInsertList.add(future);
Futures.addCallback(future, new FutureCallback<ResultSet>() {
#Override
public void onSuccess(ResultSet result) {
runningInsertList.remove(future);
}
#Override
public void onFailure(Throwable t) {
concurrentInsertErrorOccured = true;
}
}, MoreExecutors.sameThreadExecutor());
if (!concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) {
concurrentInsertLimit += 2000;
LOGGER.info(String.format("New concurrent insert limit is %d", concurrentInsertLimit));
}
return;
}
After using C* for a bit, I'm convinced you should really use batches only for keeping multiple tables in sync. If you don't need that feature, then don't use batches at all because you will incur in performance penalties.
The correct way to load data into C* is with async writes, with optional backpressure if your cluster can't keep up with the ingestion rate. You should replace your "custom" batching method with something that:
performs async writes
keep under control how many inflight writes you have
perform some retry when a write timeouts.
To perform async writes, use the .executeAsync method, that will return you a ResultSetFuture object.
To keep under control how many inflight queries just collect the ResultSetFuture object retrieved from the .executeAsync method in a list, and if the list gets (ballpark values here) say 1k elements then wait for all of them to finish before issuing more writes. Or you can wait for the first to finish before issuing one more write, just to keep the list full.
And finally, you can check for write failures when you're waiting on an operation to complete. In that case, you could:
write again with the same timeout value
write again with an increased timeout value
wait some amount of time, and then write again with the same timeout value
wait some amount of time, and then write again with an increased timeout value
From 1 to 4 you have an increased backpressure strength. Pick the one that best fit your case.
EDIT after question update
Your insert logic seems a bit broken to me:
I don't see any retry logic
You don't remove the item in the list if it fails
Your while (concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) is wrong, because you will sleep only when the number of issued queries is > concurrentInsertLimit, and because of 2. your thread will just park there.
You never set to false concurrentInsertErrorOccured
I usually keep a list of (failed) queries for the purpose of retrying them at later time. That gives me powerful control on the queries, and when the failed queries starts to accumulate I sleep for a few moments, and then keep on retrying them (up to X times, then hard fail...).
This list should be very dynamic, eg you add items there when queries fail, and remove items when you perform a retry. Now you can understand the limits of your cluster, and tune your concurrentInsertLimit based on eg the avg number of failed queries in the last second, or stick with the simpler approach "pause if we have an item in the retry list" etc...
EDIT 2 after comments
Since you don't want any retry logic, I would change your code this way:
private List<ResultSetFuture> runningInsertList;
private static int concurrentInsertLimit = 1000;
private static int concurrentInsertSleepTime = 500;
...
#Override
public void executeBatch(Statement statement) throws InterruptedException {
if (this.runningInsertList == null) {
this.runningInsertList = new ArrayList<>();
}
ResultSetFuture future = this.executeAsync(statement);
this.runningInsertList.add(future);
Futures.addCallback(future, new FutureCallback<ResultSet>() {
#Override
public void onSuccess(ResultSet result) {
runningInsertList.remove(future);
}
#Override
public void onFailure(Throwable t) {
runningInsertList.remove(future);
concurrentInsertErrorOccured = true;
}
}, MoreExecutors.sameThreadExecutor());
//Sleep while the currently processing number of inserts is too high
while (runningInsertList.size() >= concurrentInsertLimit) {
Thread.sleep(concurrentInsertSleepTime);
}
if (!concurrentInsertErrorOccured) {
// Increase your ingestion rate if no query failed so far
concurrentInsertLimit += 10;
} else {
// Decrease your ingestion rate because at least one query failed
concurrentInsertErrorOccured = false;
concurrentInsertLimit = Max(1, concurrentInsertLimit - 50);
while (runningInsertList.size() >= concurrentInsertLimit) {
Thread.sleep(concurrentInsertSleepTime);
}
}
return;
}
You could also optimize a bit the procedure by replacing your List<ResultSetFuture> with a counter.
Hope that helps.
When you run a batch in Cassandra, it chooses a single node to act as the coordinator. This node then becomes responsible for seeing to it that the batched writes find their appropriate nodes. So (for example) by batching 10000 writes together, you have now tasked one node with the job of coordinating 10000 writes, most of which will be for different nodes. It's very easy to tip over a node, or kill latency for an entire cluster by doing this. Hence, the reason for the limit on batch sizes.
The problem is that Cassandra CQL BATCH is a misnomer, and it doesn't do what you or anyone else thinks that it does. It is not to be used for performance gains. Parallel, asynchronous writes will always be faster than running the same number of statements BATCHed together.
I know that I could easily batch 10.000 rows together because they will go to the same partition. ... Would you still use single row inserts (async) rather than batches?
That depends on whether or not write performance is your true goal. If so, then I'd still stick with parallel, async writes.
For some more good info on this, check out these two blog posts by DataStax's Ryan Svihla:
Cassandra: Batch loading without the Batch keyword
Cassandra: Batch Loading Without the Batch — The Nuanced Edition
So in my database, I have 3 rows, two rows have defaultFlag as 0 and one is set to 1, now in my processing am updating defaultProperty of one object to 1 from 0 but am not saving this object yet.
Before saving I need to query database and find if any row has defaultFlag set or not, there would be only 1 default set.
So before doing update am running query to find if default is set and i get 2 values out, note here if i go and check in db then there is only 1 row with default set but query gives me two result because this.object default property has changed from 0 to 1 but note that this object is not yet saved in database.
I am really confused here as to why hibernate query is returning 2 when there is one row with default set in database and other object whose default property has changed but it is not saved.
Any thoughts would be helpful. I can provide query if need be.
Update
Following suggestions, I added session.clear() to before running the query.
session.clear();
String sql = "SELECT * FROM BANKACCOUNTS WHERE PARTYID = :partyId AND CURRENCYID = :currencySymbol AND ISDEFAULTBANKACCOUNT= :defaultbankAccount";
SQLQuery q = session.createSQLQuery(sql);
q.addEntity(BankAccount.class);
q.setParameter("partyId", partyId);
q.setParameter("currencySymbol", currencySymbol);
q.setParameter("defaultbankAccount", 1);
return q.uniqueResult();
and it returns 1 row in result as expected but now am getting
nested exception is org.hibernate.NonUniqueObjectException: a different object with the same identifier value was already associated with the session exception
Either query which row has the "default flag" set before you start changing it, or query for a list of rows with default flag set & clear all except the one you're trying to set.
Very easy, stop mucking about with your "brittle" current approach which will break in the face of concurrency or if data is ever in an inconsistent state. Use a reliable approach instead, which will always set the data to a valid state.
protected void makeAccountDefault (BankAccount acc) {
// find & clear any existing 'Default Accounts', other than specified.
//
String sql = "SELECT * FROM BANKACCOUNTS WHERE PARTYID = :partyId AND CURRENCYID = :currencySymbol AND ISDEFAULTBANKACCOUNT= :defaultbankAccount";
SQLQuery q = session.createSQLQuery(sql);
q.addEntity(BankAccount.class);
q.setParameter("partyId", partyId);
q.setParameter("currencySymbol", currencySymbol);
q.setParameter("defaultbankAccount", 1);
//
List<BackAccount> existingDefaults = q.list();
for (BankAccount existing : existingDefaults) {
if (! existing.equals( acc))
existing.setDefaultBankAccount( false);
}
// set the specified Account as Default.
acc.setDefaultBankAccount( true);
// done.
}
This is how you write proper code, do it simple & reliable. Never make or depend on weak assumptions about the reliability of data or internal state, always read & process "beforehand state" before you do the operation, just implement your code clean & right and it will serve you well.
I think that your second query won't be executed at all because the entity is already in the first level cache.
As your transaction is not yet commited, you don't see the changes in the underlying database.
(this is only a guess)
That's only a guess because you're not giving many details, but I suppose that you perform your myObject.setMyDefaultProperty(1) while your session is open.
In this case, be careful that you don't need to actually perform a session.update(myObject) to save the change. It is the nominal case when database update is transparently done by hibernate.
So, in fact, I think that your change is saved... (but not commited, of course, thus not seen when you check in db)
To verify this, you should enable the hibernate.show_sql option. You will see if an Update statement is triggered (I advise to always enable this option in development phase anyway)