I have a table of items with a flag of "null" or "done" , I need to fetch the null flagged items ,process them, set flag to done.
thing is , I want to use pagination , where I fetch 500 by 500 item(null flagged)
my design goes as follows
I fetch 500 item // the producer
put them in a queue
some thread takes these 500 item // the consumer
operate on them and updates flag to "done"
the problem am facing is the consumer is pretty slow, so the producer fetches the same 500 part again , so I went for indexing but seems not to work properly
public List<Parts> getNParts(int listSize) {
try {
criteria = session.createCriteria(Parts.class);
criteria.setFirstResult(DBIndexGuard.getNextIndex()); //index+=500;
criteria.add(Restrictions.isNull("Status"));
criteria.setMaxResults(listSize); //list size is 500;
newPartList = criteria.list();
} catch (Exception e) {
e.printStackTrace();
} finally {
}
return newPartList;
}
how can I implement pagination in order to fetch 500 by 500 different items with the criteria that these items are null flagged ?
create a synchronized method for producer - consumer type of problem, this tutorial can help you.
You may try one of the following implementation.
1. Eliminate duplicate processing on consumer side. Set done only if null.
2. Eliminate duplicate on producer by maintaining additional status 'in processs" whenever put in queue. Exclude them in your producer query.
3. While paginating, sort your records by primary key of your table , and for subsequent pages , query only on those records greater than the primary key of last record in previous page.
This problem can easily be solved with inclusion of one more status, say 'processing'. So your producer will mark the records picked as - 'processing' and consumer can then work on them and set their status to 'done'.
In this case producer will not pick the already picked records.
I solved it as follows
if (newPartList.isEmpty() || newPartList.size()<DBIndexGuard.getAllowedListSize()) { // AllowedListSize=500
System.out.println("DataFetcher Sleeping");
inputQueue.offer(newPartList);
DBIndexGuard.resetIndex();
session.clear();
TimeUnit.MINUTES.sleep(10);
}
Related
I need to delete items from two databases - one internal managed by my team, and another managed by some other team (they hold different, but related data). The constraint is that if one of these deletes from database fail, then the entire operation should be cancelled and rolled back.
Now, I can control and access my own database easily, but not the database managed by the other team. My line of thought is as follows:
delete from my database first (if it fails, abort everything straightaway)
assuming step 1 succeeds, now I call the API from the other team to delete the data on their side as well
if step 2 succeeds, all is good... if it fails, I'll roll back the delete on my database in step 1
In order to achieve step 3, I think I will have to save the data in step 1 in some variables within the function. Roughly speaking...
public void deleteData (String id) {
Optional<var> entityToBeDeleted = getEntity(id);
try{
deleteFromMyDB(id);
} catch (Exception e){
throw e;
}
try{
deleteFromOtherDB(id);
} catch (Exception e){
persistInMyDB(entityToBeDeleted);
throw e;
}
}
Now I am aware that the above code looks horrible. Any guru can give me some advice on how to do this better?
What does it mean if the remote deletion fails? That the deletion should not happen at all?
Can the local deletion fail for a non-transient reason?
A possible solution is:
Create a "pending deletions" table in your database which will contain the keys of records you want to delete.
When you need to delete record, insert a row in this table.
Then delete the record from the remote system.
If this succeeds, delete the "pending deletion" record and the local record, preferably in a single transaction.
Whenever you start your system, check the "pending deletion" table, and delete any records mention from the local and remote systems (I assume that both these operations are idempotent). Then delete the "pending deletion" record.
I have a collection from which I am getting max id and while inserting using max id + 1. The id column is unique in this collection.
When multiple instances of this service is invoked the concurrent application reads the same collection and gets the max id. But since the same collection is accessed the same max id is returned to multiple instances, can I get an explicit lock on the collection while reading the data from this collection and release the lock after writing in Mongo DB?
Using mongoDB method collections.findAndModify() you can create your own "get-and-increment" query.
For example:
db.collection_name.findAndModify({
query: { document_identifier: "doc_id_1" },
update: { $inc: { max_id: 1 } },
new: true //return the document AFTER it's updated
})
https://docs.mongodb.com/manual/reference/method/db.collection.findAndModify/
Take a look at this page for more help:
https://www.tutorialspoint.com/mongodb/mongodb_autoincrement_sequence.htm
Try this approach
Instead of getting the max id in read of the collection and increment it as max id + 1.
While read for multiple instances just give the document/collection, and while updating follow the below logic
Let us have the below part in a synchronized block, so that no two threads gets the same max id
synchronize() {
getMaxId from collection
increase it by 1
insert the new document
}
Please refer:
https://docs.mongodb.com/v3.0/tutorial/create-an-auto-incrementing-field/
https://www.tutorialspoint.com/mongodb/mongodb_autoincrement_sequence.htm
Hope it Helps!
I have built an importer for MongoDB and Cassandra. Basically all operations of the importer are the same, except for the last part where data gets formed to match the needed cassandra table schema and wanted mongodb document structure. The write performance of Cassandra is really bad compared to MongoDB and I think I'm doing something wrong.
Basically, my abstract importer class loads the data, reads out all data and passes it to the extending MongoDBImporter or CassandraImporter class to send data to the databases. One database is targeted at a time - no "dual" inserts to both C* and MongoDB at the same time. The importer is run on the same machine against the same number of nodes (6).
The Problem:
MongoDB import finished after 57 minutes. I ingested 10.000.000 documents and I expect about the same amount of rows for Cassandra. My Cassandra importer is now running since 2,5 hours and is only at 5.000.000 inserted rows. I will wait for the importer to finish and edit the actual finish time in here.
How I import with Cassandra:
I prepare two statements once before ingesting data. Both statements are UPDATE queries because sometimes I have to append data to an existing list. My table is cleared completely before starting the import. The prepared statements get used over and over again.
PreparedStatement statementA = session.prepare(queryA);
PreparedStatement statementB = session.prepare(queryB);
For every row, I create a BoundStatement and pass that statement to my "custom" batching method:
BoundStatement bs = new BoundStatement(preparedStatement); //either statementA or B
bs = bs.bind();
//add data... with several bs.setXXX(..) calls
cassandraConnection.executeBatch(bs);
With MongoDB, I can insert 1000 Documents (thats the maximum) at a time without problems. For Cassandra, the importer crashes with com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large for just 10 of my statements at some point. I'm using this code to build the batches. Btw, I began with 1000, 500, 300, 200, 100, 50, 20 batch size before but obviously they do not work too. I then set it down to 10 and it threw the exception again. Now I'm out of ideas why it's breaking.
private static final int MAX_BATCH_SIZE = 10;
private Session session;
private BatchStatement currentBatch;
...
#Override
public ResultSet executeBatch(Statement statement) {
if (session == null) {
throw new IllegalStateException(CONNECTION_STATE_EXCEPTION);
}
if (currentBatch == null) {
currentBatch = new BatchStatement(Type.UNLOGGED);
}
currentBatch.add(statement);
if (currentBatch.size() == MAX_BATCH_SIZE) {
ResultSet result = session.execute(currentBatch);
currentBatch = new BatchStatement(Type.UNLOGGED);
return result;
}
return null;
}
My C* schema looks like this
CREATE TYPE stream.event (
data_dbl frozen<map<text, double>>,
data_str frozen<map<text, text>>,
data_bool frozen<map<text, boolean>>,
);
CREATE TABLE stream.data (
log_creator text,
date text, //date of the timestamp
ts timestamp,
log_id text, //some id
hour int, //just the hour of the timestmap
x double,
y double,
events list<frozen<event>>,
PRIMARY KEY ((log_creator, date, hour), ts, log_id)
) WITH CLUSTERING ORDER BY (ts ASC, log_id ASC)
I sometimes need to add further new events to an existing row. That's why I need a List of UDTs. My UDT contains three maps because the event creators produce different data (key/value pairs of type string/double/boolean). I am aware of the fact that the UDTs are frozen and I can not touch the maps of already ingested events. That's fine for me, I just need to add new events that have the same timestamp sometimes. I partition on the creator of the logs (some sensor name) as well as the date of the record (ie. "22-09-2016") and the hour of the timestamp (to distribute data more while keeping related data close together in a partition).
I'm using Cassandra 3.0.8 with the Datastax Java Driver, version 3.1.0 in my pom.
According to What is the batch limit in Cassandra?, I should not increase the batch size by adjusting batch_size_fail_threshold_in_kb in my cassandra.yaml. So... what do or what's wrong with my import?
UPDATE
So I have adjusted my code to run async queries and store the currently running inserts in a list. Whenever an async insert finishes, it will be removed from the list. When the list size exceeds a threshold and an error occured in an insert before, the method will wait 500ms until the inserts are below the threshold. My code is now automatically increasing the threshold when no insert failed.
But after streaming 3.300.000 rows, there were 280.000 inserts being processed but no error happened. This seems number of currently processed inserts looks too high. The 6 cassandra nodes are running on commodity hardware, which is 2 years old.
Is this the high number (280.000 for 6 nodes) of concurrent inserts a problem? Should I add a variable like MAX_CONCURRENT_INSERT_LIMIT?
private List<ResultSetFuture> runningInsertList;
private static int concurrentInsertLimit = 1000;
private static int concurrentInsertSleepTime = 500;
...
#Override
public void executeBatch(Statement statement) throws InterruptedException {
if (this.runningInsertList == null) {
this.runningInsertList = new ArrayList<>();
}
//Sleep while the currently processing number of inserts is too high
while (concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) {
Thread.sleep(concurrentInsertSleepTime);
}
ResultSetFuture future = this.executeAsync(statement);
this.runningInsertList.add(future);
Futures.addCallback(future, new FutureCallback<ResultSet>() {
#Override
public void onSuccess(ResultSet result) {
runningInsertList.remove(future);
}
#Override
public void onFailure(Throwable t) {
concurrentInsertErrorOccured = true;
}
}, MoreExecutors.sameThreadExecutor());
if (!concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) {
concurrentInsertLimit += 2000;
LOGGER.info(String.format("New concurrent insert limit is %d", concurrentInsertLimit));
}
return;
}
After using C* for a bit, I'm convinced you should really use batches only for keeping multiple tables in sync. If you don't need that feature, then don't use batches at all because you will incur in performance penalties.
The correct way to load data into C* is with async writes, with optional backpressure if your cluster can't keep up with the ingestion rate. You should replace your "custom" batching method with something that:
performs async writes
keep under control how many inflight writes you have
perform some retry when a write timeouts.
To perform async writes, use the .executeAsync method, that will return you a ResultSetFuture object.
To keep under control how many inflight queries just collect the ResultSetFuture object retrieved from the .executeAsync method in a list, and if the list gets (ballpark values here) say 1k elements then wait for all of them to finish before issuing more writes. Or you can wait for the first to finish before issuing one more write, just to keep the list full.
And finally, you can check for write failures when you're waiting on an operation to complete. In that case, you could:
write again with the same timeout value
write again with an increased timeout value
wait some amount of time, and then write again with the same timeout value
wait some amount of time, and then write again with an increased timeout value
From 1 to 4 you have an increased backpressure strength. Pick the one that best fit your case.
EDIT after question update
Your insert logic seems a bit broken to me:
I don't see any retry logic
You don't remove the item in the list if it fails
Your while (concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) is wrong, because you will sleep only when the number of issued queries is > concurrentInsertLimit, and because of 2. your thread will just park there.
You never set to false concurrentInsertErrorOccured
I usually keep a list of (failed) queries for the purpose of retrying them at later time. That gives me powerful control on the queries, and when the failed queries starts to accumulate I sleep for a few moments, and then keep on retrying them (up to X times, then hard fail...).
This list should be very dynamic, eg you add items there when queries fail, and remove items when you perform a retry. Now you can understand the limits of your cluster, and tune your concurrentInsertLimit based on eg the avg number of failed queries in the last second, or stick with the simpler approach "pause if we have an item in the retry list" etc...
EDIT 2 after comments
Since you don't want any retry logic, I would change your code this way:
private List<ResultSetFuture> runningInsertList;
private static int concurrentInsertLimit = 1000;
private static int concurrentInsertSleepTime = 500;
...
#Override
public void executeBatch(Statement statement) throws InterruptedException {
if (this.runningInsertList == null) {
this.runningInsertList = new ArrayList<>();
}
ResultSetFuture future = this.executeAsync(statement);
this.runningInsertList.add(future);
Futures.addCallback(future, new FutureCallback<ResultSet>() {
#Override
public void onSuccess(ResultSet result) {
runningInsertList.remove(future);
}
#Override
public void onFailure(Throwable t) {
runningInsertList.remove(future);
concurrentInsertErrorOccured = true;
}
}, MoreExecutors.sameThreadExecutor());
//Sleep while the currently processing number of inserts is too high
while (runningInsertList.size() >= concurrentInsertLimit) {
Thread.sleep(concurrentInsertSleepTime);
}
if (!concurrentInsertErrorOccured) {
// Increase your ingestion rate if no query failed so far
concurrentInsertLimit += 10;
} else {
// Decrease your ingestion rate because at least one query failed
concurrentInsertErrorOccured = false;
concurrentInsertLimit = Max(1, concurrentInsertLimit - 50);
while (runningInsertList.size() >= concurrentInsertLimit) {
Thread.sleep(concurrentInsertSleepTime);
}
}
return;
}
You could also optimize a bit the procedure by replacing your List<ResultSetFuture> with a counter.
Hope that helps.
When you run a batch in Cassandra, it chooses a single node to act as the coordinator. This node then becomes responsible for seeing to it that the batched writes find their appropriate nodes. So (for example) by batching 10000 writes together, you have now tasked one node with the job of coordinating 10000 writes, most of which will be for different nodes. It's very easy to tip over a node, or kill latency for an entire cluster by doing this. Hence, the reason for the limit on batch sizes.
The problem is that Cassandra CQL BATCH is a misnomer, and it doesn't do what you or anyone else thinks that it does. It is not to be used for performance gains. Parallel, asynchronous writes will always be faster than running the same number of statements BATCHed together.
I know that I could easily batch 10.000 rows together because they will go to the same partition. ... Would you still use single row inserts (async) rather than batches?
That depends on whether or not write performance is your true goal. If so, then I'd still stick with parallel, async writes.
For some more good info on this, check out these two blog posts by DataStax's Ryan Svihla:
Cassandra: Batch loading without the Batch keyword
Cassandra: Batch Loading Without the Batch — The Nuanced Edition
The scenario is simple.
I have a somehow large MySQL db containing two tables:
-- Table 1
id (primary key) | some other columns without constraints
-----------------+--------------------------------------
1 | foo
2 | bar
3 | foobar
... | ...
-- Table 2
id_src | id_trg | some other columns without constraints
-------+--------+---------------------------------------
1 | 2 | ...
1 | 3 | ...
2 | 1 | ...
2 | 3 | ...
2 | 5 | ...
...
On table1 only id is a primary key. This table contains about 12M entries.
On table2 id_src and id_trg are both primary keys and both have foreign key constraints on table1's id and they also have the option DELETE ON CASCADE enabled. This table contains about 110M entries.
Ok, now what I'm doing is only to create a list of ids that I want to remove from table 1 and then I'm executing a simple DELETE FROM table1 WHERE id IN (<the list of ids>);
The latter process is as you may have guessed would delete the corresponding id from table2 as well. So far so good, but the problem is that when I run this on a multi-threaded env and I get many Deadlocks!
A few notes:
There is no other process running at the same time nor will be (for the time being)
I want this to be fast! I have about 24 threads (if this does make any difference in the answer)
I have already tried almost all of transaction isolation levels (except the TRANSACTION_NONE) Java sql connection transaction isolation
Ordering/sorting the id's I think would not help!
I have already tried SELECT ... FOR UPDATE, but a simple DELETE would take up to 30secs! (so there is no use of using it) :
DELETE FROM table1
WHERE id IN (
SELECT id FROM (
SELECT * FROM table1
WHERE id='some_id'
FOR UPDATE) AS x);
How can I fix this?
I would appreciate any help and thanks in advance :)
Edit:
Using InnoDB engine
On a single thread this process would take a dozen hours even maybe a whole day, but I'm aiming for a few hours!
I'm already using a connection pool manager: java.util.concurrent
For explanation on double nested SELECTs please refer to MySQL can’t specify target table for update in FROM clause
The list that is to be deleted from DB, may contain a couple of million entries in total which is divided into chunks of 200
The FOR UPDATE clause is that I've heard that it locks a single row instead of locking the whole table
The app uses Spring's batchUpdate(String sqlQuery) method, thus the transactions are managed automatically
All ids have index enabled and the ids are unique 50 chars max!
DELETE ON CASCADE on id_src and id_trg (each separately) would mean that every delete on table1 id=x would lead to deletes on table2 id_src=x and id_trg=x
Some code as requested:
public void write(List data){
try{
Arraylist idsToDelete = getIdsToDelete();
String query = "DELETE FROM table1 WHERE id IN ("+ idsToDelete + " )";
mysqlJdbcTemplate.getJdbcTemplate().batchUpdate(query);
} catch (Exception e) {
LOG.error(e);
}
}
and myJdbcTemplate is just an abstract class that extends JdbcDaoSupport.
First of all your first simple delete query in which you are passing ids, should not create problem if you are passing ids till a limit like 1000 (total no of rows in child table also should be near about but not to many like 10,000 etc.), but if you are passing like 50,000 or more then it can create locking issue.
To avoid deadlock, you can follow below approach to take care this issue (assuming bulk deletion will not be part of production system)-
Step1: Fetch all ids by select query and keep in cursor.
Step2: now delete these ids stored in cursor in a stored procedure one by one.
Note: To check why deletion is acquiring locks we have to check several things like how many ids you are passing, what is transaction level set at DB level, what is your Mysql configuration setting in my.cnf etc...
It may be dangereous to delete many (> 10000) parent records each having child records deleted by cascade, because the most records you delete in a single time, the most chances of lock conflict leading to deadlock or rollback.
If it is acceptable (meaning you can make a direct JDBC connection to the database) you should (no threading involved here) :
compute the list of ids to delete
delete them by batches (between 10 and 100 a priori) committing every 100 or 1000 records
As the heavier job should be on database part, I hardly doubt that threading will help here. If you want to try it, I would recommend :
one single thread (with a dedicated database connection) computing the list of ids to delete and alimenting a synchronized queue with them
a small number of threads (4 maybe 8), each with its own database connection that :
use a prepared DELETE FROM table1 WHERE id = ? in batches
take ids from the queue and prepare the batches
send a batch to the database every 10 or 100 records
do a commit every 10 or 100 batches
I cannot imagine that the whole process could take more than several minutes.
After some other readings, it looks like I was used to old systems and that my numbers are really conservative.
Ok here's what I did, it might not actually avoid having Deadlocks but was my only option at time being.
This solution is actually a way of handling MySQL Deadlocks using Spring.
Catch and retry Deadlocks:
public void write(List data){
try{
Arraylist idsToDelete = getIdsToDelete();
String query = "DELETE FROM table1 WHERE id IN ("+ idsToDelete + " )";
try {
mysqlJdbcTemplate.getJdbcTemplate().batchUpdate(query);
} catch (org.springframework.dao.DeadlockLoserDataAccessException e) {
LOG.info("Caught DEADLOCK : " + e);
retryDeadlock(query); // Retry them!
}
} catch (Exception e) {
LOG.error(e);
}
}
public void retryDeadlock(final String[] sqlQuery) {
RetryTemplate template = new RetryTemplate();
TimeoutRetryPolicy policy = new TimeoutRetryPolicy();
policy.setTimeout(30000L);
template.setRetryPolicy(policy);
try {
template.execute(new RetryCallback<int[]>() {
public int[] doWithRetry(RetryContext context) {
LOG.info("Retrying DEADLOCK " + context);
return mysqlJdbcTemplate.getJdbcTemplate().batchUpdate(sqlQuery);
}
});
} catch (Exception e1) {
e1.printStackTrace();
}
}
Another solution could be to use Spring's multiple step mechanism.
So that the DELETE queries are split into 3 and thus by starting the first step by deleting the blocking column and other steps delete the two other columns respectively.
Step1: Delete id_trg from child table;
Step2: Delete id_src from child table;
Step3: Delete id from parent table;
Of course the last two steps could be merged into 1, but in that case two distinct ItemsWriters would be needed!
I am trying to create a junit test. Scenario:
setUp: I'm adding two json documents to database
Test: I'm getting those documents using view
tearDown: I'm removing both objects
My view:
function (doc, meta) {
if (doc.type && doc.type == "UserConnection") {
emit([doc.providerId, doc.providerUserId], doc.userId);
}
}
This is how I add those documents to database and make sure that "add" is synchronous:
public boolean add(String key, Object element) {
String json = gson.toJson(element);
OperationFuture<Boolean> result = couchbaseClient.add(key, 0, json);
return result.get();
}
JSON Documents that I'm adding are:
{"userId":"1","providerId":"test_pId","providerUserId":"test_pUId","type":"UserConnection"}
{"userId":"2","providerId":"test_pId","providerUserId":"test_pUId","type":"UserConnection"}
This is how I call the view:
View view = couchbaseClient.getView(DESIGN_DOCUMENT_NAME, VIEW_NAME);
Query query = new Query();
query.setKey(ComplexKey.of("test_pId", "test_pUId"));
ViewResponse viewResponse = couchbaseClient.query(view, query);
Problem:
Test fails due to invalid number of elements fetched from view.
My observations:
Sometimes tests are passing
Number of elements that are fetched from view is not consistent(from 0 to 2)
When I've added those documents to database instead of calling setUp the test passed every time
Acording to this http://www.couchbase.com/docs/couchbase-sdk-java-1.1/create-update-docs.html documentation I'm adding those json documents synchronously by calling get() on returned Future object.
My question:
Is there something wrong with how I've approached to fetching data from view just after this data was inserted to DB? Is there any good practise for solving this problem? And can someone explain it to me please what I've did wrong?
Thanks,
Dariusz
In Couchbase 2.0 documents are required to be written to disk before they will show up in a view. There are three ways you can do an operation with the Java SDK. The first is asynchronous which means that you just send the data and at a later time check to make sure that the data was received correctly. If you do an asynchronous operation and then immediately call .get() as you did above then you have created a synchronous operation. When an operation returns success in these two cases above you are only guaranteed that the item has been written into memory. Your test passed sometimes only because you were lucky enough that both items were written to disk before did your query.
The third way to do an operation is with durability requirements and this is the one you want to do for your tests. Durability requirements allow you to say that you want an item to be written to disk or replicated before success is returned to the client. Take a look at the following function.
https://github.com/couchbase/couchbase-java-client/blob/1.1.0/src/main/java/com/couchbase/client/CouchbaseClient.java#L1293
You will want to use this function and set the PersistedTo parameter to MASTER.