Async writes seem to be broken in Cassandra

Async writes seem to be broken in Cassandra - java

I have had issues with spark-cassandra-connector (1.0.4, 1.1.0) when writing batches of 9 millions rows to a 12 nodes cassandra (2.1.2) cluster. I was writing with consistency ALL and reading with consistency ONE but the number of rows read was every time different from 9 million (8.865.753, 8.753.213 etc.).
I've checked the code of the connector and found no issues. Then, I decided to write my own application, independent from spark and the connector, to investigate the problem (the only dependency is datastax-driver-code version 2.1.3).
The full code, the startup scripts and the configuration files can now be found on github.
In pseudo-code, I wrote two different version of the application, the sync one:
try (Session session = cluster.connect()) {
String cql = "insert into <<a table with 9 normal fields and 2 collections>>";
PreparedStatement pstm = session.prepare(cql);
for(String partitionKey : keySource) {
// keySource is an Iterable<String> of partition keys
BoundStatement bound = pstm.bind(partitionKey /*, << plus the other parameters >> */);
bound.setConsistencyLevel(ConsistencyLevel.ALL);
session.execute(bound);
}
}
And the async one:
try (Session session = cluster.connect()) {
List<ResultSetFuture> futures = new LinkedList<ResultSetFuture>();
String cql = "insert into <<a table with 9 normal fields and 2 collections>>";
PreparedStatement pstm = session.prepare(cql);
for(String partitionKey : keySource) {
// keySource is an Iterable<String> of partition keys
while(futures.size()>=10 /* Max 10 concurrent writes */) {
// Wait for the first issued write to terminate
ResultSetFuture future = futures.get(0);
future.get();
futures.remove(0);
}
BoundStatement bound = pstm.bind(partitionKey /*, << plus the other parameters >> */);
bound.setConsistencyLevel(ConsistencyLevel.ALL);
futures.add(session.executeAsync(bound));
}
while(futures.size()>0) {
// Wait for the other write requests to terminate
ResultSetFuture future = futures.get(0);
future.get();
futures.remove(0);
}
}
The last one is similar to that used by the connector in the case of no-batch configuration.
The two versions of the application work the same in all circumstances, except when the load is high.
For instance, when running the sync version with 5 threads on 9 machines (45 threads) writing 9 millions rows to the cluster, I find all the rows in the subsequent read (with spark-cassandra-connector).
If I run the async version with 1 thread per machine (9 threads), the execution is much faster but I cannot find all the rows in the subsequent read (the same problem that arised with the spark-cassandra-connector).
No exception was thrown by the code during the executions.
What could be the cause of the issue ?
I add some other results (thanks for the comments):
Async version with 9 threads on 9 machines, with 5 concurrent writers per thread (45 concurrent writers): no issues
Sync version with 90 threads on 9 machines (10 threads per JVM instance): no issues
Issues seemed start arising with Async writes and a number of concurrent writers > 45 and <=90, so I did other tests to ensure that the finding were right:
Replaced the "get" method of ResultSetFuture with
"getUninterruptibly": same issues.
Async version with 18 threads on 9 machines, with 5 concurrent
writers per thread (90 concurrent writers): no issues.
The last finding shows that the high number of concurrent writers (90) is not an issue as was expected in the first tests. The problem is the high number of async writes using the same session.
With 5 concurrent async writes on the same session the issue is not present. If I increase to 10 the number of concurrent writes, some operations get lost without notification.
It seems that the async writes are broken in Cassandra 2.1.2 (or the Cassandra Java driver) if you issue multiple (>5) writes concurrently on the same session.

Nicola and I communicated over email this weekend and thought I'd provide an update here with my current theory. I took a look at the github project Nicola shared and experimented with an 8 node cluster on EC2.
I was able to reproduce the issue with 2.1.2, but did observe that after a period of time I could re-execute the spark job and all 9 million rows were returned.
What I seemed to notice was that while nodes were under compaction I did not get all 9 million rows. On a whim I took a look at the change log for 2.1 and observed an issue CASSANDRA-8429 - "Some keys unreadable during compaction" that may explain this problem.
Seeing that the issue has been fixed at is targeted for 2.1.3, I reran the test against the cassandra-2.1 branch and ran the count job while compaction activity was happening and got 9 million rows back.
I'd like to experiment with this some more since my testing with the cassandra-2.1 branch was rather limited and the compaction activity may have been purely coincidental, but I'm hoping this may explain these issues.

A few possibilities:
Your async example is issuing 10 writes at time with 9 threads, so 90 at a time while your sync example is only doing 45 writes at a time, so I would try cutting the async down to the same rate so it's an apples to apples comparison.
You don't say how you're checking for exceptions with the async approach. I see you are using future.get(), but it is recommended to use getUninterruptibly() as noted in the documentation:
Waits for the query to return and return its result. This method is
usually more convenient than Future.get() because it: Waits for the
result uninterruptibly, and so doesn't throw InterruptedException.
Returns meaningful exceptions, instead of having to deal with
ExecutionException. As such, it is the preferred way to get the future
result.
So perhaps you're not seeing write exceptions that are occurring with your async example.
Another unlikely possibility is that your keySource is for some reason returning duplicate partition keys, so when you do the writes, some of them end up overwriting a previously inserted row and don't increase the row count. But that should impact the sync version too, so that's why I say it's unlikely.
I would try writing smaller sets than 9 million and at a slow rate and see if the problem only starts to happen at a certain number of inserts or certain rate of inserts. If the number of inserts has an impact, then I'd suspect something is wrong with the row keys in the data. If the rate of inserts has an impact, then I'd suspect hot spots causing write timeout errors.
One other thing to check would be the Cassandra log file, to see if there are any exceptions being reported there.
Addendum: 12/30/14
I tried to reproduce the symptom using your sample code with Cassandra 2.1.2 and driver 2.1.3. I used a single table with a key of an incrementing number so that I could see gaps in the data. I did a lot of async inserts (30 at a time per thread in 10 threads all using one global session). Then I did a "select count (*)" of the table, and indeed it reported fewer rows in the table than expected. Then I did a "select *" and dumped the rows to a file and checked for missing keys. They seemed to be randomly distributed, but when I queried for those missing individual rows, it turned out they were actually present in the table. Then I noticed every time I did a "select count (*)", it came back with a different number, so it seems to be giving an approximation of the number of rows in the table rather than the actual number.
So I revised the test program to do a read back phase after all the writes, since I know all the key values. When I did that, all the async writes were present in the table.
So my question is, how are you checking the number of rows that are in your table after you finish writing? Are you querying for each individual key value or using some kind of operation like "select *"? If the latter, that seems to give most of the rows, but not all of them, so perhaps your data is actually present. Since no exceptions are being thrown, it seems to suggest that the writes are all successful. The other question would be, are you sure your key values are unique for all 9 million rows.

Related

Oracle 11g - System performace impacted as the data grows

Our system has quite a high level of concurrency with multiple java threads picking up one record at a time from a given Oracle 11g table which normally holds about two millions records.
There are always many records ready to be picked up for processing. The records ready to be processed are selected based on a relatively complex SQL statement but once selected the processing order is based on a FIFO algorithm (ID order).
It is crucial that the same record is not picked up by two distinct threads. Because of this we have a locking mechanism in place.
From a high level view the way in which it works at the moment is that java thread invokes a stored procedure which in turn will open a RECORD_READY_KEYS cursor and then it iterates trough that cursor and try to acquire a lock on a record on a locking table with that key. The locking attempt is done with SELECT FOR UPDATE SKIP LOCKED. If the lock succeeds the record to process is returned to the java thread for processing.
Everything works fine as long as the records ready to process are not too many. However when this number grows over a limit (from our observations when going over 15K) the SQL statement used to get the RECORD_READY_KEYS cursor starts decreasing in performance. Despite the fact it is fully optimised it starts taking close to 0.2 seconds to run which means you can only process maximum five records per second per java thread. In reality considering the time taken to acquire the lock, to travel over the network, to actually do the processing, commit the transaction, etc. will result in even slower processing.
Increasing the number of java threads is an option, however we cannot go over a certain limit as they will start putting pressure on the database/application server/system resources, etc.
The real problem is that we run an SQL statement to get the RECORD_READY_KEYS containing fifteen thousand keys out of a total of two millions and we then pick up the first available record from the top and the we discard the rest by closing the cursor.
My idea would be to have a KEYS_CACHE nested table defined at package level and store the result of RECORD_READY_KEYS selection in that nested table. Once a key is locked it will delete it from the KEYS_CACHE and will return it to the java thread. The process can go that way until the whole KEYS_CACHE gets consumed and when this happens it will populate it again.
Now my questions will be:
Q1. Can you see any weak point with this approach.
I can see multiple threads trying to lock the same record at the same time and such wasting a bit of time. On the java side we can make the stored procedure invocation synchronized to a given extend only as the invocation will happen from multiple JVMs. However I cannot see this a major issue.
Another issue would be when an unlikely rollback happens as there will be no easy way to put back the deleted key. The next RECORD_READY_KEYS selection will pick it back again and a delay of a few minutes will not really matter.
Q2. As the nested table gets less and less records it will become very sparse. Can you see this becoming a problem? If so should I limit the initial size to say 5000 keys or it does not really matter.
Q3. Can you see a problem with that package level KEYS_CACHE nested table being accessed concurrently by so many threads (we have between 25 to 100 of them)
Q4. Can you see an alternative approach that would not require a whole system redesign.
Thank you in advance
I think I was not very when explaining my situation. We do not lock the records to process in the two millions records table but we do the lock the key instead that are also saved on a different locking table.
Say I have this 2 million records table called messages:
And there are only messages with Key-A, Key-B, and Key-C that are ready to be processed a possible content of the key locking table may be:
Note the Key-X is in there even if no messages ready to be processed for that key because messages with such a key were just finished processing and the clean-up thread did not kicked off yet. That is OK and even desirable in case more new messages with Key-X will enter the system in a short while it will save a new insert.
So our select (fully optimised) will obtain a list with the Key-A, Key-C, and Key-B in this order (Key-C comes before Key-B because has a message with an Id = 2 which is smaller than the first Key-B message with the Id=6
Very simplified what we do here in fact is
SELECT key FROM messages WHERE ready = ‘Y’ GROUP BY key ORDER BY min(id)
Once we get that select in a cursor we fetch the key one by one and try to lock it in the key_locckings table. Once a lock succeeds the key get assigned to a thread (there is threads table for this) and will stay with that thread processing all messages that are ready for that key. As I mentioned in my first post it is crucial that messages with the same key be processed by the same thread as the key is how we link related messages which must be processed in sequence.
The SELECT above is instantly when the number of keys selected is up to a few thousands. It is still performing OK when it gets around 10000 keys. Once the number of retrieved keys gets over 15000 then the performance starts degrading. The time to run the SELECT is still OK (about 0.2 seconds) and we do have indexes on all fields involved in this selection. It is just that getting the WHERE, GROUP, ORDER BY applied to select 15000 keys out of two million records that take the time.
So the problem for us is that every single thread will run the same SELECT and will get 15000 records just to pick up one of them. The think I was considering was that rather than closing the cursor and throwing the hard work away as we do at the moment to try storing those keys in a package level nested table and delete the keys from there as we allocate them to the threads. My first three questions just wanted to capture some others opinions about this approach while the last one was about finding some alternative ideas (e.g. someone would say use advanced queues, etc)

I have an example to hand that is (I think) very similar.
You have a table, with lots of rows, and you have multiple consumer processes (the Java threads), that all want to work on the contents of that table.
First off, I recommend avoiding SKIP LOCKED. If you think you need it, consider if you've set your INITRANS high enough on the table. Consider that SKIP LOCKED means that Oracle will skip locked resources, not just locked rows. If a block's ITL is full, SKIP LOCKED will skip it, even if there are unlocked rows in the block!
For a more detailed discussion of this, see here:
https://markjbobak.wordpress.com/2010/04/06/unintended-consequences/
So, on to my suggestion. For each concurrent JAVA thread, define a thread number. So, suppose you have 10 concurrent threads, assign them each a thread number or thread id, 0-9. Now, if you have the flexibility to modify the table, you could add a column, THREAD_ID, and then use that in the select statement when selecting from the table. Each concurrent JAVA thread will select only those rows that match it's thread id. In this way, you can guarantee that you'll avoid collisions. If you don't have the ability to add a a column to the table, then you hopefully have a nice, numeric, sequence-driven primary key? If so, you can get the same effect by querying MOD(PRIMARY_KEY_COLUMN, 10) = :client_thread_id.
Additionally, do you have columns that specify a status of some sort, or something like that, which you'll use to determine which rows from the table are eligible to be processed by the Java thread? If so, and particularly if that criteria significantly improves selectivity, creating a virtual column that is only populated for the values which you're interested in, could be quite useful, if that column is then added to the index. (THREAD_ID, STATUS), for example.
Finally, you mentioned processing in a specific order. If THREAD_ID, STATUS is your selection criteria, then perhaps a PRIORITY or STATUS_DATE column may be your ordering requirement. In that case, it may be useful to continue to build out the index, to add in the column(s) specifying the required order, and top it off with the primary key of the table.
With a carefully constructed index, and using the THREAD_ID idea, it should be possible to construct an index that will allow you to:
avoid collisions (Use THREAD_ID or MOD() on primary key)
minimize the size of the index (vurtual columns)
avoid any ORDER BY operation (add order by columns to index)
avoid any TABLE ACCESS BY ROWID operation (add primary key column to end of index)
I made a few assumptions that may or may not apply.

Slowness in reading the large ResultSet

I'm having problems in generating a report the result reaches more than 500,000 lines. Believe me, this result is already filter.
The query (DB2) runs almost instantly, but the the interaction in resultSet is absurdly slow.
I'm doing several tests to try to improve this process but so far without success.
- At first was converting the direct data for the bean (used for report generation), but is very slow and the database gives timeout.
- I tried to turn into a simpler process for testing (resultSet to HashMap) unsuccessfully
- Used the setFetchSize configuration (2000) for the statement
- I looked on the possibility of using thread safe, but does not support resultSet
Already modified the timeout of the bank to increase the processing time, but my problem was not resolved.
Anyway, already tried several possibilities. Does anyone have any tips or solution to my problem?

First of all let me clear,
Reporting, Report Generation task should never be done on application DB.
Application DB, Transactional DBs are designed for fast transactions which doesnt involve heavy result fetching, processing. Those tasks should be handled on DW server or standby replicas.
Second,
Reporting application logic should be processed in less crowded hours (when system is not used by users i.e. nights)
If possible put your processing logic on DB side in form of procedures (maths part) with efficient queries to improve the performance in terms of processing and data transfer.
Try to collect reports periodically using triggers/scheduled jobs etc. and while creating reports use those intermediate reports instead of DB (As you said your query execution is not a problem, but this will save iterating over a large set.) You can use values from intermediate reports thus iterating frequency will be less.

Improving performance for WRITE operation on Oracle DB in Java

I've a typical scenario & need to understand best possible way to handle this, so here it goes -
I'm developing a solution that will retrieve data from a remote SOAP based web service & will then push this data to an Oracle database on network.
Also, this will be a scheduled task that will execute every 15 minutes.
I've event queues on remote service that contains the INSERT/UPDATE/DELETE operations that have been done since last retrieval, & once I retrieve the events for last 15 minutes, it again add events for next retrieval.
Now, its just pushing data to Oracle so all my interactions are INSERT & UPDATE statements.
There are around 60 tables on Oracle with some of them having 100+ columns. Moreover, for every 15 minutes cycle there would be around 60-70 Inserts, 100+ Updates & 10-20 Deletes.
This will be an executable jar file that will terminate after operation & will again start on next 15 minutes cycle.
So, I need to understand how should I handle WRITE operations (best practices) to improve performance for this application as whole ?
Current Test Code (on every cycle) -
Connects to remote service to get events.
Creates a connection with DB (single connection object).
Identifies the type of operation (INSERT/UPDATE/DELETE) & table on which it is done.
After above, calls the respective method based on type of operation & table.
Uses Preparedstatement with positional parameters, & retrieves each column value from remote service & assigns that to statement parameters.
Commits the statement & returns to get event class to process next event.
Above is repeated till all the retrieved events are processed after which program closes & then starts on next cycle & everything repeats again.
Thanks for help !

If you are inserting or updating one row at a time,You can consider executing a batch Insert or a batch Update. It has been proven that if you are attempting to update or insert rows after a certain quantity, you get much better performance.

The number of DB operations you are talking about (200 every 15 minutes) is tiny and will be easy to finish in less than 15 minutes. Some concrete suggestions:
You should profile your application to understand where it is spending its time. If you don't do this, then you don't know what to optimize next and you don't know if something you did helped or hurt.
If possible, try to get all of the events in one round-trip to the remote server.
You should reuse the connection to the remote service (probably by using a library that supports connection persistence and reuse).
You should reuse the DB connections by using a connection pooling library rather than creating a new connection for each insert/update/delete. Believe it or not, creating the connection probably takes 100+ times as long as doing your DB operation once you have the connection in hand.
You should consider doing multiple (or all) of the database operations in the same transaction rather than creating a new transaction for each row that is changed. However, you should carefully consider your failure modes such that you don't lose any events (if that is an important consideration).
You should consider utilizing prepared statement caching. This may help, but maybe not if Oracle is configured properly.
You should consider trying to analyze your operations to find any that can be batched together. This can be a lot faster if you have some "hot" operations that get done often.

"I've a typical scenario"
No you haven't. You have a bespoke architecture, with a unique data model, unique data and unique business requirements. That's not a bad thing, it's the state of pretty much every computer system that's not been bought off-the-shelf (and even some of them).
So, it's an experiment and you must approach it as such. There is no "best practice". Try various things and see what works best.
"need to understand best possible way to handle this"
You will improve your chances of success enormously by hiring somebody who understands Oracle databases.

java jdbc design pattern : handle many inserts

I would like to ask for some advices concerning my problem.
I have a batch that does some computation (multi threading environement) and do some inserts in a table.
I would like to do something like batch insert, meaning that once I got a query, wait to have 1000 queries for instance, and then execute the batch insert (not doing it one by one).
I was wondering if there is any design pattern on this.
I have a solution in mind, but it's a bit complicated:
build a method that will receive the queries
add them to a list (the string and/or the statements)
do not execute until the list has 1000 items
The problem : how do I handle the end ?
What I mean is, the last 999 queries, when do I execute them since I'll never get to 1000 ?
What should I do ?
I'm thinking at a thread that wakes up every 5 minutes and check the number of items in a list. If he wakes up twice and the number is the same , execute the existing queries.
Does anyone has a better idea ?

Your database driver needs to support batch inserting. See this.
Have you established your system is choking on network traffic because there is too much communication between the service and the database? If not, I wouldn't worry about batching, until you are sure you need it.
You mention that in your plan you want to check every 5 minutes. That's an eternity. If you are going to get 1000 items in 5 minutes, you shouldn't need batching. That's ~ 3 a second.
Assuming you do want to batch, have a process wake up every 2 seconds and commit whatever is queued up. Don't wait five minutes. It might commit 0 rows, it might commit 10...who cares...With this approach, you don't need to worry that your arbitrary threshold hasn't been met.
I'm assuming that the inserts come in one at a time. If your incoming data comes in n at once, I would just commit every incoming request, no matter how many inserts happen. If your messages are coming in as some sort of messaging system, it's asynchronous anyway, so you shouldn't need to worry about batching. Under high load, the incoming messages just wait till there is capacity to handle them.

Add a commit kind of method to that API that will be called to confirm all items have been added. Also, the optimum batch size is somewhere in the range 20-50. After that the potential gain is outweighed by the bookkeeping necessary for a growing number of statements. You don't mention it explicitly, but of course you must use the dedicated batch API in JDBC.
If you need to keep track of many writers, each in its own thread, then you'll also need a begin kind of method and you can count how many times it was called, compared to how many times commit was called. Something like reference-counting. When you reach zero, you know you can flush your statement buffer.

This is most amazing concept , I have faced many time.So, according to your problem you are creating a batch and that batch has 1000 or more queries for insert . But , if you are inserting into same table with repeated manner.
To avoid this type of situation you can make the insert query like this:-
INSERT INTO table1 VALUES('4','India'),('5','Odisha'),('6','Bhubaneswar')
It can execute only once with multiple values.So, better you can keep all values inside any collections elements (arraylist,list,etc) and finally make a query like above and insert it once.
Also you can use SQL Transaction API.(Commit,rollback,setTraction() ) etc.
Hope ,it will help you.
All the best.

Querying over 1,000,000 records using salesforce Java API and looking for best approach

I am developing a Java application which will query tables which may hold over 1,000,000 records. I have tried everything I could to be as efficient as possible but I am only able to achieve on avg. about 5,000 records a minute and a maximum of 10,000 at one point. I have tried reverse engineering the data loader and my code seems to be very similar but still no luck.
Is threading a viable solution here? I have tried this but with very minimal results.
I have been reading and have applied every thing possible it seems (compressing requests/responses, threads etc.) but I cannot achieve data loader like speeds.
To note, it seems that the queryMore method seems to be the bottle neck.
Does anyone have any code samples or experiences they can share to steer me in the right direction?
Thanks

An approach I've used in the past is to query just for the IDs that you want (which makes the queries significantly faster). You can then parallelize the retrieves() across several threads.
That looks something like this:
[query thread] -> BlockingQueue -> [thread pool doing retrieve()] -> BlockingQueue
The first thread does query() and queryMore() as fast as it can, writing all ids it gets into the BlockingQueue. queryMore() isn't something you should call concurrently, as far as I know, so there's no way to parallelize this step. All ids are written into a BlockingQueue. You may wish to package them up into bundles of a few hundred to reduce lock contention if that becomes an issue. A thread pool can then do concurrent retrieve() calls on the ids to get all the fields for the SObjects and put them in a queue for the rest of your app to deal with.
I wrote a Java library for using the SF API that may be useful. http://blog.teamlazerbeez.com/2011/03/03/a-new-java-salesforce-api-library/

With the Salesforce API, the batch size limit is what can really slow you down. When you use the query/queryMore methods, the maximum batch size is 2000. However, even though you may specify 2000 as the batch size in your SOAP header, Salesforce may be sending smaller batches in response. Their batch size decision is based on server activity as well as the output of your original query.
I have noticed that if I submit a query that includes any "text" fields, the batch size is limited to 50.
My suggestion would be to make sure your queries are only pulling the data that you need. I know a lot of Salesforce tables end up with a lot of custom fields that may not be needed for every integration.
Salesforce documentation on this subject

We have about 14000 records in our Accounts object and it takes quite some time to get all the records. I perform a query which takes about a minute but SF only returns batches of no more than 500 even though I set batchsize to 2000. Each query more operation takes from 45 seconds to a minute also. This limitation is quite frustrating when you need to get bulk data.

Make use of Bulk-api to query any number of records from Java. I'm making use of it and performs very effectively even in seconds you get the result. The String returned is comma separated. Even you can maintain batches less than or equal to 10k to get the records either in CSV (using open csv) or directly in String.
Let me know if you require the code help.

Latency is going to be a killer for this type of situation - and the solution will be either multi-thread, or asynchronous operations (using NIO). I would start by running 10 worker threads in parallel and see what difference it makes (assuming that the back-end supports simultaneous gets).
I don't have any concrete code or anything I can provide here, sorry - just painful experience with API calls going over high latency networks.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.