cassandra high volume writes sometimes silently fail - java

I am recording realtime trade data with the Datastax Cassandra java driver. I have configured Cassandra with a single node, replication factor of 1, and consistency level ALL.
I frequently have writes which do not record, but do not fail. The java client does not throw any errors, and the async execute successful callback is called. Trace doesn't seem to show anything unusual:
[CassandraClient] - Adding to trades memtable on /10.0.0.118[SharedPool-Worker-1] at Mon Dec 22 22:54:04 UTC 2015
[CassandraClient] - Appending to commitlog on /10.0.0.118[SharedPool-Worker-1] at Mon Dec 22 22:54:04 UTC 2015
[CassandraClient] - Coordinator used /10.0.0.118
but, when I look at the data in the cassandra shell, notice the skipped IDs (ignoring bad dates):
cqlsh:keyspace> select * from trades where [...] order by date desc limit 10;
date | id | price | volume
--------------------------+--------+--------+------------
1970-01-17 19:00:19+0000 | 729286 | 435.96 | 3.4410000
1970-01-17 19:00:19+0000 | 729284 | 436.00 | 17.4000000
1970-01-17 19:00:19+0000 | 729283 | 436.00 | 0.1300000
1970-01-17 19:00:19+0000 | 729277 | 436.45 | 5.6972000
1970-01-17 19:00:19+0000 | 729276 | 436.44 | 1.0000000
1970-01-17 19:00:19+0000 | 729275 | 436.44 | 0.9728478
1970-01-17 19:00:19+0000 | 729274 | 436.43 | 0.0700070
1970-01-17 19:00:19+0000 | 729273 | 436.45 | 0.0369260
1970-01-17 19:00:19+0000 | 729272 | 436.43 | 1.0000000
1970-01-17 19:00:19+0000 | 729271 | 436.43 | 1.0000000
why do some inserts silently fail? indications point to a timestamp issue, but I don't detect a pattern.
similar question: Cassandra - Write doesn't fail, but values aren't inserted
might be related to: Cassandra update fails silently with several nodes

The fact that the writes succeed and some records are missing is a symptom that C* is overwriting the missing rows. The reason you may see such behavior is the misuse of bound statements.
Usually people prepare the statements with:
PreparedStatement ps = ...;
BoundStatement bs = ps.bind();
then they issue something like:
for (int i = 0; i < myHugeNumberOfRowsToInsert; i++) {
session.executeAsync(bs.bind(xx));
}
This will actually produce the weird behavior, because the bound statement is the same across most of the executeAsync calls, and if the loop is fast enough to enqueue (say) 6 queries before the driver fires the first query at all, all the submitted queries will share the same bound data. A simple fix is to actually issue different BoundStatement:
for (int i = 0; i < myHugeNumberOfRowsToInsert; i++) {
session.executeAsync(new BoundStatement(ps).bind(xx));
}
This will guarantee that each statement is unique and no overwrites are possible at all.

Related

Can I run an "explain analyze" on a query using JOOQ?

Can I run explain analyze on a query in JOOQ? like:
explain analyse select some, columns from some_table
but do it using JOOQ on PostgreSQL database?
I have found an interface org.jooq.Explain, with a method DSLContext.explain​(Query query) - but it seems just to use EXPLAIN on a query:
#Support({AURORA_MYSQL,AURORA_POSTGRES,H2,HSQLDB,MARIADB,MEMSQL,MYSQL,ORACLE,POSTGRES,SQLITE})
Explain explain​(Query query)
Run an EXPLAIN statement in the database to estimate the cardinality of the query.
Is there any sensible way to run an EXPLAIN ANALYZE on the database from the code side?
Yes you can run explain. Example
SelectWhereStep<ModuldefRecord> where = dsl.selectFrom(MODULDEF);
Explain explain = dsl().explain(where);
System.out.println(explain);
The output look like this (for Oracle)
+------------------------------------------------------------------------------+
|PLAN_TABLE_OUTPUT |
+------------------------------------------------------------------------------+
|Plan hash value: 3871168833 |
| |
|------------------------------------------------------------------------------|
|| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time ||
|------------------------------------------------------------------------------|
|| 0 | SELECT STATEMENT | | 61303 | 30M| 1305 (1)| 00:00:01 ||
|| 1 | TABLE ACCESS FULL| MODULDEF | 61303 | 30M| 1305 (1)| 00:00:01 ||
|------------------------------------------------------------------------------|
+------------------------------------------------------------------------------+
Explain also contains rows and cost
/**
* The number of rows (cardinality) that is estimated to be returned by the query.
* <p>
* This returns {#link Double#NaN} if rows could not be estimated.
*/
double rows();
/**
* The cost the database associated with the execution of the query.
* <p>
* This returns {#link Double#NaN} if cost could not be retrieved.
*/
double cost();
It's not supported yet: https://github.com/jOOQ/jOOQ/issues/10424. Use plain SQL templating, instead:
ctx.fetch("explain analyze {0}", select);
For mariadb I needed to do:
SelectConditionStep<TableNameRecord> select =
context.selectFrom(Tables.TABLE_NAME)
.where(filter);
System.out.println(context.fetch("analyze " + select.getSQL(ParamType.INLINED)));
which produced the output:
+----+-----------+----------+-----+-----------------+-----------------+-------+------+----+-------+--------+----------+------------------------+
| id|select_type|table |type |possible_keys |key |key_len|ref |rows|r_rows |filtered|r_filtered|Extra |
+----+-----------+----------+-----+-----------------+-----------------+-------+------+----+-------+--------+----------+------------------------+
| 1|SIMPLE |table_name|range|table_column_name|table_column_name|20 |{null}|1000|1000.00| 100.0| 100.0|Using where; Using index|
+----+-----------+----------+-----+-----------------+-----------------+-------+------+----+-------+--------+----------+------------------------+
If you use context.explain(select) as proposed by another answer you lose a few columns:
+----+-----------+----------+-----+-----------------+-----------------+-------+------+----+------------------------+
| id|select_type|table |type |possible_keys |key |key_len|ref |rows|Extra |
+----+-----------+----------+-----+-----------------+-----------------+-------+------+----+------------------------+
| 1|SIMPLE |table_name|range|table_column_name|table_column_name|20 |{null}|1000|Using where; Using index|
+----+-----------+----------+-----+-----------------+-----------------+-------+------+----+------------------------+

Java Collections Optimization

I'm working on a java application that connects to database to fetch some records, processes each records and updates record back to the db table.
Following is my db schema (with sample data):
Table A: Requests
| REQUESTID | STATUS |
-------------------------
| 1 | PENDING|
| 2 | PENDING|
Table B: RequestDetails
| DETAILID | REQUESTID | STATUS | USERID |
---------------------------------------------
| 1 | 1 | PENDING | RA1234 |
| 2 | 1 | PENDING | YA7266 |
| 3 | 2 | PENDING | KAJ373 |
Following is my requirement:
1) Fetch Request along with pending status along with request data from both tables
I'm using below query for this:
SELECT Requests.REQUEST_ID as "RequestID",RequestDetails.USERID as "UserID",RequestDetails.DETAILID as "DetailID"
FROM Requests Requests
JOIN RequestDetails RequestDetails
ON (Requests.REQUESTID=RequestDetails.REQUESTID AND Requests.REQUEST_STATUS='PENDING' AND RequestDetails.STATUS='PENDING')
2) I'm using a HashMap<String, List<HashMap<String,String>> to store all the values
3) Iterate over each request and get details List<HashMap<String,String>>
Perform action for each detail record and update status
4) After all detail records are processed for a request, update status of the request on requests table
The end state should be something like this:
Table A: Requests
| REQUESTID | STATUS |
-------------------------
| 1 | PENDING|
| 2 | PENDING|
Table B: RequestDetails
| DETAILID | REQUESTID | STATUS | USERID |
---------------------------------------------
| 1 | 1 | PENDING | RA1234 |
| 2 | 1 | PENDING | YA7266 |
| 3 | 2 | PENDING | KAJ373 |
My question is: the collection I'm using is quite complex ("HashMap<String, List<HashMap<String,String>>"). Is there any other efficient way to do this?
Thank you,
Sash
I think you should use class something like,
Class RequestDetails{
int detailId;
int statusId;
String status;
String userId;
}
instead of map HashMap<String, List<HashMap<String,String>> you should use HashMap<String, RequestDetails>That has advantages like, code simplicity and also when you working with huge data and you need to modify string it is always better to avoid using String data-type as it is immutable and decrease your performance.
Hope this helps.
Above all that and what Darshan suggested, you must override the hashCode and equals method too, the reason is its the basic contract while dealing with HashMap and it will also increase the performance too.

How to specify uberization of a Hive query in Hadoop2?

There is a new feature in Hadoop 2 called uberization. For example, this reference says:
Uberization is the possibility to run all tasks of a MapReduce job in
the ApplicationMaster's JVM if the job is small enough. This way, you
avoid the overhead of requesting containers from the ResourceManager
and asking the NodeManagers to start (supposedly small) tasks.
What I can't tell is whether this just happens magically behind the scenes or does one need to do something for this to happen? For example, when doing a Hive query is there a setting (or hint) to get this to happen? Can you specify the threshold for what is "small enough"?
Also, I'm having trouble finding much about this concept - does it go by another name?
I found details in the YARN Book by Arun Murthy about "uber jobs":
An Uber Job occurs when multiple mapper and reducers are combined to use a single
container. There are four core settings around the configuration of Uber Jobs found in
the mapred-site.xml options presented in Table 9.3.
Here is table 9.3:
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.enable | Whether to enable the small-jobs "ubertask" optimization, |
| | which runs "sufficiently small" jobs sequentially within a |
| | single JVM. "Small" is defined by the maxmaps, maxreduces, |
| | and maxbytes settings. Users may override this value. |
| | Default = false. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxmaps | Threshold for the number of maps beyond which the job is |
| | considered too big for the ubertasking optimization. |
| | Users may override this value, but only downward. |
| | Default = 9. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxreduces | Threshold for the number of reduces beyond which |
| | the job is considered too big for the ubertasking |
| | optimization. Currently the code cannot support more |
| | than one reduce and will ignore larger values. (Zero is |
| | a valid maximum, however.) Users may override this |
| | value, but only downward. |
| | Default = 1. |
|-----------------------------------+------------------------------------------------------------|
| mapreduce.job.ubertask.maxbytes | Threshold for the number of input bytes beyond |
| | which the job is considered too big for the uber- |
| | tasking optimization. If no value is specified, |
| | `dfs.block.size` is used as a default. Be sure to |
| | specify a default value in `mapred-site.xml` if the |
| | underlying file system is not HDFS. Users may override |
| | this value, but only downward. |
| | Default = HDFS block size. |
|-----------------------------------+------------------------------------------------------------|
I don't know yet if there is a Hive-specific way to set this or if you just use the above with Hive.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master. So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So, the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

How to process columns of an SQLite table in Java android?

I have an SQLite table like:
+---+-------------+----------------+
|_id| lap_time_ms |formatted_elapse|
+---+-------------+----------------+
| 1 | 5600 | 00:05.6 |
| 2 | 4612 | 00:04.6 |
| 3 | 4123 | 00:04.1 |
| 4 | 15033 | 00:15.0 |
| 5 | 4523 | 00:04.5 |
| 6 | 6246 | 00:06.2 |
Where lap_time_ms is an of the type long and represents the amount of time in milliseconds for a lap while formatter_elapse is a String that represents the formatted (displayable) form of the first column (elapse).
My question is that if I want to add (say) 5 seconds (5000) to each lap_time_ms then I use a statement like:
DB.execSQL("update table_name set KEY_ELAPSE=KEY_ELAPSE+5000);
Which works fine however the problem is that formatted_elapse still retains its outdated value!
So, what is the best way to update the values in the formatted_elapse column if I have a function like:
public static String getFormattedTime(long milliseconds) {
// custom code that processes the long
return processedString;
}
It may be a long shot (metaphorically speaking of course ;) but is it possible to have SQLite link the two columns such that if I update a lap_time_ms row, the formatted_elapse will automatically update appropriately.
Thanks in advance!
In theory, it would be possible to create a trigger to update that column, but only if the formatting can be done with some built-in SQLite function (Android does not allow user-defined functions):
CREATE TRIGGER update_formatted_elapse
AFTER UPDATE OF lap_time_ms ON MyTable
FOR EACH ROW
BEGIN
UPDATE MyTable
SET formatted_elapse = strftime('%M:%f', NEW.lap_time_ms, 'unixepoch')
WHERE _id = NEW._id;
END;
However, it would be bad design to store the formatted string in the database; it would be duplicated information that is in danger of becoming inconsistent.
Drop the formatted_elapse column and just call getFormattedTime in your code whenever you need it.

Lucene 4.x performance issues

Over the last few weeks I've been working on upgrading an application from Lucene 3.x to Lucene 4.x in hopes of improving performance. Unfortunately, after going through the full migration process and playing with all sorts of tweaks I found online and in the documentation, Lucene 4 is running significantly slower than Lucene 3 (~50%). I'm pretty much out of ideas at this point, and was wondering if anyone else had any suggestions on how to bring it up to speed. I'm not even looking for a big improvement over 3.x anymore; I'd be happy to just match it and stay on a current release of Lucene.
<Edit>
In order to confirm that none of the standard migration changes had a negative effect on performance, I ported my Lucene 4.x version back to Lucene 3.6.2 and kept the newer API rather than using the custom ParallelMultiSearcher and other deprecated methods/classes.
Performance in 3.6.2 is even faster than before:
Old application (Lucene 3.6.0) - ~5700 requests/min
Updated application with new API and some minor optimizations (Lucene 4.4.0) - ~2900 requests/min
New version of the application ported back, but retaining optimizations and newer IndexSearcher/etc API (Lucene 3.6.2) - ~6200 requests/min
Since the optimizations and use of the newer Lucene API actually improved performance on 3.6.2, it doesn't make sense for this to be a problem with anything but Lucene. I just don't know what else I can change in my program to fix it.
</Edit>
Application Information
We have one index that is broken into 20 shards - this provided the best performance in both Lucene 3.x and Lucene 4.x
The index currently contains ~150 million documents, all of which are fairly simple and heavily normalized so there are a lot of duplicate tokens. Only one field (an ID) is stored - the others are not retrievable.
We have a fixed set of relatively simple queries that are populated with user input and executed - they are comprised of multiple BooleanQueries, TermQueries and TermRangeQueries. Some of them are nested, but only a single level right now.
We're not doing anything advanced with results - we just fetch the the scores and the stored ID fields
We're using MMapDirectories pointing to index files in a tmpfs. We played with the useUnmap "hack" since we don't open new Directories very often and got a nice boost from that
We're using a single IndexSearcher for all queries
Our test machines have 94GB of RAM and 64 logical cores
General Processing
1) Request received by socket listener
2) Up to 4 Query objects are generated and populated with normalized user input (all of the required input for a query must be present or it won't be executed)
3) Queries are executed in parallel using the Fork/Join framework
Subqueries to each shard are executed in parallel using the IndexSearcher w/ExecutorService
4) Aggregation and other simple post-processing
Other Relevant Info
Indexes were recreated for the 4.x system, but the data is the same. We tried the normal Lucene42 codec as well as an extended one that didn't use compression (per a suggestion on the web)
In 3.x we used a modified version of the ParallelMultisearcher, in 4.x we're using the IndexSearcher with an ExecutorService and combining all of our readers in a MultiReader
In 3.x we used a ThreadPoolExecutor instead of Fork/Join (Fork/Join performed better in my tests)
4.x Hot Spots
Method | Self Time (%) | Self Time (ms)| Self Time (CPU in ms)
java.util.concurrent.CountDownLatch.await() | 11.29% | 140887.219 | 0.0 <- this is just from tcp threads waiting for the real work to finish - you can ignore it
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.<init>() | 9.74% | 21594.03 | 121594
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTermsEnum$Frame.<init>() | 9.59% | 119680.956 | 119680
org.apache.lucene.codecs.lucene41.ForUtil.readBlock() | 6.91% | 86208.621 | 86208
org.apache.lucene.search.DisjunctionScorer.heapAdjust() | 6.68% | 83332.525 | 83332
java.util.concurrent.ExecutorCompletionService.take() | 5.29% | 66081.499 | 6153
org.apache.lucene.search.DisjunctionSucorer.afterNext() | 4.93% | 61560.872 | 61560
org.apache.lucene.search.Tercorer.advance() | 4.53% | 56530.752 | 56530
java.nio.DirectByteBuffer.get() | 3.96% | 49470.349 | 49470
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.<init>() | 2.97% | 37051.644 | 37051
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getFrame() | 2.77% | 34576.54 | 34576
org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo() | 2.47% | 30767.711 | 30767
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.newTertate() | 2.23% | 27782.522 | 27782
java.net.ServerSocket.accept() | 2.19% | 27380.696 | 0.0
org.apache.lucene.search.DisjunctionSucorer.advance() | 1.82% | 22775.325 | 22775
org.apache.lucene.search.HitQueue.getSentinelObject() | 1.59% | 19869.871 | 19869
org.apache.lucene.store.ByteBufferIndexInput.buildSlice() | 1.43% | 17861.148 | 17861
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.getArc() | 1.35% | 16813.927 | 16813
org.apache.lucene.search.DisjunctionSucorer.countMatches() | 1.25% | 15603.283 | 15603
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs() | 1.12% | 13929.646 | 13929
java.util.concurrent.locks.ReentrantLock.lock() | 1.05% | 13145.631 | 8618
org.apache.lucene.util.PriorityQueue.downHeap() | 1.00% | 12513.406 | 12513
java.util.TreeMap.get() | 0.89% | 11070.192 | 11070
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.docs() | 0.80% | 10026.117 | 10026
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum$Frame.decodeMetaData() | 0.62% | 7746.05 | 7746
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader.iterator() | 0.60% | 7482.395 | 7482
org.apache.lucene.codecs.BlockTreeTerReader$FieldReader$SegmentTerEnum.seekExact() | 0.55% | 6863.069 | 6863
org.apache.lucene.store.DataInput.clone() | 0.54% | 6721.357 | 6721
java.nio.DirectByteBufferR.duplicate() | 0.48% | 5930.226 | 5930
org.apache.lucene.util.fst.ByteSequenceOutputs.read() | 0.46% | 5708.354 | 5708
org.apache.lucene.util.fst.FST.findTargetArc() | 0.45% | 5601.63 | 5601
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readTermsBlock() | 0.45% | 5567.914 | 5567
org.apache.lucene.store.ByteBufferIndexInput.toString() | 0.39% | 4889.302 | 4889
org.apache.lucene.codecs.lucene41.Lucene41SkipReader.<init>() | 0.33% | 4147.285 | 4147
org.apache.lucene.search.TermQuery$TermWeight.scorer() | 0.32% | 4045.912 | 4045
org.apache.lucene.codecs.MultiLevelSkipListReader.<init>() | 0.31% | 3890.399 | 3890
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock() | 0.31% | 3886.194 | 3886
If there's any other information you could use that might help, please let me know.
For anyone who cares or is trying to do something similar (controlled parallelism within a query), the problem I had was that the IndexSearcher was creating a task per segment per shard rather than a task per shard - I misread the javadoc.
I resolved the problem by using forceMerge(1) on my shards to limit the number of extra threads. In my use case this isn't a big deal since I don't currently use NRT search, but it still adds unnecessary complexity to the update + slave synchronization process, so I'm looking into ways to avoid the forceMerge.
As a quick fix, I'll probably just extend the IndexSearcher and have it spawn a thread per reader instead of a thread per segment, but the idea of a "virtual segment" was brought up in the Lucene mailing list. That would be a much better long-term fix.
If you want to see more info, you can follow the lucene mailing list thread here:
http://www.mail-archive.com/java-user#lucene.apache.org/msg42961.html

Categories