How to selectively log execution details of MongoDB operations? - java

Using the latest java sync SDK, I would like to log operation execution details only for operations taking longer than expected time. For example, I have a find operation for which I would like to collect execution details when the operation takes longer than 10 ms.
Document doc = collection.find(query).explain();

Related

Why there is much time cost between took_millis and timeout in elasticsearch slow query

I meet some slow queries in the production environment and I config the slow log to find some slow queries info like this(query_string with 500ms timeout):
[2021-06-21T10:43:33,930][DEBUG][index.search.slowlog.query.xxxx] [xxx][g][2] took[1s], took_millis[1043], total_hits[424690], types[top], stats[], search_typ e[QUERY_THEN_FETCH], total_shards[6], source[{query_string}]
In this case, the query timeout is 500ms, and the took_millis in response is 1043ms.
As far as I know the timeout is only useful for the query parse and the took value represents the execution time in es without some external phases like Query timing: ‘took’ value and what I’m measuring. I have two questions:
Firstly, why there is 504ms(1043 - 500 = 504) between the timeout and took_millis?
Secondly, how can I know the detail time cost between the timeout and took_millis time?
Thanks a lot!
Setting a timeout on a query doesn't ensure that the query is actually cancelled when its execution time surpasses that timeout. The Elasticsearch documentation states:
"By default, a running search only checks if it is cancelled or not on
segment boundaries, therefore the cancellation can be delayed by large
segments."
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search.html#global-search-timeout
Check issues 3627, 4586 and 2929
This can explain the 504ms between timeout and took_millis, your query just takes that long, and it is not cancelled in time.
To analyze the query execution and see what might be causing these long delays, you can rerun the query with the profile API. Note that if the slow execution of your query cannot be reproduced, this won't help you solve the issue. If your query runs fine most of the time, try to correlate these slow-running queries with external factors such as server load.

Should I care about dynamodb stream shards if I process stream events by lambda?

The dynamodb documentation says that there are shards and they are needed to be iterated first, then for each shard it is needed to get number of records.
The documentation also says:
(If you use the DynamoDB Streams Kinesis Adapter, this is handled for you: Your application will process the shards and stream records in the correct order, and automatically handle new or expired shards, as well as shards that split while the application is running. For more information, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.)
Ok, But I use lambda not kinesis (ot they relates to each other?) and if a lambda function is attached to dynamodb stream should I care about shards ot not? Or I should just write labda code and expect that aws environment pass just some records to that lambda?
When using Lambda to consume a DynamoDB Stream the work of polling the API and keeping track of shards is all handled for you automatically. If your table has multiple shards then multiple Lambda functions will be invoked. From your prospective as a developer you just have to write the code for your Lambda function and the rest is taken care for you.
In-order processing is still guaranteed by DynamoDB streams so with a single shard will have only one instance of your Lambda function will be invoked at a time. However, with multiple shards you may see multiple instances of your Lambda function running at the same time. This fan-out is transparent and may cause issues or lead to surprising behaviors if you are not aware of it while coding your Lambda function.
For a deeper explanation of how this works I'd recommend the YouTube video AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301). While the focus is mostly on Kinesis Streams the same concepts for consuming DynamoDB Streams apply as the technology is nearly identical.
We use DynamoDB to process close to billion of records everyday and autoexpire those records and send to streams.
Everything is taken care by AWS and we don't need to do anything, except configuring streams (what type of image you want) and adding triggers.
The only fine tuning we did is,
When you get more data, we just increased the batch size to process faster and reduce the overhead on the number of calls to Lambda.
If you are using any external process to iterate over the stream, you might need to do the same.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Hope it helps.

Getting desirable JMeter reports from java code

Currently I'm struggling with getting desirable JMeter reports from java code.
My goal is to get latency and throughput logged into file for each transaction and then have a summary per each scenario with averages and max/min values for latency and throughput.
Currently I have this code for reports:
ResultCollector csvlogger = new ResultCollector(summer);
csvlogger.setFilename(csvLogFile);
testPlanTree.add(testPlanTree.getArray()[0], csvlogger);
But in this way it logs info only per one transaction and there is no throughput, and latency reported is simply 0 (without any decimal part).
It looks like this:
timeStamp,elapsed,label,responseCode,responseMessage,threadName,dataType,success,failureMessage,bytes,sentBytes,grpThreads,allThreads,Latency,IdleTime,Connect
2017/06/28 08:53:49.276,1014,Jedis Sampler,200,OK,Jedis Thread Group 1-1,text,true,,0,0,1,1,0,0,0
Does anyone know is there any way how I can tune it?
Thanks!
only per one transaction - .jtl log file contains execution of single sampler, try adding more threads and/or loops on Thread Group level and you should see more results.
Latency always zero for scripting-based samplers, you need to explicitly call SampleResult.setLatency() method and set the desired value.
Throughput is not being recorded, it is being calculated. You need to open .jtl results file with i.e. Aggregate Report or Summary Report listener to see the generated value. Take a look into org.apache.jmeter.util.Calculator class source to see the details if you prefer programmaticall non-GUI approaches.

Delta between query execution time and Java query call to finish

Context
Our container cluster is located # us-east1-c
We are using the following Java library: google-cloud-bigquery, 0.9.2-beta
Our dataset has around 26M rows and represents ~10G
All of our queries return less than 100 rows as we are always grouping on a specific column
Question
We analyzed the last 100 queries executed in BigQuery, these are were all executed in about 2-3 seconds (we analyzed this by calling bq --format=prettyjson show -j JOBID, end time - creation time).
In our Java logs though, most of the calls to bigquery.query are blocking for 5-6 seconds (and 10 seconds is not out of the ordinary). What could explain the systematic gap between the query to finish in the BigQuery cluster and the results being available in Java? I know 5-6 seconds isn't astronomic, but I am curious to see if this is a normal behaviour when using the Java BigQuery cloud library.
I didn't dig to the point where I analyzed the outbound call using Wireshark. All our tests were executed in our container cluster (Kubernetes).
Code
QueryRequest request = QueryRequest.newBuilder(sql)
.setMaxWaitTime(30000L)
.setUseLegacySql(false)
.setUseQueryCache(false)
.build();
QueryResponse response = bigquery.query(request);
Thank you
Just looking at the code briefly here:
https://github.com/GoogleCloudPlatform/google-cloud-java/blob/master/google-cloud-bigquery/src/main/java/com/google/cloud/bigquery/BigQueryImpl.java
It appears that there are multiple potential sources of delay:
Getting query results
Restarting (there are some automatic restarts in there that can explain the delay spikes)
The frequency of checking for new results
It sounds like looking at Wireshark would give you a precise answer of what is happening.

Async writes seem to be broken in Cassandra

I have had issues with spark-cassandra-connector (1.0.4, 1.1.0) when writing batches of 9 millions rows to a 12 nodes cassandra (2.1.2) cluster. I was writing with consistency ALL and reading with consistency ONE but the number of rows read was every time different from 9 million (8.865.753, 8.753.213 etc.).
I've checked the code of the connector and found no issues. Then, I decided to write my own application, independent from spark and the connector, to investigate the problem (the only dependency is datastax-driver-code version 2.1.3).
The full code, the startup scripts and the configuration files can now be found on github.
In pseudo-code, I wrote two different version of the application, the sync one:
try (Session session = cluster.connect()) {
String cql = "insert into <<a table with 9 normal fields and 2 collections>>";
PreparedStatement pstm = session.prepare(cql);
for(String partitionKey : keySource) {
// keySource is an Iterable<String> of partition keys
BoundStatement bound = pstm.bind(partitionKey /*, << plus the other parameters >> */);
bound.setConsistencyLevel(ConsistencyLevel.ALL);
session.execute(bound);
}
}
And the async one:
try (Session session = cluster.connect()) {
List<ResultSetFuture> futures = new LinkedList<ResultSetFuture>();
String cql = "insert into <<a table with 9 normal fields and 2 collections>>";
PreparedStatement pstm = session.prepare(cql);
for(String partitionKey : keySource) {
// keySource is an Iterable<String> of partition keys
while(futures.size()>=10 /* Max 10 concurrent writes */) {
// Wait for the first issued write to terminate
ResultSetFuture future = futures.get(0);
future.get();
futures.remove(0);
}
BoundStatement bound = pstm.bind(partitionKey /*, << plus the other parameters >> */);
bound.setConsistencyLevel(ConsistencyLevel.ALL);
futures.add(session.executeAsync(bound));
}
while(futures.size()>0) {
// Wait for the other write requests to terminate
ResultSetFuture future = futures.get(0);
future.get();
futures.remove(0);
}
}
The last one is similar to that used by the connector in the case of no-batch configuration.
The two versions of the application work the same in all circumstances, except when the load is high.
For instance, when running the sync version with 5 threads on 9 machines (45 threads) writing 9 millions rows to the cluster, I find all the rows in the subsequent read (with spark-cassandra-connector).
If I run the async version with 1 thread per machine (9 threads), the execution is much faster but I cannot find all the rows in the subsequent read (the same problem that arised with the spark-cassandra-connector).
No exception was thrown by the code during the executions.
What could be the cause of the issue ?
I add some other results (thanks for the comments):
Async version with 9 threads on 9 machines, with 5 concurrent writers per thread (45 concurrent writers): no issues
Sync version with 90 threads on 9 machines (10 threads per JVM instance): no issues
Issues seemed start arising with Async writes and a number of concurrent writers > 45 and <=90, so I did other tests to ensure that the finding were right:
Replaced the "get" method of ResultSetFuture with
"getUninterruptibly": same issues.
Async version with 18 threads on 9 machines, with 5 concurrent
writers per thread (90 concurrent writers): no issues.
The last finding shows that the high number of concurrent writers (90) is not an issue as was expected in the first tests. The problem is the high number of async writes using the same session.
With 5 concurrent async writes on the same session the issue is not present. If I increase to 10 the number of concurrent writes, some operations get lost without notification.
It seems that the async writes are broken in Cassandra 2.1.2 (or the Cassandra Java driver) if you issue multiple (>5) writes concurrently on the same session.
Nicola and I communicated over email this weekend and thought I'd provide an update here with my current theory. I took a look at the github project Nicola shared and experimented with an 8 node cluster on EC2.
I was able to reproduce the issue with 2.1.2, but did observe that after a period of time I could re-execute the spark job and all 9 million rows were returned.
What I seemed to notice was that while nodes were under compaction I did not get all 9 million rows. On a whim I took a look at the change log for 2.1 and observed an issue CASSANDRA-8429 - "Some keys unreadable during compaction" that may explain this problem.
Seeing that the issue has been fixed at is targeted for 2.1.3, I reran the test against the cassandra-2.1 branch and ran the count job while compaction activity was happening and got 9 million rows back.
I'd like to experiment with this some more since my testing with the cassandra-2.1 branch was rather limited and the compaction activity may have been purely coincidental, but I'm hoping this may explain these issues.
A few possibilities:
Your async example is issuing 10 writes at time with 9 threads, so 90 at a time while your sync example is only doing 45 writes at a time, so I would try cutting the async down to the same rate so it's an apples to apples comparison.
You don't say how you're checking for exceptions with the async approach. I see you are using future.get(), but it is recommended to use getUninterruptibly() as noted in the documentation:
Waits for the query to return and return its result. This method is
usually more convenient than Future.get() because it: Waits for the
result uninterruptibly, and so doesn't throw InterruptedException.
Returns meaningful exceptions, instead of having to deal with
ExecutionException. As such, it is the preferred way to get the future
result.
So perhaps you're not seeing write exceptions that are occurring with your async example.
Another unlikely possibility is that your keySource is for some reason returning duplicate partition keys, so when you do the writes, some of them end up overwriting a previously inserted row and don't increase the row count. But that should impact the sync version too, so that's why I say it's unlikely.
I would try writing smaller sets than 9 million and at a slow rate and see if the problem only starts to happen at a certain number of inserts or certain rate of inserts. If the number of inserts has an impact, then I'd suspect something is wrong with the row keys in the data. If the rate of inserts has an impact, then I'd suspect hot spots causing write timeout errors.
One other thing to check would be the Cassandra log file, to see if there are any exceptions being reported there.
Addendum: 12/30/14
I tried to reproduce the symptom using your sample code with Cassandra 2.1.2 and driver 2.1.3. I used a single table with a key of an incrementing number so that I could see gaps in the data. I did a lot of async inserts (30 at a time per thread in 10 threads all using one global session). Then I did a "select count (*)" of the table, and indeed it reported fewer rows in the table than expected. Then I did a "select *" and dumped the rows to a file and checked for missing keys. They seemed to be randomly distributed, but when I queried for those missing individual rows, it turned out they were actually present in the table. Then I noticed every time I did a "select count (*)", it came back with a different number, so it seems to be giving an approximation of the number of rows in the table rather than the actual number.
So I revised the test program to do a read back phase after all the writes, since I know all the key values. When I did that, all the async writes were present in the table.
So my question is, how are you checking the number of rows that are in your table after you finish writing? Are you querying for each individual key value or using some kind of operation like "select *"? If the latter, that seems to give most of the rows, but not all of them, so perhaps your data is actually present. Since no exceptions are being thrown, it seems to suggest that the writes are all successful. The other question would be, are you sure your key values are unique for all 9 million rows.

Categories