I'm trying to diagnose an issue where our ElasticSearch search queue seemingly randomly fills up.
The behavior we observe in our monitor is that on one node of our cluster the search queue growth (just one) and after the search thread pool is used up we start getting timeouts of course. There seems to be one query that is blocking the while thing. The only way for us to resolve the problem at the moment is to restart the node.
You can see below the relevant behavior in charts: First the queue size, then the pending cluster tasks (to show that no other operations are blocking or queing up, e.g. index operations or so) and finally the active threads for the search thread pool. The spike at 11 o'clock is the restart of the node.
The log files on all nodes show no entries during an hour before or after the issue until we restarted the node. Only garbage collection events of around 200 -600ms and only one on the relevant node but that is around 20 minutes before the event.
My questions:
- how can I debug this as there is no information logged anywhere on a failing or timing out query?
- what are possible reasons for this? We don't have dynamic queries or anything similar
- can I set a query timeout or clear / reset active searches when this happens to prevent a node restart?
Some more details that don't apply, based on questions so far:
exactly same hardware (16 cores, 60GB mem)
same config, no special nodes
no swap enabled
nothing noticeable on other metrics like IO or CPU
not a master node
no special shards, three shards per node each node, pertty standard queries, all queries getting send to ES for 10 minutes before are queries that typically finish within 5-10ms, all the ones we get a timeout on are the same, no increase in query rate or anything else
we have 5 nodes for this deployment, all accessed round robin
we have a slow log of 2 seconds on info level, no entries
The hot threads after 1 minute of queue build up are at https://gist.github.com/elm-/5ed398054ea6b46522c0, several snapshots of some dumps over a few moments.
That's a very open-ended investigation as there can be many things of fault. A rogue query can be the most obvious reason, but the question is why the other nodes are not affected. The most relevant clue in my opinion is, why is that node so special.
Things to look at:
compare hardware specs between nodes
compare configuration settings. See if this nodes stands out with something different.
look at swapping on all nodes, if swapping is enabled. Check mlockall to see if it's set to true.
in your monitoring tool correlate the queue size increasing with other things: memory usage, CPU usage, disk IOPS, GCs, indexing rate, searching rate
is this node the master node when that queue fill up is happening?
look at the shards distribution: is there any "special" shard(s) on this node that stands out? Correlate that with the queries you usually run. Maybe routing is in play here.
are you sending the queries to the same node, or you do a round-robin query execution to all nodes
try to enable slowlogs and decrease the threshold and try to catch that allegedly problematic query (if there is one)
Andrei Stefan's answer isn't wrong but I'd start by looking at the hot_threads from the clogged up node rather than trying to figure out what might be special about the node.
I don't know of a way for you to look inside the queue. Slowlogs, like Andrei says, are a great idea though.
Related
I'm using IgniteDataStreamer with allowOverwrite to load continious data.
Question 1.
From javadoc:
Note that streamer will stream data concurrently by multiple internal threads, so the data may get to remote nodes in different order from which it was added to the streamer.
Reordering is not acceptable in my case. Will perNodeParallelOperations set to 1 guarantee keeping order of addData calls? There is a number of caches being simultaneously loaded with IgniteDataStreamer, so Ignite server node threads will all be utilized anyway.
Question 2.
My streaming application could hang for a couple of seconds due to GC pause. I want to avoid cache loading pause at that moments and keep high average cache writing speed. Is iy possible to configure IgniteDataStreamer to keep (bounded) queue of incoming batches on server node, that would be consumed while streaming (client) app hangs? See question 1, queue should be consumed sequentially. It's OK to utilize some heap for it.
Question 3.
perNodeBufferSize javadoc:
This setting controls the size of internal per-node buffer before buffered data is sent to remote node
According to javadoc, data transfer is triggered by tryFlush / flush / autoFlush, so how does it correlate with perNodeBufferSize limitation? Would flush be ignored if there is less than perNodeBufferSize messages (I hope no)?
I don't recommend trying to avoid reordering in DataStreamer, but if you absolutely need to do that, you will also need to set data streamer pool size to 1 on server nodes. If it's larger then data is split into stripes and not sent sequentially.
DataStreamer is designed for throughput, not latency. So there's not much you can do here. Increasing perThreadBufferSize, perhaps?
Data transfer is automatically started when perThreadBufferSize is reached for any stripe.
Below I include text explaining the issue I face in Storm. Any way, I know it is a long post (just a heads up) and any comment/indication is more than welcome. There goes the description:
I have installed Storm 0.9.4 and ZooKeeper 3.4.6 on a single server (2 sockets with Intel Xeon 8-core chips, 96 GB ram running CentOS) and I have setup a pseudo-distributed, single node Storm runtime. My configuration consists of 1 zookeeper server, 1 nimbus process, 1 supervisor process, and 1 worker process (when topologies are submitted), all running on the same machine. The purpose of my experiment is to see Storm's behavior on a single node setting, when input load is dynamically distributed among executor threads.
For the purpose of my experiment I have input tuples that consist of 1 long and 1 integer value. The input data come from two spouts that read tuples from disk files and I control the input rates to follow the pattern:
200 tuples/second for the first 24 seconds (time 0 - 24 seconds)
800 tuples/second for the next 12 seconds (24 - 36 seconds)
200 tuples/sec for 6 more seconds (time 36 - 42 seconds)
Turning to my topology, I have two types of bolts: a) a Dispatcher bolt that receives input from the two spouts, and (b) a Consumer bolt that performs an operation on the tuples and maintains some tuples as state. The parallelism hint for the Dispatcher is one (1 executor/thread), since I have examined that it never reaches even 10% of its capacity. For the Consumer bolt I have a parallelism hint of two (2 executors/threads for that bolt). The input rates I previously mentioned are picked so that I monitor end-to-end latency less than 10 msecs using the appropriate number of executors on the Consumer bolt. In detail, I have run the same topology with one Consumer executor and it can handle an input rate of 200 tuples/sec with end-to-end latency < 10 msec. Similarly, if I add one more Consumer executor (2 executors in total) the topology can consume 800 tuples/sec with < 10 msecs end-to-end latency. At this point, I have to say that if I use 1 consumer executor for 800 tuples/sec the end-to-end latency reaches up to 2 seconds. By the way, I have to mention that I measure end-to-end latency using the ack() function of my bolts and see how much time it takes between sending a tuple in the topology, until its tuple tree is fully acknowledged.
As you realize by now, the goal is to see if I can maintain end-to-end latency < 10 msec for the input spike, by simulating the addition of another Consumer executor.In order to simulate the addition of processing resources for the input spike, I use direct grouping and before the spike, I send tuples only to one of the two Consumer executors. When the spike is detected on the Dispatcher, it starts sending tuples to the other Consumer also, so that the input load is balanced between two threads. Hence, I expect that when I start sending tuples to the additional Consumer thread, the end-to-end latency will drop back to its acceptable value. However, the previous does not happen.
In order to verify my hypothesis that two Consumer executors are able to maintain < 10 msec latency during a spike, I execute the same experiment, but this time, I send tuples to both executors (threads) for the whole lifetime of the experiment. In this case, the end-to-end latency remains stable and in acceptable levels. So, I do not know what really happens in my simulation. I can not really figure out what causes the deterioration of the end-to-end latency in the case where input load is re-directed to the additional Consumer executor.
In order to figure out more about the mechanics of Storm, I did the same setup on a smaller machine and did some profiling. I saw that most of the time is spent in the BlockingWaitStrategy of the lmax disruptor and it dominates the CPU. My actual processing function (in the Consumer bolt) takes only a fraction of the lmax BlockingWaitStrategy. Hence, I think that it is an I/O issue between queues and not something that has to do with the processing of tuples in the Consumer.
Any idea about what goes wrong and I get this radical/baffling behavior?
Thank you.
First, thanks for the detailed and well formulated question! There are multiple comments from my side (not sure if this is already an answer...):
your experiment is rather short (time ranges below 1 minute) which I think might not reveal reliable numbers.
How do you detect the spike?
Are you awe of the internal buffer mechanisms in Storm (have a look here: http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/)
How many ackers did you configure?
I assume that during your spike period, before you detect the spike, the buffers are filled up and it takes some time to empty them. Thus the latency does not drop immediately (maybe extending you last period resolve this).
Using the ack mechanism is done by many people, however, it is rather imprecise. First, it shows an average value (a mean, quantile, or max would be much better to use. Furthermore, the measure value is not what should be considered the latency after all. For example, if you hold a tuple in an internal state for some time and do not ack it until the tuple is removed from the state, Storm's "latency" value would increase what does not make sense for a latency measurement. The usual definition of latency would be to take the output timestamp of an result tuple and subtract the emit timestamp the source tuple (if there a multiple source tuples, you use the youngest---ie, maximum---timestamp over all source tuples). The tricky part is to figure out the corresponding source tuples for each output tuple... As an alternative, some people inject dummy tuples that carry their emit timestamp as data. This dummy tuples are forwarded by each operator immediately and the sink operator can easily compete a latency value as it has access to the emit timestamp that is carried around. This is a quite good approximation of the actual latency as described before.
Hope this helps. If you have more questions and/or information I can refine my answer later on.
I have a long task to run under my App Engine application with a lot of datastore to compute. It worked well with a small amount of data, but since yesterday, I'm suddenly getting more than a million datastore entries to compute per day. After a while running the task (around 2 minutes), it fails with a 202 exit code (HTTP error 500). I really cannot deal with this issue. It is pretty much undocumented. The only information I was able to find is that it probably means that my app is running out of memory.
The task is simple. Each entry in the datastore contains a non-unique string identifier and a long number. The task sums the numbers and stores the identifiers into a set.
My budget is really low since my app is entirely free and without ads. I would like to prevent the app cost to soar. I would like to find a cheap and simple solution to this issue.
Edit:
I read Objectify documentation thoroughly tonight, and I found that the session cache (which ensures entities references consistency) can consume a lot of memory and should be cleared regularly when performing a lot of requests (which is my case). Unfortunately, this didn't help.
It's possible to stay within the free quota but it will require a little extra work.
In your case you should split this operation into smaller batches ( ej process 1000 entities per batch) and queue those smaller tasks to run sequentially during off hours. That should save you form the memory issue and allow you to scale beyond your current entity amount.
I have been running a test for a large data migration to dynamo that we intend to do in our prod account this summer. I ran a test to batch write about 3.2 billion documents to our dynamo table, which has a hash and range keys and two partial indexes. Each document is small, less than 1k. While we succeeded in getting the items written in about 3 days, we were disappointed with the Dynamo performance we experienced and are looking for suggestions on how we might improve things.
In order to do this migration, we are using 2 ec2 instances (c4.8xlarges). Each runs up to 10 processes of our migration program; we've split the work among the processes by some internal parameters and know that some processes will run longer than others. Each process queries our RDS database for 100,000 records. We then split these into partitions of 25 each and use a threadpool of 10 threads to call the DynamoDB java SDK's batchSave() method. Each call to batchSave() is sending only 25 documents that are less than 1k each, so we expect each to only make a single HTTP call out to AWS. This means that at any given time, we can have as many as 100 threads on a server each making calls to batchSave with 25 records. Our RDS instance handled the load of queries to it just fine during this time, and our 2 EC2 instances did as well. On the ec2 side, we did not max out our cpu, memory, or network in or network out. Our writes are not grouped by hash key, as we know that can be known to slow down dynamo writes. In general, in a group of 100,000 records, they are split across 88,000 different hash keys. I created the dynamo table initially with 30,000 write throughput, but configured up to 40,000 write throughput at one point during the test, so our understanding is that there are at least 40 partitions on the dynamo side to handle this.
We saw very variable responses times in our calls to batchSave() to dynamo throughout this period. For one span of 20 minutes while I was running 100 threads per ec2 instance, the average time was 0.636 seconds, but the median was only 0.374, so we've got a lot of calls taking more than a second. I'd expect to see much more consistency in the time it takes to make these calls from an EC2 instance to dynamo. Our dynamo table seems to have plenty of throughput configured, and the EC2 instance is below 10% CPU, and the network in and out look healthy, but are not close to be maxed out. The CloudWatch graphs in the console (which are fairly terrible...) didn't show any throttling of write requests.
After I took these sample times, some of our processes finished their work, so we were running less threads on our ec2 instances. When that happened, we saw dramatically improved response times in our calls to dynamo. e.g. when we were running 40 threads instead of 100 on the ec2 instance, each making calls to batchSave, the response times improved more than 5x. However, we did NOT see improved write throughput even with the increased better response times. It seems that no matter what we configured our write throughput to be, we never really saw the actual throughput exceed 15,000.
We'd like some advice on how best to achieve better performance on a Dynamo migration like this. Our production migration this summer will be time-sensitive, of course, and by then, we'll be looking to migrate about 4 billion records. Does anyone have any advice on how we can achieve an overall higher throughput rate? If we're willing to pay for 30,000 units of write throughput for our main index during the migration, how can we actually achieve performance close to that?
One component of BatchWrite latency is the Put request that takes the longest in the Batch. Considering that you have to loop over the List of DynamoDBMapper.FailedBatch until it is empty, you might not be making progress fast enough. Consider running multiple parallel DynamoDBMapper.save() calls instead of batchSave so that you can make progress independently for each item you write.
Again, Cloudwatch metrics are 1 minute metrics so you may have peaks of consumption and throttling that are masked by the 1 minute window. This is compounded by the fact that the SDK, by default, will retry throttled calls 10 times before exposing the ProvisionedThroughputExceededException to the client, making it difficult to pinpoint when and where the actual throttling is happening. To improve your understanding, try reducing the number of SDK retries, request ConsumedCapacity=TOTAL, self-throttle your writes using Guava RateLimiter as is described in the rate-limited scan blog post, and log throttled primary keys to see if any patterns emerge.
Finally, the number of partitions of a table is not only driven by the amount of read and write capacity units you provision on your table. It is also driven by the amount of data you store in your table. Generally, a partition stores up to 10GB of data and then will split. So, if you just write to your table without deleting old entries, the number of partitions in your table will grow without bound. This causes IOPS starvation - even if you provision 40000 WCU/s, if you already have 80 partitions due to the amount of data, the 40k WCU will be distributed among 80 partitions for an average of 500 WCU per partition. To control the amount of stale data in your table, you can have a rate-limited cleanup process that scans and removes old entries, or use rolling time-series tables (slides 84-95) and delete/migrate entire tables of data as they become less relevant. Rolling time-series tables is less expensive than rate-limited cleanup as you do not consume WCU with a DeleteTable operation, while you consume at least 1 WCU for each DeleteItem call.
Through appstats, I can see that my datastore queries are taking about 125ms (api and cpu combined), but often there are long latencies (e.g. upto 12000ms) before the queries are executed.
I can see that my latency from the datastore is not related to my query (e.g. the same query/data has vastly different latencies), so I'm assuming that it's a scheduling issue with app engine.
Are other people seeing this same problem ?
Is there someway to reduce the latency (e.g. admin console setting) ?
Here's a screen shot from appstats. This servlet has very little cpu processing. It does a getObjectByID and then does a datastore query. The query has an OR operator so it's being converted into 3 queries by app engine.
.
As you can see, it takes 6000ms before the first getObjectByID is even executed. There is no processing before the get operation (other than getting pm). I thought this 6000ms latency might be due to an instance warm-up, so I had increased my idle instances to 2 to prevent any warm-ups.
Then there's a second latency around a 1000ms between the getObjectByID and the query. There's zero lines of code between the get and the query. The code simply takes the result of the getObjectByID and uses the data as part of the query.
The grand total is 8097ms, yet my datastore operations (and 99.99% of the servlet) are only 514ms (45ms api), though the numbers change every time I run the servlet. Here is another appstats screenshot that was run on the same servlet against the same data.
Here is the basics of my java code. I had to remove some of the details for security purposes.
user = pm.getObjectById(User.class, userKey);
//build queryBuilder.append(...
final Query query = pm.newQuery(UserAccount.class,queryBuilder.toString());
query.setOrdering("rating descending");
query.executeWithArray(args);
Edited:
Using Pingdom, I can see that GAE latency varies from 450ms to 7,399ms, or 1,644% difference !! This is with two idle instances and no users on the site.
I observed very similar latencies (in the 7000-10000ms range) in some of my apps. I don't think the bulk of the issue (those 6000ms) lies in your code.
In my observations, the issue is related to AppEngine spinning up a new instance. Setting min idle instances may help mitigate but it will not solve it (I tried up to 2 idle instances), because basically even if you have N idle instances app engine will prefer spinning up dynamic ones even when a single request comes in, and will "save" the idle ones in case of crazy traffic spikes. This is highly counter-intuitive because you'd expect it to use the instance that are already around and spin up dynamic ones for future requests.
Anyway, in my experience this issue (10000ms latency) very rarely happens under any non-zero amount of load, and many people had to revert to some king of pinging (possibly cron jobs) every couple of minutes (used to work with 5 minutes but lately instances are dying faster so it's more like a ping every 2 mins) to keep dynamic instances around to serve users who hit the site when no one else is on. This pinging is not ideal because it will eat away at your free quota (pinging every 5 minutes will eat away more than half of it) but I really haven't found a better alternative so far.
In recap, in general I found that app engine is awesome when under load, but not outstanding when you just have very few (1-3) users on the site.
Appstats only helps diagnose performance issues when you make GAE API/RPC calls.
In the case of your diagram, the "blank" time is spent running your code on your instance. It's not going to be scheduling time.
Your guess that the initial delay may be because of instance warm-up is highly likely. It may be framework code that is executing.
I can't guess at the delay between the Get and Query. It may be that there's 0 lines of code, but you called some function in the Query that takes time to process.
Without knowledge of the language, framework or the actual code, no one will be able to help you.
You'll need to add some sort of performance tracing on your own in order to diagnose this. The simplest (but not highly accurate) way to do this is to add timers and log timer values as your code executes.