Jedis as message queue performance - java

I am using the Java library Jedis ontop of a Redis queue which I am using as a producer/consumer queue. It was easy to set up and is working nicely.
Consumer code below
List<String> messages = jedis.blpop(0, redisQueueName);
String message = messages.get(1);
//do some stuff
I'm looking to see if I can speed up performance as I have a large amount of items sitting in the Redis queue waiting to be picked up. I've timed my custom processing code and it does not take too long (20000 nano seconds).
Would best practice be to pull multiple items from Redis at once and process them in a batch? Or am I better looking at tuning the Redis server for better performance?

Yes pulling in batch is indeed the best practice. You will be avoiding network round trip.
One more thing is to trim the queue if it goes beyond certain range as the queue grows rapidly and you want to have the control over the queue size (memory size). Sometimes you may not need to perform each and every entry in the queue instead you may skip few as the queue size grows big.
If you want to retain first entered elements, ie.,
For retaining first 100 elements alone
Ltrim queue 0 100
To retain last 100 elements you can do
Ltrim queue -1 100
Hope this helps

Related

Understanding IgniteDataStreamer: ordering and buffering

I'm using IgniteDataStreamer with allowOverwrite to load continious data.
Question 1.
From javadoc:
Note that streamer will stream data concurrently by multiple internal threads, so the data may get to remote nodes in different order from which it was added to the streamer.
Reordering is not acceptable in my case. Will perNodeParallelOperations set to 1 guarantee keeping order of addData calls? There is a number of caches being simultaneously loaded with IgniteDataStreamer, so Ignite server node threads will all be utilized anyway.
Question 2.
My streaming application could hang for a couple of seconds due to GC pause. I want to avoid cache loading pause at that moments and keep high average cache writing speed. Is iy possible to configure IgniteDataStreamer to keep (bounded) queue of incoming batches on server node, that would be consumed while streaming (client) app hangs? See question 1, queue should be consumed sequentially. It's OK to utilize some heap for it.
Question 3.
perNodeBufferSize javadoc:
This setting controls the size of internal per-node buffer before buffered data is sent to remote node
According to javadoc, data transfer is triggered by tryFlush / flush / autoFlush, so how does it correlate with perNodeBufferSize limitation? Would flush be ignored if there is less than perNodeBufferSize messages (I hope no)?
I don't recommend trying to avoid reordering in DataStreamer, but if you absolutely need to do that, you will also need to set data streamer pool size to 1 on server nodes. If it's larger then data is split into stripes and not sent sequentially.
DataStreamer is designed for throughput, not latency. So there's not much you can do here. Increasing perThreadBufferSize, perhaps?
Data transfer is automatically started when perThreadBufferSize is reached for any stripe.

How to Diagnose ElasticSearch Search Queue Growth

I'm trying to diagnose an issue where our ElasticSearch search queue seemingly randomly fills up.
The behavior we observe in our monitor is that on one node of our cluster the search queue growth (just one) and after the search thread pool is used up we start getting timeouts of course. There seems to be one query that is blocking the while thing. The only way for us to resolve the problem at the moment is to restart the node.
You can see below the relevant behavior in charts: First the queue size, then the pending cluster tasks (to show that no other operations are blocking or queing up, e.g. index operations or so) and finally the active threads for the search thread pool. The spike at 11 o'clock is the restart of the node.
The log files on all nodes show no entries during an hour before or after the issue until we restarted the node. Only garbage collection events of around 200 -600ms and only one on the relevant node but that is around 20 minutes before the event.
My questions:
- how can I debug this as there is no information logged anywhere on a failing or timing out query?
- what are possible reasons for this? We don't have dynamic queries or anything similar
- can I set a query timeout or clear / reset active searches when this happens to prevent a node restart?
Some more details that don't apply, based on questions so far:
exactly same hardware (16 cores, 60GB mem)
same config, no special nodes
no swap enabled
nothing noticeable on other metrics like IO or CPU
not a master node
no special shards, three shards per node each node, pertty standard queries, all queries getting send to ES for 10 minutes before are queries that typically finish within 5-10ms, all the ones we get a timeout on are the same, no increase in query rate or anything else
we have 5 nodes for this deployment, all accessed round robin
we have a slow log of 2 seconds on info level, no entries
The hot threads after 1 minute of queue build up are at https://gist.github.com/elm-/5ed398054ea6b46522c0, several snapshots of some dumps over a few moments.
That's a very open-ended investigation as there can be many things of fault. A rogue query can be the most obvious reason, but the question is why the other nodes are not affected. The most relevant clue in my opinion is, why is that node so special.
Things to look at:
compare hardware specs between nodes
compare configuration settings. See if this nodes stands out with something different.
look at swapping on all nodes, if swapping is enabled. Check mlockall to see if it's set to true.
in your monitoring tool correlate the queue size increasing with other things: memory usage, CPU usage, disk IOPS, GCs, indexing rate, searching rate
is this node the master node when that queue fill up is happening?
look at the shards distribution: is there any "special" shard(s) on this node that stands out? Correlate that with the queries you usually run. Maybe routing is in play here.
are you sending the queries to the same node, or you do a round-robin query execution to all nodes
try to enable slowlogs and decrease the threshold and try to catch that allegedly problematic query (if there is one)
Andrei Stefan's answer isn't wrong but I'd start by looking at the hot_threads from the clogged up node rather than trying to figure out what might be special about the node.
I don't know of a way for you to look inside the queue. Slowlogs, like Andrei says, are a great idea though.

Apache Storm issue with Dynamic redirection of tuples (baffling impact on end-to-end latency)

Below I include text explaining the issue I face in Storm. Any way, I know it is a long post (just a heads up) and any comment/indication is more than welcome. There goes the description:
I have installed Storm 0.9.4 and ZooKeeper 3.4.6 on a single server (2 sockets with Intel Xeon 8-core chips, 96 GB ram running CentOS) and I have setup a pseudo-distributed, single node Storm runtime. My configuration consists of 1 zookeeper server, 1 nimbus process, 1 supervisor process, and 1 worker process (when topologies are submitted), all running on the same machine. The purpose of my experiment is to see Storm's behavior on a single node setting, when input load is dynamically distributed among executor threads.
For the purpose of my experiment I have input tuples that consist of 1 long and 1 integer value. The input data come from two spouts that read tuples from disk files and I control the input rates to follow the pattern:
200 tuples/second for the first 24 seconds (time 0 - 24 seconds)
800 tuples/second for the next 12 seconds (24 - 36 seconds)
200 tuples/sec for 6 more seconds (time 36 - 42 seconds)
Turning to my topology, I have two types of bolts: a) a Dispatcher bolt that receives input from the two spouts, and (b) a Consumer bolt that performs an operation on the tuples and maintains some tuples as state. The parallelism hint for the Dispatcher is one (1 executor/thread), since I have examined that it never reaches even 10% of its capacity. For the Consumer bolt I have a parallelism hint of two (2 executors/threads for that bolt). The input rates I previously mentioned are picked so that I monitor end-to-end latency less than 10 msecs using the appropriate number of executors on the Consumer bolt. In detail, I have run the same topology with one Consumer executor and it can handle an input rate of 200 tuples/sec with end-to-end latency < 10 msec. Similarly, if I add one more Consumer executor (2 executors in total) the topology can consume 800 tuples/sec with < 10 msecs end-to-end latency. At this point, I have to say that if I use 1 consumer executor for 800 tuples/sec the end-to-end latency reaches up to 2 seconds. By the way, I have to mention that I measure end-to-end latency using the ack() function of my bolts and see how much time it takes between sending a tuple in the topology, until its tuple tree is fully acknowledged.
As you realize by now, the goal is to see if I can maintain end-to-end latency < 10 msec for the input spike, by simulating the addition of another Consumer executor.In order to simulate the addition of processing resources for the input spike, I use direct grouping and before the spike, I send tuples only to one of the two Consumer executors. When the spike is detected on the Dispatcher, it starts sending tuples to the other Consumer also, so that the input load is balanced between two threads. Hence, I expect that when I start sending tuples to the additional Consumer thread, the end-to-end latency will drop back to its acceptable value. However, the previous does not happen.
In order to verify my hypothesis that two Consumer executors are able to maintain < 10 msec latency during a spike, I execute the same experiment, but this time, I send tuples to both executors (threads) for the whole lifetime of the experiment. In this case, the end-to-end latency remains stable and in acceptable levels. So, I do not know what really happens in my simulation. I can not really figure out what causes the deterioration of the end-to-end latency in the case where input load is re-directed to the additional Consumer executor.
In order to figure out more about the mechanics of Storm, I did the same setup on a smaller machine and did some profiling. I saw that most of the time is spent in the BlockingWaitStrategy of the lmax disruptor and it dominates the CPU. My actual processing function (in the Consumer bolt) takes only a fraction of the lmax BlockingWaitStrategy. Hence, I think that it is an I/O issue between queues and not something that has to do with the processing of tuples in the Consumer.
Any idea about what goes wrong and I get this radical/baffling behavior?
Thank you.
First, thanks for the detailed and well formulated question! There are multiple comments from my side (not sure if this is already an answer...):
your experiment is rather short (time ranges below 1 minute) which I think might not reveal reliable numbers.
How do you detect the spike?
Are you awe of the internal buffer mechanisms in Storm (have a look here: http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/)
How many ackers did you configure?
I assume that during your spike period, before you detect the spike, the buffers are filled up and it takes some time to empty them. Thus the latency does not drop immediately (maybe extending you last period resolve this).
Using the ack mechanism is done by many people, however, it is rather imprecise. First, it shows an average value (a mean, quantile, or max would be much better to use. Furthermore, the measure value is not what should be considered the latency after all. For example, if you hold a tuple in an internal state for some time and do not ack it until the tuple is removed from the state, Storm's "latency" value would increase what does not make sense for a latency measurement. The usual definition of latency would be to take the output timestamp of an result tuple and subtract the emit timestamp the source tuple (if there a multiple source tuples, you use the youngest---ie, maximum---timestamp over all source tuples). The tricky part is to figure out the corresponding source tuples for each output tuple... As an alternative, some people inject dummy tuples that carry their emit timestamp as data. This dummy tuples are forwarded by each operator immediately and the sink operator can easily compete a latency value as it has access to the emit timestamp that is carried around. This is a quite good approximation of the actual latency as described before.
Hope this helps. If you have more questions and/or information I can refine my answer later on.

Design : A Java Application with high throughput

I have a scenario, in which
A HUGE Input file with a specific format, delimited with \n has to be read, it has almost 20 Million records.
Each Record has to be read and processed by sending it to server in specific format.
=====================
I am thinking on how to design it.
- Read the File(nio)
- The thread that reads the file can keep those chunks into a JMS queue.
- Create n threads representing n servers (to which the data is to be sent). and then n Threads running in parallel can pick up one chunk at a time..execute that chunk by sending requests to the server.
Can you suggest if the above is fine, or you see any flaw(s) :). Also it would be great if you can suggest better way/ technologies to do this.
Thank you!
Updated : I wrote a program to read that file with 20m Records, using Apache Commons IO(file iterator) i read the file in chunks (10 lines at at time). and it read the file in 1.2 Seconds. How good is this? Should i think of going to nio? (When i did put a log to print the chunks, it took almost 26seconds! )
20 million records isn't actually that many so first off I would try just processing it normally, you may find performance is fine.
After that you will need to measure things.
You need to read from the disk sequentially for good speed there so that must be single threaded.
You don't want the disk read waiting for the networking or the networking waiting for the disk reads so dropping the data read into a queue is a good idea. You probably will want a chunk size larger than one line though for optimum performance. Measure the performance at different chunk sizes to see.
You may find that network sending is faster than disk reading already. If so then you are done, if not then at that point you can spin up more threads reading from the queue and test with them.
So your tuning factors are:
chunk size
number of threads.
Make sure you measure performance over a decent sized amount of data for various combinations to find the one that works best for your circumstances.
I believe you could batch the records instead of sending one at a time. You could avoid unnecessary network hops given the volume of data that need to be processed by the server.

Non blocking buffer in java

In a high volume multi-threaded java project I need to implement a non-blocking buffer.
In my scenario I have a web layer that receives ~20,000 requests per second. I need to accumulate some of those requests in some data structure (aka the desired buffer) and when it is full (let's assume it is full when it contains 1000 objects) those objects should be serialized to a file that will be sent to another server for further processing.
The implementation shoud be a non-blocking one.
I examined ConcurrentLinkedQueue but I'm not sure it can fit the job.
I think I need to use 2 queues in a way that once the first gets filled it is replaced by a new one, and the full queue ("the first") gets delivered for further processing. This is the basic idea I'm thinking of at the moment, and still I don't know if it is feasible since I'm not sure I can switch pointers in java (in order to switch the full queue).
Any advice?
Thanks
What I usualy do with requirements like this is create a pool of buffers at app startup and store the references in a BlockingQueue. The producer thread pops buffers, fills them and then pushes the refs to another queue upon which the consumers are waiting. When consumer/s are done, (data written to fine, in your case), the refs get pushed back onto the pool queue for re-use. This provides lots of buffer storage, no need for expensive bulk copying inside locks, eliminates GC actions, provides flow-control, (if the pool empties, the producer is forced to wait until some buffers are returned), and prevents memory-runaway, all in one design.
More: I've used such designs for many years in various other languages too, (C++, Delphi), and it works well. I have an 'ObjectPool' class that contains the BlockingQueue and a 'PooledObject' class to derive the buffers from. PooledObject has an internal private reference to its pool, (it gets initialized on pool creation), so allowing a parameterless release() method. This means that, in complex designs with more than one pool, a buffer always gets released to the correct pool, reducing cockup-potential.
Most of my apps have a GUI, so I usually dump the pool level to a status bar on a timer, every second, say. I can then see roughly how much loading there is, if any buffers are leaking, (number consistently goes down and then app eventually deadlocks on empty pool), or I am double-releasing, (number consistently goes up and app eventually crashes).
It's also fairly easy to change the number of buffers at runtime, by either creating more and pushing them into the pool, or by waiting on the pool, removing buffers and letting GC destroy them.
I think you have a very good point with your solution. You would need two queues, the processingQueue would be the buffer size you want (in your example that would be 1000) while the waitingQueue would be a lot bigger. Every time the processingQueue is full it will put its contents in the specified file and then grab the first 1000 from the waitingQueue (or less if the waiting queue has fewer than 1000).
My only concern about this is that you mention 20000 per second and a buffer of 1000. I know the 1000 was an example, but if you don't make it bigger it might just be that you are moving the problem to the waitingQueue rather than solving it, as your waitingQueue will receive 1000 new ones faster than the processingQueue can process them, giving you a buffer overflow in the waitingQueue.
Instead of putting each request object in a queue, allocate an array of size 1000, and when it is filled, put that array in the queue to the sender thread which serializes and sends the whole array. Then allocate another array.
How are you going to handle the situation when the sender cannot work fast enough and its queue is overflown? To avoid out of memory error, use queue of a limited size.
I might be getting something wrong, but you may use an ArrayList for this as you don't need to poll per element from your queue. You just flush (create a copy and clear) your array in a synchronized section when it's size reaches the limit and you need to send it. Adding to this list should also be synced to this flush operation.
Swapping your arrays might not be safe - if your sending is slower than your generation, buffers may soon start overwriting each other. And 20000-elements array allocation per second is almost nothing for GC.
Object lock = new Object();
List list = ...;
synchronized(lock){
list.add();
}
...
// this check outside is a quick dirty check for performance,
// it's not valid out of the sync block
// this first check is less than nano-second and will filter out 99.9%
// `synchronized(lock)` sections
if(list.size() > 1000){
synchronized(lock){ // this should be less than a microsecond
if(list.size() > 1000){ // this one is valid
// make sure this is async (i.e. saved in a separate thread) or <1ms
// new array allocation must be the slowest part here
sendAsyncInASeparateThread(new ArrayList(list));
list.clear();
}
}
}
UPDATE
Considering that sending is async, the slowest part here is new ArrayList(list) which should be around 1 microseconds for 1000 elements and 20 microseconds per second. I didn't measure that, I resolved this from proportion in which 1 million elements are allocated in ~1 ms.
If you still require a super-fast synchronized queue, you might want to have a look at the MentaQueue
What do you mean by "switch pointers"? There are no pointers in Java (unless you're talking about references).
Anyways, as you probably saw from the Javadoc, ConcurrentLinkedQueue has a "problem" with the size() method. Still, you could use your original idea of 2 (or more) buffers that would get switched. There's probably going to be some bottlenecks with the disk I/O. Maybe the non-constant time of size() won't be a problem here either.
Of course if you want it to be non-blocking, you better have a lot of memory and a fast disk (and large / bigger buffers).

Categories