Message latencies with CPU under-utilization

Message latencies with CPU under-utilization - java

We've got a Java app where we basically use three dispatcher pools to handle processing tasks:
Convert incoming messages (from RabbitMQ queues) into another format
Serialize messages
Push serialized messages to another RabbitMQ server
The thing, where we don't know how to start fixing it, is, that we have latencies at the first one. In other words, when we measure the time between "tell" and the start of doing the conversion in an actor, there is (not always, but too often) a delay of up to 500ms. Especially strange is that the CPUs are heavily under-utilized (10-15%) and the mailboxes are pretty much empty all of the time, no huge amount of messages waiting to be processed. Our understanding was that Akka typically would utilize CPUs much better than that?
The conversion is non-blocking and does not require I/O. There are approx. 200 actors running on that dispatcher, which is configured with throughput 2 and has 8 threads.
The system itself has 16 CPUs with around 400+ threads running, most of the passive, of course.
Interestingly enough, the other steps do not see such delays, but that can probably explained by the fact that the first step already "spreads" the messages so that the other steps/pools can easily digest them.
Does anyone have an idea what could cause such latencies and CPU under-utilization and how you normally go improving things there?

Related

In general, is it better to configure a Tomcat REST service with a rational limit of max threads, or set it to an effectively infinite value?

There isn't a single answer for this, but I don't know where else to ask this question.
I work on a large enterprise system that uses Tomcat to run REST services running in containers, managed by kubernetes.
Tomcat, or really any request processor, has a "max threads" property, such that if enough requests come in that cause creation of many threads, if the number of created threads reaches that defined limit, it will put additional requests into a queue (limited by the value of another property), and then possibly requests will be rejected after that queue is full.
It's reasonable to consider whether this property should be set to a value that could possibly be reached, or whether it should be set to effective infinity.
There are many scenarios to consider, although the only interesting ones are when traffic is extremely higher than normal, either from real customer traffic, or malicious ddos traffic.
In managed container environments, and other similar cases, this also begs the question of how many instances, pods, or containers should be running copies of the service. I would assume you would want to have as few of these as possible, to reduce duplication of resources for each pod, which would increase the average number of threads in each container, but I would assume that's better than spreading them thinly across a set of containers.
Some members of my team think it's better to set the "max threads" property to effective infinity.
What are some reasonable thoughts about this?

As a general rule, I'd suggest trying to scale by running more pods (which can easily be scheduled on multiple hosts) rather than by running more threads. It's also easier for the cluster to schedule 16 1-core pods than to schedule 1 16-core pod.
In terms of thread count, it depends a little bit on how much work your process is doing. A typical Web application spends most of its time talking to the database, and does a little bit of local computation, so you could often set it to run 50 or 100 threads but still with a limit of 1.0 CPU, and be effectively using resources. If it's very computation-heavy (it's doing real image-processing or machine-learning work, say) you might be limited to 1 thread per CPU. The bad case is where your process is allocating 16 threads, but the system only actually has 4 cores available, in which case your process will get throttled but you really want it to scale up.
The other important bad state to be aware of is the thread pool filling up. If it does, requests will get queued up, as you note, but if some of those requests are Kubernetes health-check probes, that can result in the cluster recording your service as unhealthy. This can actually lead to a bad spiral where an overloaded replica gets killed off (because it's not answering health checks promptly), so its load gets sent to other replicas, which also become overloaded and stop answering health checks. You can escape this by running more pods, or more threads. (...or by rewriting your application in a runtime which doesn't have a fixed upper capacity like this.)
It's also worth reading about the horizontal pod autoscaler. If you can connect some metric (CPU utilization, thread pool count) to say "I need more pods", then Kubernetes can automatically create more for you, and scale them down when they're not needed.

Implementing event-driven lightweight threads

Inspired by libraries like Akka and Quasar I started wondering how these actually work "under the hood". I'm aware that it is most likely very complex and that they all work quite different from each other.
I would still like to learn how I would go to implement a (at most) very basic version of my own "event-driven lightweight threads" using Java 8.
I'm quite familiar with Akka as a library, and I have an intermediate understanding about concurrency on the JVM.
Could anyone point me to some literature covering this, or try to describe the concepts involved?

In Akka it works like this:
An actor is a class that bundles a mailbox with the behavior to handle messages
When some code calls ActorRef.tell(msg), the msg is put into the mailbox of the referenced actor (though, this wouldn't be enough to run anything)
A task is queued on the dispatcher (a thread pool basically) to handle messages in the mailbox
When another message comes in and the mailbox is already queued, it doesn't need to be scheduled again
When the dispatcher is executing the task to handle the mailbox, the actor is called to handle one message after the other
Messages in this mailbox up to the count specified in akka.actor.throughput are handled by this one task in one go. If the mailbox still has messages afterwards, another task is scheduled on the dispatcher to handle the remaining messages. Afterwards the tasks exits. This ensures fairness, i.e. that the thread this mailbox is run on isn't indefinitely blocked by one actor.
So, there are basically two work queues:
The mailbox of an actor. These messages need to be processed sequentially to ensure the contract of actors.
The queue of the dispatcher. All of the tasks in here can be processed concurrently.
The hardest part of writing this efficiently is the thread pool. In the thread pool a bunch of worker threads need to access their task queue in an efficient way. By default, Akka uses JDK's ForkJoinPool under-the-hood which is a very sophisticated work-stealing thread pool implementation.

Could anyone point me to some literature covering this,
I am the architect for Chronicle Queue and you can read how it is used and works here on my blog https://vanilla-java.github.io/tag/Microservices/
try to describe the concepts involved?
You have;
above all, make your threads faster and light weight by doing less work.
try to deal with each event as quickly as possible to keep latency low.
batch when necessary but keep it to a minimum. Batching add latency but can help improve maximum throughput.
Identify the critical path. Keep this as short as possible, moving anything blocking or long running to asynchronous thread/processes.
keep hops to a minimum, either between threads, processes or machines.
keep allocation rates down to improve throughput between GCs, and reduce the impact of GCs.
For some of the systems I work on you can achieve latencies of 30 micro-seconds in Java (network packet in to network packet out)

In Akka,
1.Actor system allocates the threads from thread pool to actors that have messages to process.
2.When the actor has no messages to process,thread is released and allocated to other actors that have messages to process
This way asynchronous actor systems can handle many
more concurrent requests with the same amount of resources since
the limited number of threads(thread pool) never sit idle while waiting for I/O
operations to complete.
For more information you can download & check this e-book https://info.lightbend.com/COLL-20XX-Designing-Reactive-Systems_RES-LP.html?lst=BL&_ga=1.214533079.1169348714.1482593952

Why does Java App take less overall CPU when running multiple instances of app instead of one instance?

I've got a Java App running on Ubuntu, the app listens on a socket for incoming connections, and creates a new thread to process each connection. The app receives incoming data on each connection processes the data, and sends the processed data back to the client. Simple enough.
With only one instance of the application running and up to 70 simultaneous threads, the app will run up the CPU to over 150%.. and have trouble keeping up processing the incoming data. This is running on a Dell 24 Core System.
Now if I create 3 instances of my application, and split the incoming data across the 3 instances on the same machine, the max overall cpu on the same machine may only reach 25%.
Question is why would one instance of the application use 6 times the amount of CPU that 3 instances on the same machine each processing one third of the amount of data use?
I'm not a linux guy, but can anyone recommend a tool to monitor system resources to try and figure out where the bottleneck is occurring? Any clues as to why 3 instances processing the same amount of data as 1 instance would use so much less overall system CPU?

In general this should not be the case. Maybe you are reading the CPU usage wrong. Try top, htop, ps, vmstat commands to see what's going on.
I could imagine one of the reasons for such behaviour - resource contention. If you have some sort of lock or a busy loop which manifests itself only on one instance (max connections, or max threads) then your system might not parallelize processing optimally and wait for resources. I suggest to connect something like jconsole to your java processes and see what's happening.
As a general recommendation check how many threads are available per JVM and if you are using them correctly. Maybe you don't have enough memory allocated to JVM so it's garbage collecting too often. If you use database ops then check for bottlenecks there too. Profile and find the place where it spends most of the time and compare 1 to 3 instances in terms of % of time spend in that function.

Java: How to build a scalable Job processing mechanism

I need to build a job processing module wherein the incoming rate of jobs is in the order of millions. I have a multiprocessor machine to run these jobs on. In my current solution for Java, I use Java's ThreadPoolExecutor framework to create a job queue, a LinkedListBlockingQueue, and the number of threads equals the available processor on the system. This design is not able to sustain the incoming rate as the job queue keeps growing and within seconds it reports the overflow even though the CPU utilization is not maxed out. The CPU utilization remains somewhere in the range of 30-40 percent.
It means that most of the time is going away in thread contention where other CPU remains idle. Is there any better way of processing the jobs so that CPUs are utilized better so that job queue does not overflow?

I suggest you look at Disruptor first. This provides a high performance in memory ring buffer. This works best if you can slow the producer(s) if the consumers cannot keep up.
If you need a persisted or unbounded queue I suggest using Chronicle (which I wrote) This has the advantage that the producer is not slowed by the producer (and the queue is entirely off heap)
Both of these are designed to handle millions of messages per second.

Hi You could use a queuing system like RabbitMQ to hold messages for processing. If you combine this with Spring AMQP you can have easy (one line of config) multi-threading and the messages would be stored on disk until they are ready to be processed by your application.

Your analysis is probably wrong. If the CPU was busy switching jobs, then the CPU utilization would be 100% - anything that the CPU does for a process counts.
My guess is that you have I/O where you could run more jobs. Try to run 4 or 8 times as many threads as you have CPU cores.
If that turns out to be too slow, use a framework like Akka which can process 10 million messages in 23 seconds without any special tuning.
If that's not enough, then look at Disruptor.

Magical libraries are very tempting but often mislead you to wrong direction and makes your solution day by day more complex...Disruptor people LMAX says this too :) .. I think you should take a step back and understand the root cause of Job queue depth. In your case looks to me its same type of consumers so i don't think disruptor is going to help.
You mentioned about thread contention.
I would suggest first try to see if you can reduce the contention. Not sure if all your jobs are related but if not may be can use some portioning technique of queue and reduce unrelated jobs contention. Then you need to know why your consumers are slow. Can you improve your locking strategy by using ReadWrite locks or NonBlocking collections in consumers.

Scaling Software/Hardware for a Large # of External API Requests?

We have a system that given a batch of requests, makes an equivalent number of calls to an external 3rd Party API. Given that this is an I/O bound task, we currently use a cached thread-pool of size 20 to service these requests. Other than above, is the solution to:
Use fewer machines with more cores (less context-switching, capable of supporting more concurrent threads)
or
Use more machines by leveraging commodity/cheap hardware (pizza boxes)
The number of requests we receive a day is on the order of millions.
We're using Java, so the threads here are kernel, not "green".
Other Points/Thoughts:
Hadoop is commonly used for problems of this nature, but this needs to be real-time vs. the stereotypical offline data mining.
The API requests take anywhere from 200ms to 2 seconds on average
There is no shared state between requests
The 3rd Party in question is capable of servicing more requests than we can possibly fire (payments vendor).

It's not obvious to me that you need more resources at all (larger machines or more machines). If you're talking about at most 10 million requests in a day taking at most 2 seconds each, that means:
~110 requests per second. That's not so fast. Are the requests particularly large? Or are there big bursts? Are you doing heavy processing besides dispatching to the third-party API? You haven't given me any information so far that leads me to believe it's not possible to run your whole service on a single core. (Call it three of the smallest possible machines if you want to have n+2 redundancy.)
on average, ~220 active requests. Again, that seems like no problem for a single machine, even with a (pooled) thread-per-request model. Why don't you just expand your pool size and call it a day? Are these really bursty? (And do you have really tight latency/reliability requirements?) Do they need a huge amount of RAM while active?
Could you give some more information on why you think you have to make this choice?

Rather than using a large number of threads, you might fare better with the event-driven I/O using node.js with the caveats that it may mean a large rewrite and the fact that node.js is fairly young.
This SO article may be of interest.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.