Java: How to build a scalable Job processing mechanism - java

I need to build a job processing module wherein the incoming rate of jobs is in the order of millions. I have a multiprocessor machine to run these jobs on. In my current solution for Java, I use Java's ThreadPoolExecutor framework to create a job queue, a LinkedListBlockingQueue, and the number of threads equals the available processor on the system. This design is not able to sustain the incoming rate as the job queue keeps growing and within seconds it reports the overflow even though the CPU utilization is not maxed out. The CPU utilization remains somewhere in the range of 30-40 percent.
It means that most of the time is going away in thread contention where other CPU remains idle. Is there any better way of processing the jobs so that CPUs are utilized better so that job queue does not overflow?

I suggest you look at Disruptor first. This provides a high performance in memory ring buffer. This works best if you can slow the producer(s) if the consumers cannot keep up.
If you need a persisted or unbounded queue I suggest using Chronicle (which I wrote) This has the advantage that the producer is not slowed by the producer (and the queue is entirely off heap)
Both of these are designed to handle millions of messages per second.

Hi You could use a queuing system like RabbitMQ to hold messages for processing. If you combine this with Spring AMQP you can have easy (one line of config) multi-threading and the messages would be stored on disk until they are ready to be processed by your application.

Your analysis is probably wrong. If the CPU was busy switching jobs, then the CPU utilization would be 100% - anything that the CPU does for a process counts.
My guess is that you have I/O where you could run more jobs. Try to run 4 or 8 times as many threads as you have CPU cores.
If that turns out to be too slow, use a framework like Akka which can process 10 million messages in 23 seconds without any special tuning.
If that's not enough, then look at Disruptor.

Magical libraries are very tempting but often mislead you to wrong direction and makes your solution day by day more complex...Disruptor people LMAX says this too :) .. I think you should take a step back and understand the root cause of Job queue depth. In your case looks to me its same type of consumers so i don't think disruptor is going to help.
You mentioned about thread contention.
I would suggest first try to see if you can reduce the contention. Not sure if all your jobs are related but if not may be can use some portioning technique of queue and reduce unrelated jobs contention. Then you need to know why your consumers are slow. Can you improve your locking strategy by using ReadWrite locks or NonBlocking collections in consumers.

Related

Message latencies with CPU under-utilization

We've got a Java app where we basically use three dispatcher pools to handle processing tasks:
Convert incoming messages (from RabbitMQ queues) into another format
Serialize messages
Push serialized messages to another RabbitMQ server
The thing, where we don't know how to start fixing it, is, that we have latencies at the first one. In other words, when we measure the time between "tell" and the start of doing the conversion in an actor, there is (not always, but too often) a delay of up to 500ms. Especially strange is that the CPUs are heavily under-utilized (10-15%) and the mailboxes are pretty much empty all of the time, no huge amount of messages waiting to be processed. Our understanding was that Akka typically would utilize CPUs much better than that?
The conversion is non-blocking and does not require I/O. There are approx. 200 actors running on that dispatcher, which is configured with throughput 2 and has 8 threads.
The system itself has 16 CPUs with around 400+ threads running, most of the passive, of course.
Interestingly enough, the other steps do not see such delays, but that can probably explained by the fact that the first step already "spreads" the messages so that the other steps/pools can easily digest them.
Does anyone have an idea what could cause such latencies and CPU under-utilization and how you normally go improving things there?

Implementing event-driven lightweight threads

Inspired by libraries like Akka and Quasar I started wondering how these actually work "under the hood". I'm aware that it is most likely very complex and that they all work quite different from each other.
I would still like to learn how I would go to implement a (at most) very basic version of my own "event-driven lightweight threads" using Java 8.
I'm quite familiar with Akka as a library, and I have an intermediate understanding about concurrency on the JVM.
Could anyone point me to some literature covering this, or try to describe the concepts involved?
In Akka it works like this:
An actor is a class that bundles a mailbox with the behavior to handle messages
When some code calls ActorRef.tell(msg), the msg is put into the mailbox of the referenced actor (though, this wouldn't be enough to run anything)
A task is queued on the dispatcher (a thread pool basically) to handle messages in the mailbox
When another message comes in and the mailbox is already queued, it doesn't need to be scheduled again
When the dispatcher is executing the task to handle the mailbox, the actor is called to handle one message after the other
Messages in this mailbox up to the count specified in akka.actor.throughput are handled by this one task in one go. If the mailbox still has messages afterwards, another task is scheduled on the dispatcher to handle the remaining messages. Afterwards the tasks exits. This ensures fairness, i.e. that the thread this mailbox is run on isn't indefinitely blocked by one actor.
So, there are basically two work queues:
The mailbox of an actor. These messages need to be processed sequentially to ensure the contract of actors.
The queue of the dispatcher. All of the tasks in here can be processed concurrently.
The hardest part of writing this efficiently is the thread pool. In the thread pool a bunch of worker threads need to access their task queue in an efficient way. By default, Akka uses JDK's ForkJoinPool under-the-hood which is a very sophisticated work-stealing thread pool implementation.
Could anyone point me to some literature covering this,
I am the architect for Chronicle Queue and you can read how it is used and works here on my blog https://vanilla-java.github.io/tag/Microservices/
try to describe the concepts involved?
You have;
above all, make your threads faster and light weight by doing less work.
try to deal with each event as quickly as possible to keep latency low.
batch when necessary but keep it to a minimum. Batching add latency but can help improve maximum throughput.
Identify the critical path. Keep this as short as possible, moving anything blocking or long running to asynchronous thread/processes.
keep hops to a minimum, either between threads, processes or machines.
keep allocation rates down to improve throughput between GCs, and reduce the impact of GCs.
For some of the systems I work on you can achieve latencies of 30 micro-seconds in Java (network packet in to network packet out)
In Akka,
1.Actor system allocates the threads from thread pool to actors that have messages to process.
2.When the actor has no messages to process,thread is released and allocated to other actors that have messages to process
This way asynchronous actor systems can handle many
more concurrent requests with the same amount of resources since
the limited number of threads(thread pool) never sit idle while waiting for I/O
operations to complete.
For more information you can download & check this e-book https://info.lightbend.com/COLL-20XX-Designing-Reactive-Systems_RES-LP.html?lst=BL&_ga=1.214533079.1169348714.1482593952

Scalability of Redis Cluster using Jedis 2.8.0 to benchmark throughput

I have an instance of JedisCluster shared between N threads that perform set operations.
When I run with 64 threads, the throughput of set operations is only slightly increased (compared to running using 8 threads).
How to configure the JedisCluster instance using the GenericObjectPoolConfig so that I can maximize throughput as I increase the thread count?
I have tried
GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
poolConfig.setMaxTotal(64);
jedisCluster = new JedisCluster(jedisClusterNodes, poolConfig);
believing this could increase the number of jedisCluster connection to the cluster and so boost throughput.
However, I observed a minimal effect.
When talking about performance, we need to dig into details a bit before I can actually answer your question.
A naive approach suggests: The more Threads (concurrency), the higher the throughput.
My statement is not wrong, but it is also not true. Concurrency and the resulting performance are not (always) linear because there is so many involved behind the scenes. Turning something from sequential to concurrent processing might result in something that executes twice of the work compared to sequential execution. This example assumes that you run a multi-core machine, that is not occupied by anything else and it has enough bandwidth for the required work processing (I/O, Network, Memory). If you scale this example from two threads to eight, but your machine has only four physical cores, weird things might happen.
First of all, the processor needs to schedule two threads so each of the threads probably behaves as if they would run sequentially, except that the process, the OS, and the processor have increased overhead caused by twice as many threads as cores. Orchestrating these guys comes at a cost that needs to be paid at least in memory allocation and CPU time. If the workload requires heavy I/O, then the work processing might be limited by your I/O bandwidth and running things concurrently may increase throughput as the CPU is mostly waiting until the I/O comes back with the data to process. In that scenario, 4 threads might be blocked by I/O while the other 4 threads are doing some work. Similar applies to memory and other resources utilized by your application. Actually, there's much more that digs into context switching, branch prediction, L1/L2/L3 caching, locking and much more that is enough to write a 500-page book. Let's stay at a basic level.
Resource sharing and certain limitations lead to different scalability profiles. Some are linear until a certain concurrency level, some hit a roof and adding more concurrency results in the same throughput, some have a knee when adding concurrency makes it even slower because of $reasons.
Now, we can analyze how Redis, Redis Cluster, and concurrency are related.
Redis is a network service which requires network I/O. Networking might be obvious, but we require to add this fact to our considerations meaning a Redis server shares its network connection with other things running on the same host and things that use the switches, routers, hubs, etc. Same applies to the client, even in the case you told everybody else not to run anything while you're testing.
The next thing is, Redis uses a single-threaded processing model for user tasks (Don't want to dig into Background I/O, lazy-free memory freeing and asynchronous replication). So you could assume that Redis uses one CPU core for its work but, in fact, it can use more than that. If multiple clients send commands at a time, Redis processes commands sequentially, in the order of arrival (except for blocking operations, but let's leave this out for this post). If you run N Redis instances on one machine where N is also the number of CPU cores, you can easily run again into a sharing scenario - That is something you might want to avoid.
You have one or many clients that talk to your Redis server(s). Depending on the number of clients involved in your test, this has an effect. Running 64 threads on a 8 core machine might be not the best idea since only 8 cores can execute work at a time (let's leave hyper-threading and all that out of here, don't want to confuse you too much). Requesting more than 8 threads causes time-sharing effects. Running a bit more threads than CPU cores for Redis and other networked services isn't a too bad of an idea since there is always some overhead/lag coming from the I/O (network). You need to send packets from Java (through the JVM, the OS, the network adapter, routers) to Redis (routers, network, yadda yadda yadda), Redis has to process the commands and send the response back. This usually takes some time.
The client itself (assuming concurrency on one JVM) locks certain resources for synchronization. Especially requesting new connections with using existing/creating new connections is a scenario for locking. You already found a link to the Pool config. While one thread locks a resource, no other thread can access the resource.
Knowing the basics, we can dig into how to measure throughput using jedis and Redis Cluster:
Congestion on Redis Cluster can be an issue. If all client threads are talking to the same cluster node, then other cluster nodes are idle, and you effectively measured how one node behaves but not the cluster: Solution: Create an even workload (Level: Hard!)
Congestion on the Client: Running 64 threads on a 8 core machine (that is just my assumption here, so please don't beat me up if I'm wrong) is not the best idea. Raising the number of threads on a client a bit above the number of Cluster nodes (assuming even workload for each cluster node) and a bit over the number of CPU cores can improve performance is never a too bad idea. Having 8x threads (compared to the number of CPU cores) is an overkill because it adds scheduling overhead at all levels. In general, performance engineering is related to finding the best ratio between work, overhead, bandwidth limitations and concurrency. So finding the best number of threads is an own field in computer science.
If running a test using multiple systems, that run a number of total threads, is something that might be closer to a production environment than running a test from one system. Distributed performance testing is a master class (Level: Very hard!) The trick here is to monitor all resources that are used by your test making sure nothing is overloaded or finding the tipping point where you identify the limit of a particular resource. Monitoring the client and the server are just the easy parts.
Since I do not know your setup (number of Redis Cluster nodes, distribution of Cluster nodes amongst different servers, load on the Redis servers, the client, and the network during test caused by other things than your test), it is impossible to say what's the cause.

Tracking down thread conflicts in java

Using YourKit, I metered an application, and identified the main CPU sink. I structured the computation to parallelize this via an ExecutorService with a fixed number of threads.
On a 24-core machine, the benefit of adding threads trails off very fast above 4. So, thought I, there must be some contention or locking going on around here, or IO latency, or something.
OK, I turned on the 'Monitor Usage' feature of YourKit, and the amount of blocked time shown in the worker threads is trivial. Eyeballing the thread state chart, the worker threads are nearly all 'green' (running) as opposed to yellow (waiting) or red (blocked).
CPU profiling still shows 96% of the time in a call tree that is inside the worker threads.
So something is using up real time. Could it be scheduling overhead?
In pseudo-code, you might model this as:
loop over blobs:
submit tasks for a blob via invokeAll of executor
do some single-threaded processing on the results
end loop over blobs
In a test run, there are ~680 blobs, and ~13 tasks/blob. So each thread (assuming four) dispatches about 3 times per blob.
hardware: I've run tests on a small scale on my MacBook pro, and then on a big fat Dell: hwinfo on linux there reports 24 different items for --cpu, composed of
Intel(R) Xeon(R) CPU X5680 # 3.33GHz
Intel's website tells me that each has 6 cores, 12 threads, I suspect I have 4 of them.
Assuming you have 4 cores with 8 logical threads each, this means you have 4 real processing unit which can be shared across 32 threads. It also means when you have 2-8 active threads on the same core, they have to compete for resources such as the CPU pipeline and the instruction and data caches.
This works best when you have many threads which have to wait for external resources like disk or network IO. If you have CPU intensive processes, you may find that one thread per core will use all the CPU power you have.
I have written a library which supports allocations of threads and cores for linux and windows. If you have Solaris it may be easy to port as it support JNI posix calls and JNA calls.
https://github.com/peter-lawrey/Java-Thread-Affinity
It's most likely not contention, though it's hard to say without more details. Profiling results can be misleading because Java reports threads as RUNNABLE when they're blocked on disk or network I/O. Yourkit still counts it as CPU time.
Your best bet is to turn on CPU profiling and drill into what's taking the time in the worker threads. If it ends up mostly in java.io classes, you've still got disk or network latency.
You have not completely parallelized the processing. you may not be submitting the next blob till the results of the previous blob is completed, hence no parallel processing.
If you can, try this way:
for each blob{
create a runnable for blob process name it blobProcessor;
create a runnable for blob results name it resultsProcessor;
submit blobProcessor;
before blobProcessor finishes, submit resultsProcessor;
}
also:
please take a look at JetLang which provides a threadless concurrency using fibers.

what number of threads to be created in thread pool

Currently i am in a process of developing a application which can work in a multi threaded mode. As part of testing on my local machine (Intel Core I5) i tested with 4 threads. But now want to release the code for intense(regression) testing, so there any hard rule by which we can decide the number of threads to be created for processing.
I am not using any web or App server, instead i have written my logic to receive the request and then process it. Now During the processing, i receive the request on main Thread, and then i submit the call to ExecuterService where i need to decide the number of threads, then i process the request and each thread is again capable of returning the response.
I need to configure an optimum number of thread. I am trying to deploy my application on 16-Core, with 40GB Memory linux machine.
Thanks
The maximum number of threads for an application can not be extracted via some well defined formula but it depends on the nature of your various tasks and your target environment.
If your tasks are CPU intensive then, if you spawn too many threads the performance will degenerate as most of the time will be spend in context switching.
For compute intensive tasks a general formula is Ncpus+1. You can determine the number of CPUs using Runtime.availableProcessors
If your tasks are I/O intensive then most of the time you can use a much larger number of threads since, due to the fact that the threads are spending so much time in blocking tasks, all of the threads will be schedulable.
So taking these 2 into account you should estimate the compute-time vs waiting-time via a profiler or other similar tool.
You can try your benchmarks with various sizes until your estimate the optimal for your case.
In theory, the optimal number of threads is equal to the number of cores in your machine.
In practice, many operations are waiting for memory, IO, network or disk.
Try to execute only a single thread. If the CPU core load is 25% - you can try to create (4 x the number of cores in your machine) threads.
Note that increasing the number of threads will effect the time each thread will wait for network/disk/memory/IO, so it is somewhat more complex.
The best thing you can do is benchmark: Measure how much time it would take to complete 1,000,000 simulated requests - given different number of threads.
Depends on how cpu intensive your tasks are. But still you can assign one task to one core. So at the least you can go about creating as many threads as number of cores. That said, things may slow down depending on
Your code doing lots of I/O.
Lots of network I/O
other CPU intensive tasks
If you create too many threads, there will be lots of time wasted in context switching. Unless you can come to a benchmark based on your own tests, go with threads=number of cores.

Categories