Server Environment
Linux/RedHat
6 cores
Java 7/8
About application :
We are working on developing a low latency (7-8 ms) high speed trading platform using Java
There are 2 modules A & B each running on its own JVM
B gets data from A
Architecture:
we have made use of MemoryMaps & Unsafe. In this case, Module A writes into a memory mapped file & Module B reads from the file (both are holding address location to the file)
We went ahead & used an endless while-loop to continue reading till the desired value is obtained from the memory mapped file
Problem
CPU utilization shoots up to 100% & remains the same till its life cycle
Question :
Is there a more sophisticated way to keep polling for a value in the memory mapped file which involves minimum overheads, minimum delay & minimum CPU utilization? Note that every microsecond delay will deteriorate the performance
Code Snippet
Code snippet for the Module B (endless while-loop which polls & reads from the memory mapped file) is below
FileChannel fc_pointer = new RandomAccessFile(file, "rw").getChannel();
MappedByteBuffer mem_file_pointer =fc_pointer.map(FileChannel.MapMode.READ_ONLY, 0, bufferSize);
long address_file_pointer = ((DirectBuffer) mem_file_pointer).address();
while(true)
{
int value_from_memory_mapped_file = unsafe.getInt(address_file_pointer);
if (value_from_memory_mapped_file .. is different from the last read value)
{
//do some operation....
//exit the routine;
}
else
{
continue;
}
}//end of while
Highly loaded CPU is the real cost of the lowest latency possible. In a practical architecture, which uses a lock-free signaling, you should run no more than just a couple of Consumer-Producer pairs of threads per CPU socket. One pair eats one or two (one core per thread if not pinned to single Intel CPU core with Hyper-threading enabled) cores almost completely (that's why in most cases you have to think about horizontal scalability when you build ultra-low latency server system for many clients). BTW, don't forget to use "taskset" to pin each process to a specific core before performance tests and disable power management.
There is a well known trick when you lock a Consumer after a long period of spinning with no result. But you have to spend some time to park and then unpark the thread. Here is a moment of sporadic latency increasing, of course, but CPU core is free when the thread is parked. See, for example: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (8.4.4 Synchronization for Longer Periods)
Also, a nice illustration for different waiting strategies for java can be found here:
https://github.com/LMAX-Exchange/disruptor/wiki/Getting-Started (Alternative Wait Strategies)
If you are talking about milliseconds (ms), not microseconds (µs), you can just try TCP socket communication over loopback. It adds about 10 µs to pass a small amount of data from Producer to Consumer and this is blocking technique. Named Pipes has better latency characteristics than sockets, but they are non blocking really and you have to build something like a spinloop again. Memory Mapped Files + intrinsic Unsafe.getXXX (which is a single x86 MOV) is still the best IPC technique in terms of both latency and throughput since it doesn't require system calls while reading and writing.
If you are still going to use lock-free and Memory Mapped Files and direct access using Unsafe, don't forget about appropriate memory barriers for both Producer and Consumer. For example, "unsafe.getIntVolatile" instead of first "unsafe.getInt" if you are not sure your code always runs on later x86.
If you see unexpected CPU utilization which should be no more 30-40% (2 utilized cores for 6 cores CPU) per pair of Producer-Consumer, you have to use standard tools to check what is running on other cores and the overall system performance. If you see intensive IO associated with your mapped file, then make sure it is mapped to tmpfs file system to prevent real disk IO. Check memory bus loading and L3 cache misses for the "fattest" processes because, as we know, CPU time = (CPU exec clock cycles + _memory_stall_cycles_) * clock cycle time
And finally, a quite similar and interesting open source project with a good example how to use memory mapped files: http://openhft.net/products/chronicle-queue/
Related
I am working on a Java application for solving a class of numerical optimization problems - large-scale linear programming problems to be more precise. A single problem can be split up into smaller subproblems that can solved in parallel. Since there are more subproblems than CPU cores, I use an ExecutorService and define each subproblem as a Callable that gets submitted to the ExecutorService. Solving a subproblem requires calling a native library - a linear programming solver in this case.
Problem
I can run the application on Unix and on Windows systems with up to 44 physical cores and up to 256g memory, but computation times on Windows are an order of magnitude higher than on Linux for large problems. Windows not only requires substantially more memory, but CPU utilization over time drops from 25% in the beginning to 5% after a few hours. Here is a screenshot of the task manager in Windows:
Observations
Solution times for large instances of the overall problem range from hours to days and consume up to 32g of memory (on Unix). Solution times for a subproblem are in the ms range.
I do not encounter this issue on small problems that take only a few minutes to solve.
Linux uses both sockets out-of-the-box, whereas Windows requires me to explicitly activate memory interleaving in the BIOS so that the application utilizes both cores. Whether of not I do this has no effect on the deterioration of overall CPU utilization over time though.
When I look at the threads in VisualVM all pool threads are running, none are on wait or else.
According to VisualVM, 90% CPU time is spend on a native function call (solving a small linear program)
Garbage Collection is not an issue since the application does not create and de-reference a lot of objects. Also, most memory seems to be allocated off-heap. 4g of heap are sufficient on Linux and 8g on Windows for the largest instance.
What I've tried
all sorts of JVM args, high XMS, high metaspace, UseNUMA flag, other GCs.
different JVMs (Hotspot 8, 9, 10, 11).
different native libraries of different linear programming solvers (CLP, Xpress, Cplex, Gurobi).
Questions
What drives the performance difference between Linux and Windows of a large multi-threaded Java application that makes heavy use of native calls?
Is there anything that I can change in the implementation that would help Windows, for example, should I avoid using an ExecutorService that receives thousands of Callables and do what instead?
For Windows the number of threads per process is limited by the address space of the process (see also Mark Russinovich - Pushing the Limits of Windows: Processes and Threads). Think this causes side effects when it comes close to the limits (slow down of context switches, fragmentation...). For Windows I would try to divide the work load to a set of processes. For a similar issue that I had years ago I implemented a Java library to do this more conveniently (Java 8), have a look if you like: Library to spawn tasks in an external process.
Sounds like windows is caching some memory to pagefile, after its being untouched for some time, and thats why the CPU is bottlenecked by the Disk speed
You can verify it with Process explorer and check how much memory is cached
I think this performance difference is due to how the O.S. manages the threads. JVM hide all OS difference. There are many sites where you can read about it, like this, for example. But it does not mean that the difference disappears.
I suppose you are running on Java 8+ JVM. Due to this fact, I suggest you to try to use stream and functional programming features. Functional programming is very usefully when you have many small independent problems and you want easily switch from sequential to parallel execution. The good news is that you don't have to define a policy to determine how many threads do you have to manage (like with the ExecutorService). Just for example (taken from here):
package com.mkyong.java8;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.IntStream;
import java.util.stream.Stream;
public class ParallelExample4 {
public static void main(String[] args) {
long count = Stream.iterate(0, n -> n + 1)
.limit(1_000_000)
//.parallel() with this 23s, without this 1m 10s
.filter(ParallelExample4::isPrime)
.peek(x -> System.out.format("%s\t", x))
.count();
System.out.println("\nTotal: " + count);
}
public static boolean isPrime(int number) {
if (number <= 1) return false;
return !IntStream.rangeClosed(2, number / 2).anyMatch(i -> number % i == 0);
}
}
Result:
For normal streams, it takes 1 minute 10 seconds. For parallel
streams, it takes 23 seconds. P.S Tested with i7-7700, 16G RAM,
WIndows 10
So, I suggest you read about function programming, stream, lambda function in Java and try to implement a small number of test with your code (adapted to work in this new context).
Would you please post the system statistics? Task manager is good enough to provide some clue if that is the only tool available. It can easily tell if your tasks are waiting for IO - which sounds like the culprit based on what you described. It may be due to certain memory management issue, or the library may write some temporary data to the disk, etc.
When you are saying 25% of CPU utilization, do you mean only a few cores are busy working at the same time? (It can be that all the cores works from time to time, but not simultaneously.) Would you check how many threads (or processes) are really created in the system? Is the number always bigger than the number of cores?
If there are enough threads, are many of them idle waiting for something? If true, you can try to interrupt (or attach a debugger) to see what they are waiting for.
We are seeing a behavior where the performance of the JVM decreases when the load is light. Specifically on multiple runs, in a test environment we are noticing that the latency worsens by around 100% when the rate of order messages pumped into the system is reduced. Some of the background on the issue is below and I would appreciate any help on this.
Simplistically the demo Java trading application being investigated can be thought to have 3 important threads:
order receiver thread,
processor thread,
exchange transmitter thread
Order receiver thread receives the order and puts it on a processor q. the processor thread picks it up from the processor q, does some basic processing and puts it on the exchange q. the exchange transmitter thread picks it up from exchange q and sends order to the exchange.
The latency from the order receipt to the order going out to the exchange worsens by 100% when the rate of orders pumped into the system is changed from a higher number to a low number.
Solutions tried:
Warming up critical code path in JVM by sending high message rate and priming the system before reducing message rate:
Does not solve the issue
Profiling the application:
Using a profiler it shows hotspots in the code where 10 -15% improvement may be had by improving the implementation. But nothing in the range of 100% improvement just obtained by increasing message rate.
Does anyone have any insights/suggestions on this? Could it have to do with the scheduling jitter on the thread.
Could it be that under the low message rate the threads are being switched out from the core?
2 posts I think may be related are below. However our symptoms are a bit different:
is the jvm faster under load?
Why does the JVM require warmup?
Consistent latency for low/medium load requires specific tuning of Linux.
Below are few point from my old check list, which is relevant for components with millisecond latency requirements.
configure CPU core to always run and maximum frequency (here are docs for RedHat)
configure dedicated CPU cores for your critical application threads
Use isolcpus to exclude dedicated cores from scheduler
use taskset to bind critical thread to specific core
configure your service to run in single NUMA node (with numactl)
Linux scheduler and power sampling are key contributor to high variance of latency under low/medium low.
By default, CPU core would reduce frequency if inactive, as consequence your next request is processed slower on downclocked core.
CPU cache is key performance asset if your critical thread is scheduled on different cores it would lose its cache data. Also, other threads schedule for same core would evict cache also increasing latency of critical code.
Under heavy load these factors are less important (frequency is maxed and thread are ~100% busy tending to stick to specific cores).
Though under low/medium load these factors negatively affect both average latency and high percentiles (99 percentile may be order of magnitude worse compared to heavy load case).
For high throughput applications (above 100k request/sec) advanced inter thread communication approach (e.g. LMAX disruptor) are also useful.
My application is a "thread-per-request" web server with a thread pool of M threads. All processing of a single request runs in the same thread.
Suppose I am running the application in a computer with N cores. I would like to configure M to limit the CPU usage: e.g. up to 50% of all CPUs.
If the processing were entirely CPU-bound then I would set M to N/2. However the processing does some IO.
I can run the application with different M and use top -H, ps -L, jstat, etc. to monitor it.
How would you suggest me estimate M ?
To have a CPU usage of 50% does not necessarily mean that the number of threads needs to be N_Cores / 2. When dealing with I/O the CPU wastes many cycles in waiting for the data to arrive from devices.
So you need a tool to measure real CPU usage and through experiments, you could increase the number of threads until the real CPU usage goes to 50%.
perf for Linux is such a tool. This question addresses the problem. Also be sure to collect statistics system wide: perf record -a.
You are interested in your CPU issuing and executing as many instructions / cycle as possible (IPC). Modern servers can execute up to 4 IPC for intense compute bound workloads. You want to go as close to that as possible to get good CPU utilization, and that means that you need to increase the thread count. Of course if there's many I/O that won't be possible due to many context switches that bring some penalties (cache flushing, kernel code, etc.)
So the final answer would be just increase the thread count until real CPU usage goes to 50%.
It depends enterely on your particular software and hardware.
Hardware is important since if a thread blocks writing to a slow disk it will take long to wake up again (and consume cpu again) but if the disk is really fast the blocking will only switch cpu context but the thread will be running again immediately.
I think you can only try and try with different parameters or the app may monitor it's CPU use itself and adjust the pool dynamically.
Say I run a simple single-threaded process like the one below:
public class SirCountALot {
public static void main(String[] args) {
int count = 0;
while (true) {
count++;
}
}
}
(This is Java because that's what I'm familiar with, but I suspect it doesn't really matter)
I have an i7 processor (4 cores, or 8 counting hyperthreading), and I'm running Windows 7 64-bit so I fired up Sysinternals Process Explorer to look at the CPU usage, and as expected I see it is using around 20% of all available CPU.
But when I toggle the option to show 1 graph per CPU, I see that instead of 1 of the 4 "cores" being used, the CPU usage is spread all over the cores:
Instead what I would expect is 1 core maxed out, but this only happens when I set the affinity for the process to a single core.
Why is the workload split over the separate cores? Wouldn't splitting the workload over several cores mess with the caching or incur other performance penalties?
Is it for the simple reason of preventing overheating of one core? Or is there some deeper reason?
Edit: I'm aware that the operating system is responsible for the scheduling, but I want to know why it "bothers". Surely from a naive viewpoint, sticking a (mostly*) single-threaded process to 1 core is the simpler & more efficient way to go?
*I say mostly single-threaded because there's multiple theads here, but only 2 of them are doing anything:
The OS is responsible for scheduling. It is free to stop a thread and start it again on another CPU. It will do this even if there is nothing else the machine is doing.
The process is moved around the CPUs because the OS doesn't assume there is any reason to continue running the thread on the same CPU each time.
For this reason I have written a library for lock threads to a CPU so it won't move around and won't be interrupted by other threads. This reduces latency and improve throughput but does tire up a CPU for that thread. This works for Linux, perhaps you can adapt it for Windows. https://github.com/peter-lawrey/Java-Thread-Affinity/wiki/Getting-started
I would also expect this could well be done on purpose by the CPU and OS so as to try and spread the thermal load on the CPU die...
So it would rotate the (unique/single) thread from core to core.
And that could admittedly be an argument against trying to fight this too hard (especially as, in practice, you often will see better improvements by simply tuning / improving the app itself anyway)
In Java, is there a programmatic way to find out how many concurrent threads are supported by a CPU?
Update
To clarify, I'm not trying to hammer the CPU with threads and I am aware of Runtime.getRuntime().availableProcessors() function, which provides me part of the information I'm looking for.
I want to find out if there's a way to automatically tune the size of thread pool so that:
if I'm running on a 1-year old server, I get 2 threads (1 thread per CPU x an arbitrary multiplier of 2)
if I switch to an Intel i7 quad core two years from now (which supports 2 threads per core), I get 16 threads (2 logical threads per CPU x 4 CPUs x the arbitrary multiplier of 2).
if, instead, I use a eight core Ultrasparc T2 server (which supports 8 threads per core), I get 128 threads (8 threads per CPU x 8 CPUs x the arbitrary multiplier of 2)
if I deploy the same software on a cluster of 30 different machines, potentially purchased at different years, I don't need to read the CPU specs and set configuration options for every single one of them.
Runtime.availableProcessors returns the number of logical processors (i.e. hardware threads) not physical cores. See CR 5048379.
A single non-hyperthreading CPU core can always run one thread. You can spawn lots of threads and the CPU will switch between them.
The best number depends on the task. If it is a task that will take lots of CPU power and not require any I/O (like calculating pi, prime numbers, etc.) then 1 thread per CPU will probably be best. If the task is more I/O bound. like processing information from disk, then you will probably get better performance by having more than one thread per CPU. In this case the disk access can take place while the CPU is processing information from a previous disk read.
I suggest you do some testing of how performance in your situation scales with number of threads per CPU core and decide based on that. Then, when your application runs, it can check availableProcessors() and decide how many threads it should spawn.
Hyperthreading will make the single core appear to the operating system and all applications, including availableProcessors(), as 2 CPUs, so if your application can use hyperthreading you will get the benefit. If not, then performance will suffer slightly but probably not enough to make the extra effort in catering for it worth while.
There is no standard way to get the number of supported threads per CPU core within Java. Your best bet is to get a Java CPUID utility that gives you the processor information, and then match it against a table you'll have to generate that gives you the threads per core that the processor manages without a "real" context switch.
Each processor, or processor core, can do exactly 1 thing at a time. With hyperthreading, things get a little different, but for the most part that still remains true, which is why my HT machine at work almost never goes above 50%, and even when it's at 100%, it's not processing twice as much at once.
You'll probably just have to do some testing on common architectures you plan to deploy on to determine how many threads you want to run on each CPU. Just using 1 thread may be too slow if you're waiting for a lot of I/O. Running a lot of threads will slow things down as the processor will have to switch threads more often, which can be quite costly. I'm not sure if there is any hard-coded limit to how many threads you can run, but I gaurantee that your app would probably come to a crawl from too much thread switching before you reached any kind of hard limit. Ultimately, you should just leave it as an option in the configuration file, so that you can easily tune your app to whatever processor you're running it on.
A CPU does not normally pose a limit on the number of threads, and I don't think Java itself has a limit on the number of native (kernel) threads it will spawn.
There is a method availableProcessors() in the Runtime class. Is that what you're looking for?
Basics:
Application loaded into memory is a process. A process has at least 1 thread. If you want, you can create as many threads as you want in a process (theoretically). So number of threads depends upon you and the algorithms you use.
If you use thread pools, that means thread pool manages the number of threads because creating a thread consumes resources. Thread pools recycle threads. This means many logical threads can run inside one physical thread one after one.
You don't have to consider the number of threads, it's managed by the thread pool algorithms. Thread pools choose different algorithms for servers and desktop machines (OSes).
Edit1:
You can use explicit threads if you think thread pool doesn't use the resources you have. You can manage the number of threads explicitly in that case.
This is a function of the VM, not the CPU. It has to do with the amount of heap consumed per thread. When you run out of space on the heap, you're done. As with other posters, I suspect your app becomes unusable before this point if you exceed the heap space because of thread count.
See this discussion.