I am working on a Java application for solving a class of numerical optimization problems - large-scale linear programming problems to be more precise. A single problem can be split up into smaller subproblems that can solved in parallel. Since there are more subproblems than CPU cores, I use an ExecutorService and define each subproblem as a Callable that gets submitted to the ExecutorService. Solving a subproblem requires calling a native library - a linear programming solver in this case.
Problem
I can run the application on Unix and on Windows systems with up to 44 physical cores and up to 256g memory, but computation times on Windows are an order of magnitude higher than on Linux for large problems. Windows not only requires substantially more memory, but CPU utilization over time drops from 25% in the beginning to 5% after a few hours. Here is a screenshot of the task manager in Windows:
Observations
Solution times for large instances of the overall problem range from hours to days and consume up to 32g of memory (on Unix). Solution times for a subproblem are in the ms range.
I do not encounter this issue on small problems that take only a few minutes to solve.
Linux uses both sockets out-of-the-box, whereas Windows requires me to explicitly activate memory interleaving in the BIOS so that the application utilizes both cores. Whether of not I do this has no effect on the deterioration of overall CPU utilization over time though.
When I look at the threads in VisualVM all pool threads are running, none are on wait or else.
According to VisualVM, 90% CPU time is spend on a native function call (solving a small linear program)
Garbage Collection is not an issue since the application does not create and de-reference a lot of objects. Also, most memory seems to be allocated off-heap. 4g of heap are sufficient on Linux and 8g on Windows for the largest instance.
What I've tried
all sorts of JVM args, high XMS, high metaspace, UseNUMA flag, other GCs.
different JVMs (Hotspot 8, 9, 10, 11).
different native libraries of different linear programming solvers (CLP, Xpress, Cplex, Gurobi).
Questions
What drives the performance difference between Linux and Windows of a large multi-threaded Java application that makes heavy use of native calls?
Is there anything that I can change in the implementation that would help Windows, for example, should I avoid using an ExecutorService that receives thousands of Callables and do what instead?
For Windows the number of threads per process is limited by the address space of the process (see also Mark Russinovich - Pushing the Limits of Windows: Processes and Threads). Think this causes side effects when it comes close to the limits (slow down of context switches, fragmentation...). For Windows I would try to divide the work load to a set of processes. For a similar issue that I had years ago I implemented a Java library to do this more conveniently (Java 8), have a look if you like: Library to spawn tasks in an external process.
Sounds like windows is caching some memory to pagefile, after its being untouched for some time, and thats why the CPU is bottlenecked by the Disk speed
You can verify it with Process explorer and check how much memory is cached
I think this performance difference is due to how the O.S. manages the threads. JVM hide all OS difference. There are many sites where you can read about it, like this, for example. But it does not mean that the difference disappears.
I suppose you are running on Java 8+ JVM. Due to this fact, I suggest you to try to use stream and functional programming features. Functional programming is very usefully when you have many small independent problems and you want easily switch from sequential to parallel execution. The good news is that you don't have to define a policy to determine how many threads do you have to manage (like with the ExecutorService). Just for example (taken from here):
package com.mkyong.java8;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.IntStream;
import java.util.stream.Stream;
public class ParallelExample4 {
public static void main(String[] args) {
long count = Stream.iterate(0, n -> n + 1)
.limit(1_000_000)
//.parallel() with this 23s, without this 1m 10s
.filter(ParallelExample4::isPrime)
.peek(x -> System.out.format("%s\t", x))
.count();
System.out.println("\nTotal: " + count);
}
public static boolean isPrime(int number) {
if (number <= 1) return false;
return !IntStream.rangeClosed(2, number / 2).anyMatch(i -> number % i == 0);
}
}
Result:
For normal streams, it takes 1 minute 10 seconds. For parallel
streams, it takes 23 seconds. P.S Tested with i7-7700, 16G RAM,
WIndows 10
So, I suggest you read about function programming, stream, lambda function in Java and try to implement a small number of test with your code (adapted to work in this new context).
Would you please post the system statistics? Task manager is good enough to provide some clue if that is the only tool available. It can easily tell if your tasks are waiting for IO - which sounds like the culprit based on what you described. It may be due to certain memory management issue, or the library may write some temporary data to the disk, etc.
When you are saying 25% of CPU utilization, do you mean only a few cores are busy working at the same time? (It can be that all the cores works from time to time, but not simultaneously.) Would you check how many threads (or processes) are really created in the system? Is the number always bigger than the number of cores?
If there are enough threads, are many of them idle waiting for something? If true, you can try to interrupt (or attach a debugger) to see what they are waiting for.
Related
Server Environment
Linux/RedHat
6 cores
Java 7/8
About application :
We are working on developing a low latency (7-8 ms) high speed trading platform using Java
There are 2 modules A & B each running on its own JVM
B gets data from A
Architecture:
we have made use of MemoryMaps & Unsafe. In this case, Module A writes into a memory mapped file & Module B reads from the file (both are holding address location to the file)
We went ahead & used an endless while-loop to continue reading till the desired value is obtained from the memory mapped file
Problem
CPU utilization shoots up to 100% & remains the same till its life cycle
Question :
Is there a more sophisticated way to keep polling for a value in the memory mapped file which involves minimum overheads, minimum delay & minimum CPU utilization? Note that every microsecond delay will deteriorate the performance
Code Snippet
Code snippet for the Module B (endless while-loop which polls & reads from the memory mapped file) is below
FileChannel fc_pointer = new RandomAccessFile(file, "rw").getChannel();
MappedByteBuffer mem_file_pointer =fc_pointer.map(FileChannel.MapMode.READ_ONLY, 0, bufferSize);
long address_file_pointer = ((DirectBuffer) mem_file_pointer).address();
while(true)
{
int value_from_memory_mapped_file = unsafe.getInt(address_file_pointer);
if (value_from_memory_mapped_file .. is different from the last read value)
{
//do some operation....
//exit the routine;
}
else
{
continue;
}
}//end of while
Highly loaded CPU is the real cost of the lowest latency possible. In a practical architecture, which uses a lock-free signaling, you should run no more than just a couple of Consumer-Producer pairs of threads per CPU socket. One pair eats one or two (one core per thread if not pinned to single Intel CPU core with Hyper-threading enabled) cores almost completely (that's why in most cases you have to think about horizontal scalability when you build ultra-low latency server system for many clients). BTW, don't forget to use "taskset" to pin each process to a specific core before performance tests and disable power management.
There is a well known trick when you lock a Consumer after a long period of spinning with no result. But you have to spend some time to park and then unpark the thread. Here is a moment of sporadic latency increasing, of course, but CPU core is free when the thread is parked. See, for example: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (8.4.4 Synchronization for Longer Periods)
Also, a nice illustration for different waiting strategies for java can be found here:
https://github.com/LMAX-Exchange/disruptor/wiki/Getting-Started (Alternative Wait Strategies)
If you are talking about milliseconds (ms), not microseconds (µs), you can just try TCP socket communication over loopback. It adds about 10 µs to pass a small amount of data from Producer to Consumer and this is blocking technique. Named Pipes has better latency characteristics than sockets, but they are non blocking really and you have to build something like a spinloop again. Memory Mapped Files + intrinsic Unsafe.getXXX (which is a single x86 MOV) is still the best IPC technique in terms of both latency and throughput since it doesn't require system calls while reading and writing.
If you are still going to use lock-free and Memory Mapped Files and direct access using Unsafe, don't forget about appropriate memory barriers for both Producer and Consumer. For example, "unsafe.getIntVolatile" instead of first "unsafe.getInt" if you are not sure your code always runs on later x86.
If you see unexpected CPU utilization which should be no more 30-40% (2 utilized cores for 6 cores CPU) per pair of Producer-Consumer, you have to use standard tools to check what is running on other cores and the overall system performance. If you see intensive IO associated with your mapped file, then make sure it is mapped to tmpfs file system to prevent real disk IO. Check memory bus loading and L3 cache misses for the "fattest" processes because, as we know, CPU time = (CPU exec clock cycles + _memory_stall_cycles_) * clock cycle time
And finally, a quite similar and interesting open source project with a good example how to use memory mapped files: http://openhft.net/products/chronicle-queue/
I wrote a parallel java program. It works typically:
It takes a String input as input;
Then input is cut into String inputs[numThreads] evenly;
Each inputs[i] is assigned to thread_i to process, and generates results[i];
After all the working threads finish, the main thread merge the results[i] into result.
The performance data on a 10-core (physical cores) machine is as below.
Threads# 1 thread 2 threads 4 threads 8 threads 10 threads
Time(ms) 78 41 28 21 21
Note:
the JVM warm-up time have been eliminated (first 50 runs).
the time doesn't include threads starting/joining time.
It seems the memory bandwidth becomes the bottleneck when there are more than 8 threads.
In this case, how to further improve the performance? Is there any design issues in my parallel Java program?
To examine the cause of this scalability issue, I inserted a (meaningless computation) loop into the process(inputs[i]) method. Here is the new data:
Threads# 1 thread 10 threads
Time(ms) 41000 4330
The new data shows good scalability for 10 threads, which in return confirms the original (without meaningless loop) has memory issue, so that its scalability is limited to 8 threads.
But anyway to circumvent this issue, like pre-loading the data into each core's local cache, or loading in batch?
I find it unlikely that you have a memory bandwidth issue here. It is more likely that your run times are so short that as you approach 0 you are just mostly timing the thread startup/shutdown or the hotswap compiler optimization cycles. Gaining relevant timing information from a Java task that runs so short is close to worthless. The hotswap compiler and other optimizations that run initially often dominate the CPU usage early on in a class' life. Our production applications stabilize only after minutes of live service operation.
If you can significantly increase your run times by adding more input data or by calculating the same result over and over you may get a better idea about what the optimal thread numbers are.
Edit:
Now that you have added timings for 1 and 10 threads over a longer period, it looks to me that you are not bound by anything since the timing seems to be fairly linear -- with some thread overhead. 41000/10 = 4100 versus 4330 for 10 threads.
Pretty good demonstration of what threading can do to a CPU bound application. :-)
How many logical cores do you have?
Consider - imagine you had one core and a hundred threads. The work to be done is the same, it cannot be distributed over multiple cores, but now you have a great deal of thread switching overhead.
Now imagine you have say four cores and four threads. Assuming no other bottlenecks, compute time is quartered.
Now imagine you have four cores and eight threads. You compute time will be approximately quartered, but you'll have added some thread swapping overhead.
Be aware of hyperthreading and that it may help or hinder you, depending on the nature of the compute task.
I'd say your losses are down to switching threads. You have more threads than cores, and none need to block for slower processes, so they are getting switched in, doing a bit of work and then gettimg switched out to switch another one in. Switching threads is an expensive process, given the nature of what you appear to be doing I would have instinctively restricted the number of threads to 8 (leave two cores for the os) , and your performance numbers appear to bear me out.
can you explain this nonsense to me?
i have a method that basically fills up an array with mathematical operations. there's no I/O involved or anything. now, this method takes about 50 seconds to run, and the code is perfectly scalable (theoretically 100%), so i split it up into 4 threads, wait for them to complete, and reassemble the 4 arrays. now, i run the program on a quad core processor, expecting it to take about 15 seconds, and it actually takes 58 seconds. that's right: it takes longer! i see the cpu working 100%, and i know that each thread does 1/4 of the calculations, and creating threads and reassembling the arrays take about 1-2 ms in total.
what's causing such loss of performance? what the hell is the cpu doing all that time?
CODE: http://pastebin.com/cFUgiysw
Threads don't work that way.
Threads are still part of the same process (depending on the OS), so in terms of the operating system - CPU time will be scheduled the same for 4 threads in 1 process as it is for 1 thread in 1 process.
Also, with such a small number of values, you won't see the scalability in the midst of the overhead. Re-assembling the arrays in java will be costly.
Check out things like "Context switching overhead" - things like that always mess you up when you try to map theory to practise :P
I would stick to the single-threaded way :)
~ Dan
http://en.wikipedia.org/wiki/Context_switch
A lot depends on what you are doing and how you are dividing the work. There are many possible causes for this problem.
The most likely cause is, you are using all the bandwidth of your CPU to main memory bus with one thread. This can happen if your data set is larger than your CPU cache. esp if you have some random access behaviour. You could consider trying to reuse the original array, rather than taking multiple copies to reduce cache churn.
Your locking overhead is greater than the performance gain. I suspect you have used very course locking so this shouldnt be an issue.
Starting stopping threads takes too long. As your code is multi second, I doubt this too.
There is a cost associated with opening new threads. I don't think it should be up to 8 second but it depends on what threads you are using. Some threads needs to create a copy of the data that you are handling to be thread safe and that can take some time. This cost is commonly referred to as overhead. If the execution you are doing is somewhere not serializable for instance reads the same file or needs access to a shared resource the threads might need to wait on each other this can take some time and under sub optimal conditions it can take more time than serial execution. My tip is try and check for these unserializable events remove them from the threaded part if possible. Also try and use a lower amount of threads 4 threads for 4 cpus is not always optimal.
Hope it helps.
Unless you are constantly creating and killing threads the thread overhead shouldn't be a problem. Four threads running simultaeously is no big deal for the scheduler.
As Peter Lawrey suggested the memory bandwidth could be the problem. Your 50-second code is running on a java engine and they both compete for the available memory bandwidth. The java engine needs memory bandwidth to execute your code and your code needs it to do its calculations.
You write "perfectly scalable" which would be the case if your code was compiled. Since it runs on a java engine this is not the case. So the 16% increase in overall time could be seen as the difference between the smoothness of one thread vs the chaos of four colliding over memory accesses.
Say I run a simple single-threaded process like the one below:
public class SirCountALot {
public static void main(String[] args) {
int count = 0;
while (true) {
count++;
}
}
}
(This is Java because that's what I'm familiar with, but I suspect it doesn't really matter)
I have an i7 processor (4 cores, or 8 counting hyperthreading), and I'm running Windows 7 64-bit so I fired up Sysinternals Process Explorer to look at the CPU usage, and as expected I see it is using around 20% of all available CPU.
But when I toggle the option to show 1 graph per CPU, I see that instead of 1 of the 4 "cores" being used, the CPU usage is spread all over the cores:
Instead what I would expect is 1 core maxed out, but this only happens when I set the affinity for the process to a single core.
Why is the workload split over the separate cores? Wouldn't splitting the workload over several cores mess with the caching or incur other performance penalties?
Is it for the simple reason of preventing overheating of one core? Or is there some deeper reason?
Edit: I'm aware that the operating system is responsible for the scheduling, but I want to know why it "bothers". Surely from a naive viewpoint, sticking a (mostly*) single-threaded process to 1 core is the simpler & more efficient way to go?
*I say mostly single-threaded because there's multiple theads here, but only 2 of them are doing anything:
The OS is responsible for scheduling. It is free to stop a thread and start it again on another CPU. It will do this even if there is nothing else the machine is doing.
The process is moved around the CPUs because the OS doesn't assume there is any reason to continue running the thread on the same CPU each time.
For this reason I have written a library for lock threads to a CPU so it won't move around and won't be interrupted by other threads. This reduces latency and improve throughput but does tire up a CPU for that thread. This works for Linux, perhaps you can adapt it for Windows. https://github.com/peter-lawrey/Java-Thread-Affinity/wiki/Getting-started
I would also expect this could well be done on purpose by the CPU and OS so as to try and spread the thermal load on the CPU die...
So it would rotate the (unique/single) thread from core to core.
And that could admittedly be an argument against trying to fight this too hard (especially as, in practice, you often will see better improvements by simply tuning / improving the app itself anyway)
In Java, is there a programmatic way to find out how many concurrent threads are supported by a CPU?
Update
To clarify, I'm not trying to hammer the CPU with threads and I am aware of Runtime.getRuntime().availableProcessors() function, which provides me part of the information I'm looking for.
I want to find out if there's a way to automatically tune the size of thread pool so that:
if I'm running on a 1-year old server, I get 2 threads (1 thread per CPU x an arbitrary multiplier of 2)
if I switch to an Intel i7 quad core two years from now (which supports 2 threads per core), I get 16 threads (2 logical threads per CPU x 4 CPUs x the arbitrary multiplier of 2).
if, instead, I use a eight core Ultrasparc T2 server (which supports 8 threads per core), I get 128 threads (8 threads per CPU x 8 CPUs x the arbitrary multiplier of 2)
if I deploy the same software on a cluster of 30 different machines, potentially purchased at different years, I don't need to read the CPU specs and set configuration options for every single one of them.
Runtime.availableProcessors returns the number of logical processors (i.e. hardware threads) not physical cores. See CR 5048379.
A single non-hyperthreading CPU core can always run one thread. You can spawn lots of threads and the CPU will switch between them.
The best number depends on the task. If it is a task that will take lots of CPU power and not require any I/O (like calculating pi, prime numbers, etc.) then 1 thread per CPU will probably be best. If the task is more I/O bound. like processing information from disk, then you will probably get better performance by having more than one thread per CPU. In this case the disk access can take place while the CPU is processing information from a previous disk read.
I suggest you do some testing of how performance in your situation scales with number of threads per CPU core and decide based on that. Then, when your application runs, it can check availableProcessors() and decide how many threads it should spawn.
Hyperthreading will make the single core appear to the operating system and all applications, including availableProcessors(), as 2 CPUs, so if your application can use hyperthreading you will get the benefit. If not, then performance will suffer slightly but probably not enough to make the extra effort in catering for it worth while.
There is no standard way to get the number of supported threads per CPU core within Java. Your best bet is to get a Java CPUID utility that gives you the processor information, and then match it against a table you'll have to generate that gives you the threads per core that the processor manages without a "real" context switch.
Each processor, or processor core, can do exactly 1 thing at a time. With hyperthreading, things get a little different, but for the most part that still remains true, which is why my HT machine at work almost never goes above 50%, and even when it's at 100%, it's not processing twice as much at once.
You'll probably just have to do some testing on common architectures you plan to deploy on to determine how many threads you want to run on each CPU. Just using 1 thread may be too slow if you're waiting for a lot of I/O. Running a lot of threads will slow things down as the processor will have to switch threads more often, which can be quite costly. I'm not sure if there is any hard-coded limit to how many threads you can run, but I gaurantee that your app would probably come to a crawl from too much thread switching before you reached any kind of hard limit. Ultimately, you should just leave it as an option in the configuration file, so that you can easily tune your app to whatever processor you're running it on.
A CPU does not normally pose a limit on the number of threads, and I don't think Java itself has a limit on the number of native (kernel) threads it will spawn.
There is a method availableProcessors() in the Runtime class. Is that what you're looking for?
Basics:
Application loaded into memory is a process. A process has at least 1 thread. If you want, you can create as many threads as you want in a process (theoretically). So number of threads depends upon you and the algorithms you use.
If you use thread pools, that means thread pool manages the number of threads because creating a thread consumes resources. Thread pools recycle threads. This means many logical threads can run inside one physical thread one after one.
You don't have to consider the number of threads, it's managed by the thread pool algorithms. Thread pools choose different algorithms for servers and desktop machines (OSes).
Edit1:
You can use explicit threads if you think thread pool doesn't use the resources you have. You can manage the number of threads explicitly in that case.
This is a function of the VM, not the CPU. It has to do with the amount of heap consumed per thread. When you run out of space on the heap, you're done. As with other posters, I suspect your app becomes unusable before this point if you exceed the heap space because of thread count.
See this discussion.