I have a Java application that has a fixed thread pool of fifteen, the machine, Solaris 10 SPARC, has sixteen CPUs. Adding the pool has greatly increased performance, but I'm wondering if I have too many threads in the pool. Would performance be better with less threads or does Solaris do a good job of thread scheduling.
Say the pool is heavily using fifteen CPUs, then other application threads demand CPU for various reason, concurrent garbage collection is a good example. Now, five CPUs are shared between the pool and other application threads. Then CPUs one through seven become free, will Solaris move the threads sharing time on the busy CPUs to the free CPUs?
If not, would it be better to keep the pool size smaller, so that there are always free CPUs for other application threads? Compounding the issue, CPU usage is very sporadic in the application.
If you are doing only cpu intensive tasks (no IO) N+1 threads (where N is the number of cores) will give you the optimum processor utilization.
+1 because you can have a page fault, a therad can be paused to any reason or a small wait time during synchronization.
For Threads doing IO this is not really easy, you have to test the best size.
The book Java concurrency in practice suggests this algorithm as starting point:
N = number of CPUs
U = target CPU utilization (0 <= U <= 1)
W/C = ration of wait time to cpu time (measured through profiling)
threads = N * U * (1 + W/C)
IBM uses the same algorithm in their article Java theory and practice: Thread pools and work queues, with a fixed U=1.
The N+1 fact can be read too in the IBM article, to provide origins for both theses.
It's generally fine to have a few more threads than CPUs, this can actually help overall throughput.
The reason for this is that several threads may be blocked on IO or sleeping at any given time. So having a few more threads ready to execute never hurts.
Generally, if you can establish a 1 to 1 mapping between threads and CPUs, you will get optimal performance. That is assuming that user threads are mapped 1 to 1 with kernel threads. If I recall correctly, Solaris allows multiple kernel threads as well as user threads, so you should be ok in this respect. You will run into bottlenecks when more than one thread is using the same CPU.
As always, the best advice to your question about reducing thread pool size is: "Try and benchmark it." (but I suspect more will be better in this case.)
As for your question about other applications being run, Solaris will use a CPU to alternate between your thread pool and the other application. If a CPU becomes free though, Solaris will move some threads to the free CPU.
EDIT: Here is a link from Sun about how the Solaris and Java thread models interact.
Related
My application is a "thread-per-request" web server with a thread pool of M threads. All processing of a single request runs in the same thread.
Suppose I am running the application in a computer with N cores. I would like to configure M to limit the CPU usage: e.g. up to 50% of all CPUs.
If the processing were entirely CPU-bound then I would set M to N/2. However the processing does some IO.
I can run the application with different M and use top -H, ps -L, jstat, etc. to monitor it.
How would you suggest me estimate M ?
To have a CPU usage of 50% does not necessarily mean that the number of threads needs to be N_Cores / 2. When dealing with I/O the CPU wastes many cycles in waiting for the data to arrive from devices.
So you need a tool to measure real CPU usage and through experiments, you could increase the number of threads until the real CPU usage goes to 50%.
perf for Linux is such a tool. This question addresses the problem. Also be sure to collect statistics system wide: perf record -a.
You are interested in your CPU issuing and executing as many instructions / cycle as possible (IPC). Modern servers can execute up to 4 IPC for intense compute bound workloads. You want to go as close to that as possible to get good CPU utilization, and that means that you need to increase the thread count. Of course if there's many I/O that won't be possible due to many context switches that bring some penalties (cache flushing, kernel code, etc.)
So the final answer would be just increase the thread count until real CPU usage goes to 50%.
It depends enterely on your particular software and hardware.
Hardware is important since if a thread blocks writing to a slow disk it will take long to wake up again (and consume cpu again) but if the disk is really fast the blocking will only switch cpu context but the thread will be running again immediately.
I think you can only try and try with different parameters or the app may monitor it's CPU use itself and adjust the pool dynamically.
I'm new to java multi-threaded programming. The question that has came to my mind is that how many threads can I run according to the number of my CPU cores. and if I run threads more than CPU cores will it be an overhead for the machine to run the app. for example when we have a server machine which has a server software that run 2 threads(main thread + developer thread), will it be an overhead for the server when more simultaneous clients make socket connections to the server or not?
Thanks.
The number of threads a system can execute simultaneously is (of course) identical to the number of cores in the system.
The number of threads that can exist on the system is limited by the available memory (each thread requires a stack and a structure used by the OS to manage the thread), and possibly there is a limitation how many threads the OS allows (this depends on the OS architecture, some OS' may use a fixed size table and once its full no more threads can be created).
Commonly, todays computers can handle hundreds to thousands of threads.
The reason why more threads are used than cores exist in the system is: Most threads will inevitably spend much of their time waiting for some event (example: word processor waiting for user to type on keyboard). The OS manages it that threads that wait in such a manner do not consume CPU time.
Idea behind it is don't let your CPU sleep, neither load it too much that it waste most of time in thread switching.
Its helpful to check Tuning the pool size, In IBMs paper
Idea behind is, it depends on the nature of task, if its all in-memory computation tasks you can use N+1 threads (N numbers of cores (included hyper threading)).
Or
we need to do the application profiling and find out waiting time (WT) , service time (ST) for a typical request and approximately N*(1+WT/ST) number of optimal threads we can have, considering 100% utilization of CPU.
That depends on what the threads are doing. The CPU is only able to do X things at once, where X is the number of cores it has. That means X threads at most can be active at any one time - however the other threads can wait their turn and the CPU will process them at appropriate moments.
You should also consider that a lot of the time threads are waiting for a response, or waiting for data to load, or a network message to arrive, etc so are not actually trying to do anything. These idle/waiting threads have very little load on the system.
Don't worry about getting a higher number of threads than CPU cores; that is actually not in your hands, but in OS'.
Assuming the JVM maps your java threads over OS threads (which is fairly normal these days), it depends on the thread management your OS does. There you rely on how smart the kernel implementation is to get performance out of your cores.
What you must keep in mind is that your design must be sustainable. For example, application servers are built on a threadpool full of worker threads. Those threads are awaken in order to serve requests. Do you want a thread for each request? Then you will surely have a problem - requests can arrive in the thousands to the server, and that could be a problem for the kernel to manage. Actually the threadpool size should be limited (between 1 and X and easily changed even in real time), threads should get work from a concurrent queue (java gives you some excellent classes for that) and each one attend requests sequentially.
I hope that being of help
Having less threads than CPUs can mean you are not using all the CPUs in your system. Having more threads might improve throughput if CPU is your bottleneck.
Having more threads than CPU does introduce an overhead and if CPU is your bottleneck this can hurt performance. However, if network IO, is your bottleneck, this overhead is a price worth paying as it usually allows you to handle many more connections. e.g. You can have 1000 TCP connections with their own threads.
There doesn't have to be any relation. A computer can have any number of cores; a process can have any number of threads.
There are several different reasons that processes utilize threading, including:
Programming abstraction. Dividing up work and assigning each division to a unit of execution (a thread) is a natural approach to many problems. Programming patterns that utilize this approach include the reactor, thread-per-connection, and thread pool patterns. Some, however, view threads as an anti-pattern. The inimitable Alan Cox summed this up well with the quote, "threads are for people who can't program state machines."
Blocking I/O. Without threads, blocking I/O halts the whole process. This can be detrimental to both throughput and latency. In a multithreaded process, individual threads may block, waiting on I/O, while other threads make forward progress. Blocking I/O via threads is thus an alternative to asynchronous & non-blocking I/O.
Memory savings. Threads provide an efficient way to share memory yet utilize multiple units of execution. In this manner they are an alternative to multiple processes.
Parallelism. In machines with multiple processors, threads provide an efficient way to achieve true parallelism. As each thread receives its own virtualized processor and is an independently schedulable entity, multiple threads may run on multiple processors at the same time, improving a system's throughput. To the extent that threads are used to achieve parallelism—that is, there are no more threads than processors—the "threads are for people who can't program state machines" quote does not apply.
The first three bullets utilize threads with no relationship to cores. If you are using threads as a programming abstraction to handle UI elements, for example, you'll have one thread per UI element (or whatever) regardless of whether you have 1 core or 12. Similarly, if you were using threads to perform blocking I/O, you'd scale your thread count with your I/O capacity, not your processing power.
The fourth bullet, however, does relate threads to cores. If the goal of threading is parallelism, then the number of threads should scale linearly with the number of cores. For example, if you double the number of cores in a system, then you would double the number of threads in your application. This is true for cores in the logical sense—that is, including SMT.
When threading is used to achieve parallelism—and this is both a common and the best use of threading—you will often have, say, one or two threads per core. Oftentimes, applications are written so as to dynamically size thread pools off the number of available cores. A single thread per core is ideal, but applications often use a larger multiplier, such as two threads per core, due to bugs and inefficiencies in their code, such as operations that block when none should.
Best performance will be when number of cores(NOC) equals number of thread (NOT), because if NOT > NOC then processor should switch context or OS will try to do that work, which is expensive enough opperation. But you have to understand that it impossible to have NOC = NOT on Web Servers because you can't predict how much clients will be at the same time. Take a look on load balancing concept to solve this issue in best way.
I've been looking for the way to increase the speed of processing messages received from rabbitmq queue. The only way I've found is make more than one threads doing the same - receiving and processing. And this gave me some profit. After I created 4 threads the speed quadrupled. As I have 8-core processor I've decided to increase the number of threads to 8. But this gives no performance increasing. YourKit shows that only 50% of CPU is used. Somebody can say that my app is lightweight and so it is so, but I can say that it can't do more work than it does regardless I produce much more what to do. Why this doesn't work?
There are many different issues that can constrain the maximum speed of some application on a given system. For example, it can be limited by memory bandwidth, by Amdahl's Law effects (time needed for non-parallel code, including synchronized blocks), I/O bandwidth, and cache space.
If you want further improvement you need to do some measurements and profiling to find where the time is going, and then work on that.
The short (and not particularly helpful) answer is "overheads and bottlenecks".
For instance:
Creating threads in Java is relatively expensive. If the amount of work done by a thread isn't large, the overhead of creating the thread can out-weigh the benefits.
Context switching between threads is relatively expensive, especially when you take account of memory-related overheads such as cache misses, TLB misses. (These overheads actually hit when a native thread is assigned to a core. If the OS can somehow keep a native thread on a single core continuously (i.e. with no other threads on the same core), then it can use spinlocking ... and avoid the context switch. But the more Java threads you have, the less likely it is that the OS can do this.)
The threads may be spending a large proportion of their time waiting for I/O to complete. The I/O system's throughput or the speed / latency of some external service can be a bottleneck.
You may have contention over data structures; e.g. threads requiring exclusive access to safely read or update (say) a shared Map. If threads regularly need to wait for others to release locks, then you have a bottleneck.
Your computation may be dominated by the costs of "feeding" the threads. For example, if there is a single master thread that hands out "work" to worker threads, then the master thread's activities could be the bottleneck; i.e. it may not be able to provide enough work to keep the workers busy.
Since your tags imply that you are using a message queue, it is possible that that is the bottleneck, especially if the messages are big or the "work" done on each one is relatively small.
(Using a separate separate message queue service is liable to increase context switches, add I/O latency, add protocol overheads and so on. It's not an automatic route to performance improvement for small-scale systems.)
It is also possible that you have "hyperthreaded" cores not real cores, or that the operating system is stopping your JVM from using all cores.
If CPU or waiting for IO is your bottle neck, adding independent threads can make a big difference.
If you have a shared resource is a bottleneck, e.g. your L3 cache, your network adapter, your kernel, adding threads won't help because CPU is not the problem. In fact it can often make it worse by adding overhead.
my app is lightweight
In which case CPU is unlikely to be your issue and you are doing well to see a speed up with more than 1 CPU. Most likely you are speeding up CPU used by RabbitMQ. Ideally it should be more efficient and this shouldn't really help much. IMHO, more efficient messaging solutions don't gain much by multiple CPUs as they will not be bottlenecked on CPU.
One way or another, you're only using 4 cores. There's a lot that can stop stop you from doubling your performance by doubling your threads, but from your 4 thread success you've gotten past all that. I'm guessing there's a bug in your code to set off 8 threads and it's only firing up 4. (Even with hyperthreading, you're going to get some improvement. Even with every possible problem, you're going to get some improvement.) Otherwise, I'll go with T.J.Crowder and Stephen C: I don't think you really have 8 cores.
I'd try using different numbers of threads: 3, 5, 6. See what changes. I think you'll stumble on the problem soon enough.
To be fair to Java: if you write thread-safe code and avoid bottlenecks, it handles threads really well, as you've noticed going from one thread to 4. I've always found the overhead costs to be trivial.
Your application does not have linear speedup, and therefore, it does not have good scalability.
In order to keep increase the number of threads you need to ensure the data being handle is growing accordingly. For a fixed amount of data, increasing number of threads (and/or cores) will have a diminishing return at some point since the overhead of creating threads will outweight the thread's compute time.
Make sure to look up the following link:
Ahmdahl's law
Gustafson's law
Gustafson's law is a great counterpoint to Ahmdal's law so I highly recommend understanding that article.
I wrote a parallel java program. It works typically:
It takes a String input as input;
Then input is cut into String inputs[numThreads] evenly;
Each inputs[i] is assigned to thread_i to process, and generates results[i];
After all the working threads finish, the main thread merge the results[i] into result.
The performance data on a 10-core (physical cores) machine is as below.
Threads# 1 thread 2 threads 4 threads 8 threads 10 threads
Time(ms) 78 41 28 21 21
Note:
the JVM warm-up time have been eliminated (first 50 runs).
the time doesn't include threads starting/joining time.
It seems the memory bandwidth becomes the bottleneck when there are more than 8 threads.
In this case, how to further improve the performance? Is there any design issues in my parallel Java program?
To examine the cause of this scalability issue, I inserted a (meaningless computation) loop into the process(inputs[i]) method. Here is the new data:
Threads# 1 thread 10 threads
Time(ms) 41000 4330
The new data shows good scalability for 10 threads, which in return confirms the original (without meaningless loop) has memory issue, so that its scalability is limited to 8 threads.
But anyway to circumvent this issue, like pre-loading the data into each core's local cache, or loading in batch?
I find it unlikely that you have a memory bandwidth issue here. It is more likely that your run times are so short that as you approach 0 you are just mostly timing the thread startup/shutdown or the hotswap compiler optimization cycles. Gaining relevant timing information from a Java task that runs so short is close to worthless. The hotswap compiler and other optimizations that run initially often dominate the CPU usage early on in a class' life. Our production applications stabilize only after minutes of live service operation.
If you can significantly increase your run times by adding more input data or by calculating the same result over and over you may get a better idea about what the optimal thread numbers are.
Edit:
Now that you have added timings for 1 and 10 threads over a longer period, it looks to me that you are not bound by anything since the timing seems to be fairly linear -- with some thread overhead. 41000/10 = 4100 versus 4330 for 10 threads.
Pretty good demonstration of what threading can do to a CPU bound application. :-)
How many logical cores do you have?
Consider - imagine you had one core and a hundred threads. The work to be done is the same, it cannot be distributed over multiple cores, but now you have a great deal of thread switching overhead.
Now imagine you have say four cores and four threads. Assuming no other bottlenecks, compute time is quartered.
Now imagine you have four cores and eight threads. You compute time will be approximately quartered, but you'll have added some thread swapping overhead.
Be aware of hyperthreading and that it may help or hinder you, depending on the nature of the compute task.
I'd say your losses are down to switching threads. You have more threads than cores, and none need to block for slower processes, so they are getting switched in, doing a bit of work and then gettimg switched out to switch another one in. Switching threads is an expensive process, given the nature of what you appear to be doing I would have instinctively restricted the number of threads to 8 (leave two cores for the os) , and your performance numbers appear to bear me out.
In Java, is there a programmatic way to find out how many concurrent threads are supported by a CPU?
Update
To clarify, I'm not trying to hammer the CPU with threads and I am aware of Runtime.getRuntime().availableProcessors() function, which provides me part of the information I'm looking for.
I want to find out if there's a way to automatically tune the size of thread pool so that:
if I'm running on a 1-year old server, I get 2 threads (1 thread per CPU x an arbitrary multiplier of 2)
if I switch to an Intel i7 quad core two years from now (which supports 2 threads per core), I get 16 threads (2 logical threads per CPU x 4 CPUs x the arbitrary multiplier of 2).
if, instead, I use a eight core Ultrasparc T2 server (which supports 8 threads per core), I get 128 threads (8 threads per CPU x 8 CPUs x the arbitrary multiplier of 2)
if I deploy the same software on a cluster of 30 different machines, potentially purchased at different years, I don't need to read the CPU specs and set configuration options for every single one of them.
Runtime.availableProcessors returns the number of logical processors (i.e. hardware threads) not physical cores. See CR 5048379.
A single non-hyperthreading CPU core can always run one thread. You can spawn lots of threads and the CPU will switch between them.
The best number depends on the task. If it is a task that will take lots of CPU power and not require any I/O (like calculating pi, prime numbers, etc.) then 1 thread per CPU will probably be best. If the task is more I/O bound. like processing information from disk, then you will probably get better performance by having more than one thread per CPU. In this case the disk access can take place while the CPU is processing information from a previous disk read.
I suggest you do some testing of how performance in your situation scales with number of threads per CPU core and decide based on that. Then, when your application runs, it can check availableProcessors() and decide how many threads it should spawn.
Hyperthreading will make the single core appear to the operating system and all applications, including availableProcessors(), as 2 CPUs, so if your application can use hyperthreading you will get the benefit. If not, then performance will suffer slightly but probably not enough to make the extra effort in catering for it worth while.
There is no standard way to get the number of supported threads per CPU core within Java. Your best bet is to get a Java CPUID utility that gives you the processor information, and then match it against a table you'll have to generate that gives you the threads per core that the processor manages without a "real" context switch.
Each processor, or processor core, can do exactly 1 thing at a time. With hyperthreading, things get a little different, but for the most part that still remains true, which is why my HT machine at work almost never goes above 50%, and even when it's at 100%, it's not processing twice as much at once.
You'll probably just have to do some testing on common architectures you plan to deploy on to determine how many threads you want to run on each CPU. Just using 1 thread may be too slow if you're waiting for a lot of I/O. Running a lot of threads will slow things down as the processor will have to switch threads more often, which can be quite costly. I'm not sure if there is any hard-coded limit to how many threads you can run, but I gaurantee that your app would probably come to a crawl from too much thread switching before you reached any kind of hard limit. Ultimately, you should just leave it as an option in the configuration file, so that you can easily tune your app to whatever processor you're running it on.
A CPU does not normally pose a limit on the number of threads, and I don't think Java itself has a limit on the number of native (kernel) threads it will spawn.
There is a method availableProcessors() in the Runtime class. Is that what you're looking for?
Basics:
Application loaded into memory is a process. A process has at least 1 thread. If you want, you can create as many threads as you want in a process (theoretically). So number of threads depends upon you and the algorithms you use.
If you use thread pools, that means thread pool manages the number of threads because creating a thread consumes resources. Thread pools recycle threads. This means many logical threads can run inside one physical thread one after one.
You don't have to consider the number of threads, it's managed by the thread pool algorithms. Thread pools choose different algorithms for servers and desktop machines (OSes).
Edit1:
You can use explicit threads if you think thread pool doesn't use the resources you have. You can manage the number of threads explicitly in that case.
This is a function of the VM, not the CPU. It has to do with the amount of heap consumed per thread. When you run out of space on the heap, you're done. As with other posters, I suspect your app becomes unusable before this point if you exceed the heap space because of thread count.
See this discussion.