I have a program which runs (all day) tasks in parallel (no I/O in the task to be executed) so I have used Executors.newFixedThreadPool(poolSize) to implement it.
Initially I set the poolSize to Runtime.getRuntime().availableProcessors(), but I was a bit worried to use all the available cores since there are other processes running on the same PC (32 cores).
In particular I have ten other JVM running the same program (on different input data), so I'm a bit worried that there might be a lot of overhead in terms of threads switching amongst the available cores, which could slow down the overall calculations.
How shall I decide the size of the pool for each program / JVM?
Also, in my PC, there are other processes running all the time (Antivirus, Backup, etc.). Shall I take into account these as well?
Any advice is going to be dependent upon your particular circumstances. 10 JVMs on 32 cores would suggest 3 threads each (ignoring garbage collection threads, timer tasks etc...)
You also have other tasks running. The scheduler will ensure they're running, but do they have to be responsive ? More responsive than the JVM ? If you're running Linux/Unix then you can also make use of prioritisation (via nice) to ensure particular processes don't hog the CPU.
Finally you're running 10 JVMs. Will that cause paging ? If so, that will be slow and you may be better off running fewer JVMs in order to avoid consuming so much memory.
Just make sure that your key variables are exposed and configurable, and measure various scenarios in order to find the optimal one.
How shall I decide the size of the pool for each program / JVM?
You want the number of threads which will get you close to 99% utilisation and no more.
The simplest way to balance the work is to have the process running once, processing multiple files at concurrently and using just one thread pool. You can set up you process as a service if you need to start files via the command line.
If this is impossible for some reason, you will need to guesstimate how much the thread pools should be shrunk by. Try running one process and look at the utilisation. If one is say 40% then I suspect ten processes is over utilised by 400%. i.e then you might reduce the pool size by a factor of 4.
Unfortunately, this is a hard thing to know, as programs don't typically know what else is or might be going on on the same box.
the "easy" way out is to make the pool size configurable. this allows the user who controls the program/box to decide how many threads to allocate to your program (presumably using their knowledge of the general workload of the box).
a more complex solution would be to attempt to programmatically determine the current workload of the box and choose the pool size appropriately from that. the efficacy of this solution depends on how accurately you can determine the workload and potentially adapt as it changes over time.
Try grepping the processes, check top/task manager and performance monitors to verify if this implementation is actually affecting your machine.
This article seems to contain interesting info about what you are trying to implement:
http://www.ibm.com/developerworks/library/j-jtp0730/index.html
Related
Scenario : I have a sample application and I have 3 different system configuration -
- 2 core processor, 2 GB RAM, 60 GB HHD,
- 4 core processor, 4 GB RAM, 80 GB HHD,
- 8 core processor, 8 GB RAM, 120 GB HHD
In order to effectively exploit the H/W capabilities for my application, I wish to configure the no. of threads at the application level. However, I wish to do this only after a thorough understanding of system capabilities.
Could there be some way(system/modus/tool) to determine the system prowess with reference to the max and min no. of threads it could service optimally & without any loss in efficiency and performance. By this, I could configure only those values for my application that will do full justice and achieve best performance for the respective hardware configuration.
Edited1 :
Could any one please advise any read-up on how to set a baseline for a particular h/w config.
Edited2 :
To make it more direct - Wish to learn/know about any resource/write-up that I can read to gain some understanding on CPU management of Threads at a general/holistic level.
The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:
N_threads = N_cpu * U_cpu * (1 + W / C)
Where:
N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)
So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).
For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.
Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).
This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.
Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.
My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I strongly recommend against exclusive heuristic-based thread-to-core guessing, for several reasons:
Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).
There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.
Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding#Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.
Regarding SMT (HyperThreading, Compute Modules):
Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.
Regarding Turbo Features:
Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).
... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.
Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.
You can get the number of processors available to the JVM like this:
Runtime.getRuntime().availableProcessors()
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
I think the best strategy would be to decide the optimal number of threads empirically for each of the hardware configuration, and then use these numbers in your application.
I agree with the other answers here that recommend a best-guess approach, and providing configuration for overriding the defaults.
In addition, if your application is particularly CPU-intensive, you may want to look into "pinning" your application to particular processors.
You don't say what your primary operating system is, or whether you're supporting multiple operating systems, but most have some way of doing this. Linux, for instance, has taskset.
A common approach is to avoid CPU 0 (always used by the OS), and to set your application's cpu affinity to a group of CPUs that are in the same socket.
Keeping the app's threads away from cpu 0 (and, if possible, away from other applications) often improves performance by reducing the amount of task switching.
Keeping the application on one socket can further increase performance by reducing cache invalidation as your app's threads switch among cpus.
As with everything else, this is highly dependent on the architecture of the machine that you are running on, as well as what other applications are runnning.
Use VisualVm tool to monitor threads.First Create minimum threads in program and see its performance.Then increase the no of threads within the program ans again analyze its performance.May this help you.
I use this Python script here to determine the number of cores (and memory, etc.) to launch my Java application with optimum parameters and ergonomics. PlatformWise on Github
It works like this: Write a python script which calls the getNumberOfCPUCores() in the above script to get the number of cores, and getSystemMemoryInMB() to get the RAM. You can pass that inform to your program via command line arguments. Your program can then use the appropriate number of threads based on the number of cores.
Creating a thread on application level is good and in a multicore processor separate threads are executed on cores to enhance performance.So to utilize the core processing power it is best practice to implement threading.
What i think:
At a time only 1 thread of a program will execute on 1 core.
Same application with 2 thread will execute on half time on 2 core.
Same application with 4 Threads will execute more faster on 4 core.
So the application you developing should have the threading level<= no of cores.
Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.
Read this to get how to actually utilize cpu core's.Fantastic content.
csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
I have a general question:
my program will just go on processing something which does not require user input or system resources (like printer etc..) meaning, my program will not wait for any resources except CPU time.
The same program (let us say job) may be initiated by multiple users.
in this case, is it worth full to run this in a thread (meaning each user will get a feeling that his job is executed without delay.
or is it better to run the jobs sequentially?
The issue with running as separate threads is that, too many threads running simultaneously forcing the CPU utilization go over 100%.
Please suggest. Assume that user donot see his job progress. User is not worried when his job is finished. But at the same time, I want to have the CPU busy running the jobs.
If you don't care how long a process takes, or the length of time it takes is acceptable, then using one thread is likely to be the simplest solution. For example, many GUI applications only use one event handling thread.
If you want to keep all your CPUs busy you can start a number of busy loops to max out all the CPUs.
What you usually want is to reduce latency, or improve threadput by using more CPUs. Unless this is a goal, using more CPUs won't help you.
If the thread is genuinely purely CPU-bound, then it doesn't make sense to create more threads than there are cores (or virtual cores) available to process them. So on a quad-core machine, create no more than four threads (and probably only three, as your process isn't the only thing going on on the machine). On a quad-core machine with hyper-threading (two virtual threads per core), you might create six or seven. Creating too many additional threads (say, hundreds) causes unnecessary context-switching, which can be expensive if you really overdo it.
The converse is that on a multi-core machine, a single thread can only run on one core. So on a quad-core machine, running the jobs sequentially on a single thread will only utilize 25% of the CPU capacity.
So: Run the jobs in parallel up to the number of available cores, and sequentially (on each core) beyond that.
Big caveat: Your mileage may vary. There are lots of inputs to this equation, including what else is going on on the machine, and particularly whether the jobs really are CPU-bound (as opposed to system-bound, e.g., CPU and I/O subsystem and such).
I guess your program needs memory access. Memory access may be slow, and you really want to run the processor at that time. A common solution to limit the number of threads running at the same time is to use a thread pool.
in this case, is it worth full to run this in a thread (meaning each user will get a feeling that his job is executed without delay. or is it better to run the jobs sequentially?
It depends highly on the job. If it is interactive then running it immediately would give a better interface to the user. If the speed of the response is not an issue then maybe you don't want to incur the complexity costs of writing a multi-threaded program.
The issue with running as separate threads is that, too many threads running simultaneously forcing the CPU utilization go over 100%.
I wouldn't worry about this. One of the reasons why we use multiple threads is that we can make use of multiple processors to actually get the job done faster. In this case, depending on the OS, you can actually see more than 100% load for the process if you are using more than a full CPU -- this is expected. Also, if the CPU goes over 100%, the operating system will handle it fine unless you are worried that your application will be taking cycles away from a more important application.
One of our servers is experiencing a very high CPU load with our application. We've looked at various stats and are having issues finding the source of the problem.
One of the current theories is that there are too many threads involved and that we should try to reduce the number of concurrently executing threads. There's just one main thread pool, with 3000 threads, and a WorkManager working with it (this is Java EE - Glassfish). At any given moment, there are about 620 separate network IO operations that need to be conducted in parallel (use of java.NIO is not an option either). Moreover, there are roughly 100 operations that have no IO involved and are also executed in parallel.
This structure is not efficient and we want to see if it is actually causing damage, or is simply bad practice. Reason being that any change is quite expensive in this system (in terms of man hours) so we need some proof of an issue.
So now we're wondering if context switching of threads is the cause, given there are far more threads than the required concurrent operations. Looking at the logs, we see that on average there are 14 different threads executed in a given second. If we take into account the existence of two CPUs (see below), then it is 7 threads per CPU. This doesn't sound like too much, but we wanted to verify this.
So - can we rule out context switching or too-many-threads as the problem?
General Details:
Java 1.5 (yes, it's old), running on CentOS 5, 64-bit, Linux kernel 2.6.18-128.el5
There is only one single Java process on the machine, nothing else.
Two CPUs, under VMware.
8GB RAM
We don't have the option of running a profiler on the machine.
We don't have the option of upgrading the Java, nor the OS.
UPDATE
As advised below, we've conducted captures of load average (using uptime) and CPU (using vmstat 1 120) on our test server with various loads. We've waited 15 minutes between each load change and its measurements to ensure that the system stabilized around the new load and that the load average numbers are updated:
50% of the production server's workload: http://pastebin.com/GE2kGLkk
34% of the production server's workload: http://pastebin.com/V2PWq8CG
25% of the production server's workload: http://pastebin.com/0pxxK0Fu
CPU usage appears to be reduced as the load reduces, but not on a very drastic level (change from 50% to 25% is not really a 50% reduction in CPU usage). Load average seems uncorrelated with the amount of workload.
There's also a question: given our test server is also a VM, could its CPU measurements be impacted by other VMs running on the same host (making the above measurements useless)?
UPDATE 2
Attaching the snapshot of the threads in three parts (pastebin limitations)
Part 1: http://pastebin.com/DvNzkB5z
Part 2: http://pastebin.com/72sC00rc
Part 3: http://pastebin.com/YTG9hgF5
Seems to me the problem is 100 CPU bound threads more than anything else. 3000 thread pool is basically a red herring, as idle threads don't consume much of anything. The I/O threads are likely sleeping "most" of the time, since I/O is measured on a geologic time scale in terms of computer operations.
You don't mention what the 100 CPU threads are doing, or how long they last, but if you want to slow down a computer, dedicating 100 threads of "run until time slice says stop" will most certainly do it. Because you have 100 "always ready to run", the machine will context switch as fast as the scheduler allows. There will be pretty much zero idle time. Context switching will have impact because you're doing it so often. Since the CPU threads are (likely) consuming most of the CPU time, your I/O "bound" threads are going to be waiting in the run queue longer than they're waiting for I/O. So, even more processes are waiting (the I/O processes just bail out more often as they hit an I/O barrier quickly which idles the process out for the next one).
No doubt there are tweaks here and there to improve efficiency, but 100 CPU threads are 100 CPU threads. Not much you can do there.
I think your constraints are unreasonable. Basically what you are saying is:
1.I can't change anything
2.I can't measure anything
Can you please speculate as to what my problem might be?
The real answer to this is that you need to hook a proper profiler to the application and you need to correlate what you see with CPU usage, Disk/Network I/O, and memory.
Remember the 80/20 rule of performance tuning. 80% will come from tuning your application. You might just have too much load for one VM instance and it could be time to consider solutions for scaling horizontally or vertically by giving more resources to the machine. It could be any one of the 3 billion JVM settings are not inline with your application's execution specifics.
I assume the 3000 thread pool came from the famous more threads = more concurrency = more performance theory. The real answer is a tuning change isn't worth anything unless you measure throughput and response time before/after the change and compared the results.
If you can't profile, I'd recommend taking a thread dump or two and seeing what your threads are doing. Your app doesn't have to stop to do it:
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/threads.html
http://java.net/projects/tda/
http://java.sys-con.com/node/1611555
So - can we rule out context switching or too-many-threads as the problem?
I think you concerns over thrashing are warranted. A thread pool with 3000 threads (700+ concurrent operations) on a 2 CPU VMware instance certainly seems like a problem that may be causing context switching overload and performance problems. Limiting the number of threads could give you a performance boost although determining the right number is going to be difficult and probably will use a lot of trial and error.
we need some proof of an issue.
I'm not sure the best way to answer but here are some ideas:
Watch the load average of the VM OS and the JVM. If you are seeing high load values (20+) then this is an indicator that there are too many things in the run queues.
Is there no way to simulate the load in a test environment so you can play with the thread pool numbers? If you run simulated load in a test environment with pool size of X and then run with X/2, you should be able to determine optimal values.
Can you compare high load times of day with lower load times of day? Can you graph number of responses to latency during these times to see if you can see a tipping point in terms of thrashing?
If you can simulate load then make sure you aren't just testing under the "drink from the fire hose" methodology. You need simulated load that you can dial up and down. Start at 10% and slowing increase simulated load while watching throughput and latency. You should be able to see the tipping points by watching for throughput flattening or otherwise deflecting.
Usually, context switching in threads is very cheap computationally, but when it involves this many threads... you just can't know. You say upgrading to Java 1.6 EE is out of the question, but what about some hardware upgrades ? It would probably provide a quick fix and shouldn't be that expensive...
e.g. run a profiler on a similar machine.
try a newer version of Java 6 or 7. (It may not make a difference, in which case don't bother upgrading production)
try Centos 6.x
try not using VMware.
try reducing the number of threads. You only have 8 cores.
You many find all or none of the above options make a difference, but you won't know until you have a system you can test on with a known/repeatable work load.
I have a program that starts up and creates an in-memory data model and then creates a (command-line-specified) number of threads to run several string checking algorithms against an input set and that data model. The work is divided amongst the threads along the input set of strings, and then each thread iterates the same in-memory data model instance (which is never updated again, so there are no synchronization issues).
I'm running this on a Windows 2003 64-bit server with 2 quadcore processors, and from looking at Windows task Manager they aren't being maxed-out, (nor are they looking like they are being particularly taxed) when I run with 10 threads. Is this normal behaviour?
It appears that 7 threads all complete a similar amount of work in a similar amount of time, so would you recommend running with 7 threads instead?
Should I run it with more threads?...Although I assume this could be detrimental as the JVM will do more context switching between the threads.
Alternatively, should I run it with fewer threads?
Alternatively, what would be the best tool I could use to measure this?...Would a profiling tool help me out here - indeed, is one of the several profilers better at detecting bottlenecks (assuming I have one here) than the rest?
Note, the server is also running SQL Server 2005 (this may or may not be relevant), but nothing much is happening on that database when I am running my program.
Note also, the threads are only doing string matching, they aren't doing any I/O or database work or anything else they may need to wait on.
My guess would be that your app is bottlenecked on memory access, i.e. your CPU cores spend most of the time waiting for data to be read from main memory. I'm not sure how well profilers can diagnose this kind of problem (the profiler itself could influence the behaviour considerably). You could verify the guess by having your code repeat the operations it does many times on a very small data set.
If this guess is correct, the only thing you can do (other than getting a server with more memory bandwidth) is to try and increase the locality of your memory access to make better use of caches; but depending on the details of the application that may not be possible. Using more threads may in fact lead to worse performance because of cores sharing cache memory.
Without seeing the actual code, it's hard to give proper advice. But do make sure that the threads aren't locking on shared resources, since that would naturally prevent them all from working as efficiently as possible. Also, when you say they aren't doing any io, are they not reading an input or writing an output either? this could also be a bottleneck.
With regards to cpu intensive threads, it is normally not beneficial to run more threads than you have actual cores, but in an uncontrolled environment like this with other big apps running at the same time, you are probably better off simply testing your way to the optimal number of threads.
I am working on a bittorrent client. While communicating with the peers the easiest way for me to communicate with them is to spawn a new thread for each one of them. But if the user wants to keep connections with large number of peers that my cause me to spawn a lot of threads.
Another solution i thought of is have one thread to iterate through peer objects and run them for e period.
I checked other libraries mostly in ruby( mine is in java ) and they spawn one thread for each new peer. Do you think spawning one thread will degrade performence if user sets the number of connections to a high number like 100 or 200?
It shouldn't be a problem unless you're running thousands of threads. I'd look into a compromise, using a threadpool. You can detect the number of CPUs at runtime and decide how many threads to spin up based on that, and then hand out work to the threadpool as it comes along.
You can avoid the problem altogether by using Non-blocking IO (java.nio.*).
I'd recommend using an Executor to keep the number of threads pooled.
Executors.newFixedThreadPool(numberOfThreads);
With this, you can basically add "tasks" to the pool and they will complete as soon as threads become available. This way, you're not exhausting all of the enduser's computer's threads and still getting a lot done at the same time. If you set it to like 16, you'd be pretty safe, though you could always allow the user to change this number if they wanted to.
No.....
Once I had this very same doubt and created a .net app (4 years ago) with 400 threads....
Provided they don't do a lot of work, with a decent machine you should be fine...
A few hundred threads is not a problem for most workstation-class machines, and is simpler to code.
However, if you are interested in pursuing your idea, you can use the non-blocking IO features provided by Java's NIO packages. Jean-Francois Arcand's blog contains a lot of good tips learned from creating the Grizzly connector for Glassfish.
Well in 32bit Windows for example there is actually a maximum number of native Threads you can create (2 Gigs / (number of Threads * ThreadStackSize (default is 2MB)) or something like that). So with too many connections you simply might run out of Virtual Memory address space.
I think a compromise might work: Use a Thread Pool with e.g. 10 Threads (depending on the machine) running and Distribute the connections evenly. Inside the Thread loop through the peers assigned to this Thread. And limit the maximum number of connections.
Use a thread pool and you should be safe with a fairly large pool size (100 or so). CPU will not be a problem since you are IO bound with this type of application.
You can easily make the pools size configurable and put in a reasonable maximum, just to prevent memory related issues with all the threads. Of course that should only occur if all the threads are actually being used.