High CPU, possibly due to context switching?

High CPU, possibly due to context switching? - java

One of our servers is experiencing a very high CPU load with our application. We've looked at various stats and are having issues finding the source of the problem.
One of the current theories is that there are too many threads involved and that we should try to reduce the number of concurrently executing threads. There's just one main thread pool, with 3000 threads, and a WorkManager working with it (this is Java EE - Glassfish). At any given moment, there are about 620 separate network IO operations that need to be conducted in parallel (use of java.NIO is not an option either). Moreover, there are roughly 100 operations that have no IO involved and are also executed in parallel.
This structure is not efficient and we want to see if it is actually causing damage, or is simply bad practice. Reason being that any change is quite expensive in this system (in terms of man hours) so we need some proof of an issue.
So now we're wondering if context switching of threads is the cause, given there are far more threads than the required concurrent operations. Looking at the logs, we see that on average there are 14 different threads executed in a given second. If we take into account the existence of two CPUs (see below), then it is 7 threads per CPU. This doesn't sound like too much, but we wanted to verify this.
So - can we rule out context switching or too-many-threads as the problem?
General Details:
Java 1.5 (yes, it's old), running on CentOS 5, 64-bit, Linux kernel 2.6.18-128.el5
There is only one single Java process on the machine, nothing else.
Two CPUs, under VMware.
8GB RAM
We don't have the option of running a profiler on the machine.
We don't have the option of upgrading the Java, nor the OS.
UPDATE
As advised below, we've conducted captures of load average (using uptime) and CPU (using vmstat 1 120) on our test server with various loads. We've waited 15 minutes between each load change and its measurements to ensure that the system stabilized around the new load and that the load average numbers are updated:
50% of the production server's workload: http://pastebin.com/GE2kGLkk
34% of the production server's workload: http://pastebin.com/V2PWq8CG
25% of the production server's workload: http://pastebin.com/0pxxK0Fu
CPU usage appears to be reduced as the load reduces, but not on a very drastic level (change from 50% to 25% is not really a 50% reduction in CPU usage). Load average seems uncorrelated with the amount of workload.
There's also a question: given our test server is also a VM, could its CPU measurements be impacted by other VMs running on the same host (making the above measurements useless)?
UPDATE 2
Attaching the snapshot of the threads in three parts (pastebin limitations)
Part 1: http://pastebin.com/DvNzkB5z
Part 2: http://pastebin.com/72sC00rc
Part 3: http://pastebin.com/YTG9hgF5

Seems to me the problem is 100 CPU bound threads more than anything else. 3000 thread pool is basically a red herring, as idle threads don't consume much of anything. The I/O threads are likely sleeping "most" of the time, since I/O is measured on a geologic time scale in terms of computer operations.
You don't mention what the 100 CPU threads are doing, or how long they last, but if you want to slow down a computer, dedicating 100 threads of "run until time slice says stop" will most certainly do it. Because you have 100 "always ready to run", the machine will context switch as fast as the scheduler allows. There will be pretty much zero idle time. Context switching will have impact because you're doing it so often. Since the CPU threads are (likely) consuming most of the CPU time, your I/O "bound" threads are going to be waiting in the run queue longer than they're waiting for I/O. So, even more processes are waiting (the I/O processes just bail out more often as they hit an I/O barrier quickly which idles the process out for the next one).
No doubt there are tweaks here and there to improve efficiency, but 100 CPU threads are 100 CPU threads. Not much you can do there.

I think your constraints are unreasonable. Basically what you are saying is:
1.I can't change anything
2.I can't measure anything
Can you please speculate as to what my problem might be?
The real answer to this is that you need to hook a proper profiler to the application and you need to correlate what you see with CPU usage, Disk/Network I/O, and memory.
Remember the 80/20 rule of performance tuning. 80% will come from tuning your application. You might just have too much load for one VM instance and it could be time to consider solutions for scaling horizontally or vertically by giving more resources to the machine. It could be any one of the 3 billion JVM settings are not inline with your application's execution specifics.
I assume the 3000 thread pool came from the famous more threads = more concurrency = more performance theory. The real answer is a tuning change isn't worth anything unless you measure throughput and response time before/after the change and compared the results.

If you can't profile, I'd recommend taking a thread dump or two and seeing what your threads are doing. Your app doesn't have to stop to do it:
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/threads.html
http://java.net/projects/tda/
http://java.sys-con.com/node/1611555

So - can we rule out context switching or too-many-threads as the problem?
I think you concerns over thrashing are warranted. A thread pool with 3000 threads (700+ concurrent operations) on a 2 CPU VMware instance certainly seems like a problem that may be causing context switching overload and performance problems. Limiting the number of threads could give you a performance boost although determining the right number is going to be difficult and probably will use a lot of trial and error.
we need some proof of an issue.
I'm not sure the best way to answer but here are some ideas:
Watch the load average of the VM OS and the JVM. If you are seeing high load values (20+) then this is an indicator that there are too many things in the run queues.
Is there no way to simulate the load in a test environment so you can play with the thread pool numbers? If you run simulated load in a test environment with pool size of X and then run with X/2, you should be able to determine optimal values.
Can you compare high load times of day with lower load times of day? Can you graph number of responses to latency during these times to see if you can see a tipping point in terms of thrashing?
If you can simulate load then make sure you aren't just testing under the "drink from the fire hose" methodology. You need simulated load that you can dial up and down. Start at 10% and slowing increase simulated load while watching throughput and latency. You should be able to see the tipping points by watching for throughput flattening or otherwise deflecting.

Usually, context switching in threads is very cheap computationally, but when it involves this many threads... you just can't know. You say upgrading to Java 1.6 EE is out of the question, but what about some hardware upgrades ? It would probably provide a quick fix and shouldn't be that expensive...

e.g. run a profiler on a similar machine.
try a newer version of Java 6 or 7. (It may not make a difference, in which case don't bother upgrading production)
try Centos 6.x
try not using VMware.
try reducing the number of threads. You only have 8 cores.
You many find all or none of the above options make a difference, but you won't know until you have a system you can test on with a known/repeatable work load.

Related

Linux tuning for Java Concurrent Performance

I have a big question about tuning linux for java performance, so i start with my case.
I have an application running a number of threads that communicate with each other. My typical workflow is:
1) Some consumer thread sync on a common Object lock and calls wait() on it.
2) Some producer thread waits via Selector for data from network.
2.1) producer receives data and form an object with received timestamp (microseconds precision).
2.2) producer puts this packet in some exchange map and calls notifyAll on common lock.
3) Consumer thread wakes up and reads produced object.
3.1) consumer creates new object and writes in it time difference in microseconds between received timestamp and current timestamp. this way i can monitor reaction time.
And this reaction time is the whole issue.
When i test my application on my own machine i usually get about 200-400 microseconds reaction time, but when i monitor it on my production linux machine i get numbers from 2000 to 4000 microseconds!
Right now i'm running ubuntu 16.04 as my production OS and Oracle jdk 8-111. I have a physical server with 2 Xeon Processors. I run only usual OS daemons and my app on this server so there is plenty of resources compared to my dev notebook.
I run my java app as a jar file with flags:
sudo chrt -r 77 java -server -XX:+UseNUMA -d64 -Xmx1500m -XX:NewSize=1000m -XX:+UseG1GC -jar ...
I use sudo chrt to change priority since i thought it's the case, but it didn't help.
I tuned bios for maximum performance and turned off C-States.
What else i can tune for faster reaction times and low context switches?

No, there is no single echo 1 > /proc/sys/unlock_concurrent_magic option on Linux to globally improve concurrent performance. If such an option existed, it would simply be enabled by default.
In fact, despite being tunable in general, you aren't going to find many tunables that have a big effect specifically on raw concurrency on Linux. The ones that might help often incidentially related to concurrency - i.e., something is enabled (let's say THP) which slows down your particular load, and since at least part of the slowness occurs under lock, the whole concurrent throughput is affected.
Java concurrency vs the OS
My experience, however, is that Java applications are very rarely affected directly by OS-level concurrency behavior. In fact, most of Java concurrency is implemented efficiently without using OS features, and will behave the same across OSes. The primary places where the JVM touches the OS as it relates to concurrency is for thread creation, destruction and waiting/blocking. Recent Linux kernels have very good implementations of all three so it would be unusual that you run into a bottleneck there (indeed, a well-tuned application should not be doing a ton of thread creation and should also seek to minimize blocking).
So I find it very likely the performance difference is due to other differences, in hardware, in application configuration, in the applied load, or something else.
Characterize your performance
Here's what I'd try first to characterize the performance discrepancy between your development host and the production system.
At the top level, the difference is going either be because the production stack is actually slower for the load in question or because the local test isn't an accurate reflection of the production load.
One quick test you can do to distinguish the cases is to run whatever local test you are running to get 200-400us response times on an unloaded production server. If the server is still getting response times that are 10x worse, then you know your test is probably reasonable, and the difference is really in the production stack.
At that point, the problem could still be in OS, in the software configuration, in the hardware, etc. So you should try to bisect the differences between the production stack and your local host - set any tunable parameters to the same value, investigate any application-specific configuration differences, try to characterize any hardware differences.
One big gotcha is that often production servers are in multi-socket configurations, which may increase the cost of contention by an order of magnitude, since cross-socket communication (generally 100+ cycles) is required - whereas development boxes are generally multi-core but single-socket, so contention overhead is contained to the shared L3 (generally ~30 cycles).
On the other hand, you might find that your local test performs just fine on the production server as well, so the real issue is that your test doesn't represent the true production load. You should then make an effort to characterize the true production load so you can replicate it and then tune it locally. How to tune it locally could of course fill a book or two (or require a very highly paid contractor or two), so you'd have to come back with a narrower question to get useful help here.
"Big Iron" vs your laptop
It is a common fallacy that "big iron" is going to be faster at everything than your puny laptop. In fact, quite the opposite is true for many operations, especially when you measure operation latency (as opposed to total throughput).
For example, latency to memory on the server parts is often 50% slower versus client parts, even comparing single socket systems. John McCalpin reports a main-memory latency of 54.6 ns for a client Sandy Bridge part and 79 ns for the corresponding server part. It is well known the path to memory and memory controller design for servers trades off latency for throughput, reliability and the ability to support more cores and total DRAM1.
In particular, you mention that your producer server is a "2 Xeon Processors", which I take to mean a dual-socket system. Once you introduce a second socket, you change the mechanics of synchronization entirely. On a single core system, when separate threads under contention, at worst you are sending cache lines and coherency traffic through the shared L3, which has a latency of 30-40 cycles.
On a system with more than one socket, however, concurrency traffic generally has to flow over the QPI links between sockets, which has latency on the order of DRAM access, perhaps 80 ns (i.e., 240 cycles on a 3GHz box). So you can have nearly an order of magnitude slowdown from the hardware architecture alone.
Furthermore, notifyAll type scenarios as you describe your workflow often get much worse with more cores and more threads. E.g., with more cores, you are less likely to have two communicating processes running on the same hyperthread (which dramatically speeds up inter-thread coordination, but is otherwise undesirable) and the total contention and coherency traffic may scale up in proportion to the number of cores (e.g., because a cache line has to ping-pong around to every core when you wake up threads).
So it's often the case that a heavily contended (often badly designed) algorithm performs much worse on "big iron" than on a single-socket consumer system.
1 E.g., through buffering, which adds latency, but increases the host's maximum RAM capacity.

How to prevent physical memory consuming when running parallel Java processes

I have big list (up to 500 000) of some functions.
My task is to generate some graph for each function (it can be do independently from other functions) and dump output to the file (it can be several files).
The process of generating graphs can be time consuming.
I also have server with 40 physical cores and 128GB ram.
I have tried to implement parallel processing using java Threads/ExecutorPool, but it seems not to use processors all resources.
On some inputs the program takes up to 25 hours to run and only 10-15 cores are working according to htop.
So the second thing I've tried is to create 40 distinct processes (using Runtime.exec) and split the list among them.
This method uses processor all resources (100% load on all 40 cores) and speedups performance up to 5 times on previous example (it takes only 5 hours which is reasonable for my task).
But the problem of this method is that, each java process runs separately and consumes memory independently from others. Is some scenarios all 128gb of ram is consumed after 5 minutes of parallel work. One solution that I am using now is to call System.gc() for each process if Runtime.totalMemory > 2GB. This slows down overall performance a bit (8 hours on previous input) but lefts memory usage in reasonable boundaries.
But this configuration works only for my server. If you run it on the server with 40 core and 64GB run, you need to tune Runtime.totalMemory > 2GB condition.
So the question is what is the best way to avoid such aggressive memory consuming?
Is it normal practice to run separate processes to do parallel jobs?
Is there any other parallel method in Java (maybe fork/join?) which uses 100% physical resources of processor.

You don't need to call System.gc() explicitly! The JVM will do it automatically when needed, and almost always does it better. You should, however, set the max heap size (-Xmx) to a number that works well.
If your program won't scale further you have some kind of congestion. You can either analyse your program and your java- and system settings and figure out why, or run it as multiple processes. If each process is multi-threaded, then you may get better performance using 5-10 processes instead of 40.
Note that you may get higher performance with more than one thread per core. Fiddle around with 1-8 threads per core and see if throughput increases.
From your description it sounds like you have 500,000 completely independent items of work and that each work item doesn't really need a lot of memory. If that is true, then memory consumption isn't really an issue. As long as each process has enough memory so it doesn't have to gc very often then gc isn't going to affect the total execution time by much. Just make sure you don't have any dangling references to objects you no longer need.

One of the problems here: it is still very hard to understand how many threads, cores, ... are actually available.
My personal suggestion: there are several articles on the java specialist newsletter which do a very deep dive into this subject.
For example this one: http://www.javaspecialists.eu/archive/Issue135.html
or a more recent new, on "the number of available processors": http://www.javaspecialists.eu/archive/Issue220.html

Mysql jconnector spends 50% time in com.myql.jdbc.utils.ReadAheadInputStream.fill()

I am profiling my application which uses Spring, Hibernate, and mysql-java-connector. The VisualVM shows that more than 50% of CPU time is spent in com.myql.jdbc.utils.ReadAheadInputStream.fill() method when there are 1000 parallel connections doing read.
Is there any optimization to make it faster?

VisualVM counts a thread as using CPU time whenever the JVM thinks it's runnable. This means that any thread not waiting on a lock is considered runnable, more or less, including threads waiting for I/O in the kernel! This is where the large amount of CPU usage in com.myql.jdbc.utils.ReadAheadInputStream.fill() is coming from. So instead of a CPU problem you have an I/O problem.
There are some things you can do on the JVM side, but not a lot of straightforward optimizing:
Tweak the connection pool size. 1,000 concurrent queries is a lot. Unless your MySQL instance is truly massive it's going to have trouble handling that level of load, and eat up a lot of time just switching between queries. Try dropping the pool size, to 250 or even 50, and benchmark there.
Do fewer or smaller queries. If your app is small it might be trivially obvious that every row from every query is necessary, but maybe your app is bigger than that. Are the different places that query the same data, or can two different queries be combined into one that will satisfy both?

On top of the other suggestions, consider also experimenting with a much lower amount of connections (i.e. 20).
It's very possible that the overhead of handling such a large amount of open connections is slightly fooling your profiling observations.
Not least, make sure you're using a recent version of Hibernate ORM.
We made version 5.0+ much smarter than previous versions, especially regarding performance improvements ;-) Improvements are applied daily, so keeping up to date or at least trying the latest might be an easy win.

It's hard to answer your question without additional information. Here some information needs that should be fulfilled before.
is it your estimate of CPU time absolute or relative? If fill() method uses half of CPU time available to system it seems strange. But if this number was get using VisualVM which reports usage time relative to time spent in application, it just may be the rest of your application is not doing significant work?
Do these profiling measurements confirmed using system level tools? You can use pidstat, mpstat and sar to crosscheck if you on a Linux. I've seen VisualVM marked time spent in SocketInputStream.socketRead0() method as a CPU time, which was not confirmed by pidstat. I guess it's consequences of some measurement approximations in VisualVM itself or JVM behavior. So it's always good idea to crosscheck using OS tools.

Java Threadpool size and availableProcessors()

I have a program which runs (all day) tasks in parallel (no I/O in the task to be executed) so I have used Executors.newFixedThreadPool(poolSize) to implement it.
Initially I set the poolSize to Runtime.getRuntime().availableProcessors(), but I was a bit worried to use all the available cores since there are other processes running on the same PC (32 cores).
In particular I have ten other JVM running the same program (on different input data), so I'm a bit worried that there might be a lot of overhead in terms of threads switching amongst the available cores, which could slow down the overall calculations.
How shall I decide the size of the pool for each program / JVM?
Also, in my PC, there are other processes running all the time (Antivirus, Backup, etc.). Shall I take into account these as well?

Any advice is going to be dependent upon your particular circumstances. 10 JVMs on 32 cores would suggest 3 threads each (ignoring garbage collection threads, timer tasks etc...)
You also have other tasks running. The scheduler will ensure they're running, but do they have to be responsive ? More responsive than the JVM ? If you're running Linux/Unix then you can also make use of prioritisation (via nice) to ensure particular processes don't hog the CPU.
Finally you're running 10 JVMs. Will that cause paging ? If so, that will be slow and you may be better off running fewer JVMs in order to avoid consuming so much memory.
Just make sure that your key variables are exposed and configurable, and measure various scenarios in order to find the optimal one.

How shall I decide the size of the pool for each program / JVM?
You want the number of threads which will get you close to 99% utilisation and no more.
The simplest way to balance the work is to have the process running once, processing multiple files at concurrently and using just one thread pool. You can set up you process as a service if you need to start files via the command line.
If this is impossible for some reason, you will need to guesstimate how much the thread pools should be shrunk by. Try running one process and look at the utilisation. If one is say 40% then I suspect ten processes is over utilised by 400%. i.e then you might reduce the pool size by a factor of 4.

Unfortunately, this is a hard thing to know, as programs don't typically know what else is or might be going on on the same box.
the "easy" way out is to make the pool size configurable. this allows the user who controls the program/box to decide how many threads to allocate to your program (presumably using their knowledge of the general workload of the box).
a more complex solution would be to attempt to programmatically determine the current workload of the box and choose the pool size appropriately from that. the efficacy of this solution depends on how accurately you can determine the workload and potentially adapt as it changes over time.

Try grepping the processes, check top/task manager and performance monitors to verify if this implementation is actually affecting your machine.
This article seems to contain interesting info about what you are trying to implement:
http://www.ibm.com/developerworks/library/j-jtp0730/index.html

Threads configuration based on no. of CPU-cores

Scenario : I have a sample application and I have 3 different system configuration -
- 2 core processor, 2 GB RAM, 60 GB HHD,
- 4 core processor, 4 GB RAM, 80 GB HHD,
- 8 core processor, 8 GB RAM, 120 GB HHD
In order to effectively exploit the H/W capabilities for my application, I wish to configure the no. of threads at the application level. However, I wish to do this only after a thorough understanding of system capabilities.
Could there be some way(system/modus/tool) to determine the system prowess with reference to the max and min no. of threads it could service optimally & without any loss in efficiency and performance. By this, I could configure only those values for my application that will do full justice and achieve best performance for the respective hardware configuration.
Edited1 :
Could any one please advise any read-up on how to set a baseline for a particular h/w config.
Edited2 :
To make it more direct - Wish to learn/know about any resource/write-up that I can read to gain some understanding on CPU management of Threads at a general/holistic level.

The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:
N_threads = N_cpu * U_cpu * (1 + W / C)
Where:
N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)
So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).
For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.
Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).
This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.
Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.

My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I strongly recommend against exclusive heuristic-based thread-to-core guessing, for several reasons:
Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).
There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.
Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding#Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.
Regarding SMT (HyperThreading, Compute Modules):
Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.
Regarding Turbo Features:
Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).
... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.
Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.

You can get the number of processors available to the JVM like this:
Runtime.getRuntime().availableProcessors()
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
I think the best strategy would be to decide the optimal number of threads empirically for each of the hardware configuration, and then use these numbers in your application.

I agree with the other answers here that recommend a best-guess approach, and providing configuration for overriding the defaults.
In addition, if your application is particularly CPU-intensive, you may want to look into "pinning" your application to particular processors.
You don't say what your primary operating system is, or whether you're supporting multiple operating systems, but most have some way of doing this. Linux, for instance, has taskset.
A common approach is to avoid CPU 0 (always used by the OS), and to set your application's cpu affinity to a group of CPUs that are in the same socket.
Keeping the app's threads away from cpu 0 (and, if possible, away from other applications) often improves performance by reducing the amount of task switching.
Keeping the application on one socket can further increase performance by reducing cache invalidation as your app's threads switch among cpus.
As with everything else, this is highly dependent on the architecture of the machine that you are running on, as well as what other applications are runnning.

Use VisualVm tool to monitor threads.First Create minimum threads in program and see its performance.Then increase the no of threads within the program ans again analyze its performance.May this help you.

I use this Python script here to determine the number of cores (and memory, etc.) to launch my Java application with optimum parameters and ergonomics. PlatformWise on Github
It works like this: Write a python script which calls the getNumberOfCPUCores() in the above script to get the number of cores, and getSystemMemoryInMB() to get the RAM. You can pass that inform to your program via command line arguments. Your program can then use the appropriate number of threads based on the number of cores.

Creating a thread on application level is good and in a multicore processor separate threads are executed on cores to enhance performance.So to utilize the core processing power it is best practice to implement threading.
What i think:
At a time only 1 thread of a program will execute on 1 core.
Same application with 2 thread will execute on half time on 2 core.
Same application with 4 Threads will execute more faster on 4 core.
So the application you developing should have the threading level<= no of cores.
Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.
Read this to get how to actually utilize cpu core's.Fantastic content.
csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/

Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.