Threads configuration based on no. of CPU-cores

Threads configuration based on no. of CPU-cores - java

Scenario : I have a sample application and I have 3 different system configuration -
- 2 core processor, 2 GB RAM, 60 GB HHD,
- 4 core processor, 4 GB RAM, 80 GB HHD,
- 8 core processor, 8 GB RAM, 120 GB HHD
In order to effectively exploit the H/W capabilities for my application, I wish to configure the no. of threads at the application level. However, I wish to do this only after a thorough understanding of system capabilities.
Could there be some way(system/modus/tool) to determine the system prowess with reference to the max and min no. of threads it could service optimally & without any loss in efficiency and performance. By this, I could configure only those values for my application that will do full justice and achieve best performance for the respective hardware configuration.
Edited1 :
Could any one please advise any read-up on how to set a baseline for a particular h/w config.
Edited2 :
To make it more direct - Wish to learn/know about any resource/write-up that I can read to gain some understanding on CPU management of Threads at a general/holistic level.

The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:
N_threads = N_cpu * U_cpu * (1 + W / C)
Where:
N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)
So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).
For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.
Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).
This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.
Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.

My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I strongly recommend against exclusive heuristic-based thread-to-core guessing, for several reasons:
Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).
There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.
Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding#Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.
Regarding SMT (HyperThreading, Compute Modules):
Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.
Regarding Turbo Features:
Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).
... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.
Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.

You can get the number of processors available to the JVM like this:
Runtime.getRuntime().availableProcessors()
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
I think the best strategy would be to decide the optimal number of threads empirically for each of the hardware configuration, and then use these numbers in your application.

I agree with the other answers here that recommend a best-guess approach, and providing configuration for overriding the defaults.
In addition, if your application is particularly CPU-intensive, you may want to look into "pinning" your application to particular processors.
You don't say what your primary operating system is, or whether you're supporting multiple operating systems, but most have some way of doing this. Linux, for instance, has taskset.
A common approach is to avoid CPU 0 (always used by the OS), and to set your application's cpu affinity to a group of CPUs that are in the same socket.
Keeping the app's threads away from cpu 0 (and, if possible, away from other applications) often improves performance by reducing the amount of task switching.
Keeping the application on one socket can further increase performance by reducing cache invalidation as your app's threads switch among cpus.
As with everything else, this is highly dependent on the architecture of the machine that you are running on, as well as what other applications are runnning.

Use VisualVm tool to monitor threads.First Create minimum threads in program and see its performance.Then increase the no of threads within the program ans again analyze its performance.May this help you.

I use this Python script here to determine the number of cores (and memory, etc.) to launch my Java application with optimum parameters and ergonomics. PlatformWise on Github
It works like this: Write a python script which calls the getNumberOfCPUCores() in the above script to get the number of cores, and getSystemMemoryInMB() to get the RAM. You can pass that inform to your program via command line arguments. Your program can then use the appropriate number of threads based on the number of cores.

Creating a thread on application level is good and in a multicore processor separate threads are executed on cores to enhance performance.So to utilize the core processing power it is best practice to implement threading.
What i think:
At a time only 1 thread of a program will execute on 1 core.
Same application with 2 thread will execute on half time on 2 core.
Same application with 4 Threads will execute more faster on 4 core.
So the application you developing should have the threading level<= no of cores.
Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.
Read this to get how to actually utilize cpu core's.Fantastic content.
csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/

Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.

Related

Linux tuning for Java Concurrent Performance

I have a big question about tuning linux for java performance, so i start with my case.
I have an application running a number of threads that communicate with each other. My typical workflow is:
1) Some consumer thread sync on a common Object lock and calls wait() on it.
2) Some producer thread waits via Selector for data from network.
2.1) producer receives data and form an object with received timestamp (microseconds precision).
2.2) producer puts this packet in some exchange map and calls notifyAll on common lock.
3) Consumer thread wakes up and reads produced object.
3.1) consumer creates new object and writes in it time difference in microseconds between received timestamp and current timestamp. this way i can monitor reaction time.
And this reaction time is the whole issue.
When i test my application on my own machine i usually get about 200-400 microseconds reaction time, but when i monitor it on my production linux machine i get numbers from 2000 to 4000 microseconds!
Right now i'm running ubuntu 16.04 as my production OS and Oracle jdk 8-111. I have a physical server with 2 Xeon Processors. I run only usual OS daemons and my app on this server so there is plenty of resources compared to my dev notebook.
I run my java app as a jar file with flags:
sudo chrt -r 77 java -server -XX:+UseNUMA -d64 -Xmx1500m -XX:NewSize=1000m -XX:+UseG1GC -jar ...
I use sudo chrt to change priority since i thought it's the case, but it didn't help.
I tuned bios for maximum performance and turned off C-States.
What else i can tune for faster reaction times and low context switches?

No, there is no single echo 1 > /proc/sys/unlock_concurrent_magic option on Linux to globally improve concurrent performance. If such an option existed, it would simply be enabled by default.
In fact, despite being tunable in general, you aren't going to find many tunables that have a big effect specifically on raw concurrency on Linux. The ones that might help often incidentially related to concurrency - i.e., something is enabled (let's say THP) which slows down your particular load, and since at least part of the slowness occurs under lock, the whole concurrent throughput is affected.
Java concurrency vs the OS
My experience, however, is that Java applications are very rarely affected directly by OS-level concurrency behavior. In fact, most of Java concurrency is implemented efficiently without using OS features, and will behave the same across OSes. The primary places where the JVM touches the OS as it relates to concurrency is for thread creation, destruction and waiting/blocking. Recent Linux kernels have very good implementations of all three so it would be unusual that you run into a bottleneck there (indeed, a well-tuned application should not be doing a ton of thread creation and should also seek to minimize blocking).
So I find it very likely the performance difference is due to other differences, in hardware, in application configuration, in the applied load, or something else.
Characterize your performance
Here's what I'd try first to characterize the performance discrepancy between your development host and the production system.
At the top level, the difference is going either be because the production stack is actually slower for the load in question or because the local test isn't an accurate reflection of the production load.
One quick test you can do to distinguish the cases is to run whatever local test you are running to get 200-400us response times on an unloaded production server. If the server is still getting response times that are 10x worse, then you know your test is probably reasonable, and the difference is really in the production stack.
At that point, the problem could still be in OS, in the software configuration, in the hardware, etc. So you should try to bisect the differences between the production stack and your local host - set any tunable parameters to the same value, investigate any application-specific configuration differences, try to characterize any hardware differences.
One big gotcha is that often production servers are in multi-socket configurations, which may increase the cost of contention by an order of magnitude, since cross-socket communication (generally 100+ cycles) is required - whereas development boxes are generally multi-core but single-socket, so contention overhead is contained to the shared L3 (generally ~30 cycles).
On the other hand, you might find that your local test performs just fine on the production server as well, so the real issue is that your test doesn't represent the true production load. You should then make an effort to characterize the true production load so you can replicate it and then tune it locally. How to tune it locally could of course fill a book or two (or require a very highly paid contractor or two), so you'd have to come back with a narrower question to get useful help here.
"Big Iron" vs your laptop
It is a common fallacy that "big iron" is going to be faster at everything than your puny laptop. In fact, quite the opposite is true for many operations, especially when you measure operation latency (as opposed to total throughput).
For example, latency to memory on the server parts is often 50% slower versus client parts, even comparing single socket systems. John McCalpin reports a main-memory latency of 54.6 ns for a client Sandy Bridge part and 79 ns for the corresponding server part. It is well known the path to memory and memory controller design for servers trades off latency for throughput, reliability and the ability to support more cores and total DRAM1.
In particular, you mention that your producer server is a "2 Xeon Processors", which I take to mean a dual-socket system. Once you introduce a second socket, you change the mechanics of synchronization entirely. On a single core system, when separate threads under contention, at worst you are sending cache lines and coherency traffic through the shared L3, which has a latency of 30-40 cycles.
On a system with more than one socket, however, concurrency traffic generally has to flow over the QPI links between sockets, which has latency on the order of DRAM access, perhaps 80 ns (i.e., 240 cycles on a 3GHz box). So you can have nearly an order of magnitude slowdown from the hardware architecture alone.
Furthermore, notifyAll type scenarios as you describe your workflow often get much worse with more cores and more threads. E.g., with more cores, you are less likely to have two communicating processes running on the same hyperthread (which dramatically speeds up inter-thread coordination, but is otherwise undesirable) and the total contention and coherency traffic may scale up in proportion to the number of cores (e.g., because a cache line has to ping-pong around to every core when you wake up threads).
So it's often the case that a heavily contended (often badly designed) algorithm performs much worse on "big iron" than on a single-socket consumer system.
1 E.g., through buffering, which adds latency, but increases the host's maximum RAM capacity.

Maximum number of threads than can run concurrently in java on a CPU

Please I got confused about something.
What I know is that the maximum number of threads that can run concurrently on a normal CPU of a modern computer ranges from 8 to 16 threads.
On the other hand, using GPUs thousands of threads can run concurrently without the scheduler interrupting any thread to schedule another one.
On several posts as:
Java virtual machine - maximum number of threads https://community.oracle.com/message/10312772
people are stating that they run thousands of java threads concurrently on normal CPUs.
How could this be ??
And how can I know the maximum number of threads that can run concurrently so that my code adjusts it self dynamically according to the underlying architecture.

Threads aren't tied to or limited by the number of available processors/cores. The operating system scheduler can switch back and forth between any number of threads on a single CPU. This is the meaning of "preemptive multitasking."
Of course, if you have more threads than cores, not all threads will be executing simultaneously. Some will be on hold, waiting for a time slot.
In practice, the number of threads you can have is limited by the scheduler - but that number is usually very high (thousands or more). It will vary from OS to OS and with individual versions.
As far as how many threads are useful from a performance standpoint, as you said it depends on the number of available processors and on whether the task is IO or CPU bound. Experiment to find the optimal number and make it configurable if possible.

There is hardware and software concurrency. The 8 to 16 threads refers to the hardware you have - that is one or more CPUs with hardware to execute 8 to 16 threads parallel to each other. The thousands of threads refers to the number of software threads, the scheduler will have to swap them out so every software thread gets its time slice to run on the hardware.
To get the number of hardware threads you can try Runtime.availableProcessors().

At any given time, a processor will run the number of threads equal to the number of cores contained. This means that on a uniprocessor system, only one thread (or no thread) is being run at any given moment.
However, processors do not run each thread one after another, rather they switch between multiple threads rapidly to simulate concurrent execution. If this weren't the case let alone create multiple threads, you won't even be able to start multiple applications.
A java thread (compared to processor instructions) is a very high level abstraction of a set of instructions for the CPU to process. When it gets down to the processor level, there is no guarantee which threads will run on which core at any given time. But given that processors rapidly switch between these threads, it is theoretically possible to create an infinite amount of threads albeit at the cost of performance.
If you think about it, a modern computer has thousands of threads running at the same time (combining all applications) while only having 1 ~ 16 (typical case) number of cores. Without this task-switching, nothing would ever get done.
If you are optimizing your application, you should consider the amount of threads you need by the work at hand, and not by the underlying architecture. Performance gains from parallelism should be weighted against increasing overheads of thread execution. Since every machine is different, every runtime environment is different, it is impractical to work out some golden thread count (however, a ballpark estimate may be made by benchmarking and looking at number of cores).

While all the other answers have explained how you can theoretically have thousands of threads in your application at the cost of memory and other overheads already well explained here. It is however worth noting that the default concurrencyLevel for the data structures provided in the java.util.concurrent package is 16.
You will come across contention issues if you don't account for the same.
Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention.
Make sure you have set the appropriate concurrencyLevel in case you are running into issues related to concurrency with a higher number of threads.

Java Threadpool size and availableProcessors()

I have a program which runs (all day) tasks in parallel (no I/O in the task to be executed) so I have used Executors.newFixedThreadPool(poolSize) to implement it.
Initially I set the poolSize to Runtime.getRuntime().availableProcessors(), but I was a bit worried to use all the available cores since there are other processes running on the same PC (32 cores).
In particular I have ten other JVM running the same program (on different input data), so I'm a bit worried that there might be a lot of overhead in terms of threads switching amongst the available cores, which could slow down the overall calculations.
How shall I decide the size of the pool for each program / JVM?
Also, in my PC, there are other processes running all the time (Antivirus, Backup, etc.). Shall I take into account these as well?

Any advice is going to be dependent upon your particular circumstances. 10 JVMs on 32 cores would suggest 3 threads each (ignoring garbage collection threads, timer tasks etc...)
You also have other tasks running. The scheduler will ensure they're running, but do they have to be responsive ? More responsive than the JVM ? If you're running Linux/Unix then you can also make use of prioritisation (via nice) to ensure particular processes don't hog the CPU.
Finally you're running 10 JVMs. Will that cause paging ? If so, that will be slow and you may be better off running fewer JVMs in order to avoid consuming so much memory.
Just make sure that your key variables are exposed and configurable, and measure various scenarios in order to find the optimal one.

How shall I decide the size of the pool for each program / JVM?
You want the number of threads which will get you close to 99% utilisation and no more.
The simplest way to balance the work is to have the process running once, processing multiple files at concurrently and using just one thread pool. You can set up you process as a service if you need to start files via the command line.
If this is impossible for some reason, you will need to guesstimate how much the thread pools should be shrunk by. Try running one process and look at the utilisation. If one is say 40% then I suspect ten processes is over utilised by 400%. i.e then you might reduce the pool size by a factor of 4.

Unfortunately, this is a hard thing to know, as programs don't typically know what else is or might be going on on the same box.
the "easy" way out is to make the pool size configurable. this allows the user who controls the program/box to decide how many threads to allocate to your program (presumably using their knowledge of the general workload of the box).
a more complex solution would be to attempt to programmatically determine the current workload of the box and choose the pool size appropriately from that. the efficacy of this solution depends on how accurately you can determine the workload and potentially adapt as it changes over time.

Try grepping the processes, check top/task manager and performance monitors to verify if this implementation is actually affecting your machine.
This article seems to contain interesting info about what you are trying to implement:
http://www.ibm.com/developerworks/library/j-jtp0730/index.html

High CPU, possibly due to context switching?

One of our servers is experiencing a very high CPU load with our application. We've looked at various stats and are having issues finding the source of the problem.
One of the current theories is that there are too many threads involved and that we should try to reduce the number of concurrently executing threads. There's just one main thread pool, with 3000 threads, and a WorkManager working with it (this is Java EE - Glassfish). At any given moment, there are about 620 separate network IO operations that need to be conducted in parallel (use of java.NIO is not an option either). Moreover, there are roughly 100 operations that have no IO involved and are also executed in parallel.
This structure is not efficient and we want to see if it is actually causing damage, or is simply bad practice. Reason being that any change is quite expensive in this system (in terms of man hours) so we need some proof of an issue.
So now we're wondering if context switching of threads is the cause, given there are far more threads than the required concurrent operations. Looking at the logs, we see that on average there are 14 different threads executed in a given second. If we take into account the existence of two CPUs (see below), then it is 7 threads per CPU. This doesn't sound like too much, but we wanted to verify this.
So - can we rule out context switching or too-many-threads as the problem?
General Details:
Java 1.5 (yes, it's old), running on CentOS 5, 64-bit, Linux kernel 2.6.18-128.el5
There is only one single Java process on the machine, nothing else.
Two CPUs, under VMware.
8GB RAM
We don't have the option of running a profiler on the machine.
We don't have the option of upgrading the Java, nor the OS.
UPDATE
As advised below, we've conducted captures of load average (using uptime) and CPU (using vmstat 1 120) on our test server with various loads. We've waited 15 minutes between each load change and its measurements to ensure that the system stabilized around the new load and that the load average numbers are updated:
50% of the production server's workload: http://pastebin.com/GE2kGLkk
34% of the production server's workload: http://pastebin.com/V2PWq8CG
25% of the production server's workload: http://pastebin.com/0pxxK0Fu
CPU usage appears to be reduced as the load reduces, but not on a very drastic level (change from 50% to 25% is not really a 50% reduction in CPU usage). Load average seems uncorrelated with the amount of workload.
There's also a question: given our test server is also a VM, could its CPU measurements be impacted by other VMs running on the same host (making the above measurements useless)?
UPDATE 2
Attaching the snapshot of the threads in three parts (pastebin limitations)
Part 1: http://pastebin.com/DvNzkB5z
Part 2: http://pastebin.com/72sC00rc
Part 3: http://pastebin.com/YTG9hgF5

Seems to me the problem is 100 CPU bound threads more than anything else. 3000 thread pool is basically a red herring, as idle threads don't consume much of anything. The I/O threads are likely sleeping "most" of the time, since I/O is measured on a geologic time scale in terms of computer operations.
You don't mention what the 100 CPU threads are doing, or how long they last, but if you want to slow down a computer, dedicating 100 threads of "run until time slice says stop" will most certainly do it. Because you have 100 "always ready to run", the machine will context switch as fast as the scheduler allows. There will be pretty much zero idle time. Context switching will have impact because you're doing it so often. Since the CPU threads are (likely) consuming most of the CPU time, your I/O "bound" threads are going to be waiting in the run queue longer than they're waiting for I/O. So, even more processes are waiting (the I/O processes just bail out more often as they hit an I/O barrier quickly which idles the process out for the next one).
No doubt there are tweaks here and there to improve efficiency, but 100 CPU threads are 100 CPU threads. Not much you can do there.

I think your constraints are unreasonable. Basically what you are saying is:
1.I can't change anything
2.I can't measure anything
Can you please speculate as to what my problem might be?
The real answer to this is that you need to hook a proper profiler to the application and you need to correlate what you see with CPU usage, Disk/Network I/O, and memory.
Remember the 80/20 rule of performance tuning. 80% will come from tuning your application. You might just have too much load for one VM instance and it could be time to consider solutions for scaling horizontally or vertically by giving more resources to the machine. It could be any one of the 3 billion JVM settings are not inline with your application's execution specifics.
I assume the 3000 thread pool came from the famous more threads = more concurrency = more performance theory. The real answer is a tuning change isn't worth anything unless you measure throughput and response time before/after the change and compared the results.

If you can't profile, I'd recommend taking a thread dump or two and seeing what your threads are doing. Your app doesn't have to stop to do it:
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/threads.html
http://java.net/projects/tda/
http://java.sys-con.com/node/1611555

So - can we rule out context switching or too-many-threads as the problem?
I think you concerns over thrashing are warranted. A thread pool with 3000 threads (700+ concurrent operations) on a 2 CPU VMware instance certainly seems like a problem that may be causing context switching overload and performance problems. Limiting the number of threads could give you a performance boost although determining the right number is going to be difficult and probably will use a lot of trial and error.
we need some proof of an issue.
I'm not sure the best way to answer but here are some ideas:
Watch the load average of the VM OS and the JVM. If you are seeing high load values (20+) then this is an indicator that there are too many things in the run queues.
Is there no way to simulate the load in a test environment so you can play with the thread pool numbers? If you run simulated load in a test environment with pool size of X and then run with X/2, you should be able to determine optimal values.
Can you compare high load times of day with lower load times of day? Can you graph number of responses to latency during these times to see if you can see a tipping point in terms of thrashing?
If you can simulate load then make sure you aren't just testing under the "drink from the fire hose" methodology. You need simulated load that you can dial up and down. Start at 10% and slowing increase simulated load while watching throughput and latency. You should be able to see the tipping points by watching for throughput flattening or otherwise deflecting.

Usually, context switching in threads is very cheap computationally, but when it involves this many threads... you just can't know. You say upgrading to Java 1.6 EE is out of the question, but what about some hardware upgrades ? It would probably provide a quick fix and shouldn't be that expensive...

e.g. run a profiler on a similar machine.
try a newer version of Java 6 or 7. (It may not make a difference, in which case don't bother upgrading production)
try Centos 6.x
try not using VMware.
try reducing the number of threads. You only have 8 cores.
You many find all or none of the above options make a difference, but you won't know until you have a system you can test on with a known/repeatable work load.

what number of threads to be created in thread pool

Currently i am in a process of developing a application which can work in a multi threaded mode. As part of testing on my local machine (Intel Core I5) i tested with 4 threads. But now want to release the code for intense(regression) testing, so there any hard rule by which we can decide the number of threads to be created for processing.
I am not using any web or App server, instead i have written my logic to receive the request and then process it. Now During the processing, i receive the request on main Thread, and then i submit the call to ExecuterService where i need to decide the number of threads, then i process the request and each thread is again capable of returning the response.
I need to configure an optimum number of thread. I am trying to deploy my application on 16-Core, with 40GB Memory linux machine.
Thanks

The maximum number of threads for an application can not be extracted via some well defined formula but it depends on the nature of your various tasks and your target environment.
If your tasks are CPU intensive then, if you spawn too many threads the performance will degenerate as most of the time will be spend in context switching.
For compute intensive tasks a general formula is Ncpus+1. You can determine the number of CPUs using Runtime.availableProcessors
If your tasks are I/O intensive then most of the time you can use a much larger number of threads since, due to the fact that the threads are spending so much time in blocking tasks, all of the threads will be schedulable.
So taking these 2 into account you should estimate the compute-time vs waiting-time via a profiler or other similar tool.
You can try your benchmarks with various sizes until your estimate the optimal for your case.

In theory, the optimal number of threads is equal to the number of cores in your machine.
In practice, many operations are waiting for memory, IO, network or disk.
Try to execute only a single thread. If the CPU core load is 25% - you can try to create (4 x the number of cores in your machine) threads.
Note that increasing the number of threads will effect the time each thread will wait for network/disk/memory/IO, so it is somewhat more complex.
The best thing you can do is benchmark: Measure how much time it would take to complete 1,000,000 simulated requests - given different number of threads.

Depends on how cpu intensive your tasks are. But still you can assign one task to one core. So at the least you can go about creating as many threads as number of cores. That said, things may slow down depending on
Your code doing lots of I/O.
Lots of network I/O
other CPU intensive tasks
If you create too many threads, there will be lots of time wasted in context switching. Unless you can come to a benchmark based on your own tests, go with threads=number of cores.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.