Linux tuning for Java Concurrent Performance

Linux tuning for Java Concurrent Performance - java

I have a big question about tuning linux for java performance, so i start with my case.
I have an application running a number of threads that communicate with each other. My typical workflow is:
1) Some consumer thread sync on a common Object lock and calls wait() on it.
2) Some producer thread waits via Selector for data from network.
2.1) producer receives data and form an object with received timestamp (microseconds precision).
2.2) producer puts this packet in some exchange map and calls notifyAll on common lock.
3) Consumer thread wakes up and reads produced object.
3.1) consumer creates new object and writes in it time difference in microseconds between received timestamp and current timestamp. this way i can monitor reaction time.
And this reaction time is the whole issue.
When i test my application on my own machine i usually get about 200-400 microseconds reaction time, but when i monitor it on my production linux machine i get numbers from 2000 to 4000 microseconds!
Right now i'm running ubuntu 16.04 as my production OS and Oracle jdk 8-111. I have a physical server with 2 Xeon Processors. I run only usual OS daemons and my app on this server so there is plenty of resources compared to my dev notebook.
I run my java app as a jar file with flags:
sudo chrt -r 77 java -server -XX:+UseNUMA -d64 -Xmx1500m -XX:NewSize=1000m -XX:+UseG1GC -jar ...
I use sudo chrt to change priority since i thought it's the case, but it didn't help.
I tuned bios for maximum performance and turned off C-States.
What else i can tune for faster reaction times and low context switches?

No, there is no single echo 1 > /proc/sys/unlock_concurrent_magic option on Linux to globally improve concurrent performance. If such an option existed, it would simply be enabled by default.
In fact, despite being tunable in general, you aren't going to find many tunables that have a big effect specifically on raw concurrency on Linux. The ones that might help often incidentially related to concurrency - i.e., something is enabled (let's say THP) which slows down your particular load, and since at least part of the slowness occurs under lock, the whole concurrent throughput is affected.
Java concurrency vs the OS
My experience, however, is that Java applications are very rarely affected directly by OS-level concurrency behavior. In fact, most of Java concurrency is implemented efficiently without using OS features, and will behave the same across OSes. The primary places where the JVM touches the OS as it relates to concurrency is for thread creation, destruction and waiting/blocking. Recent Linux kernels have very good implementations of all three so it would be unusual that you run into a bottleneck there (indeed, a well-tuned application should not be doing a ton of thread creation and should also seek to minimize blocking).
So I find it very likely the performance difference is due to other differences, in hardware, in application configuration, in the applied load, or something else.
Characterize your performance
Here's what I'd try first to characterize the performance discrepancy between your development host and the production system.
At the top level, the difference is going either be because the production stack is actually slower for the load in question or because the local test isn't an accurate reflection of the production load.
One quick test you can do to distinguish the cases is to run whatever local test you are running to get 200-400us response times on an unloaded production server. If the server is still getting response times that are 10x worse, then you know your test is probably reasonable, and the difference is really in the production stack.
At that point, the problem could still be in OS, in the software configuration, in the hardware, etc. So you should try to bisect the differences between the production stack and your local host - set any tunable parameters to the same value, investigate any application-specific configuration differences, try to characterize any hardware differences.
One big gotcha is that often production servers are in multi-socket configurations, which may increase the cost of contention by an order of magnitude, since cross-socket communication (generally 100+ cycles) is required - whereas development boxes are generally multi-core but single-socket, so contention overhead is contained to the shared L3 (generally ~30 cycles).
On the other hand, you might find that your local test performs just fine on the production server as well, so the real issue is that your test doesn't represent the true production load. You should then make an effort to characterize the true production load so you can replicate it and then tune it locally. How to tune it locally could of course fill a book or two (or require a very highly paid contractor or two), so you'd have to come back with a narrower question to get useful help here.
"Big Iron" vs your laptop
It is a common fallacy that "big iron" is going to be faster at everything than your puny laptop. In fact, quite the opposite is true for many operations, especially when you measure operation latency (as opposed to total throughput).
For example, latency to memory on the server parts is often 50% slower versus client parts, even comparing single socket systems. John McCalpin reports a main-memory latency of 54.6 ns for a client Sandy Bridge part and 79 ns for the corresponding server part. It is well known the path to memory and memory controller design for servers trades off latency for throughput, reliability and the ability to support more cores and total DRAM1.
In particular, you mention that your producer server is a "2 Xeon Processors", which I take to mean a dual-socket system. Once you introduce a second socket, you change the mechanics of synchronization entirely. On a single core system, when separate threads under contention, at worst you are sending cache lines and coherency traffic through the shared L3, which has a latency of 30-40 cycles.
On a system with more than one socket, however, concurrency traffic generally has to flow over the QPI links between sockets, which has latency on the order of DRAM access, perhaps 80 ns (i.e., 240 cycles on a 3GHz box). So you can have nearly an order of magnitude slowdown from the hardware architecture alone.
Furthermore, notifyAll type scenarios as you describe your workflow often get much worse with more cores and more threads. E.g., with more cores, you are less likely to have two communicating processes running on the same hyperthread (which dramatically speeds up inter-thread coordination, but is otherwise undesirable) and the total contention and coherency traffic may scale up in proportion to the number of cores (e.g., because a cache line has to ping-pong around to every core when you wake up threads).
So it's often the case that a heavily contended (often badly designed) algorithm performs much worse on "big iron" than on a single-socket consumer system.
1 E.g., through buffering, which adds latency, but increases the host's maximum RAM capacity.

Related

Scalability of Redis Cluster using Jedis 2.8.0 to benchmark throughput

I have an instance of JedisCluster shared between N threads that perform set operations.
When I run with 64 threads, the throughput of set operations is only slightly increased (compared to running using 8 threads).
How to configure the JedisCluster instance using the GenericObjectPoolConfig so that I can maximize throughput as I increase the thread count?
I have tried
GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
poolConfig.setMaxTotal(64);
jedisCluster = new JedisCluster(jedisClusterNodes, poolConfig);
believing this could increase the number of jedisCluster connection to the cluster and so boost throughput.
However, I observed a minimal effect.

When talking about performance, we need to dig into details a bit before I can actually answer your question.
A naive approach suggests: The more Threads (concurrency), the higher the throughput.
My statement is not wrong, but it is also not true. Concurrency and the resulting performance are not (always) linear because there is so many involved behind the scenes. Turning something from sequential to concurrent processing might result in something that executes twice of the work compared to sequential execution. This example assumes that you run a multi-core machine, that is not occupied by anything else and it has enough bandwidth for the required work processing (I/O, Network, Memory). If you scale this example from two threads to eight, but your machine has only four physical cores, weird things might happen.
First of all, the processor needs to schedule two threads so each of the threads probably behaves as if they would run sequentially, except that the process, the OS, and the processor have increased overhead caused by twice as many threads as cores. Orchestrating these guys comes at a cost that needs to be paid at least in memory allocation and CPU time. If the workload requires heavy I/O, then the work processing might be limited by your I/O bandwidth and running things concurrently may increase throughput as the CPU is mostly waiting until the I/O comes back with the data to process. In that scenario, 4 threads might be blocked by I/O while the other 4 threads are doing some work. Similar applies to memory and other resources utilized by your application. Actually, there's much more that digs into context switching, branch prediction, L1/L2/L3 caching, locking and much more that is enough to write a 500-page book. Let's stay at a basic level.
Resource sharing and certain limitations lead to different scalability profiles. Some are linear until a certain concurrency level, some hit a roof and adding more concurrency results in the same throughput, some have a knee when adding concurrency makes it even slower because of $reasons.
Now, we can analyze how Redis, Redis Cluster, and concurrency are related.
Redis is a network service which requires network I/O. Networking might be obvious, but we require to add this fact to our considerations meaning a Redis server shares its network connection with other things running on the same host and things that use the switches, routers, hubs, etc. Same applies to the client, even in the case you told everybody else not to run anything while you're testing.
The next thing is, Redis uses a single-threaded processing model for user tasks (Don't want to dig into Background I/O, lazy-free memory freeing and asynchronous replication). So you could assume that Redis uses one CPU core for its work but, in fact, it can use more than that. If multiple clients send commands at a time, Redis processes commands sequentially, in the order of arrival (except for blocking operations, but let's leave this out for this post). If you run N Redis instances on one machine where N is also the number of CPU cores, you can easily run again into a sharing scenario - That is something you might want to avoid.
You have one or many clients that talk to your Redis server(s). Depending on the number of clients involved in your test, this has an effect. Running 64 threads on a 8 core machine might be not the best idea since only 8 cores can execute work at a time (let's leave hyper-threading and all that out of here, don't want to confuse you too much). Requesting more than 8 threads causes time-sharing effects. Running a bit more threads than CPU cores for Redis and other networked services isn't a too bad of an idea since there is always some overhead/lag coming from the I/O (network). You need to send packets from Java (through the JVM, the OS, the network adapter, routers) to Redis (routers, network, yadda yadda yadda), Redis has to process the commands and send the response back. This usually takes some time.
The client itself (assuming concurrency on one JVM) locks certain resources for synchronization. Especially requesting new connections with using existing/creating new connections is a scenario for locking. You already found a link to the Pool config. While one thread locks a resource, no other thread can access the resource.
Knowing the basics, we can dig into how to measure throughput using jedis and Redis Cluster:
Congestion on Redis Cluster can be an issue. If all client threads are talking to the same cluster node, then other cluster nodes are idle, and you effectively measured how one node behaves but not the cluster: Solution: Create an even workload (Level: Hard!)
Congestion on the Client: Running 64 threads on a 8 core machine (that is just my assumption here, so please don't beat me up if I'm wrong) is not the best idea. Raising the number of threads on a client a bit above the number of Cluster nodes (assuming even workload for each cluster node) and a bit over the number of CPU cores can improve performance is never a too bad idea. Having 8x threads (compared to the number of CPU cores) is an overkill because it adds scheduling overhead at all levels. In general, performance engineering is related to finding the best ratio between work, overhead, bandwidth limitations and concurrency. So finding the best number of threads is an own field in computer science.
If running a test using multiple systems, that run a number of total threads, is something that might be closer to a production environment than running a test from one system. Distributed performance testing is a master class (Level: Very hard!) The trick here is to monitor all resources that are used by your test making sure nothing is overloaded or finding the tipping point where you identify the limit of a particular resource. Monitoring the client and the server are just the easy parts.
Since I do not know your setup (number of Redis Cluster nodes, distribution of Cluster nodes amongst different servers, load on the Redis servers, the client, and the network during test caused by other things than your test), it is impossible to say what's the cause.

Threads configuration based on no. of CPU-cores

Scenario : I have a sample application and I have 3 different system configuration -
- 2 core processor, 2 GB RAM, 60 GB HHD,
- 4 core processor, 4 GB RAM, 80 GB HHD,
- 8 core processor, 8 GB RAM, 120 GB HHD
In order to effectively exploit the H/W capabilities for my application, I wish to configure the no. of threads at the application level. However, I wish to do this only after a thorough understanding of system capabilities.
Could there be some way(system/modus/tool) to determine the system prowess with reference to the max and min no. of threads it could service optimally & without any loss in efficiency and performance. By this, I could configure only those values for my application that will do full justice and achieve best performance for the respective hardware configuration.
Edited1 :
Could any one please advise any read-up on how to set a baseline for a particular h/w config.
Edited2 :
To make it more direct - Wish to learn/know about any resource/write-up that I can read to gain some understanding on CPU management of Threads at a general/holistic level.

The optimal number of threads to use depends on several factors, but mostly the number of available processors and how cpu-intensive your tasks are. Java Concurrency in Practice proposes the following formal formula to estimate the optimal number of threads:
N_threads = N_cpu * U_cpu * (1 + W / C)
Where:
N_threads is the optimal number of threads
N_cpu is the number of prcessors, which you can obtain from Runtime.getRuntime().availableProcessors();
U_cpu is the target CPU utilization (1 if you want to use the full available resources)
W / C is the ratio of wait time to compute time (0 for CPU-bound task, maybe 10 or 100 for slow I/O tasks)
So for example, in a CPU-bound scenario, you would have as many threads as CPU (some advocate to use that number + 1 but I have never seen that it made a significant difference).
For a slow I/O process, for example a web crawler, W/C could be 10 if downloading a page is 10 times slower than processing it, in which case using 100 threads would be useful.
Note however that there is an upper bound in practice (using 10,000 threads will generally not speed things up, and you would probably get an OutOfMemoryError before you can start them all anyway with normal memory settings).
This is probably the best estimate you can get if you don't know anything about the environment in which your application runs. Profiling your application in production might enable you to fine tune the settings.
Although not strictly related, you might also be interested in Amdahl's law, which aims at measuring the maximum speed-up you can expect from parallelising a program.

My recommendation is to provide config and command-line switches for assigning the number of threads per-machine. Use a heuristic based on Runtime.getRuntime().availableProcessors() as indicated by other answers here, in cases where the user/admin hasn't explicitly configured the application differently. I strongly recommend against exclusive heuristic-based thread-to-core guessing, for several reasons:
Most modern hardware is moving toward increasingly ambiguous types of 'hardware threads': SMT models such as Intel's Hyperthreading and AMD's Compute Modules complicate formulas (details below), and querying this info at runtime can be difficult.
Most modern hardware has a turbo feature that scales speed based on active cores and ambient temperatures. As turbo tech improves, the range of speed (ghz) grows. Some recent Intel and AMD chips can range from 2.6ghz (all cores active) to 3.6ghz (single/dual core active), which combined with SMT can mean each thread getting an effective 1.6ghz - 2.0ghz throughput in the former design. There is currently no way to query this info at runtime.
If you do not have a strong guarantee that your application will be the only process running on the target systems, then blindly consuming all cpu resources may not please the user or server admin (depending on if the software is a user app or server app).
There is no robust way to know what's going on within the rest of the machine at run-time, without replacing the entire operating system with your own home-rolled multitasking kernel. Your software can try to make educated guesses by querying processes and peeking at CPU loads and such, but doing so is complicated and usefulness is limited to specific types of applications (of which yours may qualify), and usually benefit from or require elevated or privileged access levels.
Modern virus scanners now-days work by setting a special priority flag provided by modern operating systems, eg. they let the OS tell them when "the system is idle". The OS bases its decision on more than just CPU load: it also considers user input and multimedia flags that may have been set by movie players, etc. This is fine for mostly-idle tasks, but not useful to a cpu intensive task such as yours.
Distributed home computing apps (BOINC, Folding#Home, etc) work by querying running processes and system CPU load periodically -- once every second or half-second perhaps. If load is detected on processes not belonging to the app for multiple queries in a row then the app will suspend computation. Once the load goes low for some number of queries, it resumes. Multiple queries are required because the CPU load readouts are notorious for brief spikes. There are still caveats: 1. Users are still encouraged to manually reconfigure BOINC to fit their machine's specs. 2. if BOINC is run without Admin privileges then it won't be aware of processes started by other users (including some service processes), so it may unfairly compete with those for CPU resources.
Regarding SMT (HyperThreading, Compute Modules):
Most SMTs will report as hardware cores or threads these days, which is usually not good because few applications perform optimally when scaled across every core on an SMT system. To make matters worse, querying whether a core is shared (SMT) or dedicated often fails to yield expected results. In some cases the OS itself simply doesn't know (Windows 7 being unaware of AMD Bulldozer's shared core design, for example). If you can get a reliable SMT count, then the rule of thumb is to count each SMT as half-a-thread for CPU-intensive tasks, and as a full thread for mostly-idle tasks. But in reality, the weight of the SMT depends on what sort of computation its doing, and the target architecture. Intel and AMD's SMT implementations behave almost opposite of each other, for example -- Intel's is strong at running tasks loaded with integer and branching ops in parallel. AMD's is strong at running SIMD and memory ops in parallel.
Regarding Turbo Features:
Most CPUs these days have very effective built-in Turbo support that further lessens the value-gained from scaling across all cores of the system. Worse, the turbo feature is sometimes based as much on real temperature of the system as it is on CPU loads, so the cooling system of the tower itself affects the speed as much as the CPU specs do. On a particular AMD A10 (Bulldozer), for example, I observed it running at 3.7ghz on two threads. It dropped to 3.5ghz when a third thread is started, and to 3.4ghz when a fourth was started. Since it's an integrated GPU as well, it dropped all the way to approx 3.0ghz when four threads plus the GPU were working (the A10 CPU internally gives priority to the GPU in high-load scenarios); but could still muster 3.6ghz with 2 threads and GPU active. Since my application used both CPU and GPU, this was a critical discovery. I was able to improve overall performance by limiting the process to two CPU-bound threads (the other two shared cores were still helpful, they served as GPU servicing threads -- able to wake up and respond quickly to push new data to the GPU, as needed).
... but at the same time, my application at 4x threads may have performed much better on a system with a higher-quality cooling device installed. It's all so very complicated.
Conclusion: There is no good answer, and because the field of CPU SMT/Turbo design keeps evolving, I doubt there will be a good answer anytime soon. Any decent heuristic you formulate today may very well not produce ideal results tomorrow. So my recommendation is: don't waste much time on it. Rough-guess something based on core counts that suits local your purposes well enough, allow it to be overridden by config/switch, and move on.

You can get the number of processors available to the JVM like this:
Runtime.getRuntime().availableProcessors()
Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.
I think the best strategy would be to decide the optimal number of threads empirically for each of the hardware configuration, and then use these numbers in your application.

I agree with the other answers here that recommend a best-guess approach, and providing configuration for overriding the defaults.
In addition, if your application is particularly CPU-intensive, you may want to look into "pinning" your application to particular processors.
You don't say what your primary operating system is, or whether you're supporting multiple operating systems, but most have some way of doing this. Linux, for instance, has taskset.
A common approach is to avoid CPU 0 (always used by the OS), and to set your application's cpu affinity to a group of CPUs that are in the same socket.
Keeping the app's threads away from cpu 0 (and, if possible, away from other applications) often improves performance by reducing the amount of task switching.
Keeping the application on one socket can further increase performance by reducing cache invalidation as your app's threads switch among cpus.
As with everything else, this is highly dependent on the architecture of the machine that you are running on, as well as what other applications are runnning.

Use VisualVm tool to monitor threads.First Create minimum threads in program and see its performance.Then increase the no of threads within the program ans again analyze its performance.May this help you.

I use this Python script here to determine the number of cores (and memory, etc.) to launch my Java application with optimum parameters and ergonomics. PlatformWise on Github
It works like this: Write a python script which calls the getNumberOfCPUCores() in the above script to get the number of cores, and getSystemMemoryInMB() to get the RAM. You can pass that inform to your program via command line arguments. Your program can then use the appropriate number of threads based on the number of cores.

Creating a thread on application level is good and in a multicore processor separate threads are executed on cores to enhance performance.So to utilize the core processing power it is best practice to implement threading.
What i think:
At a time only 1 thread of a program will execute on 1 core.
Same application with 2 thread will execute on half time on 2 core.
Same application with 4 Threads will execute more faster on 4 core.
So the application you developing should have the threading level<= no of cores.
Thread execution time is managed by the operating system and is a highly unpredictable activity. CPU execution time is known as a time slice or a quantum. If we create more and more threads the operating system spends a fraction of this time slice in deciding which thread goes first, thus reducing the actual execution time each thread gets. In other words each thread will do lesser work if there were a large number of threads queued up.
Read this to get how to actually utilize cpu core's.Fantastic content.
csharp-codesamples.com/2009/03/threading-on-multi-core-cpus/

Calculating the optimal number of threads from the number of available processors is unfortunately not trivial however. This depends a lot on the characteristics of the application, for instance with a CPU-bound application having more threads than the number of processors make little sense, while if the application is mostly IO-bound you might want to use more threads. You also need to take into account if other resource intensive processes are running on the system.

High CPU, possibly due to context switching?

One of our servers is experiencing a very high CPU load with our application. We've looked at various stats and are having issues finding the source of the problem.
One of the current theories is that there are too many threads involved and that we should try to reduce the number of concurrently executing threads. There's just one main thread pool, with 3000 threads, and a WorkManager working with it (this is Java EE - Glassfish). At any given moment, there are about 620 separate network IO operations that need to be conducted in parallel (use of java.NIO is not an option either). Moreover, there are roughly 100 operations that have no IO involved and are also executed in parallel.
This structure is not efficient and we want to see if it is actually causing damage, or is simply bad practice. Reason being that any change is quite expensive in this system (in terms of man hours) so we need some proof of an issue.
So now we're wondering if context switching of threads is the cause, given there are far more threads than the required concurrent operations. Looking at the logs, we see that on average there are 14 different threads executed in a given second. If we take into account the existence of two CPUs (see below), then it is 7 threads per CPU. This doesn't sound like too much, but we wanted to verify this.
So - can we rule out context switching or too-many-threads as the problem?
General Details:
Java 1.5 (yes, it's old), running on CentOS 5, 64-bit, Linux kernel 2.6.18-128.el5
There is only one single Java process on the machine, nothing else.
Two CPUs, under VMware.
8GB RAM
We don't have the option of running a profiler on the machine.
We don't have the option of upgrading the Java, nor the OS.
UPDATE
As advised below, we've conducted captures of load average (using uptime) and CPU (using vmstat 1 120) on our test server with various loads. We've waited 15 minutes between each load change and its measurements to ensure that the system stabilized around the new load and that the load average numbers are updated:
50% of the production server's workload: http://pastebin.com/GE2kGLkk
34% of the production server's workload: http://pastebin.com/V2PWq8CG
25% of the production server's workload: http://pastebin.com/0pxxK0Fu
CPU usage appears to be reduced as the load reduces, but not on a very drastic level (change from 50% to 25% is not really a 50% reduction in CPU usage). Load average seems uncorrelated with the amount of workload.
There's also a question: given our test server is also a VM, could its CPU measurements be impacted by other VMs running on the same host (making the above measurements useless)?
UPDATE 2
Attaching the snapshot of the threads in three parts (pastebin limitations)
Part 1: http://pastebin.com/DvNzkB5z
Part 2: http://pastebin.com/72sC00rc
Part 3: http://pastebin.com/YTG9hgF5

Seems to me the problem is 100 CPU bound threads more than anything else. 3000 thread pool is basically a red herring, as idle threads don't consume much of anything. The I/O threads are likely sleeping "most" of the time, since I/O is measured on a geologic time scale in terms of computer operations.
You don't mention what the 100 CPU threads are doing, or how long they last, but if you want to slow down a computer, dedicating 100 threads of "run until time slice says stop" will most certainly do it. Because you have 100 "always ready to run", the machine will context switch as fast as the scheduler allows. There will be pretty much zero idle time. Context switching will have impact because you're doing it so often. Since the CPU threads are (likely) consuming most of the CPU time, your I/O "bound" threads are going to be waiting in the run queue longer than they're waiting for I/O. So, even more processes are waiting (the I/O processes just bail out more often as they hit an I/O barrier quickly which idles the process out for the next one).
No doubt there are tweaks here and there to improve efficiency, but 100 CPU threads are 100 CPU threads. Not much you can do there.

I think your constraints are unreasonable. Basically what you are saying is:
1.I can't change anything
2.I can't measure anything
Can you please speculate as to what my problem might be?
The real answer to this is that you need to hook a proper profiler to the application and you need to correlate what you see with CPU usage, Disk/Network I/O, and memory.
Remember the 80/20 rule of performance tuning. 80% will come from tuning your application. You might just have too much load for one VM instance and it could be time to consider solutions for scaling horizontally or vertically by giving more resources to the machine. It could be any one of the 3 billion JVM settings are not inline with your application's execution specifics.
I assume the 3000 thread pool came from the famous more threads = more concurrency = more performance theory. The real answer is a tuning change isn't worth anything unless you measure throughput and response time before/after the change and compared the results.

If you can't profile, I'd recommend taking a thread dump or two and seeing what your threads are doing. Your app doesn't have to stop to do it:
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/threads.html
http://java.net/projects/tda/
http://java.sys-con.com/node/1611555

So - can we rule out context switching or too-many-threads as the problem?
I think you concerns over thrashing are warranted. A thread pool with 3000 threads (700+ concurrent operations) on a 2 CPU VMware instance certainly seems like a problem that may be causing context switching overload and performance problems. Limiting the number of threads could give you a performance boost although determining the right number is going to be difficult and probably will use a lot of trial and error.
we need some proof of an issue.
I'm not sure the best way to answer but here are some ideas:
Watch the load average of the VM OS and the JVM. If you are seeing high load values (20+) then this is an indicator that there are too many things in the run queues.
Is there no way to simulate the load in a test environment so you can play with the thread pool numbers? If you run simulated load in a test environment with pool size of X and then run with X/2, you should be able to determine optimal values.
Can you compare high load times of day with lower load times of day? Can you graph number of responses to latency during these times to see if you can see a tipping point in terms of thrashing?
If you can simulate load then make sure you aren't just testing under the "drink from the fire hose" methodology. You need simulated load that you can dial up and down. Start at 10% and slowing increase simulated load while watching throughput and latency. You should be able to see the tipping points by watching for throughput flattening or otherwise deflecting.

Usually, context switching in threads is very cheap computationally, but when it involves this many threads... you just can't know. You say upgrading to Java 1.6 EE is out of the question, but what about some hardware upgrades ? It would probably provide a quick fix and shouldn't be that expensive...

e.g. run a profiler on a similar machine.
try a newer version of Java 6 or 7. (It may not make a difference, in which case don't bother upgrading production)
try Centos 6.x
try not using VMware.
try reducing the number of threads. You only have 8 cores.
You many find all or none of the above options make a difference, but you won't know until you have a system you can test on with a known/repeatable work load.

Advantages to use Java on Solaris

On many forums I found that people use Solaris for their Java applications.
I interested what are the main advantages of such combination?
My first assumption is that Solaris is very fast.
I also found out that on Solaris it is possible to match one-to-one java threads with kernel threads - as I understand it results in again very fast thread creation.
Please correct me if I'm wrong and are there any other main points?

What Solaris gives you (as its Software not hardware) over Linux or Windows is greater system manageability and low level tracing like DTrace.
What you appear to be asking about is having more threads running concurrently which is a feature of the hardware. If you run Solaris x86 or Linux or Window on the same hardware you will have the same number of logical threads. However if you run Solaris on some SPARC processors which have lots of logical threads (32 or more) running concurrently which reduces overhead if you have a need for that many threads.
The http://en.wikipedia.org/wiki/SPARC_T3 process supports up to 512 logical threads across 16 cores. This can really improve performance where you have a need for so many threads, e.g. using many blocking IO connections.
However if you need only one to six critical threads (and a bunch of non-critcal threads) a plain x64 processor will be much faster, and cheaper. (As it is designed to handle less threads faster and are mass produced on a larger scale)

We use Solaris for java applications at my workplace. I do not know about any exact performance advantage, but the reasons we decided to use Solaris were:
Solaris Service Management Facility (http://www.oreillynet.com/pub/a/sysadmin/2006/04/13/using-solaris-smf.html )
Ability to copy the entire zone backup to another box in case of HW failure.
We run application servers such as Weblogic, and it helps that SMF starts them back up if they crash for any reason. Also, we do backup of our zones at regular intervals, and from what I hear- the zone can be moved to another machine in case of HW failure, and the application back to normal.

Java TCP/IP Socket Latency - stuck at 50 μs (microseconds)? (used for Java IPC)

We have been profiling and profiling our application to get reduce latency as much as possible. Our application consists of 3 separate Java processes, all running on the same server, which are passing messages to each other over TCP/IP sockets.
We have reduced processing time in first component to 25 μs, but we see that the TCP/IP socket write (on localhost) to the next component invariably takes about 50 μs. We see one other anomalous behavior, in that the component which is accepting the connection can write faster (i.e. < 50 μs). Right now, all the components are running < 100 μs with the exception of the socket communications.
Not being a TCP/IP expert, I don't know what could be done to speed this up. Would Unix Domain Sockets be faster? MemoryMappedFiles? what other mechanisms could possibly be a faster way to pass the data from one Java Process to another?
UPDATE 6/21/2011
We created 2 benchmark applications, one in Java and one in C++ to benchmark TCP/IP more tightly and to compare. The Java app used NIO (blocking mode), and the C++ used Boost ASIO tcp library. The results were more or less equivalent, with the C++ app about 4 μs faster than Java (but in one of the tests Java beat C++). Also, both versions showed a lot of variability in the time per message.
I think we are agreeing with the basic conclusion that a shared memory implementation is going to be the fastest. (Although we would also like to evaluate the Informatica product, provided it fits the budget.)

If using native libraries via JNI is an option, I'd consider implementing IPC as usual (search for IPC, mmap, shm_open, etc.).
There's a lot of overhead associated with using JNI, but at least it's a little less than the full system calls needed to do anything with sockets or pipes. You'll likely be able to get down to about 3 microseconds one-way latency using a polling shared memory IPC implementation via JNI. (Make sure to use the -Xcomp JVM option or adjust the compilation threshold, too; otherwise your first 10,000 samples will be terrible. It makes a big difference.)
I'm a little surprised that a TCP socket write is taking 50 microseconds - most operating systems optimize TCP loopback to some extent. Solaris actually does a pretty good job of it with something called TCP Fusion. And if there has been any optimization for loopback communication at all, it's usually been for TCP. UDP tends to get neglected - so I wouldn't bother with it in this case. I also wouldn't bother with pipes (stdin/stdout or your own named pipes, etc.), because they're going to be even slower.
And generally, a lot of the latency you're seeing is likely coming from signaling - either waiting on an IO selector like select() in the case of sockets, or waiting on a semaphore, or waiting on something. If you want the lowest latency possible, you'll have to burn a core sitting in a tight loop polling for new data.
Of course, there's always the commercial off-the-shelf route - which I happen to know for a certainty would solve your problem in a hurry - but of course it does cost money. And in the interest of full disclosure: I do work for Informatica on their low-latency messaging software. (And my honest opinion, as an engineer, is that it's pretty fantastic software - certainly worth checking out for this project.)

"The O'Reilly book on NIO (Java NIO, page 84), seems to be vague about
whether the memory mapping stays in memory. Maybe it is just saying
that like other memory, if you run out of physical, this gets swapped
back to disk, but otherwise not?"
Linux. mmap() call allocates pages in OS page cache area (which are periodically get flushed to disk and can be evicted based on Clock-PRO which is approximation of LRU algorithm?) So the answer on your question is - yes. Memory mapped buffer can be evicted (in theory) from memory unless it is mlocke'd (mlock()). This is in theory. In practice, I think it is hardly possible if your system is not swapping In this case, first victims are page buffers.

See my answer to fastest (low latency) method for Inter Process Communication between Java and C/C++ - with memory mapped files (shared memory) java-to-java latency can be reduced to 0.3 microsecond

MemoryMappedFiles is not viable solution for low latency IPC at all - if mapped segment of memory gets updated it is eventually will be synced to disk thus introducing unpredictable delay which measures in milliseconds at least. For low latency one can try combinations of either Shared Memory + message queues (notifications), or shared memory + semaphores. This works on all Unixes especially System V version (not POSIX) but if you run application on Linux you pretty safe with POSIX IPC (most features are available in 2.6 kernel) Yes, you will need JNI to get this done.
UPD: I forgot that this is JVM - JVM IPC and we have already GCs which we can not control fully, so introducing additional several ms pauses due to OS file buffers flash to disk may be acceptable.

Check out https://github.com/pcdv/jocket
It's a low-latency replacement for local Java sockets that uses shared memory.
RTT latency between 2 processes is well below 1us on a modern CPU.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.