Measure context switching in java

Measure context switching in java - java

I have a java system doing lot of i/o operations.
I do understand that non CPU bound tasks can benefit from # of threads more than #CPUs.
As I/O operations time is non deterministic(I don't know how many threads I should initialize in pool). I want to measure the context switching happening due to number of threads I initialized in my java program.
Finally as result of that context switching overhead I want to tune the size of thread pool.

There are two options here:
you engage in "real" profiling. Meaning: you learn about tools that help you monitoring such aspects of your application. See here for starters; or there for a broader variety of tools.
you simply experiment. Setup your pool size to 100, 500, 1000. And see what happens. You see, when you have good insight on the clients using your system, it might sufficient to just tune that parameter and see how/if it affects your users.
Obviously, option 1 results in better understanding - but it also requires more work.

You can use good profiling tools like appdynamics to measure how much time of your program is spending in the IO and CPU wait and get lots of interesting insights about your program and accordingly optimize your code.
Once you get insights about your code, then you can gradually test with different size of thread pool and see the effect in Appdynamics and choose the best size, which gives you best performance.
Note:- Appdynamics provides free trial version and comes as a SAAS. I've used it multiple times and likes it a lot.

Related

How do I go about monitoring the performance of a Java process?

I need to monitor the performance of a Java process and take reports automatically. The reports should contain data on memory utilization thread usage, process usage etc. But I'm unsure how to accomplish this. Any suggestions?

I need to monitor the performance of a Java process and take reports automatically.
You need to determine what measures are important to the users of the application like latency and throughput. These are often impacted even if everything looks fine system wise. For example an 8 cpu system which is only 6% busy over 5 minutes might sound fine, except it could be that there is one request every 5 minutes which is taking more than 2 minutes.
The reports should contain data on memory utilization thread usage,
A key feature of threads share objects by default. This means the thread local memory usage is almost always trivial and not worth measuring in general.
process usage etc.
This can be useful for capacity planning of a long period of time, but not useful for find application specific problems (see above).
But I'm unsure how to accomplish this. Any suggestions?
Work out what metrics will help you find problems which impact the users of the application.

You may use JMX API for this purpose if want to get the data via program. Here is oracle tutorial on this topic.
If you just want to monitor the process, there are tools like VisualVM.

VisualVM is a nice tool to monitor memory utilization and other things VisualVM

How do I begin to optimise my program [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a web server program written in java and my boss wants it to run faster.
Iv always been happy if it ran without error so efficiency is new to me.
I tried a profiler but it crashed my computer and turned out to be a dead opensource project.
I have no idea what I am doing except from reading a few questions on here. I see that re factoring code is the best option but Im not sure how to go about that and that i need a profiler to see what code to re factor.
So does anyone know of a free profiler that I can use ? Im using java and eclipse. if possible some instructions or a like to easy instruction would be great.
But what I really want if anyone can give it is a basic introduction to the subject so I can understand enough to go do in depth research on the subject to get the best results.
I am a complete beginner when it comes to optimising code and the subject seems very complex from what I have seen so far, any help with how to get started would be greatly appreciated.
Im new to java as well so saying things like check garbage collection would mean nothing to me, id need a more detailed explanation.
EDIT: the program uses tomcat for the networking. it connects to an SQL database. the main function is a polling loop which checks all attached devices on the network, reads events from them writes the event to the database and the performs the event functions.
I am trying to improve the polling loop. the program is heavily multithreaded and uses a lot of interfaces and proxies so it is hart to see were code goes the farther you get from the polling loop.
I hope this information helps you offer solutions. also I did not build it, I inherited the code.

First of all detect the bottlenecks. There is no point in optimizing a method from 500ms to 400ms when there is a method running for 5 seconds, when it should run for 100ms.
You can try using the VisualVM as a profiler, which is built-in in the JDK.

If you want a free profiler, use VisualVM when comes with Java. It is likely to be enough.
You should ask your boss exact what he would like to go faster. There is no point optimising random pieces of code he/she might not care about. (Its easily done)
You can also log key points in you task/request to determine what it spends the most time doing.

EDIT: the program uses tomcat for the networking. it connects to an
SQL database. the main function is a polling loop which checks all
attached devices on the network, reads events from them writes the
event to the database and the performs the event functions.
I am trying to improve the polling loop. the program is heavily
multithreaded and uses a lot of interfaces and proxies so it is hart
to see were code goes the farther you get from the polling loop
This sounds like you have a heavily I/O bound application. There really isn't much that you can do about that because I/O bound applications aren't inefficiently using the CPU--they're stuck waiting for I/O operations on other devices to complete.
FWIW, this scenario is actually why a lot of big companies are contemplating moving toward cheap, ARM-based solutions. They're wasting a lot of power and resources on powerful x86 CPUs that get underutilized while their code sits there waiting for a remote MySQL or Oracle server to finish doing its thing. With such an application, why throw more CPU than you need?

If your new to java then Optimization sounds like a bad idea. Its very easy to get wrong. Its not trivial to rewrite code and keep all the outputs the same while changing the inner workings.
Possibly have a look at your stored procedures and replace any IN statments with INNER JOIN. Thats a fairly low risk and high reward way of speeding thing up.

Start by identifying the time taken by various steps in your application (use logging to identify). Notice if there is anything unusual.
Step into each of these steps to see if there are any bottlenecks. Identify if something can be cached to save a db call. Identify if there is scope of parallelism by breaking down your tasks into independent units.
Hope you have some unit/ integration tests to ensure you don't accidentally break anything.

Measure (with a profiler - as others suggested, VisualVM is good) and locate the spots where your program spends most of its time.
Analyze the hot spots and try to improve their performance.
Measure again to verify that your changes had the expected effect.
If needed, repeat from step 1.

Start very simple.
Make a list of whats slow from a user perspective.
Try to do high level profiling yourself. Maybe an interceptor that prints the run time for your actions.
Then profile only those actions with Start time = System.currentTime...
This easy way could be a starting point into more advanced profiling and if your lucky it may fix your problems.

Before you start optimizing, you have to understand your workload, and you have to be able to recreate that workload. One easy way to do that is to log all requests, in production, with enough detail that you can recreate the requests in a development environment.
At the same time that you log your load, you can also log the performance of those requests: the time from the start of the request to the end. One way to do that (and, incidentally, to capture the data needed to log the request) is to add a servlet filter into your stack.
Then you can start to think about optimization.
Establish performance goals. Simply saying "make it faster" is pointless. Instead, you need to establish goals such as "all pages should respond within 1.5 seconds, as long as there are less than 100 concurrent users."
Identify the requests that fail your performance goals. Focus on the biggest failure first.
Identify why the request takes so long.
To do #3, you need to be able to recreate load in a development environment. Then you can either use a profiler, or simply add trace-level logging into your application to find out how long each step of the process takes.
There is also a whole field of holistic optimization, of which garbage collection tuning is probably the most important. But again, you need to establish and replicate your workload, otherwise you'll be flailing.

When starting to optimize an application, the main risk is to try to optimize every step, which does often not improve the program efficiency as expected and results in unmaintainable code.
It is likely that 80% of the execution time of your program is caused by a single step, which is itself only 20% of the code base.
The first thing to do is to identify this bottleneck. For example, you can log timestamps (using System.nanoTime and/or System.currentTimeMillis and you favorite logging framework) to do this.
Once the step has been identified, try to write a test class which runs this step, and run it with a profiler. I have good experience with both HPROF (http://java.sun.com/developer/technicalArticles/Programming/HPROF.html) although it might require some time to get familiar with, and Eclipse Test and Performance Tools Platform (http://www.eclipse.org/tptp/). If you have never used a profiler, I recommend you start with Eclipse TPTP.
The execution profile will help you find out in what methods your program spends time. Once you know them, look at the source code, and try to understand why it is slow. It might be because (this list is not exhaustive) :
unnecessary costly operations are performed,
a sub-optimal algorithm is used,
the algorithm generates lots of objects, thus giving a lot of work to the garbage collector (especially true for objects which have a medium to long life expectancy).
If there is no visible defect in the code, then you might consider :
making the algorithm more parallel in order to leverage all your CPUs
buying faster hardware.
Regarding JVM options, the two most important ones for performance are :
-server, in order to use the server VM (enabled by default depending on the hardware) which provides better performance at the price of a slower startup (http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client),
-Xms and -Xmx which define the heap size available on startup, and the maximum amount of memory that the JVM can use. If the JVM is not given enough memory, garbage collection will use a lot of your CPU resources, slowing down your program, however if the JVM already has enough memory, increasing the heap size will not improve performance, and might even cause longer GC pauses. (http://stackoverflow.com/questions/1043817/speed-tradeoff-of-javas-xms-and-xmx-options)
Other parameters usually have lower impact, you can consult them at http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html.

Java Memory Usage / Thread Pool Performance Problem

These things obviously require close inspection and availability of code to thoroughly analyze and give good suggestions. Nevertheless, that is not always possible and I hope it may be possible to provide me with good tips based on the information I provide below.
I have a server application that uses a listener thread to listen for incoming data. The incoming data is interpreted into application specific messages and these messages then give rise to events.
Up to that point I don't really have any control over how things are done.
Because this is a legacy application, these events were previously taken care of by that same listener thread (largely a single-threaded application). The events are sent to a blackbox and out comes a result that should be written to disk.
To improve throughput, I wanted to employ a threadpool to take care of the events. The idea being that the listener thread could just spawn new tasks every time an event is created and the threads would take care of the blackbox invocation. Finally, I have a background thread performing the writing to disk.
With just the previous setup and the background writer, everything works OK and the throughput is ~1.6 times more than previously.
When I add the thread pool however performance degrades. At the start, everything seems to run smoothly but then after awhile everything is very slow and finally I get OutOfMemoryExceptions. The weird thing is that when I print the number of active threads each time a task is added to the pool (along with info on how many tasks are queued and so on) it looks as if the thread pool has no problem keeping up with the producer (the listener thread).
Using top -H to check for CPU usage, it's quite evenly spread out at the outset, but at the end the worker threads are barely ever active and only the listener thread is active. Yet it doesn't seem to be submitting more tasks...
Can anyone hypothesize a reason for these symptoms? Do you think it's more likely that there's something in the legacy code (that I have no control over) that just goes bad when multiple threads are added? The out of memory issue should be because some queue somewhere grows too large but since the threadpool almost never contains queued tasks it can't be that.
Any ideas are welcome. Especially ideas of how to more efficiently diagnose a situation like this. How can I get a better profile on what my threads are doing etc.
Thanks.

Slowing down then out of memory implies a memory leak.
So I would start by using some Java memory analyzer tools to identify if there is a leak and what is being leaked. Sometimes you get lucky and the leaked object is well-known and it becomes pretty clear who is hanging on to things that they should not.

Thank you for the answers. I read up on Java VisualVM and used that as a tool. The results and conclusions are detailed below. Hopefully the pictures will work long enough.
I first ran the program and created some heap dumps thinking I could just analyze the dumps and see what was taking up all the memory. This would probably have worked except the dump file got so large and my workstation was of limited use in trying to access it. After waiting two hours for one operation, I realized I couldn't do this.
So my next option was something I, stupidly enough, hadn't thought about. I could just reduce the number of messages sent to the application, and the trend of increasing memory usage should still be there. Also, the dump file will be smaller and faster to analyze.
It turns out that when sending messages at a slower rate, no out of memory issue occured! A graph of the memory usage can be seen below.
The peaks are results of cumulative memory allocations and the troughs that follow are after the garbage collector has run. Although the amount of memory usage certainly is quite alarming and there are probably issues there, no long term trend of memory leakage can be observed.
I started to incrementally increase the rate of messages sent per second to see where the application hits the wall. The image below shows a very different scenario then the previous one...
Because this happens when the rate of messages sent are increased, my guess is that my freeing up the listener thread results in it being able to accept a lot of messages very quickly and this causes more and more allocations. The garbage collector doesn't run and the memory usage hits a wall.
There's of course more to this issue but given what I have found out today I have a fairly good idea of where to go from here. Of course, any additional suggestions/comments are welcome.
This questions should probably be recategorized as dealing with memory usage rather than threadpools... The threadpool wasn't the problem at all.

I agree with #djna.
Thread Pool of java concurrency package works. It does not create threads if it does not need them. You see that number of threads is as expected. This means that probably something in your legacy code is not ready for multithreading. For example some code fragment is not synchronized. As a result some element is not removed from collection. Or some additional elements are stored in collection. So, the memory usage is growing.
BTW I did not understand exactly which part of the application uses threadpool now. Did you have one thread that processes events and now you have several threads that do this? Have you probably changed the inter-thread communication mechanism? Added queues? This may be yet another direction of your investigation.
Good luck!

As mentioned by djna, it's likely some type of memory leak. My guess would be that you're keeping a reference to the request around somewhere:
In the dispatcher thread that's queuing the requests
In the threads that deal with the requests
In the black box that's handling the requests
In the writer thread that writes to disk.
Since you said everything works find before you add the thread pool into the mix, my guess would be that the threads in the pool are keeping a reference to the request somewhere. Th idea being that, without the threadpool, you aren't reusing threads so the information goes away.
As recommended by djna, you can use a Java memory analyzer to help figure out where the data is stacking up.

Why is my multithreaded Java program not maxing out all my cores on my machine?

I have a program that starts up and creates an in-memory data model and then creates a (command-line-specified) number of threads to run several string checking algorithms against an input set and that data model. The work is divided amongst the threads along the input set of strings, and then each thread iterates the same in-memory data model instance (which is never updated again, so there are no synchronization issues).
I'm running this on a Windows 2003 64-bit server with 2 quadcore processors, and from looking at Windows task Manager they aren't being maxed-out, (nor are they looking like they are being particularly taxed) when I run with 10 threads. Is this normal behaviour?
It appears that 7 threads all complete a similar amount of work in a similar amount of time, so would you recommend running with 7 threads instead?
Should I run it with more threads?...Although I assume this could be detrimental as the JVM will do more context switching between the threads.
Alternatively, should I run it with fewer threads?
Alternatively, what would be the best tool I could use to measure this?...Would a profiling tool help me out here - indeed, is one of the several profilers better at detecting bottlenecks (assuming I have one here) than the rest?
Note, the server is also running SQL Server 2005 (this may or may not be relevant), but nothing much is happening on that database when I am running my program.
Note also, the threads are only doing string matching, they aren't doing any I/O or database work or anything else they may need to wait on.

My guess would be that your app is bottlenecked on memory access, i.e. your CPU cores spend most of the time waiting for data to be read from main memory. I'm not sure how well profilers can diagnose this kind of problem (the profiler itself could influence the behaviour considerably). You could verify the guess by having your code repeat the operations it does many times on a very small data set.
If this guess is correct, the only thing you can do (other than getting a server with more memory bandwidth) is to try and increase the locality of your memory access to make better use of caches; but depending on the details of the application that may not be possible. Using more threads may in fact lead to worse performance because of cores sharing cache memory.

Without seeing the actual code, it's hard to give proper advice. But do make sure that the threads aren't locking on shared resources, since that would naturally prevent them all from working as efficiently as possible. Also, when you say they aren't doing any io, are they not reading an input or writing an output either? this could also be a bottleneck.
With regards to cpu intensive threads, it is normally not beneficial to run more threads than you have actual cores, but in an uncontrolled environment like this with other big apps running at the same time, you are probably better off simply testing your way to the optimal number of threads.

Parallelization: What causes Java threads to block other than synchronization & I/O?

Short version is in the title.
Long version:
I am working on a program for scientific optimization using Java. The workload of the program can be divided into parallel and serial phases -- parallel phases meaning that highly parallelizable work is being performed. To speed up the program (it runs for hours/days) I create a number of threads equal to the number of CPU cores on the machine I'm using -- typically 4 or 8 -- and divide the work between them. I then start these threads and join() them before proceeding to a serial phase.
So far so good. What's bothering me is that the CPU utilization and speedup of the parallel phases is nowhere near the "theoretical maximum" -- e.g. if I have 4 cores, I expect to see somewhere between 350-400% "utilization" (as reported by top) but instead it bounces around between 180 and about 310. Using only a single thread, I get 100% CPU utilization.
The only reasons I know of for threads not to run at full speed are:
-blocking due to I/O
-blocking due to synchronization
No I/O whatsoever is going on in my parallel threads, nor any synchronization -- the only data structures shared by the threads are read-only, and are either basic types or (non-concurrent) collections. So I'm looking for other explanations. One possibility would be that several threads are repeatedly blocking for garbage collection, but that would only seem to make sense in a situation with memory pressure, and I am allocating well above the required maximum heap space.
Any suggestions would be appreciated.
Update: Just in case anyone is curious, after some more investigation I tweaked the code for general performance and am seeing better utilization, even though nothing I changed has to do with synchronization. However, some of the changes should have resulted in fewer new heap allocations in particular I got rid of some use of iterators and termporary boxed numbers (The CERN "Colt" library for high-performance Java computing was useful here: it provides collections like IntArrayList, DoubleArrayList etc for basic types.). So I think garbage collection was probably the culprit.

All graphics operations run on a single thread in swing. If they are rendering to the screen they will effectively be contending for access to this thread.
If you are running on Windows, all graphics operations run on a single thread no matter what. Other operating systems have similar limitations.
It's actually fairly difficult to get the proper granularity of threaded workers sometimes, and sometimes it's easy to make them too big or too small, which will typically give you less than 100% usage of all cores.
If you're not rendering much gui, the most likely culprit is that you're contending more than you think for some shared resource. This is easily seen with profiler tools like jprofiler. Some VM's like bea's jrockit can even tell you this straight out of the box.
This is one of those places where you dont want to act on guesswork. Get a profiler!

First of all, GC will not happen only "in situation with memory pressure", but at any time the JVM sees fit (unpredictable, as far as I know).
Second, if your threads allocate memory in the heap (you mention they use Collections so I guess they do assign memory in the heap), you can never be sure if this memory is currently in RAM or on a Virtual Memory page (the OS decides), and thus access to "memory" may generate blocking I/O access!
Finally, as suggested in a prior answer, you may find it useful to check what happens by using a profiler (or even JMX monitoring might give some hints there).
I believe it will be difficult to get further hints on your problem unless you provide more concrete (code) information.

Firstly, I assume you're not doing any other significant work on the box. If you are, that's clearly going to mess with things.
It does sound very odd if you're really not sharing anything. Can you give us more idea of what the code is really doing?
What happens if you run n copies of the program as different Java processes, with each only using a single thread? If that uses each CPU completely, then at least we know that it can't be a problem with the OS. Speaking of the OS, which one is this running on, and which JVM? If you can try different JVMs and different OSes, the results might give you a hint as to what's wrong.

Also an important point: Which Hardware do you use?
E.g. 4-8 Cores could mean you work on one of Suns Niagara CPUs. And despite having 4-8 Cores they have less FPUs. When computing scientific stuff it could happen, that the FPU is the bottleneck.

You try to use the full CPU capability for your calculations but the OS itself uses resources as well. So be aware that the OS will block some of your execution in order to satisfy its needs.

You are doing synchronization at some level.
Perhaps only in the memory allocation system, including garbage collection. While the JVM vendor has worked to keep blocking in these areas to a minimum, they can't reduce it to zero. Perhaps something about your application is pushing at a weak point in this area.
The accepted wisdom is "don't build your own memory reclaiming pool, let the GC work for you". This is true most of the time but not in at least one piece of code I maintain (proven with profiling). Perhaps you need to rework your Object allocation in some major way.

Try the latency analyzer that comes with JRockit Mission Control. It will show you what the CPU is doing when it's not doing anything, if the application is waiting for file I/O, TLA-fetches, object allocations, thread suspension, JVM-locks, gc-pauses etc. You can also see transitions, e.g. when one thread wakes up another. The overhead is negligible, 1% or so.
See this blog for more info. The tool is free to use for development and you can download it here

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.