Measure execution time of Java with JVMTI - java

For the profiler which I implement using JVMTI I would like to start measuring the execution time of all Java methods. The JVMTI offers the events:
MethodEntry
MethodExit
So this would be quite easy to implement, however I came across this note in the API:
Enabling method entry or exit events will significantly degrade performance on many platforms and is thus not advised for performance critical usage (such as profiling). Bytecode instrumentation should be used in these cases.
But my profiling agent works headless, which means the collected data is serialized and sent via socket to a server application displaying the results. How should I realize this using byte code instrumentation. I am kind of confused how to go on from here. Could someone explain to me, if I have to switch the strategy or how can I approach this problem?

I don't know about the Sun JVM but the IBM JVM goes into what we call FullSpeedDebug mode when you request the MethodEntry/Exit events.... FSD slows down execution quite a bit.
As you say you can use BCI as my profiler does but unless you are selective about which methods you instrument you will also see a slow down. For example my profiler inserts a if(profiling) callProfilerHook() on every entry and all of the possible exits in a method all object creates and some other areas as well.... These additional checks can slow down execution by over 50%...
As for how to BCI... well I wrote my own C library to do it... it's technically not hard (hint just delete the StackMapTable) but I may take you a while.. Alternatively you can use ASM et. al.
Finally... you callBackHook will add overhead and on small methods render the reported CPU/Clock time meaningless unless you perform some sophisticated overhead calculation... even if you do this your callback code affects the shape of the processor L1 caches and the Java code becomes less efficient because it has less room..
My profiler basically ignores the reported times as I visualize the execution in an interesting way... I'm looking to understand the flow of all of the code, in fact in most cases what code is running (most Java projects have no idea of the millions on lines of third-party code running in their app)

Related

Why Java application is not working when VM size is increased? [duplicate]

I am looking for a Java code profiler which I can use to profile my application (its a service which runs in backend) on production (so means low over head, and it must not slow down my application). Primarily I want calling tree profiling, that is if a() calls b() and then b() calls c(), then how much time a() b() and c() took, both inclusively and exclusively.
Have seen jvisualvm and jprofiler, but this is not what I am looking for, because I cannot tie my production application to them as it will cause a major performance issue.
Also, I did go through metrics (https://github.com/dropwizard/metrics), but it does not give me a functionality to profile the calling tree.
Callgrind (http://valgrind.org/docs/manual/cl-manual.html) type library is what I need, as it gives calling tree profiling functionality and advanced options like avoiding calling cycles (recursion). But I am not sure that Callgrind can be used on production as it dumps data when program terminates.
Can anyone suggest good calling tree profiler for java that can be used on production without compromising the performance?
Take a look at Java Mission Control in conjunction with Flight Recorder. Starting with the release of Oracle JDK 7 Update 40 (7u40), Java Mission Control is bundled with the HotSpot JVM, so it is highly integrated and purports to have small effects on run-time performance. I have only just started looking at it, and I do see some call tree functionality.
In general you don't (or I won't recommend) profilers that instrument your application. Instrumenting always means an uncontrollable production overhead.
What you can use is a sampling profiler. A sampling profiler makes a snapshot of the stack traces at a controllable interval. What you don't get is call counts, but, after some runtime, you get a good overview where you have hotspots. Since you can adjust the sample interval of the profiler you can influence the overhead of it.
A usable sampling profiler is shipped with the JDK, see the hprof page in the java 7 documentation.
In former times there existed some graphical analysis tools for the hprof cpu traces (not the heap traces). Now they are gone. However, you can already work with the generated text file.
I took a quick look on the Java Mission Control stuff mentioned above. I think it is quite mighty and will satisfy a lot of needs, in the white paper they say it has only 2% overhead. However, it is not totally what I personally need or want. For my applications it is better to have a "light" profiling enabled all the time.
Intel Amplifier XE http://software.intel.com/en-us/intel-vtune-amplifier-xe has got low overhead if any noticeable. It uses stack sampling technology to minimize the impact and it can attach and detach to running non-stop processes in production. You even do not need to have sources during profiling, you can dive into sources later after offline performance results browsing.
I don't know of any tool which can do profiling without an impact on performance.
You could add logging to the methods that you're interested in. Make sure you include the time stamp in the log; then you can do the timing. You should also configure the logging framework to log asynchronously to reduce the performance loss.
A load time weaver like AspectJ is able to add these calls at runtime, which would allow you to easily select the methods you want to monitor without changing the source code all the time.
Using an around aspect, you can even add timing logging, so you don't have to parse the logs and try to find matching log entries. See this blog post for details.
Have a look at perfspy (tutorial), it might already do out of the box what you need.
Related:
How to use AOP with AspectJ for logging?
http://mathewjhall.wordpress.com/2011/03/31/tracing-java-method-execution-with-aspectj/
http://www.jroller.com/holy/entry/injecting_timing_aspect_into_junit

Protocol Buffer first usage high latency

In one of our java applications we have quite a few protocol buffer classes and the jar essentially exposes one interface with one method that is used by another application. We have noticed that the first time this method is called the invocation time is quite high (>500ms) while subsequent calls are much faster (<10ms). At first we assumed this has something to do with our code, however after profiling we could not confirm this. Through process of elimination it became obvious that it has something to do with protocol buffers.
This was further confirmed when in a different application, that works completely different - but also uses protocol buffers - showed the same behavior. Additionally we tried creating a dummy instance (XY.newBuilder().build()) of all the proto buffer classes at startup and with each of those we added we could notice the overhead of the first invocation drop.
In .NET I can find another question that shows the similar problem (Why is ProtoBuf so slow on the 1st call but very fast inside loops?), however the solution there seems to be specific to C# with precompiling serializers. I couldn't find the same issue in Java so far. Are there workarounds like the one shown in the question above that apply to java?
JVM ships with just-in-time (JIT) compiler which does a lot of optimization to your code. You can dig into JVM internals if you want to understand it further. There will be class loading and unloading, performance profiles, code compilation and de-compilation, biased locking, etc.
To give you an example how complex this can get, as per this article, in OpenJDK there are two compilers (C1 and C2) with five possible tiers of code compilation:
Tiered compilation has five tiers of optimization. It starts in tier-0, the interpreter tier, where instrumentation provides information on the performance critical methods. Soon enough the tier 1 level, the simple C1 (client) compiler, optimizes the code. At tier 1, there is no profiling information. Next comes tier 2, where only a few methods are compiled (again by the client compiler). At tier 2, for those few methods, profiling information is gathered for entry-counters and loop-back branches. Tier 3 would then see all the methods getting compiled by the client compiler with full profiling information, and finally tier 4 would avail itself of C2, the server compiler.
The takeaway here is that if you require a predictable performance you should always warmup your code by running some dummy requests after each deployment.
You did the right thing with dummy code creating all used protobuff objects but you should take it a step further and warmup the actual method you are hitting.

Java application profiling

I am looking for a Java code profiler which I can use to profile my application (its a service which runs in backend) on production (so means low over head, and it must not slow down my application). Primarily I want calling tree profiling, that is if a() calls b() and then b() calls c(), then how much time a() b() and c() took, both inclusively and exclusively.
Have seen jvisualvm and jprofiler, but this is not what I am looking for, because I cannot tie my production application to them as it will cause a major performance issue.
Also, I did go through metrics (https://github.com/dropwizard/metrics), but it does not give me a functionality to profile the calling tree.
Callgrind (http://valgrind.org/docs/manual/cl-manual.html) type library is what I need, as it gives calling tree profiling functionality and advanced options like avoiding calling cycles (recursion). But I am not sure that Callgrind can be used on production as it dumps data when program terminates.
Can anyone suggest good calling tree profiler for java that can be used on production without compromising the performance?
Take a look at Java Mission Control in conjunction with Flight Recorder. Starting with the release of Oracle JDK 7 Update 40 (7u40), Java Mission Control is bundled with the HotSpot JVM, so it is highly integrated and purports to have small effects on run-time performance. I have only just started looking at it, and I do see some call tree functionality.
In general you don't (or I won't recommend) profilers that instrument your application. Instrumenting always means an uncontrollable production overhead.
What you can use is a sampling profiler. A sampling profiler makes a snapshot of the stack traces at a controllable interval. What you don't get is call counts, but, after some runtime, you get a good overview where you have hotspots. Since you can adjust the sample interval of the profiler you can influence the overhead of it.
A usable sampling profiler is shipped with the JDK, see the hprof page in the java 7 documentation.
In former times there existed some graphical analysis tools for the hprof cpu traces (not the heap traces). Now they are gone. However, you can already work with the generated text file.
I took a quick look on the Java Mission Control stuff mentioned above. I think it is quite mighty and will satisfy a lot of needs, in the white paper they say it has only 2% overhead. However, it is not totally what I personally need or want. For my applications it is better to have a "light" profiling enabled all the time.
Intel Amplifier XE http://software.intel.com/en-us/intel-vtune-amplifier-xe has got low overhead if any noticeable. It uses stack sampling technology to minimize the impact and it can attach and detach to running non-stop processes in production. You even do not need to have sources during profiling, you can dive into sources later after offline performance results browsing.
I don't know of any tool which can do profiling without an impact on performance.
You could add logging to the methods that you're interested in. Make sure you include the time stamp in the log; then you can do the timing. You should also configure the logging framework to log asynchronously to reduce the performance loss.
A load time weaver like AspectJ is able to add these calls at runtime, which would allow you to easily select the methods you want to monitor without changing the source code all the time.
Using an around aspect, you can even add timing logging, so you don't have to parse the logs and try to find matching log entries. See this blog post for details.
Have a look at perfspy (tutorial), it might already do out of the box what you need.
Related:
How to use AOP with AspectJ for logging?
http://mathewjhall.wordpress.com/2011/03/31/tracing-java-method-execution-with-aspectj/
http://www.jroller.com/holy/entry/injecting_timing_aspect_into_junit

How to make full use of multiple processors?

I am doing web crawling on a server with 32 virtual processors using Java. How can I make full of these processors? I've seen some suggestions on multi-threaded programming, but I wonder how that could ensure all processors would be taken advantage of since we can do multi-threaded programming on single processor machine as well.
There is no simple answer to this ... except the way to ensure all processors are used is to use multi-threading the right way. (Note: that is a circular answer!)
Basically, the way to get effective use of multiple processors is to:
ensure that there is work that can be done in parallel, and
reduce / eliminate contention points that force one thread to wait while another thread does something.
This is difficult enough when you are doing simple computation. For a web crawler, you've got the additional problems that the threads will be competing for network and (possibly) remove server bandwidth, and they will typically be attempting to put their results into a shared data structure or database.
That's about all that can be said at this level of generality ...
And as #veer correctly points, you can't "ensure" it.
... but using a load of threads will surely be quicker wall-time-wise because all the miserable network latency will happen in parallel ...
Actually, if you go overboard, a load of threads can reduce throughput because of contention. Just throwing lots of threads at the problem is rarely a good idea.
A computer or a program is only as fast as the slowest link in its processing chain. Just increasing the CPU capacity is not going to ensure a drastic performance peak. Leaving aside other issues like your cache-size, RAM, etc., there are two basic kinds of approach to your question about how to take advantage of all your processors:
[1] Using a Jit/just-in-time compiler/interpreter technology such as Java/.NET. I don't know much about java, but the .NET jitter is definitely designed to take advantage of all the available processors on the mahcine. In fact, this very feature makes a jitter stand out against other static language compilers like C/C++, because the jitter "knows" that it is sitting on 32 processors, it is in a much better position to take advantage of them than a program statically compiled on any other machine. (provided you have written a robust multi-threading code for it!)
[2] Programming in C/C++. This is the classic approach. If you compile your code on the same machine with 32 CPUs, and take proper care in your program such as memory-management, handling pointers, etc. the C/C++ program will be the most optimal and will perform better than its CLR/JVM counterpart (as it runs without the extra overhead of a garbage-collector or a VM).
But keep in mind that writing robust code is much easier in .NET/Java than C/C++. So, if you are not a "hard-core" programmer, I would suggest going with the former approach. Also remember to handle your multiple threads with care, such as locking variables when multiple threads try to change the same variables. However, excessive locking might make your code hang, if a variable behaves unexpectedly.
Processor management is implemented in native through the Virtual machine you are using i.e., JVM. You can have a look here Java Hotspot VM Options to optimize your machine if you are using Java Hotspot VM. If you are using a third party VM then your provider may help you with tuning it for your requirements.
Application performance in design practically depends on you.
If you would like to monitor your threads and memory usage to optimize your application, you can use any VM monitoring tools available to date. The Java virtual machine (JVM) has built-in instrumentation that enables you to monitor and manage it using JMX.
For details you can check Platform Monitoring and management using JMX. For third party VMs you have to contact the vendor I guess.

Java profiling - How reliable are the values it gives?

I am working on a simple text markup Java Library which should be, amongst other requirements, fast.
For that purpose, I did some profiling, but the results give me worse numbers that are then measured when running in non-profile mode.
So my question is - how much reliable is the profiling? Does that give just an informational ratio of the time spent in methods? Does that take JIT compiler into account, or is the profiling mode only interpreted? I use NetBeans Profiler and Sun JDK 1.6.
Thanks.
When running profiling, you'll always incur a performance penalty as something has to measure the start/stop time of the methods, keep track of the objects of the heap (for memory profiling), so there is a management overhead.
However, it will give you clear pointers to find out where bottlenecks are. I tend to look for the methods where the most cumulative time is spent and check whether optimisations can be made. It is also useful to determine whether methods are called unnecessarily.
With methods that are very small, take the profile results with a pinch of salt, sometimes the process of measuring can take more time than the method call itself and skew the results (it might appear that a small, often-called method has more of a performance impact).
Hope that helps.
Because of instrumentation profiled code on average will run slower than non-profiled code. Measuring speed is not the purpose of profiling however.
The profiling output will point you to bottlenecks, places that threads spend more time in, code that behaves worse than expected, or possible memory leaks.
You can use those hints to improve said methods and profile again until you are happy with the results.
A profiler will not be a solution to a coding style that is x% slower than optimal however, you still need to spend time fine-tuning those parts of your code that are used more often than others.
I'm not surprised by the fact that you get worst results when profiling your application as instrumenting java code will typically always slow its execution. This is actually nicely captured by the Wikipedia page on Profiling which mentions that instrumentation can causes changes in the performance of a program, potentially causing inaccurate inaccurate results and heisenbugs (due to the observer effect: observers affect what they are observing, by the mere act of observing it alone).
Having that said, if you want to measure speed, I think that you're not using the right tool. Profilers are used to find bottlenecks in an application (and for that, you don't really care of the overall impact). But if you want to benchmark your library, you should use a performance testing tool (for example, something like JMeter) that will be able to give you an average execution time per call. You will get much better and more reliable results with the right tool.
The Profiling should have no influence on the JIT compiler. The code inserted to profile your application however will slow the methods down quite a bit.
Profillers work on severall different models, either they insert code to see how long and how often methods are running or they only take samples by repeatedly polling what code is currently executed.
The first will slow down your code quite a bit, while the second is not 100% accurate, as it may miss some method calls.
Profiled code is bound to run slower as mentioned in most of the previous comments. I would say, use profiling to measure only the relative performance of various parts of your code (say methods). Do not use the measures from a profiler as an indicator of how the code would perform overall (unless you want a worst-case measure, in which case what you have would be an overestimated value.)
I have found I get different result depending on which profiler I use. However the results are often valid, but a different perspective on the problem. Something I often do when profiling CPU usage is to enable memory allocation profiling. This often gives me different CPU results (due to the increased overhead caused by the memory profiling) and can give me some useful places to optimise.

Categories