Profiling native methods in Java - strange results

Profiling native methods in Java - strange results - java

I have been using Yourkit 8.0 to profile a mathematically-intensive application running under Mac OS X (10.5.7, Apple JDK 1.6.0_06-b06-57), and have noticed some strange behavior in the CPU profiling results.
For instance - I did a profiling run using sampling, which reported that 40% of the application's 10-minute runtime was spent in the StrictMath.atan method. I found this puzzling, but I took it at it's word and spent a bit of time replacing atan with an extremely simply polynomial fit.
When I ran the application again, it took almost exactly the same time as before (10 minutes) - but my atan replacement showed up nowhere in the profiling results. Instead, the runtime percentages of the other major hotspots simply increased to make up for it.
To summarize:
RESULTS WITH StrictMath.atan (native method)
Total runtime: 10 minutes
Method 1: 20%
Method 2: 20%
Method 3: 20%
StrictMath.atan: 40%
RESULTS WITH simplified, pure Java atan
Total runtime: 10 minutes
Method 1: 33%
Method 2: 33%
Method 3: 33%
(Methods 1,2,3 do not perform any atan calls)
Any idea what is up with this behavior? I got the same results using EJ-Technologies' JProfiler. It seems like the JDK profiling API reports inaccurate results for native methods, at least under OS X.

This can happen because of inconsistencies of when samples are taken. So for example, if a method uses a fair amount of time, but doesn't take very long to execute, it is possible for the sampling to miss it. Also, I think garbage collection never happens during a sample, but if some code causes a lot of garbage collection it can greatly contribute to a slowdown without appearing in the sample.
In similar situation I've found it very helpful to run twice, once with tracing as well as once with sampling. If a method appears in both it is probably using a lot of CPU, otherwise it could well just be an artifact of the sampling process.

Since you're using a Mac, you might try Apple's Shark profiler (free download from ADC) which has Java support and Apple's performance group has put a fair amount of time into the tool.
As Nick pointed out, sampling can be misleading if the sample interval is close enough to the function's execution time and the profiler rarely checks when the function is actually executing. I don't know whether YourKit supports this but in Shark you can change the sampling interval to something other than the default 10ms and see if the results are substantially different.
There's also a separate call-tracing mode which will record every function enter/return - this completely avoids the possibility of aliasing errors but collects a ton of data and higher overhead, which could matter if your app is doing any sort of real-time processing.

You may want to look at the parameters that are being passed into the three methods. It may be that the time is being spent generating return values or in methods that are creating a lot of temporary objects.

I find YourKit greatly exaggerates the cost of calling sub-methods (due to its logging method, I assume). If you only follow the advice that the profile gives you you'll end up just merging functions with no real gain as the HotSpot usually does excellently for this.
Therefore, I'd highly advise to test batches completely outside profilers too, to get a better idea whether changes are really beneficial (it may seem obvious but this cost me some development time).

It worth noting that Java methods can be inlined if they are small enough, however native methods are inlined under different rules. If a method is inlined it doesn't appear in the profiler (certainly not YourKit anyway)

Profilers can be like that.
This is the method I use.
Works every time.
And this is why.

Related

Java Used Heap RAM usage peaks, how can I avoid them?

First of all, I can't really show code, I am sorry, these software belongs to the company I work for, not me. I will try to explain my problem the best I can.
I am developing a little application based on JavaFX, that shows values in LineCharts, these are refreshed every 800ms-1000ms (0,8-1 seconds), and calls System.gc() every time I refresh (Around once every 0,8-1 seconds).
I am having RAM usage peaks every 10-20 seconds:
In this specific example, this doesn't look like a problem, but in some cases it goes up to 700-750 MB (Making the Heap Size go up to 1.2-1.3 GB, and taking a long time to release it back to the OS).
I know about (and currently use, without noticing any huge improvement) Heap Tuning Paremeters, but I don't think these can fix the problem here, they are helping at specific points, and slightly reduce the memory consumption, but not solve the problem.
Any ideas on how can I design my code not to have these RAM peaks? I don't have a process that uses memory and releases it every 10-20 seconds, so I assume there is something else allocating and releasing that ammount of RAM (Maybe JavaFX?), JVisualVM only says int[], byte[] and char[], and I am not even using Integer values in my code (I work with Double values in this software).
Thank you all.

Sorry, but the only reasonable answer here: you have to do profiling in order to understand where those peaks are coming from. You have to identify the root cause of this problem; and that is nothing that we can help with.
This program runs in your setup, with your data, and shows behavior that needs to analyzed over time.
My guess would be that your program is creating large amounts of objects that will be thrown away quickly afterwards ( I guess you have those calls to System.gc() in there for a reason). And guess what: creating garbage on high rate is a bad idea. Because it keeps your GC constantly spinning; and it (obviously?!) contributes to high memory load.
So, as said: you have to identify the root cause and fix that. In that sense: you have to study the tooling you are using. An alternative to profiling might be to have the GC log its activities; and analyze that output. See here for some information on that.

I found the solution:
Both MrSmith42 and GhostCat pointed out that calling System.gc() doesn't really help me here. They were right, in fact, that was the problem.
Removing System.gc() solved the problem for me
Thank you, MrSmith42 and GhostCat.

System.gc() does not trigger a garbage collection directly it is more like a hint to the VM that you think performing a garbage collection would be a good idea. What your VM does is its own decision based on the implementation.
Only if the VM runs out of memory it will sure perform a garbage collection but that also without you calling System.gc().
A quite long discussion about this topic can be found here:
When does System.gc() do anything

Why aren't all methods displayed in VisualVM profiler?

I am using VisualVM to see where my application is slow. But it does not show all methods, probably does not show all methods that delays the application.
I have a realtime application (sound processing) and have a time deficiency in few hundreds of microseconds.
Is it possible that VisualVM hides methods which are fast themselves?
UPDATE 1
I found slow method by sampler and guessing. It was toString() method which was called from debug logging which was turned off, but consuming a time.
Settings helped and now I know how to see it: it was depending on Start profiling from option.

Other than the filters mentioned by Ryan Stewart, here are a couple of additional reasons why methods may not appear in the profiler:
Sampling profiles are inherently stochastic: a sample of the current stack of all threads is taken every N ms. Some methods which actually executed but which aren't caught in any sample during your run just won't appear. This is generally not too problematic since the very fact they didn't appear in any sample, means that with very high probability these are methods which aren't taking up a large part of your runtime.
When using instrumentation based sampling in visualvm (called "CPU profiling"), you need to define the entry point for profiled methods (the "Start profiling from" option). I have found this fails for methods in the default package, and also won't pick up time in methods which are current running when the profiler is attached (for the duration of the current invocation - it will get later invocations. This is probably because the instrumented method won't be swapped in until the current invocation finishes.
Sampling is subject to a potentially serious issue with stack traced based profiling, which is that samples are only taken at safe points in the code. When a trace is requested, each thread is forced to a safe point, then the stack is taken. In some cases you may have a hot spot in your code which does no safe point polling (common for simple loops that the JIT can guarantee terminate after a fixed number of iterations), interleaved with a bit of code that does have a safepoint poll. Your stacks will always show your process in the safe-point code, never in the safe-point free code, even though the latter may be taking the majority of CPU time.

I don't have it in front of my at the moment, but before you start profiling, there's a settings pane that's hidden by default and lets you enter regexes for filtering out methods. By default, it filters out a lot of the core JDK stuff.

I had the same problem with my pet project. I added a package name and the problem is solved. I don't understand why. VisualVM 1.4.1, jdk1.8.0_181 and jdk-10.0.2, Windows 10

Debugging a slow Java method

VisualVM is showing me that a particular method is taking a long time to execute.
Are there any widely used strategies for looking at the performance (in regards to time) of a Java method?
My gut feeling is that the sluggish response time will come from a method that is somewhere down the call hierarchy from the one VisualVM is reporting but I think getting some hard numbers is better than fishing around in the code based on an assumption when it comes to performance.

VisualVM should be showing you the methods which use the most CPU. If the biggest user is your method, it means it not a method you are calling unless you are calling many methods which individually look small but in total are more.
I suggest you take the difference of the methods this method calls and your total. That is how much your method is adding which being profiled. Note: how much it adds when not profiled could be less as the profiler has an overhead.

You need to use tools like JProfiler, Yourkit etc. You can profile you code in depth & you can exactly catch which method is taking much time. You can go as much in depth hierarchy as you want with these tools.

java debug perfomance issues - best practices

Just curious to know if there are a list of steps that I can use as guidelines to debug performance issues to pinpoint what is taking the most time. There are a myriad of tools starting with logging, timing methods, load test tools, timing database queries and so on....
considering that there are so many different things, is there a list of things that are at the top of the list.
if so please let me

Check the machine physically has enough RAM. Paging will kill applications.
Check what proportion of the application's time is spent in garbage collection. A high proportion means you'll benefit from heap or GC tuning.
Run the app in a profiler and put it through its paces, monitoring CPU usage. Look for those methods where the CPU spends all its time. A good profiler will let you roll up the time spent in third party code where you have no control, allowing you to identify the hot spots in your own code.
For the top hot spots in your application, work out where the time is being spent. Is it all I/O? Calculations? Multi-thread synchronisation? Object allocation/deallocation? A poor algorithm with n-squared or worse complexity? Something else?
Deal with the hot spots one at a time, and each time you change something, measure and work out whether you've actually improved anything. Roll back the ineffective changes and work out where the problem has moved to if the change worked.

There is nothing really specific to Java about something like this, with any language/framework/tool you should follow the same pattern:
Measure the performance before you change a single thing
Hypothesize about possible causes/fixes
Implement the change
Measure performance after the change to compare with #1
Repeat until happy

Measure
MEASURE!!!!!
Compare Apples to Apples. Don't run your tests on a busy subnet(especially don't try to justify this ludicrous practice by saying - "I want the circumstances to be realistic")
Measure- Capture time stamps at each discrete step.
Note that although there is a relationship, throughput and response time are not the same thing
After you make a change... MEASURE!!!!! Never say to yourself, it seems better. You know how you know its better? compare measurement 1 to measurement 2
Test one thing at a time. Don't create one uber performance suite that attempts to simulate realistic conditions. Its too much and you are setting yourself up to be overwhelmed. Test for message size. Test for concurrency. Test in isolation
Once you start to isolate the bottlenecks then the next steps will start to feel more natural, fine tuning your tests will become easier, you may choose at that point to hook up a profiler to investigate GC/CPU perf and memory consumption(VisualVM is good and free).The point is treat performance issues like a binary search. Start by measuring everything and continually subdivide the problem to it reveals itself.

The first and most important step in any kind of perfomance tuning is to identify what is slow, and measure just how slow it is. In most cases (particularly if the performance problem is easy to reproduce), a profiler is the most effective tool for that, as it will give you detailed statistics on execution time, breaking it down to single methods, without having to manually instrument your program.

Check DB queries
Check Statements in loop, statement in loops make application slow instead use prepared statement/callable statements
Capture time stamps at each discrete step
Identify hot spots area where time is being spent, like I/O, Calculations,
multithreaded synchronization, garbage collection and look for Poor algorithms.

Precise time measurement in Java

Java gives access to two method to get the current time: System.nanoTime() and System.currentTimeMillis(). The first one gives a result in nanoseconds, but the actual accuracy is much worse than that (many microseconds).
Is the JVM already providing the best possible value for each particular machine?
Otherwise, is there some Java library that can give finer measurement, possibly by being tied to a particular system?

The problem with getting super precise time measurements is that some processors can't/don't provide such tiny increments.
As far as I know, System.currentTimeMillis() and System.nanoTime() is the best measurement you will be able to find.
Note that both return a long value.

It's a bit pointless in Java measuring time down to the nanosecond scale; an occasional GC hit will easily wipe out any kind of accuracy this may have given. In any case, the documentation states that whilst it gives nanosecond precision, it's not the same thing as nanosecond accuracy; and there are operating systems which don't report nanoseconds in any case (which is why you'll find answers quantized to 1000 when accessing them; it's not luck, it's limitation).
Not only that, but depending on how the feature is actually implemented by the OS, you might find quantized results coming through anyway (e.g. answers that always end in 64 or 128 instead of intermediate values).
It's also worth noting that the purpose of the method is to find the two time differences between some (nearby) start time and now; if you take System.nanoTime() at the start of a long-running application and then take System.nanoTime() a long time later, it may have drifted quite far from real time. So you should only really use it for periods of less than 1s; if you need a longer running time than that, milliseconds should be enough. (And if it's not, then make up the last few numbers; you'll probably impress clients and the result will be just as valid.)

Unfortunately, I don't think java RTS is mature enough at this moment.
Java time does try to provide best value (they actually delegate the native code to call get the kernal time). However, JVM specs make this coarse time measurement disclaimer mainly for things like GC activities, and support of underlying system.
Certain GC activities will block all threads even if you are running concurrent GC.
default linux clock tick precision is only 10ms. Java cannot make it any better if linux kernal does not support.
I haven't figured out how to address #1 unless your app does not need to do GC. A decent and med size application probably and occasionally spends like tens of milliseconds on GC pauses. You are probably out of luck if your precision requirement is lower 10ms.
As for #2, You can tune the linux kernal to give more precision. However, you are also getting less out of your box because now kernal context switch more often.
Perhaps, we should look at it different angle. Is there a reason that OPS needs precision of 10ms of lower? Is it okay to tell Ops that precision is at 10ms AND also look at the GC log at that time, so they know the time is +-10ms accurate without GC activity around that time?

If you are looking to record some type of phenomenon on the order of nanoseconds, what you really need is a real-time operating system. The accuracy of the timer will greatly depend on the operating system's implementation of its high resolution timer and the underlying hardware.
However, you can still stay with Java since there are RTOS versions available.

JNI:
Create a simple function to access the Intel RDTSC instruction or the PMCCNTR register of co-processor p15 in ARM.
Pure Java:
You can possibly get better values if you are willing to delay until a clock tick. You can spin checking System.nanoTime() until the value changes. If you know for instance that the value of System.nanoTime() changes every 10000 loop iterations on your platform by amount DELTA then the actual event time was finalNanoTime-DELTA*ITERATIONS/10000. You will need to "warm-up" the code before taking actual measurements.
Hack (for profiling, etc, only):
If garbage collection is throwing you off you could always measure the time using a high-priority thread running in a second jvm which doesn't create objects. Have it spin incrementing a long in shared memory which you use as a clock.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.