I'm doing CPU profiling in VisualVM and look at the results in the call tree.
I have some method, taking a total time X, which is spent in the method itself (Self time), and in subroutines called from the method.
When I add up the times spent in the subroutines, plus the Self time, why doesn't the result equal the total time spent in the method? Note that I'm not talking about milliseconds, but more like 50% or several minutes missing in the balance.
It is very difficult to use "self time" to learn anything meaningful except in tiny programs with very shallow call trees.
CPU-only time is also not very useful in any kind of complex program, which can easily spend a large fraction of time in hidden I/O.
It's better to look at
inclusive time, not self time
wall clock, not cpu time
as percent, not as absolute seconds or milliseconds
It's even better to get line-level resolution, not just function or method.
Here's the method I use to find out why time is being spent and how to improve it, and here's an example of what has been done with it.
Here's a more extensive discussion of the issues.
you need both the total time and inherent (self) time...you also should avoid call trees and instead look at clock timing from a namespace hierarchy perspective (packages, classes, methods, even dynamic tags & marks)
Here is a series of articles which detail what a thorough performance investigation looks like especially when dealing with huge stacks depths beyond +2000 and billion of method invocations in a very short period:
http://www.jinspired.com/solutions/case-studies/scala-compiler
note that each method looks to validate the findings of other methods used...and more importantly there is no single right performance model...there are many right performance models depending on what is being asked and what can be changed...the only bad performance model I know of is one that is sample based in this context though when you have nothing else to do and in a hurry anything little piece of information helps
Related
I want to do some research to improve my programmer skills by seeing see how long a method takes to finish.
But I don't want to write any code in the java files I am testing because I have a lot of methods to test (for some of the code I do not have permission to edit the code) so if possible I just want to "watch" the methods.
For example:
public void methodZero(){
methodOne();
methodTwo();
}
Should ideally print something like:
Time of methodZero = x. Time of methodOne = y. Time of methodTwo = z.
Question: Can I measure timing like this? Is there some additional info that would be important to me, like memory use?
Timing a method in isolation is usually completely meaningless, for a start you need statistical samples. If you want to get run times for a method you have to take a lot of things into consideration and give as much context to the target of your timing as possible.
For example, suppose your method has a return value, but you don't use it in any way in your benchmark - you are possibly going to encounter "dead-code elimination" by the JVM, which makes your timings meaningless. The JVM does many clever things like this (openjdk.net: Performance techniques used in the Hotspot JVM), all of which confuse and complicate taking meaningful timings.
A good library to use to help you do this (that also has good documentation and tutorials) is JMH.
Just as important as actual timings, but often related to taking meaningful timings, is measuring algorithm space and time complexity. This allows you to understand how your algorithm will grow as the input dataset changes in size. For example MethodA might be faster on a small input array and impractically slow on an input array 100x bigger, whereas MethodB may take the same time regardless of array size.
If you want to find out where in a program you should start looking to improve performance you can use a profiler (e.g. eclipse.org: An introduction to profiling Java applications). This will help you identify things such as: high memory usage, high GC usage, total method time (e.g. low method call count but high execution time, or high call count and small but significant execution time). However profilers will impact on your program's execution time as well.
TL;DR: profiling and performance testing is hard. Often you aren't really measuring what you think you are measuring.
long startTime = System.nanoTime();
methodOne();
long endTime = System.nanoTime();
long duration = (endTime - startTime);
You as a programmer should be more interested in complexity (Usually Time Complexity, but sometimes you should also worry about Space Complexity) of your algorithm rather than actual time taken to execute the program.
Have a read here about analysing algorithm complexity
I am trying to compare the accuracy of timing methods with C++ and Java.
With C++ I usually use CLOCKS_PER_SEC, I run the block of code I want to time for a certain amount of time and then calculate how long it took, based on how many times the block was executed.
With Java I usually use System.nanoTime().
Which one is more accurate, the one I use for C++ or the one I use for Java? Is there any other way to time in C++ so I don't have to repeat the piece of code to get a proper measurement? Basically, is there a System.nanoTime() method for C++?
I am aware that both use system calls which cause considerable latencies. How does this distort the real value of the timing? Is there any way to prevent this?
Every method has errors. Before you spend a great deal of time on this question, you have to ask yourself "how accurate do I need my answer to be"? Usually the solution is to run a loop / piece of code a number of times, and keep track of the mean / standard deviation of the measurement. This is a good way to get a handle on the repeatability of your measurement. After that, assume that latency is "comparable" between the "start time" and "stop time" calls (regardless of what function you used), and you have a framework to understand the issues.
Bottom line: clock() function typically gives microsecond accuracy.
See https://stackoverflow.com/a/20497193/1967396 for an example of how to go about this in C (in that instance, using a usec precision clock). There's the ability to use ns timing - see for example the answer to clock_gettime() still not monotonic - alternatives? which uses clock_gettime(CLOCK_MONOTONIC_RAW, &tSpec);
Note that you have to extract seconds and nanoseconds separately from that structure.
Be careful using System.nanoTime() as it is still limited by the resolution that the machine you are running on can give you.
Also there are complications timing Java as the first few times through a function will be a lot slower until they get optimized for your system.
Virtually all modern systems use pre-emptive multi threading and multiple cores, etc - so all timings will vary from run to run. (For example if control gets switched away from your thread while it in the method).
To get reliable timings you need to
Warm up the system by running around the thing you are timing a few hundred times before starting.
Run the code for a good number of times and average the results.
The reliability issues are the same for any language so apply just as well to C as to Java so C may not need the warm-up loop but you will still need to take a lot of samples and average them.
To debug our Android code we have put System.out.println(string) which will let us know how many times a function has been called. The other method would have been to put a flag and keep on incrementing it after every function call. And then at the end printing the final value of flag by System.out.println(...). (practically in my application the function will be called thousands of time)
My question is: In terms of CPU Resources and Clock Cycles which one is lighter: increment operation Or System.out.println?
Incrementing is going to be much, much more efficient - especially if you've actually got anywhere for that output to go. Think of all the operations required by System.out.println vs incrementing a variable. Of course, whether the impact will actually be significant is a different matter - and if your method is already doing a lot of work, then a System.out.println call may not actually make much difference. But if you just want to know how many times it was called, then keeping a counter makes more sense than looking through the logs anyway, IMO.
I would recommend using AtomicLong or AtomicInteger instead of just having a primitive variable, as that way you get simple thread-safety.
Incrementing will be a lot faster in terms of clock cycles. Assuming the increment is fairly close to a hardware increment it would only take a couple of clock cycles. That means you can do millions every second.
On the other hand System.out.println will have to call out to the OS. Use stdout. Convert characters, etc. Each of these steps will take many, many clock cycles.
Coming back to your original question, if you're looking at how many times a function gets called you could try and run a profiler - there are various desktop and android solutions available. That way you wouldn't need to pollute your code with counting/printing, and you can keep your production code lean.
Again thinking a litle further, why would you like to know exact number of times a function is called? If you're concerned about a defect consider writing some unit tests that will prove exactly how many times a function gets called. If you're concerned about performance, perhaps look at load test techniques in combination with your profiler.
After watching the presentation "Performance Anxiety" of Joshua Bloch, I read the paper he suggested in the presentation "Evaluating the Accuracy of Java Profilers". Quoting the conclusion:
Our results are disturbing because they indicate that profiler incorrectness is pervasive—occurring in most of our seven benchmarks and in two production JVM—-and significant—all four of
the state-of-the-art profilers produce incorrect profiles. Incorrect
profiles can easily cause a performance analyst to spend time optimizing cold methods that will have minimal effect on performance.
We show that a proof-of-concept profiler that does not use yield
points for sampling does not suffer from the above problems
The conclusion of the paper is that we cannot really believe the result of profilers. But then, what is the alternative of using profilers. Should we go back and just use our feeling to do optimization?
UPDATE: A point that seems to be missed in the discussion is observer effect. Can we build a profiler that really 'observer effect'-free?
Oh, man, where to begin?
First, I'm amazed that this is news. Second, the problem is not that profilers are bad, it is that some profilers are bad.
The authors built one that, they feel, is good, just by avoiding some of the mistakes they found in the ones they evaluated.
Mistakes are common because of some persistent myths about performance profiling.
But let's be positive.
If one wants to find opportunities for speedup, it is really very simple:
Sampling should be uncorrelated with the state of the program.
That means happening at a truly random time, regardless of whether the program is in I/O (except for user input), or in GC, or in a tight CPU loop, or whatever.
Sampling should read the function call stack,
so as to determine which statements were "active" at the time of the sample.
The reason is that every call site (point at which a function is called) has a percentage cost equal to the fraction of time it is on the stack.
(Note: the paper is concerned entirely with self-time, ignoring the massive impact of avoidable function calls in large software. In fact, the reason behind the original gprof was to help find those calls.)
Reporting should show percent by line (not by function).
If a "hot" function is identified, one still has to hunt inside it for the "hot" lines of code accounting for the time. That information is in the samples! Why hide it?
An almost universal mistake (that the paper shares) is to be concerned too much with accuracy of measurement, and not enough with accuracy of location.
For example, here is an example of performance tuning
in which a series of performance problems were identified and fixed, resulting in a compounded speedup of 43 times.
It was not essential to know precisely the size of each problem before fixing it, but to know its location.
A phenomenon of performance tuning is that fixing one problem, by reducing the time, magnifies the percentages of remaining problems, so they are easier to find.
As long as any problem is found and fixed, progress is made toward the goal of finding and fixing all the problems.
It is not essential to fix them in decreasing size order, but it is essential to pinpoint them.
On the subject of statistical accuracy of measurement, if a call point is on the stack some percent of time F (like 20%), and N (like 100) random-time samples are taken, then the number of samples that show the call point is a binomial distribution, with mean = NF = 20, standard deviation = sqrt(NF(1-F)) = sqrt(16) = 4. So the percent of samples that show it will be 20% +/- 4%.
So is that accurate? Not really, but has the problem been found? Precisely.
In fact, the larger a problem is, in terms of percent, the fewer samples are needed to locate it. For example, if 3 samples are taken, and a call point shows up on 2 of them, it is highly likely to be very costly.
(Specifically, it follows a beta distribution. If you generate 4 uniform 0,1 random numbers, and sort them, the distribution of the 3rd one is the distribution of cost for that call point.
It's mean is (2+1)/(3+2) = 0.6, so that is the expected savings, given those samples.)
INSERTED: And the speedup factor you get is governed by another distribution, BetaPrime, and its average is 4. So if you take 3 samples, see a problem on 2 of them, and eliminate that problem, on average you will make the program four times faster.
It's high time we programmers blew the cobwebs out of our heads on the subject of profiling.
Disclaimer - the paper failed to reference my article: Dunlavey, “Performance tuning with instruction-level cost derived from call-stack sampling”, ACM SIGPLAN Notices 42, 8 (August, 2007), pp. 4-8.
If I read it correctly, the paper only talks about sample-based profiling. Many profilers also do instrumentation-based profiling. It's much slower and has some other problems, but it should not suffer from the biases the paper talks about.
The conclusion of the paper is that we
cannot really believe the result of
profilers. But then, what is the
alternative of using profilers.
No. The conclusion of the paper is that current profilers' measuring methods have specific defects. They propose a fix. The paper is quite recent. I'd expect profilers to implement this fix eventually. Until then, even a defective profiler is still much better than "feeling".
Unless you are building bleeding edge applications that need every CPU cycle then I have found that profilers are a good way to find the 10% slowest parts of your code. As a developer, I would argue that should be all you really care about in nearly all cases.
I have experience with http://www.dynatrace.com/en/ and I can tell you it is very good at finding the low hanging fruit.
Profilers are like any other tool and they have their quirks but I would trust them over a human any day to find the hot spots in your app to look at.
If you don't trust profilers, then you can go into paranoia mode by using aspect oriented programming, wrapping around every method in your application and then using a logger to log every method invocation.
Your application will really slow down, but at least you'll have a precise count of how many times each method is invoked. If you also want to see how long each method takes to execute, wrap around every method perf4j.
After dumping all these statistics to text files, use some tools to extract all necessary information and then visualize it. I'd guess this will give you a pretty good overview of how slow your application is in certain places.
Actually, you are better off profiling at the database level. Most enterprise databases come with the ability to show the top queries over a period of time. Start working on those queries until the top ones are down to 300 ms or less, and you will have made great progress. Profilers are useful for showing behavior of the heap and for identifying blocked threads, but I personally have never gotten much traction with the development teams on identifying hot methods or large objects.
I want to capture the time take to go from statement A to Statement B in a Java class. In between these statements there are many web service calls made. I wanted to know if there is some stop watch like functionality in java that i could use to capture the exact time?
Kaddy
This will give you the number of nanoseconds between the two nanoTime() calls.
long start = System.nanoTime();
// Java statements
long diff = System.nanoTime() - start;
For more sophisticated approaches there are several duplicate questions that address Stopwatch classes:
Java performance timing library
Stopwatch class for Java
#Ben S's answer is spot on.
However, it should be noted that the approach of inserting time measurement statements into your code does not scale:
It makes your code look a mess.
It makes your application run slower. Those calls to System.nanoTime() don't come for free!
It introduces the possibility of bugs.
If your real aim is to try and work out why your application is running slowly so that you decide what what to optimize, then a better solution is to use a Java profiler. This has the advantage that you need to make ZERO changes to your source code. (Of course, profiling doesn't give you the exact times spent in particular sections. Rather, it gives you time proportions ... which is far more useful for deciding where to optimize.)
System.currentTimeMillis will get it in milliseconds and nanoTime in nanosceconds.
If you're trying to compare the performance of different techniques, note that the JVM environment is complex so simply taking one time is not meaningful. I always write a loop where I execute method 1 a few thousand times, then do a System.gc, then execute method 2 a few thousands times, then do another System.gc, then loop back and do the whole thing again at least five or six times. This helps to average out time for garbage collection, just-in-time compiles, and other magic things happening in the JVM.