When conducting performance testing of java code, you want to test JIT compiled code, rather than raw bytecode. To cause the bytecode to be compiled, you must trigger the compilation to occur by executing the code a number of times, and also allow enough yield time for the background thread to complete the compilation.
What is the minimum number of "warm up" executions of a code path required to be "very confident" that the code will be JIT compiled?
What is the minimum sleep time of the main thread to be "very confident" that the compilation has completed (assuming a smallish code block)?
I'm looking for a threshold that would safely apply in any modern OS, say Mac OS or Windows for the development environment and Linux for CI/production.
Since OP's intent is not actually figuring out whether the block is JIT-compiled, but rather make sure to measure the optimized code, I think OP needs to watch some of these benchmarking talks.
TL;DR version: There is no reliable way to figure out whether you hit the "steady state":
You can only measure for a long time to get the ball-park estimate for usual time your concrete system takes to reach some state you can claim "steady".
Observing -XX:+PrintCompilation is not reliable, because you may be in the phase when counters are still in flux, and JIT is standing by to compile the next batch of now-hot methods. You can easily have a number of warmup plateaus because of that. The method can even recompile multiple times, depending on how many tiered compilers are standing by.
While one could argue about the invocation thresholds, these things are not reliable either, since tiered compilation may be involved, method might get inlined sooner in the caller, the probabilistic counters can miss the updates, etc. That is, common wisdom about -XX:CompileThreshold=# is unreliable.
JIT compilation is not the only warmup effect you are after. Automatic GC heuristics, scheduler heuristics, etc. also require warmup.
Get a microbenchmark harness which will make the task easier for you!
To begin with, the results will most likely differ for a JVM run in client or in server mode. Second of all, this number depends highly on the complexity of your code and I am afraid that you will have to exploratively estimate a number for each test case. In general, the more complex your byte code is, the more optimization can be applied to it and therefore your code must get relatively hotter in order to make the JVM reach deep into its tool box. A JVM might recompile a segment of code a dozen of times.
Furthermore, a "real world" compilation depends on the context in which your byte code is run. A compilation might for example occur when a monomorphic call site is promoted to a megamorphic one such that an observed compilation actually represents a de-optimization. Therefore, be careful when assuming that your micro benachmark reflects the code's actual performance.
Instead of the suggested flag, I suggest you to use CompilationMXBean which allows you to check the amount of time a JVM still spends with compilation. If that time is too high, rerun your test until the value gets stable long enough. (Be patient!) Frameworks can help you with creating good benchmarks. Personally, I like caliper. However, never trust your benchmark.
From my experience, custom byte code performs best when you immitate the idioms of javac. To mention the one anecdote I can tell on this matter, I once wrote custom byte code for the Java source code equivalent of:
int[] array = {1, 2, 3};
javac creates the array and uses dup for assigning each value but I stored the array reference in a local variable and loaded it back on the operand stack for assigning each value. The array had a bigger size than that and there was a noteable performance difference.
Finally, I recommend this article before writing a benchmark.
Not sure about numbers, but when doing speed tests what I do is:
Run with the -XX:-PrintCompilation flag
Warm up the JVM until there are no more compilation debug messages generated and, if possible, timings become consistent
Related
I'm aware that, for good practice, StringBuilder should be initialised with a capacity value of the expected content. Otherwise, increasing the size after compilation is going to be an expensive operation.
My question is, if we don't know the expected size, how should one go about it? Is there a standard value/way to avoid expensive operations under the hood?
If not, is there potentially a way of alarming/logging in the code if the capacity is bigger than the value given upon initialisation?
I'm aware that, for good practice, StringBuilder should be initialised with a capacity value of the expected content. Otherwise, increasing the size after compilation is going to be an expensive operation.
This is a wildly incorrect statement. It is very bad practice to do this. Even if you know exactly how large it'll be.
If I see this code:
StringBuilder sb = new StringBuilder(in1.length() + in2.length() * 3 + loaded ? suffixLen : 0);
Then this is an additional thing to worry about, test, and keep up to date. I would assume if all this is present that for whatever reason somebody did some performance testing and actually figured out that this saves a worthwhile chunk of cycles, and somehow, in a fit of idiocy, neglected to write an enlightening comment and link to the JMH or profiler result analysis to verify this conclusion.
So, I'd either painstakingly attempt to manually analyse precisely if the calculation is still correct after an update to this code, or, I'd fix the problem and add the tests (and then be utterly befuddled, when, of course, the profile review shows this code is utterly inconsequential), or, I'd go through the considerable trouble of writing an assert based test case that will run the entire operation and then verify at the end that the size calculation done at the top is, in fact, correct.
I don't think you fully grasp why the hyperbolic premature optimization is the root of all evil statement is so popular.
Here's the problem. 99% of the system's resources are spent on 1% of the code. That's not an exaggeration; in fact, that is likely understating the issue.
Developer time is not infinite, and even if it was, the programmer's ability to comprehend code and focus on the relevant parts, is limited because, in the end, they are humans. Spending additional code that needs to be parsed and understood by human eyeballs and brains is therefore bad if the code does something irrelevant. We're literally talking about the same order of magnitude as you throwing a glass of water into the ocean down by the seashore in europe and then watching the water levels rise in manhattan. Beyond any and all ability to measure, and utterly incomprehensive. Bold does not do sufficient justice to how little it matters. Even if this code runs 100 million times a day for 15 years, it amounts to perhaps 5 cents in IAAS deployment costs total over that entire decade and a half, and that's if there is even a performance impact, which often there isn't because modern VMs, GCs, OSes, and CPU architectures get up to some crazy shenanigans.
Furthermore, the system optimizes. Optimizers, such as JVM hotspot engines, are in the end pattern matching machines. They find commonly used patterns and recognize how to run them as efficient as possible. By writing code in ways that nobody else does, it is highly unlikely that code is going to actually outperform the common (idiomatic) case. Most likely because it just doesn't matter, and even if it does, because the idiomatic case gets optimized much more readily.
Here is a trivial example:
List<String> someList = new ArrayList<String>();
for (int i = 0; i < 10000; i++) someList.add(someRandomString);
String[] arrayForm = someList.toArray(new String[0]);
Here you may go: Huh, well, we can optimize this code a little bit and pass new String[10000] instead; this saves the system from having to allocate an admittedly small object (a 0-size string array).
You would be wrong. The above code, with the new String[0], is in fact faster. How can that be? Optimizers, with pattern matching. They recognize the pattern and realize that the system can create a new array of the requisite size, not zero it out, and then run the code that fills it. Whereas the optimization patterns do not include the new String[reqSize] variant where the system could in theory also realize it can allocate the array and then omit zeroing out the array (which the JVM spec guarantees, which merely means the spec guarantees you can never observe that it wasn't zeroed out; it doesn't actually mean the JVM must zero it out, that's where the pattern optimization of not doing so is coming from). However, it doesn't do that - not common enough, and somewhat more complicated.
I'm not saying that new StringBuilder() is neccessarilyt faster than new StringBuilder(knownSize). I'm saying it:
99.9% of the time literally does not make one iota of difference. Not a single nanosecond - the speedup is entirely theoretical: No performance test of any stripe can detect the difference. If a tree falls in the forest, and all that.
You have no idea when that 0.1% of the time even is, or if it's not actually straight up never - 0%. Between a CPU that caches and pipelines (did you know modern CPUs cannot access memory? At all? I bet you didn't. The basic von neumann model of how CPUs work? Totally misleads you if you try to performance analyse machine code if you do that) - VMs, garbage collectors (did you know that garbage is free but live objects are expensive? Re-using an object is in fact more expensive that creating a ton of fast garbage.. depending on many factors, of course this too is an oversimplification. That's the real point: This is an intractable thing; you cannot just look at code and jump to conclusions about performance) - you stand no chance to know what's 'faster'.
The only right move is to write code as simple and as clean as you can ('clean' defined as: When you look at it, you jump to conclusions, and these conclusions are correct. It is easy to adjust in the face of changing requirements, and flexible in how it connects to the rest of the codebase). IF (big if!) real life situations result in a performance issue, you first run a profiler so you know the 1% of the code that is in any way relevant, and then you go ham on that, with JMH benchmarks and all sorts of performance experiments to optimize the heck out of it. If your code is clean, that's great, because almost always this requires adjusting how the code that calls into the 'hot path' or where the code flows to out of the 'hot path' - and the cleaner your code the easier that will be.
Needless performance optimization almost invariably reduces flexibility, and makes code harder to understand.
Hence, objectively, micro-optimizing like this just makes your code slower and buggier for literally no benefit. Not even a tiny, almost immeasurable one.
Hence, the advice is silly. The only correct call is new StringBuilder() - no pre-configured size. The one and only excuse you have to write new StringBuilder(presetCapacity) is if there's a lengthy comment that immediately precedes it that lays out in a lot of detail, or links to a ticket, the exact performance study done to indicate this indeed fixes a real performance issue and how to recreate that study, and on what schedule it should be revisited.
I'm writing an algorithm where efficiency is very important.
Sometimes I also want to track the algorithm behaviour by calling some "callback" functions.
Let's say my code looks like this:
public float myAlgorithm(AlgorithmTracker tracker) {
while (something) { // millions of iterations
doStuff();
if (tracker != null) tracker.incrementIterationCount(); // <--- How to run the if only once?
doOtherStaff();
}
}
How do I prevent to execute the if statement a million times? Does the compiler see that tracker is never reassigned? If it's null on the first check, it will always be. If it's not, it will never be.
Ideally I would like to tell the compiler to build my code in such a way so that if tracker is null (at runtime) it would run with the same performance as
while (something) { // millions of iterations
doStuff();
doOtherStaff();
}
I thought of two solutions:
I could write two versions of myAlgorithm, one with the calls and one without them, but that would lead to lots of code duplication.
I could extract AlgorithmTracker to an interface and create a fake empty tracker with empty functions. Still, I don't know if the compiler would optimize the calls away.
For most CPU architectures you don't have to worry about the optimization that you want to apply because that particular optimization is part of most contemporary CPU's. it is called branch prediction and current CPU's are very good at this.
on an average every 6th instruction executed by a CPU is a branch, and if for every branch CPU had to wait and evaluate the branch condition it would make the execution lot slower.
Branch prediction and Speculative execution
So when faced with a branch, without evaluating the branch condition, the CPU starts executing (Speculative execution) a path which it thinks is highly likely to be correct and on a later stage when result of branch condition becomes available, CPU checks if it is executing the correct path.
if the path picked by CPU is consistent with the result of branch condition, then CPU knows that it's executing correct path and hence it keeps going at 100% speed, otherwise it will have to flush out all the instructions that it executed speculatively and start with the correct path.
But how does CPU know which path to pick?
enter the branch predictor subsystem of a CPU. in its most basic form it would store the information about the past behavior of a branch, for example if a branch is not being picked for some time then its likely that it would not be picked now.
This is a simple explanation and a real branch predictor is going to be quite complex.
So how effective are these branch predictors ?
Given that at their core the branch predictors are just pattern matching machines, if your branch shows a predictable pattern then you can rest assured that branch predictor will get it right.
but if your branch shows no pattern at all then branch predictor is not going to help you, worst yet it would hamper your code execution because of all the wrong predictions.
How your code is going to work out with the branch predictors?
In your case value of branch control variable never changes, so the branch is either going to be picked every iteration of the loop or it is never going to be picked.
This clearly shows a pattern which even the most basic of the branch predictors can discern. which means your code will practically execute as if condition is not there, because after a first few iterations the branch predictor will be able to pick the path with 100% accuracy.
To know more read this great so thread
Fun fact: this particular optimization is the reason for CPU vulnerabilities such as specter and meltdown
lots of code duplication
Lots of code duplication means there's lots of code. So how could one simple null check influence the performance?
Hoisting null checks out of loops is a very trivial optimization. There's no guarantee that it gets done, but when the JIT can't do it, the CPU can still perfectly predict the branch outcome.(*) So the branch will cost something like ΒΌ cycle as current CPU's are capable of executing say 4 instruction per cycle.
As already said, there's some branching anyway as the JIT has to do the null check. So most probably the net win from this premature optimization is zero.
(*) The prediction can get spoiled by tons of other branches evicting you branch from the predictor (sort of cache). But then your poor branch was just one of many and you needn't care.
I could write two versions of myAlgorithm...but that would lead to lots of code duplication"
Yes, this may be a way to optimize performance and one of rare cases when DRY doesn't work. Another example of such RY technique - the loop unrolling (if your compiler didn't do it :)). Here, the code duplication is the cost you pay for better performance.
But, as for your particular IF, you see, the condition isn't changed in the loop and CPU branch prediction should work quite well. Make good/correct performance tests (with JMH, for example) and something tells me that you will not see any difference with such pico(even not micro)-optimization, the result may be even worse, since there are much-much more important things that may affect the overall performance. Just a few of such ones:
the most efficient compiler optimization is inlining (https://www.baeldung.com/jvm-method-inlining). If your code transformation brakes inlining, think twice and measure the result performance properly
memory allocation and, therefore, GC pauses in the main/critical path of the application may also be an important thing. Reuse mutable objects if required (pooling).
cache misses. Make sure you access the memory sequentially as much as possible. A canonical example - you replace LinkedList by ArrayList to iterate through and your performance becomes much better
etc. etc.
So, don't worry about this particular IF at all.
Performance optimization is a very large and very interesting area. Take care about RIGHT things and make make make correct perf tests... And always think about appropriate algorithms/collections, remember about classical big O.
I've been fiddling around with some sorting algorithms and timing them to see just how effective they are. To that end, I've created a static class containing a number of sorting algorithms for integers, and another class for timing them and exporting data to csv.
I've started looking at said data, and I've noticed an interesting trend. My program creates 5 different random arrays for testing, and it take the average of 10 different trials on each array for every sorting algorithm. The strange thing is that the first array seems to have a significantly longer average time for a few algorithms, but not for the others. Here's some example data to back it up:
Dataset 1, Dataset 2, and Dataset 3 (times are in nanoseconds).
I'm not sure whether it has to do with certain algorithms, the way I implemented the algorithms, the JVM, or some other combination of factors. Does anybody know how this type of data happens?
Also, source code for all of this is available here. Look under the 'src' folder.
This looks like the effects of just-in-time compilation are present.
To reduce start-up time, when code is first run, it is directly interpreted from the byte code. Only after any particular piece of code has been run several times, the JIT decides it is a critical section and it's worth re-compiling to native code. Then it replaces the method by its compiled form.
The effects of just-in-time compilation is that the start up time is extremely reduced (the entire application does not need to be compiled before it's run) but also the critical sections run as fast as they can (because they eventually are compiled). It looks like the critical section got compiled somewhere within the first evaluation.
I am not sure why some algorithms do not exhibit the speedup from compilation. Most likely it's because they share their critical section with some other previous algorithm. For example, the Insertion sort relies heavily on the swap method which got compiled during the selection sort evaluation. Another possible reason lies in noting that mergeSort, which runs consistently, does not rely heavily on function calls, so it does not benefit from inlining. Heap sort, on the other hand relies heavily on the siftDown method, which might be difficult for the compiler to inline.
If you mark your swap as final, it enables the compiler to inline the calls to this method. Since this method is small and it's called often, inlining it (which is a part of the just in time compilation) will help the performance significantly.
To provide consistent test environment, you could disable just-in-time compilation while running the tests.
Recently I bumped into a situation where a static code analysis tool (PMD) complainted about a switch statement that had too few branches. It suggested turning it into an if statement, that I did not wanted to do because I knew that soon more cases will be added. But I wondered if the javac performs such an optimization or not. I decompiled the code using JAD but it showed still a switch. Is it possible that this is optimized runtime by the JIT?
Update: Please do not be misleaded by the context of my question. I'm not asking about PMD, I'm not asking about the need for micro-optimisation, etc. The question is clearly only this: does the current (Oracle 1.6.x) JVM implementation contain a JIT that deals with switches with too few branches or not.
The way to determine how the JIT compiler is optimizing switch statements is either:
read the JIT compiler source code (OpenJDK 6 and 7 are open source), or
run the JVM with the switch that tells it to dump the JIT compiled code for the classes of interest to a file.
Note that like all questions related to performance and optimization, the answer depends on the hardware platform and the JVM vendor and version.
Reference: Disassemble Java JIT compiled native bytecode
If this Question is "mere idle curiosity", so be it.
However, it should also be pointed out that rewriting your code to use switch or if for performance reasons is probably a bad idea and/or a waste of time.
It is probably a waste of time because the chances are that the difference in time (if any) between the original and hand optimized versions will be insignificant.
It is a bad idea because, your optimization may only be helpful for specific hardware and JVM combinations. On others, it may have no effect ... or even be an anti-optimization.
In short, even if you know how the JIT optimizer handles this, you probably shouldn't be taking it into account in your programming.
(The exception of course is when you have a real measurable performance problem, and profiling points to (say) a 3-branch switch as being one of the bottlenecks.)
If you compiled it in debug mode, it is normal that when you decompile it, you still get the switch. Otherwise, any debugging attempt would miss some information such as line number and the original instruction flow.
You could thus try to compile in production mode and see what the decompilation result would be.
However, a switch statement, especially if it is expected to grow, is generally considered as a code smell and should be evaluated as a good candidate for a refactoring.
As for after your clarification on what the question is.
Since this denepnds so strongly on the hardware and the JVM (JVMs using the Java trademark may be developed by companies other than Oracle as long as they adhere to the JVM specification) Id say the only valid method would be to make speed tests.
Cut out a chunk of code, lock it in a loop for a considerable amount of repetitions, check the time before and after execution of the loop. Repeat for both solutions (switch and if)
This may seem simplistic and silly, but it actually works, and is a lot faster than decompiling, reading through bytecode and memory dumps etc.
You have to remember, that Java actually uses Virtual Machines and bytecode. Im pretty sure this is all handled and optimized. We are using high level languages to AVOID such micromanagement and optimization that youre asking about
On a more general note, I think you are trying to optimize a bit too early. If you know there are going to be more cases in that switch, why bother at all? Did you run a profiler? If not, its no use optimizing. "Premature optimization is the root of all evil". You might be optimizing a part of code that actually isnt the bottleneck, incresing code complexity and wasting your own time on writing code that does not contribute in any way.
I dont know what type of app you are making, but a rule of thumb says that clarity is king, and you usually should choose simpler, more elegant, self-documenting solution.
The javac performance almost no optimisations. All the optimisations are performed at runtime using the JIT. Unless you know you have a performance problem, I would assume you don't.
What the PMD is complaining about is clarity. e.g.
if (a == 5) {
// something
} else {
// something else
}
is clearer than
switch(a) {
case 5:
// something
break;
default:
// something else
break;
}
I am wondering if there is any performance differences between
String s = someObject.toString(); System.out.println(s);
and
System.out.println(someObject.toString());
Looking at the generated bytecode, it seems to have differences. Is the JVM able to optimize this bytecode at runtime to have both solutions providing same performances ?
In this simple case, of course solution 2 seems more appropriate but sometimes I would prefer solution 1 for readability purposes and I just want to be sure to not introduce performances "decreases" in critical code sections.
The creation of a temporary variable (especially something as small as a String) is inconsequential to the speed of your code, so you should stop worrying about this.
Try measuring the actual time spent in this part of your code and I bet you'll find there's no performance difference at all. The time it takes to call toString() and print out the result takes far longer than the time it takes to store a temporary value, and I don't think you'll find a measurable difference here at all.
Even if the bytecode looks different here, it's because javac is naive and your JIT Compiler does the heavy lifting for you. If this code really matters for speed, then it will be executed many, many times, and your JIT will select it for compilation to native code. It is highly likely that both of these compile to the same native code.
Finally, why are you calling System.out.println() in performance-critical code? If anything here is going to kill your performance, that will.
If you have critical code sections that demand performance, avoid using System.out.println(). There is more overhead incurred by going to standard output than there ever will be with a variable assignment.
Do solution 1.
Edit: or solution 2
There is no* code critical enough that the difference between your two samples makes any difference at all. I encourage you to test this; run both a few million times, and record the time taken.
Pick the more readable and maintainable form.
* Exaggerating for effect. If you have code critical enough, you've studied it to learn this.
The generated bytecode is not a good measure of the the performance of an given piece of code, since this bytecode will get analysed, optimised and ( in case of the server compiler ) re-analysed and re-optimised if it is deemed to be a performance bottleneck.
When in doubt, use a profiler.
Compared to output to the console, I doubt that any difference in performance between the two is going to be measurable. Don't optimize before you have measured and confirmed that you have a problem.