What are the causes for these timing issues in the JVM?

What are the causes for these timing issues in the JVM? - java

I've been fiddling around with some sorting algorithms and timing them to see just how effective they are. To that end, I've created a static class containing a number of sorting algorithms for integers, and another class for timing them and exporting data to csv.
I've started looking at said data, and I've noticed an interesting trend. My program creates 5 different random arrays for testing, and it take the average of 10 different trials on each array for every sorting algorithm. The strange thing is that the first array seems to have a significantly longer average time for a few algorithms, but not for the others. Here's some example data to back it up:
Dataset 1, Dataset 2, and Dataset 3 (times are in nanoseconds).
I'm not sure whether it has to do with certain algorithms, the way I implemented the algorithms, the JVM, or some other combination of factors. Does anybody know how this type of data happens?
Also, source code for all of this is available here. Look under the 'src' folder.

This looks like the effects of just-in-time compilation are present.
To reduce start-up time, when code is first run, it is directly interpreted from the byte code. Only after any particular piece of code has been run several times, the JIT decides it is a critical section and it's worth re-compiling to native code. Then it replaces the method by its compiled form.
The effects of just-in-time compilation is that the start up time is extremely reduced (the entire application does not need to be compiled before it's run) but also the critical sections run as fast as they can (because they eventually are compiled). It looks like the critical section got compiled somewhere within the first evaluation.
I am not sure why some algorithms do not exhibit the speedup from compilation. Most likely it's because they share their critical section with some other previous algorithm. For example, the Insertion sort relies heavily on the swap method which got compiled during the selection sort evaluation. Another possible reason lies in noting that mergeSort, which runs consistently, does not rely heavily on function calls, so it does not benefit from inlining. Heap sort, on the other hand relies heavily on the siftDown method, which might be difficult for the compiler to inline.
If you mark your swap as final, it enables the compiler to inline the calls to this method. Since this method is small and it's called often, inlining it (which is a part of the just in time compilation) will help the performance significantly.
To provide consistent test environment, you could disable just-in-time compilation while running the tests.

Related

Which manner of processing two repetative operations will be completed faster?

Generally speaking, is there a significant differance in the processing speeds of these two example segments of code, and if so, which should complete faster? Assume that "processA(int)" and "processB(int)" are voids that are common to both examples.
for(int x=0;x<1000;x++){
processA(x);
processB(x);
}
or
for(int x=0;x<1000;x++){
processA(x);
}
for(int x=0;x<1000;x++){
processB(x);
}
I'm looking to see if I can speed up one of my programs and it involves cycling through blocks of data several times and processing it differant ways. Currantly, it runs a seperate cycle for each processing method, meaning a lot of cycles get run in total, but each cycle does very little work. I was thinking about rewriting my code so that each cycle incorporates every processing method; in other words, much fewer cycles, but each cycle has a heavier workload.
This would be a very intensive rewrite of my program stucture. So unless it would give me a significant performance boost, it won't be worth the trouble.

The first case will be slightly faster then the second case because a for loop in and of itself has an effect on performance. However, the main question you should ask yourself is this: will the effect be of significance to my program? If not, you should opt for clear and readable code.
One thing to remember in such a case is that the JVM (Java Virtual Machine) does a whole lot of optimisation, and in your case the JVM can even get rid of the for-loop and rewrite the code into 1000 successive calls to processA() and processB(). So even if you have two for loops, the JVM can get rid of both, making your program more optimal than even your first case.
To get a basic understanding of method calls, cost, and the JVM, you can read this short article:
https://plumbr.eu/blog/how-expensive-is-a-method-call-in-java

Absolutely, fusing the 2 loops will be faster. (Some compilers do this automatically as an optimization.) How much faster? That depends on how many iterations the loops are running. Unless the number of iterations is very high, you can expect that the improvement will be minimal.

The single loop case will contain fewer instructions so will run faster.
But, unless processA and processB are very quick functions, such a substantial refactoring would give you negligible performance gain.
If this is production code, you should also take care as there may be side effects. You should make alterations in the context of a unit testing framework testing the code in question. (In C++ for example x may be passed by reference and could be modified by the functions! Of course Java has no such hazard but there may be other reasons why all processA functions have to run before all processB functions and the program comments may not make this clear).

The first piece of code is faster, because it only has one for loop.
If, however, you need to do something AFTER processA() has been executed n times and before processB()'s loop starts, then the second option would be ideal.

How to be reasonably sure a code block has been JIT compiled?

When conducting performance testing of java code, you want to test JIT compiled code, rather than raw bytecode. To cause the bytecode to be compiled, you must trigger the compilation to occur by executing the code a number of times, and also allow enough yield time for the background thread to complete the compilation.
What is the minimum number of "warm up" executions of a code path required to be "very confident" that the code will be JIT compiled?
What is the minimum sleep time of the main thread to be "very confident" that the compilation has completed (assuming a smallish code block)?
I'm looking for a threshold that would safely apply in any modern OS, say Mac OS or Windows for the development environment and Linux for CI/production.

Since OP's intent is not actually figuring out whether the block is JIT-compiled, but rather make sure to measure the optimized code, I think OP needs to watch some of these benchmarking talks.
TL;DR version: There is no reliable way to figure out whether you hit the "steady state":
You can only measure for a long time to get the ball-park estimate for usual time your concrete system takes to reach some state you can claim "steady".
Observing -XX:+PrintCompilation is not reliable, because you may be in the phase when counters are still in flux, and JIT is standing by to compile the next batch of now-hot methods. You can easily have a number of warmup plateaus because of that. The method can even recompile multiple times, depending on how many tiered compilers are standing by.
While one could argue about the invocation thresholds, these things are not reliable either, since tiered compilation may be involved, method might get inlined sooner in the caller, the probabilistic counters can miss the updates, etc. That is, common wisdom about -XX:CompileThreshold=# is unreliable.
JIT compilation is not the only warmup effect you are after. Automatic GC heuristics, scheduler heuristics, etc. also require warmup.
Get a microbenchmark harness which will make the task easier for you!

To begin with, the results will most likely differ for a JVM run in client or in server mode. Second of all, this number depends highly on the complexity of your code and I am afraid that you will have to exploratively estimate a number for each test case. In general, the more complex your byte code is, the more optimization can be applied to it and therefore your code must get relatively hotter in order to make the JVM reach deep into its tool box. A JVM might recompile a segment of code a dozen of times.
Furthermore, a "real world" compilation depends on the context in which your byte code is run. A compilation might for example occur when a monomorphic call site is promoted to a megamorphic one such that an observed compilation actually represents a de-optimization. Therefore, be careful when assuming that your micro benachmark reflects the code's actual performance.
Instead of the suggested flag, I suggest you to use CompilationMXBean which allows you to check the amount of time a JVM still spends with compilation. If that time is too high, rerun your test until the value gets stable long enough. (Be patient!) Frameworks can help you with creating good benchmarks. Personally, I like caliper. However, never trust your benchmark.
From my experience, custom byte code performs best when you immitate the idioms of javac. To mention the one anecdote I can tell on this matter, I once wrote custom byte code for the Java source code equivalent of:
int[] array = {1, 2, 3};
javac creates the array and uses dup for assigning each value but I stored the array reference in a local variable and loaded it back on the operand stack for assigning each value. The array had a bigger size than that and there was a noteable performance difference.
Finally, I recommend this article before writing a benchmark.

Not sure about numbers, but when doing speed tests what I do is:
Run with the -XX:-PrintCompilation flag
Warm up the JVM until there are no more compilation debug messages generated and, if possible, timings become consistent

Should I inline long code in a loop, or move it in a separate method?

Assume I have a loop (any while or for) like this:
loop{
A long code.
}
From the point of time complexity, should I divide this code in parts, write a function outside the loop, and call that function repeatedly?
I read something about functions very long ago, that calling a function repeatedly takes more time or memory or like something, I don't remember it exactly. Can you also provide some good reference about things like this (time complexity, coding style)?
Can you also provide some reference book or tutorial about heap memory, overheads etc. which affects the performance of program?

The performance difference is probably very minimal in this case. I would concentrate on clarity rather than performance until you identify this portion of your code to be a serious bottleneck.
It really does depend on what kind of code you're running in the loop, however. If you're just doing a tiny mathematical operation that isn't going to take any CPU time, but you're doing it a few hundred thousand times, then inlining the calculation might make sense. Anything more expensive than that, though, and performance shouldn't be an issue.

There is an overhead of calling a function.
So if the "long code" is fast compared to this overhead (and your application cares about performance), then you should definitely avoid the overhead.
However, if the performance is not noticably worse, it's better to make it more readable, by using a (or better multiple) function.

Rule one of performance optmisation: Measure it.
Personally, I go for readable code first and then optimise it IF NECESSARY. Usually, it isn't necessary :-)
See the first line in CHAPTER 3 - Measurement Is Everything
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." - Donald Knuth
In this case, the difference in performance will probably be minimal between the two solutions, so writing clearer code is the way to do it.

There really isnt a simple "tutorial" on performance, it is a very complex subject and one that even seasoned veterans often dont fully understand. Anyway, to give you more of an idea of what the overhead of "calling" a function is, basically what you are doing is "freezing" the state of your function(in Java there are no "functions" per se, they are all called methods), calling the method, then "unfreezing", where your method was before.
The "freezing" essentially consists of pushing state information(where you were in the method, what the value of the variables was etc) on to the stack, "unfreezing" consists of popping the saved state off the stack and updating the control structures to where they were before you called the function. Naturally memory operations are far from free, but the VM is pretty good at keeping the performance impact to an absolute minimum.
Now keep in mind Java is almost entirely heap based, the only things that really have to get pushed on the stack are the value of pointers(small), your place in the program(again small), and whatever primitives you have local to your method, and a tiny bit of control information, nothing else. Furthermore, although you cannot explicitly inline in Java(though Im sure there are bytecode editors out there that essentially let you do that), most VMs, including the most popular HotSpot VM, will do this automatically for you. http://java.sun.com/developer/technicalArticles/Networking/HotSpot/inlining.html
So the bottom line is pretty much 0 performance impact, if you want to verify for yourself you can always run benchmarking and profiling tools, they should be able to confirm it for you.

From a execution speed point of view it shouldn't matter, and if you still believe this is a bottleneck it is easy to measure.
From a development performance perspective, it is a good idea to keep the code short. I would vote for turning the loop contents into one (or more) properly named methods.

Forget it! You can't gain any performance by doing the job of the JIT. Let JIT inline it for you. Keep the methods short for readability and also for performance, as JIT works better with short methods.
There are microptimizations which may help you gain some performance, but don't even think about them. I suggest the following rules:
Write clean code using appropriate objects and algorithms for readability and for performance.
In case the program is too slow, profile and identify the critical parts.
Think about improving them using better objects and algorithms.
As a last resort, you may also consider microoptimizations.

What is the effect of recursion on a server?

I've heard that running recursive code on a server can impact performance. How true is this statement, and should recursive method be used as a last resort?

Recursion can potentially consume more memory than an equivalent iterative solution, because the latter can be optimized to take up only the memory it strictly needs, but recursion saves all local variables on the stack, thus taking up a bit more than strictly needed. This is only a problem in a memory-limited environment (which is not a rare situation) and for potentially very deep recursion (a few dozen recursive legs taking up a few hundred bytes each at most will not measurably impact the memory footprint of the server), so "last resort" is an overbid.
But when profiling shows you that the footprint impact is large, one optimization-refactoring you can definitely perform is recursion removal -- a popular topic since a few decades ago in the academic literature, but typically not hard to do by hand (especially if you keep all your methods, recursive or otherwise, reasonably small, as you should;-).

I've heard that running recursive code on a server can impact performance. How true is this
statement?
It is true, it impacts the performance, in the same way creating variables, loops or executing pretty much anything else.
If the recursive code is poor or uncontrolled it will consume your system resources the same way an uncontrolled while loop.
and should recursive method be used as a last resort?
No. It may be used as a first resort, many times it would be easier to code a recursive function. Depends on your skills. But to be clear, there is nothing particularly evil with recursion.

To discuss performance you have to talk about very specific scenarios. Used appropriately recursion is fine. If you use it inappropriately you could blow the stack, or just use too much stack. This is especially true if you somehow get a recursive tailcall without ever it terminating (typically: a bug such as an attempt to walk a cyclic graph), as it won't even blow the stack (it'll just run forever, chomping CPU cycles).
But get it right (and limit the depth to sane amounts) and it is fine.

A badly programmed recursion that does not end has a negative impact on the machine, consuming an ever-grwoing amount of resources, and threatening the stability of the whole system in the worst case.
Otherwise, recursions are a perfectly legitimate tool like loops and other constructs. They have no negative effect on performance per se.

Tail-recursion is also an alternative. It boils down to this: just pass the returned Result as mutable reference as parameter of the recursion method. This way the stack won't blow up. More at Wikipedia and this site.

Recusion is a tool of choice when you have to write algorithms. It's also much easier than iteration when you have to deal with recusive data structures like trees or graph. It's usually harmless if (as a rule of thumb) you can evaluate evaluate the recusion depth to something not too large, provided that you do not forget the end condition...
Most modern compilers are able to optimize some kinds of recursive call (replace them internally with non recursive equivalents). It's specialy easy with tail recursion, that is when the recursive call is the last instruction before returning the result.
However there is some issues specific to Java. The underlying JVM does not provide any kind of goto instruction. This set limits of what the compiler can do. If it's a tail-end recursion internal to one function it can be replaced by a simple loop internal to the function, but if the terminal call is done through another function, or if several functions recusively calls one another it become quite difficult to do when targeting JVM bytecode. SUN JVM does not support tail-call optimization, but there is plans to change that, IBM JVM does support tail-call optimization.
With some languages (functional languages like LISP or Haskell) recursion is also the only (or the more natural) way to write programs. On JVM based functional languages like Clojure or Scala the lack of tail-call optimization is a problem that leads to workarounds like trampolines in Scala.

Running any code on a server can impact performance. Server performance is usually going to be impacted by storage I/O before anything else, so at the level of "server performance" it's odd to see the question of general algorithm strategy talked about.

Deep recursion Can cause a stack overflow, which is nasty. Be careful as it s hard to get up again if you need to. Small, manageable piecs of work is easier to handle and parallize.

Java Optimizations

I am wondering if there is any performance differences between
String s = someObject.toString(); System.out.println(s);
and
System.out.println(someObject.toString());
Looking at the generated bytecode, it seems to have differences. Is the JVM able to optimize this bytecode at runtime to have both solutions providing same performances ?
In this simple case, of course solution 2 seems more appropriate but sometimes I would prefer solution 1 for readability purposes and I just want to be sure to not introduce performances "decreases" in critical code sections.

The creation of a temporary variable (especially something as small as a String) is inconsequential to the speed of your code, so you should stop worrying about this.
Try measuring the actual time spent in this part of your code and I bet you'll find there's no performance difference at all. The time it takes to call toString() and print out the result takes far longer than the time it takes to store a temporary value, and I don't think you'll find a measurable difference here at all.
Even if the bytecode looks different here, it's because javac is naive and your JIT Compiler does the heavy lifting for you. If this code really matters for speed, then it will be executed many, many times, and your JIT will select it for compilation to native code. It is highly likely that both of these compile to the same native code.
Finally, why are you calling System.out.println() in performance-critical code? If anything here is going to kill your performance, that will.

If you have critical code sections that demand performance, avoid using System.out.println(). There is more overhead incurred by going to standard output than there ever will be with a variable assignment.
Do solution 1.
Edit: or solution 2

There is no* code critical enough that the difference between your two samples makes any difference at all. I encourage you to test this; run both a few million times, and record the time taken.
Pick the more readable and maintainable form.
* Exaggerating for effect. If you have code critical enough, you've studied it to learn this.

The generated bytecode is not a good measure of the the performance of an given piece of code, since this bytecode will get analysed, optimised and ( in case of the server compiler ) re-analysed and re-optimised if it is deemed to be a performance bottleneck.
When in doubt, use a profiler.

Compared to output to the console, I doubt that any difference in performance between the two is going to be measurable. Don't optimize before you have measured and confirmed that you have a problem.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.