Smart JVM and JIT Micro-Optimizations

Smart JVM and JIT Micro-Optimizations - java

Over time, Sun's JVM and JIT have gotten pretty smart. Things that used to be common knowledge as being a necessary micro-optimization are no longer needed, because it gets taken care of for you.
For example, it used to be the case that you should mark all possible classes as final, so the JVM inlines as much code as possible. However now, the JIT knows whether your class is final based on what classes get loaded in at runtime, and if you load a class to make the original one non-final-able, it un-inlines the methods and un-marks it as final.
What other smart micro-optimizations does the JVM or JIT do for you?
EDIT: I made this a community wiki; I'd like to collect these over time.

It's beyond impressive. All of these are things you can't do in C++ (certainly to the same extent Java does). Keep in mind that early versions of Java started the "slow" reputation by not having these things, and we keep improving significantly over time. This is still a big research area.
Efficient interface dispatch.
Inlining and direct dispatch of virtual method calls.
Very fast object allocation with bump pointers (slide 19 or so) and escape analysis.

Oracle has a wiki on Performance techniques used in the Hotspot JVM.

Java is smarter at inlining as it can
inline code only available at runtime
or even dynamically generated.
inline virtual methods (up to two at once)
perform escape analysis on inlined methods and the methods they were inlined to. (Much harder to do
in C++)

Related

Does dynamically generated java byte-code need any optimization?

I did some java byte code generation using ASM.
By walking through some sort of AST of some kind of small DSL in visitor pattern.
And I'm worrying about the generated byte code is too 'straightforward', that is, without any 'compile-time optimization'.
Although in my case, that could be ok if the generated byte code is not optimized, still I can't help to ask: is there a need for those projects which generate byte code at runtime to do bytecode optimization?
I know the fact that for jvm, most of the 'optimization' work is done while the program is running, by jit compilation. So the bytecode optimization at compile time may effect little.
But, really? Is it absolutely meaningless to do bytecode optimization for the bytecode generated on the fly? Is there any one to share some experience about the difference, mainly in runtime performance, between bytecodes with and without any form of optimization?

I know at least one JVM based language, which share remain nameless, is slow as hell. It could have used some compile time optimization.
Javac and JVM are analyzing roughly the same programming model, therefore any optimization techniques that Javac can employ can be employed by JVM too. Then there is not much point for Javac to duplicate the work. Actually it's probably preferred that Javac leaves as much structure of the source code as possible so JVM can better reason about the code.
That doesn't apply if the source language is not a Java-ish language.
Think about this, CPU does a lot of wonderful optimization too, so why does JVM need to do any optimization? Why not leave it all to CPU. Because CPU and JVM are analyzing very different code. CPU is analyzing an arbitrary sequence of machine instructions(though it can make assumptions based on common behaviors of high level languages). JVM is analyzing a very specific, much higher level language, JVM can reason and transform the code based on knowledges that are almost impossible for CPU to discover from the machine instructions.
Back to your case, it is possible that, you (as the compiler) knows a lot better about your even-higher-level source language, you can perform transformations that are impossible by JVM.

No it is not necessary.
If you look at Javac's output, it does virtually no compile time optimization at all. And thanks to Hotspot's JIT, it is difficult to tell what impact changing the bytecode will have on optimization anyway. It's best not to worry about such things unless you can prove there's a real bottleneck and have the time to investigate it.

.NET runtime vs. Java Hotspot: Is .NET one generation behind?

According to the information I could gather on .NET and Java execution environment, the current state of affairs is follows:
Modern Java VM are capable of performing continuous recompilation, which combined with profiling can yield great performance improvements. Older JVMs employed JIT.
More information in this article:
http://www.ibm.com/developerworks/library/j-jtp12214/ and especially: Java theory and practice: Dynamic compilation and performance measurement
.NET uses JIT or NGEN to generate native code, but once the native code is generated, no further (runtime) optimizations are performed.
Benchmarks aside and with no intention to escalate holy wars, does this mean that Java Hotspot VM is one generation ahead of .Net. Will these technologies employed at Java VM eventually find its way into .NET runtime?

They follow two different strategies. I do not think one is better than the other.
.NET does not interpret bytecode, so it has to JIT everything as is gets executed and therefore cannot optimise heavily due to time constraints. If you need heavy optimizations in some part of the code, you can always NGEN it manually, or do a fast but unsafe implementation. Furthermore, calling native code is easy. The approach here seems to be getting the runtime good enough and manually optimise bottlenecks.
Modern JVMs will usually interpret most of the code, and then do an optimized compilation of the bottlenecks. This usually gets better results than straight JIT'ing, but if you need more, you don't have unsafe in Java, and calling native code is not nice. So the approach here is to do as much automatic optimising as possible, because the other options are not that good.
In reality Java applications tend to perform slightly better in time and worse in space when compared to .NET.

I've never benchmarked the two to compare, and I'm more familiar with the Sun JVM, I can only speak in general terms about JITs.
There are always tradeoffs with optimizations, and not all optimizations work all the time. However, here are some modern JIT techniques. I think this can be the beginning of a good conversation if we stick to the technical stuff:
escape analysis
intrinsics
http://bugs.sun.com/view_bug.do?bug_id=6823354
http://weblog.ikvm.net/CommentView.aspx?guid=0404dd8a-88a8-4d62-9bcb-98324d57a2a9
tail-call optimization
on-stack replacement
lock coarsening
lock elision
multi-threaded garbage collection
low-pause garbage collection
polymorphic method call removal
fast heap allocation
There's also features that are helpful as far as good implementations of a VM go:
being able to pick between GC
implementations customization of each GC
heap allocation parameters (such as growth)
page locking
Based on these features and many more, we can compare VMs, and not just "Java" versus ".NET" but, say, Sun's JVM versus IBM's JVM versus .NET versus Mono.
For example, Sun's JVM doesn't do tail-call optimization, IIRC, but IBM's does.

Apparently someone was working on something similar for Rotor. I don't have access to IEEE so I can't read the abstract.
Dynamic recompilation and profile-guided optimisations for a .NET JIT compiler
Quote from Summary...
An evaluation of the framework using a
set of test programs shows that
performance can improve by a maximum
of 42.3% and by 9% on average. Our
results also show that the overheads
of collecting accurate profile
information through instrumentation to
an extent outweigh the benefits of
profile-guided optimisations in our
implementation, suggesting the need
for implementing techniques that can
reduce such overheads.

You may be interested in SPUR which is a Tracing JIT compiler. The focus is on javascript but it operates on CIL not the language itself. It is a research project based on Bartok not the standard .NET VM. The paper has some performance benchmarks showing 'it consistently performs faster than SPUR-CLR' which is the standard 3.5 CLR. There haven't been any announcements about it's future relating to the current VM however. Traces can cross method boundaries which is not something HotSpot does AFAIK, JVM tracing JITs are mentioned here.
I'd be hesitant to say the .NET VM is a generation behind especially when considering all the sub-systems, in particular generics. How the GC and DLR vs invokedynamic compare I'm unsure but there are lots of details about them at places like channel9.

Why doesn't the JVM cache JIT compiled code?

The canonical JVM implementation from Sun applies some pretty sophisticated optimization to bytecode to obtain near-native execution speeds after the code has been run a few times.
The question is, why isn't this compiled code cached to disk for use during subsequent uses of the same function/class?
As it stands, every time a program is executed, the JIT compiler kicks in afresh, rather than using a pre-compiled version of the code. Wouldn't adding this feature add a significant boost to the initial run time of the program, when the bytecode is essentially being interpreted?

Without resorting to cut'n'paste of the link that #MYYN posted, I suspect this is because the optimisations that the JVM performs are not static, but rather dynamic, based on the data patterns as well as code patterns. It's likely that these data patterns will change during the application's lifetime, rendering the cached optimisations less than optimal.
So you'd need a mechanism to establish whether than saved optimisations were still optimal, at which point you might as well just re-optimise on the fly.

Oracle's JVM is indeed documented to do so -- quoting Oracle,
the compiler can take advantage of
Oracle JVM's class resolution model to
optionally persist compiled Java
methods across database calls,
sessions, or instances. Such
persistence avoids the overhead of
unnecessary recompilations across
sessions or instances, when it is
known that semantically the Java code
has not changed.
I don't know why all sophisticated VM implementations don't offer similar options.

An updated to the existing answers - Java 8 has a JEP dedicated to solving this:
=> JEP 145: Cache Compiled Code. New link.
At a very high level, its stated goal is:
Save and reuse compiled native code from previous runs in order to
improve the startup time of large Java applications.
Hope this helps.

Excelsior JET has a caching JIT compiler since version 2.0, released back in 2001. Moreover, its AOT compiler may recompile the cache into a single DLL/shared object using all optimizations.

I do not know the actual reasons, not being in any way involved in the JVM implementation, but I can think of some plausible ones:
The idea of Java is to be a write-once-run-anywhere language, and putting precompiled stuff into the class file is kind of violating that (only "kind of" because of course the actual byte code would still be there)
It would increase the class file sizes because you would have the same code there multiple times, especially if you happen to run the same program under multiple different JVMs (which is not really uncommon, when you consider different versions to be different JVMs, which you really have to do)
The class files themselves might not be writable (though it would be pretty easy to check for that)
The JVM optimizations are partially based on run-time information and on other runs they might not be as applicable (though they should still provide some benefit)
But I really am guessing, and as you can see, I don't really think any of my reasons are actual show-stoppers. I figure Sun just don't consider this support as a priority, and maybe my first reason is close to the truth, as doing this habitually might also lead people into thinking that Java class files really need a separate version for each VM instead of being cross-platform.
My preferred way would actually be to have a separate bytecode-to-native translator that you could use to do something like this explicitly beforehand, creating class files that are explicitly built for a specific VM, with possibly the original bytecode in them so that you can run with different VMs too. But that probably comes from my experience: I've been mostly doing Java ME, where it really hurts that the Java compiler isn't smarter about compilation.

Is it possible to write a decent java optimizer if information is lost in the translation to bytecode?

It occurred to me that when you write a C program, the compiler knows the source and destination platform (for lack of a better term) and can optimize to the machine it is building code for.
But in java the best the compiler can do is optimize to the bytecode, which may be great, but there's still a layer in the jvm that has to interpret the bytecode, and the farther the bytecode is away translation-wise from the final machine architecture, the more work has to be done to make it go.
It seems to me that a bytecode optimizer wouldn't be nearly as good because it has lost all the semantic information available from the original source code (which may already have been butchered by the java compiler's optimizer.)
So is it even possible to ever approach the efficiency of C with a java compiler?

Actually, a byte-code JIT compiler can exceed the performance of statically compiled languages in many instances because it can evaluate the byte code in real time and in the actual execution context. So the apps performance increases as it continues to run.

What Kevin said. Also, the bytecode optimizer (JIT) can also take advantage of runtime information to perform better optimizations. For instance, it knows what code is executing more (Hot-spots) so it doesn't spend time optimizing code that rarely executes. It can do most of the stuff that profile-guided optimization gives you (branch prediction, etc), but on-the-fly for whatever the target procesor is. This is why the JVM usually needs to "warm up" before it reaches best performance.

In theory both optimizers should behave 'identically' as it is standard practice for c/c++ compilers to perform the optimization on the generated assembly and not the source code so you've already lost any semantic information.

If you read the byte code, you may see that the compiler doesn't optimise the code very well. However the JIT can optimise the code so this really doesn't matter.
Say you compile the code on an x86 machine and new architecture comes along, lets call it x64, the same Java binary can take advantage of the new features of that architecture even though it might not have existed when the code was compiled. It means you can take old distributions of libraries and take advantage of the latest hardware specific optimisations. You cannot do this with C/C++.
Java can optimise inline calls for virtual methods. Say you have a virtual method with many different possible implementations. However, say one or two implementations are called most of the time in reality. The JIT can detect this and inline up to two method implementations but still behave correctly if you happen to call another implementation. You cannot do this with C/C++
Java 7 supports escape analysis for locked/synchronised objects, it can detect that an object is only used in a local context and drop synchronization for that object.
In the current versions of Java, it can detect if two consecutive methods lock the same object and keep the lock between them (rather than release and re-acquire the lock)
You cannot do this with C/C++ because there isn't a language level understanding of locking.

Virtual Machine Optimization

I am messing around with a toy interpreter in Java and I was considering trying to write a simple compiler that can generate bytecode for the Java Virtual Machine. Which got me thinking, how much optimization needs to be done by compilers that target virtual machines such as JVM and CLI?
Do Just In Time (JIT) compilers do constant folding, peephole optimizations etc?

I'm just gonna add two links which explain Java's bytecode pretty well and some of the various optimization of the JVM during runtime.

Optimisation is what makes JVMs viable as environments for long running applications, you can bet that SUN, IBM and friends are doing their best to ensure they can optimise your bytecode and JIT-compiled code in an efficient a manner as possible.
With that being said, if you think you can pre-optimise your bytecode then it probably won't do much harm.
It is worth being aware, however, that JVMs can tend towards performing better (and not crashing) when presented with just the sort of bytecode the Java compiler tends to construct. It is not unknown for optimisations to be missed or even for the JVM to crash when permutations of bytecode occur that are correct but unlike what would be produced by javac. Hopefully that sort of thing is more in the past now, but may be something to be aware of.

Optimising bytecode is probably an oxymoron in most cases
I don't think that's true. Optimizations like hoisting loop invariants and propagating constants can never hurt, even if the JVM is smart enough to do them on its own, by simple virtue of making the code do less work.

Obfuscators such as ProGuard will perform many static optimisations on your bytecode for you.

The HotSpot compiler will optimize your code at runtime better than is possible at compile-time - it has more information to work with, after all. The only time you should be optimizing the bytecode instead of just your algorithm is when you are targeting mobile devices, such as the Blackberry, where the JVM for that platform is not powerful enough to optimize code at runtime and just executes the bytecode.

Optimising bytecode is probably an oxymoron in most cases. Unless you control the VM, you have no idea what it does to speed up code execution, if anything. The compiler would need to know the details of the VM in order to generate optimised code.

Note to Aseraphim:
It can also be useful to optimise bytecode for non-embedded applications in some limited cases:
When delivering code over the wire, eg for WebStart apps, to minimise deliverable/cache size and because you don't necessarily know the capability/speed of the client.
For code that you know is performance critical and used at start-up before (say) HotSpot has had time to gather any stats.
Again, the transformations that a good optimiser/obfuscator performs can be very helpful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Smart JVM and JIT Micro-Optimizations - java

Oracle has a wiki on Performance techniques used in the Hotspot JVM.

Java is smarter at inlining as it can inline code only available at runtime or even dynamically generated. inline virtual methods (up to two at once) perform escape analysis on inlined methods and the methods they were inlined to. (Much harder to do in C++)

Related

Does dynamically generated java byte-code need any optimization?

.NET runtime vs. Java Hotspot: Is .NET one generation behind?

Why doesn't the JVM cache JIT compiled code?

Is it possible to write a decent java optimizer if information is lost in the translation to bytecode?

Virtual Machine Optimization

Categories

Resources