.NET runtime vs. Java Hotspot: Is .NET one generation behind? - java

According to the information I could gather on .NET and Java execution environment, the current state of affairs is follows:
Modern Java VM are capable of performing continuous recompilation, which combined with profiling can yield great performance improvements. Older JVMs employed JIT.
More information in this article:
http://www.ibm.com/developerworks/library/j-jtp12214/ and especially: Java theory and practice: Dynamic compilation and performance measurement
.NET uses JIT or NGEN to generate native code, but once the native code is generated, no further (runtime) optimizations are performed.
Benchmarks aside and with no intention to escalate holy wars, does this mean that Java Hotspot VM is one generation ahead of .Net. Will these technologies employed at Java VM eventually find its way into .NET runtime?

They follow two different strategies. I do not think one is better than the other.
.NET does not interpret bytecode, so it has to JIT everything as is gets executed and therefore cannot optimise heavily due to time constraints. If you need heavy optimizations in some part of the code, you can always NGEN it manually, or do a fast but unsafe implementation. Furthermore, calling native code is easy. The approach here seems to be getting the runtime good enough and manually optimise bottlenecks.
Modern JVMs will usually interpret most of the code, and then do an optimized compilation of the bottlenecks. This usually gets better results than straight JIT'ing, but if you need more, you don't have unsafe in Java, and calling native code is not nice. So the approach here is to do as much automatic optimising as possible, because the other options are not that good.
In reality Java applications tend to perform slightly better in time and worse in space when compared to .NET.

I've never benchmarked the two to compare, and I'm more familiar with the Sun JVM, I can only speak in general terms about JITs.
There are always tradeoffs with optimizations, and not all optimizations work all the time. However, here are some modern JIT techniques. I think this can be the beginning of a good conversation if we stick to the technical stuff:
escape analysis
intrinsics
http://bugs.sun.com/view_bug.do?bug_id=6823354
http://weblog.ikvm.net/CommentView.aspx?guid=0404dd8a-88a8-4d62-9bcb-98324d57a2a9
tail-call optimization
on-stack replacement
lock coarsening
lock elision
multi-threaded garbage collection
low-pause garbage collection
polymorphic method call removal
fast heap allocation
There's also features that are helpful as far as good implementations of a VM go:
being able to pick between GC
implementations customization of each GC
heap allocation parameters (such as growth)
page locking
Based on these features and many more, we can compare VMs, and not just "Java" versus ".NET" but, say, Sun's JVM versus IBM's JVM versus .NET versus Mono.
For example, Sun's JVM doesn't do tail-call optimization, IIRC, but IBM's does.

Apparently someone was working on something similar for Rotor. I don't have access to IEEE so I can't read the abstract.
Dynamic recompilation and profile-guided optimisations for a .NET JIT compiler
Quote from Summary...
An evaluation of the framework using a
set of test programs shows that
performance can improve by a maximum
of 42.3% and by 9% on average. Our
results also show that the overheads
of collecting accurate profile
information through instrumentation to
an extent outweigh the benefits of
profile-guided optimisations in our
implementation, suggesting the need
for implementing techniques that can
reduce such overheads.

You may be interested in SPUR which is a Tracing JIT compiler. The focus is on javascript but it operates on CIL not the language itself. It is a research project based on Bartok not the standard .NET VM. The paper has some performance benchmarks showing 'it consistently performs faster than SPUR-CLR' which is the standard 3.5 CLR. There haven't been any announcements about it's future relating to the current VM however. Traces can cross method boundaries which is not something HotSpot does AFAIK, JVM tracing JITs are mentioned here.
I'd be hesitant to say the .NET VM is a generation behind especially when considering all the sub-systems, in particular generics. How the GC and DLR vs invokedynamic compare I'm unsure but there are lots of details about them at places like channel9.

Related

Why is memory management so visible in Java VM?

I'm playing around with writing some simple Spring-based web apps and deploying them to Tomcat. Almost immediately, I run into the need to customize the Tomcat's JVM settings with -XX:MaxPermSize (and -Xmx and -Xms); without this, the server easily runs out of PermGen space.
Why is this such an issue for Java VMs compared to other garbage collected languages? Comparing counts of "tune X memory usage" for X in Java, Ruby, Perl and Python, shows that Java has easily an order of magnitude more hits in Google than the other languages combined.
I'd also be interested in references to technical papers/blog-posts/etc explaining design choices behind JVM GC implementations, across different JVMs or compared to other interpreted language VMs (e.g. comparing Sun or IBM JVM to Parrot). Are there technical reasons why JVM users still have to deal with non-auto-tuning heap/permgen sizes?
The title of your question is misleading (not on purpose, I know): PermSize issues (and there are a lot of them, I was one of the first one to diagnose a Tomcat/Sun PermGen issue years ago, when there wasn't any knowledge on the issue yet) are not a Java specifity but a Sun VM specifity.
If you use a VM that doesn't use permanent generation (like, say, an IBM VM if I'm not mistaken) you cannot have permgen issues.
So it's is not a "Java" problem, but a Sun VM implementation problem.
Java gives you a bit more control about memory -- strike one for people wanting to apply that control there, vs Ruby, Perl, and Python, which give you less control on that. Java's typical implementation is also very memory hungry (because it has a more advanced garbage collection approach) wrt the typical implementations of the dynamic languages... but if you look at JRuby or Jython you'll find it's not a language issue (when these different languages use the same underlying VM, memory issues are pretty much equalized). I don't know of a widespread "Perl on JVM" implementation, but if there's one I'm willing to bet it wouldn't be measurably different in terms of footprint from JRuby or Jython!
Python/Perl/Ruby allocate their memory with malloc() or an optimization thereof. The limit to the heap space is determined by the operating system rather than the VM, so there's no need for options like -Xmxn. Also, the garbage collection is simpler, based mostly on reference counting. So there's a lot less to fine-tune.
Furthermore, dynamic languages tend to be implemented with bytecode interpreters rather than JIT compilers, so they aren't used for performance-critical code anyway.
The essence of #WizardOfOdds and #Alex-Martelli's answers appear to be correct: Java has an advanced set of GC options, and sometimes you need to tune them. However, I'm still not entirely clear on why you might design a JVM with or without a permanent generation. I have found a bunch of useful links about garbage collection in Java, though not necessarily in comparison to other languages with GC. Briefly:
The Sun GC evolves very slowly due to the fact that it is deployed everywhere and people may rely on quirks in its implementation.
Sun has detailed white papers on GC design and options, such as Tuning Garbage Collection with the 5.0 Java[tm] Virtual Machine.
There is a new GC in the wings, called the G1 GC. Alex Miller has a good summary of relevant blog posts and a link to the technical paper. But it still has a permanent generation (and doesn't necessarily do a great job with it).
Jon Masamitsu has (had?) an interesting blog at Sun various details of garbage collection.
Happy to update this answer with more details if anyone has them.
This is because Tomcat is running in the Java Virtual Machine, while other languages are either compiled or interpreted and run against your actual machine. When you set -Xmx and -Xms you are saying that you want to JVM to run like a computer with am amount of ram somewhere in the set range.
I think the reason so many people run in to this is that the default values are relatively low and people end up hitting the default ceiling pretty quickly (instead of waiting until you run out of actual ram as you would with other languages).

Why isn't more Java software compiled natively?

I realize the benefits of bytecode vs. native code (portability).
But say you always know that your code will run on a x86 architecture, why not then compile for x86 and get the performance benefit?
Note that I am assuming there is a performance gain to native code compilation. Some folks have answered that there could in fact be no gain which is news to me..
Because the performance gain (if any) is not worth the trouble.
Also, garbage collection is very important for performance. Chances are that the GC of the JVM is better than the one embedded in the compiled executable, say with GCJ.
And just in time compilation can even result in better performance because the JIT has more information are run-time available to optimize the compilation than the compiler at compile-time. See the wikipedia page on JIT.
"Solaris" is an operating system, not a CPU architecture. The JVM installed on the actual machine will compile to the native CPU instructions. Solaris could be SPARC, x86, or x86-64 architecture.
Also, the JIT compiler can make processor-specific optimisations depending on which actual CPU family you have. For example, different instruction sequences are faster on Intel CPUs than on AMD CPUs, and a JIT compiler for your exact platform can take advantage of this information to produce highly optimised code.
The bytecode runs in a Java Virtual Machine that is compiled for (example) Solaris. It will be optimised like heck for that operating system.
In real-world cases, you see often see equal or better performance from Java code at runtime, by virtue of building on the virtual machine's code for things like memory management - that code will have been evolving and maturing for years.
There's more benefits to building for the JVM than just portability - for example, every time a new JVM is released your compiled bytecode gets any optimisations, algorithmic improvements etc. that come from the best in the business. On the other hand, once you've compiled your C code, that's it.
Because with Just-In-Time compilation, there is trivial performance benefit.
Actually, many things JIT can actually do faster.
It's already will be compiled by JIT into Solaris native code, after run. You can't receive any other benefits if you compile it before uploading at target site.
You may, or may not get a performance benefit. But more likely you would get a performance penalty: JIT optimization is not possible with static compilation, so the performance would be only as good as the compiler can make it "blindfolded" (without actually profiling the program and optimizing it accordingly, which is what JIT compilers such as HotSpot does).
It's intuitively quite surprising how cheap (resource-wise) compiling is, and how much can be automatically optimized by just observing the running program. Black magic, but good for us :-)
All this talk of JITs is about seven years out of date BTW. The technology concerned now is called HotSpot and it isn't just a JIT.
"why not then compile for x86"
Because then you can't take advantage of the specific features of the particular cpu it gets run on. In particular, if we are to read "compile for x86" as "produce native code that can run on a 386 and its descendants", then the resulting code can't rely on even something as old as the mmx instructions.
As such, the end result is that you need to compile for every exact architecture it'll run on (what about those that does not exist yet), and have the installer select which executable to put into place. Or, I hear the intel C++ compiler will produce several versions of the same function, differing only on cpu features used, and pick the right one at run-time based on what the CPU reports as available.
On the other hand, you can view bytecode as a "half-compiled" source, similar to an intermediate format a native compiler will (unless asked) not actually write to disk. The runtime environment can then do the final compilation, knowing exactly what architecture will be used. This is the given reason why some C#/.net code could slightly outperform c++ code on some cpu-intensive tasks in some benchmarks a while ago.
The "final compilation" of bytecode can also make additional optimalization assumptions that are (from a static compilation perspective) distinctly unsafe*, and just recompile if those assumptions are found wrong later.
I guess because JIT (just in time) compilation is very advanced.

Smart JVM and JIT Micro-Optimizations

Over time, Sun's JVM and JIT have gotten pretty smart. Things that used to be common knowledge as being a necessary micro-optimization are no longer needed, because it gets taken care of for you.
For example, it used to be the case that you should mark all possible classes as final, so the JVM inlines as much code as possible. However now, the JIT knows whether your class is final based on what classes get loaded in at runtime, and if you load a class to make the original one non-final-able, it un-inlines the methods and un-marks it as final.
What other smart micro-optimizations does the JVM or JIT do for you?
EDIT: I made this a community wiki; I'd like to collect these over time.
It's beyond impressive. All of these are things you can't do in C++ (certainly to the same extent Java does). Keep in mind that early versions of Java started the "slow" reputation by not having these things, and we keep improving significantly over time. This is still a big research area.
Efficient interface dispatch.
Inlining and direct dispatch of virtual method calls.
Very fast object allocation with bump pointers (slide 19 or so) and escape analysis.
Oracle has a wiki on Performance techniques used in the Hotspot JVM.
Java is smarter at inlining as it can
inline code only available at runtime
or even dynamically generated.
inline virtual methods (up to two at once)
perform escape analysis on inlined methods and the methods they were inlined to. (Much harder to do
in C++)

Is it possible to write a decent java optimizer if information is lost in the translation to bytecode?

It occurred to me that when you write a C program, the compiler knows the source and destination platform (for lack of a better term) and can optimize to the machine it is building code for.
But in java the best the compiler can do is optimize to the bytecode, which may be great, but there's still a layer in the jvm that has to interpret the bytecode, and the farther the bytecode is away translation-wise from the final machine architecture, the more work has to be done to make it go.
It seems to me that a bytecode optimizer wouldn't be nearly as good because it has lost all the semantic information available from the original source code (which may already have been butchered by the java compiler's optimizer.)
So is it even possible to ever approach the efficiency of C with a java compiler?
Actually, a byte-code JIT compiler can exceed the performance of statically compiled languages in many instances because it can evaluate the byte code in real time and in the actual execution context. So the apps performance increases as it continues to run.
What Kevin said. Also, the bytecode optimizer (JIT) can also take advantage of runtime information to perform better optimizations. For instance, it knows what code is executing more (Hot-spots) so it doesn't spend time optimizing code that rarely executes. It can do most of the stuff that profile-guided optimization gives you (branch prediction, etc), but on-the-fly for whatever the target procesor is. This is why the JVM usually needs to "warm up" before it reaches best performance.
In theory both optimizers should behave 'identically' as it is standard practice for c/c++ compilers to perform the optimization on the generated assembly and not the source code so you've already lost any semantic information.
If you read the byte code, you may see that the compiler doesn't optimise the code very well. However the JIT can optimise the code so this really doesn't matter.
Say you compile the code on an x86 machine and new architecture comes along, lets call it x64, the same Java binary can take advantage of the new features of that architecture even though it might not have existed when the code was compiled. It means you can take old distributions of libraries and take advantage of the latest hardware specific optimisations. You cannot do this with C/C++.
Java can optimise inline calls for virtual methods. Say you have a virtual method with many different possible implementations. However, say one or two implementations are called most of the time in reality. The JIT can detect this and inline up to two method implementations but still behave correctly if you happen to call another implementation. You cannot do this with C/C++
Java 7 supports escape analysis for locked/synchronised objects, it can detect that an object is only used in a local context and drop synchronization for that object.
In the current versions of Java, it can detect if two consecutive methods lock the same object and keep the lock between them (rather than release and re-acquire the lock)
You cannot do this with C/C++ because there isn't a language level understanding of locking.

Virtual Machine Optimization

I am messing around with a toy interpreter in Java and I was considering trying to write a simple compiler that can generate bytecode for the Java Virtual Machine. Which got me thinking, how much optimization needs to be done by compilers that target virtual machines such as JVM and CLI?
Do Just In Time (JIT) compilers do constant folding, peephole optimizations etc?
I'm just gonna add two links which explain Java's bytecode pretty well and some of the various optimization of the JVM during runtime.
Optimisation is what makes JVMs viable as environments for long running applications, you can bet that SUN, IBM and friends are doing their best to ensure they can optimise your bytecode and JIT-compiled code in an efficient a manner as possible.
With that being said, if you think you can pre-optimise your bytecode then it probably won't do much harm.
It is worth being aware, however, that JVMs can tend towards performing better (and not crashing) when presented with just the sort of bytecode the Java compiler tends to construct. It is not unknown for optimisations to be missed or even for the JVM to crash when permutations of bytecode occur that are correct but unlike what would be produced by javac. Hopefully that sort of thing is more in the past now, but may be something to be aware of.
Optimising bytecode is probably an oxymoron in most cases
I don't think that's true. Optimizations like hoisting loop invariants and propagating constants can never hurt, even if the JVM is smart enough to do them on its own, by simple virtue of making the code do less work.
Obfuscators such as ProGuard will perform many static optimisations on your bytecode for you.
The HotSpot compiler will optimize your code at runtime better than is possible at compile-time - it has more information to work with, after all. The only time you should be optimizing the bytecode instead of just your algorithm is when you are targeting mobile devices, such as the Blackberry, where the JVM for that platform is not powerful enough to optimize code at runtime and just executes the bytecode.
Optimising bytecode is probably an oxymoron in most cases. Unless you control the VM, you have no idea what it does to speed up code execution, if anything. The compiler would need to know the details of the VM in order to generate optimised code.
Note to Aseraphim:
It can also be useful to optimise bytecode for non-embedded applications in some limited cases:
When delivering code over the wire, eg for WebStart apps, to minimise deliverable/cache size and because you don't necessarily know the capability/speed of the client.
For code that you know is performance critical and used at start-up before (say) HotSpot has had time to gather any stats.
Again, the transformations that a good optimiser/obfuscator performs can be very helpful.

Categories