Is there any performance to be gained these days from compiling java to native code, or do modern hotspot compilers end up doing this over time anyway?
There was a similar discussion here recently, for the question What are advantages of bytecode over native code?. You can find interesting answers in that thread.
Some more anecdotal evidence. I've worked on a few performance critical real-time trading financial applications. I agree with Frank, nearly every time your problem is not the lack of being compiled, it is your algorithm or data structure. Modern hot-spot compilers are very good with the right code, for example the CERN Colt library is within 90% of compiled, optimised Fortran for numerical work.
If you are worried about speed I'd really recommend a good profiler and get evidence as to where your bottlenecks are - I use YourKit and have been very pleased.
We have only resorted to native compiled code for speed in one instance in the last few years, and that was so we could use CUDA and get some serious GPU performance.
Your question is a little large, the answer vary a lot
If you are using Just In Time compilation (JIT) or not
When you are using,, if your process is executed for a long time or not
All recent JVM use JIT, but on old JVM the java code is several time slower that native code.
If you have a server that run for a long period of time or batch that execute the same code again and again, the difference and up being very low.
We wrote the same batch both in C++ and in Java and run it with different dataset, the result differ for about 3 second, with dataset taking from 5 minutes to several hours.
But be careful, they are special case that there will be an important difference, for example the batch that need a lot memory.
Memory performance or CPU performance? Or are they the same these days?
My only evidence is anecdotal and on a different platform: after porting a bunch of CPU-hungry apps to C# (.NET 2.0), I did not notice substantial loss in performance (I do not consider 10% substantial). Well written code seems to perform well on a variety of architectures.
Most apps spend/waste time with:
IO operations that will not benefit from static (compile-time) analysis.
Bad Algorithms that will not benefit from static analysis.
Bad Memory layouts in critical CPU inner loops. While it is technically possible that compilers help us here, I have yet to see a real compiler do anything interesting.
So based upon my experience, unless you are writing a video codec, there is no benefit to compiling Java apps vs. just relying upon the hotspot compilers.
Tried Hello-World in with six different implementations just to check the overhead
and the difference was staggering. Java was off the charts while the compiled languages did equally well. I could proved all the evidence (in a reproducible) if needed.
Related
Is there a way to achieve JIT performance while removing JIT overhead? Preferably by compiling a class file to an native image.
I have investigated GCJ, but even for a simple program, GCJ output's performance is much worse than Java JIT.
You could try Excelsior.
http://www.excelsior-usa.com/jet.html
I've had good experiences with this in the past (but it was a long time ago)
There have been in the past a number of "static" compilers for Java, but I don't know that any are currently available. To the best of my knowledge the last one in use was the "Java Transformer" for the IBM iSeries "Classic JVM", but that JVM was deprecated in favor of the J9 JVM.
The "Java Transformer" did quite well, but, as others have noted, it could not take advantage of all of the info that a JITC has available at runtime (though it did manage to take advantage of some of the runtime info).
(And it should be noted that "JITC overhead" is really minimal. Compilation occurs pretty quickly and efficiently in most cases. The problem is that compilation doesn't even start until the interpreter has run long enough to collect statistics and trigger the JITC.)
The simplest solution is often to warmup your code on startup. If you have a server based application, the cost of startup isn't as important as the cost when the service is used. In this situation you can warmup all the critical code by calling it 10K - 20K times which triggers all that code to compile.
This can take less than a second in simple cases so has very little impact on startup and means you are using compiled code when the service is used.
If you have a client based application you usually have a lot of processing power for just one user in which case the cost of the background JIT is less important.
The moral of the story is; try to check you have a problem to solve before diving into a solution. Very often questions on stack over flow are about problems which have either a) already been solved or b) are not a significant problem in the first place.
Measuring the extent of your problem or performance is the best guide as to what matters and what doesn't. If you don't measure, you are just guessing. (Even if you have ten+ year experience performance tuning Java systems)
I have just found my answer here:
Why is Java faster when using a JIT vs. compiling to machine code?
Quote from top answer:
This means that you cannot write a AOT compiler which covers ALL Java
programs as there is information available only at runtime about the
characteristics of the program.
I'd recommend you to find the root cause of inferior performance of your Java code before trying out AOT compilation or rewriting any portions in C++.
Head over to http://www.javaperformancetuning.com/ for tons of information and links.
To see if I can really take any benefit of native code (written C) by using JNI (instead of writing complete java application), I want to measure overhead of calling through JNI. What is the best way to measure this overhead?
I wouldn't use a profiler to do quantitative performance testing. Profiling tends to introduce distortions into the actual timing numbers.
I'd create a benchmark that performed one of the actual calculations that you are considering doing in C and compare the C + JNI + Java version against a pure Java version. Be sure that you are comparing apples and apples; i.e. profile and optimize both versions before you benchmark them.
To do the actual benchmarking, I'd construct a loop that performed the calculation a large number of times, record the timings over a large number of iterations and compare. Make sure that you take account of JVM warmup effects; e.g. class loading, JIT compilation and heap warmup.
Like Thihara, I doubt that using C + JNI will help much. And even if it does, you need to take account of the downsides of JNI; e.g. C code portability, platform specific build issues ... and possible JVM hard crashes if your native code has bugs.
Measuring the overhead alone may give you strange results. I'd code a small part of the performance-critical code in both Java and C++ and measure the program performance, e.g., using caliper (microbenchmarking is quite a complicated thing and hardly anybody gets it right).
I would not use any profiler, especially C++ profiler, since the performance of the C++ part alone doesn't matter and since profilers may distort the results.
Use a C++ profiler and a Java profiler. They are available in IDEs for Java. I can only assume in the case of C++ though. And whatever test you design please run through a substantial number of loops to minimize environmental errors.
Oh and do post the results back since I'm also curious to see if there are any improvement in using native code over modern JVMs. Chances are though you won't see a huge performance improvement in native code.
For university, I perform bytecode modifications and analyze their influence on performance of Java programs. Therefore, I need Java programs---in best case used in production---and appropriate benchmarks. For instance, I already got HyperSQL and measure its performance by the benchmark program PolePosition. The Java programs running on a JVM without JIT compiler. Thanks for your help!
P.S.: I cannot use programs to benchmark the performance of the JVM or of the Java language itself (such as Wide Finder).
Brent Boyer, wrote a nice article series for IBM developer works: Robust Java benchmarking, which is accompanied by a micro-benchmarking framework which is based on a sound statistical approach. Article and the Resources Page.
Since, you do that for university, you might be interested in Andy Georges, Dries Buytaert, Lieven Eeckhout: Statistically rigorous java performance evaluation in OOPSLA 2007.
Caliper is a tool provided by Google for micro-benchmarking. It will provide you with graphs and everything. The folks who put this tool together are very familiar with the principle of "Premature Optimization is the root of all evil," (to jwnting's point) and are very careful in explaining the role of benchmarking.
Any experienced programmer will tell you that premature optimisation is worse than no optimisation.
It's a waste of resources at best, and a source of infinite future (and current) problems at worst.
Without context, any application, even with benchmark logs, will tell you nothing.
I may have a loop in there that takes 10 hours to complete, the benchmark will show it taking almost forever, but I don't care because it's not performance critical.
Another loop takes only a millisecond but that may be too long because it causes me to fail to catch incoming data packets arriving at 100 microsecond intervals.
Two extremes, but both can happen (even in the same application), and you'd never know unless you knew that application, how it is used, what it does, under which conditions and requirements.
If a user interface takes 1/2 second to render it may be too long or no problem, what's the context? What are the user expectations?
I realize the benefits of bytecode vs. native code (portability).
But say you always know that your code will run on a x86 architecture, why not then compile for x86 and get the performance benefit?
Note that I am assuming there is a performance gain to native code compilation. Some folks have answered that there could in fact be no gain which is news to me..
Because the performance gain (if any) is not worth the trouble.
Also, garbage collection is very important for performance. Chances are that the GC of the JVM is better than the one embedded in the compiled executable, say with GCJ.
And just in time compilation can even result in better performance because the JIT has more information are run-time available to optimize the compilation than the compiler at compile-time. See the wikipedia page on JIT.
"Solaris" is an operating system, not a CPU architecture. The JVM installed on the actual machine will compile to the native CPU instructions. Solaris could be SPARC, x86, or x86-64 architecture.
Also, the JIT compiler can make processor-specific optimisations depending on which actual CPU family you have. For example, different instruction sequences are faster on Intel CPUs than on AMD CPUs, and a JIT compiler for your exact platform can take advantage of this information to produce highly optimised code.
The bytecode runs in a Java Virtual Machine that is compiled for (example) Solaris. It will be optimised like heck for that operating system.
In real-world cases, you see often see equal or better performance from Java code at runtime, by virtue of building on the virtual machine's code for things like memory management - that code will have been evolving and maturing for years.
There's more benefits to building for the JVM than just portability - for example, every time a new JVM is released your compiled bytecode gets any optimisations, algorithmic improvements etc. that come from the best in the business. On the other hand, once you've compiled your C code, that's it.
Because with Just-In-Time compilation, there is trivial performance benefit.
Actually, many things JIT can actually do faster.
It's already will be compiled by JIT into Solaris native code, after run. You can't receive any other benefits if you compile it before uploading at target site.
You may, or may not get a performance benefit. But more likely you would get a performance penalty: JIT optimization is not possible with static compilation, so the performance would be only as good as the compiler can make it "blindfolded" (without actually profiling the program and optimizing it accordingly, which is what JIT compilers such as HotSpot does).
It's intuitively quite surprising how cheap (resource-wise) compiling is, and how much can be automatically optimized by just observing the running program. Black magic, but good for us :-)
All this talk of JITs is about seven years out of date BTW. The technology concerned now is called HotSpot and it isn't just a JIT.
"why not then compile for x86"
Because then you can't take advantage of the specific features of the particular cpu it gets run on. In particular, if we are to read "compile for x86" as "produce native code that can run on a 386 and its descendants", then the resulting code can't rely on even something as old as the mmx instructions.
As such, the end result is that you need to compile for every exact architecture it'll run on (what about those that does not exist yet), and have the installer select which executable to put into place. Or, I hear the intel C++ compiler will produce several versions of the same function, differing only on cpu features used, and pick the right one at run-time based on what the CPU reports as available.
On the other hand, you can view bytecode as a "half-compiled" source, similar to an intermediate format a native compiler will (unless asked) not actually write to disk. The runtime environment can then do the final compilation, knowing exactly what architecture will be used. This is the given reason why some C#/.net code could slightly outperform c++ code on some cpu-intensive tasks in some benchmarks a while ago.
The "final compilation" of bytecode can also make additional optimalization assumptions that are (from a static compilation perspective) distinctly unsafe*, and just recompile if those assumptions are found wrong later.
I guess because JIT (just in time) compilation is very advanced.
My professor did an informal benchmark on a little program and the Java times were: 1.7 seconds for the first run, and 0.8 seconds for the runs thereafter.
Is this due entirely to the loading of the runtime environment into the operating environment ?
OR
Is it influenced by Java's optimizing the code and storing the results of those optimizations (sorry, I don't know the technical term for that)?
Okay, I found where I read that. This is all from "Learning Java" (O'Reilly 2005):
The problem with a traditional JIT compilation is that optimizing code takes time. So a JIT compiler can produce decent results but may suffer a significant latency when the application starts up. This is generally not a problem for long-running server-side applications but is a serious problem for client-side software and applications run on smaller devices with limited capabilities. To address this, Sun's compiler technology, called HotSpot, uses a trick called adaptive compilation. If you look at what programs actually spend their time doing, it turns out that they spend almost all their time executing a relatively small part of the code again and again. The chunk of code that is executed repeatedly may be only a small fraction of the total program, but its behavior determines the program's overall performance. Adaptive compilation also allows the Java runtime to take advantage of new kinds of optimizations that simply can't be done in a statically compiled language, hence the claim that Java code can run faster than C/C++ in some cases.
To take advantage of this fact, HotSpot starts out as a normal Java bytecode interpreter, but with a difference: it measures (profiles) the code as it is executing to see what parts are being executed repeatedly. Once it knows which parts of the code are crucial to performance, HotSpot compiles those sections into optimal native machine code. Since it compiles only a small portion of the program into machine code, it can afford to take the time necessary to optimize those portions. The rest of the program may not need to be compiled at all—just interpreted—saving memory and time. In fact, Sun's default Java VM can run in one of two modes: client and server, which tell it whether to emphasize quick startup time and memory conservation or flat out performance.
A natural question to ask at this point is, Why throw away all this good profiling information each time an application shuts down? Well, Sun has partially broached this topic with the release of Java 5.0 through the use of shared, read-only classes that are stored persistently in an optimized form. This significantly reduces both the startup time and overhead of running many Java applications on a given machine. The technology for doing this is complex, but the idea is simple: optimize the parts of the program that need to go fast, and don't worry about the rest.
I'm kind of wondering how far Sun has gotten with it since Java 5.0.
I'm not aware of any virtual machine in widespread use that saves statistical usage data between program invocations -- but it certainly is an interesting possibility for future research.
What you're seeing is almost certainly due to disk caching.
I agree that it's likely the result of disk caching.
FYI, the IBM Java 6 VM does contain an ahead-of-time compiler (AOT). The code isn't quite as optimized as what the JIT would produce, but it is stored across VMs, I believe in some sort of persistent shared memory. Its primary benefit is to improve startup performance. The IBM VM by default JITs a method after it's been called 1000 times. If it knows that a method is going to be called 1000 times just during the VM startup (think a commonly-used method like java.lang.String.equals(...) ), then it's beneficial for it to store that in the AOT cache so that it never has to waste time compiling at runtime.
I agree that the performance difference seen by the poster is most likely caused by disk latency bringing the JRE into memory. The Just In Time compiler (JIT) would not have an impact on performance of a little application.
Java 1.6u10 (http://download.java.net/jdk6/) touches the runtime JARs in a background process (even if Java isn't running) to keep the data in the disk cache. This significantly decreases startup times (Which is a huge benefit to desktop apps, but probably of marginal value to server side apps).
On large, long running applications, the JIT makes a big difference over time - but the amount of time required for the JIT to accumulate sufficient statistics to kick in and optimize (5-10 seconds) is very, very short compared to the overall life of the application (most run for months and months). While storing and restoring the JIT results is an interesting academic exercise, the practical improvement is not very large (Which is why the JIT team has been more focused on things like GC strategies for minimizing memory cache misses, etc...).
The pre-compilation of the runtime classes does help desktop applications quite a bit (as does the aforementioned 6u10 disk cache pre-loading).
You should describe how your Benchmark was done. Especially at which point you start to measure the time.
If you include the JVM startup time (which is useful for Benchmarking the User experience but not so useful to optimize Java code) then it might be a filesystem caching effect or it can be caused by a feature called "Java Class Data Sharing":
For Sun:
http://java.sun.com/j2se/1.5.0/docs/guide/vm/class-data-sharing.html
This is an option where the JVM saves a prepared image of the runtime classes to a file, to allow quicker loading (and sharing) of those at the next start. You can control this with -Xshare:on or -Xshare:off with a Sun JVM. The default is -Xshare:auto which will load the shared classes image if present, and if not present it will write it at first startup if the directory is write able.
With IBM Java 5 this is BTW even more powerful:
http://www.ibm.com/developerworks/java/library/j-ibmjava4/
I don't know of any mainstream JVM which is saving JIT statistics.
Java JVM (actually might change from different implementations of the JVM) when first started out will interpret the byte code. Once it detects that the code will be running enough number of times JITs it to native machine language so it runs faster.