Is there a way to achieve JIT performance without JIT overhead?

Is there a way to achieve JIT performance without JIT overhead? - java

Is there a way to achieve JIT performance while removing JIT overhead? Preferably by compiling a class file to an native image.
I have investigated GCJ, but even for a simple program, GCJ output's performance is much worse than Java JIT.

You could try Excelsior.
http://www.excelsior-usa.com/jet.html
I've had good experiences with this in the past (but it was a long time ago)

There have been in the past a number of "static" compilers for Java, but I don't know that any are currently available. To the best of my knowledge the last one in use was the "Java Transformer" for the IBM iSeries "Classic JVM", but that JVM was deprecated in favor of the J9 JVM.
The "Java Transformer" did quite well, but, as others have noted, it could not take advantage of all of the info that a JITC has available at runtime (though it did manage to take advantage of some of the runtime info).
(And it should be noted that "JITC overhead" is really minimal. Compilation occurs pretty quickly and efficiently in most cases. The problem is that compilation doesn't even start until the interpreter has run long enough to collect statistics and trigger the JITC.)

The simplest solution is often to warmup your code on startup. If you have a server based application, the cost of startup isn't as important as the cost when the service is used. In this situation you can warmup all the critical code by calling it 10K - 20K times which triggers all that code to compile.
This can take less than a second in simple cases so has very little impact on startup and means you are using compiled code when the service is used.
If you have a client based application you usually have a lot of processing power for just one user in which case the cost of the background JIT is less important.
The moral of the story is; try to check you have a problem to solve before diving into a solution. Very often questions on stack over flow are about problems which have either a) already been solved or b) are not a significant problem in the first place.
Measuring the extent of your problem or performance is the best guide as to what matters and what doesn't. If you don't measure, you are just guessing. (Even if you have ten+ year experience performance tuning Java systems)

I have just found my answer here:
Why is Java faster when using a JIT vs. compiling to machine code?
Quote from top answer:
This means that you cannot write a AOT compiler which covers ALL Java
programs as there is information available only at runtime about the
characteristics of the program.

I'd recommend you to find the root cause of inferior performance of your Java code before trying out AOT compilation or rewriting any portions in C++.
Head over to http://www.javaperformancetuning.com/ for tons of information and links.

Related

Why would I use AOT compilation in Java?

I recently started reading about Java compilers. So far my understanding is, that optimization comes from techniques like tired compilation or code profiling. Now I read that Java 9 respectively Java 10 (Windows) provides the option of AOT compilation. Now I wonder: what use case would justify the use of AOT compilation?

To have better startup performance, like simple desktop app, it would be annoying for user to wait for it to load and then it still will be pretty slow until JIT kicks in. So then you can use AOT to already provide optimized code - it might not be as good as JIT, but will be much faster at startup.
Also some applications are used only for few seconds or even less - JIT will never have a chance to kick in. Like simple command line app that just sends single request and closes. Each function will be probably only executed once - so there is no reason to use JIT at all.
Also it might help to decrease binary size or allow for creation of very simple and small standalone binary. Same for memory usage - as JIT needs some memory to work.

Benchmarking Java programs

For university, I perform bytecode modifications and analyze their influence on performance of Java programs. Therefore, I need Java programs---in best case used in production---and appropriate benchmarks. For instance, I already got HyperSQL and measure its performance by the benchmark program PolePosition. The Java programs running on a JVM without JIT compiler. Thanks for your help!
P.S.: I cannot use programs to benchmark the performance of the JVM or of the Java language itself (such as Wide Finder).

Brent Boyer, wrote a nice article series for IBM developer works: Robust Java benchmarking, which is accompanied by a micro-benchmarking framework which is based on a sound statistical approach. Article and the Resources Page.
Since, you do that for university, you might be interested in Andy Georges, Dries Buytaert, Lieven Eeckhout: Statistically rigorous java performance evaluation in OOPSLA 2007.

Caliper is a tool provided by Google for micro-benchmarking. It will provide you with graphs and everything. The folks who put this tool together are very familiar with the principle of "Premature Optimization is the root of all evil," (to jwnting's point) and are very careful in explaining the role of benchmarking.

Any experienced programmer will tell you that premature optimisation is worse than no optimisation.
It's a waste of resources at best, and a source of infinite future (and current) problems at worst.
Without context, any application, even with benchmark logs, will tell you nothing.
I may have a loop in there that takes 10 hours to complete, the benchmark will show it taking almost forever, but I don't care because it's not performance critical.
Another loop takes only a millisecond but that may be too long because it causes me to fail to catch incoming data packets arriving at 100 microsecond intervals.
Two extremes, but both can happen (even in the same application), and you'd never know unless you knew that application, how it is used, what it does, under which conditions and requirements.
If a user interface takes 1/2 second to render it may be too long or no problem, what's the context? What are the user expectations?

Does Java save its runtime optimizations?

My professor did an informal benchmark on a little program and the Java times were: 1.7 seconds for the first run, and 0.8 seconds for the runs thereafter.
Is this due entirely to the loading of the runtime environment into the operating environment ?
OR
Is it influenced by Java's optimizing the code and storing the results of those optimizations (sorry, I don't know the technical term for that)?

Okay, I found where I read that. This is all from "Learning Java" (O'Reilly 2005):
The problem with a traditional JIT compilation is that optimizing code takes time. So a JIT compiler can produce decent results but may suffer a significant latency when the application starts up. This is generally not a problem for long-running server-side applications but is a serious problem for client-side software and applications run on smaller devices with limited capabilities. To address this, Sun's compiler technology, called HotSpot, uses a trick called adaptive compilation. If you look at what programs actually spend their time doing, it turns out that they spend almost all their time executing a relatively small part of the code again and again. The chunk of code that is executed repeatedly may be only a small fraction of the total program, but its behavior determines the program's overall performance. Adaptive compilation also allows the Java runtime to take advantage of new kinds of optimizations that simply can't be done in a statically compiled language, hence the claim that Java code can run faster than C/C++ in some cases.
To take advantage of this fact, HotSpot starts out as a normal Java bytecode interpreter, but with a difference: it measures (profiles) the code as it is executing to see what parts are being executed repeatedly. Once it knows which parts of the code are crucial to performance, HotSpot compiles those sections into optimal native machine code. Since it compiles only a small portion of the program into machine code, it can afford to take the time necessary to optimize those portions. The rest of the program may not need to be compiled at all—just interpreted—saving memory and time. In fact, Sun's default Java VM can run in one of two modes: client and server, which tell it whether to emphasize quick startup time and memory conservation or flat out performance.
A natural question to ask at this point is, Why throw away all this good profiling information each time an application shuts down? Well, Sun has partially broached this topic with the release of Java 5.0 through the use of shared, read-only classes that are stored persistently in an optimized form. This significantly reduces both the startup time and overhead of running many Java applications on a given machine. The technology for doing this is complex, but the idea is simple: optimize the parts of the program that need to go fast, and don't worry about the rest.
I'm kind of wondering how far Sun has gotten with it since Java 5.0.

I'm not aware of any virtual machine in widespread use that saves statistical usage data between program invocations -- but it certainly is an interesting possibility for future research.
What you're seeing is almost certainly due to disk caching.

I agree that it's likely the result of disk caching.
FYI, the IBM Java 6 VM does contain an ahead-of-time compiler (AOT). The code isn't quite as optimized as what the JIT would produce, but it is stored across VMs, I believe in some sort of persistent shared memory. Its primary benefit is to improve startup performance. The IBM VM by default JITs a method after it's been called 1000 times. If it knows that a method is going to be called 1000 times just during the VM startup (think a commonly-used method like java.lang.String.equals(...) ), then it's beneficial for it to store that in the AOT cache so that it never has to waste time compiling at runtime.

I agree that the performance difference seen by the poster is most likely caused by disk latency bringing the JRE into memory. The Just In Time compiler (JIT) would not have an impact on performance of a little application.
Java 1.6u10 (http://download.java.net/jdk6/) touches the runtime JARs in a background process (even if Java isn't running) to keep the data in the disk cache. This significantly decreases startup times (Which is a huge benefit to desktop apps, but probably of marginal value to server side apps).
On large, long running applications, the JIT makes a big difference over time - but the amount of time required for the JIT to accumulate sufficient statistics to kick in and optimize (5-10 seconds) is very, very short compared to the overall life of the application (most run for months and months). While storing and restoring the JIT results is an interesting academic exercise, the practical improvement is not very large (Which is why the JIT team has been more focused on things like GC strategies for minimizing memory cache misses, etc...).
The pre-compilation of the runtime classes does help desktop applications quite a bit (as does the aforementioned 6u10 disk cache pre-loading).

You should describe how your Benchmark was done. Especially at which point you start to measure the time.
If you include the JVM startup time (which is useful for Benchmarking the User experience but not so useful to optimize Java code) then it might be a filesystem caching effect or it can be caused by a feature called "Java Class Data Sharing":
For Sun:
http://java.sun.com/j2se/1.5.0/docs/guide/vm/class-data-sharing.html
This is an option where the JVM saves a prepared image of the runtime classes to a file, to allow quicker loading (and sharing) of those at the next start. You can control this with -Xshare:on or -Xshare:off with a Sun JVM. The default is -Xshare:auto which will load the shared classes image if present, and if not present it will write it at first startup if the directory is write able.
With IBM Java 5 this is BTW even more powerful:
http://www.ibm.com/developerworks/java/library/j-ibmjava4/
I don't know of any mainstream JVM which is saving JIT statistics.

Java JVM (actually might change from different implementations of the JVM) when first started out will interpret the byte code. Once it detects that the code will be running enough number of times JITs it to native machine language so it runs faster.

Performance gain in compiling java to native code?

Is there any performance to be gained these days from compiling java to native code, or do modern hotspot compilers end up doing this over time anyway?

There was a similar discussion here recently, for the question What are advantages of bytecode over native code?. You can find interesting answers in that thread.

Some more anecdotal evidence. I've worked on a few performance critical real-time trading financial applications. I agree with Frank, nearly every time your problem is not the lack of being compiled, it is your algorithm or data structure. Modern hot-spot compilers are very good with the right code, for example the CERN Colt library is within 90% of compiled, optimised Fortran for numerical work.
If you are worried about speed I'd really recommend a good profiler and get evidence as to where your bottlenecks are - I use YourKit and have been very pleased.
We have only resorted to native compiled code for speed in one instance in the last few years, and that was so we could use CUDA and get some serious GPU performance.

Your question is a little large, the answer vary a lot
If you are using Just In Time compilation (JIT) or not
When you are using,, if your process is executed for a long time or not
All recent JVM use JIT, but on old JVM the java code is several time slower that native code.
If you have a server that run for a long period of time or batch that execute the same code again and again, the difference and up being very low.
We wrote the same batch both in C++ and in Java and run it with different dataset, the result differ for about 3 second, with dataset taking from 5 minutes to several hours.
But be careful, they are special case that there will be an important difference, for example the batch that need a lot memory.

Memory performance or CPU performance? Or are they the same these days?
My only evidence is anecdotal and on a different platform: after porting a bunch of CPU-hungry apps to C# (.NET 2.0), I did not notice substantial loss in performance (I do not consider 10% substantial). Well written code seems to perform well on a variety of architectures.
Most apps spend/waste time with:
IO operations that will not benefit from static (compile-time) analysis.
Bad Algorithms that will not benefit from static analysis.
Bad Memory layouts in critical CPU inner loops. While it is technically possible that compilers help us here, I have yet to see a real compiler do anything interesting.
So based upon my experience, unless you are writing a video codec, there is no benefit to compiling Java apps vs. just relying upon the hotspot compilers.

Tried Hello-World in with six different implementations just to check the overhead
and the difference was staggering. Java was off the charts while the compiled languages did equally well. I could proved all the evidence (in a reproducible) if needed.

Virtual Machine Optimization

I am messing around with a toy interpreter in Java and I was considering trying to write a simple compiler that can generate bytecode for the Java Virtual Machine. Which got me thinking, how much optimization needs to be done by compilers that target virtual machines such as JVM and CLI?
Do Just In Time (JIT) compilers do constant folding, peephole optimizations etc?

I'm just gonna add two links which explain Java's bytecode pretty well and some of the various optimization of the JVM during runtime.

Optimisation is what makes JVMs viable as environments for long running applications, you can bet that SUN, IBM and friends are doing their best to ensure they can optimise your bytecode and JIT-compiled code in an efficient a manner as possible.
With that being said, if you think you can pre-optimise your bytecode then it probably won't do much harm.
It is worth being aware, however, that JVMs can tend towards performing better (and not crashing) when presented with just the sort of bytecode the Java compiler tends to construct. It is not unknown for optimisations to be missed or even for the JVM to crash when permutations of bytecode occur that are correct but unlike what would be produced by javac. Hopefully that sort of thing is more in the past now, but may be something to be aware of.

Optimising bytecode is probably an oxymoron in most cases
I don't think that's true. Optimizations like hoisting loop invariants and propagating constants can never hurt, even if the JVM is smart enough to do them on its own, by simple virtue of making the code do less work.

Obfuscators such as ProGuard will perform many static optimisations on your bytecode for you.

The HotSpot compiler will optimize your code at runtime better than is possible at compile-time - it has more information to work with, after all. The only time you should be optimizing the bytecode instead of just your algorithm is when you are targeting mobile devices, such as the Blackberry, where the JVM for that platform is not powerful enough to optimize code at runtime and just executes the bytecode.

Optimising bytecode is probably an oxymoron in most cases. Unless you control the VM, you have no idea what it does to speed up code execution, if anything. The compiler would need to know the details of the VM in order to generate optimised code.

Note to Aseraphim:
It can also be useful to optimise bytecode for non-embedded applications in some limited cases:
When delivering code over the wire, eg for WebStart apps, to minimise deliverable/cache size and because you don't necessarily know the capability/speed of the client.
For code that you know is performance critical and used at start-up before (say) HotSpot has had time to gather any stats.
Again, the transformations that a good optimiser/obfuscator performs can be very helpful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.