Does Java save its runtime optimizations? - java

My professor did an informal benchmark on a little program and the Java times were: 1.7 seconds for the first run, and 0.8 seconds for the runs thereafter.
Is this due entirely to the loading of the runtime environment into the operating environment ?
OR
Is it influenced by Java's optimizing the code and storing the results of those optimizations (sorry, I don't know the technical term for that)?

Okay, I found where I read that. This is all from "Learning Java" (O'Reilly 2005):
The problem with a traditional JIT compilation is that optimizing code takes time. So a JIT compiler can produce decent results but may suffer a significant latency when the application starts up. This is generally not a problem for long-running server-side applications but is a serious problem for client-side software and applications run on smaller devices with limited capabilities. To address this, Sun's compiler technology, called HotSpot, uses a trick called adaptive compilation. If you look at what programs actually spend their time doing, it turns out that they spend almost all their time executing a relatively small part of the code again and again. The chunk of code that is executed repeatedly may be only a small fraction of the total program, but its behavior determines the program's overall performance. Adaptive compilation also allows the Java runtime to take advantage of new kinds of optimizations that simply can't be done in a statically compiled language, hence the claim that Java code can run faster than C/C++ in some cases.
To take advantage of this fact, HotSpot starts out as a normal Java bytecode interpreter, but with a difference: it measures (profiles) the code as it is executing to see what parts are being executed repeatedly. Once it knows which parts of the code are crucial to performance, HotSpot compiles those sections into optimal native machine code. Since it compiles only a small portion of the program into machine code, it can afford to take the time necessary to optimize those portions. The rest of the program may not need to be compiled at all—just interpreted—saving memory and time. In fact, Sun's default Java VM can run in one of two modes: client and server, which tell it whether to emphasize quick startup time and memory conservation or flat out performance.
A natural question to ask at this point is, Why throw away all this good profiling information each time an application shuts down? Well, Sun has partially broached this topic with the release of Java 5.0 through the use of shared, read-only classes that are stored persistently in an optimized form. This significantly reduces both the startup time and overhead of running many Java applications on a given machine. The technology for doing this is complex, but the idea is simple: optimize the parts of the program that need to go fast, and don't worry about the rest.
I'm kind of wondering how far Sun has gotten with it since Java 5.0.

I'm not aware of any virtual machine in widespread use that saves statistical usage data between program invocations -- but it certainly is an interesting possibility for future research.
What you're seeing is almost certainly due to disk caching.

I agree that it's likely the result of disk caching.
FYI, the IBM Java 6 VM does contain an ahead-of-time compiler (AOT). The code isn't quite as optimized as what the JIT would produce, but it is stored across VMs, I believe in some sort of persistent shared memory. Its primary benefit is to improve startup performance. The IBM VM by default JITs a method after it's been called 1000 times. If it knows that a method is going to be called 1000 times just during the VM startup (think a commonly-used method like java.lang.String.equals(...) ), then it's beneficial for it to store that in the AOT cache so that it never has to waste time compiling at runtime.

I agree that the performance difference seen by the poster is most likely caused by disk latency bringing the JRE into memory. The Just In Time compiler (JIT) would not have an impact on performance of a little application.
Java 1.6u10 (http://download.java.net/jdk6/) touches the runtime JARs in a background process (even if Java isn't running) to keep the data in the disk cache. This significantly decreases startup times (Which is a huge benefit to desktop apps, but probably of marginal value to server side apps).
On large, long running applications, the JIT makes a big difference over time - but the amount of time required for the JIT to accumulate sufficient statistics to kick in and optimize (5-10 seconds) is very, very short compared to the overall life of the application (most run for months and months). While storing and restoring the JIT results is an interesting academic exercise, the practical improvement is not very large (Which is why the JIT team has been more focused on things like GC strategies for minimizing memory cache misses, etc...).
The pre-compilation of the runtime classes does help desktop applications quite a bit (as does the aforementioned 6u10 disk cache pre-loading).

You should describe how your Benchmark was done. Especially at which point you start to measure the time.
If you include the JVM startup time (which is useful for Benchmarking the User experience but not so useful to optimize Java code) then it might be a filesystem caching effect or it can be caused by a feature called "Java Class Data Sharing":
For Sun:
http://java.sun.com/j2se/1.5.0/docs/guide/vm/class-data-sharing.html
This is an option where the JVM saves a prepared image of the runtime classes to a file, to allow quicker loading (and sharing) of those at the next start. You can control this with -Xshare:on or -Xshare:off with a Sun JVM. The default is -Xshare:auto which will load the shared classes image if present, and if not present it will write it at first startup if the directory is write able.
With IBM Java 5 this is BTW even more powerful:
http://www.ibm.com/developerworks/java/library/j-ibmjava4/
I don't know of any mainstream JVM which is saving JIT statistics.

Java JVM (actually might change from different implementations of the JVM) when first started out will interpret the byte code. Once it detects that the code will be running enough number of times JITs it to native machine language so it runs faster.

Related

Performance of short-running Java CLI application

I'm building a java CLI utility application that processes some data from a file.
Apart from reading from a file, all the operations are done in-memory. The in-memory processing part is taking a surprisingly long time so I tried profiling it but could not pinpoint any specific function that performed particularly bad.
I was afraid that JIT was not able to optimize the program during a single run, so I benchmarked how the runtime changes between the consecutive executions of the function with all the program logic (including reading the input file) and sure enough, the runtime for the in-memory processing part goes down for several executions and becomes almost 10 times smaller already on the 5th run.
I tried shuffling the input data before every execution, but it doesn't have any visible effect on this. I'm not sure if some caching may be responsible for this improvement or the JIT optimizations done during the program run, but since usually the program is ran once at time, it always shows the worst performance.
Would it be possible to somehow get a good performance during the first run? Is there a generic way to optimize performance for a short-running java applications?
You probably cannot optimize startup time and performance by changing your application1, 2. And especially for a small application3. And I certainly don't think there are "generic" ways to do it; i.e. optimizations that will work for all cases.
However, there are a couple of JVM features that should improve performance for a short-lived JVM.
Class Data Sharing (CDS) is a feature that allows JIT compiled classes to be cached in the file system (as a CDS archive) and which is then reused by later of runs of your application. This feature has been available since Java 5 (though with limitations in earlier Java releases).
The CDS feature is controlled using the -Xshare JVM option.
-Xshare:dump generates a CDS archive during the run
-Xshare:off -Xshare:on and -Xshare:auto control whether an existing CDS archive will be used.
The other way to improve startup times for a HotSpot JVM is (was) to use Ahead Of Time (AOT) compilation. Basically, you compile your application to a native code binary using the jaotc command, and then run the executable it produces rather than the java command. The jaotc command is experimental and was introduced in Java 9.
It appears that jaotc was not included in the Java 16 builds published by Oracle, and is scheduled for removal in Java 17. (See JEP 410: Remove the Experimental AOT and JIT Compiler).
The current recommended way to get AOT compilation for Java is to use the GraalVM AOT Java compiler.
1 - You could convert into a client-server application where the server "up" all of the time. However, that has other problems, and doesn't eliminate the startup time issue for the client ... assuming that is coded in Java.
2 - According to #apangin, there are some other application tweaks that may could make you code more JIT friendly, though it will depend on what you code is currently doing.
3 - It is conceivable that the startup time for a large (long running) monolithic application could be improved by refactoring it so that subsystems of the application can be loaded and initialized only when they are needed. However, it doesn't sound like this would work for your use-case.
You could have the small processing run as a service: when you need to run it, "just" make a network call to that service (easier if it's HTTP because there are easy way to do it in Java). That way, the processing itself stays in the same JVM and will eventually get faster when JIT kicks in.
Of course, because it could require significant development, that is only valid if the processing itself:
is called often
has arguments that are easy to pass to the service (usually serialized as strings)
has arguments that don't require too much data to pass to the service (e.g. several MB binary content)

Why would I use AOT compilation in Java?

I recently started reading about Java compilers. So far my understanding is, that optimization comes from techniques like tired compilation or code profiling. Now I read that Java 9 respectively Java 10 (Windows) provides the option of AOT compilation. Now I wonder: what use case would justify the use of AOT compilation?
To have better startup performance, like simple desktop app, it would be annoying for user to wait for it to load and then it still will be pretty slow until JIT kicks in. So then you can use AOT to already provide optimized code - it might not be as good as JIT, but will be much faster at startup.
Also some applications are used only for few seconds or even less - JIT will never have a chance to kick in. Like simple command line app that just sends single request and closes. Each function will be probably only executed once - so there is no reason to use JIT at all.
Also it might help to decrease binary size or allow for creation of very simple and small standalone binary. Same for memory usage - as JIT needs some memory to work.

Is there a way to achieve JIT performance without JIT overhead?

Is there a way to achieve JIT performance while removing JIT overhead? Preferably by compiling a class file to an native image.
I have investigated GCJ, but even for a simple program, GCJ output's performance is much worse than Java JIT.
You could try Excelsior.
http://www.excelsior-usa.com/jet.html
I've had good experiences with this in the past (but it was a long time ago)
There have been in the past a number of "static" compilers for Java, but I don't know that any are currently available. To the best of my knowledge the last one in use was the "Java Transformer" for the IBM iSeries "Classic JVM", but that JVM was deprecated in favor of the J9 JVM.
The "Java Transformer" did quite well, but, as others have noted, it could not take advantage of all of the info that a JITC has available at runtime (though it did manage to take advantage of some of the runtime info).
(And it should be noted that "JITC overhead" is really minimal. Compilation occurs pretty quickly and efficiently in most cases. The problem is that compilation doesn't even start until the interpreter has run long enough to collect statistics and trigger the JITC.)
The simplest solution is often to warmup your code on startup. If you have a server based application, the cost of startup isn't as important as the cost when the service is used. In this situation you can warmup all the critical code by calling it 10K - 20K times which triggers all that code to compile.
This can take less than a second in simple cases so has very little impact on startup and means you are using compiled code when the service is used.
If you have a client based application you usually have a lot of processing power for just one user in which case the cost of the background JIT is less important.
The moral of the story is; try to check you have a problem to solve before diving into a solution. Very often questions on stack over flow are about problems which have either a) already been solved or b) are not a significant problem in the first place.
Measuring the extent of your problem or performance is the best guide as to what matters and what doesn't. If you don't measure, you are just guessing. (Even if you have ten+ year experience performance tuning Java systems)
I have just found my answer here:
Why is Java faster when using a JIT vs. compiling to machine code?
Quote from top answer:
This means that you cannot write a AOT compiler which covers ALL Java
programs as there is information available only at runtime about the
characteristics of the program.
I'd recommend you to find the root cause of inferior performance of your Java code before trying out AOT compilation or rewriting any portions in C++.
Head over to http://www.javaperformancetuning.com/ for tons of information and links.

Why is the JVM slow to start?

What exactly makes the JVM (in particular, Sun's implementation) slow to get running compared to other runtimes like CPython? My impression was that it mainly has to do with a boatload of libraries getting loaded whether they're needed or not, but that seems like something that shouldn't take 10 years to fix.
Come to think of it, how does the JVM start time compare to the CLR on Windows? How about Mono's CLR?
UPDATE: I'm particularly concerned with the use case of small utilities chained together as is common in Unix. Is Java now suitable for this style? Whatever startup overhead Java incurs, does it add up for every Java process, or does the overhead only really manifest for the first process?
Here is what Wikipedia has to say on the issue (with some references).
It appears that most of the time is taken just loading data (classes) from disk (i.e. startup time is I/O bound).
Just to note some solutions:
There are two mechanisms that allow to faster startup JVM.
The first one, is the class data sharing mechanism, that is supported since Java 6 Update 21 (only with the HotSpot Client VM, and only with the serial garbage collector as far as I know)
To activate it you need to set -Xshare (on some implementations: -Xshareclasses ) JVM options.
To read more about the feature you may visit:
Class data sharing
The second mechanism is a Java Quick Starter. It allows to preload classes during OS startup, see:
Java Quick Starter for more details.
Running a trivial Java app with the 1.6 (Java 6) client JVM seems instantaneous on my machine. Sun has attempted to tune the client JVM for faster startup (and the client JVM is the default), so if you don't need lots of extra jar files, then startup should be speedy.
If you are using Sun's HotSpot for x86_64 (64bit compiled), note that the current implementation only works in server mode, that is, it precompiles every class it loads with full optimization, whereas the 32bit version also supports client mode, which generally postpones optimization and optimizes the most CPU-intensive parts only, but has faster start-up times.
See for instance:
http://en.wikipedia.org/wiki/64-bit#32_vs_64_bit
http://java.sun.com/docs/hotspot/HotSpotFAQ.html#64bit_compilers
That being said, at least on my machine (Linux x86_64 with 64bit kernel), the 32bit HotSpot version supports both client and server mode (via the -client and -server flags), but defaults to server mode, while the 64bit version only supports server mode.
It really depends on what you are doing during the start up. If you run Hello World application it takes 0.15 seconds on my machine.
However, Java is better suited to running as a client or a server/service which means the startup time isn't as important as the connection time (about 0.025 ms) or the round trip time response time (<< 0.001 ms).
There are a number of reasons:
lots of jars to load
verification (making sure code doesn't do evil things)
JIT (just in time compilation) overhead
I'm not sure about the CLR, but I think it is often faster because it caches a native version of assemblies for next time (so it doesn't need to JIT). CPython starts faster because it is an interpreter, and IIRC, doesn't do JIT.
In addition to things already mentioned (loading classes, esp. from compressed JARs); running in interpreted mode before HotSpot compiles commonly-used bytecode; and HotSpot compilation overhead, there is also quite a bit of one-time initialization done by JDK classes themselves.
Many optimizations are done in favor of longer-running systems where startup speed is less of a concern.
And as to unix style pipelining: you certainly do NOT want to start and re-start JVM multiple times. That is not going to be efficient. Rather chaining of tools should happen within JVM. This can not be easily intermixed with non-Java Unix tools, except by starting such tools from within JVM.
All VMs with a rich type system such as Java or CLR will not be instanteous when compared to less rich systems such as those found in C or C++. This is largely because a lot is happening in the VM, a lot of classes get initialized and are required by a running system. Snapshots of an initialized system do help but it still costs to load that image back into memory etc.
A simple hello world styled one liner class with a main still requires a lot to be loaded and initialized. Verifying the class requires a lot of dependency checking and validation all which cost time and many CPU instructions to be executed. On the other hand a C program will not do any of these and will amount of a few instructions and then invoke the printer function.

Performance gain in compiling java to native code?

Is there any performance to be gained these days from compiling java to native code, or do modern hotspot compilers end up doing this over time anyway?
There was a similar discussion here recently, for the question What are advantages of bytecode over native code?. You can find interesting answers in that thread.
Some more anecdotal evidence. I've worked on a few performance critical real-time trading financial applications. I agree with Frank, nearly every time your problem is not the lack of being compiled, it is your algorithm or data structure. Modern hot-spot compilers are very good with the right code, for example the CERN Colt library is within 90% of compiled, optimised Fortran for numerical work.
If you are worried about speed I'd really recommend a good profiler and get evidence as to where your bottlenecks are - I use YourKit and have been very pleased.
We have only resorted to native compiled code for speed in one instance in the last few years, and that was so we could use CUDA and get some serious GPU performance.
Your question is a little large, the answer vary a lot
If you are using Just In Time compilation (JIT) or not
When you are using,, if your process is executed for a long time or not
All recent JVM use JIT, but on old JVM the java code is several time slower that native code.
If you have a server that run for a long period of time or batch that execute the same code again and again, the difference and up being very low.
We wrote the same batch both in C++ and in Java and run it with different dataset, the result differ for about 3 second, with dataset taking from 5 minutes to several hours.
But be careful, they are special case that there will be an important difference, for example the batch that need a lot memory.
Memory performance or CPU performance? Or are they the same these days?
My only evidence is anecdotal and on a different platform: after porting a bunch of CPU-hungry apps to C# (.NET 2.0), I did not notice substantial loss in performance (I do not consider 10% substantial). Well written code seems to perform well on a variety of architectures.
Most apps spend/waste time with:
IO operations that will not benefit from static (compile-time) analysis.
Bad Algorithms that will not benefit from static analysis.
Bad Memory layouts in critical CPU inner loops. While it is technically possible that compilers help us here, I have yet to see a real compiler do anything interesting.
So based upon my experience, unless you are writing a video codec, there is no benefit to compiling Java apps vs. just relying upon the hotspot compilers.
Tried Hello-World in with six different implementations just to check the overhead
and the difference was staggering. Java was off the charts while the compiled languages did equally well. I could proved all the evidence (in a reproducible) if needed.

Categories