Measuring overhead of calling through JNI

Measuring overhead of calling through JNI - java

To see if I can really take any benefit of native code (written C) by using JNI (instead of writing complete java application), I want to measure overhead of calling through JNI. What is the best way to measure this overhead?

I wouldn't use a profiler to do quantitative performance testing. Profiling tends to introduce distortions into the actual timing numbers.
I'd create a benchmark that performed one of the actual calculations that you are considering doing in C and compare the C + JNI + Java version against a pure Java version. Be sure that you are comparing apples and apples; i.e. profile and optimize both versions before you benchmark them.
To do the actual benchmarking, I'd construct a loop that performed the calculation a large number of times, record the timings over a large number of iterations and compare. Make sure that you take account of JVM warmup effects; e.g. class loading, JIT compilation and heap warmup.
Like Thihara, I doubt that using C + JNI will help much. And even if it does, you need to take account of the downsides of JNI; e.g. C code portability, platform specific build issues ... and possible JVM hard crashes if your native code has bugs.

Measuring the overhead alone may give you strange results. I'd code a small part of the performance-critical code in both Java and C++ and measure the program performance, e.g., using caliper (microbenchmarking is quite a complicated thing and hardly anybody gets it right).
I would not use any profiler, especially C++ profiler, since the performance of the C++ part alone doesn't matter and since profilers may distort the results.

Use a C++ profiler and a Java profiler. They are available in IDEs for Java. I can only assume in the case of C++ though. And whatever test you design please run through a substantial number of loops to minimize environmental errors.
Oh and do post the results back since I'm also curious to see if there are any improvement in using native code over modern JVMs. Chances are though you won't see a huge performance improvement in native code.

Related

Is there a way to achieve JIT performance without JIT overhead?

Is there a way to achieve JIT performance while removing JIT overhead? Preferably by compiling a class file to an native image.
I have investigated GCJ, but even for a simple program, GCJ output's performance is much worse than Java JIT.

You could try Excelsior.
http://www.excelsior-usa.com/jet.html
I've had good experiences with this in the past (but it was a long time ago)

There have been in the past a number of "static" compilers for Java, but I don't know that any are currently available. To the best of my knowledge the last one in use was the "Java Transformer" for the IBM iSeries "Classic JVM", but that JVM was deprecated in favor of the J9 JVM.
The "Java Transformer" did quite well, but, as others have noted, it could not take advantage of all of the info that a JITC has available at runtime (though it did manage to take advantage of some of the runtime info).
(And it should be noted that "JITC overhead" is really minimal. Compilation occurs pretty quickly and efficiently in most cases. The problem is that compilation doesn't even start until the interpreter has run long enough to collect statistics and trigger the JITC.)

The simplest solution is often to warmup your code on startup. If you have a server based application, the cost of startup isn't as important as the cost when the service is used. In this situation you can warmup all the critical code by calling it 10K - 20K times which triggers all that code to compile.
This can take less than a second in simple cases so has very little impact on startup and means you are using compiled code when the service is used.
If you have a client based application you usually have a lot of processing power for just one user in which case the cost of the background JIT is less important.
The moral of the story is; try to check you have a problem to solve before diving into a solution. Very often questions on stack over flow are about problems which have either a) already been solved or b) are not a significant problem in the first place.
Measuring the extent of your problem or performance is the best guide as to what matters and what doesn't. If you don't measure, you are just guessing. (Even if you have ten+ year experience performance tuning Java systems)

I have just found my answer here:
Why is Java faster when using a JIT vs. compiling to machine code?
Quote from top answer:
This means that you cannot write a AOT compiler which covers ALL Java
programs as there is information available only at runtime about the
characteristics of the program.

I'd recommend you to find the root cause of inferior performance of your Java code before trying out AOT compilation or rewriting any portions in C++.
Head over to http://www.javaperformancetuning.com/ for tons of information and links.

Java, JIT and Garbage Collector efficiency

I want to know about the efficiency of Java and the advantages and disadvantages of Java Virtual Machine and Android.
Efficiency is the low use of memory, low use of the processor and fast execution.
Mobile devices are simpler than PC, then the apps need to be more efficient. Servers receive many connections and they need to be very efficient. Many mobile devices use Android and Java apps, and many servers use PHP.
Can Java and interpreted languages, such as Java Script, Python and PHP, be more efficient than C and C++?
JIT (just in time) advantages:
It can optimize better, because it knows the value of some variables and where it is used or changed.
It knows the processor and can optimize with processor specific instructions.
It is easier to transform functions into inline function.
It can remove known conditional tests and remove blocks that will not be run.
Java disadvantages:
When the app run for the first time, the app will be very slow, because the bytecodes will be interpreted and JIT compiler will do many analysis to find good optimizations. The apps cannot use the maximum of the hardware power. If an app is a game or a real-time app, if it be run for the first successfully and with no delay, but it uses the maximum of the hardware power, then the next time the app be run, it will not use the maximum of the hardware power due to optimizations. The problem is the app cannot be designed to use the maximum of the hardware power after the optimization, because it will be too slow on the first run, and will not continue to run.
Java checks if the array index is not out of bounds, and it checks if the pointers are not null. It will add several internal "if"s to generated code.
All objects use garbage collector, including objects that are very easy to manually delete.
All instances of objects are created with dynamic memory allocation, including objects that can easily use the stack. If a loop iteration begins creating an instance of a class and ends deleting the created object, dynamic memory allocation will be inefficient.
The garbage collector needs to stop the app while it cleans the memory and it is very undesired for games, GUI apps and real-time apps. Reference counting is slow and it cannot handle circular references. Multi-threaded garbage collector is slower and it needs more use of the CPU.

Can Java and interpreted languages, such as Java Script, Python and PHP, be more efficient than C and C++?
It's very difficult to get more efficient than the best C and C++ programs. There's a lot of C and C++ programs that are nowhere near as efficient as that though, and beating them with (modern) Java code is quite practical if you're any good.
I've also heard good things about the current best-of-breed Javascript engines, but I've never studied them in detail.
With Python and PHP (and many other languages besides) it's a bit different. Those languages are written in C, so it's obvious they cannot be more efficient than C (follows by construction). Yet it's much easier to write efficient code in them (i.e., that uses what is in-effect a very well-written C library) than it is to start from scratch. In particular, it reduces the number of defects per program. That's a very important metric in practice; anyone can produce fast code if it's allowed to be wrong.
In general, I advise not worrying about getting maximal efficiency. You run up against the law of diminishing returns. Instead, use sensible overall algorithms (or, as a friend of mine once said to me, “look after the big O()s and let the constant factors look after themselves”) and focus on the question of whether the program is good enough in practice. Once it is, stop fiddling around and ship it!

Let's pick apart your claimed disadvantages:
When the app run for the first time, the app will be very slow, because the bytecodes will be interpreted and JIT compiler will do many analysis to find good optimizations. The apps cannot use the maximum of the hardware power.
JIT compilation is an implementation issue. Not all platforms do it. Indeed, the Android platform could be modified to 1) do ahead of time compilation, or 2) cache the native code produced by the JIT to give faster startup next time you run the app.
It is interesting that various Java vendors have tried these strategies at various times, and yet the empirical evidence is that plain JIT is the best strategy.
Java checks if the array index is not out of bounds, and it checks if the pointers are not null. It will add several internal "if"s to generated code.
The JIT compiler can optimize away many of these tests. For the rest, the overheads tend to be relatively small; e.g. a few percent difference ... not a factor of 2.
Note that the alternative to checking is the risk that typical application bugs will crash the android platform. Certainly, garbage collection becomes problematic if applications can trash memory.
All objects use garbage collector, including objects that are very easy to manually delete.
The flip-side is that it is easy to forget to delete objects, delete objects twice, use them after they have been deleted and so on. These mistakes all lead to bugs that tend to be hard to track down.
All instances of objects are created with dynamic memory allocation, including objects that can easily use the stack. If a loop iteration begins creating an instance of a class and ends deleting the created object, dynamic memory allocation will be inefficient.
Java dynamic memory allocation and object creation is FAST. Faster than in C++ for example.
The garbage collector needs to stop the app while it cleans the memory and it is very undesired for games, GUI apps and real-time apps.
Use a concurrent / low-pause garbage collector then. Another approach is to implement your app to not generate lots of garbage ... and seldom trigger garbage collection.
Reference counting is slow and it cannot handle circular references.
No decent Java GC uses reference counting. (On the other hand, a lot of C / C++ manual memory management schemes do. For instance, so-called smart pointer schemes in C++.)
Multi-threaded garbage collector is slower and it needs more use of the CPU.
You actually mean concurrent collection I think. Yes it does, but that's the penalty you pay for the extra responsiveness that you demand for interactive games / realtime apps.

What you describe as 'efficient' I would describe as 'ideal'. An application that requires little memory, little CPU time and runs quickly, put another way, is one that is good, fast, and cheap all at the same time. Never mind if it does anything useful or interesting.
The only comparison I'd view as reasonable, if all three goals are required, is among applications that produce a common result. In that case, it is unlikely, given a competing group of evenly-capable programmers, that any one implementation would excel on all three counts over the others.
That said, your question leaves out a key criterion to the mobile market: rate of application development. Mobile applications also profit far more from positive user experience than back-end optimization. Without that constraint, the question of efficiency as you put it, seems to me more of an ponderous consideration than a practical one.
But to the actual question: can a language like Java produce more efficient code than one that compiles statically to the instruction set of the target machine? Probably not. Can it be as efficient, or efficient enough? Absolutely. If we considered an execution platform with fixed, severely constrained resources that changes infrequently, it would be a different matter.

In any language, the way to get fast execution is to do the job with as little execution as possible, and as little garbage collection as possible.
That sounds like a vacuous generality, but what it means in practice, regardless of language, is
For the data structure design, keep it as simple as possible. Stay away from the fancy collection classes full of bells and whistles. Especially stay away from notifications as a way of keeping data consistent. If your data is normalized, it can never be inconsistent. If you can't normalize it, it's better to tolerate temporary inconsistency, than to try to keep it tight with notifications.
Performance problems creep in, even into the best code. You should try not to make them, but you will still make them. Most important is knowing how to find them, once made, and remove them. Here's a blow-by-blow example. If in doing this, you find you need a better big-O algorithm, then put it in. Putting one in without being sure it's needed is a recipe for slowness.
No language can rescue a program from non-removed performance problems. The language and its compiler, JITter, etc. are like a race horse. It's fine to want a good horse, but it's a waste if the jockey isn't as slim as possible.
Your program is the jockey, and it's your job to take it on a weight-loss program.

I will paste an interesting answer given by the James Gosling himself in the Book Masterminds of Programming.
Well, I’ve heard it said that
effectively you have two compilers in
the Java world. You have the compiler
to Java bytecode, and then you have
your JIT, which basically recompiles
everything specifically again. All of
your scary optimizations are in the
JIT.
James: Exactly. These days we’re
beating the really good C and C++
compilers pretty much always. When you
go to the dynamic compiler, you get
two advantages when the compiler’s
running right at the last moment. One
is you know exactly what chipset
you’re running on. So many times when
people are compiling a piece of C
code, they have to compile it to run
on kind of the generic x86
architecture. Almost none of the
binaries you get are particularly well
tuned for any of them. You download
the latest copy of Mozilla,and it’ll
run on pretty much any Intel
architecture CPU. There’s pretty much
one Linux binary. It’s pretty generic,
and it’s compiled with GCC, which is
not a very good C compiler.
When HotSpot runs, it knows exactly
what chipset you’re running on. It
knows exactly how the cache works. It
knows exactly how the memory hierarchy
works. It knows exactly how all the
pipeline interlocks work in the CPU.
It knows what instruction set
extensions this chip has got. It
optimizes for precisely what machine
you’re on. Then the other half of it
is that it actually sees the
application as it’s running. It’s able
to have statistics that know which
things are important. It’s able to
inline things that a C compiler could
never do. The kind of stuff that gets
inlined in the Java world is pretty
amazing. Then you tack onto that the
way the storage management works with
the modern garbage collectors. With a
modern garbage collector, storage
allocation is extremely fast.

Java profiling - How reliable are the values it gives?

I am working on a simple text markup Java Library which should be, amongst other requirements, fast.
For that purpose, I did some profiling, but the results give me worse numbers that are then measured when running in non-profile mode.
So my question is - how much reliable is the profiling? Does that give just an informational ratio of the time spent in methods? Does that take JIT compiler into account, or is the profiling mode only interpreted? I use NetBeans Profiler and Sun JDK 1.6.
Thanks.

When running profiling, you'll always incur a performance penalty as something has to measure the start/stop time of the methods, keep track of the objects of the heap (for memory profiling), so there is a management overhead.
However, it will give you clear pointers to find out where bottlenecks are. I tend to look for the methods where the most cumulative time is spent and check whether optimisations can be made. It is also useful to determine whether methods are called unnecessarily.
With methods that are very small, take the profile results with a pinch of salt, sometimes the process of measuring can take more time than the method call itself and skew the results (it might appear that a small, often-called method has more of a performance impact).
Hope that helps.

Because of instrumentation profiled code on average will run slower than non-profiled code. Measuring speed is not the purpose of profiling however.
The profiling output will point you to bottlenecks, places that threads spend more time in, code that behaves worse than expected, or possible memory leaks.
You can use those hints to improve said methods and profile again until you are happy with the results.
A profiler will not be a solution to a coding style that is x% slower than optimal however, you still need to spend time fine-tuning those parts of your code that are used more often than others.

I'm not surprised by the fact that you get worst results when profiling your application as instrumenting java code will typically always slow its execution. This is actually nicely captured by the Wikipedia page on Profiling which mentions that instrumentation can causes changes in the performance of a program, potentially causing inaccurate inaccurate results and heisenbugs (due to the observer effect: observers affect what they are observing, by the mere act of observing it alone).
Having that said, if you want to measure speed, I think that you're not using the right tool. Profilers are used to find bottlenecks in an application (and for that, you don't really care of the overall impact). But if you want to benchmark your library, you should use a performance testing tool (for example, something like JMeter) that will be able to give you an average execution time per call. You will get much better and more reliable results with the right tool.

The Profiling should have no influence on the JIT compiler. The code inserted to profile your application however will slow the methods down quite a bit.
Profillers work on severall different models, either they insert code to see how long and how often methods are running or they only take samples by repeatedly polling what code is currently executed.
The first will slow down your code quite a bit, while the second is not 100% accurate, as it may miss some method calls.

Profiled code is bound to run slower as mentioned in most of the previous comments. I would say, use profiling to measure only the relative performance of various parts of your code (say methods). Do not use the measures from a profiler as an indicator of how the code would perform overall (unless you want a worst-case measure, in which case what you have would be an overestimated value.)

I have found I get different result depending on which profiler I use. However the results are often valid, but a different perspective on the problem. Something I often do when profiling CPU usage is to enable memory allocation profiling. This often gives me different CPU results (due to the increased overhead caused by the memory profiling) and can give me some useful places to optimise.

Performance gain in compiling java to native code?

Is there any performance to be gained these days from compiling java to native code, or do modern hotspot compilers end up doing this over time anyway?

There was a similar discussion here recently, for the question What are advantages of bytecode over native code?. You can find interesting answers in that thread.

Some more anecdotal evidence. I've worked on a few performance critical real-time trading financial applications. I agree with Frank, nearly every time your problem is not the lack of being compiled, it is your algorithm or data structure. Modern hot-spot compilers are very good with the right code, for example the CERN Colt library is within 90% of compiled, optimised Fortran for numerical work.
If you are worried about speed I'd really recommend a good profiler and get evidence as to where your bottlenecks are - I use YourKit and have been very pleased.
We have only resorted to native compiled code for speed in one instance in the last few years, and that was so we could use CUDA and get some serious GPU performance.

Your question is a little large, the answer vary a lot
If you are using Just In Time compilation (JIT) or not
When you are using,, if your process is executed for a long time or not
All recent JVM use JIT, but on old JVM the java code is several time slower that native code.
If you have a server that run for a long period of time or batch that execute the same code again and again, the difference and up being very low.
We wrote the same batch both in C++ and in Java and run it with different dataset, the result differ for about 3 second, with dataset taking from 5 minutes to several hours.
But be careful, they are special case that there will be an important difference, for example the batch that need a lot memory.

Memory performance or CPU performance? Or are they the same these days?
My only evidence is anecdotal and on a different platform: after porting a bunch of CPU-hungry apps to C# (.NET 2.0), I did not notice substantial loss in performance (I do not consider 10% substantial). Well written code seems to perform well on a variety of architectures.
Most apps spend/waste time with:
IO operations that will not benefit from static (compile-time) analysis.
Bad Algorithms that will not benefit from static analysis.
Bad Memory layouts in critical CPU inner loops. While it is technically possible that compilers help us here, I have yet to see a real compiler do anything interesting.
So based upon my experience, unless you are writing a video codec, there is no benefit to compiling Java apps vs. just relying upon the hotspot compilers.

Tried Hello-World in with six different implementations just to check the overhead
and the difference was staggering. Java was off the charts while the compiled languages did equally well. I could proved all the evidence (in a reproducible) if needed.

Virtual Machine Optimization

I am messing around with a toy interpreter in Java and I was considering trying to write a simple compiler that can generate bytecode for the Java Virtual Machine. Which got me thinking, how much optimization needs to be done by compilers that target virtual machines such as JVM and CLI?
Do Just In Time (JIT) compilers do constant folding, peephole optimizations etc?

I'm just gonna add two links which explain Java's bytecode pretty well and some of the various optimization of the JVM during runtime.

Optimisation is what makes JVMs viable as environments for long running applications, you can bet that SUN, IBM and friends are doing their best to ensure they can optimise your bytecode and JIT-compiled code in an efficient a manner as possible.
With that being said, if you think you can pre-optimise your bytecode then it probably won't do much harm.
It is worth being aware, however, that JVMs can tend towards performing better (and not crashing) when presented with just the sort of bytecode the Java compiler tends to construct. It is not unknown for optimisations to be missed or even for the JVM to crash when permutations of bytecode occur that are correct but unlike what would be produced by javac. Hopefully that sort of thing is more in the past now, but may be something to be aware of.

Optimising bytecode is probably an oxymoron in most cases
I don't think that's true. Optimizations like hoisting loop invariants and propagating constants can never hurt, even if the JVM is smart enough to do them on its own, by simple virtue of making the code do less work.

Obfuscators such as ProGuard will perform many static optimisations on your bytecode for you.

The HotSpot compiler will optimize your code at runtime better than is possible at compile-time - it has more information to work with, after all. The only time you should be optimizing the bytecode instead of just your algorithm is when you are targeting mobile devices, such as the Blackberry, where the JVM for that platform is not powerful enough to optimize code at runtime and just executes the bytecode.

Optimising bytecode is probably an oxymoron in most cases. Unless you control the VM, you have no idea what it does to speed up code execution, if anything. The compiler would need to know the details of the VM in order to generate optimised code.

Note to Aseraphim:
It can also be useful to optimise bytecode for non-embedded applications in some limited cases:
When delivering code over the wire, eg for WebStart apps, to minimise deliverable/cache size and because you don't necessarily know the capability/speed of the client.
For code that you know is performance critical and used at start-up before (say) HotSpot has had time to gather any stats.
Again, the transformations that a good optimiser/obfuscator performs can be very helpful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.