How is Java jitting inefficient code to run faster than efficient code? - java

In the following code snippet, Foo1 is a class that increments a counter every time the method bar() is called. Foo2 does the same thing but with one additional level of indirection.
I would expect Foo1 to be faster than Foo2, however in practice, Foo2 is consistently 40% faster than Foo1. How is the JVM optimizing the code such that Foo2 runs faster than Foo1?
Some Details
The test was executed with java -server CompositionTest.
Running the test with java -client CompositionTest produces the expected results, that Foo2 is slower than Foo1.
Switching the order of the loops does not make a difference.
The results were verified with java6 on both sun and openjdk's JVMs.
The Code
public class CompositionTest {
private static interface DoesBar {
public void bar();
public int count();
public void count(int c);
}
private static final class Foo1 implements DoesBar {
private int count = 0;
public final void bar() { ++count; }
public int count() { return count; }
public void count(int c) { count = c; }
}
private static final class Foo2 implements DoesBar {
private DoesBar bar;
public Foo2(DoesBar bar) { this.bar = bar; }
public final void bar() { bar.bar(); }
public int count() { return bar.count(); }
public void count(int c) { bar.count(c); }
}
public static void main(String[] args) {
long time = 0;
DoesBar bar = null;
int reps = 100000000;
for (int loop = 0; loop < 10; loop++) {
bar = new Foo1();
bar.count(0);
int i = reps;
time = System.nanoTime();
while (i-- > 0) bar.bar();
time = System.nanoTime() - time;
if (reps != bar.count())
throw new Error("reps != bar.count()");
}
System.out.println("Foo1 time: " + time);
for (int loop = 0; loop < 10; loop++) {
bar = new Foo2(new Foo1());
bar.count(0);
int i = reps;
time = System.nanoTime();
while (i-- > 0) bar.bar();
time = System.nanoTime() - time;
if (reps != bar.count())
throw new Error("reps != bar.count()");
}
System.out.println("Foo2 time: " + time);
}
}

Your microbench mark is meaningless. On my computer the code runs in about 8ms for each loop... To have any meaningful number a benchmark should probably run for at least a second.
When run both for around a second (hint, you need more than Integer.MAX_VALUE repetitions) I find that the run times of both are identical.
The likely explanation for this is that the JIT compiler has noticed that your indirection is meaningless and optimised it out (or at least inlined the method calls) such that the code executed in both loops is identical.
It can do this because it knows bar in Foo2 is effectively final, it also know that the argument to the Foo2 constructor is always going to be a Foo1 (at least in our little test). That way it knows the exact code path when Foo2.bar is called. It also knows that this loop is going to run a lot of times (in fact it knows exactly how many times the loop will execute) -- so it seems like a good idea to inline the code.
I have no idea if that is precisely what it does, but these are all logical observations that the JIT could me making about the code. Perhaps in the future some JIT compilers might even optimise the entire while loop and simply set count to reps, but that seems somewhat unlikely.

Trying to predict performance on modern languages is not very productive.
The JVM is constantly modified to increase performance of common, readable structures which, in contrast, makes uncommon, awkward code slower.
Just write your code as clearly as you can--then if you really identify a point where your code is actually identified as too slow to pass written specifications, you may have to hand-tweak some areas--but this will probably involve large, simple ideas like object caches, tweaking JVM options and eliminating truly stupid/wrong code (Wrong data structures can be HUGE, I once changed an ArrayList to a LinkedList and reduced an operation from 10 minutes to 5 seconds, multi-threading a ping operation that discovered a class-B network took an operation from 8+ hours to minutes).

Related

Multithreading with a variable number of tasks

I have a class that needs to compute n tasks as quickly as possible (up to 625). Therefore, I want to utilize multithreading so that these computations are run in parallel. After some research, I found the fork/join framework but have not been able to figure out how to implement this.
For example, let there be some class Foo (which will be used as an object elsewhere) with some methods and variables:
public class Foo {
int n;
int[][] fooArray;
public Foo(int x) {
n = x;
fooArray = new int[n][];
}
public void fooFunction(int x, int y) {
//Assume (n > x >= 0).
fooArray[x] = new int[y];
}
//Implement multithreading here.
}
I read a basic tutorial on the Java documentation that uses ForkJoinPool to split a task into 2 parts and use recursion to pass them into the invokeAll method. Ideally, I want to do something similar except implement it as a subclass of Foo and split the task (in this case, running fooFunction) into n parts. How should I accomplish this?
After days of extensive trial-and-error, I finally figured out how to do this myself:
Let there be some class foo that needs something that needs many similar (if not identical) tasks to be done in parallel. Let there be some number n that represents the number of times that this task should be run, where n is more than zero and less than the maximum number of threads that you can create.
public class foo {
//do normal class stuff.
public void fooFunction(int n) {
//do normal function things.
executeThreads(n);
}
public void executeThreads(int n) throws InterruptedException {
ExecutorService exec = Executors.newFixedThreadPool(n);
List<Callable<Object>> tasks = new ArrayList<Callable<Object>>();
for(int i = 0; i < n; i++)
tasks.add(Executors.callable(new Task(i)));
exec.invokeAll(tasks);
exec.shutdown();
}
public class Task implements Runnable {
int taskNumber;
public Task(int i) {
taskNumber = i;
}
public void run() {
try {
//this gets run in a thread
System.out.println("Thread number " + taskNumber);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
This is almost certainly not the most efficient method, and it creates a thread for EVERY task that needs to be done. In other words, this is NOT a thread pool. Make sure that you do not create too many threads and that the tasks are large enough to justify running them in parallel. If there are better alternatives, please post an answer.

Unexpected result regarding lambdas and concurrency

I have been looking in to the volatile keyword, and how it can be used to manipulate the way memory is stored and accessed from the CPU cache. I am using a simple test program to explore how storing 2 variables, each being accessed by a concurrently executing thread, on different cache lines improves read/write speed. Obviously, this method would never be used in the real world, however, I found the following program very useful in demonstrating and understanding how data is stored in the CPU cache, albeit very crudely:
public class Volatile {
private volatile int a = 0;
private long dummy1 = 0;
private long dummy2 = 0;
private long dummy3 = 0;
private long dummy4 = 0;
private volatile int b = 0;
private static long lastA;
private static long lastB;
public static void main(String[] args) {
final Volatile instance = new Volatile();
new Thread(new Runnable(){
#Override
public void run() {
lastA = System.nanoTime();
while(true){
instance.a++;
if(instance.a % 100_000_000 == 0){
System.out.println("A: " + (System.nanoTime() - lastA) / 1000000 + "ms");
lastA = System.nanoTime();
instance.a = 0;
}
}
}
}).start();
new Thread(new Runnable(){
#Override
public void run() {
lastB = System.nanoTime();
while(true){
instance.b++;
if(instance.b % 100_000_000 == 0){
System.out.println("B: " + (System.nanoTime() - lastB) / 1000000 + "ms");
lastB = System.nanoTime();
instance.b = 0;
}
}
}
}).start();
}
}
Here, the code is padding dummy variables between a and b such that they will be stored on separate cache lines, and the 2 threads accessing them will not clash. The results produced by this program are as expected, and the time taken to increment each variable to 100_000_000 is approximately 600-700 ms for my CPU. Removing the dummy variables increases this time to approximately 3000-4000 ms.
This is where I encounter some behavior I do not understand.
I replicated the code exactly in a separate class, however, I replaced the anonymous inner class passed into the thread creation with a lambda expression:
i.e
final VolatileLambda instance = new VolatileLambda();
new Thread(() -> {
lastA = System.nanoTime();
while(true){
and
new Thread(() -> {
lastB = System.nanoTime();
while(true){
When I ran this second program with lambdas, I encountered different results to the first program in that the padding variables were no longer sufficient to separate a and b on to separate cache lines, causing the threads to clash and again producing an output of 3000-4000 ms. This was solved by declaring an extra single dummy byte variable after the dummy longs:
private long dummy4 = 0;
private byte dummy5 = 0;
private volatile int b = 0;
The output after declaring this extra byte was then, once again, approximately 600-700 ms.
I have replicated this comparison numerous times on different systems and, strangely, this produces no consistent outcome. Sometimes, using lambdas over anonymous inner classes has no effect on the output, sometimes it does. Even attempting the same comparison on the same system at different times did not always produce the same results.
I'm at a loss trying to explain this behavior, and would greatly appreciate any help. Feel free to ask for clarification on anything, as I probably did not explain this very well.
Thanks!

Static block vs static method - initializing static fields

Out of curiosity, I measured the performance between static block and static method initializer. First, I implemented the above mentioned methods in two separate java classes, like so:
First:
class Dummy {
static java.util.List<Integer> lista = new java.util.ArrayList<Integer>();
static {
for(int i=0; i < 1000000; ++i) {
lista.add(new Integer(i));
}
}
}
public class First {
public static void main(String[] args) {
long st = System.currentTimeMillis();
Dummy d = new Dummy();
long end = System.currentTimeMillis() - st;
System.out.println(end);
}
}
Second:
class Muddy {
static java.util.List<Integer> lista = new java.util.ArrayList<Integer>();
public static void initList() {
for(int i=0; i < 1000000; ++i) {
lista.add(new Integer(i));
}
}
}
public class Second {
public static void main(String[] args) {
long st = System.currentTimeMillis();
Muddy.initList();
Muddy m = new Muddy();
long end = System.currentTimeMillis() - st;
System.out.println(end);
}
}
Then I executed this little batch script to measure it 100 times and put the values in a file. batchFile.bat First Second dum.res.txt
After that, I wrote this piece of code to calculate mean value and standard deviation of Dummy's and Muddy's measured values.
This is the result that I've got:
First size: 100 Second size: 100
First Sum: 132 Std. deviation: 13
Second Sum: 112 Std. deviation: 9
And it is similar on my other machines...every time I test it.
Now I'm wondering, why is it so? I checked the bytecode and Second.class has one instruction more (call to static initList()) between calls to System.currentTimeMillis().
They both do the same thing, but why is the First one slower? I can't really reason it out just by looking at the bytecode, since this was my first time touching javap; I don't understand bytecode yet.
I think that the reason why the static block version is slower than the static method version could be due to the different JIT optimization that they get ...
See this interesting article for more interesting information : Java Secret: Are static blocks interpreted?
Here's my guess as to the reason for this:
The initialization you are doing is creating enough objects that it is causing one or more garbage collections.
When the initialization is called from the static block, it is done during the class initialization rather than during simple method execution. During class initialization, the garbage detector may have a little more work to do (because the execution stack is longer, for example) than during simple method execution, even though the contents of the heap are almost the same.
To test this, you could try adding -Xms200m or something to your java commands; this should eliminate the need to garbage collect during the initialization you are doing.

The code example which can prove "volatile" declare should be used

Currently I can't understand when we should use volatile to declare variable.
I have do some study and searched some materials about it for a long time and know that when a field is declared volatile, the compiler and runtime are put on notice that this variable is shared and that operations on it should not be reordered with other memory operations.
However, I still can't understand in what scenario we should use it. I mean can someone provide any example code which can prove that using "volatile" brings benefit or solve problems compare to without using it?
Here is an example of why volatile is necessary. If you remove the keyword volatile, thread 1 may never terminate. (When I tested on Java 1.6 Hotspot on Linux, this was indeed the case - your results may vary as the JVM is not obliged to do any caching of variables not marked volatile.)
public class ThreadTest {
volatile boolean running = true;
public void test() {
new Thread(new Runnable() {
public void run() {
int counter = 0;
while (running) {
counter++;
}
System.out.println("Thread 1 finished. Counted up to " + counter);
}
}).start();
new Thread(new Runnable() {
public void run() {
// Sleep for a bit so that thread 1 has a chance to start
try {
Thread.sleep(100);
} catch (InterruptedException ignored) {
// catch block
}
System.out.println("Thread 2 finishing");
running = false;
}
}).start();
}
public static void main(String[] args) {
new ThreadTest().test();
}
}
The following is a canonical example of the necessity of volatile (in this case for the str variable. Without it, hotspot lifts the access outside the loop (while (str == null)) and run() never terminates. This will happen on most -server JVMs.
public class DelayWrite implements Runnable {
private String str;
void setStr(String str) {this.str = str;}
public void run() {
  while (str == null);
  System.out.println(str);
}
public static void main(String[] args) {
  DelayWrite delay = new DelayWrite();
  new Thread(delay).start();
  Thread.sleep(1000);
  delay.setStr("Hello world!!");
}
}
Eric, I have read your comments and one in particular strikes me
In fact, I can understand the usage of volatile on the concept
level. But for practice, I can't think
up the code which has concurrency
problems without using volatile
The obvious problem you can have are compiler reorderings, for example the more famous hoisting as mentioned by Simon Nickerson. But let's assume that there will be no reorderings, that comment can be a valid one.
Another issue that volatile resolves are with 64 bit variables (long, double). If you write to a long or a double, it is treated as two separate 32 bit stores. What can happen with a concurrent write is the high 32 of one thread gets written to high 32 bits of the register while another thread writes the low 32 bit. You can then have a long that is neither one or the other.
Also, if you look at the memory section of the JLS you will observe it to be a relaxed memory model.
That means writes may not become visible (can be sitting in a store buffer) for a while. This can lead to stale reads. Now you may say that seems unlikely, and it is, but your program is incorrect and has potential to fail.
If you have an int that you are incrementing for the lifetime of an application and you know (or at least think) the int wont overflow then you don't upgrade it to a long, but it is still possible it can. In the case of a memory visibility issue, if you think it shouldn't effect you, you should know that it still can and can cause errors in your concurrent application that are extremely difficult to identify. Correctness is the reason to use volatile.
The volatile keyword is pretty complex and you need to understand what it does and does not do well before you use it. I recommend reading this language specification section which explains it very well.
They highlight this example:
class Test {
static volatile int i = 0, j = 0;
static void one() { i++; j++; }
static void two() {
System.out.println("i=" + i + " j=" + j);
}
}
What this means is that during one() j is never greater than i. However, another Thread running two() might print out a value of j that is much larger than i because let's say two() is running and fetches the value of i. Then one() runs 1000 times. Then the Thread running two finally gets scheduled again and picks up j which is now much larger than the value of i. I think this example perfectly demonstrates the difference between volatile and synchronized - the updates to i and j are volatile which means that the order that they happen in is consistent with the source code. However the two updates happen separately and not atomically so callers may see values that look (to that caller) to be inconsistent.
In a nutshell: Be very careful with volatile!
A minimalist example in java 8, if you remove volatile keyword it will never end.
public class VolatileExample {
private static volatile boolean BOOL = true;
public static void main(String[] args) throws InterruptedException {
new Thread(() -> { while (BOOL) { } }).start();
TimeUnit.MILLISECONDS.sleep(500);
BOOL = false;
}
}
To expand on the answer from #jed-wesley-smith, if you drop this into a new project, take out the volatile keyword from the iterationCount, and run it, it will never stop. Adding the volatile keyword to either str or iterationCount would cause the code to end successfully. I've also noticed that the sleep can't be smaller than 5, using Java 8, but perhaps your mileage may vary with other JVMs / Java versions.
public static class DelayWrite implements Runnable
{
private String str;
public volatile int iterationCount = 0;
void setStr(String str)
{
this.str = str;
}
public void run()
{
while (str == null)
{
iterationCount++;
}
System.out.println(str + " after " + iterationCount + " iterations.");
}
}
public static void main(String[] args) throws InterruptedException
{
System.out.println("This should print 'Hello world!' and exit if str or iterationCount is volatile.");
DelayWrite delay = new DelayWrite();
new Thread(delay).start();
Thread.sleep(5);
System.out.println("Thread sleep gave the thread " + delay.iterationCount + " iterations.");
delay.setStr("Hello world!!");
}

Java 6 Threading output is not Asynchronous?

This code should produce even and uneven output because there is no synchronized on any methods. Yet the output on my JVM is always even. I am really confused as this example comes straight out of Doug Lea.
public class TestMethod implements Runnable {
private int index = 0;
public void testThisMethod() {
index++;
index++;
System.out.println(Thread.currentThread().toString() + " "
+ index );
}
public void run() {
while(true) {
this.testThisMethod();
}
}
public static void main(String args[]) {
int i = 0;
TestMethod method = new TestMethod();
while(i < 20) {
new Thread(method).start();
i++;
}
}
}
Output
Thread[Thread-8,5,main] 135134
Thread[Thread-8,5,main] 135136
Thread[Thread-8,5,main] 135138
Thread[Thread-8,5,main] 135140
Thread[Thread-8,5,main] 135142
Thread[Thread-8,5,main] 135144
I tried with volatile and got the following (with an if to print only if odd):
Thread[Thread-12,5,main] 122229779
Thread[Thread-12,5,main] 122229781
Thread[Thread-12,5,main] 122229783
Thread[Thread-12,5,main] 122229785
Thread[Thread-12,5,main] 122229787
Answer to comments:
the index is infact shared, because we have one TestMethod instance but many Threads that call testThisMethod() on the one TestMethod that we have.
Code (no changes besides the mentioned above):
public class TestMethod implements Runnable {
volatile private int index = 0;
public void testThisMethod() {
index++;
index++;
if(index % 2 != 0){
System.out.println(Thread.currentThread().toString() + " "
+ index );
}
}
public void run() {
while(true) {
this.testThisMethod();
}
}
public static void main(String args[]) {
int i = 0;
TestMethod method = new TestMethod();
while(i < 20) {
new Thread(method).start();
i++;
}
}
}
First off all: as others have noted there's no guarantee at all, that your threads do get interrupted between the two increment operations.
Note that printing to System.out pretty likely forces some kind of synchronization on your threads, so your threads are pretty likely to have just started a time slice when they return from that, so they will probably complete the two incrementation operations and then wait for the shared resource for System.out.
Try replacing the System.out.println() with something like this:
int snapshot = index;
if (snapshot % 2 != 0) {
System.out.println("Oh noes! " + snapshot);
}
You don't know that. The point of automatic scheduling is that it makes no guarantees. It might treat two threads that run the same code completely different. Or completely the same. Or completely the same for an hour and then suddenly different...
The point is, even if you fix the problems mentioned in the other answers, you still cannot rely on things coming out a particular way; you must always be prepared for any possible interleaving that the Java memory and threading model allows, and that includes the possibility that the println always happens after an even number of increments, even if that seems unlikely to you on the face of it.
The result is exactly as I would expect. index is being incremented twice between outputs, and there is no interaction between threads.
To turn the question around - why would you expect odd outputs?
EDIT: Whoops. I wrongly assumed a new runnable was being created per Thread, and therefore there was a distinct index per thread, rather than shared. Disturbing how such a flawed answer got 3 upvotes though...
You have not marked index as volatile. This means that the compiler is allowed to optimize accesses to it, and it probably merges your 2 increments to one addition.
You get the output of the very first thread you start, because this thread loops and gives no chance to other threads to run.
So you should Thread.sleep() or (not recommended) Thread.yield() in the loop.

Categories