Should I use volatile if I use synchronized

Should I use volatile if I use synchronized - java

Should I use volatile every time if I use synchronized dealing with some mutable state.
synchronized makes me (state/threads) safe
volatile makes threads updated about shared mutable state.
Then I should put volatile everywhere If I care about threads to be updated?
EDIT: There are two use cases:
1.
1000 threads read and write to this object (they hope they are updated about state a):
class A {
private int a;
public synchronized int getA() {...}
public void setA(int a) {...}
}
2.
There are 1000 threads of ThreadA. They hope they are updated about state a
class ThreadA extends Thread {
private int a;
public void run() { synchronized(a) { ... } }
}

Like many performance questions, the real issue is simplicity and clarity. I suggest using synchronized or volatile as using both is likely to be confusing. Using both is redundant and thus slightly inefficient, but unlikely to be enough to matter. I would worry more about making the code as easy to understand as possible, and do no more than you need to do.
In your first case, only volatile makes sense (or using synchronized consistently)
class A {
private volatile int a;
public int getA() {...}
public void setA(int a) {...}
}
In your second case synchronizing on a local object makes no sense, you can remove it. I wouldn't extends Thread either, this is bad practice.
While you might have 1000 threads you are only likely to have 8-16 CPUs Have so many CPU bound threads is a bad idea. Reduce the number of threads and you are likely to improve performance by reducing the overhead.
You should design them to be as independent as possible because if you can't it is likely that a single thread will be faster as it won't have the cache consistency over head.
IMHO Using a enum is simpler than using Guava MemorizeSupplier but which is faster
public class GuavaMain {
interface AAA {
int hashCode();
}
enum Singleton implements AAA {
INSTANCE
}
public static void main(String... ignored) {
Supplier<AAA> memoize = Suppliers.memoize(new Supplier<AAA>() {
#Override
public AAA get() {
return new AAA() {
};
}
});
for (int j = 0; j < 10; j++) {
int runs = 5000;
long time1 = System.nanoTime();
for (int i = 0; i < runs; i++) {
// call a method on out lazy instance
Singleton.INSTANCE.hashCode();
}
long time2 = System.nanoTime();
for (int i = 0; i < runs; i++) {
// call a method on out lazy instance
memoize.get().hashCode();
}
long time3 = System.nanoTime();
System.out.printf("enum took %,d ns and memorize took %,d ns avg%n",
(time2 - time1) / runs, (time3 - time2) / runs);
}
}
}
prints
enum took 179 ns and memorize took 301 ns avg
enum took 74 ns and memorize took 97 ns avg
enum took 62 ns and memorize took 175 ns avg
enum took 58 ns and memorize took 146 ns avg
enum took 58 ns and memorize took 147 ns avg
enum took 56 ns and memorize took 111 ns avg
enum took 36 ns and memorize took 86 ns avg
enum took 36 ns and memorize took 84 ns avg
enum took 36 ns and memorize took 82 ns avg
enum took 36 ns and memorize took 82 ns avg

If you use volatile everywhere, you will be essentially introducing a lot of synchronization points in your code, which decrements efficiency.
When coding for concurrency, your best choice is to isolate shared mutability into small sections, or even eliminate shared state as much as possible.

Related

Multi threaded matrix multiplication performance issue

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.
public class MatMulConcur {
private final static int NUM_OF_THREAD =1 ;
private static Mat matC;
public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}
private static Mat mul(Mat matA,Mat matB) {
int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;
Worker[] myWorker = new Worker[NUM_OF_THREAD];
for (int j = 0; j < NUM_OF_THREAD; j++) {
if (j<NUM_OF_THREAD-1){
numRowForThread = (numRowA / NUM_OF_THREAD);
} else {
numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
}
myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
myWorker[j].start();
startRow += numRowForThread;
}
for (Worker worker : myWorker) {
try {
worker.join();
} catch (InterruptedException e) {
}
}
return matC;
}
private static class Worker extends Thread {
private int startRow, stopRow;
private Mat matA, matB;
public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
super();
this.startRow = startRow;
this.stopRow = stopRow;
this.matA = matA;
this.matB = matB;
}
#Override
public void run() {
for (int i = startRow; i < stopRow; i++) {
for (int j = 0; j < matB.getNColumns(); j++) {
double sum = 0;
for (int k = 0; k < matA.getNColumns(); k++) {
sum += matA.get(i, k) * matB.get(k, j);
}
matC.set(i, j, sum);
}
}
}
}
I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table
Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds
Any Idea?

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.
There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.
A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.
There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.
There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.
1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.

Using multi-threading is not primarily for performance, but for parallelization. There are cases where parallelization can benefit performance, though.
Your computer doesn't have infinite resources. Adding more and more threads will decrease performance. It's like starting more and more applications, you wouldn't expect a program to run faster when you start another program, and you probably wouldn't be surprised if it runs slower.
Up to a certain point performance will remain constant (your computer still has resources to handle the demand), but at some point you reach the maximum your computer can handle and performance will drop. That's exactly what your result shows. Performance stays somewhat constant with 1 or 10 threads, and then drops steadily.

Java - Why does this basic ticking class use up so much cpu?

Details:
For a lot the programs that I develop I use this code (or some slight variant) to "tick" a method every so often, set to the varaible tps (if set to 32 it calls the method tick 32 times every second). Its very essential so I can't remove it from my code as animations and various other parts will break.
Unfortunately it seems to use a sizable amount of cpu usage for a reason I can't figure out. A while back I was thinking about using thread.sleep() to fix this issue but according to this post; it's rather innacurate which makes it unfeasible as this requires reasonably accurate timing.
It doesn't use that much cpu, around 6-11% cpu for a ryzen 1700 in my admittedly short testing, but it's still quite a lot considering how little it's doing. Is there a less cpu intensive method of completing this? Or will the timing be to innacurate for regular usage.
public class ThreadTest {
public ThreadTest() {
int tps = 32;
boolean threadShouldRun = true;
long lastTime = System.nanoTime();
double ns = 1000000000 / tps;
double delta = 0;
long now;
while (threadShouldRun) {
now = System.nanoTime();
delta += (now - lastTime) / ns;
lastTime = now;
while ((delta >= 1) && (threadShouldRun)) {
tick();
delta--;
}
}
}
public void tick() {
}
public static void main(String[] args) {
new ThreadTest();
}
}
Basic summary: The code above uses 6-11% cpu with a ryzen 1700, is there a way in java to accomplish the same code with less cpu usage and keeping reasonable timing when executing code a certain amount of times per second.

One easy alternative that shouldn't use as much CPU is to use a ScheduledExecutorService. For example:
public static void main(String[] args) {
ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor();
executor.scheduleAtFixedRate(() -> {
}, 0, 31250, TimeUnit.MICROSECONDS);
}
Note that 31250 represents the value of 1/32 seconds converted to microseconds, as that parameter accepts a long.

Performance of synchronize section in Java

I had a small dispute over performance of synchronized block in Java. This is a theoretical question, which does not affect real life application.
Consider single-thread application, which uses locks and synchronize sections. Does this code work slower than the same code without synchronize sections? If so, why? We do not discuss concurrency, since it’s only single thread application
Update
Found interesting benchmark testing it. But it's from 2001. Things could have changed dramatically in the latest version of JDK

Single-threaded code will still run slower when using synchronized blocks. Obviously you will not have other threads stalled while waiting for other threads to finish, however you will have to deal with the other effects of synchronization, namely cache coherency.
Synchronized blocks are not only used for concurrency, but also visibility. Every synchronized block is a memory barrier: the JVM is free to work on variables in registers, instead of main memory, on the assumption that multiple threads will not access that variable. Without synchronization blocks, this data could be stored in a CPU's cache and different threads on different CPUs would not see the same data. By using a synchronization block, you force the JVM to write this data to main memory for visibility to other threads.
So even though you're free from lock contention, the JVM will still have to do housekeeping in flushing data to main memory.
In addition, this has optimization constraints. The JVM is free to reorder instructions in order to provide optimization: consider a simple example:
foo++;
bar++;
versus:
foo++;
synchronized(obj)
{
bar++;
}
In the first example, the compiler is free to load foo and bar at the same time, then increment them both, then save them both. In the second example, the compiler must perform the load/add/save on foo, then perform the load/add/save on bar. Thus, synchronization may impact the ability of the JRE to optimize instructions.
(An excellent book on the Java Memory Model is Brian Goetz's Java Concurrency In Practice.)

There are 3 type of locking in HotSpot
Fat: JVM relies on OS mutexes to acquire lock.
Thin: JVM is using CAS algorithm.
Biased: CAS is rather expensive operation on some of the architecture. Biased locking - is special type of locking optimized for scenario when only one thread is working on object.
By default JVM uses thin locking. Later if JVM determines that there is no contention thin locking is converted to biased locking. Operation that changes type of the lock is rather expensive, hence JVM does not apply this optimization immediately. There is special JVM option - XX:BiasedLockingStartupDelay=delay which tells JVM when this kind of optimization should be applied.
Once biased, that thread can subsequently lock and unlock the object without resorting to expensive atomic instructions.
Answer to the question: it depends. But if biased, the single threaded code with locking and without locking has average same performance.
Biased Locking in HotSpot - Dave Dice's Weblog
Synchronization and Object Locking - Thomas Kotzmann and Christian Wimmer

There is some overhead in acquiring a non-contested lock, but on modern JVMs it is very small.
A key run-time optimization that's relevant to this case is called "Biased Locking" and is explained in the Java SE 6 Performance White Paper.
If you wanted to have some performance numbers that are relevant to your JVM and hardware, you could construct a micro-benchmark to try and measure this overhead.

Using locks when you don't need to will slow down your application. It could be too small to measure or it could be surprisingly high.
IMHO Often the best approach is to use lock free code in a single threaded program to make it clear this code is not intended to be shared across thread. This could be more important for maintenance than any performance issues.
public static void main(String... args) throws IOException {
for (int i = 0; i < 3; i++) {
perfTest(new Vector<Integer>());
perfTest(new ArrayList<Integer>());
}
}
private static void perfTest(List<Integer> objects) {
long start = System.nanoTime();
final int runs = 100000000;
for (int i = 0; i < runs; i += 20) {
// add items.
for (int j = 0; j < 20; j+=2)
objects.add(i);
// remove from the end.
while (!objects.isEmpty())
objects.remove(objects.size() - 1);
}
long time = System.nanoTime() - start;
System.out.printf("%s each add/remove took an average of %.1f ns%n", objects.getClass().getSimpleName(), (double) time/runs);
}
prints
Vector each add/remove took an average of 38.9 ns
ArrayList each add/remove took an average of 6.4 ns
Vector each add/remove took an average of 10.5 ns
ArrayList each add/remove took an average of 6.2 ns
Vector each add/remove took an average of 10.4 ns
ArrayList each add/remove took an average of 5.7 ns
From a performance point of view, if 4 ns is important to you, you have to use the non-synchronized version.
For 99% of use cases, the clarity of the code is more important than performance. Clear, simple code often performs reasonably good as well.
BTW: I am using a 4.6 GHz i7 2600 with Oracle Java 7u1.
For comparison if I do the following where perfTest1,2,3 are identical.
perfTest1(new ArrayList<Integer>());
perfTest2(new Vector<Integer>());
perfTest3(Collections.synchronizedList(new ArrayList<Integer>()));
I get
ArrayList each add/remove took an average of 2.6 ns
Vector each add/remove took an average of 7.5 ns
SynchronizedRandomAccessList each add/remove took an average of 8.9 ns
If I use a common perfTest method it cannot inline the code as optimally and they are all slower
ArrayList each add/remove took an average of 9.3 ns
Vector each add/remove took an average of 12.4 ns
SynchronizedRandomAccessList each add/remove took an average of 13.9 ns
Swapping the order of tests
ArrayList each add/remove took an average of 3.0 ns
Vector each add/remove took an average of 39.7 ns
ArrayList each add/remove took an average of 2.0 ns
Vector each add/remove took an average of 4.6 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.5 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.4 ns
ArrayList each add/remove took an average of 2.4 ns
Vector each add/remove took an average of 4.6 ns
one at a time
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 2.3 ns
ArrayList each add/remove took an average of 2.2 ns
ArrayList each add/remove took an average of 2.4 ns
and
Vector each add/remove took an average of 28.4 ns
Vector each add/remove took an average of 37.4 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns

Assuming you're using the HotSpot VM, I believe the JVM is able to recognize that there is no contention for any resources within the synchronized block and treat it as "normal" code.

This sample code (with 100 threads making 1,000,000 iterations each one) demonstrates the performance difference between avoiding and not avoiding a synchronized block.
Output:
Total time(Avoid Sync Block): 630ms
Total time(NOT Avoid Sync Block): 6360ms
Total time(Avoid Sync Block): 427ms
Total time(NOT Avoid Sync Block): 6636ms
Total time(Avoid Sync Block): 481ms
Total time(NOT Avoid Sync Block): 5882ms
Code:
import org.apache.commons.lang.time.StopWatch;
public class App {
public static int countTheads = 100;
public static int loopsPerThead = 1000000;
public static int sleepOfFirst = 10;
public static int runningCount = 0;
public static Boolean flagSync = null;
public static void main( String[] args )
{
for (int j = 0; j < 3; j++) {
App.startAll(new App.AvoidSyncBlockRunner(), "(Avoid Sync Block)");
App.startAll(new App.NotAvoidSyncBlockRunner(), "(NOT Avoid Sync Block)");
}
}
public static void startAll(Runnable runnable, String description) {
App.runningCount = 0;
App.flagSync = null;
Thread[] threads = new Thread[App.countTheads];
StopWatch sw = new StopWatch();
sw.start();
for (int i = 0; i < threads.length; i++) {
threads[i] = new Thread(runnable);
}
for (int i = 0; i < threads.length; i++) {
threads[i].start();
}
do {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
} while (runningCount != 0);
System.out.println("Total time"+description+": " + (sw.getTime() - App.sleepOfFirst) + "ms");
}
public static void commonBlock() {
String a = "foo";
a += "Baa";
}
public static synchronized void incrementCountRunning(int inc) {
runningCount = runningCount + inc;
}
public static class NotAvoidSyncBlockRunner implements Runnable {
public void run() {
App.incrementCountRunning(1);
for (int i = 0; i < App.loopsPerThead; i++) {
synchronized (App.class) {
if (App.flagSync == null) {
try {
Thread.sleep(App.sleepOfFirst);
} catch (InterruptedException e) {
e.printStackTrace();
}
App.flagSync = true;
}
}
App.commonBlock();
}
App.incrementCountRunning(-1);
}
}
public static class AvoidSyncBlockRunner implements Runnable {
public void run() {
App.incrementCountRunning(1);
for (int i = 0; i < App.loopsPerThead; i++) {
// THIS "IF" MAY SEEM POINTLESS, BUT IT AVOIDS THE NEXT
//ITERATION OF ENTERING INTO THE SYNCHRONIZED BLOCK
if (App.flagSync == null) {
synchronized (App.class) {
if (App.flagSync == null) {
try {
Thread.sleep(App.sleepOfFirst);
} catch (InterruptedException e) {
e.printStackTrace();
}
App.flagSync = true;
}
}
}
App.commonBlock();
}
App.incrementCountRunning(-1);
}
}
}

Performance costs of casting a concrete collection to its interface

When I write some API, it sometimes will use Collection<Model> to be the parameter. Of course, you can use ArrayList if you know ArrayList is already enough to handle all the use case.
My question is is there any considerable performance cost when for example cast the ArrayList<Model> to Collection<Model> when passing parameter.
Will the collection size also impact the performance of casting? Any advice?
Thanks for Peter's answer.
I think the answer is pretty enough to stop me to waste time on changing it.
EDIT
As said in accepted answer, the cost is actually paid in the calling of interface methods.
it's not free to keep this kind of flexibity. But the cost is not so considerable.

Like most performance questions the answer is; write cleare and simple code and the application usually performs okay as well.
A cast to an interface can take around 10 ns (less than a method call) Depending on how the code is optimised, it might be too small to measure.
A cast between generic types is a compiler time check, nothing actually happens at runtime.
When you cast, it is the reference type which changes, all references are the same size. The size of what they point to doesn't matter.
BTW: All ArrayList objects are the same size, All LinkedList objects are the same size all HashMap objects are the same size etc. They can reference an array which can be different sizes in different collection.
You can see a difference in code which hasn't been JITed.
public static void main(String... args) throws Throwable {
ArrayList<Integer> ints = new ArrayList<>();
for(int i=0;i<100;i++) ints.add(i);
sumSize(ints, 5000);
castSumSize(ints, 5000);
sumSize(ints, 5000);
castSumSize(ints, 5000);
}
public static long sumSize(ArrayList<Integer> ints, int runs) {
long sum = 0;
long start = System.nanoTime();
for(int i=0;i<runs;i++)
sum += ints.size();
long time = System.nanoTime() - start;
System.out.printf("sumSize: Took an average of %,d ns%n", time/runs);
return sum;
}
public static long castSumSize(ArrayList<Integer> ints, int runs) {
long sum = 0;
long start = System.nanoTime();
for(int i=0;i<runs;i++)
sum += ((Collection) ints).size();
long time = System.nanoTime() - start;
System.out.printf("castSumSize: Took an average of %,d ns%n", time/runs);
return sum;
}
prints
sumSize: Took an average of 31 ns
castSumSize: Took an average of 37 ns
sumSize: Took an average of 28 ns
castSumSize: Took an average of 34 ns
however the difference is likely to be due to the method calls being more expensive. The only bytecode difference is
invokevirtual #9; //Method java/util/ArrayList.size:()I
and
invokeinterface #15, 1; //InterfaceMethod java/util/Collection.size:()I
Once the JIT has optimised the code there isn't much difference. Run long enough, the time drops to 0 ns for the -server JVM because it detects the loop doesn't do anything. ;)

Compared to doing anything with any object: absolutely none.
And even if it did, be sure any program involves things taking millions more time!

Collection is an interface. You always have to provide a concrete implementation such as ArrayList.
Commonly it would be this
Collection<Model> myCollection = new ArrayList<Model>();
Designing to interfaces is actually good practice, so use Collection as your method parameter.

Java Reflection Performance Issue

I know there's a lot of topics talking about Reflection performance.
Even official Java docs says that Reflection is slower, but I have this code:
public class ReflectionTest {
public static void main(String[] args) throws Exception {
Object object = new Object();
Class<Object> c = Object.class;
int loops = 100000;
long start = System.currentTimeMillis();
Object s;
for (int i = 0; i < loops; i++) {
s = object.toString();
System.out.println(s);
}
long regularCalls = System.currentTimeMillis() - start;
java.lang.reflect.Method method = c.getMethod("toString");
start = System.currentTimeMillis();
for (int i = 0; i < loops; i++) {
s = method.invoke(object);
System.out.println(s);
}
long reflectiveCalls = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for (int i = 0; i < loops; i++) {
method = c.getMethod("toString");
s = method.invoke(object);
System.out.println(s);
}
long reflectiveLookup = System.currentTimeMillis() - start;
System.out.println(loops + " regular method calls:" + regularCalls
+ " milliseconds.");
System.out.println(loops + " reflective method calls without lookup:"
+ reflectiveCalls+ " milliseconds.");
System.out.println(loops + " reflective method calls with lookup:"
+ reflectiveLookup + " milliseconds.");
}
}
That I don't think is a valid benchmark, but at least should show some difference.
I executed it waiting to see the reflection normal calls being a bit slower than regular ones.
But this prints this:
100000 regular method calls:1129 milliseconds.
100000 reflective method calls without lookup:910 milliseconds.
100000 reflective method calls with lookup:994 milliseconds.
Just for note, first I executed it without that bunch of sysouts, and then I realized that some JVM optimization are just making it goes faster, so I added these printls to see if reflection was still faster.
The result without sysouts are:
100000 regular method calls:68 milliseconds.
100000 reflective method calls without lookup:48 milliseconds.
100000 reflective method calls with lookup:168 milliseconds.
I saw over internet that the same test executed on old JVMs make the reflective without lookup are two times slower than regular calls, and that speed falls over new updates.
If anyone can execute it and say me I'm wrong, or at least show me if there's something different than the past that make it faster.
Following instructions, I ran every loop separated and the result are (without sysouts)
100000 regular method calls:70 milliseconds.
100000 reflective method calls without lookup:120 milliseconds.
100000 reflective method calls with lookup:129 milliseconds.

Never performance test different bits of code in the same "run". The JVM has various optimisations that mean it though the end result is the same, how the internals are performed may differ. In more concrete terms, during your test the JVM may have noticed you are calling Object.toString a lot and have started to inline the method calls to Object.toString. It may have started to perform loop unfolding. Or there could have been a garbage collection in the first loop but not the second or third loops.
To get a more meaningful, but still not totally accurate picture you should separate your test into three separate programs.
The results on my computer (with no printing and 1,000,000 runs each)
All three loops run in same program
1000000 regular method calls: 490 milliseconds.
1000000 reflective method calls without lookup: 393 milliseconds.
1000000 reflective method calls with loopup: 978 milliseconds.
Loops run in separate programs
1000000 regular method calls: 475 milliseconds.
1000000 reflective method calls without lookup: 555 milliseconds.
1000000 reflective method calls with loopup: 1160 milliseconds.

There's an article by Brian Goetz on microbenchmarks that's worth reading. It looks like you're not doing anything to warm up the JVM (meaning give it a chance to do whatever inlining or other optimizations it's going to do) before doing your measurements, so it's likely the non-reflective test is still not warmed-up yet, and that could skew your numbers.

When you have multiple long running loops, the first loop can trigger the method to compile resulting in the later loops being optimised from the start. However the optimisation can be sub-optimal as it has no runtime information for those loops. The toString is relatively expensive and couple be taking longer than the reflections calls.
You don't need separate programs to avoid loop being optimised due to an earlier loop. You can run them in different methods.
The results I get are
Average regular method calls:2 ns.
Average reflective method calls without lookup:10 ns.
Average reflective method calls with lookup:240 ns.
The code
import java.lang.reflect.Method;
public class ReflectionTest {
public static void main(String[] args) throws Exception {
int loops = 1000 * 1000;
Object object = new Object();
long start = System.nanoTime();
Object s;
testMethodCall(object, loops);
long regularCalls = System.nanoTime() - start;
java.lang.reflect.Method method = Object.class.getMethod("getClass");
method.setAccessible(true);
start = System.nanoTime();
testInvoke(object, loops, method);
long reflectiveCalls = System.nanoTime() - start;
start = System.nanoTime();
testGetMethodInvoke(object, loops);
long reflectiveLookup = System.nanoTime() - start;
System.out.println("Average regular method calls:"
+ regularCalls / loops + " ns.");
System.out.println("Average reflective method calls without lookup:"
+ reflectiveCalls / loops + " ns.");
System.out.println("Average reflective method calls with lookup:"
+ reflectiveLookup / loops + " ns.");
}
private static Object testMethodCall(Object object, int loops) {
Object s = null;
for (int i = 0; i < loops; i++) {
s = object.getClass();
}
return s;
}
private static Object testInvoke(Object object, int loops, Method method) throws Exception {
Object s = null;
for (int i = 0; i < loops; i++) {
s = method.invoke(object);
}
return s;
}
private static Object testGetMethodInvoke(Object object, int loops) throws Exception {
Method method;
Object s = null;
for (int i = 0; i < loops; i++) {
method = Object.class.getMethod("getClass");
s = method.invoke(object);
}
return s;
}
}

Micro-benchmarks like this are never going to be accurate at all - as the VM "warms up" it'll inline bits of code and optimise bits of code as it goes along, so the same thing executed 2 minutes into a program could vastly outperform it right at the start.
In terms of what's happening here, my guess is that the first "normal" method call block warms it up, so the reflective blocks (and indeed all subsequent calls) would be faster. The only overhead added through reflectively calling a method that I can see is looking up the pointer to that method, which is a nanosecond-scale operation anyway and would be easily cached by the JVM. The rest would be on how the VM is warmed up, which it is by the time you reach the reflective calls.

There is no inherent reason why reflective call should be slower than a normal call. JVM can optimize them into the same thing.
Practically, human resources are limited, and they had to optimize normal calls first. As time passes by they can work on optimizing reflective calls; especially when reflection becomes more and more popular.

I have been writing my own micro-benchmark, without loops, and with System.nanoTime():
public static void main(String[] args) throws NoSuchMethodException, IllegalArgumentException, IllegalAccessException, InvocationTargetException
{
Object obj = new Object();
Class<Object> objClass = Object.class;
String s;
long start = System.nanoTime();
s = obj.toString();
long directInvokeEnd = System.nanoTime();
System.out.println(s);
long methodLookupStart = System.nanoTime();
java.lang.reflect.Method method = objClass.getMethod("toString");
long methodLookupEnd = System.nanoTime();
s = (String) (method.invoke(obj));
long reflectInvokeEnd = System.nanoTime();
System.out.println(s);
System.out.println(directInvokeEnd - start);
System.out.println(methodLookupEnd - methodLookupStart);
System.out.println(reflectInvokeEnd - methodLookupEnd);
}
I have been executing that in Eclipse on my machine a dozen times, and the results vary quite a bit, but here is what I typically get:
the direct method invocation clocks at 40-50 microseconds
method lookup clocks at 150-200 microseconds
reflective invocation with the method variable clocks at 250-310 microseconds.
Now, do not forget the caveats on microbenchmarks described in Nathan's reply - there are certainly a lot of flaws in that micro benchmark - and trust the documentation if they say that reflection is a LOT slower than direct invocation.

It strikes me that you have placed a "System.out.println(s)" call inside your inner benchmark loop.
Since performing IO is bound to be slow, it actually "swallows up" your benchmark and the overhead of the invoke becomes negligible.
Try removing the "println()" call and running code like this, I'm sure you'd be surprised by the result (some of the silly calculations are needed to avoid the compiler optimizing away the calls altogether):
public class Experius
{
public static void main(String[] args) throws Exception
{
Experius a = new Experius();
int count = 10000000;
int v = 0;
long tm = System.currentTimeMillis();
for ( int i = 0; i < count; ++i )
{
v = a.something(i + v);
++v;
}
tm = System.currentTimeMillis() - tm;
System.out.println("Time: " + tm);
tm = System.currentTimeMillis();
Method method = Experius.class.getMethod("something", Integer.TYPE);
for ( int i = 0; i < count; ++i )
{
Object o = method.invoke(a, i + v);
++v;
}
tm = System.currentTimeMillis() - tm;
System.out.println("Time: " + tm);
}
public int something(int n)
{
return n + 5;
}
}
-- TR

Even if you look up the method in both cases (i.e. before 2nd and 3rd loop),
the first lookup takes way less time than the second lookup, which should have been the other way around and less than a regular method call on my machine.
Neverthless, if you use the 2nd loop with method lookup, and System.out.println statement, I get this:
regular call : 740 ms
look up(2nd loop) : 640 ms
look up ( 3rd loop) : 800 ms
Without System.out.println statement, I get:
regular call : 78 ms
look up (2nd) : 37 ms
look up (3rd ) : 112 ms

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.