Based on performance alone, approximately how many "simple" lines of java is the equivalent performance hit of making a JNI call?
Or to try to express the question in a more concrete way, if a simple java operation such as
someIntVar1 = someIntVar2 + someIntVar3;
was given a "CPU work" index of 1, what would be the typical (ballpark) "CPU work" index of the overhead of making the JNI call?
This question ignores the time taken waiting for the native code to execute. In telephonic parlance, it is strictly about the "flag fall" part of the call, not the "call rate".
The reason for asking this question is to have a "rule of thumb" to know when to bother attempting coding a JNI call when you know the native cost (from direct testing) and the java cost of a given operation. It could help you quickly avoid the hassle to coding the JNI call only to find that the callout overhead consumed any benefit of using native code.
Edit:
Some folks are getting hung up on variations in CPU, RAM etc. These are all virtually irrelevant to the question - I'm asking for the relative cost to lines of java code. If CPU and RAM are poor, they are poor for both java and JNI so environmental considerations should balance out. The JVM version falls into the "irrelevant" category too.
This question isn't asking for an absolute timing in nanoseconds, but rather a ball park "work effort" in units of "lines of simple java code".
Quick profiler test yields:
Java class:
public class Main {
private static native int zero();
private static int testNative() {
return Main.zero();
}
private static int test() {
return 0;
}
public static void main(String[] args) {
testNative();
test();
}
static {
System.loadLibrary("foo");
}
}
C library:
#include <jni.h>
#include "Main.h"
JNIEXPORT int JNICALL
Java_Main_zero(JNIEnv *env, jobject obj)
{
return 0;
}
Results:
System details:
java version "1.7.0_09"
OpenJDK Runtime Environment (IcedTea7 2.3.3) (7u9-2.3.3-1)
OpenJDK Server VM (build 23.2-b09, mixed mode)
Linux visor 3.2.0-4-686-pae #1 SMP Debian 3.2.32-1 i686 GNU/Linux
Update: Caliper micro-benchmarks for x86 (32/64 bit) and ARMv6 are as follows:
Java class:
public class Main extends SimpleBenchmark {
private static native int zero();
private Random random;
private int[] primes;
public int timeJniCall(int reps) {
int r = 0;
for (int i = 0; i < reps; i++) r += Main.zero();
return r;
}
public int timeAddIntOperation(int reps) {
int p = primes[random.nextInt(1) + 54]; // >= 257
for (int i = 0; i < reps; i++) p += i;
return p;
}
public long timeAddLongOperation(int reps) {
long p = primes[random.nextInt(3) + 54]; // >= 257
long inc = primes[random.nextInt(3) + 4]; // >= 11
for (int i = 0; i < reps; i++) p += inc;
return p;
}
#Override
protected void setUp() throws Exception {
random = new Random();
primes = getPrimes(1000);
}
public static void main(String[] args) {
Runner.main(Main.class, args);
}
public static int[] getPrimes(int limit) {
// returns array of primes under $limit, off-topic here
}
static {
System.loadLibrary("foo");
}
}
Results (x86/i7500/Hotspot/Linux):
Scenario{benchmark=JniCall} 11.34 ns; σ=0.02 ns # 3 trials
Scenario{benchmark=AddIntOperation} 0.47 ns; σ=0.02 ns # 10 trials
Scenario{benchmark=AddLongOperation} 0.92 ns; σ=0.02 ns # 10 trials
benchmark ns linear runtime
JniCall 11.335 ==============================
AddIntOperation 0.466 =
AddLongOperation 0.921 ==
Results (amd64/phenom 960T/Hostspot/Linux):
Scenario{benchmark=JniCall} 6.66 ns; σ=0.22 ns # 10 trials
Scenario{benchmark=AddIntOperation} 0.29 ns; σ=0.00 ns # 3 trials
Scenario{benchmark=AddLongOperation} 0.26 ns; σ=0.00 ns # 3 trials
benchmark ns linear runtime
JniCall 6.657 ==============================
AddIntOperation 0.291 =
AddLongOperation 0.259 =
Results (armv6/BCM2708/Zero/Linux):
Scenario{benchmark=JniCall} 678.59 ns; σ=1.44 ns # 3 trials
Scenario{benchmark=AddIntOperation} 183.46 ns; σ=0.54 ns # 3 trials
Scenario{benchmark=AddLongOperation} 199.36 ns; σ=0.65 ns # 3 trials
benchmark ns linear runtime
JniCall 679 ==============================
AddIntOperation 183 ========
AddLongOperation 199 ========
To summarize things a bit, it seems that JNI call is roughly equivalent to 10-25 java ops on typical (x86) hardware and Hotspot VM. At no surprise, under much less optimized Zero VM, the results are quite different (3-4 ops).
Thanks go to #Giovanni Azua and #Marko Topolnik for participation and hints.
So I just tested the "latency" for a JNI call to C on Windows 8.1, 64-bit, using the Eclipse Mars IDE, JDK 1.8.0_74, and VirtualVM profiler 1.3.8 with the Profile Startup add-on.
Setup: (two methods)
SOMETHING() passes arguments, does stuff, and returns arguments
NOTHING() passes in the same arguments, does nothing with them, and returns same arguments.
(each gets called 270 times)
Total run time for SOMETHING(): 6523ms
Total run time for NOTHING(): 0.102ms
Thus in my case the JNI calls are quite negligible.
You should actually test it yourself what the "latency" is. Latency is defined in engineering as the time it takes to send a message of zero length. In this context, it would correspond to writing the smallest Java program that invokes a do_nothing empty C++ function and compute mean and stddev of the elapsed time over 30 measurements (do couple of extra warm up calls). You might be surprised of the different average results doing the same for different JDK versions and platforms.
Only doing so will give you the final answer of whether using JNI makes sense for your target environment.
Related
I am trying to test the performance of Aparapi.
I have seen some blogs where the results show that Aparapi does improve the performance while doing data parallel operations.
But I am not able to see that in my tests. Here is what I did, I wrote two programs, one using Aparapi, the other one using normal loops.
Program 1: In Aparapi
import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;
public class App
{
public static void main( String[] args )
{
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
long t1 = System.currentTimeMillis();
kernel.execute(Range.create(size));
long t2 = System.currentTimeMillis();
System.out.println("Execution mode = "+kernel.getExecutionMode());
kernel.dispose();
System.out.println(t2-t1);
}
}
Program 2: using loops
public class App2 {
public static void main(String[] args) {
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
long t1 = System.currentTimeMillis();
for(int i=0;i<size;i++) {
sum[i]=a[i]+b[i];
}
long t2 = System.currentTimeMillis();
System.out.println(t2-t1);
}
}
Program 1 takes around 330ms whereas Program 2 takes only around 55ms.
Am I doing something wrong here? I did printout the execution mode in Aparpai program and it prints that the mode of execution is GPU
You did not do anything wrong - execpt for the benchmark itself.
Benchmarking is always tricky, and particularly for the cases where a JIT is involved (as for Java), and for libraries where many nitty-gritty details are hidden from the user (as for Aparapi). And in both cases, you should at least execute the code section that you want to benchmark multiple times.
For the Java version, one might expect the computation time for a single execution of the loop to decrease when the loop itself it is executed multiple times, due to the JIT kicking in. There are many additional caveats to consider - for details, you should refer to this answer. In this simple test, the effect of the JIT may not really be noticable, but in more realistic or complex scenarios, this will make a difference. Anyhow: When repeating the loop for 10 times, the time for a single execution of the loop on my machine was about 70 milliseconds.
For the Aparapi version, the point of possible GPU initialization was already mentioned in the comments. And here, this is indeed the main problem: When running the kernel 10 times, the timings on my machine are
1248
72
72
72
73
71
72
73
72
72
You see that the initial call causes all the overhead. The reason for this is that, during the first call to Kernel#execute(), it has to do all the initializations (basically converting the bytecode to OpenCL, compile the OpenCL code etc.). This is also mentioned in the documentation of the KernelRunner class:
The KernelRunner is created lazily as a result of calling Kernel.execute().
The effect of this - namely, a comparatively large delay for the first execution - has lead to this question on the Aparapi mailing list: A way to eagerly create KernelRunners. The only workaround suggested there was to create an "initialization call" like
kernel.execute(Range.create(1));
without a real workload, only to trigger the whole setup, so that the subsequent calls are fast. (This also works for your example).
You may have noticed that, even after the initialization, the Aparapi version is still not faster than the plain Java version. The reason for that is that the task of a simple vector addition like this is memory bound - for details, you may refer to this answer, which explains this term and some issues with GPU programming in general.
As an overly suggestive example for a case where you might benefit from the GPU, you might want to modify your test, in order to create an artificial compute bound task: When you change the kernel to involve some expensive trigonometric functions, like this
Kernel kernel = new Kernel() {
#Override
public void run() {
int gid = getGlobalId();
sum[gid] = (float)(Math.cos(Math.sin(a[gid])) + Math.sin(Math.cos(b[gid])));
}
};
and the plain Java loop version accordingly, like this
for (int i = 0; i < size; i++) {
sum[i] = (float)(Math.cos(Math.sin(a[i])) + Math.sin(Math.cos(b[i])));;
}
then you will see a difference. On my machine (GeForce 970 GPU vs. AMD K10 CPU) the timings are about 140 milliseconds for the Aparapi version, and a whopping 12000 milliseconds for the plain Java version - that's a speedup of nearly 90 through Aparapi!
Also note that even in CPU mode, Aparapi may offer an advantage compared to plain Java. On my machine, in CPU mode, Aparapi needs only 2300 milliseconds, because it still parallelizes the execution using a Java thread pool.
Just add before main loop kernel execution
kernel.setExplicit(true);
kernel.put(a);
kernel.put(b);
and
kernel.get(sum);
after it.
Although Aparapi does analyze the byte code of the Kernel.run()
method (and any method reachable from Kernel.run()) Aparapi has no
visibility to the call site. In the above code there is no way for
Aparapi to detect that that hugeArray is not modified within the for
loop body. Unfortunately, Aparapi must default to being ‘safe’ and
copy the contents of hugeArray backwards and forwards to the GPU
device.
https://github.com/aparapi/aparapi/blob/master/doc/ExplicitBufferHandling.md
I've run into a really strange bug, and I'm hoping someone here can shed some light as it's way out of my area of expertise.
First, relevant background information: I am running OS X 10.9.4 on a Late 2013 Macbook Pro Retina with a 2.4GHz Haswell CPU. I'm using JDK SE 8u5 for OS X from Oracle, and I'm running my code on the latest version of IntelliJ IDEA. This bug also seems to be specific only to OS X, as I posted on Reddit about this bug already and other users with OS X were able to recreate it while users on Windows and Linux, including myself, had the program run as expected with the println() version running half a second slower than the version without println().
Now for the bug: In my code, I have a println() statement that when included, the program runs at ~2.5 seconds. If I remove the println() statement either by deleting it or commenting it out, the program counterintuitively takes longer to run at ~9 seconds. It's extremely strange as I/O should theoretically slow the program down, not make it faster.
For my actual code, it's my implementation of Project Euler Problem 14. Please keep in mind I'm still a student so it's not the best implementation:
public class ProjectEuler14
{
public static void main(String[] args)
{
final double TIME_START = System.currentTimeMillis();
Collatz c = new Collatz();
int highestNumOfTerms = 0;
int currentNumOfTerms = 0;
int highestValue = 0; //Value which produces most number of Collatz terms
for (double i = 1.; i <= 1000000.; i++)
{
currentNumOfTerms = c.startCollatz(i);
if (currentNumOfTerms > highestNumOfTerms)
{
highestNumOfTerms = currentNumOfTerms;
highestValue = (int)(i);
System.out.println("New term: " + highestValue); //THIS IS THE OFFENDING LINE OF CODE
}
}
final double TIME_STOP = System.currentTimeMillis();
System.out.println("Highest term: " + highestValue + " with " + highestNumOfTerms + " number of terms");
System.out.println("Completed in " + ((TIME_STOP - TIME_START)/1000) + " s");
}
}
public class Collatz
{
private static int numOfTerms = 0;
private boolean isFirstRun = false;
public int startCollatz(double n)
{
isFirstRun = true;
runCollatz(n);
return numOfTerms;
}
private void runCollatz(double n)
{
if (isFirstRun)
{
numOfTerms = 0;
isFirstRun = false;
}
if (n == 1)
{
//Reached last term, does nothing and causes program to return to startCollatz()
}
else if (n % 2 == 0)
{
//Divides n by 2 following Collatz rule, running recursion
numOfTerms = numOfTerms + 1;
runCollatz(n / 2);
}
else if (n % 2 == 1)
{
//Multiples n by 3 and adds one, following Collatz rule, running recursion
numOfTerms = numOfTerms + 1;
runCollatz((3 * n) + 1);
}
}
}
The line of code in question has been commented in with all caps, as it doesn't look like SO does line numbers. If you can't find it, it's within the nested if() statement in my for() loop in my main method.
I've run my code multiple times with and without that line, and I consistently get the above stated ~2.5sec times with println() and ~9sec without println(). I've also rebooted my laptop multiple times to make sure it wasn't my current OS run and the times stay consistent.
Since other OS X 10.9.4 users were able to replicate the code, I suspect it's due to a low-level bug with the compliler, JVM, or OS itself. In any case, this is way outside my knowledge. It's not a critical bug, but I definitely am interested in why this is happening and would appreciate any insight.
I did some research and some more with #ekabanov and here are the findings.
The effect you are seeing only happens with Java 8 and not with Java 7.
The extra line triggers a different JIT compilation/optimisation
The assembly code of the faster version is ~3 times larger and quick glance shows it did loop unrolling
The JIT compilation log shows that the slower version successfully inlined the runCollatz while the faster didn't stating that the callee is too large (probably because of the unrolling).
There is a great tool that helps you analyse such situations, it is called jitwatch. If it is assembly level then you also need the HotSpot Disassembler.
I'll post also my log files. You can feed the hotspot log files to the jitwatch and the assembly extraction is something that you diff to spot the differences.
Fast version's hotspot log file
Fast version's assembly log file
Slow version's hotspot log file
Slow version's assembly log file
I have some code that profiles Runtime.freeMemory. Here is my code:
package misc;
import java.util.ArrayList;
import java.util.Random;
public class FreeMemoryTest {
private final ArrayList<Double> l;
private final Random r;
public FreeMemoryTest(){
this.r = new Random();
this.l = new ArrayList<Double>();
}
public static boolean memoryCheck() {
double freeMem = Runtime.getRuntime().freeMemory();
double totalMem = Runtime.getRuntime().totalMemory();
double fptm = totalMem * 0.05;
boolean toReturn = fptm > freeMem;
return toReturn;
}
public void freeMemWorkout(int max){
for(int i = 0; i < max; i++){
memoryCheck();
l.add(r.nextDouble());
}
}
public void workout(int max){
for(int i = 0; i < max; i++){
l.add(r.nextDouble());
}
}
public static void main(String[] args){
FreeMemoryTest f = new FreeMemoryTest();
int count = Integer.parseInt(args[1]);
long startTime = System.currentTimeMillis();
if(args[0].equals("f")){
f.freeMemWorkout(count);
} else {
f.workout(count);
}
long endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
}
}
When I run the profiler using -Xrunhprof:cpu=samples, the vast majority of the calls are to the Runtime.freeMemory(), like this:
CPU SAMPLES BEGIN (total = 531) Fri Dec 7 00:17:20 2012
rank self accum count trace method
1 83.62% 83.62% 444 300274 java.lang.Runtime.freeMemory
2 9.04% 92.66% 48 300276 java.lang.Runtime.totalMemory
When I run the profiler using -Xrunhprof:cpu=time, I don't see any of the calls to Runtime.freeMemory at all, and the top five calls are as follows:
CPU TIME (ms) BEGIN (total = 10042) Fri Dec 7 00:29:51 2012
rank self accum count trace method
1 13.39% 13.39% 200000 307547 java.util.Random.next
2 9.69% 23.08% 1 307852 misc.FreeMemoryTest.freeMemWorkout
3 7.41% 30.49% 100000 307544 misc.FreeMemoryTest.memoryCheck
4 7.39% 37.88% 100000 307548 java.util.Random.nextDouble
5 4.35% 42.23% 100000 307561 java.util.ArrayList.add
These two profiles are so different from one another. I thought that samples was supposed to at least roughly approximate the results from the times, but here we see a very radical difference, something that consumes more than 80% of the samples doesn't even appear in the times profile. This does not make any sense to me, does anyone know why this is happening?
More on this:
$ java -Xmx1000m -Xms1000m -jar memtest.jar a 20000000 5524
//does not have the calls to Runtime.freeMemory()
$ java -Xmx1000m -Xms1000m -jar memtest.jar f 20000000 9442
//has the calls to Runtime.freeMemory()
Running with freemem requires approximately twice the amount of time as running without it. If 80% of the CPU time is spent in java.Runtime.freeMemory(), and I remove that call, I would expect the program to speed up by a factor of approximately 5. As we can see above, the program speeds up by a factor of approximately 2.
A slowdown of a factor of 5 is way worse than a slowdown of a factor of 2 that was observed empirically, so what I do not understand is how the sampling profiler is so far off from reality.
The Runtime freeMemory() and totalMemory() are native calls.
See http://www.docjar.com/html/api/java/lang/Runtime.java.html
The timer cannot time them, but the sampler can.
I experience a (for me) strange runtimebehaviour in the following code:
public class Main{
private final static long ROUNDS = 1000000;
private final static double INITIAL_NUMBER = 0.45781929d;
private final static double DIFFERENCE = 0.1250120303d;
public static void main(String[] args){
doSomething();
doSomething();
doSomething();
}
private static void doSomething(){
long begin, end;
double numberToConvert, difference;
numberToConvert = INITIAL_NUMBER;
difference = DIFFERENCE;
begin = System.currentTimeMillis();
for(long i=0; i<ROUNDS; i++){
String s = "" + numberToConvert;
if(i % 2 == 0){
numberToConvert += difference;
}
else{
numberToConvert -= difference;
}
}
end = System.currentTimeMillis();
System.out.println("String appending conversion took " + (end - begin) + "ms.");
}
}
I would expect the program to print out similiar runtimes each time. However, the output I get is always like this:
String appending conversion took 473ms.
String appending conversion took 362ms.
String appending conversion took 341ms.
The first call is about 30% slower than the calls afterwards. Most of the time, the second call is also slightly slower than the third call.
java/javac versions:
javac 1.7.0_09 java version "1.7.0_09" OpenJDK Runtime Environment (IcedTea7 2.3.3) (7u9-2.3.3-0ubuntu1~12.04.1) OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
So, my question: Why does this happen?
The Just-in-time (JIT) compiler is profiling your code on the fly and optimizing execution. The more often a piece of code is executed the better it's optimized.
See, for example, this question for more info: (How) does the Java JIT compiler optimize my code?
It is possible that other apps you have running are affecting the amount of memory you have allocated to the JVM on your machine. Try setting the same min and max memory to JVM when running the java command:
java -Xms512M -Xmx512M ...
I got fairly consistent intervals when trying to run it:
String appending conversion took 1153ms.
String appending conversion took 1095ms.
String appending conversion took 1081ms.
I had a small dispute over performance of synchronized block in Java. This is a theoretical question, which does not affect real life application.
Consider single-thread application, which uses locks and synchronize sections. Does this code work slower than the same code without synchronize sections? If so, why? We do not discuss concurrency, since it’s only single thread application
Update
Found interesting benchmark testing it. But it's from 2001. Things could have changed dramatically in the latest version of JDK
Single-threaded code will still run slower when using synchronized blocks. Obviously you will not have other threads stalled while waiting for other threads to finish, however you will have to deal with the other effects of synchronization, namely cache coherency.
Synchronized blocks are not only used for concurrency, but also visibility. Every synchronized block is a memory barrier: the JVM is free to work on variables in registers, instead of main memory, on the assumption that multiple threads will not access that variable. Without synchronization blocks, this data could be stored in a CPU's cache and different threads on different CPUs would not see the same data. By using a synchronization block, you force the JVM to write this data to main memory for visibility to other threads.
So even though you're free from lock contention, the JVM will still have to do housekeeping in flushing data to main memory.
In addition, this has optimization constraints. The JVM is free to reorder instructions in order to provide optimization: consider a simple example:
foo++;
bar++;
versus:
foo++;
synchronized(obj)
{
bar++;
}
In the first example, the compiler is free to load foo and bar at the same time, then increment them both, then save them both. In the second example, the compiler must perform the load/add/save on foo, then perform the load/add/save on bar. Thus, synchronization may impact the ability of the JRE to optimize instructions.
(An excellent book on the Java Memory Model is Brian Goetz's Java Concurrency In Practice.)
There are 3 type of locking in HotSpot
Fat: JVM relies on OS mutexes to acquire lock.
Thin: JVM is using CAS algorithm.
Biased: CAS is rather expensive operation on some of the architecture. Biased locking - is special type of locking optimized for scenario when only one thread is working on object.
By default JVM uses thin locking. Later if JVM determines that there is no contention thin locking is converted to biased locking. Operation that changes type of the lock is rather expensive, hence JVM does not apply this optimization immediately. There is special JVM option - XX:BiasedLockingStartupDelay=delay which tells JVM when this kind of optimization should be applied.
Once biased, that thread can subsequently lock and unlock the object without resorting to expensive atomic instructions.
Answer to the question: it depends. But if biased, the single threaded code with locking and without locking has average same performance.
Biased Locking in HotSpot - Dave Dice's Weblog
Synchronization and Object Locking - Thomas Kotzmann and Christian Wimmer
There is some overhead in acquiring a non-contested lock, but on modern JVMs it is very small.
A key run-time optimization that's relevant to this case is called "Biased Locking" and is explained in the Java SE 6 Performance White Paper.
If you wanted to have some performance numbers that are relevant to your JVM and hardware, you could construct a micro-benchmark to try and measure this overhead.
Using locks when you don't need to will slow down your application. It could be too small to measure or it could be surprisingly high.
IMHO Often the best approach is to use lock free code in a single threaded program to make it clear this code is not intended to be shared across thread. This could be more important for maintenance than any performance issues.
public static void main(String... args) throws IOException {
for (int i = 0; i < 3; i++) {
perfTest(new Vector<Integer>());
perfTest(new ArrayList<Integer>());
}
}
private static void perfTest(List<Integer> objects) {
long start = System.nanoTime();
final int runs = 100000000;
for (int i = 0; i < runs; i += 20) {
// add items.
for (int j = 0; j < 20; j+=2)
objects.add(i);
// remove from the end.
while (!objects.isEmpty())
objects.remove(objects.size() - 1);
}
long time = System.nanoTime() - start;
System.out.printf("%s each add/remove took an average of %.1f ns%n", objects.getClass().getSimpleName(), (double) time/runs);
}
prints
Vector each add/remove took an average of 38.9 ns
ArrayList each add/remove took an average of 6.4 ns
Vector each add/remove took an average of 10.5 ns
ArrayList each add/remove took an average of 6.2 ns
Vector each add/remove took an average of 10.4 ns
ArrayList each add/remove took an average of 5.7 ns
From a performance point of view, if 4 ns is important to you, you have to use the non-synchronized version.
For 99% of use cases, the clarity of the code is more important than performance. Clear, simple code often performs reasonably good as well.
BTW: I am using a 4.6 GHz i7 2600 with Oracle Java 7u1.
For comparison if I do the following where perfTest1,2,3 are identical.
perfTest1(new ArrayList<Integer>());
perfTest2(new Vector<Integer>());
perfTest3(Collections.synchronizedList(new ArrayList<Integer>()));
I get
ArrayList each add/remove took an average of 2.6 ns
Vector each add/remove took an average of 7.5 ns
SynchronizedRandomAccessList each add/remove took an average of 8.9 ns
If I use a common perfTest method it cannot inline the code as optimally and they are all slower
ArrayList each add/remove took an average of 9.3 ns
Vector each add/remove took an average of 12.4 ns
SynchronizedRandomAccessList each add/remove took an average of 13.9 ns
Swapping the order of tests
ArrayList each add/remove took an average of 3.0 ns
Vector each add/remove took an average of 39.7 ns
ArrayList each add/remove took an average of 2.0 ns
Vector each add/remove took an average of 4.6 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.5 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.4 ns
ArrayList each add/remove took an average of 2.4 ns
Vector each add/remove took an average of 4.6 ns
one at a time
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 2.3 ns
ArrayList each add/remove took an average of 2.2 ns
ArrayList each add/remove took an average of 2.4 ns
and
Vector each add/remove took an average of 28.4 ns
Vector each add/remove took an average of 37.4 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Assuming you're using the HotSpot VM, I believe the JVM is able to recognize that there is no contention for any resources within the synchronized block and treat it as "normal" code.
This sample code (with 100 threads making 1,000,000 iterations each one) demonstrates the performance difference between avoiding and not avoiding a synchronized block.
Output:
Total time(Avoid Sync Block): 630ms
Total time(NOT Avoid Sync Block): 6360ms
Total time(Avoid Sync Block): 427ms
Total time(NOT Avoid Sync Block): 6636ms
Total time(Avoid Sync Block): 481ms
Total time(NOT Avoid Sync Block): 5882ms
Code:
import org.apache.commons.lang.time.StopWatch;
public class App {
public static int countTheads = 100;
public static int loopsPerThead = 1000000;
public static int sleepOfFirst = 10;
public static int runningCount = 0;
public static Boolean flagSync = null;
public static void main( String[] args )
{
for (int j = 0; j < 3; j++) {
App.startAll(new App.AvoidSyncBlockRunner(), "(Avoid Sync Block)");
App.startAll(new App.NotAvoidSyncBlockRunner(), "(NOT Avoid Sync Block)");
}
}
public static void startAll(Runnable runnable, String description) {
App.runningCount = 0;
App.flagSync = null;
Thread[] threads = new Thread[App.countTheads];
StopWatch sw = new StopWatch();
sw.start();
for (int i = 0; i < threads.length; i++) {
threads[i] = new Thread(runnable);
}
for (int i = 0; i < threads.length; i++) {
threads[i].start();
}
do {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
} while (runningCount != 0);
System.out.println("Total time"+description+": " + (sw.getTime() - App.sleepOfFirst) + "ms");
}
public static void commonBlock() {
String a = "foo";
a += "Baa";
}
public static synchronized void incrementCountRunning(int inc) {
runningCount = runningCount + inc;
}
public static class NotAvoidSyncBlockRunner implements Runnable {
public void run() {
App.incrementCountRunning(1);
for (int i = 0; i < App.loopsPerThead; i++) {
synchronized (App.class) {
if (App.flagSync == null) {
try {
Thread.sleep(App.sleepOfFirst);
} catch (InterruptedException e) {
e.printStackTrace();
}
App.flagSync = true;
}
}
App.commonBlock();
}
App.incrementCountRunning(-1);
}
}
public static class AvoidSyncBlockRunner implements Runnable {
public void run() {
App.incrementCountRunning(1);
for (int i = 0; i < App.loopsPerThead; i++) {
// THIS "IF" MAY SEEM POINTLESS, BUT IT AVOIDS THE NEXT
//ITERATION OF ENTERING INTO THE SYNCHRONIZED BLOCK
if (App.flagSync == null) {
synchronized (App.class) {
if (App.flagSync == null) {
try {
Thread.sleep(App.sleepOfFirst);
} catch (InterruptedException e) {
e.printStackTrace();
}
App.flagSync = true;
}
}
}
App.commonBlock();
}
App.incrementCountRunning(-1);
}
}
}