Aparapi GPU execution slower than CPU

Aparapi GPU execution slower than CPU - java

I am trying to test the performance of Aparapi.
I have seen some blogs where the results show that Aparapi does improve the performance while doing data parallel operations.
But I am not able to see that in my tests. Here is what I did, I wrote two programs, one using Aparapi, the other one using normal loops.
Program 1: In Aparapi
import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;
public class App
{
public static void main( String[] args )
{
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
long t1 = System.currentTimeMillis();
kernel.execute(Range.create(size));
long t2 = System.currentTimeMillis();
System.out.println("Execution mode = "+kernel.getExecutionMode());
kernel.dispose();
System.out.println(t2-t1);
}
}
Program 2: using loops
public class App2 {
public static void main(String[] args) {
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
long t1 = System.currentTimeMillis();
for(int i=0;i<size;i++) {
sum[i]=a[i]+b[i];
}
long t2 = System.currentTimeMillis();
System.out.println(t2-t1);
}
}
Program 1 takes around 330ms whereas Program 2 takes only around 55ms.
Am I doing something wrong here? I did printout the execution mode in Aparpai program and it prints that the mode of execution is GPU

You did not do anything wrong - execpt for the benchmark itself.
Benchmarking is always tricky, and particularly for the cases where a JIT is involved (as for Java), and for libraries where many nitty-gritty details are hidden from the user (as for Aparapi). And in both cases, you should at least execute the code section that you want to benchmark multiple times.
For the Java version, one might expect the computation time for a single execution of the loop to decrease when the loop itself it is executed multiple times, due to the JIT kicking in. There are many additional caveats to consider - for details, you should refer to this answer. In this simple test, the effect of the JIT may not really be noticable, but in more realistic or complex scenarios, this will make a difference. Anyhow: When repeating the loop for 10 times, the time for a single execution of the loop on my machine was about 70 milliseconds.
For the Aparapi version, the point of possible GPU initialization was already mentioned in the comments. And here, this is indeed the main problem: When running the kernel 10 times, the timings on my machine are
1248
72
72
72
73
71
72
73
72
72
You see that the initial call causes all the overhead. The reason for this is that, during the first call to Kernel#execute(), it has to do all the initializations (basically converting the bytecode to OpenCL, compile the OpenCL code etc.). This is also mentioned in the documentation of the KernelRunner class:
The KernelRunner is created lazily as a result of calling Kernel.execute().
The effect of this - namely, a comparatively large delay for the first execution - has lead to this question on the Aparapi mailing list: A way to eagerly create KernelRunners. The only workaround suggested there was to create an "initialization call" like
kernel.execute(Range.create(1));
without a real workload, only to trigger the whole setup, so that the subsequent calls are fast. (This also works for your example).
You may have noticed that, even after the initialization, the Aparapi version is still not faster than the plain Java version. The reason for that is that the task of a simple vector addition like this is memory bound - for details, you may refer to this answer, which explains this term and some issues with GPU programming in general.
As an overly suggestive example for a case where you might benefit from the GPU, you might want to modify your test, in order to create an artificial compute bound task: When you change the kernel to involve some expensive trigonometric functions, like this
Kernel kernel = new Kernel() {
#Override
public void run() {
int gid = getGlobalId();
sum[gid] = (float)(Math.cos(Math.sin(a[gid])) + Math.sin(Math.cos(b[gid])));
}
};
and the plain Java loop version accordingly, like this
for (int i = 0; i < size; i++) {
sum[i] = (float)(Math.cos(Math.sin(a[i])) + Math.sin(Math.cos(b[i])));;
}
then you will see a difference. On my machine (GeForce 970 GPU vs. AMD K10 CPU) the timings are about 140 milliseconds for the Aparapi version, and a whopping 12000 milliseconds for the plain Java version - that's a speedup of nearly 90 through Aparapi!
Also note that even in CPU mode, Aparapi may offer an advantage compared to plain Java. On my machine, in CPU mode, Aparapi needs only 2300 milliseconds, because it still parallelizes the execution using a Java thread pool.

Just add before main loop kernel execution
kernel.setExplicit(true);
kernel.put(a);
kernel.put(b);
and
kernel.get(sum);
after it.
Although Aparapi does analyze the byte code of the Kernel.run()
method (and any method reachable from Kernel.run()) Aparapi has no
visibility to the call site. In the above code there is no way for
Aparapi to detect that that hugeArray is not modified within the for
loop body. Unfortunately, Aparapi must default to being ‘safe’ and
copy the contents of hugeArray backwards and forwards to the GPU
device.
https://github.com/aparapi/aparapi/blob/master/doc/ExplicitBufferHandling.md

Related

Multithreaded vs Asynchronous programming in a single core

If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540

For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.

In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?

Multi threaded matrix multiplication performance issue

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.
public class MatMulConcur {
private final static int NUM_OF_THREAD =1 ;
private static Mat matC;
public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}
private static Mat mul(Mat matA,Mat matB) {
int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;
Worker[] myWorker = new Worker[NUM_OF_THREAD];
for (int j = 0; j < NUM_OF_THREAD; j++) {
if (j<NUM_OF_THREAD-1){
numRowForThread = (numRowA / NUM_OF_THREAD);
} else {
numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
}
myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
myWorker[j].start();
startRow += numRowForThread;
}
for (Worker worker : myWorker) {
try {
worker.join();
} catch (InterruptedException e) {
}
}
return matC;
}
private static class Worker extends Thread {
private int startRow, stopRow;
private Mat matA, matB;
public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
super();
this.startRow = startRow;
this.stopRow = stopRow;
this.matA = matA;
this.matB = matB;
}
#Override
public void run() {
for (int i = startRow; i < stopRow; i++) {
for (int j = 0; j < matB.getNColumns(); j++) {
double sum = 0;
for (int k = 0; k < matA.getNColumns(); k++) {
sum += matA.get(i, k) * matB.get(k, j);
}
matC.set(i, j, sum);
}
}
}
}
I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table
Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds
Any Idea?

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.
There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.
A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.
There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.
There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.
1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.

Using multi-threading is not primarily for performance, but for parallelization. There are cases where parallelization can benefit performance, though.
Your computer doesn't have infinite resources. Adding more and more threads will decrease performance. It's like starting more and more applications, you wouldn't expect a program to run faster when you start another program, and you probably wouldn't be surprised if it runs slower.
Up to a certain point performance will remain constant (your computer still has resources to handle the demand), but at some point you reach the maximum your computer can handle and performance will drop. That's exactly what your result shows. Performance stays somewhat constant with 1 or 10 threads, and then drops steadily.

Performance of synchronize section in Java

I had a small dispute over performance of synchronized block in Java. This is a theoretical question, which does not affect real life application.
Consider single-thread application, which uses locks and synchronize sections. Does this code work slower than the same code without synchronize sections? If so, why? We do not discuss concurrency, since it’s only single thread application
Update
Found interesting benchmark testing it. But it's from 2001. Things could have changed dramatically in the latest version of JDK

Single-threaded code will still run slower when using synchronized blocks. Obviously you will not have other threads stalled while waiting for other threads to finish, however you will have to deal with the other effects of synchronization, namely cache coherency.
Synchronized blocks are not only used for concurrency, but also visibility. Every synchronized block is a memory barrier: the JVM is free to work on variables in registers, instead of main memory, on the assumption that multiple threads will not access that variable. Without synchronization blocks, this data could be stored in a CPU's cache and different threads on different CPUs would not see the same data. By using a synchronization block, you force the JVM to write this data to main memory for visibility to other threads.
So even though you're free from lock contention, the JVM will still have to do housekeeping in flushing data to main memory.
In addition, this has optimization constraints. The JVM is free to reorder instructions in order to provide optimization: consider a simple example:
foo++;
bar++;
versus:
foo++;
synchronized(obj)
{
bar++;
}
In the first example, the compiler is free to load foo and bar at the same time, then increment them both, then save them both. In the second example, the compiler must perform the load/add/save on foo, then perform the load/add/save on bar. Thus, synchronization may impact the ability of the JRE to optimize instructions.
(An excellent book on the Java Memory Model is Brian Goetz's Java Concurrency In Practice.)

There are 3 type of locking in HotSpot
Fat: JVM relies on OS mutexes to acquire lock.
Thin: JVM is using CAS algorithm.
Biased: CAS is rather expensive operation on some of the architecture. Biased locking - is special type of locking optimized for scenario when only one thread is working on object.
By default JVM uses thin locking. Later if JVM determines that there is no contention thin locking is converted to biased locking. Operation that changes type of the lock is rather expensive, hence JVM does not apply this optimization immediately. There is special JVM option - XX:BiasedLockingStartupDelay=delay which tells JVM when this kind of optimization should be applied.
Once biased, that thread can subsequently lock and unlock the object without resorting to expensive atomic instructions.
Answer to the question: it depends. But if biased, the single threaded code with locking and without locking has average same performance.
Biased Locking in HotSpot - Dave Dice's Weblog
Synchronization and Object Locking - Thomas Kotzmann and Christian Wimmer

There is some overhead in acquiring a non-contested lock, but on modern JVMs it is very small.
A key run-time optimization that's relevant to this case is called "Biased Locking" and is explained in the Java SE 6 Performance White Paper.
If you wanted to have some performance numbers that are relevant to your JVM and hardware, you could construct a micro-benchmark to try and measure this overhead.

Using locks when you don't need to will slow down your application. It could be too small to measure or it could be surprisingly high.
IMHO Often the best approach is to use lock free code in a single threaded program to make it clear this code is not intended to be shared across thread. This could be more important for maintenance than any performance issues.
public static void main(String... args) throws IOException {
for (int i = 0; i < 3; i++) {
perfTest(new Vector<Integer>());
perfTest(new ArrayList<Integer>());
}
}
private static void perfTest(List<Integer> objects) {
long start = System.nanoTime();
final int runs = 100000000;
for (int i = 0; i < runs; i += 20) {
// add items.
for (int j = 0; j < 20; j+=2)
objects.add(i);
// remove from the end.
while (!objects.isEmpty())
objects.remove(objects.size() - 1);
}
long time = System.nanoTime() - start;
System.out.printf("%s each add/remove took an average of %.1f ns%n", objects.getClass().getSimpleName(), (double) time/runs);
}
prints
Vector each add/remove took an average of 38.9 ns
ArrayList each add/remove took an average of 6.4 ns
Vector each add/remove took an average of 10.5 ns
ArrayList each add/remove took an average of 6.2 ns
Vector each add/remove took an average of 10.4 ns
ArrayList each add/remove took an average of 5.7 ns
From a performance point of view, if 4 ns is important to you, you have to use the non-synchronized version.
For 99% of use cases, the clarity of the code is more important than performance. Clear, simple code often performs reasonably good as well.
BTW: I am using a 4.6 GHz i7 2600 with Oracle Java 7u1.
For comparison if I do the following where perfTest1,2,3 are identical.
perfTest1(new ArrayList<Integer>());
perfTest2(new Vector<Integer>());
perfTest3(Collections.synchronizedList(new ArrayList<Integer>()));
I get
ArrayList each add/remove took an average of 2.6 ns
Vector each add/remove took an average of 7.5 ns
SynchronizedRandomAccessList each add/remove took an average of 8.9 ns
If I use a common perfTest method it cannot inline the code as optimally and they are all slower
ArrayList each add/remove took an average of 9.3 ns
Vector each add/remove took an average of 12.4 ns
SynchronizedRandomAccessList each add/remove took an average of 13.9 ns
Swapping the order of tests
ArrayList each add/remove took an average of 3.0 ns
Vector each add/remove took an average of 39.7 ns
ArrayList each add/remove took an average of 2.0 ns
Vector each add/remove took an average of 4.6 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.5 ns
ArrayList each add/remove took an average of 2.3 ns
Vector each add/remove took an average of 4.4 ns
ArrayList each add/remove took an average of 2.4 ns
Vector each add/remove took an average of 4.6 ns
one at a time
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 3.0 ns
ArrayList each add/remove took an average of 2.3 ns
ArrayList each add/remove took an average of 2.2 ns
ArrayList each add/remove took an average of 2.4 ns
and
Vector each add/remove took an average of 28.4 ns
Vector each add/remove took an average of 37.4 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns
Vector each add/remove took an average of 7.6 ns

Assuming you're using the HotSpot VM, I believe the JVM is able to recognize that there is no contention for any resources within the synchronized block and treat it as "normal" code.

This sample code (with 100 threads making 1,000,000 iterations each one) demonstrates the performance difference between avoiding and not avoiding a synchronized block.
Output:
Total time(Avoid Sync Block): 630ms
Total time(NOT Avoid Sync Block): 6360ms
Total time(Avoid Sync Block): 427ms
Total time(NOT Avoid Sync Block): 6636ms
Total time(Avoid Sync Block): 481ms
Total time(NOT Avoid Sync Block): 5882ms
Code:
import org.apache.commons.lang.time.StopWatch;
public class App {
public static int countTheads = 100;
public static int loopsPerThead = 1000000;
public static int sleepOfFirst = 10;
public static int runningCount = 0;
public static Boolean flagSync = null;
public static void main( String[] args )
{
for (int j = 0; j < 3; j++) {
App.startAll(new App.AvoidSyncBlockRunner(), "(Avoid Sync Block)");
App.startAll(new App.NotAvoidSyncBlockRunner(), "(NOT Avoid Sync Block)");
}
}
public static void startAll(Runnable runnable, String description) {
App.runningCount = 0;
App.flagSync = null;
Thread[] threads = new Thread[App.countTheads];
StopWatch sw = new StopWatch();
sw.start();
for (int i = 0; i < threads.length; i++) {
threads[i] = new Thread(runnable);
}
for (int i = 0; i < threads.length; i++) {
threads[i].start();
}
do {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
} while (runningCount != 0);
System.out.println("Total time"+description+": " + (sw.getTime() - App.sleepOfFirst) + "ms");
}
public static void commonBlock() {
String a = "foo";
a += "Baa";
}
public static synchronized void incrementCountRunning(int inc) {
runningCount = runningCount + inc;
}
public static class NotAvoidSyncBlockRunner implements Runnable {
public void run() {
App.incrementCountRunning(1);
for (int i = 0; i < App.loopsPerThead; i++) {
synchronized (App.class) {
if (App.flagSync == null) {
try {
Thread.sleep(App.sleepOfFirst);
} catch (InterruptedException e) {
e.printStackTrace();
}
App.flagSync = true;
}
}
App.commonBlock();
}
App.incrementCountRunning(-1);
}
}
public static class AvoidSyncBlockRunner implements Runnable {
public void run() {
App.incrementCountRunning(1);
for (int i = 0; i < App.loopsPerThead; i++) {
// THIS "IF" MAY SEEM POINTLESS, BUT IT AVOIDS THE NEXT
//ITERATION OF ENTERING INTO THE SYNCHRONIZED BLOCK
if (App.flagSync == null) {
synchronized (App.class) {
if (App.flagSync == null) {
try {
Thread.sleep(App.sleepOfFirst);
} catch (InterruptedException e) {
e.printStackTrace();
}
App.flagSync = true;
}
}
}
App.commonBlock();
}
App.incrementCountRunning(-1);
}
}
}

Android: How much overhead is generated by running an empty method?

I have created a class to handle my debug outputs so that I don't need to strip out all my log outputs before release.
public class Debug {
public static void debug( String module, String message) {
if( Release.DEBUG )
Log.d(module, message);
}
}
After reading another question, I have learned that the contents of the if statement are not compiled if the constant Release.DEBUG is false.
What I want to know is how much overhead is generated by running this empty method? (Once the if clause is removed there is no code left in the method) Is it going to have any impact on my application? Obviously performance is a big issue when writing for mobile handsets =P
Thanks
Gary

Measurements done on Nexus S with Android 2.3.2:
10^6 iterations of 1000 calls to an empty static void function: 21s <==> 21ns/call
10^6 iterations of 1000 calls to an empty non-static void function: 65s <==> 65ns/call
10^6 iterations of 500 calls to an empty static void function: 3.5s <==> 7ns/call
10^6 iterations of 500 calls to an empty non-static void function: 28s <==> 56ns/call
10^6 iterations of 100 calls to an empty static void function: 2.4s <==> 24ns/call
10^6 iterations of 100 calls to an empty non-static void function: 2.9s <==> 29ns/call
control:
10^6 iterations of an empty loop: 41ms <==> 41ns/iteration
10^7 iterations of an empty loop: 560ms <==> 56ns/iteration
10^9 iterations of an empty loop: 9300ms <==> 9.3ns/iteration
I've repeated the measurements several times. No significant deviations were found.
You can see that the per-call cost can vary greatly depending on workload (possibly due to JIT compiling),
but 3 conclusions can be drawn:
dalvik/java sucks at optimizing dead code
static function calls can be optimized much better than non-static
(non-static functions are virtual and need to be looked up in a virtual table)
the cost on nexus s is not greater than 70ns/call (thats ~70 cpu cycles)
and is comparable with the cost of one empty for loop iteration (i.e. one increment and one condition check on a local variable)
Observe that in your case the string argument will always be evaluated. If you do string concatenation, this will involve creating intermediate strings. This will be very costly and involve a lot of gc. For example executing a function:
void empty(String string){
}
called with arguments such as
empty("Hello " + 42 + " this is a string " + count );
10^4 iterations of 100 such calls takes 10s. That is 10us/call, i.e. ~1000 times slower than just an empty call. It also produces huge amount of GC activity. The only way to avoid this is to manually inline the function, i.e. use the >>if<< statement instead of the debug function call. It's ugly but the only way to make it work.

Unless you call this from within a deeply nested loop, I wouldn't worry about it.

A good compiler removes the entire empty method, resulting in no overhead at all. I'm not sure if the Dalvik compiler already does this, but I suspect it's likely, at least since the arrival of the Just-in-time compiler with Froyo.
See also: Inline expansion

In terms of performance the overhead of generating the messages which get passed into the debug function are going to be a lot more serious since its likely they do memory allocations eg
Debug.debug(mymodule, "My error message" + myerrorcode);
Which will still occur even through the message is binned.
Unfortunately you really need the "if( Release.DEBUG ) " around the calls to this function rather than inside the function itself if your goal is performance, and you will see this in a lot of android code.

This is an interesting question and I like #misiu_mp analysis, so I thought I would update it with a 2016 test on a Nexus 7 running Android 6.0.1. Here is the test code:
public void runSpeedTest() {
long startTime;
long[] times = new long[100000];
long[] staticTimes = new long[100000];
for (int i = 0; i < times.length; i++) {
startTime = System.nanoTime();
for (int j = 0; j < 1000; j++) {
emptyMethod();
}
times[i] = (System.nanoTime() - startTime) / 1000;
startTime = System.nanoTime();
for (int j = 0; j < 1000; j++) {
emptyStaticMethod();
}
staticTimes[i] = (System.nanoTime() - startTime) / 1000;
}
int timesSum = 0;
for (int i = 0; i < times.length; i++) { timesSum += times[i]; Log.d("status", "time," + times[i]); sleep(); }
int timesStaticSum = 0;
for (int i = 0; i < times.length; i++) { timesStaticSum += staticTimes[i]; Log.d("status", "statictime," + staticTimes[i]); sleep(); }
sleep();
Log.d("status", "final speed = " + (timesSum / times.length));
Log.d("status", "final static speed = " + (timesStaticSum / times.length));
}
private void sleep() {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void emptyMethod() { }
private static void emptyStaticMethod() { }
The sleep() was added to prevent overflowing the Log.d buffer.
I played around with it many times and the results were pretty consistent with #misiu_mp:
10^5 iterations of 1000 calls to an empty static void function: 29ns/call
10^5 iterations of 1000 calls to an empty non-static void function: 34ns/call
The static method call was always slightly faster than the non-static method call, but it would appear that a) the gap has closed significantly since Android 2.3.2 and b) there's still a cost to making calls to an empty method, static or not.
Looking at a histogram of times reveals something interesting, however. The majority of call, whether static or not, take between 30-40ns, and looking closely at the data they are virtually all 30ns exactly.
Running the same code with empty loops (commenting out the method calls) produces an average speed of 8ns, however, about 3/4 of the measured times are 0ns while the remainder are exactly 30ns.
I'm not sure how to account for this data, but I'm not sure that #misiu_mp's conclusions still hold. The difference between empty static and non-static methods is negligible, and the preponderance of measurements are exactly 30ns. That being said, it would appear that there is still some non-zero cost to running empty methods.

Code inside thread slower than outside thread..?

I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.

All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?

I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.

When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)

You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.