Java String = "" vs. new String("") performance change - java

I have made the same Test as it was done in this post:
new String() vs literal string performance
Meaning I wanted to test which one has the better in performance. As I expected the result was that the assignment by literals is faster. I don't know why, but I did the test with some more assignments and I noticed something strange: when I let the program do the loops more than 10.000 times the assignment by literals is relatively not as much faster than at less than 10.000 assignments. And at 1.000.000 repetitions it is even slower than creating new objects.
Here is my code:
double tx = System.nanoTime();
for (int i = 0; i<1; i++){
String s = "test";
}
double ty = System.nanoTime();
double ta = System.nanoTime();
for (int i = 0; i<1; i++){
String s = new String("test");
}
double tb = System.nanoTime();
System.out.println((ty-tx));
System.out.println((tb-ta));
I let this run like it is written above. I'm just learning Java and my boss asked me to do the test and after I presented the outcome of the test he asked me do find an answer, why this happens. I cannot find anything on google or in stackoverflow and so I hope someone can help me out here.
factor at 1 repetition 3,811565221
factor at 10 repetitions 4,393570401
factor at 100 repetitions 5,234779103
factor at 1,000 repetitions 7,909884116
factor at 10,000 repetitions 9,395538811
factor at 100,000 repetitions 2,355514697
factor at 1,000,000 repetitions 0,734826755
Thank you!

First you'll have to learn a lot about the internals of HotSpot, in particular the fact that your code is first interpreted, then at a certain point compiled into native code.
A lot of optimizations happen while compiling, based on results of both static and dynamic analysis of your code.
Specifically, in your code,
String s = "test";
is a clear no-op. The compiler will emit no code whatsoever for this line. All that remains is the loop itself, and the whole loop may be eliminated if HotSpot proves it has no observable outside effects.
Second, even the code
String s = new String("test");
may result in almost the same thing as above because it is very easy to prove that your new String is an instance which cannot escape from the method where it is created.
With your code, the measurements are mixing up the performance of interpreted bytecode, the delay it takes to compile the code and swap it in by On-Stack Replacement, and then the performance of the native code.
Basically, the measurements you are making are measuring everything but the effect you have set out to measure.
To make the arguments more solid, I have repeated the test with jmh:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 1, time = 1)
#Measurement(iterations = 3, time = 1)
#Threads(1)
#Fork(2)
public class Strings
{
static final int ITERS = 1_000;
#GenerateMicroBenchmark
public void literal() {
for (int i = 0; i < ITERS; i++) { String s = "test"; }
}
#GenerateMicroBenchmark
public void newString() {
for (int i = 0; i < ITERS; i++) { String s = new String("test"); }
}
}
and these are the results:
Benchmark Mode Samples Mean Mean error Units
literal avgt 6 0.625 0.023 ns/op
newString avgt 6 43.778 3.283 ns/op
You can see that the whole method body is eliminated in the case of the string literal, while with new String the loop remains, but nothing in it because the time per loop iteration is just 0.04 nanoseconds. Definitely no String instances are allocated.

Related

Performance difference between Java direct array index access vs. for loop access

I was experimenting with predicates. I tried to implement the predicate for serializing issues in distributed systems. I wrote a simple example where the test function just returns true. I was measuring the overhead, and I stumbled upon this interesting problem. Accessing array in for loop is 10 times slower compared to direct access.
class Test {
public boolean test(Object o) { return true; }
}
long count = 1000000000l;
Test[] test = new Test[3];
test[0] = new Test();
test[1] = new Test();
test[2] = new Test();
long milliseconds = System.currentTimeMillis();
for(int i = 0; i < count; i++){
boolean result = true;
Object object = new Object();
for(int j = 0; j < test.length; j++){
result = result && test[j].test(object);
}
}
System.out.println((System.currentTimeMillis() - milliseconds));
However, the following code is almost 10 times faster. What can be the
reason?
milliseconds = System.currentTimeMillis();
for(int i=0 ; i < count; i++) {
Object object = new Object();
boolean result = test[0].test(object) && test[1].test(object) && test[2].test(object);
}
System.out.println((System.currentTimeMillis() - milliseconds));
Benchmark results on my i5.
4567 msec for for loop access
297 msec for direct access
Due to the predictable result of test(Object o) the compiler is able to optimize the second piece of code quite effectively. The second loop in the first piece of code makes this optimization impossible.
Compare the result with the following Test class:
static class Test {
public boolean test(Object o) {
return Math.random() > 0.5;
}
}
... and the loops:
long count = 100000000l;
Test[] test = new Test[3];
test[0] = new Test();
test[1] = new Test();
test[2] = new Test();
long milliseconds = System.currentTimeMillis();
for(int i = 0; i < count; i++){
boolean result = true;
Object object = new Object();
for(int j = 0; j < test.length; j++){
result = result && test[j].test(object);
}
}
System.out.println((System.currentTimeMillis() - milliseconds));
milliseconds = System.currentTimeMillis();
for(int i=0 ; i < count; i++) {
Object object = new Object();
boolean result = test[0].test(object) && test[1].test(object) && test[2].test(object);
}
System.out.println((System.currentTimeMillis() - milliseconds));
Now both loops require almost the same time:
run:
3759
3368
BUILD SUCCESSFUL (total time: 7 seconds)
p.s.: check out this article for more about JIT compiler optimizations.
You are committing almost every basic mistake you can make with a microbenchmark.
You don't ensure code cannot be optimized away by making sure to actually use the calculations result.
Your two code branches have subtly but decidedly different logic (as pointed out variant two will always short-circuit). The second case is easier to optimize for the JIT due to test() returning a constant.
You did not warm up the code, inviting JIT optimization time being included somewhere into the execution time
Your testing code is not accounting for execution order of test cases exerting an influence on the test results. Its not fair to run case 1, then case 2 with the same data and objects. The JIT will by the time case 2 runs have optimized the test method and collected runtime statistics about its behavior (at the expense of case 1's execution time).
If loop header takes one unit time to execute the in first solution loop header evaluations takes 3N units of time. While in direct access it takes N.
Other than loop header overhead in first solution 3 && conditions per iteration to evaluate while in second there are only 2.
And last but not the least Boolean short-circuit evaluation which causes your second, faster example, to stop testing the condition "prematurely", i.e. the entire result evaluates to false if first && condition results false.

Aparapi GPU execution slower than CPU

I am trying to test the performance of Aparapi.
I have seen some blogs where the results show that Aparapi does improve the performance while doing data parallel operations.
But I am not able to see that in my tests. Here is what I did, I wrote two programs, one using Aparapi, the other one using normal loops.
Program 1: In Aparapi
import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;
public class App
{
public static void main( String[] args )
{
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
long t1 = System.currentTimeMillis();
kernel.execute(Range.create(size));
long t2 = System.currentTimeMillis();
System.out.println("Execution mode = "+kernel.getExecutionMode());
kernel.dispose();
System.out.println(t2-t1);
}
}
Program 2: using loops
public class App2 {
public static void main(String[] args) {
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
long t1 = System.currentTimeMillis();
for(int i=0;i<size;i++) {
sum[i]=a[i]+b[i];
}
long t2 = System.currentTimeMillis();
System.out.println(t2-t1);
}
}
Program 1 takes around 330ms whereas Program 2 takes only around 55ms.
Am I doing something wrong here? I did printout the execution mode in Aparpai program and it prints that the mode of execution is GPU
You did not do anything wrong - execpt for the benchmark itself.
Benchmarking is always tricky, and particularly for the cases where a JIT is involved (as for Java), and for libraries where many nitty-gritty details are hidden from the user (as for Aparapi). And in both cases, you should at least execute the code section that you want to benchmark multiple times.
For the Java version, one might expect the computation time for a single execution of the loop to decrease when the loop itself it is executed multiple times, due to the JIT kicking in. There are many additional caveats to consider - for details, you should refer to this answer. In this simple test, the effect of the JIT may not really be noticable, but in more realistic or complex scenarios, this will make a difference. Anyhow: When repeating the loop for 10 times, the time for a single execution of the loop on my machine was about 70 milliseconds.
For the Aparapi version, the point of possible GPU initialization was already mentioned in the comments. And here, this is indeed the main problem: When running the kernel 10 times, the timings on my machine are
1248
72
72
72
73
71
72
73
72
72
You see that the initial call causes all the overhead. The reason for this is that, during the first call to Kernel#execute(), it has to do all the initializations (basically converting the bytecode to OpenCL, compile the OpenCL code etc.). This is also mentioned in the documentation of the KernelRunner class:
The KernelRunner is created lazily as a result of calling Kernel.execute().
The effect of this - namely, a comparatively large delay for the first execution - has lead to this question on the Aparapi mailing list: A way to eagerly create KernelRunners. The only workaround suggested there was to create an "initialization call" like
kernel.execute(Range.create(1));
without a real workload, only to trigger the whole setup, so that the subsequent calls are fast. (This also works for your example).
You may have noticed that, even after the initialization, the Aparapi version is still not faster than the plain Java version. The reason for that is that the task of a simple vector addition like this is memory bound - for details, you may refer to this answer, which explains this term and some issues with GPU programming in general.
As an overly suggestive example for a case where you might benefit from the GPU, you might want to modify your test, in order to create an artificial compute bound task: When you change the kernel to involve some expensive trigonometric functions, like this
Kernel kernel = new Kernel() {
#Override
public void run() {
int gid = getGlobalId();
sum[gid] = (float)(Math.cos(Math.sin(a[gid])) + Math.sin(Math.cos(b[gid])));
}
};
and the plain Java loop version accordingly, like this
for (int i = 0; i < size; i++) {
sum[i] = (float)(Math.cos(Math.sin(a[i])) + Math.sin(Math.cos(b[i])));;
}
then you will see a difference. On my machine (GeForce 970 GPU vs. AMD K10 CPU) the timings are about 140 milliseconds for the Aparapi version, and a whopping 12000 milliseconds for the plain Java version - that's a speedup of nearly 90 through Aparapi!
Also note that even in CPU mode, Aparapi may offer an advantage compared to plain Java. On my machine, in CPU mode, Aparapi needs only 2300 milliseconds, because it still parallelizes the execution using a Java thread pool.
Just add before main loop kernel execution
kernel.setExplicit(true);
kernel.put(a);
kernel.put(b);
and
kernel.get(sum);
after it.
Although Aparapi does analyze the byte code of the Kernel.run()
method (and any method reachable from Kernel.run()) Aparapi has no
visibility to the call site. In the above code there is no way for
Aparapi to detect that that hugeArray is not modified within the for
loop body. Unfortunately, Aparapi must default to being ‘safe’ and
copy the contents of hugeArray backwards and forwards to the GPU
device.
https://github.com/aparapi/aparapi/blob/master/doc/ExplicitBufferHandling.md

java stream performace for finding maximum element form a list

I wrote a simple program to compare to performance of stream for finding maximum form list of integer. Surprisingly I found that the performance of ' stream way' 1/10 of 'usual way'. Am I doing something wrong? Is there any condition on which Stream way will not be efficient? Could anyone have a nice explanation for this behavior?
"stream way" took 80 milliseconds "usual way" took 15 milli seconds
Please find the code below
public class Performance {
public static void main(String[] args) {
ArrayList<Integer> a = new ArrayList<Integer>();
Random randomGenerator = new Random();
for (int i=0;i<40000;i++){
a.add(randomGenerator.nextInt(40000));
}
long start_s = System.currentTimeMillis( );
Optional<Integer> m1 = a.stream().max(Integer::compare);
long diff_s = System.currentTimeMillis( ) - start_s;
System.out.println(diff_s);
int e = a.size();
Integer m = Integer.MIN_VALUE;
long start = System.currentTimeMillis( );
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
long diff = System.currentTimeMillis( ) - start;
System.out.println(diff);
}
}
Yes, Streams are slower for such simple operations. But your numbers are completely unrelated. If you think that 15 milliseconds is satisfactory time for your task, then there are good news: after warm-up stream code can solve this problem in like 0.1-0.2 milliseconds, which is 70-150 times faster.
Here's quick-and-dirty benchmark:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
// Stream API is very nice to get random data for tests!
List<Integer> a = new Random().ints(40000, 0, 40000).boxed()
.collect(Collectors.toList());
#Benchmark
public Integer streamList() {
return a.stream().max(Integer::compare).orElse(Integer.MIN_VALUE);
}
#Benchmark
public Integer simpleList() {
int e = a.size();
Integer m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(a.get(i) > m) m = a.get(i);
return m;
}
}
The results are:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleList avgt 30 38.241 ± 0.434 us/op
StreamTest.streamList avgt 30 215.425 ± 32.871 us/op
Here's microseconds. So the Stream version is actually much faster than your test. Nevertheless the simple version is even more faster. So if you were fine with 15 ms, you can use any of these two versions you like: both will perform much faster.
If you want to get the best possible performance no matter what, you should get rid of boxed Integer objects and work with primitive array:
int[] b = new Random().ints(40000, 0, 40000).toArray();
#Benchmark
public int streamArray() {
return Arrays.stream(b).max().orElse(Integer.MIN_VALUE);
}
#Benchmark
public int simpleArray() {
int e = b.length;
int m = Integer.MIN_VALUE;
for(int i=0; i < e; i++)
if(b[i] > m) m = b[i];
return m;
}
Both versions are faster now:
Benchmark Mode Cnt Score Error Units
StreamTest.simpleArray avgt 30 10.132 ± 0.193 us/op
StreamTest.streamArray avgt 30 167.435 ± 1.155 us/op
Actually the stream version result may vary greatly as it involves many intermediate methods which are JIT-compiled in different time, so the speed may change in any direction after some iterations.
By the way your original problem can be solved by good old Collections.max method without Stream API like this:
Integer max = Collections.max(a);
In general you should avoid testing the artificial code which does not solve real problems. With artificial code you will get the artificial results which generally say nothing about the API performance in real conditions.
The immediate difference that I see is that the stream way uses Integer::compare which might require more autoboxing etc. vs. an operator in the loop. perhaps you can call Integer::compare in the loop to see if this is the reason?
EDIT: following the advice from Nicholas Robinson, I wrote a new version of the test. It uses 400K sized list (the original one yielded zero diff results), it uses Integer.compare in both cases and runs only one of them in each invocation (I alternate between the two methods):
static List<Integer> a = new ArrayList<Integer>();
public static void main(String[] args)
{
Random randomGenerator = new Random();
for (int i = 0; i < 400000; i++) {
a.add(randomGenerator.nextInt(400000));
}
long start = System.currentTimeMillis();
//Integer max = checkLoop();
Integer max = checkStream();
long diff = System.currentTimeMillis() - start;
System.out.println("max " + max + " diff " + diff);
}
static Integer checkStream()
{
Optional<Integer> max = a.stream().max(Integer::compare);
return max.get();
}
static Integer checkLoop()
{
int e = a.size();
Integer max = Integer.MIN_VALUE;
for (int i = 0; i < e; i++) {
if (Integer.compare(a.get(i), max) > 0) max = a.get(i);
}
return max;
}
The results for loop: max 399999 diff 10
The results for stream: max 399999 diff 40 (and sometimes I got 50)
In Java 8 they have been putting a lot of effort into making use of concurrent processes with the new lambdas. You will find the stream to be so much faster because the list is being processed concurrently in the most efficient way possible where as the usual way is running through the list sequentially.
Because the lambda are static this makes threading easier, however when you are accessing something line your hard drive (reading in a file line by line) you will probably find the stream wont be as efficient because the hard drive can only access info.
[UPDATE]
The reason your stream took so much longer than the normal way is because you run in first. The JRE is constantly trying to optimize the performance so there will be a cache set up with the usual way. If you run the usual way before the stream way you should get opposing results. I would recommend running the tests in different mains for the best results.

How to disable compiler and JVM optimizations?

I have this code that is testing Calendar.getInstance().getTimeInMillis() vs System.currentTimeMilli() :
long before = getTimeInMilli();
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
long after = getTimeInMilli();
System.out.println(getClass().getSimpleName() + " total is " + (after - before));
I want to make sure no JVM or compiler optimization happens, so the test will be valid and will actually show the difference.
How to be sure?
EDIT: I changed the code example so it will be more clear. What I am checking here is how much time it takes to call getTimeInMilli() in different implementations - Calendar vs System.
I think you need to disable JIT. Add to your run command next option:
-Djava.compiler=NONE
You want optimization to happen, because it will in real life - the test wouldn't be valid if the JVM didn't optimize in the same way that it would in the real situation you're interested in.
However, if you want to make sure that the JVM doesn't remove calls that it could potentially consider no-ops otherwise, one option is to use the result - so if you're calling System.currentTimeMillis() repeatedly, you might sum all the return values and then display the sum at the end.
Note that you may still have some bias though - for example, there may be some optimization if the JVM can cheaply determine that only a tiny amount of time has passed since the last call to System.currentTimeMillis(), so it can use a cached value. I'm not saying that's actually the case here, but it's the kind of thing you need to think about. Ultimately, benchmarks can only really test the loads you give them.
One other thing to consider: assuming you want to model a real world situation where the code is run a lot, you should run the code a lot before taking any timing - because the Hotspot JVM will optimize progressively harder, and presumably you care about the heavily-optimized version and don't want to measure the time for JITting and the "slow" versions of the code.
As Stephen mentioned, you should almost certainly take the timing outside the loop... and don't forget to actually use the results...
What you are doing looks like benchmarking, you can read Robust Java benchmarking to get some good background about how to make it right. In few words, you don't need to turn it off, because it won't be what happens on production server.. instead you need to know the close the possible to 'real' time estimation / performance. Before optimization you need to 'warm up' your code, it looks like:
// warm up
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
}
// measure time
long before = getTimeInMilli();
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
}
long after = getTimeInMilli();
System.out.prinltn( "What to expect? " + (after - before)/1000 ); // average time
When we measure performance of our code we use this approach, it give us more less real time our code needs to work. Even better to measure code in separated methods:
public void doIt() {
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
}
// warm up
for (int j = 0; j < 1000; j++) {
doIt()
}
// measure time
long before = getTimeInMilli();
for (int j = 0; j < 1000; j++) {
doIt();
}
long after = getTimeInMilli();
System.out.prinltn( "What to expect? " + (after - before)/1000 ); // average time
Second approach is more precise, but it also depends on VM. E.g. HotSpot can perform "on-stack replacement", it means that if some part of method is executed very often it will be optimized by VM and old version of code will be exchanged with optimized one while method is executing. Of course it takes extra actions from VM side. JRockit does not do it, optimized version of code will be used only when this method is executed again (so no 'runtime' optimization... I mean in my first code sample all the time old code will be executed... except for doSomeReallyHardWork internals - they do not belong to this method, so optimization will work well).
UPDATED: code in question was edited while I was answering ;)
Sorry, but what you are trying to do makes little sense.
If you turn off JIT compilation, then you are only going to measure how long it takes to call that method with JIT compilation turned off. This is not useful information ... because it tells you little if anything about what will happen when JIT compilation is turned on1.
The times between JIT on and off can be different by a huge factor. You are unlikely to want to run anything in production with JIT turned off.
A better approach would be to do this:
long before1 = getTimeInMilli();
for (int i = 0; i < TIMES_TO_ITERATE; i++) {
doSomeReallyHardWork();
}
long after1 = getTimeInMilli();
... and / or use the nanosecond clock.
If you are trying to measure the time taken to call the two versions of getTimeInMillis(), then I don't understand the point of your call to doSomeReallyHardWork(). A more senible benchmark would be this:
public long test() {
long before1 = getTimeInMilli();
long sum = 0;
for (int i = 0; i < TIMES_TO_ITERATE; i++) {
sum += getTimeInMilli();
}
long after1 = getTimeInMilli();
System.out.println("Took " + (after - before) + " milliseconds");
return sum;
}
... and call that a number of times, until the times printed stabilize.
Either way, my main point still stands, turning of JIT compilation and / or optimization would mean that you were measuring something that is not useful to know, and not what you are really trying to find out. (Unless, that is, you are intending to run your application in production with JIT turned off ... which I find hard to believe ...)
1 - I note that someone has commented that turning off JIT compilation allowed them to easily demonstrate the difference between O(1), O(N) and O(N^2) algorithms for a class. But I would counter that it is better to learn how to write a correct micro-benchmark. And for serious purposes, you need to learn how to derive the complexity of the algorithms ... mathematically. Even with a perfect benchmark, you can get the wrong answer by trying to "deduce" complexity from performance measurements. (Take the behavior of HashMap for example.)

Code inside thread slower than outside thread..?

I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.
All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?
I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.
When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)
You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.

Categories