Today I was faced with the method constructServiceUrl() of the org.jasig.cas.client.util.CommonUtils class. I thought he was very strange:
final StringBuffer buffer = new StringBuffer();
synchronized (buffer)
{
if (!serverName.startsWith("https://") && !serverName.startsWith("http://"))
{
buffer.append(request.isSecure() ? "https://" : "http://");
}
buffer.append(serverName);
buffer.append(request.getRequestURI());
if (CommonUtils.isNotBlank(request.getQueryString()))
{
final int location = request.getQueryString().indexOf(
artifactParameterName + "=");
if (location == 0)
{
final String returnValue = encode ? response.encodeURL(buffer.toString()) : buffer.toString();
if (LOG.isDebugEnabled())
{
LOG.debug("serviceUrl generated: " + returnValue);
}
return returnValue;
}
buffer.append("?");
if (location == -1)
{
buffer.append(request.getQueryString());
}
else if (location > 0)
{
final int actualLocation = request.getQueryString()
.indexOf("&" + artifactParameterName + "=");
if (actualLocation == -1)
{
buffer.append(request.getQueryString());
}
else if (actualLocation > 0)
{
buffer.append(request.getQueryString().substring(0, actualLocation));
}
}
}
}
Why did the author synchronizes a local variable?
This is an example of manual "lock coarsening" and may have been done to get a performance boost.
Consider these two snippets:
StringBuffer b = new StringBuffer();
for(int i = 0 ; i < 100; i++){
b.append(i);
}
versus:
StringBuffer b = new StringBuffer();
synchronized(b){
for(int i = 0 ; i < 100; i++){
b.append(i);
}
}
In the first case, the StringBuffer must acquire and release a lock 100 times (because append is a synchronized method), whereas in the second case, the lock is acquired and released only once. This can give you a performance boost and is probably why the author did it. In some cases, the compiler can perform this lock coarsening for you (but not around looping constructs because you could end up holding a lock for long periods of time).
By the way, the compiler can detect that an object is not "escaping" from a method and so remove acquiring and releasing locks on the object altogether (lock elision) since no other thread can access the object anyway. A lot of work has been done on this in JDK7.
Update:
I carried out two quick tests:
1) WITHOUT WARM-UP:
In this test, I did not run the methods a few times to "warm-up" the JVM. This means that the Java Hotspot Server Compiler did not get a chance to optimize code e.g. by eliminating locks for escaping objects.
JDK 1.4.2_19 1.5.0_21 1.6.0_21 1.7.0_06
WITH-SYNC (ms) 3172 1108 3822 2786
WITHOUT-SYNC (ms) 3660 801 509 763
STRINGBUILDER (ms) N/A 450 434 475
With JDK 1.4, the code with the external synchronized block is faster. However, with JDK 5 and above the code without external synchronization wins.
2) WITH WARM-UP:
In this test, the methods were run a few times before the timings were calculated. This was done so that the JVM could optimize code by performing escape analysis.
JDK 1.4.2_19 1.5.0_21 1.6.0_21 1.7.0_06
WITH-SYNC (ms) 3190 614 565 587
WITHOUT-SYNC (ms) 3593 779 563 610
STRINGBUILDER (ms) N/A 450 434 475
Once again, with JDK 1.4, the code with the external synchronized block is faster. However, with JDK 5 and above, both methods perform equally well.
Here is my test class (feel free to improve):
public class StringBufferTest {
public static void unsync() {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < 9999999; i++) {
buffer.append(i);
buffer.delete(0, buffer.length() - 1);
}
}
public static void sync() {
StringBuffer buffer = new StringBuffer();
synchronized (buffer) {
for (int i = 0; i < 9999999; i++) {
buffer.append(i);
buffer.delete(0, buffer.length() - 1);
}
}
}
public static void sb() {
StringBuilder buffer = new StringBuilder();
synchronized (buffer) {
for (int i = 0; i < 9999999; i++) {
buffer.append(i);
buffer.delete(0, buffer.length() - 1);
}
}
}
public static void main(String[] args) {
System.out.println(System.getProperty("java.version"));
// warm up
for(int i = 0 ; i < 10 ; i++){
unsync();
sync();
sb();
}
long start = System.currentTimeMillis();
unsync();
long end = System.currentTimeMillis();
long duration = end - start;
System.out.println("Unsync: " + duration);
start = System.currentTimeMillis();
sync();
end = System.currentTimeMillis();
duration = end - start;
System.out.println("sync: " + duration);
start = System.currentTimeMillis();
sb();
end = System.currentTimeMillis();
duration = end - start;
System.out.println("sb: " + duration);
}
}
Inexperience, incompetence, or more likely dead yet benign code that remains after refactoring.
You're right to question the worth of this - modern compilers will use escape analysis to determine that the object in question cannot be referenced by another thread, and so will elide (remove) the synchronization altogether.
(In a broader sense, it is sometimes useful to synchronize on a local variable - they are still objects after all, and another thread can still have a reference to them (so long as they have been somehow "published" after their creation). Still, this is seldom a good idea as it's often unclear and very difficult to get right - a more explicitly locking mechanism with other threads is likely to prove better overall in these cases.)
I don't think the synchronization can have any effect, since buffer is never passed to another method or stored in a field before it goes out of scope, so no other thread can possibly have access to it.
The reason it is there could be political - I've been in a similar situation: A "pointy-haired boss' insisted that I clone a string in a setter method instead of just storing the reference, for fear of having the contents changed. He didn't deny that strings are immutable but insisted on cloning it "just in case." Since it was harmless (just like this synchronization) I did not argue.
That's a bit insane...it doesn't do anything except add overhead. Not to mention that calls to StringBuffer are already synchronized, which is why StringBuilder is preferred for cases where you won't have multiple threads accessing the same instance.
IMO, there is no need for synchronization that local variable. Only if it was exposed to others, e.g. by passing it to a function that will store it and potentially use it in another thread, the synchronization would make sense.
But as this is not the case, I see no use of it
Related
Today I got confused by the results of Visual VM profiling I got.
I have the following simple Java method:
public class Encoder {
...
private BitString encode(InputStream in, Map<Character, BitString> table)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
BitString result = new BitString();
int i;
while ((i = reader.read()) != -1) {
char ch = (char) i;
BitString string = table.get(ch);
result = result.append(string);
}
return result;
}
}
This method reads characters from a stream, one at a time. For each character it looks up it's bit-string representation and it joins these bit-strings to represent the whole stream.
A BitString is a custom data structure that represents a sequence of bits using an underlying byte array.
The method performs very poorly. The problem lies with BitString#append - that method creates a new byte array, copies the bits from both input BitStrings and returns it as a new BitString instance.
public BitString append(BitString other) {
BitString result = new BitString(size + other.size);
int pos = 0;
for (byte b : this) {
result.set(pos, b);
pos++;
}
for (byte b : other) {
result.set(pos, b);
pos++;
}
return result;
}
However, when I tried to use VisualVM to verify what's happening, here's what I got:
I have very little experience with Visual VM and profiling in general. To my understanding, this looks as if the problem lied somewhere in encode method itself, not in append.
To be sure, I surrounded the whole encode method and the append call with custom time measuring, like this:
public class Encoder {
private BitString encode(InputStream in, Map<Character, BitString> table)
throws IOException {
>> long startTime = System.currentTimeMillis();
>> long appendDuration = 0;
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
BitString result = new BitString();
int i;
>> long count = 0;
while ((i = reader.read()) != -1) {
char ch = (char) i;
BitString string = table.get(ch);
>> long appendStartTime = System.currentTimeMillis();
result = result.append(string);
>> long appendEndTime = System.currentTimeMillis();
>> appendDuration += appendEndTime - appendStartTime;
>> count++;
>> if (count % 1000 == 0) {
>> log.info(">>> CHARACTERS PROCESSED: " + count);
>> long endTime = System.currentTimeMillis();
>> log.info(">>> TOTAL ENCODE DURATION: " + (endTime - startTime) + " ms");
>> log.info(">>> TOTAL APPEND DURATION: " + appendDuration + " ms");
>> }
}
return result;
}
}
And I got the following results:
CHARACTERS PROCESSED: 102000
TOTAL ENCODE DURATION: 188276 ms
APPEND CALL DURATION: 188179 ms
This seems in contradiction with the results from Visual VM.
What am I missing?
You're seeing this behavior because VisualVM can only sample the call stack at safepoints, and the JVM is optimizing the safepoints out of your code. This causes the samples to be lumped together under "Self time" instead, which makes it artificially inflated and misleading. There are two possible fixes:
To make VisualVM work better, add JVM options to keep more safepoints, like -XX:-Inline and -XX:+UseCountedLoopSafepoints. These will slow your code down some, but will make the profiling results much more accurate. This solution is easy, and it's usually good enough. Just remember to remove these options when you're not profiling!
If you don't mind switching profilers, you can use Java Mission Control or honest-profiler. These have the ability to take samples at places other than safepoints, using a special API of the JVM. This solution is a little more accurate, since you're profiling the fully optimized code, but in my opinion those profilers are harder to use than VisualVM.
In your specific case, the method call to BitString.append() is probably being inlined by the JVM for performance. This causes the safepoint that's normally at the end of a method call to be removed, which means that method won't show up in the profiler anymore.
There's an excellent blog post here with more details about what safepoints are and how they work, and another one here that talks in more detail about the interaction between safepoints and sampling profilers.
[THIS ANSWER is INVALID. but i keep it until OP get some help from a member,because this post contains two comments which helps others to understand the problem].
in this case VisualVM has measured the actual CPU time, but time value you measured is "elapsed time".
if the execution threads had to wait for IO or Network, then that time will not be measured as CPU time.
I know there's a lot of topics talking about Reflection performance.
Even official Java docs says that Reflection is slower, but I have this code:
public class ReflectionTest {
public static void main(String[] args) throws Exception {
Object object = new Object();
Class<Object> c = Object.class;
int loops = 100000;
long start = System.currentTimeMillis();
Object s;
for (int i = 0; i < loops; i++) {
s = object.toString();
System.out.println(s);
}
long regularCalls = System.currentTimeMillis() - start;
java.lang.reflect.Method method = c.getMethod("toString");
start = System.currentTimeMillis();
for (int i = 0; i < loops; i++) {
s = method.invoke(object);
System.out.println(s);
}
long reflectiveCalls = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for (int i = 0; i < loops; i++) {
method = c.getMethod("toString");
s = method.invoke(object);
System.out.println(s);
}
long reflectiveLookup = System.currentTimeMillis() - start;
System.out.println(loops + " regular method calls:" + regularCalls
+ " milliseconds.");
System.out.println(loops + " reflective method calls without lookup:"
+ reflectiveCalls+ " milliseconds.");
System.out.println(loops + " reflective method calls with lookup:"
+ reflectiveLookup + " milliseconds.");
}
}
That I don't think is a valid benchmark, but at least should show some difference.
I executed it waiting to see the reflection normal calls being a bit slower than regular ones.
But this prints this:
100000 regular method calls:1129 milliseconds.
100000 reflective method calls without lookup:910 milliseconds.
100000 reflective method calls with lookup:994 milliseconds.
Just for note, first I executed it without that bunch of sysouts, and then I realized that some JVM optimization are just making it goes faster, so I added these printls to see if reflection was still faster.
The result without sysouts are:
100000 regular method calls:68 milliseconds.
100000 reflective method calls without lookup:48 milliseconds.
100000 reflective method calls with lookup:168 milliseconds.
I saw over internet that the same test executed on old JVMs make the reflective without lookup are two times slower than regular calls, and that speed falls over new updates.
If anyone can execute it and say me I'm wrong, or at least show me if there's something different than the past that make it faster.
Following instructions, I ran every loop separated and the result are (without sysouts)
100000 regular method calls:70 milliseconds.
100000 reflective method calls without lookup:120 milliseconds.
100000 reflective method calls with lookup:129 milliseconds.
Never performance test different bits of code in the same "run". The JVM has various optimisations that mean it though the end result is the same, how the internals are performed may differ. In more concrete terms, during your test the JVM may have noticed you are calling Object.toString a lot and have started to inline the method calls to Object.toString. It may have started to perform loop unfolding. Or there could have been a garbage collection in the first loop but not the second or third loops.
To get a more meaningful, but still not totally accurate picture you should separate your test into three separate programs.
The results on my computer (with no printing and 1,000,000 runs each)
All three loops run in same program
1000000 regular method calls: 490 milliseconds.
1000000 reflective method calls without lookup: 393 milliseconds.
1000000 reflective method calls with loopup: 978 milliseconds.
Loops run in separate programs
1000000 regular method calls: 475 milliseconds.
1000000 reflective method calls without lookup: 555 milliseconds.
1000000 reflective method calls with loopup: 1160 milliseconds.
There's an article by Brian Goetz on microbenchmarks that's worth reading. It looks like you're not doing anything to warm up the JVM (meaning give it a chance to do whatever inlining or other optimizations it's going to do) before doing your measurements, so it's likely the non-reflective test is still not warmed-up yet, and that could skew your numbers.
When you have multiple long running loops, the first loop can trigger the method to compile resulting in the later loops being optimised from the start. However the optimisation can be sub-optimal as it has no runtime information for those loops. The toString is relatively expensive and couple be taking longer than the reflections calls.
You don't need separate programs to avoid loop being optimised due to an earlier loop. You can run them in different methods.
The results I get are
Average regular method calls:2 ns.
Average reflective method calls without lookup:10 ns.
Average reflective method calls with lookup:240 ns.
The code
import java.lang.reflect.Method;
public class ReflectionTest {
public static void main(String[] args) throws Exception {
int loops = 1000 * 1000;
Object object = new Object();
long start = System.nanoTime();
Object s;
testMethodCall(object, loops);
long regularCalls = System.nanoTime() - start;
java.lang.reflect.Method method = Object.class.getMethod("getClass");
method.setAccessible(true);
start = System.nanoTime();
testInvoke(object, loops, method);
long reflectiveCalls = System.nanoTime() - start;
start = System.nanoTime();
testGetMethodInvoke(object, loops);
long reflectiveLookup = System.nanoTime() - start;
System.out.println("Average regular method calls:"
+ regularCalls / loops + " ns.");
System.out.println("Average reflective method calls without lookup:"
+ reflectiveCalls / loops + " ns.");
System.out.println("Average reflective method calls with lookup:"
+ reflectiveLookup / loops + " ns.");
}
private static Object testMethodCall(Object object, int loops) {
Object s = null;
for (int i = 0; i < loops; i++) {
s = object.getClass();
}
return s;
}
private static Object testInvoke(Object object, int loops, Method method) throws Exception {
Object s = null;
for (int i = 0; i < loops; i++) {
s = method.invoke(object);
}
return s;
}
private static Object testGetMethodInvoke(Object object, int loops) throws Exception {
Method method;
Object s = null;
for (int i = 0; i < loops; i++) {
method = Object.class.getMethod("getClass");
s = method.invoke(object);
}
return s;
}
}
Micro-benchmarks like this are never going to be accurate at all - as the VM "warms up" it'll inline bits of code and optimise bits of code as it goes along, so the same thing executed 2 minutes into a program could vastly outperform it right at the start.
In terms of what's happening here, my guess is that the first "normal" method call block warms it up, so the reflective blocks (and indeed all subsequent calls) would be faster. The only overhead added through reflectively calling a method that I can see is looking up the pointer to that method, which is a nanosecond-scale operation anyway and would be easily cached by the JVM. The rest would be on how the VM is warmed up, which it is by the time you reach the reflective calls.
There is no inherent reason why reflective call should be slower than a normal call. JVM can optimize them into the same thing.
Practically, human resources are limited, and they had to optimize normal calls first. As time passes by they can work on optimizing reflective calls; especially when reflection becomes more and more popular.
I have been writing my own micro-benchmark, without loops, and with System.nanoTime():
public static void main(String[] args) throws NoSuchMethodException, IllegalArgumentException, IllegalAccessException, InvocationTargetException
{
Object obj = new Object();
Class<Object> objClass = Object.class;
String s;
long start = System.nanoTime();
s = obj.toString();
long directInvokeEnd = System.nanoTime();
System.out.println(s);
long methodLookupStart = System.nanoTime();
java.lang.reflect.Method method = objClass.getMethod("toString");
long methodLookupEnd = System.nanoTime();
s = (String) (method.invoke(obj));
long reflectInvokeEnd = System.nanoTime();
System.out.println(s);
System.out.println(directInvokeEnd - start);
System.out.println(methodLookupEnd - methodLookupStart);
System.out.println(reflectInvokeEnd - methodLookupEnd);
}
I have been executing that in Eclipse on my machine a dozen times, and the results vary quite a bit, but here is what I typically get:
the direct method invocation clocks at 40-50 microseconds
method lookup clocks at 150-200 microseconds
reflective invocation with the method variable clocks at 250-310 microseconds.
Now, do not forget the caveats on microbenchmarks described in Nathan's reply - there are certainly a lot of flaws in that micro benchmark - and trust the documentation if they say that reflection is a LOT slower than direct invocation.
It strikes me that you have placed a "System.out.println(s)" call inside your inner benchmark loop.
Since performing IO is bound to be slow, it actually "swallows up" your benchmark and the overhead of the invoke becomes negligible.
Try removing the "println()" call and running code like this, I'm sure you'd be surprised by the result (some of the silly calculations are needed to avoid the compiler optimizing away the calls altogether):
public class Experius
{
public static void main(String[] args) throws Exception
{
Experius a = new Experius();
int count = 10000000;
int v = 0;
long tm = System.currentTimeMillis();
for ( int i = 0; i < count; ++i )
{
v = a.something(i + v);
++v;
}
tm = System.currentTimeMillis() - tm;
System.out.println("Time: " + tm);
tm = System.currentTimeMillis();
Method method = Experius.class.getMethod("something", Integer.TYPE);
for ( int i = 0; i < count; ++i )
{
Object o = method.invoke(a, i + v);
++v;
}
tm = System.currentTimeMillis() - tm;
System.out.println("Time: " + tm);
}
public int something(int n)
{
return n + 5;
}
}
-- TR
Even if you look up the method in both cases (i.e. before 2nd and 3rd loop),
the first lookup takes way less time than the second lookup, which should have been the other way around and less than a regular method call on my machine.
Neverthless, if you use the 2nd loop with method lookup, and System.out.println statement, I get this:
regular call : 740 ms
look up(2nd loop) : 640 ms
look up ( 3rd loop) : 800 ms
Without System.out.println statement, I get:
regular call : 78 ms
look up (2nd) : 37 ms
look up (3rd ) : 112 ms
Suppose our application have only one thread. and we are using StringBuffer then what is the problem?
I mean if StringBuffer can handle multiple threads through synchronization, what is the problem to work with single thread?
Why use StringBuilder instead?
StringBuffers are thread-safe, meaning that they have synchronized methods to control access so that only one thread can access a StringBuffer object's synchronized code at a time. Thus, StringBuffer objects are generally safe to use in a multi-threaded environment where multiple threads may be trying to access the same StringBuffer object at the same time.
StringBuilder's access is not synchronized so that it is not thread-safe. By not being synchronized, the performance of StringBuilder can be better than StringBuffer. Thus, if you are working in a single-threaded environment, using StringBuilder instead of StringBuffer may result in increased performance. This is also true of other situations such as a StringBuilder local variable (ie, a variable within a method) where only one thread will be accessing a StringBuilder object.
So, prefer StringBuilder because,
Small performance gain.
StringBuilder is a 1:1 drop-in replacement for the StringBuffer class.
StringBuilder is not thread synchronized and therefore performs better on most implementations of Java
Check this out :
Don't Use StringBuffer!
StringBuffer vs. StringBuilder performance comparison
StringBuilder is supposed to be a (tiny) bit faster because it isn't synchronised (thread safe).
You can notice the difference in really heavy applications.
The StringBuilder class should generally be used in preference to this one, as it supports all of the same operations but it is faster, as it performs no synchronization.
http://download.oracle.com/javase/6/docs/api/java/lang/StringBuffer.html
Using StringBuffer in multiple threads is next to useless and in reality almost never happens.
Consider the following
Thread1: sb.append(key1).append("=").append(value1);
Thread2: sb.append(key2).append("=").append(value2);
each append is synchronized, but a thread can stoop at any point so you can have any of the following combinations and more
key1=value1key2=value2
key1key2==value2value1
key2key1=value1=value2
key2=key1=value2value1
This can be avoided by synchronizing the whole line at a time, but this defeats the point of using StringBuffer instead of StringBuilder.
Even if you have a correctly synchronized view, it more complicated than just creating a thread local copy of the whole line e.g. StringBuilder and log lines at a time to a class like a Writer.
StringBuffer is not wrong in a single-threaded application. It will work just as well as StringBuilder.
The only difference is the tiny overhead added by having all synchronized methods, which brings no advantage in a single-threaded application.
My opinion is that the main reason StringBuilder was introduced is that the compiler uses StringBuffer (and now StringBuilder) when it compiles code that contains String concatenation: in those cases synchronization is never necessary and replacing all of those places with an un-synchronized StringBuilder can provide a small performance improvement.
StringBuilder has a better performance because it's methods are not synchronized.
So if you do not need to build a String concurrently (which is a rather untypical scenarion anyway), then there's no need to "pay" for the unnecessary synchronization overhead.
This will help u guys,
Be Straight Builder is faster than Buffer,
public class ConcatPerf {
private static final int ITERATIONS = 100000;
private static final int BUFFSIZE = 16;
private void concatStrAdd() {
System.out.print("concatStrAdd -> ");
long startTime = System.currentTimeMillis();
String concat = new String("");
for (int i = 0; i < ITERATIONS; i++) {
concat += i % 10;
}
//System.out.println("Content: " + concat);
long endTime = System.currentTimeMillis();
System.out.print("length: " + concat.length());
System.out.println(" time: " + (endTime - startTime));
}
private void concatStrBuff() {
System.out.print("concatStrBuff -> ");
long startTime = System.currentTimeMillis();
StringBuffer concat = new StringBuffer(BUFFSIZE);
for (int i = 0; i < ITERATIONS; i++) {
concat.append(i % 10);
}
long endTime = System.currentTimeMillis();
//System.out.println("Content: " + concat);
System.out.print("length: " + concat.length());
System.out.println(" time: " + (endTime - startTime));
}
private void concatStrBuild() {
System.out.print("concatStrBuild -> ");
long startTime = System.currentTimeMillis();
StringBuilder concat = new StringBuilder(BUFFSIZE);
for (int i = 0; i < ITERATIONS; i++) {
concat.append(i % 10);
}
long endTime = System.currentTimeMillis();
// System.out.println("Content: " + concat);
System.out.print("length: " + concat.length());
System.out.println(" time: " + (endTime - startTime));
}
public static void main(String[] args) {
ConcatPerf st = new ConcatPerf();
System.out.println("Iterations: " + ITERATIONS);
System.out.println("Buffer : " + BUFFSIZE);
st.concatStrBuff();
st.concatStrBuild();
st.concatStrAdd();
}
}
Output
run:
Iterations: 100000
Buffer : 16
concatStrBuff -> length: 100000 time: 11
concatStrBuild -> length: 100000 time: 4
concatStrAdd ->
Manish, although there is just one thread operating on your StringBuffer instance, there is some overhead in acquiring and releasing the monitor lock on the StringBuffer instance whenever any of its methods are invoked. Hence StringBuilder is a preferable choice in single thread environment.
There is a huge cost to synchronize objects. Don't see a program as a standalone entity; its not a problem when you are reading the concepts and applying them on small programs like you have mentioned in your question details, the problems arise when we want to scale the system. In that case your single threaded program might be dependent on several other methods/programs/entities, so synchronized objects can cause a serious programming complexity in terms of performance. So if you are sure that there is no need to synchronize an object then you should use StringBuilder as it is a good programming practice. At the end we want to learn programming to make scalable high performance systems, so that is what we should do!
I have created a class to handle my debug outputs so that I don't need to strip out all my log outputs before release.
public class Debug {
public static void debug( String module, String message) {
if( Release.DEBUG )
Log.d(module, message);
}
}
After reading another question, I have learned that the contents of the if statement are not compiled if the constant Release.DEBUG is false.
What I want to know is how much overhead is generated by running this empty method? (Once the if clause is removed there is no code left in the method) Is it going to have any impact on my application? Obviously performance is a big issue when writing for mobile handsets =P
Thanks
Gary
Measurements done on Nexus S with Android 2.3.2:
10^6 iterations of 1000 calls to an empty static void function: 21s <==> 21ns/call
10^6 iterations of 1000 calls to an empty non-static void function: 65s <==> 65ns/call
10^6 iterations of 500 calls to an empty static void function: 3.5s <==> 7ns/call
10^6 iterations of 500 calls to an empty non-static void function: 28s <==> 56ns/call
10^6 iterations of 100 calls to an empty static void function: 2.4s <==> 24ns/call
10^6 iterations of 100 calls to an empty non-static void function: 2.9s <==> 29ns/call
control:
10^6 iterations of an empty loop: 41ms <==> 41ns/iteration
10^7 iterations of an empty loop: 560ms <==> 56ns/iteration
10^9 iterations of an empty loop: 9300ms <==> 9.3ns/iteration
I've repeated the measurements several times. No significant deviations were found.
You can see that the per-call cost can vary greatly depending on workload (possibly due to JIT compiling),
but 3 conclusions can be drawn:
dalvik/java sucks at optimizing dead code
static function calls can be optimized much better than non-static
(non-static functions are virtual and need to be looked up in a virtual table)
the cost on nexus s is not greater than 70ns/call (thats ~70 cpu cycles)
and is comparable with the cost of one empty for loop iteration (i.e. one increment and one condition check on a local variable)
Observe that in your case the string argument will always be evaluated. If you do string concatenation, this will involve creating intermediate strings. This will be very costly and involve a lot of gc. For example executing a function:
void empty(String string){
}
called with arguments such as
empty("Hello " + 42 + " this is a string " + count );
10^4 iterations of 100 such calls takes 10s. That is 10us/call, i.e. ~1000 times slower than just an empty call. It also produces huge amount of GC activity. The only way to avoid this is to manually inline the function, i.e. use the >>if<< statement instead of the debug function call. It's ugly but the only way to make it work.
Unless you call this from within a deeply nested loop, I wouldn't worry about it.
A good compiler removes the entire empty method, resulting in no overhead at all. I'm not sure if the Dalvik compiler already does this, but I suspect it's likely, at least since the arrival of the Just-in-time compiler with Froyo.
See also: Inline expansion
In terms of performance the overhead of generating the messages which get passed into the debug function are going to be a lot more serious since its likely they do memory allocations eg
Debug.debug(mymodule, "My error message" + myerrorcode);
Which will still occur even through the message is binned.
Unfortunately you really need the "if( Release.DEBUG ) " around the calls to this function rather than inside the function itself if your goal is performance, and you will see this in a lot of android code.
This is an interesting question and I like #misiu_mp analysis, so I thought I would update it with a 2016 test on a Nexus 7 running Android 6.0.1. Here is the test code:
public void runSpeedTest() {
long startTime;
long[] times = new long[100000];
long[] staticTimes = new long[100000];
for (int i = 0; i < times.length; i++) {
startTime = System.nanoTime();
for (int j = 0; j < 1000; j++) {
emptyMethod();
}
times[i] = (System.nanoTime() - startTime) / 1000;
startTime = System.nanoTime();
for (int j = 0; j < 1000; j++) {
emptyStaticMethod();
}
staticTimes[i] = (System.nanoTime() - startTime) / 1000;
}
int timesSum = 0;
for (int i = 0; i < times.length; i++) { timesSum += times[i]; Log.d("status", "time," + times[i]); sleep(); }
int timesStaticSum = 0;
for (int i = 0; i < times.length; i++) { timesStaticSum += staticTimes[i]; Log.d("status", "statictime," + staticTimes[i]); sleep(); }
sleep();
Log.d("status", "final speed = " + (timesSum / times.length));
Log.d("status", "final static speed = " + (timesStaticSum / times.length));
}
private void sleep() {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void emptyMethod() { }
private static void emptyStaticMethod() { }
The sleep() was added to prevent overflowing the Log.d buffer.
I played around with it many times and the results were pretty consistent with #misiu_mp:
10^5 iterations of 1000 calls to an empty static void function: 29ns/call
10^5 iterations of 1000 calls to an empty non-static void function: 34ns/call
The static method call was always slightly faster than the non-static method call, but it would appear that a) the gap has closed significantly since Android 2.3.2 and b) there's still a cost to making calls to an empty method, static or not.
Looking at a histogram of times reveals something interesting, however. The majority of call, whether static or not, take between 30-40ns, and looking closely at the data they are virtually all 30ns exactly.
Running the same code with empty loops (commenting out the method calls) produces an average speed of 8ns, however, about 3/4 of the measured times are 0ns while the remainder are exactly 30ns.
I'm not sure how to account for this data, but I'm not sure that #misiu_mp's conclusions still hold. The difference between empty static and non-static methods is negligible, and the preponderance of measurements are exactly 30ns. That being said, it would appear that there is still some non-zero cost to running empty methods.
I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.
All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?
I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.
When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)
You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.