Performance costs of casting a concrete collection to its interface

Performance costs of casting a concrete collection to its interface - java

When I write some API, it sometimes will use Collection<Model> to be the parameter. Of course, you can use ArrayList if you know ArrayList is already enough to handle all the use case.
My question is is there any considerable performance cost when for example cast the ArrayList<Model> to Collection<Model> when passing parameter.
Will the collection size also impact the performance of casting? Any advice?
Thanks for Peter's answer.
I think the answer is pretty enough to stop me to waste time on changing it.
EDIT
As said in accepted answer, the cost is actually paid in the calling of interface methods.
it's not free to keep this kind of flexibity. But the cost is not so considerable.

Like most performance questions the answer is; write cleare and simple code and the application usually performs okay as well.
A cast to an interface can take around 10 ns (less than a method call) Depending on how the code is optimised, it might be too small to measure.
A cast between generic types is a compiler time check, nothing actually happens at runtime.
When you cast, it is the reference type which changes, all references are the same size. The size of what they point to doesn't matter.
BTW: All ArrayList objects are the same size, All LinkedList objects are the same size all HashMap objects are the same size etc. They can reference an array which can be different sizes in different collection.
You can see a difference in code which hasn't been JITed.
public static void main(String... args) throws Throwable {
ArrayList<Integer> ints = new ArrayList<>();
for(int i=0;i<100;i++) ints.add(i);
sumSize(ints, 5000);
castSumSize(ints, 5000);
sumSize(ints, 5000);
castSumSize(ints, 5000);
}
public static long sumSize(ArrayList<Integer> ints, int runs) {
long sum = 0;
long start = System.nanoTime();
for(int i=0;i<runs;i++)
sum += ints.size();
long time = System.nanoTime() - start;
System.out.printf("sumSize: Took an average of %,d ns%n", time/runs);
return sum;
}
public static long castSumSize(ArrayList<Integer> ints, int runs) {
long sum = 0;
long start = System.nanoTime();
for(int i=0;i<runs;i++)
sum += ((Collection) ints).size();
long time = System.nanoTime() - start;
System.out.printf("castSumSize: Took an average of %,d ns%n", time/runs);
return sum;
}
prints
sumSize: Took an average of 31 ns
castSumSize: Took an average of 37 ns
sumSize: Took an average of 28 ns
castSumSize: Took an average of 34 ns
however the difference is likely to be due to the method calls being more expensive. The only bytecode difference is
invokevirtual #9; //Method java/util/ArrayList.size:()I
and
invokeinterface #15, 1; //InterfaceMethod java/util/Collection.size:()I
Once the JIT has optimised the code there isn't much difference. Run long enough, the time drops to 0 ns for the -server JVM because it detects the loop doesn't do anything. ;)

Compared to doing anything with any object: absolutely none.
And even if it did, be sure any program involves things taking millions more time!

Collection is an interface. You always have to provide a concrete implementation such as ArrayList.
Commonly it would be this
Collection<Model> myCollection = new ArrayList<Model>();
Designing to interfaces is actually good practice, so use Collection as your method parameter.

Related

Array / Vector as method argument

I have always read that we should use Vector everywhere in Java and that there are no performance issues, which is certainly true. I'm writing a method to calculate the MSE (Mean Squared Error) and noticed that it was very slow - I basically was passing the Vector of values. When I switched to Array, it was 10 times faster but I don't understand why.
I have written a simple test:
public static void main(String[] args) throws IOException {
Vector <Integer> testV = new Vector<Integer>();
Integer[] testA = new Integer[1000000];
for(int i=0;i<1000000;i++){
testV.add(i);
testA[i]=i;
}
Long startTime = System.currentTimeMillis();
for(int i=0;i<500;i++){
double testVal = testArray(testA, 0, 1000000);
}
System.out.println(String.format("Array total time %s ",System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
for(int i=0;i<500;i++){
double testVal = testVector(testV, 0, 1000000);
}
System.out.println(String.format("Vector total time %s ",System.currentTimeMillis() - startTime));
}
Which calls the following methods:
public static double testVector(Vector<Integer> data, int start, int stop){
double toto = 0.0;
for(int i=start ; i<stop ; i++){
toto += data.get(i);
}
return toto / data.size();
}
public static double testArray(Integer[] data, int start, int stop){
double toto = 0.0;
for(int i=start ; i<stop ; i++){
toto += data[i];
}
return toto / data.length;
}
The array one is indeed 10 times faster. Here is the output:
Array total time 854
Vector total time 9840
Can somebody explain me why ? I have searched for quite a while, but cannot figure it out. The vector method appears to be making a local copy of the vector, but I always thought that objects where passed by reference in Java.

I have always read that we should use Vector everywhere in Java and that there are no performance issues, - Wrong. A vector is thread safe and thus it needs additional logic (code) to handle access/ modification by multiple threads So, it is slow. An array on the other hand doesn't need additional logic to handle multiple threads. You should try ArrayList instead of Vector to increase the speed
Note (based on your comment): I'm running the method 500 times each
This is not the right way to measure performance / speed in java. You should atleast give a warm-up run so as to nullify the effect of JIT.

Yes, that's the eternal problem of poor microbenchmarking. The Vector itself is not SO slow.
Here is a trick:
add -XX:BiasedLockingStartupDelay=0 and now testVector "magically" runs 5 times faster than before!
Next, wrap testVector into synchronized (data) - and now it is almost as fast as testArray.
You are basically measuring the performance of object monitors in HotSpot, not the data structures.

Simple thing. Vector is thread-safe so it needs synchoronization to add and access. Use ArrayList which is also back-up by array but it is not thread-safe and faster
Note:
Please provide size of the elements if you know in advance to ArrayList. Since in normal ArrayList without initial capacity resize will happen intenally which uses Arrays copy
And a normal array and ArrayList without initial capacity performances too varies drastically if no of elements is larger

Poor code, instead of list.get() rather use an iterator on the list. The array will still be faster though.

Java: Calculate how long sorting an array takes

I have some code that generates 1000 numbers in an array and then sorts them:
import java.util.Arrays;
import java.util.Random;
public class OppgA {
public static void main(String[] args) {
int[] anArray;
anArray = new int[1000];
Random generator = new Random();
for(int i=0; i<1000; i++){
anArray[i] = (generator.nextInt(1000)+1);
}
Arrays.sort(anArray);
System.out.println(Arrays.toString(anArray));
}
}
Now I'm asked to calculate and print the time it took to sort the array. Any clues how I can do this? I really couldn't find much by searching that could help me out in my case.
Thanks!

You can call (and store the result of) System.nanoTime() before and after the call to Arrays.sort()- the difference is the time spent in nanoseconds. That method is preferred over System.currentTimeMillis to calculate durations.
long start = System.nanoTime();
Arrays.sort(anArray);
long end = System.nanoTime();
long timeInMillis = TimeUnit.MILLISECONDS.convert(end - start, TimeUnit.NANOSECONDS);
System.out.println("Time spend in ms: " + timeInMillis);
But note that the result of your measurement will probably vary widely if you run the program several times. To get a more precise calculation would be more involved - see for example: How do I write a correct micro-benchmark in Java?.

Before sorting, declare a long which corresponds to the time before you start the sorting:
long timeStarted = System.currentTimeMillis();
//your sorting here.
//after sorting
System.out.println("Sorting last for:" + (System.currentTimeMillis() - timeStarted));
The result will return the milli seconds equivalent of your sorting.
As assylias commented you can also use System.nanoTime() if you prefer precise measurements of elapsed time.

Proper microbenchmarking is done using a ready-made tool for that purpose, like Google Caliper or Oracle jmh. However, if you want a poor-man's edition, follow at least these points:
measure with System.nanoTime() (as explained elsewhere). Do not trust small numbers: if you get timings such as 10 microseconds, you are measuring a too short timespan. Enlarge the array to get at least into the milliseconds;
repeat the sorting process many times (10, 100 perhaps) and display the timing of each attempt. You are expected to see a marked drop in the time after the first few runs, but after that the timings should stabilize. If you still observe wild variation, you know something's amiss;
to avoid garbage collection issues, reuse the same array, but re-fill it with new random data each time.

long beforeTime = System.currentTimeMillis();
// Your Code
long afterTime = System.currentTimeMillis();
long diffInMilliSeconds = afterTime- beforeTime;

before starting the calculation or exactly after generating the array you can use System#currentTimeMillis() to get the exact time and do the same exactly after completion of sorting and then find the difference.

do it this way :
long start = System.currentTimeMillis();
...
your sorting code
...
long end = System.currentTimeMillis();
long timeInMillis = end - start;
Hope that helps.

import java.util.Arrays;
import java.util.Random;
public class OppgA {
public static void main(String[] args) {
int[] anArray;
anArray = new int[1000];
Random generator = new Random();
for(int i=0; i<1000; i++){
anArray[i] = (generator.nextInt(1000)+1);
}
Date before = new Date();
Date after;
Arrays.sort(anArray);
after = new Date();
System.out.println(after.getTime()-before.getTime());
System.out.println(Arrays.toString(anArray));
}
}

This is not an ideal way. But this will work
long startingTime=System.currentTimeMillis();
Arrays.sort(anArray);
long endTime=System.currentTimeMillis();
System.out.println("Sorting time: "+(endTime-startingTime)+"ms");
Following can be the best way
long startingTime=System.nanoTime();
Arrays.sort(anArray);
long endTime=System.nanoTime();
System.out.println("Sorting time: "+(endTime-startingTime)+"ns");

In short, you can either extract our code to a method and than calculate the difference between the timestamps of start and end of that method or you can just run it in a profiler or an IDE and it will print the execution time
Ideally, you should not mix your business logic (array sorting in this case) with 'metrics' stuff.If you do need to measure execution time within the app, you can try to use AOP for that
Please refer to this post , which describes possible solutions in very detail

does splitting logic into multiple methods in java slows down execution? if yes then shall we avoid re-factoring of code

http://s24.postimg.org/9y073weid/refactor_vs_non_refactor.png
Well here is the result for the execution time in nanoseconds for re-factored and non re-factored code for a simple addition operation. 1 to 5 are the consecutive runs of the code.
My intent was just to find out whether splitting up logic into multiple methods makes execution slow or not and here is the result which shows that yes there is considerable time that goes into just putting the methods on stack.
I invite people who have done some research on it before or want to investigate on this area to correct me if I am doing something wrong and draw some conclusive results out of this.
In my opinion yes code re-factoring does help in making code more structured and understandable but in time critical systems like real time game engines I would prefer not to re-factor.
Following was the simple code which I used:
package com.sim;
public class NonThreadedMethodCallBenchMark{
public static void main(String args[]){
NonThreadedMethodCallBenchMark testObject = new NonThreadedMethodCallBenchMark();
System.out.println("************Starting***************");
long startTime =System.nanoTime();
for(int i=0;i<900000;i++){
//testObject.method(1, 2); // Uncomment this line and comment the line below to test refactor time
//testObject.method5(1,2); // uncomment this line and comment the above line to test non refactor time
}
long endTime =System.nanoTime();
System.out.println("Total :" +(endTime-startTime)+" nanoseconds");
}
public int method(int a , int b){
return method1(a,b);
}
public int method1(int a, int b){
return method2(a,b);
}
public int method2(int a, int b){
return method3(a,b);
}
public int method3(int a, int b){
return method4(a,b);
}
public int method4(int a, int b){
return method5(a,b);
}
public int method5(int a, int b){
return a+b;
}
public void run() {
int x=method(1,2);
}
}

You should consider that code which doesn't do anything useful can be optimised away. If you are not careful you can be timing how long it takes to detect the code doesn't do anything useful, rather than run the code. If you use multiple methods, it can take longer to detect useless code and thus give different results. I would always look at the steady state performance after the code has warmed up.
For the most expensive parts of your code, small methods will be inlined so they won't make any difference to performance cost. What can happen is
smaller methods can be optimised better as complex methods can defeat optimisation tricks.
smaller methods can be eliminated as they are inlined.
If you don't ever warm up the code, it is likely to be slower. However, if code is called rarely, it is unlikely to matter (except in low latency system, in which case I suggest you warm up your code on startup)
If you run
System.out.println("************Starting***************");
for (int j = 0; j < 10; j++) {
long startTime = System.nanoTime();
for (int i = 0; i < 1000000; i++) {
testObject.method(1, 2);
//testObject.method5(1,2); // uncomment this line and comment the above line to test non refactor time
}
long endTime = System.nanoTime();
System.out.println("Total :" + (endTime - startTime) + " nanoseconds");
}
prints
************Starting***************
Total :8644835 nanoseconds
Total :3363047 nanoseconds
Total :52 nanoseconds
Total :30 nanoseconds
Total :30 nanoseconds
Note: the 30 nano-seconds is the time it takes to perform a System.nanoTime() call. The inner loop and the methods calls have been eliminated.

Yes, extra method calls cost extra time (except in the circumstance where the compiler/jitter actually does inlining, which I think some of them will do sometimes but under circumstances that you will find hard to control).
You shouldn't worry about this except for the most expensive part of your code, because you won't be able to see the difference in most cases. In those cases where performance doesn't matter, you should refactor to maximize code clarity.
I suggest you refactor at will, occasionally measure performance using a profiler to find the expensive parts. Only when the profiler shows that some specific code is using a significant fraction of the time, should you worry about about function calls (or other speed vs. clarity tradeoffs) in that specific code. You will discover that the performance bottlenecks are often in places that you wouldn't have guessed.

Comparing methods speed performance using nanotime in Java

I would like to compare the speed performance (if there were any) from the two readDataMethod() as I illustrate below.
private void readDataMethod1(List<Integer> numbers) {
final long startTime = System.nanoTime();
for (int i = 0; i < numbers.size(); i++) {
numbers.get(i);
}
final long endTime = System.nanoTime();
System.out.println("method 1 : " + (endTime - startTime));
}
private void readDataMethod2(List<Integer> numbers) {
final long startTime = System.nanoTime();
int i = numbers.size();
while (i-- > 0) {
numbers.get(i);
}
final long endTime = System.nanoTime();
System.out.println("method 2 : " + (endTime - startTime));
}
Most of the time the result I get shows that method 2 has "lower" value.
Run readDataMethod1 readDataMethod2
1 636331 468876
2 638256 479269
3 637485 515455
4 716786 420756
Does this test prove that the readDataMethod2 is faster than the earlier one ?

Does this test prove that the readDataMethod2 is faster than the earlier one ?
You are on the right track in that you're measuring comparative performance, rather than making assumptions.
However, there are lots of potential issues to be aware of when writing micro-benchmarks in Java. I would recommend that you read
How do I write a correct micro-benchmark in Java?

In the first one, you are calling numbers.size() for each iteration.
Try storing it in a variable, and check again.

The reason because of which the second version runs faster is because you are calling numbers.size() on each iteration. Replacing it by storing in a number would make it almost the same as the first one.

Does this test prove that the readDataMethod2 is faster than the earlier one ?
As #aix says, you are on the right track. However, there are a couple of specific issues with your methodology:
It doesn't look like you are "warming up" the JVM. Therefore it is conceivable that your figures could be distorted by startup effects (JIT compilation) or that none of the code has been JIT compiled.
I'd also argue that your runs are doing too little work. A 500000 nanoseconds, is 0.0005 seconds, and that's not much work. The risk is that "other things" external to your application could be introducing noise into the measurements. I'd have more confidence in runs that take tens of seconds.

can array access be optimized?

Maybe I'm being misled by my profiler (Netbeans), but I'm seeing some odd behavior, hoping maybe someone here can help me understand it.
I am working on an application, which makes heavy use of rather large hash tables (keys are longs, values are objects). The performance with the built in java hash table (HashMap specifically) was very poor, and after trying some alternatives -- Trove, Fastutils, Colt, Carrot -- I started working on my own.
The code is very basic using a double hashing strategy. This works fine and good and shows the best performance of all the other options I've tried thus far.
The catch is, according to the profiler, lookups into the hash table are the single most expensive method in the entire application -- despite the fact that other methods are called many more times, and/or do a lot more logic.
What really confuses me is the lookups are called only by one class; the calling method does the lookup and processes the results. Both are called nearly the same number of times, and the method that calls the lookup has a lot of logic in it to handle the result of the lookup, but is about 100x faster.
Below is the code for the hash lookup. It's basically just two accesses into an array (the functions that compute the hash codes, according to profiling, are virtually free). I don't understand how this bit of code can be so slow since it is just array access, and I don't see any way of making it faster.
Note that the code simply returns the bucket matching the key, the caller is expected to process the bucket. 'size' is the hash.length/2, hash1 does lookups in the first half of the hash table, hash2 does lookups in the second half. key_index is a final int field on the hash table passed into the constructor, and the values array on the Entry objects is a small array of longs usually of length 10 or less.
Any thoughts people have on this are much appreciated.
Thanks.
public final Entry get(final long theKey) {
Entry aEntry = hash[hash1(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
aEntry = hash[hash2(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
return null;
}
}
return aEntry;
}
Edit, the code for hash1 & hash2
private static int hash1(final long key, final int hashTableSize) {
return (int)(key&(hashTableSize-1));
}
private static int hash2(final long key, final int hashTableSize) {
return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1)));
}

Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.
The only thing that I would expect might make some difference is to move the key out of the values array of Entry.
Instead of having this:
class Entry {
long[] values;
}
//...
if ( entry.values[key_index] == key ) { //...
Try this:
class Entry {
long key;
long values[];
}
//...
if ( entry.key == key ) { //...
Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.
Is there a random-access data type faster than an array?
I was interested in the answer to this question, so I set up a test environment. This is my Array interface:
interface Array {
long get(int i);
void set(int i, long v);
}
This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:
class NormalArray implements Array {
private long[] data;
public NormalArray(int size) {
data = new long[size];
}
#Override
public long get(int i) {
return data[i];
}
#Override
public void set(int i, long v) {
data[i] = v;
}
}
And then a control:
class NoOpArray implements Array {
#Override
public long get(int i) {
return 0;
}
#Override
public void set(int i, long v) {
}
}
Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:
class TenArray implements Array {
private long v0;
private long v1;
private long v2;
private long v3;
private long v4;
private long v5;
private long v6;
private long v7;
private long v8;
private long v9;
private long[] extras;
public TenArray(int size) {
if (size > 10) {
extras = new long[size - 10];
}
}
#Override
public long get(final int i) {
switch (i) {
case 0:
return v0;
case 1:
return v1;
case 2:
return v2;
case 3:
return v3;
case 4:
return v4;
case 5:
return v5;
case 6:
return v6;
case 7:
return v7;
case 8:
return v8;
case 9:
return v9;
default:
return extras[i - 10];
}
}
#Override
public void set(final int i, final long v) {
switch (i) {
case 0:
v0 = v; break;
case 1:
v1 = v; break;
case 2:
v2 = v; break;
case 3:
v3 = v; break;
case 4:
v4 = v; break;
case 5:
v5 = v; break;
case 6:
v6 = v; break;
case 7:
v7 = v; break;
case 8:
v8 = v; break;
case 9:
v9 = v; break;
default:
extras[i - 10] = v;
}
}
}
I tested it with this harness:
import java.util.Random;
public class ArrayOptimization {
public static void main(String[] args) {
int size = 10;
long[] data = new long[size];
Random r = new Random();
for ( int i = 0; i < data.length; i++ ) {
data[i] = r.nextLong();
}
Array[] a = new Array[] {
new NoOpArray(),
new NormalArray(size),
new TenArray(size)
};
for (;;) {
for ( int i = 0; i < a.length; i++ ) {
testSet(a[i], data, 10000000);
testGet(a[i], data, 10000000);
}
}
}
private static void testGet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
data[j] = a.get(j);
}
}
long stop = System.nanoTime();
System.out.printf("%s/get took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
private static void testSet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
a.set(j, data[j]);
}
}
long stop = System.nanoTime();
System.out.printf("%s/set took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
}
The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.
NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms
Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.

Most likely you are partially misled in your interpretation of the profilers results. Profilers are notoriously overinflating the performance impact of small, frequently called methods. In your case, the profiling overhead for the get()-method is probably larger than the actual processing spent in the method itself. The situation is worsened further, since the instrumentation also interferes with the JIT's capability to inline methods.
As a rule of thumb for this situation - if the total processing time for a piece of work of known length increases more then two- to threefold when running under the profiler, the profiling overhead will give you skewed results.
To verify your changes actually do have impact, always measure performance improvements without the profiler, too. The profiler can hint you about bottlenecks, but it can also deceive you to look at places where nothing is wrong.
Array bounds checking can have a surprisingly large impact on performance (if you do comparably little else), but it can also be hard to clearly separate from general memory access penalties. In some trivial cases, the JIT might be able to eliminate them (there have been efforts towards bounds check elimination in Java 6), but this is AFAIK mostly limited to simple loop constructs like for(x=0; x<array.length; x++).
Under some circumstances you may be able to replace array access by simple member access, completely avoiding the bound checks, but its limited to the rare cases where you access you array exclusively by constant indices. I see no way to apply it to your problem.
The change suggested by Mark Peters is most likely not solely faster because it eliminates a bounds check, but also because it alters the locality properties of your data structures in a more cache friendly way.

Many profilers tell you very confusing things, partly because of how they work, and partly because people have funny ideas about performance to begin with.
For example, you're wondering about how many times functions are called, and you're looking at code and thinking it looks like a lot of logic, therefore slow.
There's a very simple way to think about this stuff, that makes it very easy to understand what's going on.
First of all, think in terms of the percent of time a routine or statement is active, rather than the number of times it is called or the average length of time it takes. The reason for that is it is relatively unaffected by irrelevant issues like competing processes or I/O, and it saves you having to multiply the number of calls by the average execution time and divide by the total time just to see if it is a big enough to even care about. Also, percent tells you, bottom line, how much fixing it could potentially reduce the overall execution time.
Second, what I mean by "active" is "on the stack", where the stack includes the currently running instruction and all the calls "above" it back to "call main". If a routine is responsible for 10% of the time, including routines that it calls, then during that time it is on the stack. The same is true of individual statements or even instructions. (Ignore "self time" or "exclusive time". It's a distraction.)
Profilers that put timers and counters on functions can only give you some of this information. Profilers that only sample the program counter tell you even less. What you need is something that samples the call stack and reports to you by line (not just by function) the percent of stack samples containing that line. It's also important that they sample the stack a) during I/O or other blockage, but b) not while waiting for user input.
There are profilers that can do this. I'm not sure about Java.
If you're still with me, let me throw out another ringer. You're looking for things you can optimize, right? and only things that have a large enough percent to be worth the trouble, like 10% or more? Such a line of code costing 10% is on the stack 10% of the time. That means if 20,000 samples are taken, it is on about 2,000 of them. If 20 samples are taken, it is on about 2 of them, on average. Now, you're trying to find the line, right? Does it really matter if the percent is off a little bit, as long as you find it? That's another one of those happy myths of profilers - that precision of timing matters. For finding problems worth fixing, 20,000 samples won't tell you much more than 20 samples will.
So what do I do? Just take the samples by hand and study them. Code worth optimizing will simply jump out at me.
Finally, there's a big gob of good news. There are probably multiple things you could optimize. Suppose you fix a 20% problem and make it go away. Overall time shrinks to 4/5 of what it was, but the other problems aren't taking any less time, so now their percentage is 5/4 of what it was, because the denominator got smaller. Percentage-wise they got bigger, and easier to find. This effect snowballs, allowing you to really squeeze the code.

You could try using a memoizing or caching strategy to reduce the number of actual calls. Another thing you could try if you're very desperate is a native array, since indexing those is unbelievably fast, and JNI shouldn't invoke toooo much overhead if you're using parameters like longs that don't require marshalling.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.