I have tried searching the web/stackoverflow for this but can not find an answer. I am not sure if it is because it is so obvious or such a negligible difference.
My basic question is... which is better performance/by how much?
String attacker = pairs.getKey();
mobInstanceMap.replace(attacker, 0);
vs:
mobInstanceMap.replace(pairs.getKey(), 0);
I like the first for readability, but I have been developing a MMORPG with which I need to be quite careful with performance. I tried to do an experiment below, but figured I would ask the gurus since it was hard for me to extrapolate. I flipped which one runs as I found out that when running both(one after the other) the second ran much better(probably due to JVM optimization).
Sorry if this has been asked and answered... or is obvious, but I searched and did not find anything(maybe I don't know what to search for). I am not only wondering for 'saving off' strings, but other variables also.
public static void main(String[] args){
long hardcodeTotal = 0;
long saveTotal = 0;
// for (int i = 0; i < 10000; i++){
// saveTotal += checkRunTimeSaveString();
// }
for (int i = 0; i < 10000; i++){
hardcodeTotal += checkRunTimeHardcodeString();
}
System.out.println("hardcodeTotal: " + hardcodeTotal/100.0 + "\t" + "saveTotal: " + saveTotal/100.0);
}
private static long checkRunTimeSaveString(){
long StartTime = System.nanoTime();
Iterator<Entry<String, Integer>> it = mobInstanceMap.entrySet().iterator();
while (it.hasNext()) { //to update the AttackingMap when entity(attackee) moves
Entry<String, Integer> pairs = it.next();
String attacker = pairs.getKey();
mobInstanceMap.replace(attacker, 0);
}
return(System.nanoTime() - StartTime);
}
private static long checkRunTimeHardcodeString(){
long StartTime = System.nanoTime();
Iterator<Entry<String, Integer>> it = mobInstanceMap.entrySet().iterator();
while (it.hasNext()) { //to update the AttackingMap when entity(attackee) moves
Entry<String, Integer> pairs = it.next();
mobInstanceMap.replace(pairs.getKey(), 0);
}
return(System.nanoTime() - StartTime);
}
In all likelihood, the intermediate assigment will be optimized away by the JVM, and they will have exactly the same performance.
In either case, the possible performance hits for these things are hardly ever noticeable. You shouldn't worry about performance unless in extreme circumstances. Always opt for readability.
It should also be added that the type of micro-benchmarking you have in your code above is incredibly hard to get reliable measurements from. Too many factors out of your control (internal JVM optimizations, JVM warmup etc etc) are involved to get an accurate measurement of the exact thing you want to test.
you could profile both implementations with a tool like JVisualVM to see the exact difference. In case you want to test with high volume you could use a tool like Contiperf http://databene.org/contiperf And then share the results for all of us :)
I agree with Keppil that the performance is probably exactly the same. If there is a slight difference, and if you care about such a small improvement in performance, then you probably shouldn't be using Java in the first place.
Related
I have a stream of Point3Ds in a JavaFX 8 program. I would like, for the sake of creating a Mesh from them, to be able to produce a list of their (x, y, z) coordinates instead.
This is a simple enough task through traditional Java looping. (Almost trivial, actually.) However, in the future, I'll likely be dealing with tens of thousands of points; and I would very much like to be able to use the Java Stream API and accomplish this with a parallel stream.
I suppose what I'm looking for is the rough equivalent of this psuedocode:
List<Double> coordinates = stream.parallel().map(s -> (s.getX(), s.getY(), s.getZ())).collect(Collectors.asList());
As of yet, I've found no such feature though. Could someone kindly give me a push in the right direction?
You can use flatMap :
List<Double> coordinates =
stream.parallel()
.flatMap(s -> Stream.of(s.getX(), s.getY(), s.getZ()))
.collect(Collectors.asList());
Why? Even with "tens of thousands of points", the code will complete in very little time, and you won't really gain anything "with a parallel stream".
This sounds like a perfect example of premature optimization, where you potentially complicate the code for something that isn't (yet) a problem, and is unlikely to ever be one, in this case at least.
To prove my point, I created the test code below.
To minimize the effect of GC runs, I ran this code with -Xms10g -Xmx10g, and added the explicit gc() calls, so test runs were running with a "clean slate".
As always, performance testing is subject to JIT optimizations and other factors, so a warm-up loop was provided.
public static void main(String[] args) {
Random rnd = new Random();
List<Point3D> input = new ArrayList<>();
for (int i = 0; i < 10_000; i++)
input.add(new Point3D(rnd.nextDouble(), rnd.nextDouble(), rnd.nextDouble()));
for (int i = 0; i < 100; i++) {
test1(input);
test2(input);
}
for (int i = 0; i < 10; i++) {
long start1 = System.nanoTime();
test1(input);
long end1 = System.nanoTime();
System.gc();
long start2 = System.nanoTime();
test2(input);
long end2 = System.nanoTime();
System.gc();
System.out.printf("%.6f %.6f%n", (end1 - start1) / 1_000_000d, (end2 - start2) / 1_000_000d);
}
}
private static List<Double> test1(List<Point3D> input) {
List<Double> list = new ArrayList<>();
for (Point3D point : input) {
list.add(point.getX());
list.add(point.getY());
list.add(point.getZ());
}
return list;
}
private static List<Double> test2(List<Point3D> input) {
return input.stream().parallel()
.flatMap(s -> Stream.of(s.getX(), s.getY(), s.getZ()))
.collect(Collectors.toList());
}
RESULT
0.355267 0.392904
0.205576 0.260035
0.193601 0.232378
0.194740 0.290544
0.193601 0.238365
0.243497 0.276286
0.200728 0.243212
0.197022 0.240646
0.192175 0.239790
0.198162 0.279708
No major difference, although parallel stream seems slightly slower.
Also notice that it completes in less than 0.3 ms, for 10,000 points.
It's nothing!
Let's try to increase the count from 10,000 to 10,000,000 (skipping warmup):
433.716847 972.100743
260.662700 693.263850
250.699271 736.744653
250.486281 813.615375
249.722716 714.296997
254.704145 796.566859
254.713840 829.755767
253.368331 959.365322
255.016928 973.306254
256.072177 1047.562090
Now there's a definite degradation of the parallel stream. It is 3 times slower. This is likely caused by extra GC runs.
CONCLUSION: Premature optimization is bad!!!!
In your case, you actually made it worse.
I would like to compare the speed performance (if there were any) from the two readDataMethod() as I illustrate below.
private void readDataMethod1(List<Integer> numbers) {
final long startTime = System.nanoTime();
for (int i = 0; i < numbers.size(); i++) {
numbers.get(i);
}
final long endTime = System.nanoTime();
System.out.println("method 1 : " + (endTime - startTime));
}
private void readDataMethod2(List<Integer> numbers) {
final long startTime = System.nanoTime();
int i = numbers.size();
while (i-- > 0) {
numbers.get(i);
}
final long endTime = System.nanoTime();
System.out.println("method 2 : " + (endTime - startTime));
}
Most of the time the result I get shows that method 2 has "lower" value.
Run readDataMethod1 readDataMethod2
1 636331 468876
2 638256 479269
3 637485 515455
4 716786 420756
Does this test prove that the readDataMethod2 is faster than the earlier one ?
Does this test prove that the readDataMethod2 is faster than the earlier one ?
You are on the right track in that you're measuring comparative performance, rather than making assumptions.
However, there are lots of potential issues to be aware of when writing micro-benchmarks in Java. I would recommend that you read
How do I write a correct micro-benchmark in Java?
In the first one, you are calling numbers.size() for each iteration.
Try storing it in a variable, and check again.
The reason because of which the second version runs faster is because you are calling numbers.size() on each iteration. Replacing it by storing in a number would make it almost the same as the first one.
Does this test prove that the readDataMethod2 is faster than the earlier one ?
As #aix says, you are on the right track. However, there are a couple of specific issues with your methodology:
It doesn't look like you are "warming up" the JVM. Therefore it is conceivable that your figures could be distorted by startup effects (JIT compilation) or that none of the code has been JIT compiled.
I'd also argue that your runs are doing too little work. A 500000 nanoseconds, is 0.0005 seconds, and that's not much work. The risk is that "other things" external to your application could be introducing noise into the measurements. I'd have more confidence in runs that take tens of seconds.
I want to test the time of adding and getting item in simple and generic hashmap:
public void TestHashGeneric(){
Map hashsimple = new HashMap();
long startTime = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
hashsimple.put("key"+i, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" );
}
for (int i = 0; i < 100000; i++) {
String ret =(String)hashsimple.get("key"+i);
}
long endTime =System.currentTimeMillis();
System.out.println("Hash Time " + (endTime - startTime) + " millisec");
Map<String,String> hm = new HashMap<String,String>();
startTime = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
hm.put("key"+i, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" );
}
for (int i = 0; i < 100000; i++) {
String ret = hm.get("key"+i);
}
endTime = System.currentTimeMillis();
System.out.println("Hash generic Time " + (endTime - startTime) + " millisec");
}
The problem is that I get different time if I change places between hashmap's code section!
if i put the loops (with the time print ofcourse) of generic below the simple I get better time for generic and if i put simple below the generic I get better time for simple!
Same happens if I use different methods for this.
The JIT will compile and optimise your program while it is running, so the 2nd run will always be faster.
You should make the following modifications:
Run both tests untimed first, then re-run them timed so that you don't get affected by the JIT.
You should use System.nanoTime() as this is more accurate for timing (you should never get a diff of 0).
You should also test against some empty methods, as you are also timing the string concatenation operation in each loop.
Also note that in Java generic types are erased, so there should be no runtime difference at all.
The Java Runtime is pretty sophistocated - it does some learning/optimisation while it's running. You might be able to get the test you want by "warming up" the JVM first. Try calling TestHashGeneric() twice and see what the second set of results gives you.
You've also got twice as much stuff in memory on the second run. There's all sorts of variables that can affect this test.
This is not the correct way to perform micro-benchmarks as it is very involved (Why?). I suggest you use Caliper framework for this.
When you have a single loop which reaches the compile threshold (about 10K) the whole method is compiled. This can make either the first loop appear faster (as it is optimised correctly as the second loop has no counter information) or the second loop appear after (as its compiled before it is even started)
This simplest way to fix this is to place each test in its own method and they will be compiled and optimised independently. (The order can still matter but is less important) I would still suggest running the test a number of times to see how the results vary.
I have this code that is testing Calendar.getInstance().getTimeInMillis() vs System.currentTimeMilli() :
long before = getTimeInMilli();
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
long after = getTimeInMilli();
System.out.println(getClass().getSimpleName() + " total is " + (after - before));
I want to make sure no JVM or compiler optimization happens, so the test will be valid and will actually show the difference.
How to be sure?
EDIT: I changed the code example so it will be more clear. What I am checking here is how much time it takes to call getTimeInMilli() in different implementations - Calendar vs System.
I think you need to disable JIT. Add to your run command next option:
-Djava.compiler=NONE
You want optimization to happen, because it will in real life - the test wouldn't be valid if the JVM didn't optimize in the same way that it would in the real situation you're interested in.
However, if you want to make sure that the JVM doesn't remove calls that it could potentially consider no-ops otherwise, one option is to use the result - so if you're calling System.currentTimeMillis() repeatedly, you might sum all the return values and then display the sum at the end.
Note that you may still have some bias though - for example, there may be some optimization if the JVM can cheaply determine that only a tiny amount of time has passed since the last call to System.currentTimeMillis(), so it can use a cached value. I'm not saying that's actually the case here, but it's the kind of thing you need to think about. Ultimately, benchmarks can only really test the loads you give them.
One other thing to consider: assuming you want to model a real world situation where the code is run a lot, you should run the code a lot before taking any timing - because the Hotspot JVM will optimize progressively harder, and presumably you care about the heavily-optimized version and don't want to measure the time for JITting and the "slow" versions of the code.
As Stephen mentioned, you should almost certainly take the timing outside the loop... and don't forget to actually use the results...
What you are doing looks like benchmarking, you can read Robust Java benchmarking to get some good background about how to make it right. In few words, you don't need to turn it off, because it won't be what happens on production server.. instead you need to know the close the possible to 'real' time estimation / performance. Before optimization you need to 'warm up' your code, it looks like:
// warm up
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
}
// measure time
long before = getTimeInMilli();
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
}
long after = getTimeInMilli();
System.out.prinltn( "What to expect? " + (after - before)/1000 ); // average time
When we measure performance of our code we use this approach, it give us more less real time our code needs to work. Even better to measure code in separated methods:
public void doIt() {
for (int i = 0; i < TIMES_TO_ITERATE; i++)
{
long before1 = getTimeInMilli();
doSomeReallyHardWork();
long after1 = getTimeInMilli();
}
}
// warm up
for (int j = 0; j < 1000; j++) {
doIt()
}
// measure time
long before = getTimeInMilli();
for (int j = 0; j < 1000; j++) {
doIt();
}
long after = getTimeInMilli();
System.out.prinltn( "What to expect? " + (after - before)/1000 ); // average time
Second approach is more precise, but it also depends on VM. E.g. HotSpot can perform "on-stack replacement", it means that if some part of method is executed very often it will be optimized by VM and old version of code will be exchanged with optimized one while method is executing. Of course it takes extra actions from VM side. JRockit does not do it, optimized version of code will be used only when this method is executed again (so no 'runtime' optimization... I mean in my first code sample all the time old code will be executed... except for doSomeReallyHardWork internals - they do not belong to this method, so optimization will work well).
UPDATED: code in question was edited while I was answering ;)
Sorry, but what you are trying to do makes little sense.
If you turn off JIT compilation, then you are only going to measure how long it takes to call that method with JIT compilation turned off. This is not useful information ... because it tells you little if anything about what will happen when JIT compilation is turned on1.
The times between JIT on and off can be different by a huge factor. You are unlikely to want to run anything in production with JIT turned off.
A better approach would be to do this:
long before1 = getTimeInMilli();
for (int i = 0; i < TIMES_TO_ITERATE; i++) {
doSomeReallyHardWork();
}
long after1 = getTimeInMilli();
... and / or use the nanosecond clock.
If you are trying to measure the time taken to call the two versions of getTimeInMillis(), then I don't understand the point of your call to doSomeReallyHardWork(). A more senible benchmark would be this:
public long test() {
long before1 = getTimeInMilli();
long sum = 0;
for (int i = 0; i < TIMES_TO_ITERATE; i++) {
sum += getTimeInMilli();
}
long after1 = getTimeInMilli();
System.out.println("Took " + (after - before) + " milliseconds");
return sum;
}
... and call that a number of times, until the times printed stabilize.
Either way, my main point still stands, turning of JIT compilation and / or optimization would mean that you were measuring something that is not useful to know, and not what you are really trying to find out. (Unless, that is, you are intending to run your application in production with JIT turned off ... which I find hard to believe ...)
1 - I note that someone has commented that turning off JIT compilation allowed them to easily demonstrate the difference between O(1), O(N) and O(N^2) algorithms for a class. But I would counter that it is better to learn how to write a correct micro-benchmark. And for serious purposes, you need to learn how to derive the complexity of the algorithms ... mathematically. Even with a perfect benchmark, you can get the wrong answer by trying to "deduce" complexity from performance measurements. (Take the behavior of HashMap for example.)
Maybe I'm being misled by my profiler (Netbeans), but I'm seeing some odd behavior, hoping maybe someone here can help me understand it.
I am working on an application, which makes heavy use of rather large hash tables (keys are longs, values are objects). The performance with the built in java hash table (HashMap specifically) was very poor, and after trying some alternatives -- Trove, Fastutils, Colt, Carrot -- I started working on my own.
The code is very basic using a double hashing strategy. This works fine and good and shows the best performance of all the other options I've tried thus far.
The catch is, according to the profiler, lookups into the hash table are the single most expensive method in the entire application -- despite the fact that other methods are called many more times, and/or do a lot more logic.
What really confuses me is the lookups are called only by one class; the calling method does the lookup and processes the results. Both are called nearly the same number of times, and the method that calls the lookup has a lot of logic in it to handle the result of the lookup, but is about 100x faster.
Below is the code for the hash lookup. It's basically just two accesses into an array (the functions that compute the hash codes, according to profiling, are virtually free). I don't understand how this bit of code can be so slow since it is just array access, and I don't see any way of making it faster.
Note that the code simply returns the bucket matching the key, the caller is expected to process the bucket. 'size' is the hash.length/2, hash1 does lookups in the first half of the hash table, hash2 does lookups in the second half. key_index is a final int field on the hash table passed into the constructor, and the values array on the Entry objects is a small array of longs usually of length 10 or less.
Any thoughts people have on this are much appreciated.
Thanks.
public final Entry get(final long theKey) {
Entry aEntry = hash[hash1(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
aEntry = hash[hash2(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
return null;
}
}
return aEntry;
}
Edit, the code for hash1 & hash2
private static int hash1(final long key, final int hashTableSize) {
return (int)(key&(hashTableSize-1));
}
private static int hash2(final long key, final int hashTableSize) {
return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1)));
}
Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.
The only thing that I would expect might make some difference is to move the key out of the values array of Entry.
Instead of having this:
class Entry {
long[] values;
}
//...
if ( entry.values[key_index] == key ) { //...
Try this:
class Entry {
long key;
long values[];
}
//...
if ( entry.key == key ) { //...
Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.
Is there a random-access data type faster than an array?
I was interested in the answer to this question, so I set up a test environment. This is my Array interface:
interface Array {
long get(int i);
void set(int i, long v);
}
This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:
class NormalArray implements Array {
private long[] data;
public NormalArray(int size) {
data = new long[size];
}
#Override
public long get(int i) {
return data[i];
}
#Override
public void set(int i, long v) {
data[i] = v;
}
}
And then a control:
class NoOpArray implements Array {
#Override
public long get(int i) {
return 0;
}
#Override
public void set(int i, long v) {
}
}
Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:
class TenArray implements Array {
private long v0;
private long v1;
private long v2;
private long v3;
private long v4;
private long v5;
private long v6;
private long v7;
private long v8;
private long v9;
private long[] extras;
public TenArray(int size) {
if (size > 10) {
extras = new long[size - 10];
}
}
#Override
public long get(final int i) {
switch (i) {
case 0:
return v0;
case 1:
return v1;
case 2:
return v2;
case 3:
return v3;
case 4:
return v4;
case 5:
return v5;
case 6:
return v6;
case 7:
return v7;
case 8:
return v8;
case 9:
return v9;
default:
return extras[i - 10];
}
}
#Override
public void set(final int i, final long v) {
switch (i) {
case 0:
v0 = v; break;
case 1:
v1 = v; break;
case 2:
v2 = v; break;
case 3:
v3 = v; break;
case 4:
v4 = v; break;
case 5:
v5 = v; break;
case 6:
v6 = v; break;
case 7:
v7 = v; break;
case 8:
v8 = v; break;
case 9:
v9 = v; break;
default:
extras[i - 10] = v;
}
}
}
I tested it with this harness:
import java.util.Random;
public class ArrayOptimization {
public static void main(String[] args) {
int size = 10;
long[] data = new long[size];
Random r = new Random();
for ( int i = 0; i < data.length; i++ ) {
data[i] = r.nextLong();
}
Array[] a = new Array[] {
new NoOpArray(),
new NormalArray(size),
new TenArray(size)
};
for (;;) {
for ( int i = 0; i < a.length; i++ ) {
testSet(a[i], data, 10000000);
testGet(a[i], data, 10000000);
}
}
}
private static void testGet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
data[j] = a.get(j);
}
}
long stop = System.nanoTime();
System.out.printf("%s/get took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
private static void testSet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
a.set(j, data[j]);
}
}
long stop = System.nanoTime();
System.out.printf("%s/set took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
}
The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.
NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms
Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.
Most likely you are partially misled in your interpretation of the profilers results. Profilers are notoriously overinflating the performance impact of small, frequently called methods. In your case, the profiling overhead for the get()-method is probably larger than the actual processing spent in the method itself. The situation is worsened further, since the instrumentation also interferes with the JIT's capability to inline methods.
As a rule of thumb for this situation - if the total processing time for a piece of work of known length increases more then two- to threefold when running under the profiler, the profiling overhead will give you skewed results.
To verify your changes actually do have impact, always measure performance improvements without the profiler, too. The profiler can hint you about bottlenecks, but it can also deceive you to look at places where nothing is wrong.
Array bounds checking can have a surprisingly large impact on performance (if you do comparably little else), but it can also be hard to clearly separate from general memory access penalties. In some trivial cases, the JIT might be able to eliminate them (there have been efforts towards bounds check elimination in Java 6), but this is AFAIK mostly limited to simple loop constructs like for(x=0; x<array.length; x++).
Under some circumstances you may be able to replace array access by simple member access, completely avoiding the bound checks, but its limited to the rare cases where you access you array exclusively by constant indices. I see no way to apply it to your problem.
The change suggested by Mark Peters is most likely not solely faster because it eliminates a bounds check, but also because it alters the locality properties of your data structures in a more cache friendly way.
Many profilers tell you very confusing things, partly because of how they work, and partly because people have funny ideas about performance to begin with.
For example, you're wondering about how many times functions are called, and you're looking at code and thinking it looks like a lot of logic, therefore slow.
There's a very simple way to think about this stuff, that makes it very easy to understand what's going on.
First of all, think in terms of the percent of time a routine or statement is active, rather than the number of times it is called or the average length of time it takes. The reason for that is it is relatively unaffected by irrelevant issues like competing processes or I/O, and it saves you having to multiply the number of calls by the average execution time and divide by the total time just to see if it is a big enough to even care about. Also, percent tells you, bottom line, how much fixing it could potentially reduce the overall execution time.
Second, what I mean by "active" is "on the stack", where the stack includes the currently running instruction and all the calls "above" it back to "call main". If a routine is responsible for 10% of the time, including routines that it calls, then during that time it is on the stack. The same is true of individual statements or even instructions. (Ignore "self time" or "exclusive time". It's a distraction.)
Profilers that put timers and counters on functions can only give you some of this information. Profilers that only sample the program counter tell you even less. What you need is something that samples the call stack and reports to you by line (not just by function) the percent of stack samples containing that line. It's also important that they sample the stack a) during I/O or other blockage, but b) not while waiting for user input.
There are profilers that can do this. I'm not sure about Java.
If you're still with me, let me throw out another ringer. You're looking for things you can optimize, right? and only things that have a large enough percent to be worth the trouble, like 10% or more? Such a line of code costing 10% is on the stack 10% of the time. That means if 20,000 samples are taken, it is on about 2,000 of them. If 20 samples are taken, it is on about 2 of them, on average. Now, you're trying to find the line, right? Does it really matter if the percent is off a little bit, as long as you find it? That's another one of those happy myths of profilers - that precision of timing matters. For finding problems worth fixing, 20,000 samples won't tell you much more than 20 samples will.
So what do I do? Just take the samples by hand and study them. Code worth optimizing will simply jump out at me.
Finally, there's a big gob of good news. There are probably multiple things you could optimize. Suppose you fix a 20% problem and make it go away. Overall time shrinks to 4/5 of what it was, but the other problems aren't taking any less time, so now their percentage is 5/4 of what it was, because the denominator got smaller. Percentage-wise they got bigger, and easier to find. This effect snowballs, allowing you to really squeeze the code.
You could try using a memoizing or caching strategy to reduce the number of actual calls. Another thing you could try if you're very desperate is a native array, since indexing those is unbelievably fast, and JNI shouldn't invoke toooo much overhead if you're using parameters like longs that don't require marshalling.