In the process of creating a voxel game, I'm doing some performance tests, for the basic chunk system.
A chunk is made of 16 tiles on the y axis. A tile is a Hashmap of material ids.
The key is a byte, and the material id is a short.
According to my calculations a chunk should be 12KB + a little bit (Let's just say 16KB). 16*16*16*3. 3 is for a byte and a short(3 bytes).
What I basically don't understand is that my application uses much more memory than expected. Actually around 256KB for each chunk. It uses around 2GB running 8192 chunks. Notice that this is chunk storage performance test, and therefore not a whole game.
Another strange thing is that the memory usage varies from 1.9GB to 2.2GB each time I run it. Theres no randomizer in the code, so it should always be the same amount of variables, arrays, elements etc.
Heres my code:
public class ChunkTest {
public static void main(String[] args) {
List <Chunk> chunks = new ArrayList <Chunk>();
long time = System.currentTimeMillis();
for(int i = 0; i<8192; i++) {
chunks.add(new Chunk());
}
long time2 = System.currentTimeMillis();
System.out.println(time2-time);
System.out.println("Done");
//System.out.println(chunk.getBlock((byte)0, (byte)0, (byte)0));
while(1==1) {
//Just to keep it running to view memory usage
}
}
}
And the other class
public class Chunk {
int x;
int y;
int z;
boolean solidUp;
boolean solidDown;
boolean solidNorth;
boolean solidSouth;
boolean solidWest;
boolean solidEast;
private HashMap<Byte, HashMap<Byte, Short>> tiles = new HashMap<Byte, HashMap<Byte, Short>>();
public Chunk() {
HashMap<Byte, Short> tile;
//Create 16 tiles
for(byte i = 0; i<16;i++) {
//System.out.println(i);
tile = new HashMap<Byte, Short>();
//Create 16 by 16 blocks (1 is the default id)
for(short e = 0; e<256;e++) {
//System.out.println(e);
tile.put((byte) e, (short) 1);
}
tiles.put(i, tile);
}
}
public short getBlock(byte x, byte y, byte z) {
HashMap<Byte, Short> tile = tiles.get(y);
short block = tile.get((byte)(x+(z*16)));
return block;
}
}
I'm using windows task manager to monitor the memory usage.
Is that a very inaccurate tool to monitor, and does it kind of guess, which would explain why it varies from instance to instance.
And what is making each chunk 20 times heavier than it should?
A little bonus question, if you know: If I know the index of what I'm trying to find, is hashMap or ArrayList faster?
A chunk is made of 16 tiles on the y axis. A tile is a Hashmap of material ids. The key is a byte, and the material id is a short.
According to my calculations a chunk should be 12KB + a little bit (Let's just say 16KB). 16*16*16*3. 3 is for a byte and a short(3 bytes).
That's bad. Though you're keeping the size of the HashMap secret, I can see you're too optimistic.
A Map.Entry is an object. Add 4 or 8 bytes for its header.
Its key is an object, never a primitive. Count 8 bytes.
The same for the value.
A HashMap.Entry stores the hash (int, 4 bytes) and a reference to Entry next (4 or 8 bytes). The HashMap maintains an array of references to its entries (4 or 8 bytes per element), which by default kept at most 75% full.
So we have much more that what you expected. The exact values depends on your JVM and some of my above figures may be wrong. Anyway, you're off by a factor of at maybe 10 or more.
I'd suggest you to post you code to CR with all the details needed for the size estimation. Consider using some primitive map or maybe just an array...
Related
I really hope that my title is correct, but let me explain exactly what I mean by the title.
I need to write a program that creates an byte Array with the size of 1MB and add it in an ArrayList<byte[]>.
I need to be doing that until I have added the amount of 3GB. Also after every allocation I should use Thread.sleep(20).
I think I understand how the implementation should look like, but I'm having difficulties with calculating that exact 1MB and the variables.
Here is how my code looks like:
import java.util.ArrayList;
public class MemLeak
{
public static void main(String[] args)
{
int maxSize = 3221225472; // 3GB in bytes
int startSize = 0;
byte[] byteArray = new byte[1048576]; // 1MB in bytes
ArrayList<byte[]> list = new ArrayList<byte[]>();
while (startSize != maxSize) {
list.add(byteArray);
Thread.sleep(20);
startSize = startSize + 1048576;
}
}
}
In my case I have thought that the number of elements in an byte Array equals to size in bytes, but that doesn't seem to be the case.
The Variable int maxSize = 3221225472 is also out of int range.
Could someone give me some pointers?
Thanks, much appreciated!
If the byte array declaration is moved inside the loop then each iteration allocates 1MB. Do this 3072 times to allocate 3GB of total space. No need to keep track of maxSize.
If the goal is for the ArrayList + its data to consume exactly 3GB of memory, that will be more difficult, because the ArrayList class consumes its own space outside of your data, and it also does its own internal memory management, so it might go from less than to more than 3GB when it needs to create more internal space. Have a look at the source code of ArrayList.add() to see what I mean.
I have simple programm:
public class Test {
public static void main(String[] args) {
for (int i = 0; i < 1_000_000; i++) {
System.out.print(1);
}
}
}
And launched profiling. Here are the results:
I assume that memory grows because of this method calls:
public void print(int i) {
write(String.valueOf(i));
}
Is there a way to print int values in the console without memory drawdown?
On local machine I try add if (i % 10000 == 0) System.gc(); to cycle and memory consumption evened out. But the system that checks the solution does not make a decision. I tried to change the values of the step but still does not pass either in memory(should work less than 20mb) or in time(<1sec)
EDIT I try this
String str = "hz";
for (int i = 0; i < 1_000_0000; i++) {
System.out.print(str);
}
But same result:
EDIT2 if I write this code
public class Test {
public static void main(String[] args) {
byte[] bytes = "hz".getBytes();
for (int i = 0; i < 1_000_0000; i++) {
System.out.write(bytes, 0, bytes.length);
}
}
}
I have this
Therefore, I do not believe that Java is making its noises. They would be in both cases.
You need to convert the int into characters without generating a new String each time you do it. This could be done in a couple of ways:
Write a custom "int to characters" method that converts to ASCII bytes in a byte[] (See #AndyTurner's example code). Then write the byte[]. And repeat.
Use ByteBuffer, fill it directly using a custom "int to characters" converter method, and use a Channel to output the bytes when the buffer is full. And repeat.
If done correctly, you should be able to output the numbers without generating any garbage ... other than your once-off buffers.
Note that System.out is a PrintStream wrapping a BufferedOutputStream wrapping a FileOuputStream. And, when you output a String directly or indirectly using one of the print methods, that actually does through a BufferedWriter that is internal to the PrintStream. It is complicated ... and apparently the print(String) method generates garbage somewhere in that complexity.
Concerning your EDIT 1: when you repeatedly print out a constant string, you are still apparently generating garbage. I was surprised by this, but I guess it is happening in the BufferedWriter.
Concerning your EDIT 2: when you repeatedly write from a byte[], the garbage generation all but disappears. This confirms that at least one of my suggestions will work.
However, since you are monitoring the JVM with an external profile, your JVM is also running an agent that is periodically sending updates to your profiler. That agent will most likely be generating a small amount of garbage. And there could be other sources of garbage in the JVM; e.g. if you have JVM GC logging enabled.
Since you have discovered that printing a byte[] keeps memory allocation within the required bounds, you can use this fact:
Allocate a byte array the length of the ASCII representation of Integer.MIN_VALUE (11 - the longest an int can be). Then you can fill the array backwards to convert a number i:
int p = buffer.length;
if (i == Integer.MIN_VALUE) {
buffer[--p] = (byte) ('0' - i % 10);
i /= 10;
}
boolean neg = i < 0;
if (neg) i = -i;
do {
buffer[--p] = (byte) ('0' + i % 10);
i /= 10;
} while (i != 0);
if (neg) buffer[--p] = '-';
Then write this to your stream:
out.write(buffer, p, buffer.length - p);
You can reuse the same buffer to write as many numbers as you wish.
The pattern of memory usage is typical for java. Your code is irrelevant. To control java memory usage you need to use some -X parameters for example "-Xms512m -Xmx512m" will set both minimum and maximum heap size to 512m. BTW in order to minimize the sow-like memory graph it would be recommended to set min and max size to the same value. Those params could be given to java on command line when you run your java for example:
java -Xms512m -Xmx512m myProgram
There are other ways as well. Here is one link where you can read more about it: Oracle docs. There are other params that control stacksize and some other things. The code itself if written without memory usage considerations may influence memory usage as well, but in your case its too trivial of a code to do anything. Most memory concerns are addressed by configuring jvm memory usage params
I have always read that we should use Vector everywhere in Java and that there are no performance issues, which is certainly true. I'm writing a method to calculate the MSE (Mean Squared Error) and noticed that it was very slow - I basically was passing the Vector of values. When I switched to Array, it was 10 times faster but I don't understand why.
I have written a simple test:
public static void main(String[] args) throws IOException {
Vector <Integer> testV = new Vector<Integer>();
Integer[] testA = new Integer[1000000];
for(int i=0;i<1000000;i++){
testV.add(i);
testA[i]=i;
}
Long startTime = System.currentTimeMillis();
for(int i=0;i<500;i++){
double testVal = testArray(testA, 0, 1000000);
}
System.out.println(String.format("Array total time %s ",System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
for(int i=0;i<500;i++){
double testVal = testVector(testV, 0, 1000000);
}
System.out.println(String.format("Vector total time %s ",System.currentTimeMillis() - startTime));
}
Which calls the following methods:
public static double testVector(Vector<Integer> data, int start, int stop){
double toto = 0.0;
for(int i=start ; i<stop ; i++){
toto += data.get(i);
}
return toto / data.size();
}
public static double testArray(Integer[] data, int start, int stop){
double toto = 0.0;
for(int i=start ; i<stop ; i++){
toto += data[i];
}
return toto / data.length;
}
The array one is indeed 10 times faster. Here is the output:
Array total time 854
Vector total time 9840
Can somebody explain me why ? I have searched for quite a while, but cannot figure it out. The vector method appears to be making a local copy of the vector, but I always thought that objects where passed by reference in Java.
I have always read that we should use Vector everywhere in Java and that there are no performance issues, - Wrong. A vector is thread safe and thus it needs additional logic (code) to handle access/ modification by multiple threads So, it is slow. An array on the other hand doesn't need additional logic to handle multiple threads. You should try ArrayList instead of Vector to increase the speed
Note (based on your comment): I'm running the method 500 times each
This is not the right way to measure performance / speed in java. You should atleast give a warm-up run so as to nullify the effect of JIT.
Yes, that's the eternal problem of poor microbenchmarking. The Vector itself is not SO slow.
Here is a trick:
add -XX:BiasedLockingStartupDelay=0 and now testVector "magically" runs 5 times faster than before!
Next, wrap testVector into synchronized (data) - and now it is almost as fast as testArray.
You are basically measuring the performance of object monitors in HotSpot, not the data structures.
Simple thing. Vector is thread-safe so it needs synchoronization to add and access. Use ArrayList which is also back-up by array but it is not thread-safe and faster
Note:
Please provide size of the elements if you know in advance to ArrayList. Since in normal ArrayList without initial capacity resize will happen intenally which uses Arrays copy
And a normal array and ArrayList without initial capacity performances too varies drastically if no of elements is larger
Poor code, instead of list.get() rather use an iterator on the list. The array will still be faster though.
I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.
All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?
I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.
When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)
You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.
Maybe I'm being misled by my profiler (Netbeans), but I'm seeing some odd behavior, hoping maybe someone here can help me understand it.
I am working on an application, which makes heavy use of rather large hash tables (keys are longs, values are objects). The performance with the built in java hash table (HashMap specifically) was very poor, and after trying some alternatives -- Trove, Fastutils, Colt, Carrot -- I started working on my own.
The code is very basic using a double hashing strategy. This works fine and good and shows the best performance of all the other options I've tried thus far.
The catch is, according to the profiler, lookups into the hash table are the single most expensive method in the entire application -- despite the fact that other methods are called many more times, and/or do a lot more logic.
What really confuses me is the lookups are called only by one class; the calling method does the lookup and processes the results. Both are called nearly the same number of times, and the method that calls the lookup has a lot of logic in it to handle the result of the lookup, but is about 100x faster.
Below is the code for the hash lookup. It's basically just two accesses into an array (the functions that compute the hash codes, according to profiling, are virtually free). I don't understand how this bit of code can be so slow since it is just array access, and I don't see any way of making it faster.
Note that the code simply returns the bucket matching the key, the caller is expected to process the bucket. 'size' is the hash.length/2, hash1 does lookups in the first half of the hash table, hash2 does lookups in the second half. key_index is a final int field on the hash table passed into the constructor, and the values array on the Entry objects is a small array of longs usually of length 10 or less.
Any thoughts people have on this are much appreciated.
Thanks.
public final Entry get(final long theKey) {
Entry aEntry = hash[hash1(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
aEntry = hash[hash2(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
return null;
}
}
return aEntry;
}
Edit, the code for hash1 & hash2
private static int hash1(final long key, final int hashTableSize) {
return (int)(key&(hashTableSize-1));
}
private static int hash2(final long key, final int hashTableSize) {
return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1)));
}
Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.
The only thing that I would expect might make some difference is to move the key out of the values array of Entry.
Instead of having this:
class Entry {
long[] values;
}
//...
if ( entry.values[key_index] == key ) { //...
Try this:
class Entry {
long key;
long values[];
}
//...
if ( entry.key == key ) { //...
Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.
Is there a random-access data type faster than an array?
I was interested in the answer to this question, so I set up a test environment. This is my Array interface:
interface Array {
long get(int i);
void set(int i, long v);
}
This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:
class NormalArray implements Array {
private long[] data;
public NormalArray(int size) {
data = new long[size];
}
#Override
public long get(int i) {
return data[i];
}
#Override
public void set(int i, long v) {
data[i] = v;
}
}
And then a control:
class NoOpArray implements Array {
#Override
public long get(int i) {
return 0;
}
#Override
public void set(int i, long v) {
}
}
Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:
class TenArray implements Array {
private long v0;
private long v1;
private long v2;
private long v3;
private long v4;
private long v5;
private long v6;
private long v7;
private long v8;
private long v9;
private long[] extras;
public TenArray(int size) {
if (size > 10) {
extras = new long[size - 10];
}
}
#Override
public long get(final int i) {
switch (i) {
case 0:
return v0;
case 1:
return v1;
case 2:
return v2;
case 3:
return v3;
case 4:
return v4;
case 5:
return v5;
case 6:
return v6;
case 7:
return v7;
case 8:
return v8;
case 9:
return v9;
default:
return extras[i - 10];
}
}
#Override
public void set(final int i, final long v) {
switch (i) {
case 0:
v0 = v; break;
case 1:
v1 = v; break;
case 2:
v2 = v; break;
case 3:
v3 = v; break;
case 4:
v4 = v; break;
case 5:
v5 = v; break;
case 6:
v6 = v; break;
case 7:
v7 = v; break;
case 8:
v8 = v; break;
case 9:
v9 = v; break;
default:
extras[i - 10] = v;
}
}
}
I tested it with this harness:
import java.util.Random;
public class ArrayOptimization {
public static void main(String[] args) {
int size = 10;
long[] data = new long[size];
Random r = new Random();
for ( int i = 0; i < data.length; i++ ) {
data[i] = r.nextLong();
}
Array[] a = new Array[] {
new NoOpArray(),
new NormalArray(size),
new TenArray(size)
};
for (;;) {
for ( int i = 0; i < a.length; i++ ) {
testSet(a[i], data, 10000000);
testGet(a[i], data, 10000000);
}
}
}
private static void testGet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
data[j] = a.get(j);
}
}
long stop = System.nanoTime();
System.out.printf("%s/get took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
private static void testSet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
a.set(j, data[j]);
}
}
long stop = System.nanoTime();
System.out.printf("%s/set took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
}
The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.
NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms
Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.
Most likely you are partially misled in your interpretation of the profilers results. Profilers are notoriously overinflating the performance impact of small, frequently called methods. In your case, the profiling overhead for the get()-method is probably larger than the actual processing spent in the method itself. The situation is worsened further, since the instrumentation also interferes with the JIT's capability to inline methods.
As a rule of thumb for this situation - if the total processing time for a piece of work of known length increases more then two- to threefold when running under the profiler, the profiling overhead will give you skewed results.
To verify your changes actually do have impact, always measure performance improvements without the profiler, too. The profiler can hint you about bottlenecks, but it can also deceive you to look at places where nothing is wrong.
Array bounds checking can have a surprisingly large impact on performance (if you do comparably little else), but it can also be hard to clearly separate from general memory access penalties. In some trivial cases, the JIT might be able to eliminate them (there have been efforts towards bounds check elimination in Java 6), but this is AFAIK mostly limited to simple loop constructs like for(x=0; x<array.length; x++).
Under some circumstances you may be able to replace array access by simple member access, completely avoiding the bound checks, but its limited to the rare cases where you access you array exclusively by constant indices. I see no way to apply it to your problem.
The change suggested by Mark Peters is most likely not solely faster because it eliminates a bounds check, but also because it alters the locality properties of your data structures in a more cache friendly way.
Many profilers tell you very confusing things, partly because of how they work, and partly because people have funny ideas about performance to begin with.
For example, you're wondering about how many times functions are called, and you're looking at code and thinking it looks like a lot of logic, therefore slow.
There's a very simple way to think about this stuff, that makes it very easy to understand what's going on.
First of all, think in terms of the percent of time a routine or statement is active, rather than the number of times it is called or the average length of time it takes. The reason for that is it is relatively unaffected by irrelevant issues like competing processes or I/O, and it saves you having to multiply the number of calls by the average execution time and divide by the total time just to see if it is a big enough to even care about. Also, percent tells you, bottom line, how much fixing it could potentially reduce the overall execution time.
Second, what I mean by "active" is "on the stack", where the stack includes the currently running instruction and all the calls "above" it back to "call main". If a routine is responsible for 10% of the time, including routines that it calls, then during that time it is on the stack. The same is true of individual statements or even instructions. (Ignore "self time" or "exclusive time". It's a distraction.)
Profilers that put timers and counters on functions can only give you some of this information. Profilers that only sample the program counter tell you even less. What you need is something that samples the call stack and reports to you by line (not just by function) the percent of stack samples containing that line. It's also important that they sample the stack a) during I/O or other blockage, but b) not while waiting for user input.
There are profilers that can do this. I'm not sure about Java.
If you're still with me, let me throw out another ringer. You're looking for things you can optimize, right? and only things that have a large enough percent to be worth the trouble, like 10% or more? Such a line of code costing 10% is on the stack 10% of the time. That means if 20,000 samples are taken, it is on about 2,000 of them. If 20 samples are taken, it is on about 2 of them, on average. Now, you're trying to find the line, right? Does it really matter if the percent is off a little bit, as long as you find it? That's another one of those happy myths of profilers - that precision of timing matters. For finding problems worth fixing, 20,000 samples won't tell you much more than 20 samples will.
So what do I do? Just take the samples by hand and study them. Code worth optimizing will simply jump out at me.
Finally, there's a big gob of good news. There are probably multiple things you could optimize. Suppose you fix a 20% problem and make it go away. Overall time shrinks to 4/5 of what it was, but the other problems aren't taking any less time, so now their percentage is 5/4 of what it was, because the denominator got smaller. Percentage-wise they got bigger, and easier to find. This effect snowballs, allowing you to really squeeze the code.
You could try using a memoizing or caching strategy to reduce the number of actual calls. Another thing you could try if you're very desperate is a native array, since indexing those is unbelievably fast, and JNI shouldn't invoke toooo much overhead if you're using parameters like longs that don't require marshalling.