Ok. I am miscalculated things of microbenchmarking. Plz dont read if you dont have excess time.
Instead of
double[] my_array=new array[1000000];double blabla=0;
for(int i=0;i<1000000;i++)
{
my_array[i]=Math.sqrt(i);//init
}
for(int i=0;i<1000000;i++)
{
blabla+=my_array[i];//array access time is 3.7ms per 1M operation
}
i used
public final static class my_class
{
public static double element=0;
my_class(double elementz)
{
element=elementz;
}
}
my_class[] class_z=new my_class[1000000];
for(int i=0;i<1000000;i++)
{
class_z[i]=new my_class(Math.sqrt(i)); //instantiating array elements for later use(random-access)
}
double blabla=0;
for(int i=0;i<1000000;i++)
{
blabla+=class_z[i].element; // array access time 2.7 ms per 1M operations.
}
}
looping overhead is nearly 0.5 ms per 1M looping iterations(used this offset).
Array of classes' element accessing time is %25 lower than a primitive-array's.
Question: Do you know any other way to even lower random-access time?
intel 2Ghz single core java -eclipse
Looking at your code again, I can see that in the first loop you are adding 1m different elements. In the second example, you are adding the same static element 1m times.
A common problem with micro-benchmarks is the order you perform the tests impacts the results.
For example, if you have two loops, the first loops is initially not compiled to native code. However after some time, the whole method will be compiled and the loop will run faster.
Then you run the second loop and find it is either
much faster because it is optimised from the start. (For simple loops)
much slower because it is optimised without any runtime metrics. (For complex loop)
You need to place each loop in a seperate method and run the test alteratively a numebr of times to get reproduceable results.
In your first case, the loop is not optimised until after it has run for a while. In the second case, your loop is likely to already be compiled when it starts.
The difference is easily explained:
The primitive array has a memory footprint of 1M * 8 bytes = 8MB.
The class array has a memory footprint of 1M * 4 bytes = 4MB, all pointing to the same instance (assuming 32bit VM or compressed refs 64bit VM).
Put different objects into your class array and you will see the primitive array perform better. You are comparing oranges to apples at the moment.
There are several problems with your benchmarks and your assessment above. First, your code doesn't compile as shown. Second, your benchmark times (i.e., a few milliseconds) are far too short to be of any statistical worth with today's high-speed processors. Third, you're comparing apples to oranges (as mentioned above). That is, you're timing two completely different use cases: a single static and a million variables.
I fixed your code and ran it several times on an i7-2620m for 10,000 x 1,000,000 repetitions. All results were within +/- 1%, which is good enough for this discussion. Then, I took the fastest of all of those runs in order to compare their performance.
Above, you claimed that the second use case was "25% lower" than the first. That is wildly inaccurate.
In order to do a "static" versus "variable" performance comparison, I changed the first benchmark to add the 999,999th square-root just like the second one is doing. The difference was only about 4.63% in favor of the second use case.
In order to do an array access performance comparison, I changed the second use case to a "non-static" variable. The difference was about 68.2% in favor of the first use case (primitive array access), meaning that the first way was much faster than the second.
(Feel free to ask me more about micro-benchmarking since I've been doing performance measurement and assessment for over 25 years.)
Related
I have the following Java method:
static Board board;
static int[][] POSSIBLE_PLAYS; // [262143][0 - 81]
public static void playSingleBoard() {
int subBoard = board.subBoards[board.boardIndex];
int randomMoveId = generateRandomInt(POSSIBLE_PLAYS[subBoard].length);
board.play(board.boardIndex, POSSIBLE_PLAYS[subBoard][randomMoveId]);
}
Accessed arrays do not change at runtime. The method is always called by the same thread. board.boardIndex may change from 0 to 8, there is a total of 9 subBoards.
In VisualVM I end up with the method being executed 2 228 212 times, with (Total Time CPU):
Self Time 27.9%
Board.play(int, int) 24.6%
MainClass.generateRnadomInt(int) 8.7%
What I am wondering is where does come from those 27.9% of self execution (999ms / 2189ms).
I first thought that allocating 2 int could slow down the method so I tried the following:
public static void playSingleBoard() {
board.play(
board.boardIndex,
POSSIBLE_PLAYS[board.subBoards[board.boardIndex]]
[generateRandomInt(POSSIBLE_PLAYS[board.subBoards[board.boardIndex]].length)]
);
}
But ending up with similar results, I have no clue what this self execution time can be.. is it GC time? memory access?
I have tried with JVM options mentionnned here => VisualVM - strange self time
and without.
First, Visual VM (as well as many other safepoint-based profilers) are inherently misleading. Try using a profiler that does not suffer from the safepoint bias. E.g. async-profiler can show not only methods, but also particular lines/bytecodes where the most CPU time is spent.
Second, in your example, playSingleBoard may indeed take relatively long. Even without a profiler, I can tell that the most expensive operations here are the numerous array accesses.
RAM is the new disk. Memory access is not free, especially the random access. Especially when the dataset is too big to fit into CPU cache. Furthermore, an array access in Java needs to be bounds-checked. Also, there are no "true" two-dimentional arrays in Java, they are rather arrays of arrays.
This means, an expression like POSSIBLE_PLAYS[subBoard][randomMoveId] will result in at least 5 memory reads and 2 bounds checks. And every time there is a L3 cache miss (which is likely for large arrays like in your case), this will result in ~50 ns latency - the time enough to execute a hundred arithmetic operations otherwise.
I'm writing a java application that transforms numbers (long) into a small set of result objects. This mapping process is very critical to the app's performance as it is needed very often.
public static Object computeResult(long input) {
Object result;
// ... calculate
return result;
}
There are about 150,000,000 different key objects, and about 3,000 distinct values.
The transformation from the input number (long) to the output (immutable object) can be computed by my algorithm with a speed of 4,000,000 transformations per second. (using 4 threads)
I would like to cache the mapping of the 150M different possible inputs to make the translation even faster but i found some difficulties creating such a cache:
public class Cache {
private static long[] sortedInputs; // 150M length
private static Object[] results; // 150M length
public static Object lookupCachedResult(long input) {
int index = Arrays.binarySearch(sortedInputs, input);
return results[index];
}
}
i tried to create two arrays with a length of 150M. the first array holds all possible input longs, and it is sorted numerically. the second array holds a reference to one of the 3000 distinct, precalculated result objects at the index corresponding to the first array's input.
to get to the cached result, i do a binary search for the input number on the first array. the cached result is then looked up in the second array at the same index.
sadly, this cache method is not faster than computing the results. not even half, only about 1.5M lookups per second. (also using 4 threads)
Can anyone think of a faster way to cache results in such a scenario?
I doubt there is a database engine that is able to answer more than 4,000,000 queries per second on, let's say an average workstation.
Hashing is the way to go here, but I would avoid using HashMap, as it only works with objects, i.e. must build a Long each time you insert a long, which can slow it down. Maybe this performance issue is not significant due to JIT, but I would recommend at least to try the following and measure performance against the HashMap-variant:
Save your longs in a long-array of some length n > 3000 and do the hashing by hand via a very simple hash-function (and thus efficient) like
index = key % n. Since you know your 3000 possible values before hand you can empirically find an array-length n such that this trivial hash-function won't cause collisions. So you circumvent rehashing etc. and have true O(1)-performance.
Secondly I would recommend you to look at Java-numerical libraries like
https://github.com/mikiobraun/jblas
https://github.com/fommil/matrix-toolkits-java
Both are backed by native Lapack and BLAS implementations that are usually highly optimized by very smart people. Maybe you can formulate your algorithm in terms of matrix/vector-algebra such that it computes the whole long-array at one time (or chunk-wise).
There are about 150,000,000 different key objects, and about 3,000 distinct values.
With the few values, you should ensure that they get re-used (unless they're pretty small objects). For this an Interner is perfect (though you can run your own).
i tried hashmap and treemap, both attempts ended in an outOfMemoryError.
There's a huge memory overhead for both of them. And there isn't much point is using a TreeMap as it uses a sort of binary search which you've already tried.
There are at least three implementations of a long-to-object-map available, google for "primitive collections". This should use slightly more memory than your two arrays. With hashing being usually O(1) (let's ignore the worst case as there's no reason for it to happen, is it?) and much better memory locality, it'll beat(*) your binary search by a factor of 20. You binary search needs log2(150e6), i.e., about 27 steps and hashing may need on the average maybe two. This depends on how tightly you pack the hash table; this is usually a parameter given when it gets created.
In case you run your own (which you most probably shouldn't), I'd suggest to use an array of size 1 << 28, i.e., 268435456 entries, so that you can use bitwise operations for indexing.
(*) Such predictions are hard, but I'm sure it's worth trying.
I've got a little program that is a fairly pointless exercise in simple number crunching that has thrown me for a loop.
The program spawns a bunch of worker threads that do simple mathematical operations. Recently I changed the inner loop of one variant of worker from:
do
{
int3 = int1 + int2;
int3 = int1 * int2;
int1++;
int2++;
i++;
}
while (i < 128);
to something akin to:
int3 = tempint4[0] + tempint5[0];
int3 = tempint4[0] * tempint5[0];
int3 = tempint4[1] + tempint5[1];
int3 = tempint4[1] * tempint5[1];
int3 = tempint4[2] + tempint5[2];
int3 = tempint4[2] * tempint5[2];
int3 = tempint4[3] + tempint5[3];
int3 = tempint4[3] * tempint5[3];
...
int3 = tempint4[127] + tempint5[127];
int3 = tempint4[127] * tempint5[127];
The arrays are populated by random integers no higher than 1025 in value, and the array values do not change.
The end result was that the program ran much faster, though closer examination seems to indicate that the CPU isn't actually doing anything when running the newer version of the code. It seems that the JVM has figured out that it can safely ignore the code that replaced the inner loop after one iteration of the outer loop since it is only redoing the same calculations on the same set of data over and over again.
To illustrate my point, the old code took maybe ~27000 ms to run and noticeably increased the operating temperature of the CPU (it also showed 100% utilization for all cores). The new code takes maybe 5 ms to run (sometimes less) and causes nary a spike in CPU utilization or temperature. Increasing the number of outer loop iterations does nothing to change the behavior of the new code, even when the number of iterations increases by a hundred times or more.
I have another version of the worker that is identical to the one above except that it has a division operation along with the addition and multiplication operations. In its new unrolled form, the division-enabled version is also much faster than it's previous form, but it actually takes a little while (~300 ms on the first run and ~200 ms on subsequent runs, despite warmup, which is a little odd) and produces a profound spike in CPU temperature for its brief run. Increasing the number of outer loop iterations seems to cause the temperature phenomenon to mostly cease after a certain amount of time has passed while running the program, though utilization still shows 100% for all cores. My guess is the JVM is taking much longer to figure out which operations it can safely ignore when handling division operations, and that it is not ignoring all of them.
Short of adding division operations to all my code (which isn't really a fix anyway beyond a certain number of outer loop iterations), is there any way I can get the JVM to stop reducing my code to apparent NOOPs? I've tried several solutions to the problem, such as generating new random values per iteration of the outer loop, going back to simple integer variables with incrementation, and some other nonsense, but none of those solutions have produced desirable results. Either it continues to ignore the series of instructions, or the performance hit from modifications is bad enough that my division-heavy variant actually performs better than the code without division operations.
edit: to provide some context:
i: this variable is an integer that is used for a loop counter in a do/while loop. It is defined in the class file containing the worker code. It's initial value is 0. It is no longer used in the newer version of the worker.
int1/int2: These are integers defined in the class file containing the worker code. Their initial values are both 0. They were used in the old version of the code to provide changing values for each iteration of the internal loop. All I had to do was increment them upward by one per loop iteration, and the JVM would be forced to carry out every operation faithfully. Unfortunately, this loop apparently prevented the use of SIMD. Each time the outer loop iterated, int1 and int2 had their values reset to prevent overflow of int1, int2, or int3 (I have discovered that integer overflow can slow down the code unnecessarily, as can allowing a float to reach Infinity).
tempint4/tempint5: These are references to a pair of integer arrays defined in the main class file for the program (Mathtester. Yes, unimaginative, I know). When the program first starts, there is a short do/while loop that fills each array with random integers randing from 1-1025. The arrays are 128 integers in size. Each array is static, though the reference variables are not. In truth there is no particular reason for me to use the reference variables. They are leftovers from when I was trying to do an array reference swap so that, after each iteration of the outer loop, tempint4 and tempint5 would be referred to the opposite array. It was my hope that the JVM would stop ignoring my code block. For the division-enabled version of the code, this seems to have worked (sort of), since it fundamentally changes the values to be calculated. Swapping tempint4 for tempint5 and vice versa does not change the results of the addition and multiplication operations, so the JVM can still ignore those.
edit: Making tempint4 and tempint5 (since they are only reference variables, I am actually referring to the main arrays, Mathtester.int4 and Mathtester.int5) volatile worked without notably reducing the amount of CPU activity or level or CPU temperature. It did slow down the code a bit, but that is a probable indicator that the JVM was NOOPing more than I knew.
Is there any way I can get the JVM to stop reducing my code to apparent NOOPs?
Yes, by making int3 volatile.
One of the first things when dealing with Java performance that you have to learn by heart is this:
"A single line of Java code means nothing at all in isolation".
Modern JVMs are very complex beasts, and do all kinds of optimization. If you try to measure some small piece of code, the chances are that you will not be measuring what you think you are - it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing.
In this case, yes, it's entirely likely that the JVM is optimizing away the loop. There's no simple way to prevent it from doing this, and almost all techniques are fragile and JVM-version specific (because new & cleverer optimizations are developed & added to the JVM all the time).
So, let me turn the question around: "What are you really trying to achieve here? Why do you want to prevent the JVM from optimizing?"
This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).
When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.
time1=System.nanotime();
int count=0;
for(int j=0;j<1000;i++)
{
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
if(((N/2)*2)!=N)
{
for(int i=0;i<N;i++)
{
t1+=elements[i]*b.elements[i];
}
}
else if(((N/8)*8)==N)
{
float []vek=new float[8];
for(int i=0;i<(N/8);i++)
{
vek[0]=elements[i]*b.elements[i];
vek[1]=elements[i+1]*b.elements[i+1];
vek[2]=elements[i+2]*b.elements[i+2];
vek[3]=elements[i+3]*b.elements[i+3];
vek[4]=elements[i+4]*b.elements[i+4];
vek[5]=elements[i+5]*b.elements[i+5];
vek[6]=elements[i+6]*b.elements[i+6];
vek[7]=elements[i+7]*b.elements[i+7];
t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
//t1 is total sum of all dot products.
}
}
}
time2=System.nanotime();
time3=(time2-time1)/1000000000.0; //seconds
Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?
If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.
For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).
Thanks.
Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?
Your first loop does this, indented more conventionally:
for(int j=0;j<1000;i++) {
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
}
Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.
Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.
I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.
I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.
As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.
See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.
Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.
I am writing some performance critical Java Code and I am really no Java expert, to say this in advance.
I work with a model in which nearly all information can be calculated from the positions of non-zero entries in a changing array of about ~1000 integers (most of them zero). To reduce these calculations I am working on algorithms, that update the informations in constant time, when the array changes instead of recalculating them. This could lead to a lot of code like
...
info1[x][y][a] = ...
info1[x][x%2+y][b] = ...
if( info3[x][y][c]!=0 )
info2[x][y] = ...
if( some condition involving ~10 array entries) {
/** some expensive algorithm that is hopefully called rarely **/
}
info3[x][y] = ...
...
So i expect maybe 10 of such consecutive and mainly independent array writes with minimal calculations which will make up a very large portion of the lines the program has to run through. Should I expect the number of such simple consecutive operations to be relevant or does Java have means to execute 20 consecutive simple array writes about as fast as it can execute 10 or 2?
Not entirely clear what your question is, so I'll cover several points:
An array write is only a little slower than an array read. The operations of interest are (a) the array bounds check, (b) the array index calculation (basically multiplying by 4 - trivial), and the actual load/store.
Having multiple (number "N") short array assignment statements in a single straight-line code block is not a problem. At worst it's N times the work of the single statement.
Having "deep" multiple-dimension arrays is not a big deal either -- if you have a 3-dimension array, eg, a store to that looks like 2 array reads and one write.
For all of these scenarios the JITC will "love" simple array-intense code. There is lots of opportunity to "common" array bounds checks, array index calculations, etc, and even mediocre JITCs will do a pretty good job of this.
The one thing you need to watch out for, with very large arrays and long-running code, is the memory footprint, especially if you're multi-threading. Doing "sparse" updates to very large (multi-megabyte) arrays "dirties" a lot of cache lines and virtual memory pages, and, for the multi-threaded case, cache synchronization (if you somehow manage to arrange it) can become a bottleneck. But for only a few thousand array entries this is nowhere near being an issue.
The actual process or reading and writing 20 elements of an array should not be much of a performance issue, although it would depend on how the code is performance critical.
I would try the implementation you have to see if it meets your requirements, I believe the array data structure should be fast enough for most performance optimised tasks that are done in Java.
Source: recently working on re-implementing ArrayList for university work.