I've noticed that
int i=10000000;
boolean isPrime= false;
while(!isPrime){
i++;
System.out.println(item); //this kills performance
isPrime = checkIfPrime(i);
}
}
Printing the current value of a variable kills performance. I want to print it once in a while, but keep the cost of this operation low.
How to compare the cost of printing to screen to computation? Are there any tricks to minimize this cost [Should I print one out of 10 records, or will this cost just as much because of conditional check]?
Why do I need this? Well, I am doing fun stuff with Java (such as "find a counterexample for Euler's conjuncture... 27^5 + 84^5 + 110^5 + 133^5 = 144^5 (Lander & Parkin, 1966),"). I want to write a program that is both correct and fast (this counterexample was discovered in 60s, so I should be able to do it in reasonable time). While debugging I want to have as much info and possible and I want to find the counterexample asap. What is my best way to proceed? Print each case? - Too slow. Let it run overnight? What if I missed some i++?
How to compare the cost of printing to screen to computation?
It is not possible. The cost (i.e elapsed time) of printing depends on where the "printed" characters go. I can trivially construct an example where the cost tends to infinity.
$ java YourClass | ( sleep 10000000000 )
After a few lines of output, the pipeline buffers will fill, and the print calls in your application will block.
Are there any tricks to minimize this cost [Should I print one out of 10 records, or will this cost just as much because of conditional check]?
There is nothing that won't introduce another overhead; e.g. the overhead of testing whether or not to print.
The only way to entirely eliminate the print overhead is to not print at all while you are trying to measure performance.
What is my best way to proceed? Print each case? - Too slow. Let it run overnight? What if I missed some i++?
First run the program with the print statements to check that you are getting the right answers.
Then remove the print statements and run again to get your performance measures.
However:
Beware of the various traps in writing Java micro-benchmarks.
Trawling through pages and pages of trace prints is not a good way to check for (possible) faults in your program.
Yes printing is expensive. A processor can do millions of operations in the time span it takes to print to the terminal/IDE. If you are using eclipse or terminal it it very much time consuming. If you are using a terminal You need to redirect it to a file using >> or > or write it to a file using nio or io library. Print anything only if its inevitable, else i feel you should never print if performance is an issue.
Following is the fastest that you can do to compute the next prime and print as well all those numbers that you tested in the process (provided the next prime does not cause overflow of int):
int i = 10000000;
boolean isPrime = false;
while (!isPrime) {
i++;
// System.out.println(item); //this kills performance
isPrime = checkIfPrime(i);
}
for (int j = 10000001; j <= i; j++) sysout(j);
If you need to benchmark your code performance, you cant have print statements. For few iterations you have to print, do your debugging and remove the print statements once u know that your code is working correctly. And then do time measure of your code.
Else if you want to have print statements always in your code, its upto you to decide how much delay you can accept. For example, a Xeon processor can give you 28-35 Gflops/IOPS (operations per second), that means the processor can do 35*10^9 increment operations per second(it can do i++ for 35*10^9 times/sec). and as per this (https://stackoverflow.com/a/20683422/3409405) answer System.out.println() takes around 1 ms. so that means if you do print for every 10^6 i++ your time consumed will be doubled.
How to compare the cost of printing to screen to computation?
By measuring it: Implement both approaches (print every line, print every x lines) and see which one is faster, and keep tuning x for a reasonable trade off between frequent status updates and throughput.
It is important to note that the cost of printing is strongly affected by what you are printing to. Is the stream buffered or does it flush every number? Does it write to memory, to an SSD, an ordinary harddisk, or some drive attached to a slow usb 1 port? That can change write performance by a factor of 1000, which is why you should be measuring your particular use case.
One approach to this could be the following:
Perform your task in a thread, which updates a common buffer (string? Instance of an information class?) with stuff you want to output, but don't perform the actual output in this thread. Mind locking that buffer so you can access this information safely from different threads.
Then, let a timer/other thread access this common buffer to print out that information. That way you decouple the calculation from the output. The drawback is that you won't see every output, but while the output is generated, the computation continues.
Short answer is: it really depends. Printing text is costly. Hundred "print i" is much more expensive than building string with a stringbuilder and firing "print" once.
Related
Im to find 100th element of a Fibonacci Sequence, I initially tried to store the value of numbers using an int but it overflowed and switched to negative values just like long.Then I came across BigInteger I did get the solution using a simple for loop and an array to store the results, and access the previous elements. Now as Im trying to do the same problem using recursion The program does not seem to be terminating. Am I missing something here? Or are BigInteger's not suggested to used with recursion? Here is the code:
import java.math.BigInteger;
class Test {
public static void main(String[] args) {
BigInteger n = BigInteger.valueOf(100);
System.out.println(fib(n));
}
public static BigInteger fib(BigInteger n) {
if (n.compareTo(BigInteger.valueOf(1)) == 0 || n.compareTo(BigInteger.valueOf(1)) == -1)
return n;
return fib(n.subtract(BigInteger.valueOf(1))).add(fib(n.subtract(BigInteger.valueOf(2))));
}
}
In the comments, you mentioned that your assumption that the program doesn't terminate is based on the fact that it ran for over 5 minutes. That is not how you prove non-termination.
If you observe the program terminating within a certain amount of time, then you can conclude that it does, indeed, terminate. However, if you don't observe it terminating within a certain amount of time, then you can say precisely nothing about whether it terminates or not. It may terminate if you wait a little bit longer, it may terminate if you wait a lot longer, it may even theoretically terminate but take longer than the heat death of the universe.
In your specific case, the algorithm is perfectly correct, and it always terminates. It is simply not a very efficient algorithm: for computing fib(n), fib gets called fib(n) times, because you compute the same numbers over and over and over and over again.
If we assume that you can execute fib once per clock cycle (which is an optimistic assumption since a single call to fib performs one condition, two subtractions, one addition, and two calls to fib in most cases, and a single addition may already take multiple clock cycles depending on the CPU), and we further assume that you have a 100 core CPU and your code is actually executed in parallel, and you have 100 CPUs, and each CPU is clocked at 100 GHz, and you have a cluster of 100 computers, then it will still take you about an hour.
Under some more realistic assumptions, the time it takes your program to finish is more on the order of tens of thousands of years.
Since your code is not parallelized, in order for your code to finish in 5 minutes on a more realistic 4 GHz CPU, it would need to execute fib almost 300 million times per clock cycle.
It often helps to do some very rough guesstimates of the expected performance of your code. As you can see, you don't need to be an expert in Java or JVM or compilers or optimization or computer organization or CPU design or performance engineering. You don't need to know what, exactly, your code gets compiled down to. You don't need to know how many clock cycles an integer ADD takes. Because even when you make some totally over-the-top ridiculous assumptions, you can still easily see that your code cannot possibly finish in minutes or even hours.
I've got a little program that is a fairly pointless exercise in simple number crunching that has thrown me for a loop.
The program spawns a bunch of worker threads that do simple mathematical operations. Recently I changed the inner loop of one variant of worker from:
do
{
int3 = int1 + int2;
int3 = int1 * int2;
int1++;
int2++;
i++;
}
while (i < 128);
to something akin to:
int3 = tempint4[0] + tempint5[0];
int3 = tempint4[0] * tempint5[0];
int3 = tempint4[1] + tempint5[1];
int3 = tempint4[1] * tempint5[1];
int3 = tempint4[2] + tempint5[2];
int3 = tempint4[2] * tempint5[2];
int3 = tempint4[3] + tempint5[3];
int3 = tempint4[3] * tempint5[3];
...
int3 = tempint4[127] + tempint5[127];
int3 = tempint4[127] * tempint5[127];
The arrays are populated by random integers no higher than 1025 in value, and the array values do not change.
The end result was that the program ran much faster, though closer examination seems to indicate that the CPU isn't actually doing anything when running the newer version of the code. It seems that the JVM has figured out that it can safely ignore the code that replaced the inner loop after one iteration of the outer loop since it is only redoing the same calculations on the same set of data over and over again.
To illustrate my point, the old code took maybe ~27000 ms to run and noticeably increased the operating temperature of the CPU (it also showed 100% utilization for all cores). The new code takes maybe 5 ms to run (sometimes less) and causes nary a spike in CPU utilization or temperature. Increasing the number of outer loop iterations does nothing to change the behavior of the new code, even when the number of iterations increases by a hundred times or more.
I have another version of the worker that is identical to the one above except that it has a division operation along with the addition and multiplication operations. In its new unrolled form, the division-enabled version is also much faster than it's previous form, but it actually takes a little while (~300 ms on the first run and ~200 ms on subsequent runs, despite warmup, which is a little odd) and produces a profound spike in CPU temperature for its brief run. Increasing the number of outer loop iterations seems to cause the temperature phenomenon to mostly cease after a certain amount of time has passed while running the program, though utilization still shows 100% for all cores. My guess is the JVM is taking much longer to figure out which operations it can safely ignore when handling division operations, and that it is not ignoring all of them.
Short of adding division operations to all my code (which isn't really a fix anyway beyond a certain number of outer loop iterations), is there any way I can get the JVM to stop reducing my code to apparent NOOPs? I've tried several solutions to the problem, such as generating new random values per iteration of the outer loop, going back to simple integer variables with incrementation, and some other nonsense, but none of those solutions have produced desirable results. Either it continues to ignore the series of instructions, or the performance hit from modifications is bad enough that my division-heavy variant actually performs better than the code without division operations.
edit: to provide some context:
i: this variable is an integer that is used for a loop counter in a do/while loop. It is defined in the class file containing the worker code. It's initial value is 0. It is no longer used in the newer version of the worker.
int1/int2: These are integers defined in the class file containing the worker code. Their initial values are both 0. They were used in the old version of the code to provide changing values for each iteration of the internal loop. All I had to do was increment them upward by one per loop iteration, and the JVM would be forced to carry out every operation faithfully. Unfortunately, this loop apparently prevented the use of SIMD. Each time the outer loop iterated, int1 and int2 had their values reset to prevent overflow of int1, int2, or int3 (I have discovered that integer overflow can slow down the code unnecessarily, as can allowing a float to reach Infinity).
tempint4/tempint5: These are references to a pair of integer arrays defined in the main class file for the program (Mathtester. Yes, unimaginative, I know). When the program first starts, there is a short do/while loop that fills each array with random integers randing from 1-1025. The arrays are 128 integers in size. Each array is static, though the reference variables are not. In truth there is no particular reason for me to use the reference variables. They are leftovers from when I was trying to do an array reference swap so that, after each iteration of the outer loop, tempint4 and tempint5 would be referred to the opposite array. It was my hope that the JVM would stop ignoring my code block. For the division-enabled version of the code, this seems to have worked (sort of), since it fundamentally changes the values to be calculated. Swapping tempint4 for tempint5 and vice versa does not change the results of the addition and multiplication operations, so the JVM can still ignore those.
edit: Making tempint4 and tempint5 (since they are only reference variables, I am actually referring to the main arrays, Mathtester.int4 and Mathtester.int5) volatile worked without notably reducing the amount of CPU activity or level or CPU temperature. It did slow down the code a bit, but that is a probable indicator that the JVM was NOOPing more than I knew.
Is there any way I can get the JVM to stop reducing my code to apparent NOOPs?
Yes, by making int3 volatile.
One of the first things when dealing with Java performance that you have to learn by heart is this:
"A single line of Java code means nothing at all in isolation".
Modern JVMs are very complex beasts, and do all kinds of optimization. If you try to measure some small piece of code, the chances are that you will not be measuring what you think you are - it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing.
In this case, yes, it's entirely likely that the JVM is optimizing away the loop. There's no simple way to prevent it from doing this, and almost all techniques are fragile and JVM-version specific (because new & cleverer optimizations are developed & added to the JVM all the time).
So, let me turn the question around: "What are you really trying to achieve here? Why do you want to prevent the JVM from optimizing?"
I've seen that JITC uses unsigned comparison for checking array bounds (the test 0 <= x < LIMIT is equivalent to 0 ≺ LIMIT where ≺ treats the numbers as unsigned quantities). So I was curious if it works for arbitrary comparisons of the form 0 <= x < LIMIT as well.
The results of my benchmark are pretty confusing. I've created three experiments of the form
for (int i=0; i<LENGTH; ++i) {
int x = data[i];
if (condition) result += x;
}
with different conditions
0 <= x called above
x < LIMIT called below
0 <= x && x < LIMIT called inRange
0 <= x & x < LIMIT called inRange2
and prepared the data so that the probabilities of the condition being true are the same.
The results should be fairly similar, just above might be slightly faster as it compares against zero. Even if the JITC couldn't use the unsigned comparison for the test, the results for above and below should still be similar.
Can anyone explain what's going on here? It's quite possible that I did something wrong...
Update
I'm using Java build 1.7.0_51-b13 on Ubuntu 2.6.32-54-generic with i5-2400 CPU # 3.10GHz, in case anybody cares. As the results for inRange and inRange2 near 0.00 are especially confusing, I re-ran the benchmark with more steps in this area.
The likely variation in the results of the benchmarks have to do with CPU caching at different levels.
Since primitive int(s) are being used, there is no JVM specific caching going on, as will happen with auto-boxed Integer to primitive values.
Thus all that remains, given minimal memory consumption of the data[] array, is CPU-caching of low level values/operations. Since as described the distributions of values are based on random values with statistical 'probabilities' of the conditions being true across the tests, the likely cause is that, depending on the values, more or less (random) caching is going on for each test, leading to more randomness.
Further, depending on the isolation of the computer (background services/processes), the test cases may not be running in complete isolation. Ensure that everything is shutdown for these tests except the core OS functions and the JVM. Set the JVM memory min/max the same, shutdown any networking processes, updates, etc.
Are you test results the avarage of a number of runs, or did you only test each function once?
One thing I have found are that the first time you run a for loop the JVM will interpret, then each time its run the JVM will optimize it. Therefore the first few runs may get horrible performance, but after a few runs it will be near native performance.
I also figured out that a loop will not be optimized while its running. I have not tested if this applies to just the loop or the whole function. If it only applies to the loop you may get much more performance if you nest in in an inner and outer loop, and work with your data one block at a time. If its the whole function, you will have to place the inner loop in its own function.
Also run the test more than once, if you compare the code you will notice how the JIT optimizes the code in stages.
For most code this gives Java optimal performance. It allows it to skip costly optimization on code that runs rarely and makes code that run often a lot faster. However if you have a code block that runs once but for a long time, it will become horribly slow.
Say you need to track the number of times a method is called and print something when it has been called n times. What would be the most efficient:
Use a long variable _counter and increase it each time the method is called. Each call you test for the equality "_counter % n == 0"
Use an int variable _counter and increase it each time the method is called. When _counter = n, print the message and reset the variable _counter to 0.
Some would say the difference is negligible and you are probably right. I am just curious of what method is most commonly used
In this particular case, since you need to have an if-statement ANYWAY, I would say that you should just set it to zero when it reaches the count.
However, for a case where you use the value every time, and just want to "wrap round to zero when we reach a certain value", then the case is less obvious.
If you can adjust n to be a power of 2 (2, 4, 8, 16, 32 ...), then you can use the trick of counter % n is the same as counter & (n-1) - which makes the operation REALLY quick.
If n is not a power of two, then chances are that you end up doing a real divide, which is a bad idea - divide is very expensive, compared to regular instructions, and a compare and reset is highly likely faster than the divide option.
Of course, as others have mentioned, if your counter ever reaches the MAX limit for the type, you could end up with all manner of fun and games.
Edit: And of course, if you are printing something, that probably takes 100 times longer than the divide, so it really is micro-optimization, unless n is quite large.
It depends on the value of n... but I bet resetting and a simple equality check is faster.
Additionally resetting the counter is safer, you will never reach the representation limit for your number.
Edit: also consider readability, doing micro optimizations may obscure your code.
Why not do both.
If it becomes a problem then look to see if it is worth optimizing.
But there is no point even looking at it until it is a problem (there will be much bigger problems in your algorithms).
count = (count+1) % countMax;
I believe that it is always better to reset the counter for the following reasons:
The code is clearer to an unfamiliar programmer (for example, the maintenance programmer).
There is less chance of an arithmetic (perhaps bad spelling) overflow when you reset the counter.
Inspection of Guava's RateLimiter will give you some idea of a similar utility implementation http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/RateLimiter.html
Here are performance times for 100000000 iterations, in ms
modTime = 1258
counterTime = 449
po2Time = 108
As we see Power of 2 outperforms other methods by far, but its only for powers of 2, also our plain counter is almost 2.5 times faster than modulus as well. So why would we like to use modulus increments at all? Well in my opinion I think they provide a clean code and if used properly they are a great tool to know of
original post
A simple question about java performance. If I write a loop
for(int i=0;i<n;++i) buffer[(k++)%buffer.length]=something;
in which something is a non trivial digital filter. With this code I have a modulo operation at every write. This feels a bit silly because the Java VM will check that anyway. Therefore I would assume that a consturct using an ArrayIndexOutOfBounds would be faster (the buffer contains 1'000'000 numbers, so we won't have that overflow too often)
int i;
try
{
for(i=0;i<n;++i,++k) buffer[k]=something;
}
catch (ArrayIndexOutOfBounds e)
{
k=0;
for(;i<n;++i,++k) buffer[k]=something;
}
A third solution could be to calculate in advance at what point we would overflow and then split the loop manually in two. The code to determine how far the loop can go is executed every 768 samples, so from that perspective it might be slower than the catch method.
The problem here, aside from the silly duplication of code, which I will gladly sacrifice on the altar of performance, is that we have more code. And there it often appears that java doesn't optimize as well as with smaller routines.
So my question is: what strategy is the most performant ? Anybody experience with this type of construct ? Also, can anybody shed a light on the performance on android devices of both constructs ?
Your answer depends on your target platform. You've added the Android tag, so I'm going to answer in terms of Dalvik and (let's say) a Nexus 4.
First, the ARMv7-A architecture doesn't provide integer division instructions. Your modulus will be computed in software every time through the loop, which is going to slow you down a bit. (This is why it's best to use power-of-2 sizes for hash tables -- you can use a bit mask rather than a mod.)
Second, throwing an exception is expensive. The VM has to create the exception object, and initialize it with a snapshot of the current stack. In addition to the immediate overhead, you're creating X number of objects that have to be cleaned up later, and increasing the possibility that the VM will have to stop you mid-computation and collect garbage.
Third, generally speaking, any computation you can pull out of the inner loop represents a win, so manually testing for array overrun on every loop iteration is unsatisfying. You don't want to add a test for k vs. length to the loop header or body if you can avoid it. (A JIT compiler may do something like this -- if it can tell that the array index never walks off the end of the array, it doesn't have to do a per-element bounds check.)
Based on the (still slightly vague) sense of what you're doing and how many times you're doing it, I'd say the best option is to compute the "break" position ahead of the loop, and iterate the necessary number of times.
I'm curious to know how this turns out in practice. :-)