Factorial method - recursive or iterative? (Java) - java

I was making my way through project Euler, and I came across a combination problem. Combination logic means working out factorials. So, I decided to create a factorial method. And then I hit upon a problem - since I could quite easily use both iteration and recursion to do this, which one should I go for? I quickly wrote 2 methods - iterative:
public static long factorial(int num) {
long result = 1;
if(num == 0) {
return 1;
}
else {
for(int i = 2; i <= num; i++) {
result *= i;
}
return result;
}
and recursive:
public static long factorial(int num) {
if(num == 0) {
return 1;
}
else {
return num * factorial(num - 1);
}
}
If I am (obviously) talking about speed and functionality here, which one should I use? And, in general, is one of the techniques generally better than the other (so if I come across this choice later, what should I go for)?

Both are hopelessly naive. No serious application of factorial would use either one. I think both are inefficient for large n, and neither int nor long will suffice when the argument is large.
A better way would be to use a good gamma function implementation and memoization.
Here's an implementation from Robert Sedgewick.
Large values will require logarithms.

Whenever you get an option to chose between recursion and iteration, always go for iteration because
1.Recursion involves creating and destroying stack frames, which has high costs.
2.Your stack can blow-up if you are using significantly large values.
So go for recursion only if you have some really tempting reasons.

I was actually analyzing this problem by time factor.
I've done 2 simple implementations:
Iterative:
private static BigInteger bigIterativeFactorial(int x) {
BigInteger result = BigInteger.ONE;
for (int i = x; i > 0; i--)
result = result.multiply(BigInteger.valueOf(i));
return result;
}
And Recursive:
public static BigInteger bigRecursiveFactorial(int x) {
if (x == 0)
return BigInteger.ONE;
else
return bigRecursiveFactorial(x - 1).multiply(BigInteger.valueOf(x));
}
Tests both running on single thread.
It turns out that Iterative is slightly faster only with small arguments. When I put n bigger than 100 recursive solution was faster.
My conclussion? You never can say that iterative solution is faster than recursive on JVM. (Still talking only about time)
If You're intrested, whole way I get this conclussion is HERE
If You're intrested in deeper understanding difference between this 2 approaches, I found really nice description on knowledge-cess.com

There's no "this is better, that is worse" for this question. Because modern computers are so strong, in Java it tends to be a personal preference as to which you use. You are doing many more checks and computations in the iterative version, however you are piling more methods onto the stack in the recursive version. Pros and cons to each, so you have to take it case by case.
Personally, I stick with iterative algorithms to avoid the logic of recursion.

Related

High cost of polymorphism in Java Hotspot server

When I run my timing test program in Java Hotspot client, I get consistent behavior.
However, when I run it in Hotspot server, I get unexpected result.
Essentially, the cost of polymorphism is unacceptably high in certain situations that I've tried
to duplicate bellow.
Is this a known issue/bug with Hotspot server, or am I doing something wrong?
Test program and timing are given bellow:
Intel i7, Windows 8
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
Mine2: 0.387028831 <--- polymorphic call with expected timing
Trivial: 1.545411765 <--- some more polymorphic calls
Mine: 0.727726371 <--- polymorphic call with unexpected timing. Should be about 0.38
Mine: 0.383132698 <--- direct call with expected timing
The situation gets worse as I add additional tests.
Timing of the tests near the end of the list are completely off.
interface canDoIsSquare {
boolean isSquare(long x);
}
final class Trivial implements canDoIsSquare {
#Override final public boolean isSquare(long x) {
if (x > 0) {
long t = (long) Math.sqrt(x);
return t * t == x;
}
return x == 0;
}
#Override public String toString() {return "Trivial";}
}
final class Mine implements canDoIsSquare {
#Override final public boolean isSquare(long x) {
if (x > 0) {
while ((x & 3) == 0)
x >>= 2;
if ((x & 2) != 0 || (x & 7) == 5)
return false;
final long t = (long) Math.sqrt(x);
return (t * t == x);
}
return x == 0;
}
#Override public String toString() {return "Mine";}
}
final class Mine2 implements canDoIsSquare {
#Override final public boolean isSquare(long x) {
// just duplicated code for this test
if (x > 0) {
while ((x & 3) == 0)
x >>= 2;
if ((x & 2) != 0 || (x & 7) == 5)
return false;
final long t = (long) Math.sqrt(x);
return (t * t == x);
}
return x == 0;
}
#Override final public String toString() {return "Mine2";}
}
public class IsSquared {
static final long init = (long) (Integer.MAX_VALUE / 8)
* (Integer.MAX_VALUE / 2) + 1L;
static long test1(final canDoIsSquare fun) {
long r = init;
long startTimeNano = System.nanoTime();
while (!fun.isSquare(r))
++r;
long taskTimeNano = System.nanoTime() - startTimeNano;
System.out.println(fun + ": " + taskTimeNano / 1e9);
return r;
}
static public void main(String[] args) {
Mine mine = new Mine();
Trivial trivial = new Trivial();
Mine2 mine2 = new Mine2();
test1(mine2);
test1(trivial);
test1(mine);
long r = init;
long startTimeNano = System.nanoTime();
while (!mine.isSquare(r))
++r;
long taskTimeNano = System.nanoTime() - startTimeNano;
System.out.println(mine + ": " + taskTimeNano / 1e9);
System.out.println(r);
}
}
The cost is high, indeed, but your benchmark doesn't measure anything really relevant. The JIT can optimize away most of the overhead, but you didn't give it any chance. See e.g. here.
In any case, there's no benchmark warmup and there's On Stack Replacement.
The explanation is probably that the Server Hotspot optimizes better but slower. It assumes that it has enough time and collects the necessary stats longer. So while the Client Hotspot optimized your program, the Server Hotspot was preparing itself to produce better code.
The reason for the worsening with additional tests is that the initially monomorphic call site became bimorphic and then megamorphic.
In reality it's possible that only one of the methods gets called. If you want benchmark this, you have to run each test in its own JVM. This is a real pain, but existing benchmarking frameworks do it for you.
Or you may want to measure the polymorphic case, but then you need to warm up the code with all cases first. This way you can find out which method is faster even in a single JVM (though each will be slowed down by the megamorphic call overhead.
Update
The explanation seems to be the change from monomorphic to megamorhic. When the first test was run, the JVM was knew all the classes (as the instances were already created), but was optimistically assuming that only Mine2 occurs on the call site. So it did a quick check (translated as a conditional branch, which was always correctly predicted and thus very fast), and called the proper method. As it later saw the other two instances being used there, it had to create a branch table for them (the branch prediction still works, but the overhead is higher).
Question
What's unclear: The JVM can move this test out of the loop and thus reduce it's cost to nearly nothing. I can't tell why it doesn't happen.
In short, the JIT can optimises a single method call, and two method calls, in ways it cannot with more multi-polymorphic calls. The number of possible methods which might be called on any given line is what matters and the JIT builds up this picture over time. When a method is inlined further optimisations are possible, but in your case the line in question increases the number of possible method calls from test1 over the life of the run and so it gets slower.
The way I get around this is to duplicate the short test code so each class is tested equally (assuming this is realistic) If you program will be multi-polymorphic when it is running, this is what you should test to be realistic as you can see it can change the results.
When you run the method from a fresh loop you see the benefit of only calling one method from that line of code.
Here is a table of different costs you might see depending on the number of possible methods any individual line can call. http://vanillajava.blogspot.co.uk/2012/12/performance-of-inlined-virtual-method.html
Polymorphism is not designed to improve performance and for me it is entirely reasonable that as the complexity of the polymorphism increases it should be slower.
BTW making methods final doesn't improve the performance any more. The JIT works out if you have called a sub-class on a line by line basis (as discussed)
EDIT As you can see the client JVM doesn't optimise the code as much as it is designed fr relatively light eight startup times. This means the client JVM is more consistent, but consistently slower. If you want the best performance you need to consider a number of optimisation strategies which leads to multiple possible outcomes depending on whether the optimisation is applied or not.

multithreading to calculate primes java

I am trying to parallelize a prime number counter as an exercise.
I have refactored the original code and separated the long loops from others so that I can parallelize them.
Now I have the following code and multithreading is looking difficult as I need to keep track of the primes found (in order) and count the number of primes found.
nthPrime(long n) gets the number of primes to search for. returns the nth prime.
count is an ArryList
public static long nthPrime(long n) {
count.add((long) 1);
if (n < 2) {
count.add((long) 3);
return getCount();
}
count.add((long) 3);
if (n == 2) {
return getCount();
}
step = 4;
candidate = 5;
checker(n, step, candidate);
return getCount();
}
private static long checker(long n, int step, long candidate) {
while (count.size() < n) {
if (Checker.isPrime(candidate)) {
// checks the number for possible prime
count.add(candidate);
}
step = 6 - step;
candidate += step;
}
return getCount();
}
Any ideas on using util.concurrent or threading to parallelize this?
Thanks
Two pieces of advice:
Before you start, you need to turn your existing non-parallel code into something that 1) works and 2) is readable. Attempting to pararllelize it in its current form is going to lead to failure.
I can see bugs (I think)
I can't see evidence that you understand the basic sieving algorithm ... which is what it appears you are trying to implement here.
Anyone can take an algorithm split it into bits and use multiple Java threads to execute the bits. The difficulty is coming up with a scheme which works, and where your efforts actually give a worthwhile speedup. That requires:
a good understanding of the problem and the applicable algorithms,
identifying the part of the problem / algorithm that are amenable to parallelization,
understanding of the overheads and potential bottlenecks in Java multithreading, and
understanding of the correctness issues; e.g. where and how to synchronize.
I can't give you a lesson on these things in the space of a StackOverflow Answer. You need a text book.

does splitting logic into multiple methods in java slows down execution? if yes then shall we avoid re-factoring of code

http://s24.postimg.org/9y073weid/refactor_vs_non_refactor.png
Well here is the result for the execution time in nanoseconds for re-factored and non re-factored code for a simple addition operation. 1 to 5 are the consecutive runs of the code.
My intent was just to find out whether splitting up logic into multiple methods makes execution slow or not and here is the result which shows that yes there is considerable time that goes into just putting the methods on stack.
I invite people who have done some research on it before or want to investigate on this area to correct me if I am doing something wrong and draw some conclusive results out of this.
In my opinion yes code re-factoring does help in making code more structured and understandable but in time critical systems like real time game engines I would prefer not to re-factor.
Following was the simple code which I used:
package com.sim;
public class NonThreadedMethodCallBenchMark{
public static void main(String args[]){
NonThreadedMethodCallBenchMark testObject = new NonThreadedMethodCallBenchMark();
System.out.println("************Starting***************");
long startTime =System.nanoTime();
for(int i=0;i<900000;i++){
//testObject.method(1, 2); // Uncomment this line and comment the line below to test refactor time
//testObject.method5(1,2); // uncomment this line and comment the above line to test non refactor time
}
long endTime =System.nanoTime();
System.out.println("Total :" +(endTime-startTime)+" nanoseconds");
}
public int method(int a , int b){
return method1(a,b);
}
public int method1(int a, int b){
return method2(a,b);
}
public int method2(int a, int b){
return method3(a,b);
}
public int method3(int a, int b){
return method4(a,b);
}
public int method4(int a, int b){
return method5(a,b);
}
public int method5(int a, int b){
return a+b;
}
public void run() {
int x=method(1,2);
}
}
You should consider that code which doesn't do anything useful can be optimised away. If you are not careful you can be timing how long it takes to detect the code doesn't do anything useful, rather than run the code. If you use multiple methods, it can take longer to detect useless code and thus give different results. I would always look at the steady state performance after the code has warmed up.
For the most expensive parts of your code, small methods will be inlined so they won't make any difference to performance cost. What can happen is
smaller methods can be optimised better as complex methods can defeat optimisation tricks.
smaller methods can be eliminated as they are inlined.
If you don't ever warm up the code, it is likely to be slower. However, if code is called rarely, it is unlikely to matter (except in low latency system, in which case I suggest you warm up your code on startup)
If you run
System.out.println("************Starting***************");
for (int j = 0; j < 10; j++) {
long startTime = System.nanoTime();
for (int i = 0; i < 1000000; i++) {
testObject.method(1, 2);
//testObject.method5(1,2); // uncomment this line and comment the above line to test non refactor time
}
long endTime = System.nanoTime();
System.out.println("Total :" + (endTime - startTime) + " nanoseconds");
}
prints
************Starting***************
Total :8644835 nanoseconds
Total :3363047 nanoseconds
Total :52 nanoseconds
Total :30 nanoseconds
Total :30 nanoseconds
Note: the 30 nano-seconds is the time it takes to perform a System.nanoTime() call. The inner loop and the methods calls have been eliminated.
Yes, extra method calls cost extra time (except in the circumstance where the compiler/jitter actually does inlining, which I think some of them will do sometimes but under circumstances that you will find hard to control).
You shouldn't worry about this except for the most expensive part of your code, because you won't be able to see the difference in most cases. In those cases where performance doesn't matter, you should refactor to maximize code clarity.
I suggest you refactor at will, occasionally measure performance using a profiler to find the expensive parts. Only when the profiler shows that some specific code is using a significant fraction of the time, should you worry about about function calls (or other speed vs. clarity tradeoffs) in that specific code. You will discover that the performance bottlenecks are often in places that you wouldn't have guessed.

Java atomic classes in compound operations

Will the following code cause race condition issue if several threads invoke the "incrementCount" method?
public class sample {
private AtomicInteger counter = new AtomicInteger(0);
public int getCurrentCount {
int current = counter.getAndIncrement();
if (counter.compareAndSet(8, 0)) current = 0;
return current;
}
}
If it causes race condition, what are the possible solution other than using synchronized keyword?
You probably don't want to let the counter exceed 8 and this won't work. There are race conditions.
It looks like you want a mod 8 counter. The easiest way is to leave the AtomicInteger alone and use something like
int current = counter.getAndIncrement() & 7;
(which is fixed and optimized version of % 8). For computations mod 8 or any other power of two it works perfectly, for other number you'd need % N and get problems with int overflowing to negative numbers.
The direct solution goes as follows
public int getCurrentCount {
while (true) {
int current = counter.get();
int next = (current+1) % 8;
if (counter.compareAndSet(current, next))) return next;
}
}
This is about how getAndIncrement() itself works, just slightly modified.
Yes, it probably does not do what you want (there is a kind of race condition).
One thread may call getAndIncrement() and receive a 8
A second thread may call getAndIncrement() and receive a 9
The first thread tries compareAndSet but the value is not 8
The second thread tries compareAndSet but the value is not 8
If there's no risk of overflowing, you could do something like
return counter.getAndIncrement() % 8;
Relying on that something does not overflow seems like a poor idea to me though, and I would probably do roughly what you do, but let the method be synchronized.
Related question: Modular increment with Java's Atomic classes
What are you trying to achieve? Even if you use the fixes proposed by ajoobe or maartinus you can end up with different threads getting the same answer - consider 20 threads running simultaneously. I don't see any interesting significance of this "counter" as you present it here - you may as well just pick a random number between 0 and 8.
Based on the code for getAndIncrement()
public int getCurrentCount() {
for(;;) {
int courrent = counter.get();
int next = current + 1;
if (next >= 8) next = 0;
if (counter.compareAndSet(current, next))
return current;
}
}
However a simpler implementation in your case is to do
public int getCurrentCount() {
return counter.getAndIncrement() & 0x7;
}
I assume that the what you want is to have a counter form 0 to 7.
If that is the case then a race condition can possibly happen and the value of counter can become 9.
Unless you are ok to use % soln. as said by others, you micht have to use synchronized.

can array access be optimized?

Maybe I'm being misled by my profiler (Netbeans), but I'm seeing some odd behavior, hoping maybe someone here can help me understand it.
I am working on an application, which makes heavy use of rather large hash tables (keys are longs, values are objects). The performance with the built in java hash table (HashMap specifically) was very poor, and after trying some alternatives -- Trove, Fastutils, Colt, Carrot -- I started working on my own.
The code is very basic using a double hashing strategy. This works fine and good and shows the best performance of all the other options I've tried thus far.
The catch is, according to the profiler, lookups into the hash table are the single most expensive method in the entire application -- despite the fact that other methods are called many more times, and/or do a lot more logic.
What really confuses me is the lookups are called only by one class; the calling method does the lookup and processes the results. Both are called nearly the same number of times, and the method that calls the lookup has a lot of logic in it to handle the result of the lookup, but is about 100x faster.
Below is the code for the hash lookup. It's basically just two accesses into an array (the functions that compute the hash codes, according to profiling, are virtually free). I don't understand how this bit of code can be so slow since it is just array access, and I don't see any way of making it faster.
Note that the code simply returns the bucket matching the key, the caller is expected to process the bucket. 'size' is the hash.length/2, hash1 does lookups in the first half of the hash table, hash2 does lookups in the second half. key_index is a final int field on the hash table passed into the constructor, and the values array on the Entry objects is a small array of longs usually of length 10 or less.
Any thoughts people have on this are much appreciated.
Thanks.
public final Entry get(final long theKey) {
Entry aEntry = hash[hash1(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
aEntry = hash[hash2(theKey, size)];
if (aEntry != null && aEntry.values[key_index] != theKey) {
return null;
}
}
return aEntry;
}
Edit, the code for hash1 & hash2
private static int hash1(final long key, final int hashTableSize) {
return (int)(key&(hashTableSize-1));
}
private static int hash2(final long key, final int hashTableSize) {
return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1)));
}
Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.
The only thing that I would expect might make some difference is to move the key out of the values array of Entry.
Instead of having this:
class Entry {
long[] values;
}
//...
if ( entry.values[key_index] == key ) { //...
Try this:
class Entry {
long key;
long values[];
}
//...
if ( entry.key == key ) { //...
Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.
Is there a random-access data type faster than an array?
I was interested in the answer to this question, so I set up a test environment. This is my Array interface:
interface Array {
long get(int i);
void set(int i, long v);
}
This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:
class NormalArray implements Array {
private long[] data;
public NormalArray(int size) {
data = new long[size];
}
#Override
public long get(int i) {
return data[i];
}
#Override
public void set(int i, long v) {
data[i] = v;
}
}
And then a control:
class NoOpArray implements Array {
#Override
public long get(int i) {
return 0;
}
#Override
public void set(int i, long v) {
}
}
Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:
class TenArray implements Array {
private long v0;
private long v1;
private long v2;
private long v3;
private long v4;
private long v5;
private long v6;
private long v7;
private long v8;
private long v9;
private long[] extras;
public TenArray(int size) {
if (size > 10) {
extras = new long[size - 10];
}
}
#Override
public long get(final int i) {
switch (i) {
case 0:
return v0;
case 1:
return v1;
case 2:
return v2;
case 3:
return v3;
case 4:
return v4;
case 5:
return v5;
case 6:
return v6;
case 7:
return v7;
case 8:
return v8;
case 9:
return v9;
default:
return extras[i - 10];
}
}
#Override
public void set(final int i, final long v) {
switch (i) {
case 0:
v0 = v; break;
case 1:
v1 = v; break;
case 2:
v2 = v; break;
case 3:
v3 = v; break;
case 4:
v4 = v; break;
case 5:
v5 = v; break;
case 6:
v6 = v; break;
case 7:
v7 = v; break;
case 8:
v8 = v; break;
case 9:
v9 = v; break;
default:
extras[i - 10] = v;
}
}
}
I tested it with this harness:
import java.util.Random;
public class ArrayOptimization {
public static void main(String[] args) {
int size = 10;
long[] data = new long[size];
Random r = new Random();
for ( int i = 0; i < data.length; i++ ) {
data[i] = r.nextLong();
}
Array[] a = new Array[] {
new NoOpArray(),
new NormalArray(size),
new TenArray(size)
};
for (;;) {
for ( int i = 0; i < a.length; i++ ) {
testSet(a[i], data, 10000000);
testGet(a[i], data, 10000000);
}
}
}
private static void testGet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
data[j] = a.get(j);
}
}
long stop = System.nanoTime();
System.out.printf("%s/get took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
private static void testSet(Array a, long[] data, int iterations) {
long nanos = System.nanoTime();
for ( int i = 0; i < iterations; i++ ) {
for ( int j = 0; j < data.length; j++ ) {
a.set(j, data[j]);
}
}
long stop = System.nanoTime();
System.out.printf("%s/set took %fms%n", a.getClass().getName(),
(stop - nanos) / 1000000.0);
}
}
The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.
NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms
Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.
Most likely you are partially misled in your interpretation of the profilers results. Profilers are notoriously overinflating the performance impact of small, frequently called methods. In your case, the profiling overhead for the get()-method is probably larger than the actual processing spent in the method itself. The situation is worsened further, since the instrumentation also interferes with the JIT's capability to inline methods.
As a rule of thumb for this situation - if the total processing time for a piece of work of known length increases more then two- to threefold when running under the profiler, the profiling overhead will give you skewed results.
To verify your changes actually do have impact, always measure performance improvements without the profiler, too. The profiler can hint you about bottlenecks, but it can also deceive you to look at places where nothing is wrong.
Array bounds checking can have a surprisingly large impact on performance (if you do comparably little else), but it can also be hard to clearly separate from general memory access penalties. In some trivial cases, the JIT might be able to eliminate them (there have been efforts towards bounds check elimination in Java 6), but this is AFAIK mostly limited to simple loop constructs like for(x=0; x<array.length; x++).
Under some circumstances you may be able to replace array access by simple member access, completely avoiding the bound checks, but its limited to the rare cases where you access you array exclusively by constant indices. I see no way to apply it to your problem.
The change suggested by Mark Peters is most likely not solely faster because it eliminates a bounds check, but also because it alters the locality properties of your data structures in a more cache friendly way.
Many profilers tell you very confusing things, partly because of how they work, and partly because people have funny ideas about performance to begin with.
For example, you're wondering about how many times functions are called, and you're looking at code and thinking it looks like a lot of logic, therefore slow.
There's a very simple way to think about this stuff, that makes it very easy to understand what's going on.
First of all, think in terms of the percent of time a routine or statement is active, rather than the number of times it is called or the average length of time it takes. The reason for that is it is relatively unaffected by irrelevant issues like competing processes or I/O, and it saves you having to multiply the number of calls by the average execution time and divide by the total time just to see if it is a big enough to even care about. Also, percent tells you, bottom line, how much fixing it could potentially reduce the overall execution time.
Second, what I mean by "active" is "on the stack", where the stack includes the currently running instruction and all the calls "above" it back to "call main". If a routine is responsible for 10% of the time, including routines that it calls, then during that time it is on the stack. The same is true of individual statements or even instructions. (Ignore "self time" or "exclusive time". It's a distraction.)
Profilers that put timers and counters on functions can only give you some of this information. Profilers that only sample the program counter tell you even less. What you need is something that samples the call stack and reports to you by line (not just by function) the percent of stack samples containing that line. It's also important that they sample the stack a) during I/O or other blockage, but b) not while waiting for user input.
There are profilers that can do this. I'm not sure about Java.
If you're still with me, let me throw out another ringer. You're looking for things you can optimize, right? and only things that have a large enough percent to be worth the trouble, like 10% or more? Such a line of code costing 10% is on the stack 10% of the time. That means if 20,000 samples are taken, it is on about 2,000 of them. If 20 samples are taken, it is on about 2 of them, on average. Now, you're trying to find the line, right? Does it really matter if the percent is off a little bit, as long as you find it? That's another one of those happy myths of profilers - that precision of timing matters. For finding problems worth fixing, 20,000 samples won't tell you much more than 20 samples will.
So what do I do? Just take the samples by hand and study them. Code worth optimizing will simply jump out at me.
Finally, there's a big gob of good news. There are probably multiple things you could optimize. Suppose you fix a 20% problem and make it go away. Overall time shrinks to 4/5 of what it was, but the other problems aren't taking any less time, so now their percentage is 5/4 of what it was, because the denominator got smaller. Percentage-wise they got bigger, and easier to find. This effect snowballs, allowing you to really squeeze the code.
You could try using a memoizing or caching strategy to reduce the number of actual calls. Another thing you could try if you're very desperate is a native array, since indexing those is unbelievably fast, and JNI shouldn't invoke toooo much overhead if you're using parameters like longs that don't require marshalling.

Categories