I just came across this article: Compute the minimum or maximum of two integers without branching
It starts with "[o]n some rare machines where branching is expensive...".
I used to think that branching is always expensive as it often forces the processor to clear and restart its execution pipeline (e.g. see Why is it faster to process a sorted array than an unsorted array?).
This leaves me with a couple of questions:
Did the writer of the article get that part wrong? Or was this article maybe written in a time before branching was an issue (I can't find a date on it).
Do modern processors have a way to complete minimal branches like the one in (x < y) ? x : y without performance degradation?
Or do all modern compilers simply implement this hack automatically? Specifically, what does Java do? Especially since its Math.min(...) function is just that ternary statement...
Did the writer of the article get that part wrong? Or was this article maybe written in a time before branching was an issue (I can't find a date on it).
The oldest comment is 5 years old, so it's no hot news. However, unpredictable branching is always expensive and so was it 5 years ago. In the meantime, it just got worse as modern CPUs can do much more per cycle and a mispredicted branch therefore cost more work.
But in a sense, the writer is right. The majority of CPUs is not found in our PCs and servers, but in embedded devices, where the situation differs.
Do modern processors have a way to complete minimal branches like the one in (x < y) ? x : y without performance degradation?
Yes and no. AFAIK Math.max gets always translated as a conditional move, which means no branching. You own max may or may not use it, depending on statistics the JVM collected.
There's no silver bullet. With predictable outcomes, branching is faster. Finding out exactly, what pattern the CPU recognizes, is hard. The JVM simply looks at how often a branch gets takes and uses a magic threshold of about 18%. See my own question and answer for details.
Or do all modern compilers simply implement this hack automatically? Specifically, what does Java do? Especially since its Math.min(...) function is just that ternary statement...
It's actually a compiler intrinsic. Whenever the JITc sees this very method called, it handles it specially. When you copy the method, it gets no special treatments.
In this case, the intrinsic is not very useful, as it's something what gets heavily optimized anyway. For methods like Long#numberOfLeadingZeros, the intrinsic is essential, as the code is rather long and slow and modern CPUs get do it in a single cycle.
Related
I'm aware that, for good practice, StringBuilder should be initialised with a capacity value of the expected content. Otherwise, increasing the size after compilation is going to be an expensive operation.
My question is, if we don't know the expected size, how should one go about it? Is there a standard value/way to avoid expensive operations under the hood?
If not, is there potentially a way of alarming/logging in the code if the capacity is bigger than the value given upon initialisation?
I'm aware that, for good practice, StringBuilder should be initialised with a capacity value of the expected content. Otherwise, increasing the size after compilation is going to be an expensive operation.
This is a wildly incorrect statement. It is very bad practice to do this. Even if you know exactly how large it'll be.
If I see this code:
StringBuilder sb = new StringBuilder(in1.length() + in2.length() * 3 + loaded ? suffixLen : 0);
Then this is an additional thing to worry about, test, and keep up to date. I would assume if all this is present that for whatever reason somebody did some performance testing and actually figured out that this saves a worthwhile chunk of cycles, and somehow, in a fit of idiocy, neglected to write an enlightening comment and link to the JMH or profiler result analysis to verify this conclusion.
So, I'd either painstakingly attempt to manually analyse precisely if the calculation is still correct after an update to this code, or, I'd fix the problem and add the tests (and then be utterly befuddled, when, of course, the profile review shows this code is utterly inconsequential), or, I'd go through the considerable trouble of writing an assert based test case that will run the entire operation and then verify at the end that the size calculation done at the top is, in fact, correct.
I don't think you fully grasp why the hyperbolic premature optimization is the root of all evil statement is so popular.
Here's the problem. 99% of the system's resources are spent on 1% of the code. That's not an exaggeration; in fact, that is likely understating the issue.
Developer time is not infinite, and even if it was, the programmer's ability to comprehend code and focus on the relevant parts, is limited because, in the end, they are humans. Spending additional code that needs to be parsed and understood by human eyeballs and brains is therefore bad if the code does something irrelevant. We're literally talking about the same order of magnitude as you throwing a glass of water into the ocean down by the seashore in europe and then watching the water levels rise in manhattan. Beyond any and all ability to measure, and utterly incomprehensive. Bold does not do sufficient justice to how little it matters. Even if this code runs 100 million times a day for 15 years, it amounts to perhaps 5 cents in IAAS deployment costs total over that entire decade and a half, and that's if there is even a performance impact, which often there isn't because modern VMs, GCs, OSes, and CPU architectures get up to some crazy shenanigans.
Furthermore, the system optimizes. Optimizers, such as JVM hotspot engines, are in the end pattern matching machines. They find commonly used patterns and recognize how to run them as efficient as possible. By writing code in ways that nobody else does, it is highly unlikely that code is going to actually outperform the common (idiomatic) case. Most likely because it just doesn't matter, and even if it does, because the idiomatic case gets optimized much more readily.
Here is a trivial example:
List<String> someList = new ArrayList<String>();
for (int i = 0; i < 10000; i++) someList.add(someRandomString);
String[] arrayForm = someList.toArray(new String[0]);
Here you may go: Huh, well, we can optimize this code a little bit and pass new String[10000] instead; this saves the system from having to allocate an admittedly small object (a 0-size string array).
You would be wrong. The above code, with the new String[0], is in fact faster. How can that be? Optimizers, with pattern matching. They recognize the pattern and realize that the system can create a new array of the requisite size, not zero it out, and then run the code that fills it. Whereas the optimization patterns do not include the new String[reqSize] variant where the system could in theory also realize it can allocate the array and then omit zeroing out the array (which the JVM spec guarantees, which merely means the spec guarantees you can never observe that it wasn't zeroed out; it doesn't actually mean the JVM must zero it out, that's where the pattern optimization of not doing so is coming from). However, it doesn't do that - not common enough, and somewhat more complicated.
I'm not saying that new StringBuilder() is neccessarilyt faster than new StringBuilder(knownSize). I'm saying it:
99.9% of the time literally does not make one iota of difference. Not a single nanosecond - the speedup is entirely theoretical: No performance test of any stripe can detect the difference. If a tree falls in the forest, and all that.
You have no idea when that 0.1% of the time even is, or if it's not actually straight up never - 0%. Between a CPU that caches and pipelines (did you know modern CPUs cannot access memory? At all? I bet you didn't. The basic von neumann model of how CPUs work? Totally misleads you if you try to performance analyse machine code if you do that) - VMs, garbage collectors (did you know that garbage is free but live objects are expensive? Re-using an object is in fact more expensive that creating a ton of fast garbage.. depending on many factors, of course this too is an oversimplification. That's the real point: This is an intractable thing; you cannot just look at code and jump to conclusions about performance) - you stand no chance to know what's 'faster'.
The only right move is to write code as simple and as clean as you can ('clean' defined as: When you look at it, you jump to conclusions, and these conclusions are correct. It is easy to adjust in the face of changing requirements, and flexible in how it connects to the rest of the codebase). IF (big if!) real life situations result in a performance issue, you first run a profiler so you know the 1% of the code that is in any way relevant, and then you go ham on that, with JMH benchmarks and all sorts of performance experiments to optimize the heck out of it. If your code is clean, that's great, because almost always this requires adjusting how the code that calls into the 'hot path' or where the code flows to out of the 'hot path' - and the cleaner your code the easier that will be.
Needless performance optimization almost invariably reduces flexibility, and makes code harder to understand.
Hence, objectively, micro-optimizing like this just makes your code slower and buggier for literally no benefit. Not even a tiny, almost immeasurable one.
Hence, the advice is silly. The only correct call is new StringBuilder() - no pre-configured size. The one and only excuse you have to write new StringBuilder(presetCapacity) is if there's a lengthy comment that immediately precedes it that lays out in a lot of detail, or links to a ticket, the exact performance study done to indicate this indeed fixes a real performance issue and how to recreate that study, and on what schedule it should be revisited.
I'm writing an algorithm where efficiency is very important.
Sometimes I also want to track the algorithm behaviour by calling some "callback" functions.
Let's say my code looks like this:
public float myAlgorithm(AlgorithmTracker tracker) {
while (something) { // millions of iterations
doStuff();
if (tracker != null) tracker.incrementIterationCount(); // <--- How to run the if only once?
doOtherStaff();
}
}
How do I prevent to execute the if statement a million times? Does the compiler see that tracker is never reassigned? If it's null on the first check, it will always be. If it's not, it will never be.
Ideally I would like to tell the compiler to build my code in such a way so that if tracker is null (at runtime) it would run with the same performance as
while (something) { // millions of iterations
doStuff();
doOtherStaff();
}
I thought of two solutions:
I could write two versions of myAlgorithm, one with the calls and one without them, but that would lead to lots of code duplication.
I could extract AlgorithmTracker to an interface and create a fake empty tracker with empty functions. Still, I don't know if the compiler would optimize the calls away.
For most CPU architectures you don't have to worry about the optimization that you want to apply because that particular optimization is part of most contemporary CPU's. it is called branch prediction and current CPU's are very good at this.
on an average every 6th instruction executed by a CPU is a branch, and if for every branch CPU had to wait and evaluate the branch condition it would make the execution lot slower.
Branch prediction and Speculative execution
So when faced with a branch, without evaluating the branch condition, the CPU starts executing (Speculative execution) a path which it thinks is highly likely to be correct and on a later stage when result of branch condition becomes available, CPU checks if it is executing the correct path.
if the path picked by CPU is consistent with the result of branch condition, then CPU knows that it's executing correct path and hence it keeps going at 100% speed, otherwise it will have to flush out all the instructions that it executed speculatively and start with the correct path.
But how does CPU know which path to pick?
enter the branch predictor subsystem of a CPU. in its most basic form it would store the information about the past behavior of a branch, for example if a branch is not being picked for some time then its likely that it would not be picked now.
This is a simple explanation and a real branch predictor is going to be quite complex.
So how effective are these branch predictors ?
Given that at their core the branch predictors are just pattern matching machines, if your branch shows a predictable pattern then you can rest assured that branch predictor will get it right.
but if your branch shows no pattern at all then branch predictor is not going to help you, worst yet it would hamper your code execution because of all the wrong predictions.
How your code is going to work out with the branch predictors?
In your case value of branch control variable never changes, so the branch is either going to be picked every iteration of the loop or it is never going to be picked.
This clearly shows a pattern which even the most basic of the branch predictors can discern. which means your code will practically execute as if condition is not there, because after a first few iterations the branch predictor will be able to pick the path with 100% accuracy.
To know more read this great so thread
Fun fact: this particular optimization is the reason for CPU vulnerabilities such as specter and meltdown
lots of code duplication
Lots of code duplication means there's lots of code. So how could one simple null check influence the performance?
Hoisting null checks out of loops is a very trivial optimization. There's no guarantee that it gets done, but when the JIT can't do it, the CPU can still perfectly predict the branch outcome.(*) So the branch will cost something like ΒΌ cycle as current CPU's are capable of executing say 4 instruction per cycle.
As already said, there's some branching anyway as the JIT has to do the null check. So most probably the net win from this premature optimization is zero.
(*) The prediction can get spoiled by tons of other branches evicting you branch from the predictor (sort of cache). But then your poor branch was just one of many and you needn't care.
I could write two versions of myAlgorithm...but that would lead to lots of code duplication"
Yes, this may be a way to optimize performance and one of rare cases when DRY doesn't work. Another example of such RY technique - the loop unrolling (if your compiler didn't do it :)). Here, the code duplication is the cost you pay for better performance.
But, as for your particular IF, you see, the condition isn't changed in the loop and CPU branch prediction should work quite well. Make good/correct performance tests (with JMH, for example) and something tells me that you will not see any difference with such pico(even not micro)-optimization, the result may be even worse, since there are much-much more important things that may affect the overall performance. Just a few of such ones:
the most efficient compiler optimization is inlining (https://www.baeldung.com/jvm-method-inlining). If your code transformation brakes inlining, think twice and measure the result performance properly
memory allocation and, therefore, GC pauses in the main/critical path of the application may also be an important thing. Reuse mutable objects if required (pooling).
cache misses. Make sure you access the memory sequentially as much as possible. A canonical example - you replace LinkedList by ArrayList to iterate through and your performance becomes much better
etc. etc.
So, don't worry about this particular IF at all.
Performance optimization is a very large and very interesting area. Take care about RIGHT things and make make make correct perf tests... And always think about appropriate algorithms/collections, remember about classical big O.
Assume I have a loop (any while or for) like this:
loop{
A long code.
}
From the point of time complexity, should I divide this code in parts, write a function outside the loop, and call that function repeatedly?
I read something about functions very long ago, that calling a function repeatedly takes more time or memory or like something, I don't remember it exactly. Can you also provide some good reference about things like this (time complexity, coding style)?
Can you also provide some reference book or tutorial about heap memory, overheads etc. which affects the performance of program?
The performance difference is probably very minimal in this case. I would concentrate on clarity rather than performance until you identify this portion of your code to be a serious bottleneck.
It really does depend on what kind of code you're running in the loop, however. If you're just doing a tiny mathematical operation that isn't going to take any CPU time, but you're doing it a few hundred thousand times, then inlining the calculation might make sense. Anything more expensive than that, though, and performance shouldn't be an issue.
There is an overhead of calling a function.
So if the "long code" is fast compared to this overhead (and your application cares about performance), then you should definitely avoid the overhead.
However, if the performance is not noticably worse, it's better to make it more readable, by using a (or better multiple) function.
Rule one of performance optmisation: Measure it.
Personally, I go for readable code first and then optimise it IF NECESSARY. Usually, it isn't necessary :-)
See the first line in CHAPTER 3 - Measurement Is Everything
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." - Donald Knuth
In this case, the difference in performance will probably be minimal between the two solutions, so writing clearer code is the way to do it.
There really isnt a simple "tutorial" on performance, it is a very complex subject and one that even seasoned veterans often dont fully understand. Anyway, to give you more of an idea of what the overhead of "calling" a function is, basically what you are doing is "freezing" the state of your function(in Java there are no "functions" per se, they are all called methods), calling the method, then "unfreezing", where your method was before.
The "freezing" essentially consists of pushing state information(where you were in the method, what the value of the variables was etc) on to the stack, "unfreezing" consists of popping the saved state off the stack and updating the control structures to where they were before you called the function. Naturally memory operations are far from free, but the VM is pretty good at keeping the performance impact to an absolute minimum.
Now keep in mind Java is almost entirely heap based, the only things that really have to get pushed on the stack are the value of pointers(small), your place in the program(again small), and whatever primitives you have local to your method, and a tiny bit of control information, nothing else. Furthermore, although you cannot explicitly inline in Java(though Im sure there are bytecode editors out there that essentially let you do that), most VMs, including the most popular HotSpot VM, will do this automatically for you. http://java.sun.com/developer/technicalArticles/Networking/HotSpot/inlining.html
So the bottom line is pretty much 0 performance impact, if you want to verify for yourself you can always run benchmarking and profiling tools, they should be able to confirm it for you.
From a execution speed point of view it shouldn't matter, and if you still believe this is a bottleneck it is easy to measure.
From a development performance perspective, it is a good idea to keep the code short. I would vote for turning the loop contents into one (or more) properly named methods.
Forget it! You can't gain any performance by doing the job of the JIT. Let JIT inline it for you. Keep the methods short for readability and also for performance, as JIT works better with short methods.
There are microptimizations which may help you gain some performance, but don't even think about them. I suggest the following rules:
Write clean code using appropriate objects and algorithms for readability and for performance.
In case the program is too slow, profile and identify the critical parts.
Think about improving them using better objects and algorithms.
As a last resort, you may also consider microoptimizations.
What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).
Take a look at the code below:
for(int i=0;i<100;i++)
sum1 += rand(100)
for(int j=0;j<100;j++)
sum2 += rand(100)/2
This kind of code can automatically get split to 2 different threads that run in parallel.
Do you think it's even possible?
I have a feeling that theoretically it's impossible (it reminds me the halting problem) but I can't justify this thought.
Do you think it's a useful project? is there anything like it?
This is called automatic parallelization. If you're looking for some program you can use that does this for you, it doesn't exist yet. But it may eventually. This is a hard problem and is an area of active research. If you're still curious...
It's possible to automatically split your example into multiple threads, but not in the way you're thinking. Some current techniques try to run each iteration of a for-loop in its own thread. One thread would get the even indicies (i=0, i=2, ...), the other would get the odd indices (i=1, i=3, ...). Once that for-loop is done, the next one could be started. Other techniques might get crazier, executing the i++ increment in one thread and the rand() on a separate thread.
As others have pointed out, there is a true dependency between iterations because rand() has internal state. That doesn't stand in the way of parallelization by itself. The compiler can recognize the memory dependency, and the modified state of rand() can be forwarded from one thread to the other. But it probably does limit you to only a few parallel threads. Without dependencies, you could run this on as many cores as you had available.
If you're truly interested in this topic and don't mind sifting through research papers:
Automatic thread extraction with decoupled software pipelining (2005) by G. Ottoni.
Speculative parallelization using software multi-threaded transactions (2010) by A. Raman.
This is practically not possible.
The problem is that you need to know, in advance, a lot more information than is readily available to the compiler, or even the runtime, in order to parallelize effectively.
While it would be possible to parallelize very simple loops, even then, there's a risk involved. For example, your above code could only be parallelized if rand() is thread-safe - and many random number generation routines are not. (Java's Math.random() is synchronized for you - however.)
Trying to do this type of automatic parallelization is, at least at this point, not practical for any "real" application.
It's certainly possible, but it is an incredibly hard task. This has been the central thrust of compiler research for several decades. The basic issue is that we cannot make a tool that can find the best partition into threads for java code (this is equivalent to the halting problem).
Instead we need to relax our goal from the best partition into some partition of the code. This is still very hard in general. So then we need to find ways to simplify the problem, one is to forget about general code and start looking at specific types of program. If you have simple control-flow (constant bounded for-loops, limited branching....) then you can make much more head-way.
Another simplification is reducing the number of parallel units that you are trying to keep busy. If you put both of these simplifications together then you get the state of the art in automatic vectorisation (a specific type of parallelisation that is used to generate MMX / SSE style code). Getting to that stage has taken decades but if you look at compilers like Intel's then performance is starting to get pretty good.
If you move from vector instructions inside a single thread to multiple threads within a process then you have a huge increase in latency moving data between the different points in the code. This means that your parallelisation has to be a lot better in order to win against the communication overhead. Currently this is a very hot topic in research, but there are no automatic user-targetted tools available. If you can write one that works it would be very interesting to many people.
For your specific example, if you assume that rand() is a parallel version so you can call it independently from different threads then it's quite easy to see that the code can be split into two. A compiler would convert just need dependency analysis to see that neither loop uses data from or affects the other. So the order between them in the user-level code is a false dependency that could split (i.e by putting each in a separate thread).
But this isn't really how you would want to parallelise the code. It looks as if each loop iteration is dependent on the previous as sum1 += rand(100) is the same as sum1 = sum1 + rand(100) where the sum1 on the right-hand-side is the value from the previous iteration. However the only operation involved is addition, which is associative so we rewrite the sum many different ways.
sum1 = (((rand_0 + rand_1) + rand_2) + rand_3) ....
sum1 = (rand_0 + rand_1) + (rand_2 + rand_3) ...
The advantage of the second is that each single addition in brackets can be computed in parallel to all of the others. Once you have 50 results then they can be combined into a further 25 additions and so on... You do more work this way 50+25+13+7+4+2+1 = 102 additions versus 100 in the original but there are only 7 sequential steps so apart from the parallel forking/joining and communication overhead it runs 14 times quicker. This tree of additions is called a gather operation in parallel architectures and it tends to be the expensive part of a computation.
On a very parallel architecture such as a GPU the above description would be the best way to parallelise the code. If you're using threads within a process it would get killed by the overhead.
In summary: it is impossible to do perfectly, it is very hard to do well, there is lots of active research in finding out how much we can do.
Whether it's possible in the general case to know whether a piece of code can be parallelized does not really matter, because even if your algorithm cannot detect all cases that can be parallelized, maybe it can detect some of them.
That does not mean it would be useful. Consider the following:
First of all, to do it at compile-time, you have to inspect all code paths you can potentially reach inside the construct you want to parallelize. This may be tricky for anything but simply computations.
Second, you have to somehow decide what is parallelizable and what is not. You cannot trivially break up a loop that modifies the same state into several threads, for example. This is probably a very difficult task and in many cases you will end up with not being sure - two variables might in fact reference the same object.
Even if you could achieve this, it would end up confusing for the user. It would be very difficult to explain why his code was not parallelizable and how it should be changed.
I think that if you want to achieve this in Java, you need to write it more as a library, and let the user decide what to parallelize (library functions together with annotations? just thinking aloud). Functional languages are much more suited for this.
As a piece of trivia: during a parallel programming course, we had to inspect code and decide whether it was parallelizable or not. I cannot remember the specifics (something about the "at-most-once" property? Someone fill me in?), but the moral of the story is that it was extremely difficult even for what appeared to be trivial cases.
There are some projects that try to simplify parallelization - such as Cilk. It doesn't always work that well, however.
I've learnt that as of JDK 1.8(Java 8), you can utilize/leverage multiple cores of your CPU in case of streams usage by using parallelStream().
However, it has been studied that before finalizing to go to production with parallelStream() it is always better to compare sequential() with parallel, by benchmarking the performance, and then decide which would be ideal.
Why?/Reason is: there could be scenarios where the parallel stream will perform dramatically worse than sequential, when the operation needs to do auto un/boxing. For those scenarios its advisable to use the Java 8 Primitive Streams such as IntStream, LongStream, DoubleStream.
Reference: Modern Java in Action: Manning Publications 2019
The Programming language is Java and Java is a virtual machine. So shouldn't one be able to execute the code at runtime on different Threads owned by the VM. Since all the Memory etc. is handled like that It whould not cause any corruption . You could see the Code as a Stack of instructions estimating execution Time and then distribute it on an Array of Threads which are each have an execution stack of roughtly the same time. It might be dangerous though some graphics like OpenGL immediate mode needs to maintain order and mostly should not be threaded at all.
I've heard that running recursive code on a server can impact performance. How true is this statement, and should recursive method be used as a last resort?
Recursion can potentially consume more memory than an equivalent iterative solution, because the latter can be optimized to take up only the memory it strictly needs, but recursion saves all local variables on the stack, thus taking up a bit more than strictly needed. This is only a problem in a memory-limited environment (which is not a rare situation) and for potentially very deep recursion (a few dozen recursive legs taking up a few hundred bytes each at most will not measurably impact the memory footprint of the server), so "last resort" is an overbid.
But when profiling shows you that the footprint impact is large, one optimization-refactoring you can definitely perform is recursion removal -- a popular topic since a few decades ago in the academic literature, but typically not hard to do by hand (especially if you keep all your methods, recursive or otherwise, reasonably small, as you should;-).
I've heard that running recursive code on a server can impact performance. How true is this
statement?
It is true, it impacts the performance, in the same way creating variables, loops or executing pretty much anything else.
If the recursive code is poor or uncontrolled it will consume your system resources the same way an uncontrolled while loop.
and should recursive method be used as a last resort?
No. It may be used as a first resort, many times it would be easier to code a recursive function. Depends on your skills. But to be clear, there is nothing particularly evil with recursion.
To discuss performance you have to talk about very specific scenarios. Used appropriately recursion is fine. If you use it inappropriately you could blow the stack, or just use too much stack. This is especially true if you somehow get a recursive tailcall without ever it terminating (typically: a bug such as an attempt to walk a cyclic graph), as it won't even blow the stack (it'll just run forever, chomping CPU cycles).
But get it right (and limit the depth to sane amounts) and it is fine.
A badly programmed recursion that does not end has a negative impact on the machine, consuming an ever-grwoing amount of resources, and threatening the stability of the whole system in the worst case.
Otherwise, recursions are a perfectly legitimate tool like loops and other constructs. They have no negative effect on performance per se.
Tail-recursion is also an alternative. It boils down to this: just pass the returned Result as mutable reference as parameter of the recursion method. This way the stack won't blow up. More at Wikipedia and this site.
Recusion is a tool of choice when you have to write algorithms. It's also much easier than iteration when you have to deal with recusive data structures like trees or graph. It's usually harmless if (as a rule of thumb) you can evaluate evaluate the recusion depth to something not too large, provided that you do not forget the end condition...
Most modern compilers are able to optimize some kinds of recursive call (replace them internally with non recursive equivalents). It's specialy easy with tail recursion, that is when the recursive call is the last instruction before returning the result.
However there is some issues specific to Java. The underlying JVM does not provide any kind of goto instruction. This set limits of what the compiler can do. If it's a tail-end recursion internal to one function it can be replaced by a simple loop internal to the function, but if the terminal call is done through another function, or if several functions recusively calls one another it become quite difficult to do when targeting JVM bytecode. SUN JVM does not support tail-call optimization, but there is plans to change that, IBM JVM does support tail-call optimization.
With some languages (functional languages like LISP or Haskell) recursion is also the only (or the more natural) way to write programs. On JVM based functional languages like Clojure or Scala the lack of tail-call optimization is a problem that leads to workarounds like trampolines in Scala.
Running any code on a server can impact performance. Server performance is usually going to be impacted by storage I/O before anything else, so at the level of "server performance" it's odd to see the question of general algorithm strategy talked about.
Deep recursion Can cause a stack overflow, which is nasty. Be careful as it s hard to get up again if you need to. Small, manageable piecs of work is easier to handle and parallize.