How to produce the cpu cache effect in C and java?

How to produce the cpu cache effect in C and java? - java

In Ulrich Drepper's paper What every programmer should know about memory, the 3rd part: CPU Caches, he shows a graph that shows the relationship between "working set" size and the cpu cycle consuming per operation (in this case, sequential reading). And there are two jumps in the graph which indicate the size of L1 cache and L2 cache. I wrote my own program to reproduce the effect in c. It just simply read a int[] array sequentially from head to tail, and I've tried different size of the array(from 1KB to 1MB). I plot the data into a graph and there is no jump, the graph is a straight line.
My questions are:
Is there something wrong with my method? What is the right way to produce the cpu cache effect(to see the jumps).
I was thinking, if it is sequential read, then it should operate like this:
When read the first element, it's a cache miss, and within the cache line size(64K), there will be hits. With the help of the prefetching, the latency of reading the next cache line will be hidden. It will contiguously read data into the L1 cache, even when the working set size is over the L1 cache size, it will evict the least recently used ones, and continue prefetch. So most of the cache misses will be hidden, the time consumed by fetch data from L2 will be hidden behind the reading activity, meaning they are operating at the same time. the assosiativity (8 way in my case) will hide the latency of reading data from L2. So, phenomenon of my program should be right, am I missing something?
Is it possible to get the same effect in java?
By the way, I am doing this in linux.
Edit 1
Thanks for Stephen C's suggestion, here are some additional Information:
This is my code:
int *arrayInt;
void initInt(long len) {
int i;
arrayInt = (int *)malloc(len * sizeof(int));
memset(arrayInt, 0, len * sizeof(int));
}
long sreadInt(long len) {
int sum = 0;
struct timespec tsStart, tsEnd;
initInt(len);
clock_gettime(CLOCK_REALTIME, &tsStart);
for(i = 0; i < len; i++) {
sum += arrayInt[i];
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
free(arrayInt);
return (tsEnd.tv_nsec - tsStart.tv_nsec) / len;
}
In main() function, I've tried from 1KB to 100MB of the array size, still the same, average time consuming per element is 2 nanoseconds. I think the time is the access time of L1d.
My cache size:
L1d == 32k
L2 == 256k
L3 == 6144k
EDIT 2
I've changed my code to use a linked list.
// element type
struct l {
struct l *n;
long int pad[NPAD]; // the NPAD could be changed, in my case I set it to 1
};
struct l *array;
long globalSum;
// for init the array
void init(long len) {
long i, j;
struct l *ptr;
array = (struct l*)malloc(sizeof(struct l));
ptr = array;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = j;
}
ptr->n = NULL;
for(i = 1; i < len; i++) {
ptr->n = (struct l*)malloc(sizeof(struct l));
ptr = ptr->n;
for(j = 0; j < NPAD; j++) {
ptr->pad[j] = i + j;
}
ptr->n = NULL;
}
}
// for free the array when operation is done
void release() {
struct l *ptr = array;
struct l *tmp = NULL;
while(ptr) {
tmp = ptr;
ptr = ptr->n;
free(tmp);
}
}
double sread(long len) {
int i;
long sum = 0;
struct l *ptr;
struct timespec tsStart, tsEnd;
init(len);
ptr = array;
clock_gettime(CLOCK_REALTIME, &tsStart);
while(ptr) {
for(i = 0; i < NPAD; i++) {
sum += ptr->pad[i];
}
ptr = ptr->n;
}
clock_gettime(CLOCK_REALTIME, &tsEnd);
release();
globalSum += sum;
return (double)(tsEnd.tv_nsec - tsStart.tv_nsec) / (double)len;
}
At last, I will printf out the globalSum in order to avoid the compiler optimization. As you can see, it is still a sequential read, I've even tried up to 500MB of the array size, the average time per element is approximately 4 nanoseconds (maybe because it has to access the data 'pad' and the pointer 'n', two accesses), the same as 1KB of the array size. So, I think it is because the cache optimization like prefetch hide the latency very well, am I right? I will try a random access, and put the result on later.
EDIT 3
I've tried a random access to the linked list, this is the result:
the first red line is my L1 cache size, the second is L2. So we can see a little jump there. And some times the latency still be hidden well.

This answer isn't an answer, but more of a set of notes.
First, the CPU tends to operate on cache lines, not on individual bytes/words/dwords. This means that if you sequentially read/write an array of integers then the first access to a cache line may cause a cache miss but subsequent accesses to different integers in that same cache line won't. For 64-byte cache lines and 4-byte integers this means that you'd only get a cache miss once for every 16 accesses; which will dilute the results.
Second, the CPU has a "hardware pre-fetcher." If it detects that cache lines are being read sequentially, the hardware pre-fetcher will automatically pre-fetch cache lines it predicts will be needed next (in an attempt to fetch them into cache before they're needed).
Third, the CPU does other things (like "out of order execution") to hide fetch costs. The time difference (between cache hit and cache miss) that you can measure is the time that the CPU couldn't hide and not the total cost of the fetch.
These 3 things combined mean that; for sequentially reading an array of integers, it's likely that the CPU pre-fetches the next cache line while you're doing 16 reads from the previous cache line; and any cache miss costs won't be noticeable and may be entirely hidden. To prevent this; you'd want to "randomly" access each cache line once, to maximise the performance difference measured between "working set fits in cache/s" and "working set doesn't fit in cache/s."
Finally, there are other factors that may influence measurements. For example, for an OS that uses paging (e.g. Linux and almost all other modern OSs) there's a whole layer of caching above all this (TLBs/Translation Look-aside Buffers), and TLB misses once the working set gets beyond a certain size; which should be visible as a fourth "step" in the graph. There's also interference from the kernel (IRQs, page faults, task switches, multiple CPUs, etc); which might be visible as random static/error in the graph (unless tests are repeated often and outliers discarded). There are also artifacts of the cache design (cache associativity) that can reduce the effectiveness of the cache in ways that depend on the physical address/es allocated by the kernel; which might be seen as the "steps" in the graph shifting to different places.

Is there something wrong with my method?
Possibly, but without seeing your actual code that cannot be answered.
Your description of what your code is doing does not say whether you are reading the array once or many times.
The array may not be big enough ... depending on your hardware. (Don't some modern chips have a 3rd level cache of a few megabytes?)
In the Java case in particular you have to do lots of things the right way to implement a meaningful micro-benchmark.
In the C case:
You might try adjusting the C compiler's optimization switches.
Since your code is accessing the array serially, the compiler might be able to order the instructions so that the CPU can keep up, or the CPU might be optimistically prefetching or doing wide fetches. You could try reading the array elements in a less predictable order.
It is even possible that the compiler has entirely optimized the loop away because result of the loop calculation is not used for anything.
(According to this Q&A - How much time does it take to fetch one word from memory?, a fetch from L2 cache is ~7 nanoseconds and a fetch from main memory is ~100 nanoseconds. But you are getting ~2 nanoseconds. Something clever has to be going on here to make it run as fast as you are observing.)

With gcc-4.7 and compilation with gcc -std=c99 -O2 -S -D_GNU_SOURCE -fverbose-asm tcache.c you can see that the compiler is optimizing enough to remove the for loop (because sum is not used).
I had to improve your source code; some #include-s are missing, and i is not declared in the second function, so your example don't even compile as it is.
Make sum a global variable, or pass it somehow to the caller (perhaps with a global int globalsum; and putting globalsum=sum; after the loop).
And I am not sure you are right to clear the array with a memset. I could imagine a clever-enough compiler understanding that you are summing all zeros.
At last your code has extremely regular behavior with good locality: once in a while, a cache miss happens, the entire cache line is loaded and data is good enough for many iterations. Some clever optimizations (e.g. -O3 or better) might generate the good prefetch instructions. This is optimal for caches, because for a 32 words L1 cache line the cache miss happens every 32 loops so is well amortized.
Making a linked list of data will make cache behavior be worse. Conversely, in some real programs carefully adding a __builtin_prefetch at few well chosen places may improve performance by more than 10% (but adding too many of them will decrease performance).
In real life, the processor is spending the majority of the time to wait for some cache (and it is difficult to measure that; this waiting is CPU time, not idle time). Remember that during an L3 cache miss, the time needed to load data from your RAM module is the time needed to execute hundreds of machine instructions!

I can't say for certain about 1 and 2, but it would be more challenging to successfully run such a test in Java. In particular, I might be concerned that managed language features like automatic garbage collection might happen during the middle of your testing and throw off your results.

As you can see from graph 3.26 the Intel Core 2 shows hardly any jumps while reading (red line at the top of the graph). It is writing/copying where the jumps are clearly visible. Better to do a write test.

Related

Why is this memoization faster with an array than with a map?

I was solving combination sum IV on leetcode (#377), which reads:
"Given an integer array with all positive numbers and no duplicates, find the number of possible combinations that add up to a positive integer target."
I solved it in Java using a top down recursive approach with a memoization array:
public int combinationSum4(int[] nums, int target){
int[] memo = new int[target+1];
for(int i = 1; i < target+1; i++) {
memo[i] = -1;
}
memo[0] = 1;
return topDownCalc(nums, target, memo);
}
public static int topDownCalc(int[] nums, int target, int[] memo) {
if (memo[target] >= 0) {
return memo[target];
}
int tot = 0;
for(int num : nums) {
if(target - num >= 0) {
tot += topDownCalc(nums, target - num, memo);
}
}
memo[target] = tot;
return tot;
}
Then I figured I was wasting time by initializing the entire memo array and could just use a Map instead (which would also save space / memory). So I rewrote the code as follows:
public int combinationSum4(int[] nums, int target) {
Map<Integer, Integer> memo = new HashMap<Integer, Integer>();
memo.put(0, 1);
return topDownMapCalc(nums, target, memo);
}
public static int topDownMapCalc(int[] nums, int target, Map<Integer, Integer> memo) {
if (memo.containsKey(target)) {
return memo.get(target);
}
int tot = 0;
for(int num : nums) {
if(target - num >= 0) {
tot += topDownMapCalc(nums, target - num, memo);
}
}
memo.put(target, tot);
return tot;
}
I am confused though, because after submitting the second version of my code Leetcode said it was slower and used more space than the first code. How does the HashMap use more space and run slower than an array whos values all had to be initialized and whos length is greater than the HashMaps size?

Then I figured I was wasting time by initializing the entire memo array
You could have stored 'answer + 1' instead, so that the default value (0) can now be a placeholder for 'not calculated yet', and save that initialization. Not that it is expensive. Let's dig into cache pages.
Cache pages
CPUs are complex beasts. They don't operate on memory directly; not anymore. They literally cannot do it; the chip's calculating parts are simply not hooked up. at all. Instead, the CPU has caches, which come in set sizes (for example, 64k - you can't have a single cache node hold more or less than precisely 64k, and that entire 64k is then considered to be a cached copy of some 64k segment of main memory). One such node is called a cache page.
The CPU can only operate on cache pages.
In java, int[] leads to a contiguous, straight sized chunk of memory representing the data. In other words, an int[] x = new int[1000] would declare a single chunk of memory of 1000*4 = 4000 bytes (because ints are 4 bytes, and you reserved room for 1000 of em). That fits inside a single page. So, when you write your loop to initialize the values to -1, that's asking the CPU to loop through a single cache page and write some data to it. CPUs have pipelines and other speedup factors; this costs maybe 250 cycles.
Contrast to the cost of fetching a cache page: The CPU will be twiddling its thumbs (which is good; it can cool down some, and on modern hardware, often the CPU is limited not by its raw speed capabilities, but by the ability of the system to wick away the thermal impact of having it run! - it can also spend time on other threads/processes) whilst it farms out the job of fetching some chunk of memory into a cache page to the memory controller. Nevertheless, that thumb twiddling takes on the order of magnitude of 500 cycles or more. It's nice the CPU gets to cool down or focus on other things during it, but it's still the case that writing 4000 contiguous bytes in a tight loop is faster than a single cache miss.
Thus, 'fill a 1000-large int array with -1s' is an extremely cheap operation.
Wrapper objects
maps operate on objects, not ints, which is why you had to write Integer and not int. an Integer, in memory at least, is a much, much larger load on memory. It's an entire object, containing an int field. Then, your variable (or your map) holds a pointer to it.
So, an int[] x = new int[1000] takes 4000 bytes, plus some change for the object headers (maybe add 12 bytes to it all), and 1 reference (depends on VM, but let's say 64 bit), for a grand total of 4020 bytes.
In contrast,
Integer[] x = new Integer[1000];
for (int i = 0; i < 1000; i++) x[i] = i;`
is much, much larger. It's 1000 pointers (can be as large as 8 bytes per pointer, or as small as 4. So, 4000 to 8000 bytes), to 1000 separate integer objects. Each integer object gets the object overhead (~12 bytes or more), + 1 integer field, generally word-aligned (so, 64-bits, even though it's only 32-bit, assuming a 64-bit VM running on 64-bit hardware, which is going to be the case on anything modern), for another 20000 bytes. A grand total of something closer to 30000 bytes.
That is about 8x more memory required.
Then consider that the 'key' in your memoized array is inherent (it's the index into the array) whereas in the map the key needs separate storage and it gets worse still: Each k/v pair in your map occupies at least 12+12+8+8+8+8 bytes (2 object overheads and 2 int fields for your key and value Integer objects, and 2 pointers for the map to point at these), 56 bytes. In contrast to your int[] which does it in 4.
That gives you a rate of 56/4 = 14.
If your map contains only 1 in 14 numbers, then the map should be about as large as your array, because the map can do a thing your array can't: The array is as large as it has to be from the get-go, the map only needs to store required nodes.
Still, one would assume for most 'interesting' inputs, the coverage factor of that map is going to be far north of 7.14%, thus resulting in the map being larger.
The map also has its objects smeared out over memory which risks them being in more than one cache page: Large memory load + fragmentation = an easy road to having the CPU wait for multiple cache page fetches vs. being able to do all the work in one go, never having to wait for cache misses.
Can it be faster?
Yeah, probably - but with map occupancy rates at 10% or higher, the concept of using a map to save space is dubious. If you want to try, you'd need a map specifically designed to hold ints and nothing else. These do exist, such as eclipse collections' IntIntMap.
But I bet in this case the simple array memoization strategy is just the winner, even if you use IntIntMap.

These things came to my mind first:
HashMap is what the name implies, a hash-based map. Sowhenever you put something into it or get something out of it, it has to hash the key, then find the target based on that hash.
put() operation isn't just a walk in the park, either - you can check here to get an idea what it does. Definitely more than array assignment.
in java it doesn't work with primitives, so for each value you have to convert ints to Integers and vice versa. (as noted by others, there are int-specialized map alternatives available, but not in standard lib)
aaand since you're not initializing it, it might need to resize internally several times during your run - default size for a hashmap is just 16 - which is definitely more expensive then one-shot initialization you did with array. here's what each resizing does.
it also works with Entry objects that it needs for each internal entry it's got, and all those objects also take some space, plenty more than just having an array of integers
So I wouldn't think a hashmap would save you neither space or time. Why would it?

Which code runs faster?

I have two piece of code, and I want to know which is faster when they run and why it's faster. I learn less about JVM and CPU, but I'm hard working on them. Every tip will help.
int[] a=new int[1000];
int[] b=new int[10000000];
long start = System.currentTimeMillis();
//method 1
for(int i=0;i<1000;i++){
for(int j=0;j<10000000;j++){
a[i]++;
}
}
long end = System.currentTimeMillis();
System.out.println(end-start);
start=System.currentTimeMillis();
//method 2
for(int i=0 ;i<10000000;i++){
for(int j=0;j<1000;j++){
b[i]++;
}
}
end = System.currentTimeMillis();
System.out.println(end-start);

I'll throw my answer in there, in theory they will be exactly the same but in practice there will be a small, but negligible, difference. Too small to really matter, actually.
The basic idea is how array b is stored in memory. Because it is a lot larger, depending on your platform/implementation it might be stored in chunks, aka non-contiguously. That is likely since an array of 10 million ints is 40 million bytes = 40 MB!
EDIT: I get 572 and 593, respectively.

Complexity
In terms of asymptotic complexity (e.g. big-O notation), they have the same running time.
Data localization
Ignoring any optimization for the moment...
b is larger and is thus more likely to be split across multiple (or more) pages. Because of this, the first is likely faster.
The difference here is likely to be rather small, unless not all of these pages fit into RAM and need to be written to disk (which is unlikely here since b is only 10000000*4 = 40000000 bytes = 38 MB).
Optimization
The first method involves "execute a[i]++ 10000000 times" (for a fixed i), which can theoretically rather easily be converted to a[i] += 10000000 by the optimizer.
A similar optimization can occur for b, but only to b[i] += 1000, which still has to run 10000000 times.
The optimizer is free to do this or not do this. As far as I know, the Java language specification doesn't say much about what should and shouldn't be optimized, as long as it doesn't change the end result.
As an extreme result, the optimizer could, in theory, see that you're not doing anything with a or b after the loops and thus get rid of both loops.

The first loop runs faster on my system (median: 333 ms vs. 596 ms)
(Edit: I made a wrong assumption on number of array accesses in my first response, see comments)
Subsequent incremental (index++) accesses to the same array seem to be faster than random accesses or decremental (index--) accesses. I assume the Java Hotspot compiler can optimize the array bound checks if it recognizes that the array will be incrementally traversed.
When reversing the loops, it actually runs slower:
//incremental array index
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 10000000; j++) {
a[i]++;
}
}
//decremental array index
for (int i = 1000 - 1; i >= 0; i--) {
for (int j = 10000000 - 1; j >= 0; j--) {
a[i]++;
}
}
Incremental: 349ms, decremental: 485ms.
Without bounds checks, decremental loops usually are faster, especially on old processors (comparing to zero).
If my assumption is right, this makes 1000 optimized bounds checks versus 10000000 checks, so the first method is faster.
By the way, when benchmarking:
Do multiple rounds and compare the averages/mediums instead of the first sample
In Java: give your benchmark a warmup-phase (execute it a few times before measuring). On the first run, classes have to be loaded, and code might be interpreted before the HotSpot VM feature kicks in and does a native compilation
Measure time deltas with System.nanoTime(). This gives more accurate timestamps. System.currentTimeMillis() is not that precise (depends on the VM), and usually 'hops' in junks of a dozen or more milliseconds, rendering your result times more volatile than they actually are. Btw: 1 milliseconds = 1'000'000 nano second.

My guess would be that the they are both pretty much the same. One of them has a smaller array to handle, but that wouldn't make much difference except for the initial allocation of memory, which is outside your measuring anyway.
The time to execute each iteration should be the same (writing a value into an array). incrementing larger numbers shouldn't take the JVM longer than incrementing smaller numbers, nor should addressing a smaller or larger array index.
But why the question if you already know how to measure yourself?

Check out big-oh notation
A nested for loop is O(n^2) - They will run the same in theory.
The numbers 1000 or 100000 are a constant k O(n^2 + k)
They won't be exactly identical in practise because of various other things at play but it will be close.

Time should be equal, the result will obviously be different since a will contain 1000 entries with value 10000 and b will contain 10000000 entries with value 1000. I don't really get your question. What's the result for end-start?
It might be that the JVM will optimize the forloops, if it understands what the end results will be in the array than the smallest array will be much easier to calculate, since it requires only 1000 assignments, while the other one needs 10 000 times more.

First will be fuster. Because of initialization of first a and i cell much more less times.

Modern architectures are complex, so answering this kind of question is never easy.
The run time could be same or the first could be faster.
Things to consider in this case is mostly memory access and optimization.
A good optimizer will realize that the values are never read, so the loops can be skipped entirely, which gives a run time of 0 in both cases. Such optimization can take place at compile time or at run-time.
If it isn't optimized away, then there is memory access to consider. a[] is much smaller than b[], so it will more readily fit in faster cache memory resulting in fewer cache misses.
Another thing to consider is memory interleaving.

Why is processing a sorted array faster than processing an unsorted array?

Here is a piece of C++ code that shows some very peculiar behavior.
For some reason, sorting the data (before the timed region) miraculously makes the primary loop almost six times faster:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{ // Primary loop.
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}
Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
With the sorted data, the code runs in 1.93 seconds.
(Sorting itself takes more time than this one pass over the array, so it's not actually worth doing if we needed to calculate this for an unknown array.)
Initially, I thought this might be just a language or compiler anomaly, so I tried Java:
import java.util.Arrays;
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
// Generate data
int arraySize = 32768;
int data[] = new int[arraySize];
Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.nextInt() % 256;
// !!! With this, the next loop runs faster
Arrays.sort(data);
// Test
long start = System.nanoTime();
long sum = 0;
for (int i = 0; i < 100000; ++i)
{
for (int c = 0; c < arraySize; ++c)
{ // Primary loop.
if (data[c] >= 128)
sum += data[c];
}
}
System.out.println((System.nanoTime() - start) / 1000000000.0);
System.out.println("sum = " + sum);
}
}
With a similar but less extreme result.
My first thought was that sorting brings the data into the cache, but that's silly because the array was just generated.
What is going on?
Why is processing a sorted array faster than processing an unsorted array?
The code is summing up some independent terms, so the order should not matter.
Related / follow-up Q&As about the same effect with different/later compilers and options:
Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?
gcc optimization flag -O3 makes code slower than -O2

You are a victim of branch prediction fail.
What is Branch Prediction?
Consider a railroad junction:
Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license.
Now for the sake of argument, suppose this is back in the 1800s - before long-distance or radio communication.
You are a blind operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.
Trains are heavy and have a lot of inertia, so they take forever to start up and slow down.
Is there a better way? You guess which direction the train will go!
If you guessed right, it continues on.
If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.
If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.
Consider an if-statement: At the processor level, it is a branch instruction:
You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.
Modern processors are complicated and have long pipelines. This means they take forever to "warm up" and "slow down".
Is there a better way? You guess which direction the branch will go!
If you guessed right, you continue executing.
If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.
If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.
This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.
How would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...
In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.
Most applications have well-behaved branches. Therefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.
Further reading: "Branch predictor" article on Wikipedia.
As hinted from above, the culprit is this if-statement:
if (data[c] >= 128)
sum += data[c];
Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.
This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.
Quick visualization:
T = branch taken
N = branch not taken
data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N N N N N ... N N T T T ... T T T ...
= NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT (easy to predict)
However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data. Thus there will probably be around 50% misprediction (no better than random guessing).
data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118, 14, 150, 177, 182, ...
branch = T, T, N, T, T, T, T, N, T, N, N, T, T, T ...
= TTNTTTTNTNNTTT ... (completely random - impossible to predict)
What can be done?
If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.
Replace:
if (data[c] >= 128)
sum += data[c];
with:
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
This eliminates the branch and replaces it with some bitwise operations.
(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)
Benchmarks: Core i7 920 # 3.5 GHz
C++ - Visual Studio 2010 - x64 Release
Scenario
Time (seconds)
Branching - Random data
11.777
Branching - Sorted data
2.352
Branchless - Random data
2.564
Branchless - Sorted data
2.587
Java - NetBeans 7.1.1 JDK 7 - x64
Scenario
Time (seconds)
Branching - Random data
10.93293813
Branching - Sorted data
5.643797077
Branchless - Random data
3.113581453
Branchless - Sorted data
3.186068823
Observations:
With the Branch: There is a huge difference between the sorted and unsorted data.
With the Hack: There is no difference between sorted and unsorted data.
In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.
A general rule of thumb is to avoid data-dependent branching in critical loops (such as in this example).
Update:
GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move, so there is no difference between the sorted and unsorted data - both are fast.
(Or somewhat fast: for the already-sorted case, cmov can be slower especially if GCC puts it on the critical path instead of just add, especially on Intel before Broadwell where cmov has 2 cycle latency: gcc optimization flag -O3 makes code slower than -O2)
VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.
Intel C++ Compiler (ICC) 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. Not only is it immune to the mispredictions, it's also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...
If you give the Intel compiler the branchless code, it just outright vectorizes it... and is just as fast as with the branch (with the loop interchange).
This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...

Branch prediction.
With a sorted array, the condition data[c] >= 128 is first false for a streak of values, then becomes true for all later values. That's easy to predict. With an unsorted array, you pay for the branching cost.

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.
Now, if we look at the code
if (data[c] >= 128)
sum += data[c];
we can find that the meaning of this particular if... else... branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction: cmovl, in an x86 system. The branch and thus the potential branch prediction penalty is removed.
In C, thus C++, the statement, which would compile directly (without any optimization) into the conditional move instruction in x86, is the ternary operator ... ? ... : .... So we rewrite the above statement into an equivalent one:
sum += data[c] >=128 ? data[c] : 0;
While maintaining readability, we can check the speedup factor.
On an Intel Core i7-2600K # 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is:
x86
Scenario
Time (seconds)
Branching - Random data
8.885
Branching - Sorted data
1.528
Branchless - Random data
3.716
Branchless - Sorted data
3.71
x64
Scenario
Time (seconds)
Branching - Random data
11.302
Branching - Sorted data
1.830
Branchless - Random data
2.736
Branchless - Sorted data
2.737
The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern.
Now let's look more closely by investigating the x86 assembly they generate. For simplicity, we use two functions max1 and max2.
max1 uses the conditional branch if... else ...:
int max1(int a, int b) {
if (a > b)
return a;
else
return b;
}
max2 uses the ternary operator ... ? ... : ...:
int max2(int a, int b) {
return a > b ? a : b;
}
On an x86-64 machine, GCC -S generates the assembly below.
:max1
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
cmpl -8(%rbp), %eax
jle .L2
movl -4(%rbp), %eax
movl %eax, -12(%rbp)
jmp .L4
.L2:
movl -8(%rbp), %eax
movl %eax, -12(%rbp)
.L4:
movl -12(%rbp), %eax
leave
ret
:max2
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
cmpl %eax, -8(%rbp)
cmovge -8(%rbp), %eax
leave
ret
max2 uses much less code due to the usage of instruction cmovge. But the real gain is that max2 does not involve branch jumps, jmp, which would have a significant performance penalty if the predicted result is not right.
So why does a conditional move perform better?
In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.
In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict.
In a conditional move case, the execution of conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode do not depend on the result of the previous instruction; only the latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when the prediction is easy.
The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for special treatment for Branch Prediction and Misprediction Penalties.
Sometimes, some modern compilers can optimize our code to assembly with better performance, and sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between a branch and a conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

If you are curious about even more optimizations that can be done to this code, consider this:
Starting with the original loop:
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned j = 0; j < arraySize; ++j)
{
if (data[j] >= 128)
sum += data[j];
}
}
With loop interchange, we can safely change this loop to:
for (unsigned j = 0; j < arraySize; ++j)
{
for (unsigned i = 0; i < 100000; ++i)
{
if (data[j] >= 128)
sum += data[j];
}
}
Then, you can see that the if conditional is constant throughout the execution of the i loop, so you can hoist the if out:
for (unsigned j = 0; j < arraySize; ++j)
{
if (data[j] >= 128)
{
for (unsigned i = 0; i < 100000; ++i)
{
sum += data[j];
}
}
}
Then, you see that the inner loop can be collapsed into one single expression, assuming the floating point model allows it (/fp:fast is thrown, for example)
for (unsigned j = 0; j < arraySize; ++j)
{
if (data[j] >= 128)
{
sum += data[j] * 100000;
}
}
That one is 100,000 times faster than before.

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool cachegrind has a branch-predictor simulator, enabled by using the --branch-sim=yes flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++, gives these results:
Sorted:
==32551== Branches: 656,645,130 ( 656,609,208 cond + 35,922 ind)
==32551== Mispredicts: 169,556 ( 169,095 cond + 461 ind)
==32551== Mispred rate: 0.0% ( 0.0% + 1.2% )
Unsorted:
==32555== Branches: 655,996,082 ( 655,960,160 cond + 35,922 ind)
==32555== Mispredicts: 164,073,152 ( 164,072,692 cond + 460 ind)
==32555== Mispred rate: 25.0% ( 25.0% + 1.2% )
Drilling down into the line-by-line output produced by cg_annotate we see for the loop in question:
Sorted:
Bc Bcm Bi Bim
10,001 4 0 0 for (unsigned i = 0; i < 10000; ++i)
. . . . {
. . . . // primary loop
327,690,000 10,016 0 0 for (unsigned c = 0; c < arraySize; ++c)
. . . . {
327,680,000 10,006 0 0 if (data[c] >= 128)
0 0 0 0 sum += data[c];
. . . . }
. . . . }
Unsorted:
Bc Bcm Bi Bim
10,001 4 0 0 for (unsigned i = 0; i < 10000; ++i)
. . . . {
. . . . // primary loop
327,690,000 10,038 0 0 for (unsigned c = 0; c < arraySize; ++c)
. . . . {
327,680,000 164,050,007 0 0 if (data[c] >= 128)
0 0 0 0 sum += data[c];
. . . . }
. . . . }
This lets you easily identify the problematic line - in the unsorted version the if (data[c] >= 128) line is causing 164,050,007 mispredicted conditional branches (Bcm) under cachegrind's branch-predictor model, whereas it's only causing 10,006 in the sorted version.
Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters.
perf stat ./sumtest_sorted
Sorted:
Performance counter stats for './sumtest_sorted':
11808.095776 task-clock # 0.998 CPUs utilized
1,062 context-switches # 0.090 K/sec
14 CPU-migrations # 0.001 K/sec
337 page-faults # 0.029 K/sec
26,487,882,764 cycles # 2.243 GHz
41,025,654,322 instructions # 1.55 insns per cycle
6,558,871,379 branches # 555.455 M/sec
567,204 branch-misses # 0.01% of all branches
11.827228330 seconds time elapsed
Unsorted:
Performance counter stats for './sumtest_unsorted':
28877.954344 task-clock # 0.998 CPUs utilized
2,584 context-switches # 0.089 K/sec
18 CPU-migrations # 0.001 K/sec
335 page-faults # 0.012 K/sec
65,076,127,595 cycles # 2.253 GHz
41,032,528,741 instructions # 0.63 insns per cycle
6,560,579,013 branches # 227.183 M/sec
1,646,394,749 branch-misses # 25.10% of all branches
28.935500947 seconds time elapsed
It can also do source code annotation with dissassembly.
perf record -e branch-misses ./sumtest_unsorted
perf annotate -d sumtest_unsorted
Percent | Source code & Disassembly of sumtest_unsorted
------------------------------------------------
...
: sum += data[c];
0.00 : 400a1a: mov -0x14(%rbp),%eax
39.97 : 400a1d: mov %eax,%eax
5.31 : 400a1f: mov -0x20040(%rbp,%rax,4),%eax
4.60 : 400a26: cltq
0.00 : 400a28: add %rax,-0x30(%rbp)
...
See the performance tutorial for more details.

I just read up on this question and its answers, and I feel an answer is missing.
A common way to eliminate branch prediction that I've found to work particularly good in managed languages is a table lookup instead of using a branch (although I haven't tested it in this case).
This approach works in general if:
it's a small table and is likely to be cached in the processor, and
you are running things in a quite tight loop and/or the processor can preload the data.
Background and why
From a processor perspective, your memory is slow. To compensate for the difference in speed, a couple of caches are built into your processor (L1/L2 cache). So imagine that you're doing your nice calculations and figure out that you need a piece of memory. The processor will get its 'load' operation and loads the piece of memory into cache -- and then uses the cache to do the rest of the calculations. Because memory is relatively slow, this 'load' will slow down your program.
Like branch prediction, this was optimized in the Pentium processors: the processor predicts that it needs to load a piece of data and attempts to load that into the cache before the operation actually hits the cache. As we've already seen, branch prediction sometimes goes horribly wrong -- in the worst case scenario you need to go back and actually wait for a memory load, which will take forever (in other words: failing branch prediction is bad, a memory load after a branch prediction fail is just horrible!).
Fortunately for us, if the memory access pattern is predictable, the processor will load it in its fast cache and all is well.
The first thing we need to know is what is small? While smaller is generally better, a rule of thumb is to stick to lookup tables that are <= 4096 bytes in size. As an upper limit: if your lookup table is larger than 64K it's probably worth reconsidering.
Constructing a table
So we've figured out that we can create a small table. Next thing to do is get a lookup function in place. Lookup functions are usually small functions that use a couple of basic integer operations (and, or, xor, shift, add, remove and perhaps multiply). You want to have your input translated by the lookup function to some kind of 'unique key' in your table, which then simply gives you the answer of all the work you wanted it to do.
In this case: >= 128 means we can keep the value, < 128 means we get rid of it. The easiest way to do that is by using an 'AND': if we keep it, we AND it with 7FFFFFFF; if we want to get rid of it, we AND it with 0. Notice also that 128 is a power of 2 -- so we can go ahead and make a table of 32768/128 integers and fill it with one zero and a lot of 7FFFFFFFF's.
Managed languages
You might wonder why this works well in managed languages. After all, managed languages check the boundaries of the arrays with a branch to ensure you don't mess up...
Well, not exactly... :-)
There has been quite some work on eliminating this branch for managed languages. For example:
for (int i = 0; i < array.Length; ++i)
{
// Use array[i]
}
In this case, it's obvious to the compiler that the boundary condition will never be hit. At least the Microsoft JIT compiler (but I expect Java does similar things) will notice this and remove the check altogether. WOW, that means no branch. Similarly, it will deal with other obvious cases.
If you run into trouble with lookups in managed languages -- the key is to add a & 0x[something]FFF to your lookup function to make the boundary check predictable -- and watch it going faster.
The result of this case
// Generate data
int arraySize = 32768;
int[] data = new int[arraySize];
Random random = new Random(0);
for (int c = 0; c < arraySize; ++c)
{
data[c] = random.Next(256);
}
/*To keep the spirit of the code intact, I'll make a separate lookup table
(I assume we cannot modify 'data' or the number of loops)*/
int[] lookup = new int[256];
for (int c = 0; c < 256; ++c)
{
lookup[c] = (c >= 128) ? c : 0;
}
// Test
DateTime startTime = System.DateTime.Now;
long sum = 0;
for (int i = 0; i < 100000; ++i)
{
// Primary loop
for (int j = 0; j < arraySize; ++j)
{
/* Here you basically want to use simple operations - so no
random branches, but things like &, |, *, -, +, etc. are fine. */
sum += lookup[data[j]];
}
}
DateTime endTime = System.DateTime.Now;
Console.WriteLine(endTime - startTime);
Console.WriteLine("sum = " + sum);
Console.ReadLine();

As data is distributed between 0 and 255 when the array is sorted, around the first half of the iterations will not enter the if-statement (the if statement is shared below).
if (data[c] >= 128)
sum += data[c];
The question is: What makes the above statement not execute in certain cases as in case of sorted data? Here comes the "branch predictor". A branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance!
Let's do some bench marking to understand it better
The performance of an if-statement depends on whether its condition has a predictable pattern. If the condition is always true or always false, the branch prediction logic in the processor will pick up the pattern. On the other hand, if the pattern is unpredictable, the if-statement will be much more expensive.
Let’s measure the performance of this loop with different conditions:
for (int i = 0; i < max; i++)
if (condition)
sum++;
Here are the timings of the loop with different true-false patterns:
Condition Pattern Time (ms)
-------------------------------------------------------
(i & 0×80000000) == 0 T repeated 322
(i & 0xffffffff) == 0 F repeated 276
(i & 1) == 0 TF alternating 760
(i & 3) == 0 TFFFTFFF… 513
(i & 2) == 0 TTFFTTFF… 1675
(i & 4) == 0 TTTTFFFFTTTTFFFF… 1275
(i & 8) == 0 8T 8F 8T 8F … 752
(i & 16) == 0 16T 16F 16T 16F … 490
A “bad” true-false pattern can make an if-statement up to six times slower than a “good” pattern! Of course, which pattern is good and which is bad depends on the exact instructions generated by the compiler and on the specific processor.
So there is no doubt about the impact of branch prediction on performance!

One way to avoid branch prediction errors is to build a lookup table, and index it using the data. Stefan de Bruijn discussed that in his answer.
But in this case, we know values are in the range [0, 255] and we only care about values >= 128. That means we can easily extract a single bit that will tell us whether we want a value or not: by shifting the data to the right 7 bits, we are left with a 0 bit or a 1 bit, and we only want to add the value when we have a 1 bit. Let's call this bit the "decision bit".
By using the 0/1 value of the decision bit as an index into an array, we can make code that will be equally fast whether the data is sorted or not sorted. Our code will always add a value, but when the decision bit is 0, we will add the value somewhere we don't care about. Here's the code:
// Test
clock_t start = clock();
long long a[] = {0, 0};
long long sum;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
int j = (data[c] >> 7);
a[j] += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
sum = a[1];
This code wastes half of the adds but never has a branch prediction failure. It's tremendously faster on random data than the version with an actual if statement.
But in my testing, an explicit lookup table was slightly faster than this, probably because indexing into a lookup table was slightly faster than bit shifting. This shows how my code sets up and uses the lookup table (unimaginatively called lut for "LookUp Table" in the code). Here's the C++ code:
// Declare and then fill in the lookup table
int lut[256];
for (unsigned c = 0; c < 256; ++c)
lut[c] = (c >= 128) ? c : 0;
// Use the lookup table after it is built
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
sum += lut[data[c]];
}
}
In this case, the lookup table was only 256 bytes, so it fits nicely in a cache and all was fast. This technique wouldn't work well if the data was 24-bit values and we only wanted half of them... the lookup table would be far too big to be practical. On the other hand, we can combine the two techniques shown above: first shift the bits over, then index a lookup table. For a 24-bit value that we only want the top half value, we could potentially shift the data right by 12 bits, and be left with a 12-bit value for a table index. A 12-bit table index implies a table of 4096 values, which might be practical.
The technique of indexing into an array, instead of using an if statement, can be used for deciding which pointer to use. I saw a library that implemented binary trees, and instead of having two named pointers (pLeft and pRight or whatever) had a length-2 array of pointers and used the "decision bit" technique to decide which one to follow. For example, instead of:
if (x < node->value)
node = node->pLeft;
else
node = node->pRight;
this library would do something like:
i = (x < node->value);
node = node->link[i];
Here's a link to this code: Red Black Trees, Eternally Confuzzled

In the sorted case, you can do better than relying on successful branch prediction or any branchless comparison trick: completely remove the branch.
Indeed, the array is partitioned in a contiguous zone with data < 128 and another with data >= 128. So you should find the partition point with a dichotomic search (using Lg(arraySize) = 15 comparisons), then do a straight accumulation from that point.
Something like (unchecked)
int i= 0, j, k= arraySize;
while (i < k)
{
j= (i + k) >> 1;
if (data[j] >= 128)
k= j;
else
i= j;
}
sum= 0;
for (; i < arraySize; i++)
sum+= data[i];
or, slightly more obfuscated
int i, k, j= (i + k) >> 1;
for (i= 0, k= arraySize; i < k; (data[j] >= 128 ? k : i)= j)
j= (i + k) >> 1;
for (sum= 0; i < arraySize; i++)
sum+= data[i];
A yet faster approach, that gives an approximate solution for both sorted or unsorted is: sum= 3137536; (assuming a truly uniform distribution, 16384 samples with expected value 191.5) :-)

The above behavior is happening because of Branch prediction.
To understand branch prediction one must first understand an Instruction Pipeline.
The the steps of running an instruction can be overlapped with the sequence of steps of running the previous and next instruction, so that different steps can be executed concurrently in parallel. This technique is known as instruction pipelining and is used to increase throughput in modern processors. To understand this better please see this example on Wikipedia.
Generally, modern processors have quite long (and wide) pipelines, so many instruction can be in flight. See Modern Microprocessors
A 90-Minute Guide! which starts by introducing basic in-order pipelining and goes from there.
But for ease let's consider a simple in-order pipeline with these 4 steps only.
(Like a classic 5-stage RISC, but omitting a separate MEM stage.)
IF -- Fetch the instruction from memory
ID -- Decode the instruction
EX -- Execute the instruction
WB -- Write back to CPU register
4-stage pipeline in general for 2 instructions.
Moving back to the above question let's consider the following instructions:
A) if (data[c] >= 128)
/\
/ \
/ \
true / \ false
/ \
/ \
/ \
/ \
B) sum += data[c]; C) for loop or print().
Without branch prediction, the following would occur:
To execute instruction B or instruction C the processor will have to wait (stall) till the instruction A leaves the EX stage in the pipeline, as the decision to go to instruction B or instruction C depends on the result of instruction A. (i.e. where to fetch from next.) So the pipeline will look like this:
Without prediction: when if condition is true:
Without prediction: When if condition is false:
As a result of waiting for the result of instruction A, the total CPU cycles spent in the above case (without branch prediction; for both true and false) is 7.
So what is branch prediction?
Branch predictor will try to guess which way a branch (an if-then-else structure) will go before this is known for sure. It will not wait for the instruction A to reach the EX stage of the pipeline, but it will guess the decision and go to that instruction (B or C in case of our example).
In case of a correct guess, the pipeline looks something like this:
If it is later detected that the guess was wrong then the partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.
The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the greater the need for a good branch predictor.
In the OP's code, the first time when the conditional, the branch predictor does not have any information to base up prediction, so the first time it will randomly choose the next instruction. (Or fall back to static prediction, typically forward not-taken, backward taken). Later in the for loop, it can base the prediction on the history.
For an array sorted in ascending order, there are three possibilities:
All the elements are less than 128
All the elements are greater than 128
Some starting new elements are less than 128 and later it become greater than 128
Let us assume that the predictor will always assume the true branch on the first run.
So in the first case, it will always take the true branch since historically all its predictions are correct.
In the 2nd case, initially it will predict wrong, but after a few iterations, it will predict correctly.
In the 3rd case, it will initially predict correctly till the elements are less than 128. After which it will fail for some time and the correct itself when it sees branch prediction failure in history.
In all these cases the failure will be too less in number and as a result, only a few times it will need to discard the partially executed instructions and start over with the correct branch, resulting in fewer CPU cycles.
But in case of a random unsorted array, the prediction will need to discard the partially executed instructions and start over with the correct branch most of the time and result in more CPU cycles compared to the sorted array.
Further reading:
Modern Microprocessors
A 90-Minute Guide!
Dan Luu's article on branch prediction (which covers older branch predictors, not modern IT-TAGE or Perceptron)
https://en.wikipedia.org/wiki/Branch_predictor
Branch Prediction and the Performance of Interpreters -
Don’t Trust Folklore - 2015 paper showing how well Intel's Haswell does at predicting the indirect branch of a Python interpreter's main loop (historically problematic due to a non-simple pattern), vs. earlier CPUs which didn't use IT-TAGE. (They don't help with this fully random case, though. Still 50% mispredict rate for the if inside the loop on a Skylake CPU when the source is compiled to branch asm.)
Static branch prediction on newer Intel processors - what CPUs actually do when running a branch instruction that doesn't have a dynamic prediction available. Historically, forward not-taken (like an if or break), backward taken (like a loop) has been used because it's better than nothing. Laying out code so the fast path / common case minimizes taken branches is good for I-cache density as well as static prediction, so compilers already do that. (That's the real effect of likely / unlikely hints in C source, not actually hinting the hardware branch prediction in most CPU, except maybe via static prediction.)

An official answer would be from
Intel - Avoiding the Cost of Branch Misprediction
Intel - Branch and Loop Reorganization to Prevent Mispredicts
Scientific papers - branch prediction computer architecture
Books: J.L. Hennessy, D.A. Patterson: Computer architecture: a quantitative approach
Articles in scientific publications: T.Y. Yeh, Y.N. Patt made a lot of these on branch predictions.
You can also see from this lovely diagram why the branch predictor gets confused.
Each element in the original code is a random value
data[c] = std::rand() % 256;
so the predictor will change sides as the std::rand() blow.
On the other hand, once it's sorted, the predictor will first move into a state of strongly not taken and when the values change to the high value the predictor will in three runs through change all the way from strongly not taken to strongly taken.

In the same line (I think this was not highlighted by any answer) it's good to mention that sometimes (specially in software where the performance matters—like in the Linux kernel) you can find some if statements like the following:
if (likely( everything_is_ok ))
{
/* Do something */
}
or similarly:
if (unlikely(very_improbable_condition))
{
/* Do something */
}
Both likely() and unlikely() are in fact macros that are defined by using something like the GCC's __builtin_expect to help the compiler insert prediction code to favour the condition taking into account the information provided by the user. GCC supports other builtins that could change the behavior of the running program or emit low level instructions like clearing the cache, etc. See this documentation that goes through the available GCC's builtins.
Normally this kind of optimizations are mainly found in hard-real time applications or embedded systems where execution time matters and it's critical. For example, if you are checking for some error condition that only happens 1/10000000 times, then why not inform the compiler about this? This way, by default, the branch prediction would assume that the condition is false.

Frequently used Boolean operations in C++ produce many branches in the compiled program. If these branches are inside loops and are hard to predict they can slow down execution significantly. Boolean variables are stored as 8-bit integers with the value 0 for false and 1 for true.
Boolean variables are overdetermined in the sense that all operators that have Boolean variables as input check if the inputs have any other value than 0 or 1, but operators that have Booleans as output can produce no other value than 0 or 1. This makes operations with Boolean variables as input less efficient than necessary.
Consider example:
bool a, b, c, d;
c = a && b;
d = a || b;
This is typically implemented by the compiler in the following way:
bool a, b, c, d;
if (a != 0) {
if (b != 0) {
c = 1;
}
else {
goto CFALSE;
}
}
else {
CFALSE:
c = 0;
}
if (a == 0) {
if (b == 0) {
d = 0;
}
else {
goto DTRUE;
}
}
else {
DTRUE:
d = 1;
}
This code is far from optimal. The branches may take a long time in case of mispredictions. The Boolean operations can be made much more efficient if it is known with certainty that the operands have no other values than 0 and 1. The reason why the compiler does not make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources. The above code can be optimized if a and b has been initialized to valid values or if they come from operators that produce Boolean output. The optimized code looks like this:
char a = 0, b = 1, c, d;
c = a & b;
d = a | b;
char is used instead of bool in order to make it possible to use the bitwise operators (& and |) instead of the Boolean operators (&& and ||). The bitwise operators are single instructions that take only one clock cycle. The OR operator (|) works even if a and b have other values than 0 or 1. The AND operator (&) and the EXCLUSIVE OR operator (^) may give inconsistent results if the operands have other values than 0 and 1.
~ can not be used for NOT. Instead, you can make a Boolean NOT on a variable which is known to be 0 or 1 by XOR'ing it with 1:
bool a, b;
b = !a;
can be optimized to:
char a = 0, b;
b = a ^ 1;
a && b cannot be replaced with a & b if b is an expression that should not be evaluated if a is false ( && will not evaluate b, & will). Likewise, a || b can not be replaced with a | b if b is an expression that should not be evaluated if a is true.
Using bitwise operators is more advantageous if the operands are variables than if the operands are comparisons:
bool a; double x, y, z;
a = x > y && z < 5.0;
is optimal in most cases (unless you expect the && expression to generate many branch mispredictions).

That's for sure!...
Branch prediction makes the logic run slower, because of the switching which happens in your code! It's like you are going a straight street or a street with a lot of turnings, for sure the straight one is going to be done quicker!...
If the array is sorted, your condition is false at the first step: data[c] >= 128, then becomes a true value for the whole way to the end of the street. That's how you get to the end of the logic faster. On the other hand, using an unsorted array, you need a lot of turning and processing which make your code run slower for sure...
Look at the image I created for you below. Which street is going to be finished faster?
So programmatically, branch prediction causes the process to be slower...
Also at the end, it's good to know we have two kinds of branch predictions that each is going to affect your code differently:
1. Static
2. Dynamic
Static branch prediction is used by the microprocessor the first time
a conditional branch is encountered, and dynamic branch prediction is
used for succeeding executions of the conditional branch code.
In order to effectively write your code to take advantage of these
rules, when writing if-else or switch statements, check the most
common cases first and work progressively down to the least common.
Loops do not necessarily require any special ordering of code for
static branch prediction, as only the condition of the loop iterator
is normally used.

This question has already been answered excellently many times over. Still I'd like to draw the group's attention to yet another interesting analysis.
Recently this example (modified very slightly) was also used as a way to demonstrate how a piece of code can be profiled within the program itself on Windows. Along the way, the author also shows how to use the results to determine where the code is spending most of its time in both the sorted & unsorted case. Finally the piece also shows how to use a little known feature of the HAL (Hardware Abstraction Layer) to determine just how much branch misprediction is happening in the unsorted case.
The link is here:
A Demonstration of Self-Profiling

As what has already been mentioned by others, what behind the mystery is Branch Predictor.
I'm not trying to add something but explaining the concept in another way.
There is a concise introduction on the wiki which contains text and diagram.
I do like the explanation below which uses a diagram to elaborate the Branch Predictor intuitively.
In computer architecture, a branch predictor is a
digital circuit that tries to guess which way a branch (e.g. an
if-then-else structure) will go before this is known for sure. The
purpose of the branch predictor is to improve the flow in the
instruction pipeline. Branch predictors play a critical role in
achieving high effective performance in many modern pipelined
microprocessor architectures such as x86.
Two-way branching is usually implemented with a conditional jump
instruction. A conditional jump can either be "not taken" and continue
execution with the first branch of code which follows immediately
after the conditional jump, or it can be "taken" and jump to a
different place in program memory where the second branch of code is
stored. It is not known for certain whether a conditional jump will be
taken or not taken until the condition has been calculated and the
conditional jump has passed the execution stage in the instruction
pipeline (see fig. 1).
Based on the described scenario, I have written an animation demo to show how instructions are executed in a pipeline in different situations.
Without the Branch Predictor.
Without branch prediction, the processor would have to wait until the
conditional jump instruction has passed the execute stage before the
next instruction can enter the fetch stage in the pipeline.
The example contains three instructions and the first one is a conditional jump instruction. The latter two instructions can go into the pipeline until the conditional jump instruction is executed.
It will take 9 clock cycles for 3 instructions to be completed.
Use Branch Predictor and don't take a conditional jump. Let's assume that the predict is not taking the conditional jump.
It will take 7 clock cycles for 3 instructions to be completed.
Use Branch Predictor and take a conditional jump. Let's assume that the predict is not taking the conditional jump.
It will take 9 clock cycles for 3 instructions to be completed.
The time that is wasted in case of a branch misprediction is equal to
the number of stages in the pipeline from the fetch stage to the
execute stage. Modern microprocessors tend to have quite long
pipelines so that the misprediction delay is between 10 and 20 clock
cycles. As a result, making a pipeline longer increases the need for a
more advanced branch predictor.
As you can see, it seems we don't have a reason not to use Branch Predictor.
It's quite a simple demo that clarifies the very basic part of Branch Predictor. If those gifs are annoying, please feel free to remove them from the answer and visitors can also get the live demo source code from BranchPredictorDemo

Branch-prediction gain!
It is important to understand that branch misprediction doesn't slow down programs. The cost of a missed prediction is just as if branch prediction didn't exist and you waited for the evaluation of the expression to decide what code to run (further explanation in the next paragraph).
if (expression)
{
// Run 1
} else {
// Run 2
}
Whenever there's an if-else \ switch statement, the expression has to be evaluated to determine which block should be executed. In the assembly code generated by the compiler, conditional branch instructions are inserted.
A branch instruction can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order (i.e. if the expression is false, the program skips the code of the if block) depending on some condition, which is the expression evaluation in our case.
That being said, the compiler tries to predict the outcome prior to it being actually evaluated. It will fetch instructions from the if block, and if the expression turns out to be true, then wonderful! We gained the time it took to evaluate it and made progress in the code; if not then we are running the wrong code, the pipeline is flushed, and the correct block is run.
Visualization:
Let's say you need to pick route 1 or route 2. Waiting for your partner to check the map, you have stopped at ## and waited, or you could just pick route1 and if you were lucky (route 1 is the correct route), then great you didn't have to wait for your partner to check the map (you saved the time it would have taken him to check the map), otherwise you will just turn back.
While flushing pipelines is super fast, nowadays taking this gamble is worth it. Predicting sorted data or a data that changes slowly is always easier and better than predicting fast changes.
O Route 1 /-------------------------------
/|\ /
| ---------##/
/ \ \
\
Route 2 \--------------------------------

On ARM, there is no branch needed, because every instruction has a 4-bit condition field, which tests (at zero cost) any of 16 different different conditions that may arise in the Processor Status Register, and if the condition on an instruction is false, the instruction is skipped. This eliminates the need for short branches, and there would be no branch prediction hit for this algorithm. Therefore, the sorted version of this algorithm would run slower than the unsorted version on ARM, because of the extra overhead of sorting.
The inner loop for this algorithm would look something like the following in ARM assembly language:
MOV R0, #0 // R0 = sum = 0
MOV R1, #0 // R1 = c = 0
ADR R2, data // R2 = addr of data array (put this instruction outside outer loop)
.inner_loop // Inner loop branch label
LDRB R3, [R2, R1] // R3 = data[c]
CMP R3, #128 // compare R3 to 128
ADDGE R0, R0, R3 // if R3 >= 128, then sum += data[c] -- no branch needed!
ADD R1, R1, #1 // c++
CMP R1, #arraySize // compare c to arraySize
BLT inner_loop // Branch to inner_loop if c < arraySize
But this is actually part of a bigger picture:
CMP opcodes always update the status bits in the Processor Status Register (PSR), because that is their purpose, but most other instructions do not touch the PSR unless you add an optional S suffix to the instruction, specifying that the PSR should be updated based on the result of the instruction. Just like the 4-bit condition suffix, being able to execute instructions without affecting the PSR is a mechanism that reduces the need for branches on ARM, and also facilitates out of order dispatch at the hardware level, because after performing some operation X that updates the status bits, subsequently (or in parallel) you can do a bunch of other work that explicitly should not affect (or be affected by) the status bits, then you can test the state of the status bits set earlier by X.
The condition testing field and the optional "set status bit" field can be combined, for example:
ADD R1, R2, R3 performs R1 = R2 + R3 without updating any status bits.
ADDGE R1, R2, R3 performs the same operation only if a previous instruction that affected the status bits resulted in a Greater than or Equal condition.
ADDS R1, R2, R3 performs the addition and then updates the N, Z, C and V flags in the Processor Status Register based on whether the result was Negative, Zero, Carried (for unsigned addition), or oVerflowed (for signed addition).
ADDSGE R1, R2, R3 performs the addition only if the GE test is true, and then subsequently updates the status bits based on the result of the addition.
Most processor architectures do not have this ability to specify whether or not the status bits should be updated for a given operation, which can necessitate writing additional code to save and later restore status bits, or may require additional branches, or may limit the processor's out of order execution efficiency: one of the side effects of most CPU instruction set architectures forcibly updating status bits after most instructions is that it is much harder to tease apart which instructions can be run in parallel without interfering with each other. Updating status bits has side effects, therefore has a linearizing effect on code. ARM's ability to mix and match branch-free condition testing on any instruction with the option to either update or not update the status bits after any instruction is extremely powerful, for both assembly language programmers and compilers, and produces very efficient code.
When you don't have to branch, you can avoid the time cost of flushing the pipeline for what would otherwise be short branches, and you can avoid the design complexity of many forms of speculative evalution. The performance impact of the initial naive imlementations of the mitigations for many recently discovered processor vulnerabilities (Spectre etc.) shows you just how much the performance of modern processors depends upon complex speculative evaluation logic. With a short pipeline and the dramatically reduced need for branching, ARM just doesn't need to rely on speculative evaluation as much as CISC processors. (Of course high-end ARM implementations do include speculative evaluation, but it's a smaller part of the performance story.)
If you have ever wondered why ARM has been so phenomenally successful, the brilliant effectiveness and interplay of these two mechanisms (combined with another mechanism that lets you "barrel shift" left or right one of the two arguments of any arithmetic operator or offset memory access operator at zero additional cost) are a big part of the story, because they are some of the greatest sources of the ARM architecture's efficiency. The brilliance of the original designers of the ARM ISA back in 1983, Steve Furber and Roger (now Sophie) Wilson, cannot be overstated.

It's about branch prediction. What is it?
A branch predictor is one of the ancient performance-improving techniques which still finds relevance in modern architectures. While the simple prediction techniques provide fast lookup and power efficiency they suffer from a high misprediction rate.
On the other hand, complex branch predictions –either neural-based or variants of two-level branch prediction –provide better prediction accuracy, but they consume more power and complexity increases exponentially.
In addition to this, in complex prediction techniques, the time taken to predict the branches is itself very high –ranging from 2 to 5 cycles –which is comparable to the execution time of actual branches.
Branch prediction is essentially an optimization (minimization) problem where the emphasis is on to achieve lowest possible miss rate, low power consumption, and low complexity with minimum resources.
There really are three different kinds of branches:
Forward conditional branches - based on a run-time condition, the PC (program counter) is changed to point to an address forward in the instruction stream.
Backward conditional branches - the PC is changed to point backward in the instruction stream. The branch is based on some condition, such as branching backwards to the beginning of a program loop when a test at the end of the loop states the loop should be executed again.
Unconditional branches - this includes jumps, procedure calls, and returns that have no specific condition. For example, an unconditional jump instruction might be coded in assembly language as simply "jmp", and the instruction stream must immediately be directed to the target location pointed to by the jump instruction, whereas a conditional jump that might be coded as "jmpne" would redirect the instruction stream only if the result of a comparison of two values in a previous "compare" instructions shows the values to not be equal. (The segmented addressing scheme used by the x86 architecture adds extra complexity since jumps can be either "near" (within a segment) or "far" (outside the segment). Each type has different effects on branch prediction algorithms.)
Static/dynamic Branch Prediction: Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.
References:
Branch predictor
A Demonstration of Self-Profiling
Branch Prediction Review
Branch Prediction (Using wayback machine)

Besides the fact that the branch prediction may slow you down, a sorted array has another advantage:
You can have a stop condition instead of just checking the value, this way you only loop over the relevant data, and ignore the rest.
The branch prediction will miss only once.
// sort backwards (higher values first), may be in some other part of the code
std::sort(data, data + arraySize, std::greater<int>());
for (unsigned c = 0; c < arraySize; ++c) {
if (data[c] < 128) {
break;
}
sum += data[c];
}

Sorted arrays are processed faster than an unsorted array, due to a phenomena called branch prediction.
The branch predictor is a digital circuit (in computer architecture) trying to predict which way a branch will go, improving the flow in the instruction pipeline. The circuit/computer predicts the next step and executes it.
Making a wrong prediction leads to going back to the previous step, and executing with another prediction. Assuming the prediction is correct, the code will continue to the next step. A wrong prediction results in repeating the same step, until a correct prediction occurs.
The answer to your question is very simple.
In an unsorted array, the computer makes multiple predictions, leading to an increased chance of errors.
Whereas, in a sorted array, the computer makes fewer predictions, reducing the chance of errors.
Making more predictions requires more time.
Sorted Array: Straight Road
____________________________________________________________________________________
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Unsorted Array: Curved Road
______ ________
| |__|
Branch prediction: Guessing/predicting which road is straight and following it without checking
___________________________________________ Straight road
|_________________________________________|Longer road
Although both the roads reach the same destination, the straight road is shorter, and the other is longer. If then you choose the other by mistake, there is no turning back, and so you will waste some extra time if you choose the longer road. This is similar to what happens in the computer, and I hope this helped you understand better.
Also I want to cite #Simon_Weaver from the comments:
It doesn’t make fewer predictions - it makes fewer incorrect predictions. It still has to predict for each time through the loop...

I tried the same code with MATLAB 2011b with my MacBook Pro (Intel i7, 64 bit, 2.4 GHz) for the following MATLAB code:
% Processing time with Sorted data vs unsorted data
%==========================================================================
% Generate data
arraySize = 32768
sum = 0;
% Generate random integer data from range 0 to 255
data = randi(256, arraySize, 1);
%Sort the data
data1= sort(data); % data1= data when no sorting done
%Start a stopwatch timer to measure the execution time
tic;
for i=1:100000
for j=1:arraySize
if data1(j)>=128
sum=sum + data1(j);
end
end
end
toc;
ExeTimeWithSorting = toc - tic;
The results for the above MATLAB code are as follows:
a: Elapsed time (without sorting) = 3479.880861 seconds.
b: Elapsed time (with sorting ) = 2377.873098 seconds.
The results of the C code as in #GManNickG I get:
a: Elapsed time (without sorting) = 19.8761 sec.
b: Elapsed time (with sorting ) = 7.37778 sec.
Based on this, it looks MATLAB is almost 175 times slower than the C implementation without sorting and 350 times slower with sorting. In other words, the effect (of branch prediction) is 1.46x for MATLAB implementation and 2.7x for the C implementation.

The assumption by other answers that one needs to sort the data is not correct.
The following code does not sort the entire array, but only 200-element segments of it, and thereby runs the fastest.
Sorting only k-element sections completes the pre-processing in linear time, O(n), rather than the O(n.log(n)) time needed to sort the entire array.
#include <algorithm>
#include <ctime>
#include <iostream>
int main() {
int data[32768]; const int l = sizeof data / sizeof data[0];
for (unsigned c = 0; c < l; ++c)
data[c] = std::rand() % 256;
// sort 200-element segments, not the whole array
for (unsigned c = 0; c + 200 <= l; c += 200)
std::sort(&data[c], &data[c + 200]);
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i) {
for (unsigned c = 0; c < sizeof data / sizeof(int); ++c) {
if (data[c] >= 128)
sum += data[c];
}
}
std::cout << static_cast<double>(clock() - start) / CLOCKS_PER_SEC << std::endl;
std::cout << "sum = " << sum << std::endl;
}
This also "proves" that it has nothing to do with any algorithmic issue such as sort order, and it is indeed branch prediction.

Bjarne Stroustrup's Answer to this question:
That sounds like an interview question. Is it true? How would you know? It is a bad idea to answer questions about efficiency without first doing some measurements, so it is important to know how to measure.
So, I tried with a vector of a million integers and got:
Already sorted 32995 milliseconds
Shuffled 125944 milliseconds
Already sorted 18610 milliseconds
Shuffled 133304 milliseconds
Already sorted 17942 milliseconds
Shuffled 107858 milliseconds
I ran that a few times to be sure. Yes, the phenomenon is real. My key code was:
void run(vector<int>& v, const string& label)
{
auto t0 = system_clock::now();
sort(v.begin(), v.end());
auto t1 = system_clock::now();
cout << label
<< duration_cast<microseconds>(t1 — t0).count()
<< " milliseconds\n";
}
void tst()
{
vector<int> v(1'000'000);
iota(v.begin(), v.end(), 0);
run(v, "already sorted ");
std::shuffle(v.begin(), v.end(), std::mt19937{ std::random_device{}() });
run(v, "shuffled ");
}
At least the phenomenon is real with this compiler, standard library, and optimizer settings. Different implementations can and do give different answers. In fact, someone did do a more systematic study (a quick web search will find it) and most implementations show that effect.
One reason is branch prediction: the key operation in the sort algorithm is “if(v[i] < pivot]) …” or equivalent. For a sorted sequence that test is always true whereas, for a random sequence, the branch chosen varies randomly.
Another reason is that when the vector is already sorted, we never need to move elements to their correct position. The effect of these little details is the factor of five or six that we saw.
Quicksort (and sorting in general) is a complex study that has attracted some of the greatest minds of computer science. A good sort function is a result of both choosing a good algorithm and paying attention to hardware performance in its implementation.
If you want to write efficient code, you need to know a bit about machine architecture.

This question is rooted in branch prediction models on CPUs. I'd recommend reading this paper:
Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache (But real CPUs these days still don't make multiple taken branch-predictions per clock cycle, except for Haswell and later effectively unrolling tiny loops in its loop buffer. Modern CPUs can predict multiple branches not-taken to make use of their fetches in large contiguous blocks.)
When you have sorted elements, branch prediction easily predicts correctly except right at the boundary, letting instructions flow through the CPU pipeline efficiently, without having to rewind and take the correct path on mispredictions.

An answer for quick and simple understanding (read the others for more details)
This concept is called branch prediction
Branch prediction is an optimization technique that predicts the path the code will take before it is known with certainty. This is important because during the code execution, the machine prefetches several code statements and stores them in the pipeline.
The problem arises in conditional branching, where there are two possible paths or parts of the code that can be executed.
When the prediction was true, the optimization technique worked out.
When the prediction was false, to explain it in a simple way, the code statement stored in the pipeline gets proved wrong and the actual code has to be completely reloaded, which takes up a lot of time.
As common sense suggests, predictions of something sorted are way more accurate than predictions of something unsorted.
branch prediction visualisation:
sorted
unsorted

At what point is it worth reusing arrays in Java?

How big does a buffer need to be in Java before it's worth reusing?
Or, put another way: I can repeatedly allocate, use, and discard byte[] objects OR run a pool to keep and reuse them. I might allocate a lot of small buffers that get discarded often, or a few big ones that's don't. At what size is is cheaper to pool them than to reallocate, and how do small allocations compare to big ones?
EDIT:
Ok, specific parameters. Say an Intel Core 2 Duo CPU, latest VM version for OS of choice. This questions isn't as vague as it sounds... a little code and a graph could answer it.
EDIT2:
You've posted a lot of good general rules and discussions, but the question really asks for numbers. Post 'em (and code too)! Theory is great, but the proof is the numbers. It doesn't matter if results vary some from system to system, I'm just looking for a rough estimate (order of magnitude). Nobody seems to know if the performance difference will be a factor of 1.1, 2, 10, or 100+, and this is something that matters. It is important for any Java code working with big arrays -- networking, bioinformatics, etc.
Suggestions to get a good benchmark:
Warm up code before running it in the benchmark. Methods should all be called at least 1000 10000 times to get full JIT optimization.
Make sure benchmarked methods run for at least 1 10 seconds and use System.nanotime if possible, to get accurate timings.
Run benchmark on a system that is only running minimal applications
Run benchmark 3-5 times and report all times, so we see how consistent it is.
I know this is a vague and somewhat demanding question. I will check this question regularly, and answers will get comments and rated up consistently. Lazy answers will not (see below for criteria). If I don't have any answers that are thorough, I'll attach a bounty. I might anyway, to reward a really good answer with a little extra.
What I know (and don't need repeated):
Java memory allocation and GC are fast and getting faster.
Object pooling used to be a good optimization, but now it hurts performance most of the time.
Object pooling is "not usually a good idea unless objects are expensive to create." Yadda yadda.
What I DON'T know:
How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?
How does allocation size effect allocation rate?
What's the break-even point for number/size of allocations vs. re-use in a pool?
Routes to an ACCEPTED answer (the more the better):
A recent whitepaper showing figures for allocation & GC on modern CPUs (recent as in last year or so, JVM 1.6 or later)
Code for a concise and correct micro-benchmark I can run
Explanation of how and why the allocations impact performance
Real-world examples/anecdotes from testing this kind of optimization
The Context:
I'm working on a library adding LZF compression support to Java. This library extends the H2 DBMS LZF classes, by adding additional compression levels (more compression) and compatibility with the byte streams from the C LZF library. One of the things I'm thinking about is whether or not it's worth trying to reuse the fixed-size buffers used to compress/decompress streams. The buffers may be ~8 kB, or ~32 kB, and in the original version they're ~128 kB. Buffers may be allocated one or more times per stream. I'm trying to figure out how I want to handle buffers to get the best performance, with an eye toward potentially multithreading in the future.
Yes, the library WILL be released as open source if anyone is interested in using this.

If you want a simple answer, it is that there is no simple answer. No amount of calling answers (and by implication people) "lazy" is going to help.
How fast should I expect memory allocations to run (MB/s) on a standard modern CPU?
At the speed at which the JVM can zero memory, assuming that the allocation does not trigger a garbage collection. If it does trigger garbage collection, it is impossible to predict without knowing what GC algorithm is used, the heap size and other parameters, and an analysis of the application's working set of non-garbage objects over the lifetime of the app.
How does allocation size effect allocation rate?
See above.
What's the break-even point for number/size of allocations vs. re-use in a pool?
If you want a simple answer, it is that there is no simple answer.
The golden rule is, the bigger your heap is (up to the amount of physical memory available), the smaller the amortized cost of GC'ing a garbage object. With a fast copying garbage collector, the amortized cost of freeing a garbage object approaches zero as the heap gets larger. The cost of the GC is actually determined by (in simplistic terms) the number and size of non-garbage objects that the GC has to deal with.
Under the assumption that your heap is large, the lifecycle cost of allocating and GC'ing a large object (in one GC cycle) approaches the cost of zeroing the memory when the object is allocated.
EDIT: If all you want is some simple numbers, write a simple application that allocates and discards large buffers and run it on your machine with various GC and heap parameters and see what happens. But beware that this is not going to give you a realistic answer because real GC costs depend on an application's non-garbage objects.
I'm not going to write a benchmark for you because I know that it would give you bogus answers.
EDIT 2: In response to the OP's comments.
So, I should expect allocations to run about as fast as System.arraycopy, or a fully JITed array initialization loop (about 1GB/s on my last bench, but I'm dubious of the result)?
Theoretically yes. In practice, it is difficult to measure in a way that separates the allocation costs from the GC costs.
By heap size, are you saying allocating a larger amount of memory for JVM use will actually reduce performance?
No, I'm saying it is likely to increase performance. Significantly. (Provided that you don't run into OS-level virtual memory effects.)
Allocations are just for arrays, and almost everything else in my code runs on the stack. It should simplify measuring and predicting performance.
Maybe. Frankly, I think that you are not going to get much improvement by recycling buffers.
But if you are intent on going down this path, create a buffer pool interface with two implementations. The first is a real thread-safe buffer pool that recycles buffers. The second is dummy pool which simply allocates a new buffer each time alloc is called, and treats dispose as a no-op. Finally, allow the application developer to choose between the pool implementations via a setBufferPool method and/or constructor parameters and/or runtime configuration properties. The application should also be able to supply a buffer pool class / instance of its own making.

When it is larger than young space.
If your array is larger than the thread-local young space, it is directly allocated in the old space. Garbage collection on the old space is way slower than on the young space. So if your array is larger than the young space, it might make sense to reuse it.
On my machine, 32kb exceeds the young space. So it would make sense to reuse it.

You've neglected to mention anything about thread safety. If it's going to be reused by multiple threads you'll have to worry about synchronization.

An answer from a completely different direction: let the user of your library decide.
Ultimately, however optimized you make your library, it will only be a component of a larger application. And if that larger application makes infrequent use of your library, there's no reason that it should pay to maintain a pool of buffers -- even if that pool is only a few hundred kilobytes.
So create your pooling mechanism as an interface, and based on some configuration parameter select the implementation that's used by your library. Set the default to be whatever your benchmark tests determine to be the best solution.1 And yes, if you use an interface you'll have to rely on the JVM being smart enough to inline calls.2
(1) By "benchmark," I mean a long-running program that exercises your library outside of a profiler, passing it a variety of inputs. Profilers are extremely useful, but so is measuring the total throughput after an hour of wall-clock time. On several different computers with differing heap sizes, and several different JVMs, running in single and multi-threaded modes.
(2) This can get you into another line of debate about the relative performance of the various invoke opcodes.

Short answer: Don't buffer.
Reasons are follow:
Don't optimize it, yet until it become a bottleneck
If you recycle it, the overhead of the pool management will be another bottleneck
Try to trust the JIT. In the latest JVM, your array may allocated in STACK rather then HEAP.
Trust me, the JRE usually do handle them faster and better then you DIY.
Keep it simple, for easier to read and debug
When you should recycle a object:
only if is it heavy. The size of memory won't make it heavy, but native resources and CPU cycle do, which cost addition finalize and CPU cycle.
You may want to recycle them if they are "ByteBuffer" rather then byte[]

Keep in mind that cache effects will probably be more of an issue than the cost of "new int[size]" and its corresponding collection. Reusing buffers is therefore a good idea if you have good temporal locality. Reallocating the buffer instead of reusing it means you might get a different chunk of memory each time. As others mentioned, this is especially true when your buffers don't fit in the young generation.
If you allocate but then don't use the whole buffer, it also pays to reuse as you don't waste time zeroing out memory you never use.

I forgot that this is a managed-memory system.
Actually, you probably have the wrong mindset. The appropriate way to determine when it is useful is dependent on the application, system it is running on, and user usage pattern.
In other words - just profile the system, determine how much time is being spent in garbage collection as a percentage of total application time in a typical session, and see if it is worthwhile to optimize that.
You will probably find out that gc isn't even being called at all. So writing code to optimize this would be a complete waste of time.
with today's large memory space I suspect 90% of the time it isn't worth doing at all. You can't really determine this based on parameters - it is too complex. Just profile - easy and accurate.

Looking at a micro benchmark (code below) there is no appreciable difference in time on my machine regardless of the size and the times the array is used (I am not posting the times, you can easily run it on your machine :-). I suspect that this is because the garbage is alive for so short a time there is not much to do for cleanup. Array allocation should probably a call to calloc or malloc/memset. Depending on the CPU this will be a very fast operation. If the arrays survived for a longer time to make it past the initial GC area (the nursery) then the time for the one that allocated several arrays might take a bit longer.
code:
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
final int size;
final int times;
size = 1024 * 128;
times = 100;
// uncomment only one of the ones below for each run
test(new NewTester(size), times);
// test(new ReuseTester(size), times);
}
private static void test(final Tester tester, final int times)
{
final long total;
// warmup
testIt(tester, 1000);
total = testIt(tester, times);
System.out.println("took: " + total);
}
private static long testIt(final Tester tester, final int times)
{
long total;
total = 0;
for(int i = 0; i < times; i++)
{
final long start;
final long end;
final int value;
start = System.nanoTime();
value = tester.run();
end = System.nanoTime();
total += (end - start);
// make sure the value is used so the VM cannot optimize too much
System.out.println(value);
}
return (total);
}
}
interface Tester
{
int run();
}
abstract class AbstractTester
implements Tester
{
protected final Random random;
{
random = new Random(0);
}
public final int run()
{
int value;
value = 0;
// make sure the random number generater always has the same work to do
random.setSeed(0);
// make sure that we have something to return so the VM cannot optimize the code out of existence.
value += doRun();
return (value);
}
protected abstract int doRun();
}
class ReuseTester
extends AbstractTester
{
private final int[] array;
ReuseTester(final int size)
{
array = new int[size];
}
public int doRun()
{
final int size;
// make sure the lookup of the array.length happens once
size = array.length;
for(int i = 0; i < size; i++)
{
array[i] = random.nextInt();
}
return (array[size - 1]);
}
}
class NewTester
extends AbstractTester
{
private int[] array;
private final int length;
NewTester(final int size)
{
length = size;
}
public int doRun()
{
final int size;
// make sure the lookup of the length happens once
size = length;
array = new int[size];
for(int i = 0; i < size; i++)
{
array[i] = random.nextInt();
}
return (array[size - 1]);
}
}

I came across this thread and, since I was implementing a Floyd-Warshall all pairs connectivity algorithm on a graph with one thousand vertices, I tried to implement it in both ways (re-using matrices or creating new ones) and check the elapsed time.
For the computation I need 1000 different matrices of size 1000 x 1000, so it seems a decent test.
My system is Ubuntu Linux with the following virtual machine.
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
Re-using matrices was about 10% slower (average running time over 5 executions 17354ms vs 15708ms. I don't know if it would still be faster in case the matrix was much bigger.
Here is the relevant code:
private void computeSolutionCreatingNewMatrices() {
computeBaseCase();
smallest = Integer.MAX_VALUE;
for (int k = 1; k <= nVertices; k++) {
current = new int[nVertices + 1][nVertices + 1];
for (int i = 1; i <= nVertices; i++) {
for (int j = 1; j <= nVertices; j++) {
if (previous[i][k] != Integer.MAX_VALUE && previous[k][j] != Integer.MAX_VALUE) {
current[i][j] = Math.min(previous[i][j], previous[i][k] + previous[k][j]);
} else {
current[i][j] = previous[i][j];
}
smallest = Math.min(smallest, current[i][j]);
}
}
previous = current;
}
}
private void computeSolutionReusingMatrices() {
computeBaseCase();
current = new int[nVertices + 1][nVertices + 1];
smallest = Integer.MAX_VALUE;
for (int k = 1; k <= nVertices; k++) {
for (int i = 1; i <= nVertices; i++) {
for (int j = 1; j <= nVertices; j++) {
if (previous[i][k] != Integer.MAX_VALUE && previous[k][j] != Integer.MAX_VALUE) {
current[i][j] = Math.min(previous[i][j], previous[i][k] + previous[k][j]);
} else {
current[i][j] = previous[i][j];
}
smallest = Math.min(smallest, current[i][j]);
}
}
matrixCopy(current, previous);
}
}
private void matrixCopy(int[][] source, int[][] destination) {
assert source.length == destination.length : "matrix sizes must be the same";
for (int i = 0; i < source.length; i++) {
assert source[i].length == destination[i].length : "matrix sizes must be the same";
System.arraycopy(source[i], 0, destination[i], 0, source[i].length);
}
}

More important than buffer size is number of allocated objects, and total memory allocated.
Is memory usage an issue at all? If it is a small app may not be worth worrying about.
The real advantage from pooling is to avoid memory fragmentation. The overhead for allocating/freeing memory is small, but the disadvantage is that if you repeatedly allocated many objects of many different sizes memory becomes more fragmented. Using a pool prevents fragmentation.

I think the answer you need is related with the 'order' (measuring space, not time!) of the algorithm.
Copy file example
By example, if you want to copy a file you need to read from an inputstream and write to an outputstream. The TIME order is O(n) because the time will be proportional to the size of the file. But the SPACE order will be O(1) because the program you'll need to do it will ocuppy a fixed ammount of memory (you'll need only one fixed buffer). In this case it's clear that it's convenient to reuse that very buffer you instantiated at the beginning of the program.
Relate the buffer policy with your algorithm execution structure
Of course, if your algoritm needs and endless supply of buffers and each buffer is a different size probably you cannot reuse them. But it gives you some clues:
try to fix the size of buffers (even
sacrifying a little bit of memory).
Try to see what's the structure of
the execution: by example, if you're
algorithm traverse some kind of tree
and you're buffers are related to
each node, maybe you only need O(log
n) buffers... so you can make an
educated guess of the space required.
Also if you need diferent buffers but
you can arrange things to share
diferent segments of the same
array... maybe it's a better
solution.
When you release a buffer you can
add it to a pool of buffers. That
pool can be a heap ordered by the
"fitting" criteria (buffers that
fit the most should be first).
What I'm trying to say is: there's no fixed answer. If you instantiated something that you can reuse... probably it's better to reuse it. The tricky part is to find how you can do it without incurring in buffer managing overhead. Here's when the algorithm analysis come in handy.
Hope it helps... :)

Representing a 100K X 100K matrix in Java

How can I store a 100K X 100K matrix in Java?
I can't do that with a normal array declaration as it is throwing a java.lang.OutofMemoryError.

The Colt library has a sparse matrix implementation for Java.
You could alternatively use Berkeley DB as your storage engine.
Now if your machine has enough actual RAM (at least 9 gigabytes free), you can increase the heap size in the Java command-line.

If the vast majority of entries in your matrix will be zero (or even some other constant value) a sparse matrix will be suitable. Otherwise it might be possible to rewrite your algorithm so that the whole matrix doesn't exist simultaneously. You could produce and consume one row at a time, for example.

Sounds like you need a sparse matrix. Others have already suggested good 3rd party implementations that may suite your needs...
Depending on your applications, you could get away without a third-party matrix library by just using a Map as a backing-store for your matrix data. Kind of...
public class SparseMatrix<T> {
private T defaultValue;
private int m;
private int n;
private Map<Integer, T> data = new TreeMap<Integer, T>();
/// create a new matrix with m rows and n columns
public SparseMatrix(int m, int n, T defaultValue) {
this.m = m;
this.n = n;
this.defaultValue = defaultValue;
}
/// set value at [i,j] (row, col)
public void setValueAt(int i, int j, T value) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
data.put(i * n + j, value);
}
/// retrieve value at [i,j] (row, col)
public T getValueAt(int i, int j) {
if (i >= m || j >= n || i < 0 || j < 0)
throw new IllegalArgumentException(
"index (" + i + ", " +j +") out of bounds");
T value = data.get(i * n + j);
return value != null ? value : defaultValue;
}
}
A simple test-case illustrating the SparseMatrix' use would be:
public class SparseMatrixTest extends TestCase {
public void testMatrix() {
SparseMatrix<Float> matrix =
new SparseMatrix<Float>(100000, 100000, 0.0F);
matrix.setValueAt(1000, 1001, 42.0F);
assertTrue(matrix.getValueAt(1000,1001) == 42.0);
assertTrue(matrix.getValueAt(1001,1000) == 0.0);
}
}
This is not the most efficient way of doing it because every non-default entry in the matrix is stored as an Object. Depending on the number of actual values you are expecting, the simplicity of this approach might trump integrating a 3rd-party solution (and possibly dealing with its License - again, depending on your situation).
Adding matrix-operations like multiplication to the above SparseMatrix implementation should be straight-forward (and is left as an exercise for the reader ;-)

100,000 x 100,000 = 10,000,000,000 (10 billion) entries. Even if you're storing single byte entries, that's still in the vicinity of 10 GB - does your machine even have that much physical memory, let alone have a will to allocate that much to a single process?
Chances are you're going to need to look into some kind of a way to only keep part of the matrix in memory at any given time, and the rest buffered on disk.

There are a number possible solutions depending on how much memory you have, how sparse the array actually is, and what the access patterns are going to be.
If the calculation of 100K * 100K * 8 is less than the amount of physical memory on your machine for use by the JVM, a simple non-sparse array is viable solution.
If the array is sparse, with (say) 75% or more of the elements being zero, then you can save space by using a sparse array library. Various alternatives have been suggested, but in all cases, you still need to work out if this is going to give you enough savings. Figure out how many non-zero elements there are going to be, multiply that by 8 (to give you doubles) and (say) 4 to account for the overheads of the sparse array. If that is less than the amount of physical memory that you can make available to the JVM, then sparse arrays are a viable solution.
If sparse and non-sparse arrays (in memory) won't work, things will get more complicated, and the viability of any solution will depend on the access patterns for the array data.
One approach is to represent the array as a file that is mapped into memory in the form of a MappedByteBuffer. Assuming that you don't have enough physical memory to store the entire file in memory, you are going to be hitting the virtual memory system hard. So it is best if your algorithm only needs to operate on contiguous sections of the array at any time. Otherwise, you'll probably die from swapping.
A second approach is a variation of the first. Map the array/file a section at a time, and when you are done, unmap and move to the next section. This only works if the algorithm works on the array in sections.
A third approach is to represent the array using a light-weight database like BDB. This will be slower than any in-memory solution because reading array elements will translate into disc accesses. But if you get it wrong it won't kill the system like the memory mapped approach will. (And if you do this on Linux/Unix, the system's disc block cache may speed things up, depending on your algorithm's array access patterns)
A fourth approach is to use a distributed memory cache. This replaces disc i/o with network i/o, and it is hard to say whether this is a good or bad thing.
A fifth approach is to analyze your algorithm and see if it is amenable to implementing as a distributed algorithm; e.g. with sections of the array and corresponding parts of the algorithm on different machines.

You can upgrade to this machine:
http://www.azulsystems.com/products/compute_appliance.htm
864 processor cores and 768 GB of memory, only costs a single family house somewhere.

Well, I'd suggest that you increase the memory in your jvm but you've going to need a lot of memory, as you're talking about 10 billion items. It's (barely) possible with lots of memory or a clustered jvm, but that's probably the wrong answer.
You're getting the outOfmemory because if you declare int[1000], the memory is allocated immediately (additionally doubles take up more space than ints-an int representation will also save you space). Maybe you can substitute a more efficient implementation of your array (if you have many empty entries lookup "sparse matrix" representations).
You could store pieces in an outside system, like memcached or memory-mapped buffers.
There are lots of good suggestions here, maybe if you posted a more detailed description of the problem you're trying to solve people could be more specific.

You should try an "external" package to handle matrices, I never did that though, maybe something like jama.

Unless you have 100K x 100K x 8 ~ 80GB of memory, you cannot create this matrix in memory. You can create this matrix on disk and access it using memory mapping. However, using this approach will be very slow.
What are you trying to do? You may find that representing your data in a different way will be much more efficient.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.