Performance benefits of a static empty array instance

Performance benefits of a static empty array instance - java

It seems common practice to extract a constant empty array return value into a static constant. Like here:
public class NoopParser implements Parser {
private static final String[] EMPTY_ARRAY = new String[0];
#Override public String[] supportedSchemas() {
return EMPTY_ARRAY;
}
// ...
}
Presumably this is done for performance reasons, since returning new String[0] directly would create a new array object every time the method is called – but would it really?
I've been wondering if there really is a measurable performance benefit in doing this or if it's just outdated folk wisdom. An empty array is immutable. Is the VM not able to roll all empty String arrays into one? Can the VM not make new String[0] basically free of cost?
Contrast this practice with returning an empty String: we're usually perfectly happy to write return "";, not return EMPTY_STRING;.

I benchmarked it using JMH:
private static final String[] EMPTY_STRING_ARRAY = new String[0];
#Benchmark
public void testStatic(Blackhole blackhole) {
blackhole.consume(EMPTY_STRING_ARRAY);
}
#Benchmark
#Fork(jvmArgs = "-XX:-EliminateAllocations")
public void testStaticEliminate(Blackhole blackhole) {
blackhole.consume(EMPTY_STRING_ARRAY);
}
#Benchmark
public void testNew(Blackhole blackhole) {
blackhole.consume(new String[0]);
}
#Benchmark
#Fork(jvmArgs = "-XX:-EliminateAllocations")
public void testNewEliminate(Blackhole blackhole) {
blackhole.consume(new String[0]);
}
#Benchmark
public void noop(Blackhole blackhole) {
}
Full source code.
Environment (seen after java -jar target/benchmarks.jar -f 1):
# JMH 1.11.2 (released 51 days ago)
# VM version: JDK 1.7.0_75, VM 24.75-b04
# VM invoker: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
EliminateAllocations was on by default (seen after java -XX:+PrintFlagsFinal -version | grep EliminateAllocations).
Results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.testNewEliminate thrpt 20 95912464.879 ± 3260948.335 ops/s
MyBenchmark.testNew thrpt 20 103980230.952 ± 3772243.160 ops/s
MyBenchmark.testStaticEliminate thrpt 20 206849985.523 ± 4920788.341 ops/s
MyBenchmark.testStatic thrpt 20 219735906.550 ± 6162025.973 ops/s
MyBenchmark.noop thrpt 20 1126421653.717 ± 8938999.666 ops/s
Using a constant was almost two times faster.
Turning off EliminateAllocations slowed things down a tiny bit.

I'm most interested in the actual performance difference between these two idioms in practical, real-world situations. I have no experience in micro-benchmarking (and it is probably not the right tool for such a question) but I gave it a try anyway.
This benchmark models a somewhat more typical, "realistic" setting. The returned array is just looked at and then discarded. No references hanging around, no requirement for reference equality.
One interface, two implementations:
public interface Parser {
String[] supportedSchemas();
void parse(String s);
}
public class NoopParserStaticArray implements Parser {
private static final String[] EMPTY_STRING_ARRAY = new String[0];
#Override public String[] supportedSchemas() {
return EMPTY_STRING_ARRAY;
}
#Override public void parse(String s) {
s.codePoints().count();
}
}
public class NoopParserNewArray implements Parser {
#Override public String[] supportedSchemas() {
return new String[0];
}
#Override public void parse(String s) {
s.codePoints().count();
}
}
And the JMH benchmark:
import org.openjdk.jmh.annotations.Benchmark;
public class EmptyArrayBenchmark {
private static final Parser NOOP_PARSER_STATIC_ARRAY = new NoopParserStaticArray();
private static final Parser NOOP_PARSER_NEW_ARRAY = new NoopParserNewArray();
#Benchmark
public void staticEmptyArray() {
Parser parser = NOOP_PARSER_STATIC_ARRAY;
for (String schema : parser.supportedSchemas()) {
parser.parse(schema);
}
}
#Benchmark
public void newEmptyArray() {
Parser parser = NOOP_PARSER_NEW_ARRAY;
for (String schema : parser.supportedSchemas()) {
parser.parse(schema);
}
}
}
The result on my machine, Java 1.8.0_51 (HotSpot VM):
Benchmark Mode Cnt Score Error Units
EmptyArrayBenchmark.staticEmptyArray thrpt 60 3024653836.077 ± 37006870.221 ops/s
EmptyArrayBenchmark.newEmptyArray thrpt 60 3018798922.045 ± 33953991.627 ops/s
EmptyArrayBenchmark.noop thrpt 60 3046726348.436 ± 5802337.322 ops/s
There is no significant difference between the two approaches in this case. In fact, they are indistinguishable from the no-op case: apparently the JIT compiler recognises that the returned array is always empty and optimises the loop away entirely!
Piping parser.supportedSchemas() into the black hole instead of looping over it, gives the static array instance approach a ~30% advantage. But they're definitely of the same magnitude:
Benchmark Mode Cnt Score Error Units
EmptyArrayBenchmark.staticEmptyArray thrpt 60 338971639.355 ± 738069.217 ops/s
EmptyArrayBenchmark.newEmptyArray thrpt 60 266936194.767 ± 411298.714 ops/s
EmptyArrayBenchmark.noop thrpt 60 3055609298.602 ± 5694730.452 ops/s
Perhaps in the end the answer is the usual "it depends". I have a hunch that in many practical scenarios, the performance benefit in factoring out the array creation is not significant.
I think it is fair to say that
if the method contract gives you the freedom to return a new empty array instance every time, and
unless you need to guard against problematic or pathological usage patterns and/or aim for theoretical max performance,
then returning new String[0] directly is fine.
Personally, I like the expressiveness and concision of return new String[0]; and not having to introduce an extra static field.
By some strange coincidence, a month after I wrote this a real performance engineer investigated the problem: see this section in Alexey Shipilёv's blog post 'Arrays of Wisdom of the Ancients':
As expected, the only effect whatsoever can be observed on a very small collection sizes, and this is only a marginal improvement over new Foo[0]. This improvement does not seem to justify caching the array in the grand scheme of things. As a teeny tiny micro-optimization, it might make sense in some tight code, but I wouldn’t care otherwise.
That settles it. I'll take the tick mark and dedicate it to Alexey.

Is the VM not able to roll all empty String arrays into one?
It can't do that, because distinct empty arrays need to compare unequal with ==. Only the programmer can make this optimization.
Contrast this practice with returning an empty String: we're usually perfectly happy writing return "";.
With strings, there is no requirement that distinct string literals produce distinct strings. In every case I know of, two instances of "" will produce the same string object, but maybe there's some weird case with classloaders where that won't happen.

I will go out on a limb and say that the performance benefit, even though using constant is much faster, is not actually relevant; because the software will likely spend a lot more time in doing other things besides returning empty arrays. If the total run-time is even hours a few extra seconds spent in creating an array does not mean much. By the same logic, memory consumption is not relevant either.
The only reason I can think of for doing this is readability.

Related

Does the order of if else matter for performance? e.g. put the most likely condition in the front is better

I'm trying to measure if the order of if else affects performance.
For example, if
if (condition == more likely condition) {}
else /** condition == rare condition **/ {}
is faster than
if (condition == rare condition) {}
else /** condition == more likely condition **/ {}
I think maybe JIT should be able to optimise it no matter which order I put it? Couldn't find any documentation on this though.
I tried to test it out myself with following benchmark. Based on it, I don't see strong evidence that the order matters. Because if it does, I think the throughput when bias=0.9 (probability of if (zeroOrOne == 1) is true is 0.9) should be higher than when bias=0.1 (else probability is 0.9).
public class BranchBench {
#Param({ "0.02", "0.1", "0.9", "0.98", })
private double bias;
#Param("10000")
private int count;
private final List<Byte> randomZeroOnes = new ArrayList<>(count);
#Setup
public void setup() {
Random r = new Random(12345);
for (int c = 0; c < count; c++) {
byte zeroOrOne = (byte) (c < (bias * count) ? 1 : 0);
randomZeroOnes.add(zeroOrOne);
}
Collections.shuffle(randomZeroOnes, r);
}
#Benchmark
public int static_ID_ifElse() {
int i = 0;
for (final Byte zeroOrOne : randomZeroOnes) {
if (zeroOrOne == 1) {
i++;
} else {
i--;
}
}
return i;
}
}
Benchmark (bias) (count) Mode Cnt Score Error Units
BranchBench.static_ID_ifElse 0.02 10000 thrpt 15 137.409 ± 1.376 ops/ms
BranchBench.static_ID_ifElse 0.1 10000 thrpt 15 129.277 ± 1.552 ops/ms
BranchBench.static_ID_ifElse 0.9 10000 thrpt 15 125.640 ± 5.858 ops/ms
BranchBench.static_ID_ifElse 0.98 10000 thrpt 15 137.427 ± 2.396 ops/ms

On modern processors I don't think the order of your conditionals really matter that much anymore. As part of the instruction pipeline, processors will do what is called branch prediction; where it guesses which condition will be true and pre-loads the instructions into the pipeline.
These days, processors guess correctly >90% of the time, so any hand-written conditional tweaking is less important.
There are quite a lot of literature on branch prediction:
https://dzone.com/articles/branch-prediction-in-java
https://www.baeldung.com/java-branch-prediction

Trying to benchmark lambda performance

I've read this post: Performance difference between Java 8 lambdas and anonymous inner classes and provided there article
and it there said:
Lambda invocation behaves exactly as anonymous class invocation
"Ok" I said and decided to write my own benchmark, I've used jmh, here it is below (I've also added benchmark for method reference).
public class MyBenchmark {
public static final int TESTS_COUNT = 100_000_000;
#Benchmark
public void testMethod_lambda() {
X x = i -> test(i);
for (long i = 0; i < TESTS_COUNT; i++) {
x.x(i);
}
}
#Benchmark
public void testMethod_methodRefernce() {
X x = this::test;
for (long i = 0; i < TESTS_COUNT; i++) {
x.x(i);
}
}
#Benchmark
public void testMethod_anonymous() {
X x = new X() {
#Override
public void x(Long i) {
test(i);
}
};
for (long i = 0; i < TESTS_COUNT; i++) {
x.x(i);
}
}
interface X {
void x(Long i);
}
public void test(Long i) {
if (i == null) System.out.println("never");
}
}
And the results (on Intel Core i7 4770k) are:
Benchmark Mode Samples Score Score error Units
t.j.MyBenchmark.testMethod_anonymous thrpt 200 16,160 0,044 ops/s
t.j.MyBenchmark.testMethod_lambda thrpt 200 4,102 0,029 ops/s
t.j.MyBenchmark.testMethod_methodRefernce thrpt 200 4,149 0,022 ops/s
So, as you can see there is 4x difference between lambda and anonymous method invocation, where lambda is 4x slower.
The question is: what am I doing wrong or I have misunderstanding of performance theory about lambdas?
EDIT:
# VM invoker: C:\Program Files\Java\jre1.8.0_31\bin\java.exe
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each

The problem is in your benchmark: you are the victim of dead code elimination.
JIT-compiler is quite smart to understand sometimes that the result of automatic boxing is never null, so for anonymous class it simply removed your check which in turn made the loop body almost empty. Replace it with something less obvious (for JIT) like this:
public void test(Long i) {
if (i == Long.MAX_VALUE) System.out.println("never");
}
And you will observe the same performace (anonymous class becomes slower, while lambda and method reference perform at the same level).
For lambda/method reference it did not made the same optimization for some reason. But you should not worry: it's unlikely that you will have such method in real code which can be optimized out completely.
In general #apangin is right: use Blackhole instead.

In addition to the issues raised by #TagirValeev, the benchmark approach you are taking is fundamentally flawed, because you are measuring a composite metric (despite your attempts not to.)
The significant costs you want to measure independently are linkage, capture, and invocation. But all your tests smear together some amount of each, poisoning your results. My advice would be to focus only on invocation cost -- this is the most relevant to overall application throughput, and also the easiest to measure (because it is less influenced by caching at multiple levels.)
Bottom line: measuring performance in dynamically compiled environments is really, really hard. Even with JMH.

My question is another example how you shouldn't do benchmarking. I've recreated my test according to the advice in other answers here.
Hope now it is near to correctness, because it shows that there isn't any significant difference between lambda's and anon's method invocation performance. See it below:
#State(Scope.Benchmark)
public class MyBenchmark {
#Param({"1", "100000", "500000"})
public int arg;
#Benchmark
public void testMethod_lambda(Blackhole bh) {
X x = (i, bh2) -> test(i, bh2);
x.x(arg, bh);
}
#Benchmark
public void testMethod_methodRefernce(Blackhole bh) {
X x = this::test;
x.x(arg, bh);
}
#Benchmark
public void testMethod_anonymous(Blackhole bh) {
X x = new X() {
#Override
public void x(Integer i, Blackhole bh) {
test(i, bh);
}
};
x.x(arg, bh);
}
interface X {
void x(Integer i, Blackhole bh);
}
public void test(Integer i, Blackhole bh) {
bh.consume(i);
}
}
Benchmark (arg) Mode Samples Score Score error Units
t.j.MyBenchmark.testMethod_anonymous 1 thrpt 200 415893575,928 1353627,574 ops/s
t.j.MyBenchmark.testMethod_anonymous 100000 thrpt 200 394989882,972 1429490,555 ops/s
t.j.MyBenchmark.testMethod_anonymous 500000 thrpt 200 395707755,557 1325623,340 ops/s
t.j.MyBenchmark.testMethod_lambda 1 thrpt 200 418597958,944 1098137,844 ops/s
t.j.MyBenchmark.testMethod_lambda 100000 thrpt 200 394672254,859 1593253,378 ops/s
t.j.MyBenchmark.testMethod_lambda 500000 thrpt 200 394407399,819 1373366,572 ops/s
t.j.MyBenchmark.testMethod_methodRefernce 1 thrpt 200 417249323,668 1140804,969 ops/s
t.j.MyBenchmark.testMethod_methodRefernce 100000 thrpt 200 396783159,253 1458935,363 ops/s
t.j.MyBenchmark.testMethod_methodRefernce 500000 thrpt 200 395098696,491 1682126,737 ops/s

Comparison between legacy for loop, streams and parallelStream in Java 8

import java.util.ArrayList;
import java.util.List;
public class IterationBenchmark {
public static void main(String args[]){
List<String> persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
long timeMillis = System.currentTimeMillis();
for(String person : persons)
System.out.println(person);
System.out.println("Time taken for legacy for loop : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.stream().forEach(System.out::println);
System.out.println("Time taken for sequence stream : "+
(System.currentTimeMillis() - timeMillis));
timeMillis = System.currentTimeMillis();
persons.parallelStream().forEach(System.out::println);
System.out.println("Time taken for parallel stream : "+
(System.currentTimeMillis() - timeMillis));
}
}
Output:
AAA
BBB
CCC
DDD
Time taken for legacy for loop : 0
AAA
BBB
CCC
DDD
Time taken for sequence stream : 49
CCC
DDD
AAA
BBB
Time taken for parallel stream : 3
Why the Java 8 Stream API performance is very low compare to legacy for loop?

Very first call to the Stream API in your program is always quite slow, because you need to load many auxiliary classes, generate many anonymous classes for lambdas and JIT-compile many methods. Thus usually very first Stream operation takes several dozens of milliseconds. The consecutive calls are much faster and may fall beyond 1 us depending on the exact stream operation. If you exchange the parallel-stream test and sequential stream test, the sequential stream will be much faster. All the hard work is done by one who comes the first.
Let's write a JMH benchmark to properly warm-up your code and test all the cases independently:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.annotations.*;
#Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
#State(Scope.Benchmark)
public class StreamTest {
List<String> persons;
#Setup
public void setup() {
persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
}
#Benchmark
public void loop() {
for(String person : persons)
System.err.println(person);
}
#Benchmark
public void stream() {
persons.stream().forEach(System.err::println);
}
#Benchmark
public void parallelStream() {
persons.parallelStream().forEach(System.err::println);
}
}
Here we have three tests: loop, stream and parallelStream. Note that I changed the System.out to System.err. That's because System.out is used normally to output the JMH results. I will redirect the output of System.err to nul, so the result should less depend on my filesystem or console subsystem (which is especially slow on Windows).
So the results are (Core i7-4702MQ CPU # 2.2GHz, 4 cores HT, Win7, Oracle JDK 1.8.0_40):
Benchmark Mode Cnt Score Error Units
StreamTest.loop avgt 30 42.410 ± 1.833 us/op
StreamTest.parallelStream avgt 30 76.440 ± 2.073 us/op
StreamTest.stream avgt 30 42.820 ± 1.389 us/op
What we see is that stream and loop produce exactly the same result. The difference is statistically insignificant. Actually Stream API is somewhat slower than loop, but here the slowest part is the PrintStream. Even with output to nul the IO subsystem is very slow compared to other operations. So we just measured not the Stream API or loop speed, but println speed.
Also see, it's microseconds, thus stream version actually works 1000 times faster than in your test.
Why parallelStream is much slower? Just because you cannot parallelize the writes to the same PrintStream, because it is internally synchronized. So the parallelStream did all the hard work to splitting 4-element list to the 4 sub-tasks, schedule the jobs in the different threads, synchronize them properly, but it's absolutely futile as the slowest operation (println) cannot perform in parallel: while one of threads is working, others are waiting. In general it's useless to parallelize the code which synchronizes on the same mutex (which is your case).

OutOfMemory with JMH and Mode.AverageTime

I am writing a micro-benchmark to compare String concatenation using + operator vs StringBuilder. To this aim, I created a JMH benchmark class based on OpenJDK example that uses the batchSize parameter:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#Measurement(batchSize = 10000, iterations = 10)
#Warmup(batchSize = 10000, iterations = 10)
#Fork(1)
public class StringConcatenationBenchmark {
private String string;
private StringBuilder stringBuilder;
#Setup(Level.Iteration)
public void setup() {
string = "";
stringBuilder = new StringBuilder();
}
#Benchmark
public void stringConcatenation() {
string += "some more data";
}
#Benchmark
public void stringBuilderConcatenation() {
stringBuilder.append("some more data");
}
}
When I run the benchmark I get the following error for stringBuilderConcatenation method:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at link.pellegrino.string_concatenation.StringConcatenationBenchmark.stringBuilderConcatenation(StringConcatenationBenchmark.java:29)
at link.pellegrino.string_concatenation.generated.StringConcatenationBenchmark_stringBuilderConcatenation.stringBuilderConcatenation_avgt_jmhStub(StringConcatenationBenchmark_stringBuilderConcatenation.java:165)
at link.pellegrino.string_concatenation.generated.StringConcatenationBenchmark_stringBuilderConcatenation.stringBuilderConcatenation_AverageTime(StringConcatenationBenchmark_stringBuilderConcatenation.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:430)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:412)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I was thinking that the default JVM heap size has to be increased, so I tried to allow up to 10GB using -Xmx10G value with -jvmArgs option provided by JMH. Unfortunately, I still get the error.
Consequently, I tried to reduce the value for batchSize parameter to 1 but I still get an OutOfMemoryError.
The only workaround I have found is to set the benchmark mode to Mode.SingleShotTime. Since this mode seems to consider a batch as a single shot (even if s/op is displayed in the Units column), it seems that I get the metric I want: the average time to perform the set of batch operations. However, I still don't understand why it is not working with Mode.AverageTime.
Please also note that the benchmarks for method stringConcatenation work as expected whatever the benchmark mode is used. The issue only occurs with stringBuilderConcatenation method that makes use of StringBuilder.
Any help to understand why the previous example is not working with Benchmark mode set to Mode.AverageTime is welcome.
JMH version I used is 1.10.4.

You're right that Mode.SingleShotTime is what you need: it measures the time for single batch. When using the Mode.AverageTime your iteration still works until the iteration time finishes (which is 1 second by default). It measures the time per executing the single batch (only batches which were fully finished during the execution time are counted), so the final results differ, but execution time is the same.
Another problem is that #Setup(Level.Iteration) forces setup to be executed before every iteration, but not before every batch. Thus your strings are not actually limited by the batch size. The string version does not cause the OutOfMemoryError just because it's much slower than StringBuilder, so during the 1 second it's capable to build much shorter string.
Not very beautiful way to fix your benchmark (while still using average time mode and batchSize parameter) is to reset the string/stringBuilder manually:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Measurement(batchSize = 10000, iterations = 10)
#Warmup(batchSize = 10000, iterations = 10)
#Fork(1)
public class StringConcatenationBenchmark {
private static final String S = "some more data";
private static final int maxLen = S.length()*10000;
private String string;
private StringBuilder stringBuilder;
#Setup(Level.Iteration)
public void setup() {
string = "";
stringBuilder = new StringBuilder();
}
#Benchmark
public void stringConcatenation() {
if(string.length() >= maxLen) string = "";
string += S;
}
#Benchmark
public void stringBuilderConcatenation() {
if(stringBuilder.length() >= maxLen) stringBuilder = new StringBuilder();
stringBuilder.append(S);
}
}
Here's results on my box (i5 3340, 4Gb RAM, 64bit Win7, JDK 1.8.0_45):
Benchmark Mode Cnt Score Error Units
stringBuilderConcatenation avgt 10 145.997 ± 2.301 us/op
stringConcatenation avgt 10 324878.341 ± 39824.738 us/op
So you can see that only about 3 batches fit the second for stringConcatenation (1e6/324878) while for stringBuilderConcatenation thousands of batches can be executed resulting in enormous string leading to OutOfMemoryError.
I don't know why adding more memory doesn't work for you, for me -Xmx4G is enough to run the stringBuilder test of your original benchmark. Probably your box is faster, so the resulting string is even longer. Note that for the very big string you can hit the array size limit (2 billion of elements) even if you have enough memory. Check the exception stacktrace after adding the memory: is it the same? If you hit the array size limit, it will still be OutOfMemoryError, but stacktrace will be different a little bit. Anyways even with enough memory the results for your benchmark will be incorrect (both for String and StringBuilder).

Why am I getting worse performance with a larger buffer in my BufferedReader?

I'm getting weird results I can't explain from a BufferedReader when I vary the size of the buffer.
I had strongly expected that performance would gradually increase as I increased the size of the buffer, with diminishing returns setting in fairly quickly, and that thereafter performance would be more or less flat. But it seems that, after only a very modest buffer size, increasing the size of the buffer makes it slower.
Here's a minimal example. All it does is run through a text file, and calculate the sum of the lengths of the lines.
public int traverseFile(int bufSize) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader("words16"), bufSize*1024);
String line;
int total=0;
while ((line=reader.readLine())!=null)
total+=line.length();
reader.close();
return total;
}
I tried benchmarking this with various buffer sizes, and the results were rather odd. Up to about 256KB, performance increases; after that point, it gets worse. I wondered whether it was just the time taken to allocate the buffer, so I tried adding something in to make it always allocate the same total amount of memory (see second line below):
public int traverseFile(int bufSize) throws IOException {
byte[] pad = new byte[(65536-bufSize)*1024];
BufferedReader reader = new BufferedReader(new FileReader("words16"), bufSize*1024);
String line;
int total=0;
while ((line=reader.readLine())!=null)
total+=line.length();
reader.close();
return total;
}
This makes no odds. I am still getting the same results, on two different machines. Here are the full results:
Benchmark Mode Samples Score Error Units
j.t.BufferSizeBenchmark.traverse_test1_4K avgt 100 363.987 ± 1.901 ms/op
j.t.BufferSizeBenchmark.traverse_test2_16K avgt 100 356.551 ± 0.330 ms/op
j.t.BufferSizeBenchmark.traverse_test3_64K avgt 100 353.462 ± 0.557 ms/op
j.t.BufferSizeBenchmark.traverse_test4_256K avgt 100 350.822 ± 0.562 ms/op
j.t.BufferSizeBenchmark.traverse_test5_1024K avgt 100 356.949 ± 0.338 ms/op
j.t.BufferSizeBenchmark.traverse_test6_4096K avgt 100 358.377 ± 0.388 ms/op
j.t.BufferSizeBenchmark.traverse_test7_16384K avgt 100 367.890 ± 0.393 ms/op
j.t.BufferSizeBenchmark.traverse_test8_65536K avgt 100 363.271 ± 0.228 ms/op
As you can see, the sweet spot is at about 256KB. The difference isn't huge, but it is certainly measurable.
All I can think is that this might be something to do with the memory cache. Is it because the RAM that's being written to is further away from the RAM that's being read? But if it's a cyclic buffer, I'm not even sure that's true: what's being written will be just behind what's being read.
The words16 file is 80MB so I can't post it here, but it's Fedora's standard /usr/share/dict/words file, sixteen times over. I can find a way to post a link if necessary.
Here's the benchmarking code:
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OperationsPerInvocation(1)
#Warmup(iterations = 30, time = 100, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 100, time = 10000, timeUnit = TimeUnit.MILLISECONDS)
#State(Scope.Thread)
#Threads(1)
#Fork(1)
public class BufferSizeBenchmark {
public int traverseFile(int bufSize) throws IOException {
byte[] pad = new byte[(65536-bufSize)*1024];
BufferedReader reader = new BufferedReader(new FileReader("words16"), bufSize*1024);
String line;
int total=0;
while ((line=reader.readLine())!=null)
total+=line.length();
reader.close();
return total;
}
#Benchmark
public int traverse_test1_4K() throws IOException {
return traverseFile(4);
}
#Benchmark
public int traverse_test2_16K() throws IOException {
return traverseFile(16);
}
#Benchmark
public int traverse_test3_64K() throws IOException {
return traverseFile(64);
}
#Benchmark
public int traverse_test4_256K() throws IOException {
return traverseFile(256);
}
#Benchmark
public int traverse_test5_1024K() throws IOException {
return traverseFile(1024);
}
#Benchmark
public int traverse_test6_4096K() throws IOException {
return traverseFile(4096);
}
#Benchmark
public int traverse_test7_16384K() throws IOException {
return traverseFile(16384);
}
#Benchmark
public int traverse_test8_65536K() throws IOException {
return traverseFile(65536);
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(
".*" + BufferSizeBenchmark.class.getSimpleName() + ".*")
.forks(1).build();
new Runner(opt).run();
}
}
Why am I getting worse performance when I increase the size of the buffer?

This is most likely an affect of the cache line size. Since cache uses LRU eviction policy using too large a buffer causes what you've written to the 'beggining' of the buffer to be evicted before you get a chance to read it.

256k is a typical CPU cache size! What type of CPU you tested on?
So what happens is: If you read 256k chunks or smaller, the content that was written to the buffer is still in the CPU cache when the read accesses it. If you have chunks greater of 256k then the last 256k that were read are in the CPU cache, so when the read starts from the beginning the content must be retrieved from main memory.
The second problem is with the buffer allocation. The trick with the padding buffer is clever, but does not really average out the allocation cost. The reason for this is, that the real cost of the allocation is not the reservation of the memory, but to clear it. Furthermore the OS may postpone to map in the real memory to the time it was first accessed. But you never access the padding buffer.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Performance benefits of a static empty array instance - java

Related

Does the order of if else matter for performance? e.g. put the most likely condition in the front is better

Trying to benchmark lambda performance

Comparison between legacy for loop, streams and parallelStream in Java 8

OutOfMemory with JMH and Mode.AverageTime

Why am I getting worse performance with a larger buffer in my BufferedReader?

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Performance benefits of a static empty array instance - java

Related

Does the order of if else matter for performance? e.g. put the most likely condition in the front is better

Trying to benchmark lambda performance

Comparison between legacy for loop, streams and parallelStream in Java 8

OutOfMemory with JMH and Mode.AverageTime

Why am I getting *worse* performance with a *larger* buffer in my BufferedReader?

Categories

Resources

Why am I getting worse performance with a larger buffer in my BufferedReader?