I am writing a micro-benchmark to compare String concatenation using + operator vs StringBuilder. To this aim, I created a JMH benchmark class based on OpenJDK example that uses the batchSize parameter:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#Measurement(batchSize = 10000, iterations = 10)
#Warmup(batchSize = 10000, iterations = 10)
#Fork(1)
public class StringConcatenationBenchmark {
private String string;
private StringBuilder stringBuilder;
#Setup(Level.Iteration)
public void setup() {
string = "";
stringBuilder = new StringBuilder();
}
#Benchmark
public void stringConcatenation() {
string += "some more data";
}
#Benchmark
public void stringBuilderConcatenation() {
stringBuilder.append("some more data");
}
}
When I run the benchmark I get the following error for stringBuilderConcatenation method:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at link.pellegrino.string_concatenation.StringConcatenationBenchmark.stringBuilderConcatenation(StringConcatenationBenchmark.java:29)
at link.pellegrino.string_concatenation.generated.StringConcatenationBenchmark_stringBuilderConcatenation.stringBuilderConcatenation_avgt_jmhStub(StringConcatenationBenchmark_stringBuilderConcatenation.java:165)
at link.pellegrino.string_concatenation.generated.StringConcatenationBenchmark_stringBuilderConcatenation.stringBuilderConcatenation_AverageTime(StringConcatenationBenchmark_stringBuilderConcatenation.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:430)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:412)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I was thinking that the default JVM heap size has to be increased, so I tried to allow up to 10GB using -Xmx10G value with -jvmArgs option provided by JMH. Unfortunately, I still get the error.
Consequently, I tried to reduce the value for batchSize parameter to 1 but I still get an OutOfMemoryError.
The only workaround I have found is to set the benchmark mode to Mode.SingleShotTime. Since this mode seems to consider a batch as a single shot (even if s/op is displayed in the Units column), it seems that I get the metric I want: the average time to perform the set of batch operations. However, I still don't understand why it is not working with Mode.AverageTime.
Please also note that the benchmarks for method stringConcatenation work as expected whatever the benchmark mode is used. The issue only occurs with stringBuilderConcatenation method that makes use of StringBuilder.
Any help to understand why the previous example is not working with Benchmark mode set to Mode.AverageTime is welcome.
JMH version I used is 1.10.4.
You're right that Mode.SingleShotTime is what you need: it measures the time for single batch. When using the Mode.AverageTime your iteration still works until the iteration time finishes (which is 1 second by default). It measures the time per executing the single batch (only batches which were fully finished during the execution time are counted), so the final results differ, but execution time is the same.
Another problem is that #Setup(Level.Iteration) forces setup to be executed before every iteration, but not before every batch. Thus your strings are not actually limited by the batch size. The string version does not cause the OutOfMemoryError just because it's much slower than StringBuilder, so during the 1 second it's capable to build much shorter string.
Not very beautiful way to fix your benchmark (while still using average time mode and batchSize parameter) is to reset the string/stringBuilder manually:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Measurement(batchSize = 10000, iterations = 10)
#Warmup(batchSize = 10000, iterations = 10)
#Fork(1)
public class StringConcatenationBenchmark {
private static final String S = "some more data";
private static final int maxLen = S.length()*10000;
private String string;
private StringBuilder stringBuilder;
#Setup(Level.Iteration)
public void setup() {
string = "";
stringBuilder = new StringBuilder();
}
#Benchmark
public void stringConcatenation() {
if(string.length() >= maxLen) string = "";
string += S;
}
#Benchmark
public void stringBuilderConcatenation() {
if(stringBuilder.length() >= maxLen) stringBuilder = new StringBuilder();
stringBuilder.append(S);
}
}
Here's results on my box (i5 3340, 4Gb RAM, 64bit Win7, JDK 1.8.0_45):
Benchmark Mode Cnt Score Error Units
stringBuilderConcatenation avgt 10 145.997 ± 2.301 us/op
stringConcatenation avgt 10 324878.341 ± 39824.738 us/op
So you can see that only about 3 batches fit the second for stringConcatenation (1e6/324878) while for stringBuilderConcatenation thousands of batches can be executed resulting in enormous string leading to OutOfMemoryError.
I don't know why adding more memory doesn't work for you, for me -Xmx4G is enough to run the stringBuilder test of your original benchmark. Probably your box is faster, so the resulting string is even longer. Note that for the very big string you can hit the array size limit (2 billion of elements) even if you have enough memory. Check the exception stacktrace after adding the memory: is it the same? If you hit the array size limit, it will still be OutOfMemoryError, but stacktrace will be different a little bit. Anyways even with enough memory the results for your benchmark will be incorrect (both for String and StringBuilder).
Related
If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540
For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.
In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?
I need to read a file one character at a time and I'm using the read() method from BufferedReader. *
I found that read() is about 10x slower than readLine(). Is this expected? Or am I doing something wrong?
Here's a benchmark with Java 7. The input test file has about 5 million lines and 254 million characters (~242 MB) **:
The read() method takes about 7000 ms to read all the characters:
#Test
public void testRead() throws IOException, UnindexableFastaFileException{
BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));
long t0= System.currentTimeMillis();
int c;
while( (c = fa.read()) != -1 ){
//
}
long t1= System.currentTimeMillis();
System.err.println(t1-t0); // ~ 7000 ms
}
The readLine() method takes only ~700 ms:
#Test
public void testReadLine() throws IOException{
BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));
String line;
long t0= System.currentTimeMillis();
while( (line = fa.readLine()) != null ){
//
}
long t1= System.currentTimeMillis();
System.err.println(t1-t0); // ~ 700 ms
}
* Practical purpose: I need to know the length of each line, including the newline characters (\n or \r\n) AND the line length after stripping them. I also need to know if a line starts with the > character. For a given file this is done only once at the start of the program. Since EOL chars are not returned by BufferedReader.readLine() I'm resorting on the read() method. If there are better ways of doing this, please say.
** The gzipped file is here http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz. For those who may be wondering, I'm writing a class to index fasta files.
The important thing when analyzing performance is to have a valid benchmark before you start. So let's start with a simple JMH benchmark that shows what our expected performance after warmup would be.
One thing we have to consider is that since modern operating systems like to cache file data that is accessed regularly we need some way to clear the caches between tests. On Windows there's a small little utility that does just this - on Linux you should be able to do it by writing to some pseudo file somewhere.
The code then looks as follows:
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Mode;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
#BenchmarkMode(Mode.AverageTime)
#Fork(1)
public class IoPerformanceBenchmark {
private static final String FILE_PATH = "test.fa";
#Benchmark
public int readTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
int value;
while ((value = reader.read()) != -1) {
result += value;
}
}
return result;
}
#Benchmark
public int readLineTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
String line;
while ((line = reader.readLine()) != null) {
result += line.chars().sum();
}
}
return result;
}
private void clearFileCaches() throws IOException, InterruptedException {
ProcessBuilder pb = new ProcessBuilder("EmptyStandbyList.exe", "standbylist");
pb.inheritIO();
pb.start().waitFor();
}
}
and if we run it with
chcp 65001 # set codepage to utf-8
mvn clean install; java "-Dfile.encoding=UTF-8" -server -jar .\target\benchmarks.jar
we get the following results (about 2 seconds are needed to clear the caches for me and I'm running this on a HDD so that's why it's a good deal slower than for you):
Benchmark Mode Cnt Score Error Units
IoPerformanceBenchmark.readLineTest avgt 20 3.749 ± 0.039 s/op
IoPerformanceBenchmark.readTest avgt 20 3.745 ± 0.023 s/op
Surprise! As expected there's no performance difference here at all after the JVM has settled into a stable mode. But there is one outlier in the readCharTest method:
# Warmup Iteration 1: 6.186 s/op
# Warmup Iteration 2: 3.744 s/op
which is exaclty the problem you're seeing. The most likely reason I can think of is that OSR isn't doing a good job here or that the JIT is only running too late to make a difference on the first iteration.
Depending on your use case this might be a big problem or negligible (if you're reading a thousand files it won't matter, if you're only reading one this is a problem).
Solving such a problem is not easy and there are no general solutions, although there are ways to handle this. One easy test to see if we're on the right track is to run the code with the -Xcomp option which forces HotSpot to compile every method on the first invocation. And indeed doing so, causes the large delay at the first invocation to disappear:
# Warmup Iteration 1: 3.965 s/op
# Warmup Iteration 2: 3.753 s/op
Possible solution
Now that we have a good idea what the actual problem is (my guess is still all those locks neither being coalesced nor using the efficient biased locks implementation), the solution is rather straight forward and simple: Reduce the number of function calls (so yes we could've arrived at this solution without everything above, but it's always nice to have a good grip on the problem and there might have been a solution that didn't involve changing much code).
The following code runs consistently faster than either of the other two - you can play with the array size but it's surprisingly unimportant (presumably because contrary to the other methods read(char[]) does not have to acquire a lock so the cost per call is lower to begin with).
private static final int BUFFER_SIZE = 256;
private char[] arr = new char[BUFFER_SIZE];
#Benchmark
public int readArrayTest() throws IOException, InterruptedException {
clearFileCaches();
int result = 0;
try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
int charsRead;
while ((charsRead = reader.read(arr)) != -1) {
for (int i = 0; i < charsRead; i++) {
result += arr[i];
}
}
}
return result;
}
This is most likely good enough performance wise, but if you wanted to improve performance even further using a file mapping might (wouldn't count on too large an improvement in a case such as this, but if you know that your text is always ASCII, you could make some further optimizations) further help performance.
So this is the practical answer to my own question: Don't use BufferedReader.read() use FileChannel instead. (Obviously I'm not answering the WHY I put in the title). Here's the quick and dirty benchmark, hopefully others will find it useful:
#Test
public void testFileChannel() throws IOException{
FileChannel fileChannel = FileChannel.open(Paths.get("chr1.fa"));
long n= 0;
int noOfBytesRead = 0;
long t0= System.nanoTime();
while(noOfBytesRead != -1){
ByteBuffer buffer = ByteBuffer.allocate(10000);
noOfBytesRead = fileChannel.read(buffer);
buffer.flip();
while ( buffer.hasRemaining() ) {
char x= (char)buffer.get();
n++;
}
}
long t1= System.nanoTime();
System.err.println((float)(t1-t0) / 1e6); // ~ 250 ms
System.err.println("nchars: " + n); // 254235640 chars read
}
With ~250 ms to read the whole file char by char, this strategy is considerably faster than BufferedReader.readLine() (~700 ms), let alone read(). Adding if statements in the loop to check for x == '\n' and x == '>' makes little difference. Also putting a StringBuilder to reconstruct lines doesn't affect the timing too much. So this is plenty good for me (at least for now).
Thanks to #Marco13 for mentioning FileChannel.
Java JIT optimizes away empty loop bodies, so your loops actually look like this:
while((c = fa.read()) != -1);
and
while((line = fa.readLine()) != null);
I suggest you read up on benchmarking here and the optimization of the loops here.
As to why the time taken differs:
Reason one (This only applies if the bodies of the loops contain code): In the first example, you're doing one operation per line, in the second, you're doing one per character. This this adds up the more lines/characters you have.
while((c = fa.read()) != -1){
//One operation per character.
}
while((line = fa.readLine()) != null){
//One operation per line.
}
Reason two: In the class BufferedReader, the method readLine() doesn't use read() behind the scenes - it uses its own code. The method readLine() does less operations per character to read a line, than it would take to read a line with the read() method - this is why readLine() is faster at reading an entire file.
Reason three: It takes more iterations to read each character, than it does to read each line (unless each character is on a new line); read() is called more times than readLine().
Thanks #Voo for the correction. What I mentioned below is correct from FileReader#read() v/s BufferedReader#readLine() point of view BUT not correct from BufferedReader#read() v/s BufferedReader#readLine() point of view, so I have striked-out the answer.
Using read() method on BufferedReader is not a good idea, it wouldn't cause you any harm but it certainly wastes the purpose of class.
Whole purpose in life of BufferedReader is to reduce the i/o by buffering the content. You can read here in Java tutorials. You may also notice that read() method in BufferedReader is actually inherited from Reader while readLine() is BufferedReader's own method.
If you want to use read() method then I would say you better use FileReader, which is meant for that purpose. You can read here in Java tutorials.
So, I think answer to your question is very simple (without going into bench-marking and all that explainations) -
Each read() is handled by underlying OS and triggers disk access, network activity, or some other operation that is relatively expensive.
When you use readLine() then you save all these overheads, so readLine() will always be faster than read(), may not be substantially for small data but faster.
It is not surprising to see this difference if you think about it. One test is iterating the lines in a text file, while the other is iterating characters.
Unless each line contains one character, it is expected that the readLine() is way faster than the read() method.(although as pointed out by the comments above, it is arguable since a BufferedReader buffers the input, while the physical file reading might not be the only performance taking operation)
If you really want to test the difference between the 2 I would suggest a setup where you iterate over each character in both tests. E.g. something like:
void readTest(BufferedReader r)
{
int c;
StringBuilder b = new StringBuilder();
while((c = r.read()) != -1)
b.append((char)c);
}
void readLineTest(BufferedReader r)
{
String line;
StringBuilder b = new StringBuilder();
while((line = b.readLine())!= null)
for(int i = 0; i< line.length; i++)
b.append(line.charAt(i));
}
Besides the above, please use a "Java performance diagnostic tool" to benchmark your code. Also, readup on how to microbenchmark java code.
According to the documentation:
Every read() method call makes an expensive system call.
Every readLine() method call still makes an expensive system call, however, for more bytes at once, so there are fewer calls.
Similar situation happens when we make database update command for each record we want to update, versus a batch update, where we make one call for all the records.
It seems common practice to extract a constant empty array return value into a static constant. Like here:
public class NoopParser implements Parser {
private static final String[] EMPTY_ARRAY = new String[0];
#Override public String[] supportedSchemas() {
return EMPTY_ARRAY;
}
// ...
}
Presumably this is done for performance reasons, since returning new String[0] directly would create a new array object every time the method is called – but would it really?
I've been wondering if there really is a measurable performance benefit in doing this or if it's just outdated folk wisdom. An empty array is immutable. Is the VM not able to roll all empty String arrays into one? Can the VM not make new String[0] basically free of cost?
Contrast this practice with returning an empty String: we're usually perfectly happy to write return "";, not return EMPTY_STRING;.
I benchmarked it using JMH:
private static final String[] EMPTY_STRING_ARRAY = new String[0];
#Benchmark
public void testStatic(Blackhole blackhole) {
blackhole.consume(EMPTY_STRING_ARRAY);
}
#Benchmark
#Fork(jvmArgs = "-XX:-EliminateAllocations")
public void testStaticEliminate(Blackhole blackhole) {
blackhole.consume(EMPTY_STRING_ARRAY);
}
#Benchmark
public void testNew(Blackhole blackhole) {
blackhole.consume(new String[0]);
}
#Benchmark
#Fork(jvmArgs = "-XX:-EliminateAllocations")
public void testNewEliminate(Blackhole blackhole) {
blackhole.consume(new String[0]);
}
#Benchmark
public void noop(Blackhole blackhole) {
}
Full source code.
Environment (seen after java -jar target/benchmarks.jar -f 1):
# JMH 1.11.2 (released 51 days ago)
# VM version: JDK 1.7.0_75, VM 24.75-b04
# VM invoker: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
EliminateAllocations was on by default (seen after java -XX:+PrintFlagsFinal -version | grep EliminateAllocations).
Results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.testNewEliminate thrpt 20 95912464.879 ± 3260948.335 ops/s
MyBenchmark.testNew thrpt 20 103980230.952 ± 3772243.160 ops/s
MyBenchmark.testStaticEliminate thrpt 20 206849985.523 ± 4920788.341 ops/s
MyBenchmark.testStatic thrpt 20 219735906.550 ± 6162025.973 ops/s
MyBenchmark.noop thrpt 20 1126421653.717 ± 8938999.666 ops/s
Using a constant was almost two times faster.
Turning off EliminateAllocations slowed things down a tiny bit.
I'm most interested in the actual performance difference between these two idioms in practical, real-world situations. I have no experience in micro-benchmarking (and it is probably not the right tool for such a question) but I gave it a try anyway.
This benchmark models a somewhat more typical, "realistic" setting. The returned array is just looked at and then discarded. No references hanging around, no requirement for reference equality.
One interface, two implementations:
public interface Parser {
String[] supportedSchemas();
void parse(String s);
}
public class NoopParserStaticArray implements Parser {
private static final String[] EMPTY_STRING_ARRAY = new String[0];
#Override public String[] supportedSchemas() {
return EMPTY_STRING_ARRAY;
}
#Override public void parse(String s) {
s.codePoints().count();
}
}
public class NoopParserNewArray implements Parser {
#Override public String[] supportedSchemas() {
return new String[0];
}
#Override public void parse(String s) {
s.codePoints().count();
}
}
And the JMH benchmark:
import org.openjdk.jmh.annotations.Benchmark;
public class EmptyArrayBenchmark {
private static final Parser NOOP_PARSER_STATIC_ARRAY = new NoopParserStaticArray();
private static final Parser NOOP_PARSER_NEW_ARRAY = new NoopParserNewArray();
#Benchmark
public void staticEmptyArray() {
Parser parser = NOOP_PARSER_STATIC_ARRAY;
for (String schema : parser.supportedSchemas()) {
parser.parse(schema);
}
}
#Benchmark
public void newEmptyArray() {
Parser parser = NOOP_PARSER_NEW_ARRAY;
for (String schema : parser.supportedSchemas()) {
parser.parse(schema);
}
}
}
The result on my machine, Java 1.8.0_51 (HotSpot VM):
Benchmark Mode Cnt Score Error Units
EmptyArrayBenchmark.staticEmptyArray thrpt 60 3024653836.077 ± 37006870.221 ops/s
EmptyArrayBenchmark.newEmptyArray thrpt 60 3018798922.045 ± 33953991.627 ops/s
EmptyArrayBenchmark.noop thrpt 60 3046726348.436 ± 5802337.322 ops/s
There is no significant difference between the two approaches in this case. In fact, they are indistinguishable from the no-op case: apparently the JIT compiler recognises that the returned array is always empty and optimises the loop away entirely!
Piping parser.supportedSchemas() into the black hole instead of looping over it, gives the static array instance approach a ~30% advantage. But they're definitely of the same magnitude:
Benchmark Mode Cnt Score Error Units
EmptyArrayBenchmark.staticEmptyArray thrpt 60 338971639.355 ± 738069.217 ops/s
EmptyArrayBenchmark.newEmptyArray thrpt 60 266936194.767 ± 411298.714 ops/s
EmptyArrayBenchmark.noop thrpt 60 3055609298.602 ± 5694730.452 ops/s
Perhaps in the end the answer is the usual "it depends". I have a hunch that in many practical scenarios, the performance benefit in factoring out the array creation is not significant.
I think it is fair to say that
if the method contract gives you the freedom to return a new empty array instance every time, and
unless you need to guard against problematic or pathological usage patterns and/or aim for theoretical max performance,
then returning new String[0] directly is fine.
Personally, I like the expressiveness and concision of return new String[0]; and not having to introduce an extra static field.
By some strange coincidence, a month after I wrote this a real performance engineer investigated the problem: see this section in Alexey Shipilёv's blog post 'Arrays of Wisdom of the Ancients':
As expected, the only effect whatsoever can be observed on a very small collection sizes, and this is only a marginal improvement over new Foo[0]. This improvement does not seem to justify caching the array in the grand scheme of things. As a teeny tiny micro-optimization, it might make sense in some tight code, but I wouldn’t care otherwise.
That settles it. I'll take the tick mark and dedicate it to Alexey.
Is the VM not able to roll all empty String arrays into one?
It can't do that, because distinct empty arrays need to compare unequal with ==. Only the programmer can make this optimization.
Contrast this practice with returning an empty String: we're usually perfectly happy writing return "";.
With strings, there is no requirement that distinct string literals produce distinct strings. In every case I know of, two instances of "" will produce the same string object, but maybe there's some weird case with classloaders where that won't happen.
I will go out on a limb and say that the performance benefit, even though using constant is much faster, is not actually relevant; because the software will likely spend a lot more time in doing other things besides returning empty arrays. If the total run-time is even hours a few extra seconds spent in creating an array does not mean much. By the same logic, memory consumption is not relevant either.
The only reason I can think of for doing this is readability.
I am trying to test the performance of Aparapi.
I have seen some blogs where the results show that Aparapi does improve the performance while doing data parallel operations.
But I am not able to see that in my tests. Here is what I did, I wrote two programs, one using Aparapi, the other one using normal loops.
Program 1: In Aparapi
import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;
public class App
{
public static void main( String[] args )
{
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
long t1 = System.currentTimeMillis();
kernel.execute(Range.create(size));
long t2 = System.currentTimeMillis();
System.out.println("Execution mode = "+kernel.getExecutionMode());
kernel.dispose();
System.out.println(t2-t1);
}
}
Program 2: using loops
public class App2 {
public static void main(String[] args) {
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
long t1 = System.currentTimeMillis();
for(int i=0;i<size;i++) {
sum[i]=a[i]+b[i];
}
long t2 = System.currentTimeMillis();
System.out.println(t2-t1);
}
}
Program 1 takes around 330ms whereas Program 2 takes only around 55ms.
Am I doing something wrong here? I did printout the execution mode in Aparpai program and it prints that the mode of execution is GPU
You did not do anything wrong - execpt for the benchmark itself.
Benchmarking is always tricky, and particularly for the cases where a JIT is involved (as for Java), and for libraries where many nitty-gritty details are hidden from the user (as for Aparapi). And in both cases, you should at least execute the code section that you want to benchmark multiple times.
For the Java version, one might expect the computation time for a single execution of the loop to decrease when the loop itself it is executed multiple times, due to the JIT kicking in. There are many additional caveats to consider - for details, you should refer to this answer. In this simple test, the effect of the JIT may not really be noticable, but in more realistic or complex scenarios, this will make a difference. Anyhow: When repeating the loop for 10 times, the time for a single execution of the loop on my machine was about 70 milliseconds.
For the Aparapi version, the point of possible GPU initialization was already mentioned in the comments. And here, this is indeed the main problem: When running the kernel 10 times, the timings on my machine are
1248
72
72
72
73
71
72
73
72
72
You see that the initial call causes all the overhead. The reason for this is that, during the first call to Kernel#execute(), it has to do all the initializations (basically converting the bytecode to OpenCL, compile the OpenCL code etc.). This is also mentioned in the documentation of the KernelRunner class:
The KernelRunner is created lazily as a result of calling Kernel.execute().
The effect of this - namely, a comparatively large delay for the first execution - has lead to this question on the Aparapi mailing list: A way to eagerly create KernelRunners. The only workaround suggested there was to create an "initialization call" like
kernel.execute(Range.create(1));
without a real workload, only to trigger the whole setup, so that the subsequent calls are fast. (This also works for your example).
You may have noticed that, even after the initialization, the Aparapi version is still not faster than the plain Java version. The reason for that is that the task of a simple vector addition like this is memory bound - for details, you may refer to this answer, which explains this term and some issues with GPU programming in general.
As an overly suggestive example for a case where you might benefit from the GPU, you might want to modify your test, in order to create an artificial compute bound task: When you change the kernel to involve some expensive trigonometric functions, like this
Kernel kernel = new Kernel() {
#Override
public void run() {
int gid = getGlobalId();
sum[gid] = (float)(Math.cos(Math.sin(a[gid])) + Math.sin(Math.cos(b[gid])));
}
};
and the plain Java loop version accordingly, like this
for (int i = 0; i < size; i++) {
sum[i] = (float)(Math.cos(Math.sin(a[i])) + Math.sin(Math.cos(b[i])));;
}
then you will see a difference. On my machine (GeForce 970 GPU vs. AMD K10 CPU) the timings are about 140 milliseconds for the Aparapi version, and a whopping 12000 milliseconds for the plain Java version - that's a speedup of nearly 90 through Aparapi!
Also note that even in CPU mode, Aparapi may offer an advantage compared to plain Java. On my machine, in CPU mode, Aparapi needs only 2300 milliseconds, because it still parallelizes the execution using a Java thread pool.
Just add before main loop kernel execution
kernel.setExplicit(true);
kernel.put(a);
kernel.put(b);
and
kernel.get(sum);
after it.
Although Aparapi does analyze the byte code of the Kernel.run()
method (and any method reachable from Kernel.run()) Aparapi has no
visibility to the call site. In the above code there is no way for
Aparapi to detect that that hugeArray is not modified within the for
loop body. Unfortunately, Aparapi must default to being ‘safe’ and
copy the contents of hugeArray backwards and forwards to the GPU
device.
https://github.com/aparapi/aparapi/blob/master/doc/ExplicitBufferHandling.md
I have made the same Test as it was done in this post:
new String() vs literal string performance
Meaning I wanted to test which one has the better in performance. As I expected the result was that the assignment by literals is faster. I don't know why, but I did the test with some more assignments and I noticed something strange: when I let the program do the loops more than 10.000 times the assignment by literals is relatively not as much faster than at less than 10.000 assignments. And at 1.000.000 repetitions it is even slower than creating new objects.
Here is my code:
double tx = System.nanoTime();
for (int i = 0; i<1; i++){
String s = "test";
}
double ty = System.nanoTime();
double ta = System.nanoTime();
for (int i = 0; i<1; i++){
String s = new String("test");
}
double tb = System.nanoTime();
System.out.println((ty-tx));
System.out.println((tb-ta));
I let this run like it is written above. I'm just learning Java and my boss asked me to do the test and after I presented the outcome of the test he asked me do find an answer, why this happens. I cannot find anything on google or in stackoverflow and so I hope someone can help me out here.
factor at 1 repetition 3,811565221
factor at 10 repetitions 4,393570401
factor at 100 repetitions 5,234779103
factor at 1,000 repetitions 7,909884116
factor at 10,000 repetitions 9,395538811
factor at 100,000 repetitions 2,355514697
factor at 1,000,000 repetitions 0,734826755
Thank you!
First you'll have to learn a lot about the internals of HotSpot, in particular the fact that your code is first interpreted, then at a certain point compiled into native code.
A lot of optimizations happen while compiling, based on results of both static and dynamic analysis of your code.
Specifically, in your code,
String s = "test";
is a clear no-op. The compiler will emit no code whatsoever for this line. All that remains is the loop itself, and the whole loop may be eliminated if HotSpot proves it has no observable outside effects.
Second, even the code
String s = new String("test");
may result in almost the same thing as above because it is very easy to prove that your new String is an instance which cannot escape from the method where it is created.
With your code, the measurements are mixing up the performance of interpreted bytecode, the delay it takes to compile the code and swap it in by On-Stack Replacement, and then the performance of the native code.
Basically, the measurements you are making are measuring everything but the effect you have set out to measure.
To make the arguments more solid, I have repeated the test with jmh:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 1, time = 1)
#Measurement(iterations = 3, time = 1)
#Threads(1)
#Fork(2)
public class Strings
{
static final int ITERS = 1_000;
#GenerateMicroBenchmark
public void literal() {
for (int i = 0; i < ITERS; i++) { String s = "test"; }
}
#GenerateMicroBenchmark
public void newString() {
for (int i = 0; i < ITERS; i++) { String s = new String("test"); }
}
}
and these are the results:
Benchmark Mode Samples Mean Mean error Units
literal avgt 6 0.625 0.023 ns/op
newString avgt 6 43.778 3.283 ns/op
You can see that the whole method body is eliminated in the case of the string literal, while with new String the loop remains, but nothing in it because the time per loop iteration is just 0.04 nanoseconds. Definitely no String instances are allocated.