Understanding the output of -XX:+PrintCompilation - java

I am running some micro benchmarks on Java list iteration code. I have used -XX:+PrintCompilation, and -verbose:gc flags to ensure that nothing is happening in the background when the timing is being run. However, I see something in the output which I cannot understand.
Here's the code, I am running the benchmark on:
import java.util.ArrayList;
import java.util.List;
public class PerformantIteration {
private static int theSum = 0;
public static void main(String[] args) {
System.out.println("Starting microbenchmark on iterating over collections with a call to size() in each iteration");
List<Integer> nums = new ArrayList<Integer>();
for(int i=0; i<50000; i++) {
nums.add(i);
}
System.out.println("Warming up ...");
//warmup... make sure all JIT comliling is done before the actual benchmarking starts
for(int i=0; i<10; i++) {
iterateWithConstantSize(nums);
iterateWithDynamicSize(nums);
}
//actual
System.out.println("Starting the actual test");
long constantSizeBenchmark = iterateWithConstantSize(nums);
long dynamicSizeBenchmark = iterateWithDynamicSize(nums);
System.out.println("Test completed... printing results");
System.out.println("constantSizeBenchmark : " + constantSizeBenchmark);
System.out.println("dynamicSizeBenchmark : " + dynamicSizeBenchmark);
System.out.println("dynamicSizeBenchmark/constantSizeBenchmark : " + ((double)dynamicSizeBenchmark/(double)constantSizeBenchmark));
}
private static long iterateWithDynamicSize(List<Integer> nums) {
int sum=0;
long start = System.nanoTime();
for(int i=0; i<nums.size(); i++) {
// appear to do something useful
sum += nums.get(i);
}
long end = System.nanoTime();
setSum(sum);
return end-start;
}
private static long iterateWithConstantSize(List<Integer> nums) {
int count = nums.size();
int sum=0;
long start = System.nanoTime();
for(int i=0; i<count; i++) {
// appear to do something useful
sum += nums.get(i);
}
long end = System.nanoTime();
setSum(sum);
return end-start;
}
// invocations to this method simply exist to fool the VM into thinking that we are doing something useful in the loop
private static void setSum(int sum) {
theSum = sum;
}
}
Here's the output.
152 1 java.lang.String::charAt (33 bytes)
160 2 java.lang.String::indexOf (151 bytes)
165 3Starting microbenchmark on iterating over collections with a call to size() in each iteration java.lang.String::hashCode (60 bytes)
171 4 sun.nio.cs.UTF_8$Encoder::encodeArrayLoop (490 bytes)
183 5
java.lang.String::lastIndexOf (156 bytes)
197 6 java.io.UnixFileSystem::normalize (75 bytes)
200 7 java.lang.Object::<init> (1 bytes)
205 8 java.lang.Number::<init> (5 bytes)
206 9 java.lang.Integer::<init> (10 bytes)
211 10 java.util.ArrayList::add (29 bytes)
211 11 java.util.ArrayList::ensureCapacity (58 bytes)
217 12 java.lang.Integer::valueOf (35 bytes)
221 1% performance.api.PerformantIteration::main # 21 (173 bytes)
Warming up ...
252 13 java.util.ArrayList::get (11 bytes)
252 14 java.util.ArrayList::rangeCheck (22 bytes)
253 15 java.util.ArrayList::elementData (7 bytes)
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
Starting the actual test
Test completed... printing results
constantSizeBenchmark : 301688
dynamicSizeBenchmark : 782602
dynamicSizeBenchmark/constantSizeBenchmark : 2.5940773249184588
I don't understand these four lines from the output.
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
Why are both these methods being compiled twice ?
How do I read this output... what do the various numbers mean ?

I am going to attempt answering my own question with the help of this link posted by Thomas Jungblut.
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
First column
The first column '260' is the timestamp.
Second column
The second column is the compilation_id and method_attributes. When a HotSpot compilation is triggered, every compilation unit gets a compilation id. The number in the second column is the compilation id. JIT compilation, and OSR compilation have two different sequences for the compilation id. So 1% and 1 are different compilation units. The % in the first two rows, refer to the fact that this is an OSR compilation. An OSR compilation was triggered because the code was looping over a large loop, and the VM determined that this code is hot. So an OSR compilation was triggered, which would enable the VM to do an On Stack Replacement and move over to the optimized code, once it is ready.
Third column
The third column performance.api.PerformantIteration::iterateWithConstantSize is the method name.
Fourth column
The fourth column is again different when OSR compilation happens and when it does not. Let's look at the common parts first. The end of the fourth column (59 bytes), refers to the size of the compilation unit in bytecode (not the size of the compiled code). The # 19 part in OSR compilation refers to the osr_bci. I am going to quote from the link mentioned above -
A "place" in a Java method is defined by its bytecode index (BCI), and
the place that triggered an OSR compilation is called the "osr_bci".
An OSR-compiled nmethod can only be entered from its osr_bci; there
can be multiple OSR-compiled versions of the same method at the same
time, as long as their osr_bci differ.
Finally, why was the method compiled twice ?
The first one is an OSR compilation, which presumably happened while the loop was running due to the warmup code (in the example), and the second compilation is a JIT compilation, presumably to further optimize the compiled code ?

I think first time OSR happened , then it change the Invocation Counter tigger method compilar
(PS: sorry, my english is pool)

Related

Java 8: Class.getName() slows down String concatenation chain

Recently I've run into an issue regarding String concatenation. This benchmark summarizes it:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class BrokenConcatenationBenchmark {
#Benchmark
public String slow(Data data) {
final Class<? extends Data> clazz = data.clazz;
return "class " + clazz.getName();
}
#Benchmark
public String fast(Data data) {
final Class<? extends Data> clazz = data.clazz;
final String clazzName = clazz.getName();
return "class " + clazzName;
}
#State(Scope.Thread)
public static class Data {
final Class<? extends Data> clazz = getClass();
#Setup
public void setup() {
//explicitly load name via native method Class.getName0()
clazz.getName();
}
}
}
On JDK 1.8.0_222 (OpenJDK 64-Bit Server VM, 25.222-b10) I've got the following results:
Benchmark Mode Cnt Score Error Units
BrokenConcatenationBenchmark.fast avgt 25 22,253 ± 0,962 ns/op
BrokenConcatenationBenchmark.fast:·gc.alloc.rate avgt 25 9824,603 ± 400,088 MB/sec
BrokenConcatenationBenchmark.fast:·gc.alloc.rate.norm avgt 25 240,000 ± 0,001 B/op
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Eden_Space avgt 25 9824,162 ± 397,745 MB/sec
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Eden_Space.norm avgt 25 239,994 ± 0,522 B/op
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Survivor_Space avgt 25 0,040 ± 0,011 MB/sec
BrokenConcatenationBenchmark.fast:·gc.churn.PS_Survivor_Space.norm avgt 25 0,001 ± 0,001 B/op
BrokenConcatenationBenchmark.fast:·gc.count avgt 25 3798,000 counts
BrokenConcatenationBenchmark.fast:·gc.time avgt 25 2241,000 ms
BrokenConcatenationBenchmark.slow avgt 25 54,316 ± 1,340 ns/op
BrokenConcatenationBenchmark.slow:·gc.alloc.rate avgt 25 8435,703 ± 198,587 MB/sec
BrokenConcatenationBenchmark.slow:·gc.alloc.rate.norm avgt 25 504,000 ± 0,001 B/op
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Eden_Space avgt 25 8434,983 ± 198,966 MB/sec
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Eden_Space.norm avgt 25 503,958 ± 1,000 B/op
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Survivor_Space avgt 25 0,127 ± 0,011 MB/sec
BrokenConcatenationBenchmark.slow:·gc.churn.PS_Survivor_Space.norm avgt 25 0,008 ± 0,001 B/op
BrokenConcatenationBenchmark.slow:·gc.count avgt 25 3789,000 counts
BrokenConcatenationBenchmark.slow:·gc.time avgt 25 2245,000 ms
This looks like an issue similar to JDK-8043677, where an expression having side effect
breaks optimization of new StringBuilder.append().append().toString() chain.
But the code of Class.getName() itself does not seem to have any side effects:
private transient String name;
public String getName() {
String name = this.name;
if (name == null) {
this.name = name = this.getName0();
}
return name;
}
private native String getName0();
The only suspicious thing here is a call to native method which happens
in fact only once and its result is cached in the field of the class.
In my benchmark I've explicitly cached it in setup method.
I've expected branch predictor to figure out that at each benchmark invocation
the actual value of this.name is never null and optimize the whole expression.
However, while for the BrokenConcatenationBenchmark.fast() I have this:
# 19 tsypanov.strings.benchmark.concatenation.BrokenConcatenationBenchmark::fast (30 bytes) force inline by CompileCommand
# 6 java.lang.Class::getName (18 bytes) inline (hot)
# 14 java.lang.Class::initClassName (0 bytes) native method
# 14 java.lang.StringBuilder::<init> (7 bytes) inline (hot)
# 19 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 23 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 26 java.lang.StringBuilder::toString (35 bytes) inline (hot)
i.e. compiler is able to inline everything, for BrokenConcatenationBenchmark.slow() it is different:
# 19 tsypanov.strings.benchmark.concatenation.BrokenConcatenationBenchmark::slow (28 bytes) force inline by CompilerOracle
# 9 java.lang.StringBuilder::<init> (7 bytes) inline (hot)
# 3 java.lang.AbstractStringBuilder::<init> (12 bytes) inline (hot)
# 1 java.lang.Object::<init> (1 bytes) inline (hot)
# 14 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 2 java.lang.AbstractStringBuilder::append (50 bytes) inline (hot)
# 10 java.lang.String::length (6 bytes) inline (hot)
# 21 java.lang.AbstractStringBuilder::ensureCapacityInternal (27 bytes) inline (hot)
# 17 java.lang.AbstractStringBuilder::newCapacity (39 bytes) inline (hot)
# 20 java.util.Arrays::copyOf (19 bytes) inline (hot)
# 11 java.lang.Math::min (11 bytes) (intrinsic)
# 14 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 35 java.lang.String::getChars (62 bytes) inline (hot)
# 58 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 18 java.lang.Class::getName (21 bytes) inline (hot)
# 11 java.lang.Class::getName0 (0 bytes) native method
# 21 java.lang.StringBuilder::append (8 bytes) inline (hot)
# 2 java.lang.AbstractStringBuilder::append (50 bytes) inline (hot)
# 10 java.lang.String::length (6 bytes) inline (hot)
# 21 java.lang.AbstractStringBuilder::ensureCapacityInternal (27 bytes) inline (hot)
# 17 java.lang.AbstractStringBuilder::newCapacity (39 bytes) inline (hot)
# 20 java.util.Arrays::copyOf (19 bytes) inline (hot)
# 11 java.lang.Math::min (11 bytes) (intrinsic)
# 14 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 35 java.lang.String::getChars (62 bytes) inline (hot)
# 58 java.lang.System::arraycopy (0 bytes) (intrinsic)
# 24 java.lang.StringBuilder::toString (17 bytes) inline (hot)
So the question is whether this is appropriate behaviour of the JVM or compiler bug?
I'm asking the question because some of the projects are still using Java 8 and if it won't be fixed on any of release updates then to me it's reasonable to hoist calls to Class.getName() manually from hot spots.
P.S. On the latest JDKs (11, 13, 14-eap) the issue is not reproduced.
HotSpot JVM collects execution statistics per bytecode. If the same code is run in different contexts, the result profile will aggregate statistics from all contexts. This effect is known as profile pollution.
Class.getName() is obviously called not only from your benchmark code. Before JIT starts compiling the benchmark, it already knows that the following condition in Class.getName() was met multiple times:
if (name == null)
this.name = name = getName0();
At least, enough times to treat this branch statistically important. So, JIT did not exclude this branch from compilation, and thus could not optimize string concat due to possible side effect.
This does not even need to be a native method call. Just a regular field assignment is also considered a side effect.
Here is an example how profile pollution can harm further optimizations.
#State(Scope.Benchmark)
public class StringConcat {
private final MyClass clazz = new MyClass();
static class MyClass {
private String name;
public String getName() {
if (name == null) name = "ZZZ";
return name;
}
}
#Param({"1", "100", "400", "1000"})
private int pollutionCalls;
#Setup
public void setup() {
for (int i = 0; i < pollutionCalls; i++) {
new MyClass().getName();
}
}
#Benchmark
public String fast() {
String clazzName = clazz.getName();
return "str " + clazzName;
}
#Benchmark
public String slow() {
return "str " + clazz.getName();
}
}
This is basically the modified version of your benchmark that simulates the pollution of getName() profile. Depending on the number of preliminary getName() calls on a fresh object, the further performance of string concatenation may dramatically differ:
Benchmark (pollutionCalls) Mode Cnt Score Error Units
StringConcat.fast 1 avgt 15 11,458 ± 0,076 ns/op
StringConcat.fast 100 avgt 15 11,690 ± 0,222 ns/op
StringConcat.fast 400 avgt 15 12,131 ± 0,105 ns/op
StringConcat.fast 1000 avgt 15 12,194 ± 0,069 ns/op
StringConcat.slow 1 avgt 15 11,771 ± 0,105 ns/op
StringConcat.slow 100 avgt 15 11,963 ± 0,212 ns/op
StringConcat.slow 400 avgt 15 26,104 ± 0,202 ns/op << !
StringConcat.slow 1000 avgt 15 26,108 ± 0,436 ns/op << !
More examples of profile pollution »
I can't call it either a bug or an "appropriate behaviour". This is just how dynamic adaptive compilation is implemented in HotSpot.
Slightly unrelated but since Java 9 and JEP 280: Indify String Concatenation the string concatenation is now done with invokedynamic and not StringBuilder. This article shows the differences in the bytecode between Java 8 and Java 9.
If the benchmark re-run on newer Java version doesn't show the problem there is most likley no bug in javac because the compiler now uses new mechanism. Not sure if diving into Java 8 behavior is beneficial if there is such a substantial change in the newer versions.

Unclear intrinsic behaviour in Java9

Suppose I have this code (it really does not matter I think, but just in case here it is):
public class AtomicJDK9 {
static AtomicInteger ai = new AtomicInteger(0);
public static void main(String[] args) {
int sum = 0;
for (int i = 0; i < 30_000; ++i) {
sum += atomicIncrement();
}
System.out.println(sum);
}
public static int atomicIncrement() {
ai.getAndAdd(12);
return ai.get();
}
}
And here is how I am invoking it (using java-9):
java -XX:+UnlockDiagnosticVMOptions
-XX:-TieredCompilation
-XX:+PrintIntrinsics
AtomicJDK9
What I am trying to find out is what methods were replaced by intrinsic code. The first one that is hit (inside Unsafe):
#HotSpotIntrinsicCandidate
public final int getAndAddInt(Object o, long offset, int delta) {
int v;
do {
v = getIntVolatile(o, offset);
} while (!weakCompareAndSwapIntVolatile(o, offset, v, v + delta));
return v;
}
And this method is indeed present in the output of the above invocation:
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
But, the entire output is weird (for me that is):
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
# 3 jdk.internal.misc.Unsafe::getIntVolatile (0 bytes) (intrinsic)
# 18 jdk.internal.misc.Unsafe::weakCompareAndSwapIntVolatile (11 bytes) (intrinsic)
# 7 jdk.internal.misc.Unsafe::compareAndSwapInt (0 bytes) (intrinsic)
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
Why is the getAndAddInt present twice in the output?
Also if getAndAddInt is indeed replaced by an intrinsic call, why is there a need to replace all other intrinsic methods down the call stackl they will not be used anymore. I do assume it's as simple as the stack of method calls is traversed from the bottom.
In order to illustrate the compiler logic I ran JVM with the following arguments.
-XX:-TieredCompilation -XX:CICompilerCount=1
-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining
And that's what it prints.
337 29 java.util.concurrent.atomic.AtomicInteger::getAndAdd (12 bytes)
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
337 30 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes)
# 3 jdk.internal.misc.Unsafe::getIntVolatile (0 bytes) (intrinsic)
# 18 jdk.internal.misc.Unsafe::weakCompareAndSwapIntVolatile (11 bytes) (intrinsic)
338 32 jdk.internal.misc.Unsafe::weakCompareAndSwapIntVolatile (11 bytes)
# 7 jdk.internal.misc.Unsafe::compareAndSwapInt (0 bytes) (intrinsic)
339 33 AtomicJDK9::atomicIncrement (16 bytes)
# 5 java.util.concurrent.atomic.AtomicInteger::getAndAdd (12 bytes) inline (hot)
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
# 12 java.util.concurrent.atomic.AtomicInteger::get (5 bytes) accessor
Methods are intrinsics only for compiler, but not for interpreter.
Every method starts in interpreter until it is considered hot.
AtomicInteger.getAndAdd is called not only from your code, but from common JDK code, too.
That is, AtomicInteger.getAndAdd reaches invocation threshold a little bit earlier than your AtomicJDK9.atomicIncrement. Then getAndAdd is submitted to compilation queue, and that's where the first intrinsic printout comes from.
HotSpot JVM compiles methods in background. While a method is being compiled, the execution continues in interpreter.
While AtomicInteger.getAndAdd is interpreted, Unsafe.getAndAddInt and Unsafe.weakCompareAndSwapIntVolatile methods also reach invocation threshold and begin compilation. The next 3 intrinsics are printed while compiling these Unsafe methods.
Finally, AtomicJDK9.atomicIncrement also reaches invocation threashold and begin compilation. The last intrinsic printout corresponds to your method.

Java 8 stream objects significant memory usage

In looking at some profiling results, I noticed that using streams within a tight loop (used instead of another nested loop) incurred a significant memory overhead of objects of types java.util.stream.ReferencePipeline and java.util.ArrayList$ArrayListSpliterator. I converted the offending streams to foreach loops, and the memory consumption decreased significantly.
I know that streams make no promises about performing any better than ordinary loops, but I was under the impression that the difference would be negligible. In this case it seemed like it was a 40% increase.
Here is the test class I wrote to isolate the problem. I monitored memory consumption and object allocation with JFR:
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.Random;
import java.util.function.Predicate;
public class StreamMemoryTest {
private static boolean blackHole = false;
public static List<Integer> getRandListOfSize(int size) {
ArrayList<Integer> randList = new ArrayList<>(size);
Random rnGen = new Random();
for (int i = 0; i < size; i++) {
randList.add(rnGen.nextInt(100));
}
return randList;
}
public static boolean getIndexOfNothingManualImpl(List<Integer> nums, Predicate<Integer> predicate) {
for (Integer num : nums) {
// Impossible condition
if (predicate.test(num)) {
return true;
}
}
return false;
}
public static boolean getIndexOfNothingStreamImpl(List<Integer> nums, Predicate<Integer> predicate) {
Optional<Integer> first = nums.stream().filter(predicate).findFirst();
return first.isPresent();
}
public static void consume(boolean value) {
blackHole = blackHole && value;
}
public static boolean result() {
return blackHole;
}
public static void main(String[] args) {
// 100 million trials
int numTrials = 100000000;
System.out.println("Beginning test");
for (int i = 0; i < numTrials; i++) {
List<Integer> randomNums = StreamMemoryTest.getRandListOfSize(100);
consume(StreamMemoryTest.getIndexOfNothingStreamImpl(randomNums, x -> x < 0));
// or ...
// consume(StreamMemoryTest.getIndexOfNothingManualImpl(randomNums, x -> x < 0));
if (randomNums == null) {
break;
}
}
System.out.print(StreamMemoryTest.result());
}
}
Stream implementation:
Memory Allocated for TLABs 64.62 GB
Class Average Object Size(bytes) Total Object Size(bytes) TLABs Average TLAB Size(bytes) Total TLAB Size(bytes) Pressure(%)
java.lang.Object[] 415.974 6,226,712 14,969 2,999,696.432 44,902,455,888 64.711
java.util.stream.ReferencePipeline$2 64 131,264 2,051 2,902,510.795 5,953,049,640 8.579
java.util.stream.ReferencePipeline$Head 56 72,744 1,299 3,070,768.043 3,988,927,688 5.749
java.util.stream.ReferencePipeline$2$1 24 25,128 1,047 3,195,726.449 3,345,925,592 4.822
java.util.Random 32 30,976 968 3,041,212.372 2,943,893,576 4.243
java.util.ArrayList 24 24,576 1,024 2,720,615.594 2,785,910,368 4.015
java.util.stream.FindOps$FindSink$OfRef 24 18,864 786 3,369,412.295 2,648,358,064 3.817
java.util.ArrayList$ArrayListSpliterator 32 14,720 460 3,080,696.209 1,417,120,256 2.042
Manual implementation:
Memory Allocated for TLABs 46.06 GB
Class Average Object Size(bytes) Total Object Size(bytes) TLABs Average TLAB Size(bytes) Total TLAB Size(bytes) Pressure(%)
java.lang.Object[] 415.961 4,190,392 10,074 4,042,267.769 40,721,805,504 82.33
java.util.Random 32 32,064 1,002 4,367,131.521 4,375,865,784 8.847
java.util.ArrayList 24 14,976 624 3,530,601.038 2,203,095,048 4.454
Has anyone else encountered issues with the stream objects themselves consuming memory? / Is this a known issue?
Using Stream API you indeed allocate more memory, though your experimental setup is somewhat questionable. I've never used JFR, but my findings using JOL are quite similar to yours.
Note that you measure not only the heap allocated during the ArrayList querying, but also during its creation and population. The allocation during the allocation and population of single ArrayList should look like this (64bits, compressed OOPs, via JOL):
COUNT AVG SUM DESCRIPTION
1 416 416 [Ljava.lang.Object;
1 24 24 java.util.ArrayList
1 32 32 java.util.Random
1 24 24 java.util.concurrent.atomic.AtomicLong
4 496 (total)
So the most memory allocated is the Object[] array used inside ArrayList to store the data. AtomicLong is a part of Random class implementation. If you perform this 100_000_000 times, then you should have at least 496*10^8/2^30 = 46.2 Gb allocated in both tests. Nevertheless this part could be skipped as it should be identical for both tests.
Another interesting thing here is inlining. JIT is smart enough to inline the whole getIndexOfNothingManualImpl (via java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining StreamMemoryTest):
StreamMemoryTest::main # 13 (59 bytes)
...
# 30 StreamMemoryTest::getIndexOfNothingManualImpl (43 bytes) inline (hot)
# 1 java.util.ArrayList::iterator (10 bytes) inline (hot)
\-> TypeProfile (2132/2132 counts) = java/util/ArrayList
# 6 java.util.ArrayList$Itr::<init> (6 bytes) inline (hot)
# 2 java.util.ArrayList$Itr::<init> (26 bytes) inline (hot)
# 6 java.lang.Object::<init> (1 bytes) inline (hot)
# 8 java.util.ArrayList$Itr::hasNext (20 bytes) inline (hot)
\-> TypeProfile (215332/215332 counts) = java/util/ArrayList$Itr
# 8 java.util.ArrayList::access$100 (5 bytes) accessor
# 17 java.util.ArrayList$Itr::next (66 bytes) inline (hot)
# 1 java.util.ArrayList$Itr::checkForComodification (23 bytes) inline (hot)
# 14 java.util.ArrayList::access$100 (5 bytes) accessor
# 28 StreamMemoryTest$$Lambda$1/791452441::test (8 bytes) inline (hot)
\-> TypeProfile (213200/213200 counts) = StreamMemoryTest$$Lambda$1
# 4 StreamMemoryTest::lambda$main$0 (13 bytes) inline (hot)
# 1 java.lang.Integer::intValue (5 bytes) accessor
# 8 java.util.ArrayList$Itr::hasNext (20 bytes) inline (hot)
# 8 java.util.ArrayList::access$100 (5 bytes) accessor
# 33 StreamMemoryTest::consume (19 bytes) inline (hot)
Disassembly actually shows that no allocation of iterator is performed after warm-up. Because escape analysis successfully tells JIT that iterator object does not escape, it's simply scalarized. Were the Iterator actually allocated it would take additionally 32 bytes:
COUNT AVG SUM DESCRIPTION
1 32 32 java.util.ArrayList$Itr
1 32 (total)
Note that JIT could also remove iteration at all. Your blackhole is false by default, so doing blackhole = blackhole && value does not change it regardless of the value, and value calculation could be excluded at all, as it does not have any side effects. I'm not sure whether it actually did this (reading disassembly is quite hard for me), but it's possible.
However while getIndexOfNothingStreamImpl also seems to inline everything inside, escape analysis fails as there are too many interdependent objects inside the stream API, so actual allocations occur. Thus it really adds five additional objects (the table is composed manually from JOL outputs):
COUNT AVG SUM DESCRIPTION
1 32 32 java.util.ArrayList$ArrayListSpliterator
1 24 24 java.util.stream.FindOps$FindSink$OfRef
1 64 64 java.util.stream.ReferencePipeline$2
1 24 24 java.util.stream.ReferencePipeline$2$1
1 56 56 java.util.stream.ReferencePipeline$Head
5 200 (total)
So every invocation of this particular stream actually allocates 200 additional bytes. As you perform 100_000_000 iterations, in total Stream version should allocate 10^8*200/2^30 = 18.62Gb more than manual version which is close to your result. I think, AtomicLong inside Random is scalarized as well, but both Iterator and AtomicLong are present during the warmup iterations (until JIT actually creates the most optimized version). This would explain the minor discrepancies in the numbers.
This additional 200 bytes allocation does not depend on the stream size, but depends on the number of intermediate stream operations (in particular, every additional filter step would add 64+24=88 bytes more). However note that these objects are usually short-lived, allocated quickly and can be collected by minor GC. In most of real-life applications you probably should not have to worry about this.
Not only more memory due to the infrastructure that is needed to build the Stream API. But also, it might to be slower in terms of speed (at least for this small inputs).
There is this presentation from one of the developers from Oracle (it is in russian, but that is not the point) that shows a trivial example (not much more complicated then yours) where the speed of execution is 30% worse in case of Streams vs Loops. He says that's pretty normal.
One thing that I've notice that not a lot of people realize is that using Streams (lambda's and method references to be more precise) will also create (potentially) a lot of classes that you will not know about.
Try to run your example with :
-Djdk.internal.lambda.dumpProxyClasses=/Some/Path/Of/Yours
And see how many additional classes will be created by your code and the code that Streams need (via ASM)

Memory error in java

I am trying to code in java prime number algorithm. I am trying to face the problems with the integer and long size. My simply code is the following which works until a limit for n:
public static long f(int n) {
if (n == 1 || n == 2)
return 1;
else {
long value = f(n-2) + f(n-1);
return value;
}
}
If in my main I ll give 50 for example my code will be crushed, I am guessing due to the size of the outcome. I ve got also another approach which I am struggle to understand it, which is the following:
private static long[] cache = new long[60];
public static long f(int n) {
if (n == 1 || n == 2)
return 1;
else if (cache[n] > 0)
return cache[n];
else {
long value = f(n-2) + f(n-1);
cache[n] = value;
return value;
}
}
With this approach everything works fine whatever is the n, my issue is that I cannot get the difference.
By "crushed" you mean that the computation runs very long. The reason is that the same call is made many times. If you add this to your method:
static long count;
public static long f(int n) {
count++;
...
you'll see how many times the method is executed. For f(50), it is actually calling the method 25,172,538,049 times, which runs in 41 seconds in my machine.
When you cache the result of previous invocations, called memoization, you eliminate all the redundant calls, e.g. f(40) = f(39) + f(38), but f(39) = f(38) + f(37), so f(38) is called twice. Remembering the result of f(38) means that subsequent invocation has the answer immediately, without having to redo the recursion.
Without memoization, I get this:
n f(n) count time(ns)
== ============== =============== ==============
1 1 1 6,273
2 1 1 571
3 2 3 855
4 3 5 1,141
5 5 9 1,425
6 8 15 1,140
7 13 25 1,996
8 21 41 2,851
9 34 67 7,413
10 55 109 16,536
11 89 177 8,839
12 144 287 19,103
13 233 465 21,098
14 377 753 11,405
15 610 1,219 5,703
16 987 1,973 9,979
17 1,597 3,193 21,099
18 2,584 5,167 32,788
19 4,181 8,361 35,639
20 6,765 13,529 57,307
21 10,946 21,891 91,521
22 17,711 35,421 147,687
23 28,657 57,313 237,496
24 46,368 92,735 283,970
25 75,025 150,049 331,583
26 121,393 242,785 401,720
27 196,418 392,835 650,052
28 317,811 635,621 1,053,483
29 514,229 1,028,457 1,702,679
30 832,040 1,664,079 2,750,745
31 1,346,269 2,692,537 4,455,137
32 2,178,309 4,356,617 12,706,520
33 3,524,578 7,049,155 11,714,051
34 5,702,887 11,405,773 19,571,980
35 9,227,465 18,454,929 30,605,757
36 14,930,352 29,860,703 51,298,507
37 24,157,817 48,315,633 84,473,965
38 39,088,169 78,176,337 127,818,746
39 63,245,986 126,491,971 208,727,118
40 102,334,155 204,668,309 336,785,071
41 165,580,141 331,160,281 543,006,638
42 267,914,296 535,828,591 875,782,771
43 433,494,437 866,988,873 1,429,555,753
44 701,408,733 1,402,817,465 2,301,577,345
45 1,134,903,170 2,269,806,339 3,724,691,882
46 1,836,311,903 3,672,623,805 6,010,675,962
47 2,971,215,073 5,942,430,145 9,706,561,705
48 4,807,526,976 9,615,053,951 15,715,064,841
49 7,778,742,049 15,557,484,097 25,427,015,418
50 12,586,269,025 25,172,538,049 41,126,559,697
If you get a StackOverflowError, it is due to the recursive nature of your algorithm. The second algorithm stores known results into an array to prevent functions piling up when it asks for an already computed result.
One of the problem can be the big number result, which is more than integer limit:
fibonacci for 50 =
12,586,269,025
2,147,483,647 - int max
and the other can be due to the recursion stackoverflow like #Xalamadrax pointed.

Performance of variable argument methods in Java

I have this question about the performance of a method in Java with a variable number of parameters.
Say I have the following 2 alternatives:
public static final boolean isIn(int i, int v1, int v2) {
return (v1 == i) || (v2 == i);
}
public static final boolean isIn(int i, int... values) {
for (int v : values) {
if (i == v) {
return true;
}
}
return false;
}
Now the main problem comes when I have versions of the first method that go up to 20, 30 or even 50 parameters. Now that just hurts the eyes. Ok, this is legacy code and I'd like to replace all of it with the only one variable arguments method.
Any idea what the impact on performance would be? Any chance the compiler does some optimization for the second method so that it resembles the first form more or less?
EDIT: Ok, maybe I was not clear enough. I don't have performance problems with the methods with 50 arguments. It's just about readability as Peter Lawrey said.
I was wondering about performance problems if I switch to the new method with variable number of arguments.
In other words: what would be the best way to do it if you care about performance? Methods with 50 arguments or the only one method with variable arguments?
I had the same question, and turned to experimentation.
public class ArgTest {
int summation(int a, int b, int c, int d, int e, int f) {
return a + b + c + d + e + f;
}
int summationVArgs(int... args) {
int sum = 0;
for (int arg : args) {
sum += arg;
}
return sum;
}
final static public int META_ITERATIONS = 200;
final static public int ITERATIONS = 1000000;
static public void main(String[] args) {
final ArgTest at = new ArgTest();
for (int loop = 0; loop < META_ITERATIONS; loop++) {
int sum = 0;
final long fixedStart = System.currentTimeMillis();
for (int i = 0; i < ITERATIONS; i++) {
sum += at.summation(2312, 45569, -9816, 19122, 4991, 901776);
}
final long fixedEnd = System.currentTimeMillis();
final long vargStart = fixedEnd;
for (int i = 0; i < ITERATIONS; i++) {
sum += at.summationVArgs(2312, 45569, -9816, 19122, 4991, 901776);
}
final long vargEnd = System.currentTimeMillis();
System.out.printf("%03d:%d Fixed-Args: %d ms\n", loop+1, ITERATIONS, fixedEnd - fixedStart);
System.out.printf("%03d:%d Vargs-Args: %d ms\n", loop+1, ITERATIONS, vargEnd - vargStart);
}
System.exit(0);
}
}
If you run this code on a modern JVM (here 1.8.0_20), you will see that the variable number of arguments cause overhead in performance and possible in memory consumption as well.
I'll only post the first 25 runs:
001:1000000 Fixed-Args: 16 ms
001:1000000 Vargs-Args: 45 ms
002:1000000 Fixed-Args: 13 ms
002:1000000 Vargs-Args: 32 ms
003:1000000 Fixed-Args: 0 ms
003:1000000 Vargs-Args: 27 ms
004:1000000 Fixed-Args: 0 ms
004:1000000 Vargs-Args: 22 ms
005:1000000 Fixed-Args: 0 ms
005:1000000 Vargs-Args: 38 ms
006:1000000 Fixed-Args: 0 ms
006:1000000 Vargs-Args: 11 ms
007:1000000 Fixed-Args: 0 ms
007:1000000 Vargs-Args: 17 ms
008:1000000 Fixed-Args: 0 ms
008:1000000 Vargs-Args: 40 ms
009:1000000 Fixed-Args: 0 ms
009:1000000 Vargs-Args: 89 ms
010:1000000 Fixed-Args: 0 ms
010:1000000 Vargs-Args: 21 ms
011:1000000 Fixed-Args: 0 ms
011:1000000 Vargs-Args: 16 ms
012:1000000 Fixed-Args: 0 ms
012:1000000 Vargs-Args: 26 ms
013:1000000 Fixed-Args: 0 ms
013:1000000 Vargs-Args: 7 ms
014:1000000 Fixed-Args: 0 ms
014:1000000 Vargs-Args: 7 ms
015:1000000 Fixed-Args: 0 ms
015:1000000 Vargs-Args: 6 ms
016:1000000 Fixed-Args: 0 ms
016:1000000 Vargs-Args: 141 ms
017:1000000 Fixed-Args: 0 ms
017:1000000 Vargs-Args: 139 ms
018:1000000 Fixed-Args: 0 ms
018:1000000 Vargs-Args: 106 ms
019:1000000 Fixed-Args: 0 ms
019:1000000 Vargs-Args: 70 ms
020:1000000 Fixed-Args: 0 ms
020:1000000 Vargs-Args: 6 ms
021:1000000 Fixed-Args: 0 ms
021:1000000 Vargs-Args: 5 ms
022:1000000 Fixed-Args: 0 ms
022:1000000 Vargs-Args: 6 ms
023:1000000 Fixed-Args: 0 ms
023:1000000 Vargs-Args: 12 ms
024:1000000 Fixed-Args: 0 ms
024:1000000 Vargs-Args: 37 ms
025:1000000 Fixed-Args: 0 ms
025:1000000 Vargs-Args: 12 ms
...
Even at the best of times, the Vargs-Args never dropped to 0ms.
The compiler does next to no optimisation. The JVM can optimise code but the two methods won't perform anything like each other. If you have lines of code like isIn(i, 1,2,3,4,5,6,7,8,9 /* plus 40 more */) you have more than performance issues to worry about IMHO. I would worry about readability first.
If you are worried about performance pass the arguments as a int[] which is reused.
BTW The most efficient way to look up a large set of int values is to use a Set like TIntHashSet
to #Canonical Chris
I don't think problem at your test come from variable argument. The function sumationVArgs take more time to complete because of for loop.
I created this function and added to the benchmark
int summationVArgs2(int... args) {
return args[0] + args[1] + args[2] + args[3] + args[4] + args[5];
}
and this is what I see:
028:1000000 Fixed-Args: 0 ms
028:1000000 Vargs-Args: 12 ms
028:1000000 Vargs2-Args2: 0 ms
The for loop in function "summationVArgs" is compiled to more operations than add function. It contains add operation to increase iterator, check operation to check condition and branch operation to loop and exit loop, and all of them execute once for each loop except branch opration to exit loop.
Sorry for my bad English. I hop you can understand my English :)
Come back when you have profiler output that says this is a problem. Until then, it's premature optimization.
It will be the same as if you declared
isIn(int i, int[] values) {
However there will be some some small overhead in packaging the variables up when calling your method
Heard about the two optimisation rules:
Don't optimize
(For experts only!) Don't optimize yet
In other words this is nothing you should care about from the performance point of view.

Categories