In looking at some profiling results, I noticed that using streams within a tight loop (used instead of another nested loop) incurred a significant memory overhead of objects of types java.util.stream.ReferencePipeline and java.util.ArrayList$ArrayListSpliterator. I converted the offending streams to foreach loops, and the memory consumption decreased significantly.
I know that streams make no promises about performing any better than ordinary loops, but I was under the impression that the difference would be negligible. In this case it seemed like it was a 40% increase.
Here is the test class I wrote to isolate the problem. I monitored memory consumption and object allocation with JFR:
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.Random;
import java.util.function.Predicate;
public class StreamMemoryTest {
private static boolean blackHole = false;
public static List<Integer> getRandListOfSize(int size) {
ArrayList<Integer> randList = new ArrayList<>(size);
Random rnGen = new Random();
for (int i = 0; i < size; i++) {
randList.add(rnGen.nextInt(100));
}
return randList;
}
public static boolean getIndexOfNothingManualImpl(List<Integer> nums, Predicate<Integer> predicate) {
for (Integer num : nums) {
// Impossible condition
if (predicate.test(num)) {
return true;
}
}
return false;
}
public static boolean getIndexOfNothingStreamImpl(List<Integer> nums, Predicate<Integer> predicate) {
Optional<Integer> first = nums.stream().filter(predicate).findFirst();
return first.isPresent();
}
public static void consume(boolean value) {
blackHole = blackHole && value;
}
public static boolean result() {
return blackHole;
}
public static void main(String[] args) {
// 100 million trials
int numTrials = 100000000;
System.out.println("Beginning test");
for (int i = 0; i < numTrials; i++) {
List<Integer> randomNums = StreamMemoryTest.getRandListOfSize(100);
consume(StreamMemoryTest.getIndexOfNothingStreamImpl(randomNums, x -> x < 0));
// or ...
// consume(StreamMemoryTest.getIndexOfNothingManualImpl(randomNums, x -> x < 0));
if (randomNums == null) {
break;
}
}
System.out.print(StreamMemoryTest.result());
}
}
Stream implementation:
Memory Allocated for TLABs 64.62 GB
Class Average Object Size(bytes) Total Object Size(bytes) TLABs Average TLAB Size(bytes) Total TLAB Size(bytes) Pressure(%)
java.lang.Object[] 415.974 6,226,712 14,969 2,999,696.432 44,902,455,888 64.711
java.util.stream.ReferencePipeline$2 64 131,264 2,051 2,902,510.795 5,953,049,640 8.579
java.util.stream.ReferencePipeline$Head 56 72,744 1,299 3,070,768.043 3,988,927,688 5.749
java.util.stream.ReferencePipeline$2$1 24 25,128 1,047 3,195,726.449 3,345,925,592 4.822
java.util.Random 32 30,976 968 3,041,212.372 2,943,893,576 4.243
java.util.ArrayList 24 24,576 1,024 2,720,615.594 2,785,910,368 4.015
java.util.stream.FindOps$FindSink$OfRef 24 18,864 786 3,369,412.295 2,648,358,064 3.817
java.util.ArrayList$ArrayListSpliterator 32 14,720 460 3,080,696.209 1,417,120,256 2.042
Manual implementation:
Memory Allocated for TLABs 46.06 GB
Class Average Object Size(bytes) Total Object Size(bytes) TLABs Average TLAB Size(bytes) Total TLAB Size(bytes) Pressure(%)
java.lang.Object[] 415.961 4,190,392 10,074 4,042,267.769 40,721,805,504 82.33
java.util.Random 32 32,064 1,002 4,367,131.521 4,375,865,784 8.847
java.util.ArrayList 24 14,976 624 3,530,601.038 2,203,095,048 4.454
Has anyone else encountered issues with the stream objects themselves consuming memory? / Is this a known issue?
Using Stream API you indeed allocate more memory, though your experimental setup is somewhat questionable. I've never used JFR, but my findings using JOL are quite similar to yours.
Note that you measure not only the heap allocated during the ArrayList querying, but also during its creation and population. The allocation during the allocation and population of single ArrayList should look like this (64bits, compressed OOPs, via JOL):
COUNT AVG SUM DESCRIPTION
1 416 416 [Ljava.lang.Object;
1 24 24 java.util.ArrayList
1 32 32 java.util.Random
1 24 24 java.util.concurrent.atomic.AtomicLong
4 496 (total)
So the most memory allocated is the Object[] array used inside ArrayList to store the data. AtomicLong is a part of Random class implementation. If you perform this 100_000_000 times, then you should have at least 496*10^8/2^30 = 46.2 Gb allocated in both tests. Nevertheless this part could be skipped as it should be identical for both tests.
Another interesting thing here is inlining. JIT is smart enough to inline the whole getIndexOfNothingManualImpl (via java -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining StreamMemoryTest):
StreamMemoryTest::main # 13 (59 bytes)
...
# 30 StreamMemoryTest::getIndexOfNothingManualImpl (43 bytes) inline (hot)
# 1 java.util.ArrayList::iterator (10 bytes) inline (hot)
\-> TypeProfile (2132/2132 counts) = java/util/ArrayList
# 6 java.util.ArrayList$Itr::<init> (6 bytes) inline (hot)
# 2 java.util.ArrayList$Itr::<init> (26 bytes) inline (hot)
# 6 java.lang.Object::<init> (1 bytes) inline (hot)
# 8 java.util.ArrayList$Itr::hasNext (20 bytes) inline (hot)
\-> TypeProfile (215332/215332 counts) = java/util/ArrayList$Itr
# 8 java.util.ArrayList::access$100 (5 bytes) accessor
# 17 java.util.ArrayList$Itr::next (66 bytes) inline (hot)
# 1 java.util.ArrayList$Itr::checkForComodification (23 bytes) inline (hot)
# 14 java.util.ArrayList::access$100 (5 bytes) accessor
# 28 StreamMemoryTest$$Lambda$1/791452441::test (8 bytes) inline (hot)
\-> TypeProfile (213200/213200 counts) = StreamMemoryTest$$Lambda$1
# 4 StreamMemoryTest::lambda$main$0 (13 bytes) inline (hot)
# 1 java.lang.Integer::intValue (5 bytes) accessor
# 8 java.util.ArrayList$Itr::hasNext (20 bytes) inline (hot)
# 8 java.util.ArrayList::access$100 (5 bytes) accessor
# 33 StreamMemoryTest::consume (19 bytes) inline (hot)
Disassembly actually shows that no allocation of iterator is performed after warm-up. Because escape analysis successfully tells JIT that iterator object does not escape, it's simply scalarized. Were the Iterator actually allocated it would take additionally 32 bytes:
COUNT AVG SUM DESCRIPTION
1 32 32 java.util.ArrayList$Itr
1 32 (total)
Note that JIT could also remove iteration at all. Your blackhole is false by default, so doing blackhole = blackhole && value does not change it regardless of the value, and value calculation could be excluded at all, as it does not have any side effects. I'm not sure whether it actually did this (reading disassembly is quite hard for me), but it's possible.
However while getIndexOfNothingStreamImpl also seems to inline everything inside, escape analysis fails as there are too many interdependent objects inside the stream API, so actual allocations occur. Thus it really adds five additional objects (the table is composed manually from JOL outputs):
COUNT AVG SUM DESCRIPTION
1 32 32 java.util.ArrayList$ArrayListSpliterator
1 24 24 java.util.stream.FindOps$FindSink$OfRef
1 64 64 java.util.stream.ReferencePipeline$2
1 24 24 java.util.stream.ReferencePipeline$2$1
1 56 56 java.util.stream.ReferencePipeline$Head
5 200 (total)
So every invocation of this particular stream actually allocates 200 additional bytes. As you perform 100_000_000 iterations, in total Stream version should allocate 10^8*200/2^30 = 18.62Gb more than manual version which is close to your result. I think, AtomicLong inside Random is scalarized as well, but both Iterator and AtomicLong are present during the warmup iterations (until JIT actually creates the most optimized version). This would explain the minor discrepancies in the numbers.
This additional 200 bytes allocation does not depend on the stream size, but depends on the number of intermediate stream operations (in particular, every additional filter step would add 64+24=88 bytes more). However note that these objects are usually short-lived, allocated quickly and can be collected by minor GC. In most of real-life applications you probably should not have to worry about this.
Not only more memory due to the infrastructure that is needed to build the Stream API. But also, it might to be slower in terms of speed (at least for this small inputs).
There is this presentation from one of the developers from Oracle (it is in russian, but that is not the point) that shows a trivial example (not much more complicated then yours) where the speed of execution is 30% worse in case of Streams vs Loops. He says that's pretty normal.
One thing that I've notice that not a lot of people realize is that using Streams (lambda's and method references to be more precise) will also create (potentially) a lot of classes that you will not know about.
Try to run your example with :
-Djdk.internal.lambda.dumpProxyClasses=/Some/Path/Of/Yours
And see how many additional classes will be created by your code and the code that Streams need (via ASM)
Related
Suppose I have this code (it really does not matter I think, but just in case here it is):
public class AtomicJDK9 {
static AtomicInteger ai = new AtomicInteger(0);
public static void main(String[] args) {
int sum = 0;
for (int i = 0; i < 30_000; ++i) {
sum += atomicIncrement();
}
System.out.println(sum);
}
public static int atomicIncrement() {
ai.getAndAdd(12);
return ai.get();
}
}
And here is how I am invoking it (using java-9):
java -XX:+UnlockDiagnosticVMOptions
-XX:-TieredCompilation
-XX:+PrintIntrinsics
AtomicJDK9
What I am trying to find out is what methods were replaced by intrinsic code. The first one that is hit (inside Unsafe):
#HotSpotIntrinsicCandidate
public final int getAndAddInt(Object o, long offset, int delta) {
int v;
do {
v = getIntVolatile(o, offset);
} while (!weakCompareAndSwapIntVolatile(o, offset, v, v + delta));
return v;
}
And this method is indeed present in the output of the above invocation:
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
But, the entire output is weird (for me that is):
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
# 3 jdk.internal.misc.Unsafe::getIntVolatile (0 bytes) (intrinsic)
# 18 jdk.internal.misc.Unsafe::weakCompareAndSwapIntVolatile (11 bytes) (intrinsic)
# 7 jdk.internal.misc.Unsafe::compareAndSwapInt (0 bytes) (intrinsic)
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
Why is the getAndAddInt present twice in the output?
Also if getAndAddInt is indeed replaced by an intrinsic call, why is there a need to replace all other intrinsic methods down the call stackl they will not be used anymore. I do assume it's as simple as the stack of method calls is traversed from the bottom.
In order to illustrate the compiler logic I ran JVM with the following arguments.
-XX:-TieredCompilation -XX:CICompilerCount=1
-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining
And that's what it prints.
337 29 java.util.concurrent.atomic.AtomicInteger::getAndAdd (12 bytes)
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
337 30 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes)
# 3 jdk.internal.misc.Unsafe::getIntVolatile (0 bytes) (intrinsic)
# 18 jdk.internal.misc.Unsafe::weakCompareAndSwapIntVolatile (11 bytes) (intrinsic)
338 32 jdk.internal.misc.Unsafe::weakCompareAndSwapIntVolatile (11 bytes)
# 7 jdk.internal.misc.Unsafe::compareAndSwapInt (0 bytes) (intrinsic)
339 33 AtomicJDK9::atomicIncrement (16 bytes)
# 5 java.util.concurrent.atomic.AtomicInteger::getAndAdd (12 bytes) inline (hot)
# 8 jdk.internal.misc.Unsafe::getAndAddInt (27 bytes) (intrinsic)
# 12 java.util.concurrent.atomic.AtomicInteger::get (5 bytes) accessor
Methods are intrinsics only for compiler, but not for interpreter.
Every method starts in interpreter until it is considered hot.
AtomicInteger.getAndAdd is called not only from your code, but from common JDK code, too.
That is, AtomicInteger.getAndAdd reaches invocation threshold a little bit earlier than your AtomicJDK9.atomicIncrement. Then getAndAdd is submitted to compilation queue, and that's where the first intrinsic printout comes from.
HotSpot JVM compiles methods in background. While a method is being compiled, the execution continues in interpreter.
While AtomicInteger.getAndAdd is interpreted, Unsafe.getAndAddInt and Unsafe.weakCompareAndSwapIntVolatile methods also reach invocation threshold and begin compilation. The next 3 intrinsics are printed while compiling these Unsafe methods.
Finally, AtomicJDK9.atomicIncrement also reaches invocation threashold and begin compilation. The last intrinsic printout corresponds to your method.
I use a hashmap to store a QTable for an implementation of a reinforcement learning algorithm. My hashmap should store 15000000 entries. When I ran my algorithm I saw that the memory used by the process is over 1000000K. When I calculated the memory, I would expect it to use not more than 530000K. I tried to write an example and I got the same high memory usage:
public static void main(String[] args) {
HashMap map = new HashMap<>(16_000_000, 1);
for(int i = 0; i < 15_000_000; i++){
map.put(i, i);
}
}
My memory calulation:
Each entryset is 32 bytes
Capacity is 15000000
HashMap Instance uses: 32 * SIZE + 4 * CAPACITY
memory = (15000000 * 32 + 15000000 * 4) / 1024 = 527343.75K
Where I'm wrong in my memory calculations?
Well, in the best case, we assume a word size of 32 bits/4 bytes (with CompressedOops and CompressedClassesPointers). Then, a map entry consists of two words JVM overhead (klass pointer and mark word), key, value, hashcode and next pointer, making 6 words total, in other words, 24 bytes. So having 15,000,000 entry instances will consume 360 MB.
Additionally, there’s the array holding the entries. The HashMap uses capacities that are a power of two, so for 15,000,000 entries, the array size is at least 16,777,216, consuming 64 MiB.
Then, you have 30,000,000 Integer instances. The problem is that map.put(i, i) performs two boxing operations and while the JVM is encouraged to reuse objects when boxing, it is not required to do so and reusing won’t happen in your simple program that is likely to complete before the optimizer ever interferes.
To be precise, the first 128 Integer instances are reused, because for values in the -128 … +127 range, sharing is mandatory, but the implementation does this by initializing the entire cache on the first use, so for the first 128 iterations, it doesn’t create two instances, but the cache consists of 256 instances, which is twice that number, so we end up again with 30,000,000 Integer instances total. An Integer instance consist of at least the two JVM specific words and the actual int value, which would make 12 bytes, but due to the default alignment, the actually consumed memory will be 16 bytes, dividable by eight.
So the 30,000,000 created Integer instances consume 480 MB.
This makes a total of 360 MB + 64 MiB + 480 MB, which is more than 900 MB, making a heap size of 1 GB entirely plausible.
But that’s what profiling tools are for. After running your program, I got
Note that this tool only reports the used size of the objects, i.e. the 12 bytes for an Integer object without considering the padding that you will notice when looking at the total memory allocated by the JVM.
I sort of had the same requirement as you.. so decided to throw my thoughts here.
1) There is a great tool for that: jol.
2) Arrays are objects too, and every object in java has two additional headers: mark and klass, usually 4 and 8 bytes in size (this can be tweaked via compressed pointers, but not going to go into details).
3) Is is important to note about the load factor here of the map (because it influences the resize of the internal array). Here is an example:
HashMap<Integer, Integer> map = new HashMap<>(16, 1);
for (int i = 0; i < 13; ++i) {
map.put(i, i);
}
System.out.println(GraphLayout.parseInstance(map).toFootprint());
HashMap<Integer, Integer> map2 = new HashMap<>(16);
for (int i = 0; i < 13; ++i) {
map2.put(i, i);
}
System.out.println(GraphLayout.parseInstance(map2).toFootprint());
Output of this is different(only the relevant lines):
1 80 80 [Ljava.util.HashMap$Node; // first case
1 144 144 [Ljava.util.HashMap$Node; // second case
See how the size is bigger for the second case because the backing array is twice as big (32 entries). You can only put 12 entries in a 16 size array, because the default load factor is 0.75: 16 * 0.75 = 12.
Why 144? The math here is easy: an array is an object, thus: 8+4 bytes for headers. Plus 32 * 4 for references = 140 bytes. Due to memory alignment of 8 bytes, there are 4 bytes for padding resulting in a total 144 bytes.
4) entries are stored inside either a Node or a TreeNode inside the map (Node is 32 bytes and TreeNode is 56 bytes). As you use ONLY integers, you will have only Nodes, as there should be no hash collisions. There might be collisions, but this does not yet mean that a certain array entry will be converted to a TreeNode, there is a threshold for that. We can easily prove that there will be Nodes only:
public static void main(String[] args) {
Map<Integer, List<Integer>> map = IntStream.range(0, 15_000_000).boxed()
.collect(Collectors.groupingBy(WillThereBeTreeNodes::hash)); // WillThereBeTreeNodes - current class name
System.out.println(map.size());
}
private static int hash(Integer key) {
int h = 0;
return (h = key.hashCode()) ^ h >>> 16;
}
The result of this will be 15_000_000, there was no merging, thus no hash-collisions.
5) When you create Integer objects there is pool for them (ranging from -127 to 128 - this can be tweaked as well, but let's not for simplicity).
6) an Integer is an object, thus it has 12 bytes header and 4 bytes for the actual int value.
With this in mind, let's try and see the output for 15_000_000 entries (since you are using a load factor of one, there is no need to create the internal capacity of 16_000_000). It will take a lot of time, so be patient. I also gave it a
-Xmx12G and -Xms12G
HashMap<Integer, Integer> map = new HashMap<>(15_000_000, 1);
for (int i = 0; i < 15_000_000; ++i) {
map.put(i, i);
}
System.out.println(GraphLayout.parseInstance(map).toFootprint());
Here is what jol said:
java.util.HashMap#9629756d footprint:
COUNT AVG SUM DESCRIPTION
1 67108880 67108880 [Ljava.util.HashMap$Node;
29999872 16 479997952 java.lang.Integer
1 48 48 java.util.HashMap
15000000 32 480000000 java.util.HashMap$Node
44999874 1027106880 (total)
Let's start from bottom.
total size of the hashmap footprint is: 1027106880 bytes or 1 027 MB.
Node instance is the wrapper class where each entry resides. it has a size of 32 bytes; there are 15 million entries, thus the line:
15000000 32 480000000 java.util.HashMap$Node
Why 32 bytes? It stores the hashcode(4 bytes), key reference (4 bytes), value reference (4 bytes), next Node reference (4 bytes), 12 bytes header, 4 bytes padding, resulting in 32 bytes total.
1 48 48 java.util.HashMap
A single hashmap instance - 48 bytes for it's internals.
If you really want to know why 48 bytes:
System.out.println(ClassLayout.parseClass(HashMap.class).toPrintable());
java.util.HashMap object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 Set AbstractMap.keySet N/A
16 4 Collection AbstractMap.values N/A
20 4 int HashMap.size N/A
24 4 int HashMap.modCount N/A
28 4 int HashMap.threshold N/A
32 4 float HashMap.loadFactor N/A
36 4 Node[] HashMap.table N/A
40 4 Set HashMap.entrySet N/A
44 4 (loss due to the next object alignment)
Instance size: 48 bytes
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
Next the Integer instances:
29999872 16 479997952 java.lang.Integer
30 million integer objects (minus 128 that are cached in the pool)
1 67108880 67108880 [Ljava.util.HashMap$Node;
we have 15_000_000 entries, but the internal array of a HashMap is a power of two size, that's 16,777,216 references of 4 bytes each.
16_777_216 * 4 = 67_108_864 + 12 bytes header + 4 padding = 67108880
I have a file of ~50 million strings that I need to add to a symbol table of some sort on startup, then search several times with reasonable speed.
I tried using a DLB trie since lookup would be relatively fast since all strings are < 10 characters, but while populating the DLB I would get either GC overhead limit exceeded or outofmemory - heap space error. The same errors were found with HashMap. This is for an assignment that would be compiled and run by a grader so I would rather not just allocate more heap space. Is there a different data structure that would have less memory usage, while still having reasonable lookup time?
If you expect low prefix sharing, then a trie may not be your best option.
Since you only load the lookup table once, at startup, and your goal is low memory footprint with "reasonable speed" for lookup, your best option is likely a sorted array and binary search for lookup.
First, you load the data into an array. Since you likely don't know the size up front, you load into an ArrayList. You then extract the final array from the list.
Assuming you load 50 million 10 character strings, memory will be:
10 character string:
String: 12 byte header + 4 byte 'hash' + 4 byte 'value' ref = 24 bytes (aligned)
char[]: 12 byte header + 4 byte 'length' + 10 * 2 byte 'char' = 40 bytes (aligned)
Total: 24 + 40 = 64 bytes
Array of 50 million 10 character strings:
String[]: 12 byte header + 4 byte 'length' + 50,000,000 * 4 byte 'String' ref = 200,000,016 bytes
Values: 50,000,000 * 64 bytes = 3,200,000,000 bytes
Total: 200,000,016 + 3,200,000,000 = 3,400,000,016 bytes = 3.2 GB
You will need another copy of the String[] when you convert the ArrayList<String> to String[]. The Arrays.sort() operation may need 50% array size (~100,000,000 bytes) for temporary storage, but if ArrayList is released for GC before you sort, that space can be reused.
So, total requirement is ~3.5 GB, just for the symbol table.
Now, if space is truly at a premium, you can squeeze that. As you can see, the String itself adds an overhead of 24 bytes, out of the 64 bytes. You can make the symbol table use char[] directly.
Also, if your strings are all US-ASCII or ISO-8859-1, you can convert the char[] to a byte[], saving half the bytes.
Combined, that reduces the value size from 64 bytes to 32 bytes, and the total symbol table size from 3.2 GB to 1.8 GB, or roughly 2 GB during loading.
UPDATE
Assuming input list of strings are already sorted, below is example of how you do this. As an MCVE, it just uses a small static array as input, but you can easily read them from a file instead.
public class Test {
public static void main(String[] args) {
String[] wordsFromFile = { "appear", "attack", "cellar", "copper",
"erratic", "grotesque", "guitar", "guttural",
"kittens", "mean", "suit", "trick" };
List<byte[]> wordList = new ArrayList<>();
for (String word : wordsFromFile) // Simulating read from file
wordList.add(word.getBytes(StandardCharsets.US_ASCII));
byte[][] symbolTable = wordList.toArray(new byte[wordList.size()][]);
test(symbolTable, "abc");
test(symbolTable, "attack");
test(symbolTable, "car");
test(symbolTable, "kittens");
test(symbolTable, "xyz");
}
private static void test(byte[][] symbolTable, String word) {
int idx = Arrays.binarySearch(symbolTable,
word.getBytes(StandardCharsets.US_ASCII),
Test::compare);
if (idx < 0)
System.out.println("Not found: " + word);
else
System.out.println("Found : " + word);
}
private static int compare(byte[] w1, byte[] w2) {
for (int i = 0, cmp; i < w1.length && i < w2.length; i++)
if ((cmp = Byte.compare(w1[i], w2[i])) != 0)
return cmp;
return Integer.compare(w1.length, w2.length);
}
}
Output
Not found: abc
Found : attack
Not found: car
Found : kittens
Not found: xyz
Use a single char array to store all strings (sorted), and an array of integers for the offsets. String n is the chars from offset[n - 1] (inclusive) to offset[n] (exclusive). offset[-1] is zero.
Memory usage will be 1GB (50M*10*2) for the char array, and 200MB (50M * 4) for the offset array. Very compact even with two byte chars.
You will have to build this array by merging smaller sorted string arrays in order not to exceed your heap space. But once you have it, it should be reasonably fast.
Alternatively, you could try a memory optimized trie implementation such as https://github.com/rklaehn/radixtree . This uses not just prefix sharing, but also structural sharing for common suffixes, so unless your strings are completely random, it should be quite compact. See the space usage benchmark. But it is scala, not java.
Is it better to pass the size of Collection to Collection constructor if I know the size at that point? Is the saving effect in regards to expanding Collection and allocating/re-allocating noticable?
What if I know minimal size of the Collection but not the upper bound. It's still worth creating it at least with minimal size?
Different collections have different performance consequences for this, for ArrayList the saving can be very noticeable.
import java.util.*;
public class Main{
public static void main(String[] args){
List<Integer> numbers = new ArrayList<Integer>(5);
int max = 1000000;
// Warmup
for (int i=0;i<max;i++) {
numbers.add(i);
}
long start = System.currentTimeMillis();
numbers = new ArrayList<Integer>(max);
for (int i=0;i<max;i++) {
numbers.add(i);
}
System.out.println("Preall: "+(System.currentTimeMillis()-start));
start = System.currentTimeMillis();
numbers = new ArrayList<Integer>(5);
for (int i=0;i<max;i++) {
numbers.add(i);
}
System.out.println("Resizing: "+(System.currentTimeMillis()-start));
}
}
Result:
Preall: 26
Resizing: 58
Running with max set to 10 times the value at 10000000 gives:
Preall: 510
Resizing: 935
So you can see even at different sizes the ratio stays around the same.
This is pretty much a worst-case test but filling an array one element at a time is very common and you can see that there was a roughly 2*speed difference.
OK, here's my jmh code:
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 3, time = 1)
#Measurement(iterations = 3, time = 1)
#Fork(3)
public class Comparison
{
static final int size = 1_000;
#GenerateMicroBenchmark
public List<?> testSpecifiedSize() {
final ArrayList<Integer> l = new ArrayList(size);
for (int i = 0; i < size; i++) l.add(1);
return l;
}
#GenerateMicroBenchmark
public List<?> testDefaultSize() {
final ArrayList<Integer> l = new ArrayList();
for (int i = 0; i < size; i++) l.add(1);
return l;
}
}
My results for size = 10_000:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
testDefaultSize avgt 1 9 1 80.770 2.095 usec/op
testSpecifiedSize avgt 1 9 1 50.060 1.078 usec/op
Results for size = 1_000:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
testDefaultSize avgt 1 9 1 6.208 0.131 usec/op
testSpecifiedSize avgt 1 9 1 4.900 0.078 usec/op
My interpretation:
presizing has some edge on the default size;
the edge isn't that spectacular;
the absolute time spent on the task of adding to the list is quite insignificant.
My conclusion:
Add the initial size if that makes you feel warmer around the heart, but objectively speaking, your customer is highly unlikely to notice the difference.
All collections are auto-expanding. Not knowing the bounds will not affect their functionality (until you run into other issues such as using all available memory etc), it may however affect their performance.
With some collections. Most notably the ArrayList, auto expanding is expensive as the whole underlying array is copied; array lists are default sized at 10 and then double in size each time they get to their maximum. So, say you know your arraylist will contain 110 objects but do not give it a size, the following copies will happen
Copy 10 --> 20
Copy 20 --> 40
Copy 40 --> 80
Copy 80 --> 160
By telling the arraylist up front that it contains 110 items you skip these copies.
An educated guess is better than nothing
Even if you're wrong it doesn't matter. The collection will still autoexpand and you will still avoid some copies. The only way you can decrease performance is if your guess is far far too large: which will lead to too much memory being allocated to the collection
In the rare cases when the size is well known (for example when filling a know number of elements into a new collection), it may be set for performance reasons.
Most often it's better to ommit it and use the default constructor instead, leading to simpler and better understandable code.
For array-based collections re-sizing is a quite expensive operation. That's why pass exact size for ArrayList is a good idea.
If you set up size to a minimal size(MIN) and then add to the collection MIN+1 elements, then you got re-sizing. ArrayList() invokes ArrayList(10) so if MIN is big enough then you get some advantage. But the best way is to create ArrayList with expecting collection size.
But possibly you prefer LinkedList because it has no any costs for adding elements (although list.get(i) have O(i) cost)
I am running some micro benchmarks on Java list iteration code. I have used -XX:+PrintCompilation, and -verbose:gc flags to ensure that nothing is happening in the background when the timing is being run. However, I see something in the output which I cannot understand.
Here's the code, I am running the benchmark on:
import java.util.ArrayList;
import java.util.List;
public class PerformantIteration {
private static int theSum = 0;
public static void main(String[] args) {
System.out.println("Starting microbenchmark on iterating over collections with a call to size() in each iteration");
List<Integer> nums = new ArrayList<Integer>();
for(int i=0; i<50000; i++) {
nums.add(i);
}
System.out.println("Warming up ...");
//warmup... make sure all JIT comliling is done before the actual benchmarking starts
for(int i=0; i<10; i++) {
iterateWithConstantSize(nums);
iterateWithDynamicSize(nums);
}
//actual
System.out.println("Starting the actual test");
long constantSizeBenchmark = iterateWithConstantSize(nums);
long dynamicSizeBenchmark = iterateWithDynamicSize(nums);
System.out.println("Test completed... printing results");
System.out.println("constantSizeBenchmark : " + constantSizeBenchmark);
System.out.println("dynamicSizeBenchmark : " + dynamicSizeBenchmark);
System.out.println("dynamicSizeBenchmark/constantSizeBenchmark : " + ((double)dynamicSizeBenchmark/(double)constantSizeBenchmark));
}
private static long iterateWithDynamicSize(List<Integer> nums) {
int sum=0;
long start = System.nanoTime();
for(int i=0; i<nums.size(); i++) {
// appear to do something useful
sum += nums.get(i);
}
long end = System.nanoTime();
setSum(sum);
return end-start;
}
private static long iterateWithConstantSize(List<Integer> nums) {
int count = nums.size();
int sum=0;
long start = System.nanoTime();
for(int i=0; i<count; i++) {
// appear to do something useful
sum += nums.get(i);
}
long end = System.nanoTime();
setSum(sum);
return end-start;
}
// invocations to this method simply exist to fool the VM into thinking that we are doing something useful in the loop
private static void setSum(int sum) {
theSum = sum;
}
}
Here's the output.
152 1 java.lang.String::charAt (33 bytes)
160 2 java.lang.String::indexOf (151 bytes)
165 3Starting microbenchmark on iterating over collections with a call to size() in each iteration java.lang.String::hashCode (60 bytes)
171 4 sun.nio.cs.UTF_8$Encoder::encodeArrayLoop (490 bytes)
183 5
java.lang.String::lastIndexOf (156 bytes)
197 6 java.io.UnixFileSystem::normalize (75 bytes)
200 7 java.lang.Object::<init> (1 bytes)
205 8 java.lang.Number::<init> (5 bytes)
206 9 java.lang.Integer::<init> (10 bytes)
211 10 java.util.ArrayList::add (29 bytes)
211 11 java.util.ArrayList::ensureCapacity (58 bytes)
217 12 java.lang.Integer::valueOf (35 bytes)
221 1% performance.api.PerformantIteration::main # 21 (173 bytes)
Warming up ...
252 13 java.util.ArrayList::get (11 bytes)
252 14 java.util.ArrayList::rangeCheck (22 bytes)
253 15 java.util.ArrayList::elementData (7 bytes)
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
Starting the actual test
Test completed... printing results
constantSizeBenchmark : 301688
dynamicSizeBenchmark : 782602
dynamicSizeBenchmark/constantSizeBenchmark : 2.5940773249184588
I don't understand these four lines from the output.
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
Why are both these methods being compiled twice ?
How do I read this output... what do the various numbers mean ?
I am going to attempt answering my own question with the help of this link posted by Thomas Jungblut.
260 2% performance.api.PerformantIteration::iterateWithConstantSize # 19 (59 bytes)
268 3% performance.api.PerformantIteration::iterateWithDynamicSize # 12 (57 bytes)
272 16 performance.api.PerformantIteration::iterateWithConstantSize (59 bytes)
278 17 performance.api.PerformantIteration::iterateWithDynamicSize (57 bytes)
First column
The first column '260' is the timestamp.
Second column
The second column is the compilation_id and method_attributes. When a HotSpot compilation is triggered, every compilation unit gets a compilation id. The number in the second column is the compilation id. JIT compilation, and OSR compilation have two different sequences for the compilation id. So 1% and 1 are different compilation units. The % in the first two rows, refer to the fact that this is an OSR compilation. An OSR compilation was triggered because the code was looping over a large loop, and the VM determined that this code is hot. So an OSR compilation was triggered, which would enable the VM to do an On Stack Replacement and move over to the optimized code, once it is ready.
Third column
The third column performance.api.PerformantIteration::iterateWithConstantSize is the method name.
Fourth column
The fourth column is again different when OSR compilation happens and when it does not. Let's look at the common parts first. The end of the fourth column (59 bytes), refers to the size of the compilation unit in bytecode (not the size of the compiled code). The # 19 part in OSR compilation refers to the osr_bci. I am going to quote from the link mentioned above -
A "place" in a Java method is defined by its bytecode index (BCI), and
the place that triggered an OSR compilation is called the "osr_bci".
An OSR-compiled nmethod can only be entered from its osr_bci; there
can be multiple OSR-compiled versions of the same method at the same
time, as long as their osr_bci differ.
Finally, why was the method compiled twice ?
The first one is an OSR compilation, which presumably happened while the loop was running due to the warmup code (in the example), and the second compilation is a JIT compilation, presumably to further optimize the compiled code ?
I think first time OSR happened , then it change the Invocation Counter tigger method compilar
(PS: sorry, my english is pool)