Resize behaviour of Maps in Java

Resize behaviour of Maps in Java - java

I need to know when a map in Java enlarges. For this I need a formula to calculate a good initial capacity.
In my project I need a large map which contains large objects. Therefore, I would like to prevent a resizing of the map by specifying a suitable initial capacity. By means of reflection I have looked at the behavior of maps.
package com.company;
import java.lang.reflect.Field;
import java.util.HashMap;
import java.util.Map;
public class Main {
public static void main(String[] args) {
Map m = new HashMap();
int lastCapacity = 0, currentCapacity = 0;
for (int i = 1; i <= 100_000; i++) {
m.put(i,i);
currentCapacity = getHashMapCapacity(m);
if (currentCapacity>lastCapacity){
System.out.println(lastCapacity+" --> "+currentCapacity+" at "+i+" entries.");
lastCapacity=currentCapacity;
}
}
}
public static int getHashMapCapacity(Map m){
int size=0;
Field tableField = null;
try {
tableField = HashMap.class.getDeclaredField("table");
tableField.setAccessible(true);
Object[] table = (Object[]) tableField.get(m);
size = table == null ? 0 : table.length;
} catch (NoSuchFieldException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
}
return size;
}
}
The output was:
0 --> 16 at 1 entries.
16 --> 32 at 13 entries.
32 --> 64 at 25 entries.
64 --> 128 at 49 entries.
128 --> 256 at 97 entries.
256 --> 512 at 193 entries.
512 --> 1024 at 385 entries.
1024 --> 2048 at 769 entries.
2048 --> 4096 at 1537 entries.
4096 --> 8192 at 3073 entries.
8192 --> 16384 at 6145 entries.
16384 --> 32768 at 12289 entries.
32768 --> 65536 at 24577 entries.
65536 --> 131072 at 49153 entries.
131072 --> 262144 at 98305 entries.
Can I assume that a map always behaves that way? Are there any differences between Java 7 and Java 8?

The easiest way to check out this sort of behaviour is to look at the openjdk source. It's all freely available and relatively easy to read.
In this case, checking HashMap, you will see there are some extensive implementation notes that explains how sizing works, what load factor is used as a threshold (which is driving the behaviour you are seeing), and even how the decision is made whether to use trees for the bin. Read through that and come back if it's not clear.
The code is pretty well optimised with expansion a very cheap operation. I suggest using a profile to get some evidence that the performance issue are related to expansion before you do any tweaking.

As per the documentation:
The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html

Related

Java HashMap size allocated

Java Hash Map has a size() method,
which reflects how many elements are set int the Hash Map.
I am interested to know what is the actual size of the Hash Map.
I tried different methods but can't find the correct one.
I set the initial Capacity to 16
HashMap hm = new HashMap(16);
for(int i=0;i<100;++i){
System.out.println(hm.size());
UUID uuid = UUID.randomUUID();
hm.pet(uuid ,null);
}
when i will add values this size can increase, how can i check the size that is actually allocated?

what is the actual size of the Hash Map
I'm assuming you are asking about the capacity. The capacity is the length of the array holding the buckets of the HashMaps. The initial capacity is 16 by default.
The capacity method is not public, but you can calculate the current capacity based on the current size, the initial capacity and the load factor.
If you use the defaults (for example, when you create the HashMap with the parameter-less constructor), the initial capacity is 16, and the default load factor is 0.75. This means the capacity will be doubled to 32 once the size reaches 16 * 0.75 == 12. It will be doubled to 64 once the size reaches 32 * 0.75 == 24.
If you pass different initial capacity and/or load factor to the constructor, the calculation will be affected accordingly.

You can use Reflection to check actual allocated size (bucket size) of the map.
HashMap<String, Integer> m = new HashMap<>();
m.put("Abhi", 101);
m.put("John", 102);
System.out.println(m.size()); // This will print 2
Field tableField = HashMap.class.getDeclaredField("table");
tableField.setAccessible(true);
Object[] table = (Object[]) tableField.get(m);
System.out.println(table.length); // This will print 16

Java 8 hashmap high memory usage

I use a hashmap to store a QTable for an implementation of a reinforcement learning algorithm. My hashmap should store 15000000 entries. When I ran my algorithm I saw that the memory used by the process is over 1000000K. When I calculated the memory, I would expect it to use not more than 530000K. I tried to write an example and I got the same high memory usage:
public static void main(String[] args) {
HashMap map = new HashMap<>(16_000_000, 1);
for(int i = 0; i < 15_000_000; i++){
map.put(i, i);
}
}
My memory calulation:
Each entryset is 32 bytes
Capacity is 15000000
HashMap Instance uses: 32 * SIZE + 4 * CAPACITY
memory = (15000000 * 32 + 15000000 * 4) / 1024 = 527343.75K
Where I'm wrong in my memory calculations?

Well, in the best case, we assume a word size of 32 bits/4 bytes (with CompressedOops and CompressedClassesPointers). Then, a map entry consists of two words JVM overhead (klass pointer and mark word), key, value, hashcode and next pointer, making 6 words total, in other words, 24 bytes. So having 15,000,000 entry instances will consume 360 MB.
Additionally, there’s the array holding the entries. The HashMap uses capacities that are a power of two, so for 15,000,000 entries, the array size is at least 16,777,216, consuming 64 MiB.
Then, you have 30,000,000 Integer instances. The problem is that map.put(i, i) performs two boxing operations and while the JVM is encouraged to reuse objects when boxing, it is not required to do so and reusing won’t happen in your simple program that is likely to complete before the optimizer ever interferes.
To be precise, the first 128 Integer instances are reused, because for values in the -128 … +127 range, sharing is mandatory, but the implementation does this by initializing the entire cache on the first use, so for the first 128 iterations, it doesn’t create two instances, but the cache consists of 256 instances, which is twice that number, so we end up again with 30,000,000 Integer instances total. An Integer instance consist of at least the two JVM specific words and the actual int value, which would make 12 bytes, but due to the default alignment, the actually consumed memory will be 16 bytes, dividable by eight.
So the 30,000,000 created Integer instances consume 480 MB.
This makes a total of 360 MB + 64 MiB + 480 MB, which is more than 900 MB, making a heap size of 1 GB entirely plausible.
But that’s what profiling tools are for. After running your program, I got
Note that this tool only reports the used size of the objects, i.e. the 12 bytes for an Integer object without considering the padding that you will notice when looking at the total memory allocated by the JVM.

I sort of had the same requirement as you.. so decided to throw my thoughts here.
1) There is a great tool for that: jol.
2) Arrays are objects too, and every object in java has two additional headers: mark and klass, usually 4 and 8 bytes in size (this can be tweaked via compressed pointers, but not going to go into details).
3) Is is important to note about the load factor here of the map (because it influences the resize of the internal array). Here is an example:
HashMap<Integer, Integer> map = new HashMap<>(16, 1);
for (int i = 0; i < 13; ++i) {
map.put(i, i);
}
System.out.println(GraphLayout.parseInstance(map).toFootprint());
HashMap<Integer, Integer> map2 = new HashMap<>(16);
for (int i = 0; i < 13; ++i) {
map2.put(i, i);
}
System.out.println(GraphLayout.parseInstance(map2).toFootprint());
Output of this is different(only the relevant lines):
1 80 80 [Ljava.util.HashMap$Node; // first case
1 144 144 [Ljava.util.HashMap$Node; // second case
See how the size is bigger for the second case because the backing array is twice as big (32 entries). You can only put 12 entries in a 16 size array, because the default load factor is 0.75: 16 * 0.75 = 12.
Why 144? The math here is easy: an array is an object, thus: 8+4 bytes for headers. Plus 32 * 4 for references = 140 bytes. Due to memory alignment of 8 bytes, there are 4 bytes for padding resulting in a total 144 bytes.
4) entries are stored inside either a Node or a TreeNode inside the map (Node is 32 bytes and TreeNode is 56 bytes). As you use ONLY integers, you will have only Nodes, as there should be no hash collisions. There might be collisions, but this does not yet mean that a certain array entry will be converted to a TreeNode, there is a threshold for that. We can easily prove that there will be Nodes only:
public static void main(String[] args) {
Map<Integer, List<Integer>> map = IntStream.range(0, 15_000_000).boxed()
.collect(Collectors.groupingBy(WillThereBeTreeNodes::hash)); // WillThereBeTreeNodes - current class name
System.out.println(map.size());
}
private static int hash(Integer key) {
int h = 0;
return (h = key.hashCode()) ^ h >>> 16;
}
The result of this will be 15_000_000, there was no merging, thus no hash-collisions.
5) When you create Integer objects there is pool for them (ranging from -127 to 128 - this can be tweaked as well, but let's not for simplicity).
6) an Integer is an object, thus it has 12 bytes header and 4 bytes for the actual int value.
With this in mind, let's try and see the output for 15_000_000 entries (since you are using a load factor of one, there is no need to create the internal capacity of 16_000_000). It will take a lot of time, so be patient. I also gave it a
-Xmx12G and -Xms12G
HashMap<Integer, Integer> map = new HashMap<>(15_000_000, 1);
for (int i = 0; i < 15_000_000; ++i) {
map.put(i, i);
}
System.out.println(GraphLayout.parseInstance(map).toFootprint());
Here is what jol said:
java.util.HashMap#9629756d footprint:
COUNT AVG SUM DESCRIPTION
1 67108880 67108880 [Ljava.util.HashMap$Node;
29999872 16 479997952 java.lang.Integer
1 48 48 java.util.HashMap
15000000 32 480000000 java.util.HashMap$Node
44999874 1027106880 (total)
Let's start from bottom.
total size of the hashmap footprint is: 1027106880 bytes or 1 027 MB.
Node instance is the wrapper class where each entry resides. it has a size of 32 bytes; there are 15 million entries, thus the line:
15000000 32 480000000 java.util.HashMap$Node
Why 32 bytes? It stores the hashcode(4 bytes), key reference (4 bytes), value reference (4 bytes), next Node reference (4 bytes), 12 bytes header, 4 bytes padding, resulting in 32 bytes total.
1 48 48 java.util.HashMap
A single hashmap instance - 48 bytes for it's internals.
If you really want to know why 48 bytes:
System.out.println(ClassLayout.parseClass(HashMap.class).toPrintable());
java.util.HashMap object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 12 (object header) N/A
12 4 Set AbstractMap.keySet N/A
16 4 Collection AbstractMap.values N/A
20 4 int HashMap.size N/A
24 4 int HashMap.modCount N/A
28 4 int HashMap.threshold N/A
32 4 float HashMap.loadFactor N/A
36 4 Node[] HashMap.table N/A
40 4 Set HashMap.entrySet N/A
44 4 (loss due to the next object alignment)
Instance size: 48 bytes
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total
Next the Integer instances:
29999872 16 479997952 java.lang.Integer
30 million integer objects (minus 128 that are cached in the pool)
1 67108880 67108880 [Ljava.util.HashMap$Node;
we have 15_000_000 entries, but the internal array of a HashMap is a power of two size, that's 16,777,216 references of 4 bytes each.
16_777_216 * 4 = 67_108_864 + 12 bytes header + 4 padding = 67108880

What symbol table can I use to store ~50 mil strings with fast lookup without running out of heap space?

I have a file of ~50 million strings that I need to add to a symbol table of some sort on startup, then search several times with reasonable speed.
I tried using a DLB trie since lookup would be relatively fast since all strings are < 10 characters, but while populating the DLB I would get either GC overhead limit exceeded or outofmemory - heap space error. The same errors were found with HashMap. This is for an assignment that would be compiled and run by a grader so I would rather not just allocate more heap space. Is there a different data structure that would have less memory usage, while still having reasonable lookup time?

If you expect low prefix sharing, then a trie may not be your best option.
Since you only load the lookup table once, at startup, and your goal is low memory footprint with "reasonable speed" for lookup, your best option is likely a sorted array and binary search for lookup.
First, you load the data into an array. Since you likely don't know the size up front, you load into an ArrayList. You then extract the final array from the list.
Assuming you load 50 million 10 character strings, memory will be:
10 character string:
String: 12 byte header + 4 byte 'hash' + 4 byte 'value' ref = 24 bytes (aligned)
char[]: 12 byte header + 4 byte 'length' + 10 * 2 byte 'char' = 40 bytes (aligned)
Total: 24 + 40 = 64 bytes
Array of 50 million 10 character strings:
String[]: 12 byte header + 4 byte 'length' + 50,000,000 * 4 byte 'String' ref = 200,000,016 bytes
Values: 50,000,000 * 64 bytes = 3,200,000,000 bytes
Total: 200,000,016 + 3,200,000,000 = 3,400,000,016 bytes = 3.2 GB
You will need another copy of the String[] when you convert the ArrayList<String> to String[]. The Arrays.sort() operation may need 50% array size (~100,000,000 bytes) for temporary storage, but if ArrayList is released for GC before you sort, that space can be reused.
So, total requirement is ~3.5 GB, just for the symbol table.
Now, if space is truly at a premium, you can squeeze that. As you can see, the String itself adds an overhead of 24 bytes, out of the 64 bytes. You can make the symbol table use char[] directly.
Also, if your strings are all US-ASCII or ISO-8859-1, you can convert the char[] to a byte[], saving half the bytes.
Combined, that reduces the value size from 64 bytes to 32 bytes, and the total symbol table size from 3.2 GB to 1.8 GB, or roughly 2 GB during loading.
UPDATE
Assuming input list of strings are already sorted, below is example of how you do this. As an MCVE, it just uses a small static array as input, but you can easily read them from a file instead.
public class Test {
public static void main(String[] args) {
String[] wordsFromFile = { "appear", "attack", "cellar", "copper",
"erratic", "grotesque", "guitar", "guttural",
"kittens", "mean", "suit", "trick" };
List<byte[]> wordList = new ArrayList<>();
for (String word : wordsFromFile) // Simulating read from file
wordList.add(word.getBytes(StandardCharsets.US_ASCII));
byte[][] symbolTable = wordList.toArray(new byte[wordList.size()][]);
test(symbolTable, "abc");
test(symbolTable, "attack");
test(symbolTable, "car");
test(symbolTable, "kittens");
test(symbolTable, "xyz");
}
private static void test(byte[][] symbolTable, String word) {
int idx = Arrays.binarySearch(symbolTable,
word.getBytes(StandardCharsets.US_ASCII),
Test::compare);
if (idx < 0)
System.out.println("Not found: " + word);
else
System.out.println("Found : " + word);
}
private static int compare(byte[] w1, byte[] w2) {
for (int i = 0, cmp; i < w1.length && i < w2.length; i++)
if ((cmp = Byte.compare(w1[i], w2[i])) != 0)
return cmp;
return Integer.compare(w1.length, w2.length);
}
}
Output
Not found: abc
Found : attack
Not found: car
Found : kittens
Not found: xyz

Use a single char array to store all strings (sorted), and an array of integers for the offsets. String n is the chars from offset[n - 1] (inclusive) to offset[n] (exclusive). offset[-1] is zero.
Memory usage will be 1GB (50M*10*2) for the char array, and 200MB (50M * 4) for the offset array. Very compact even with two byte chars.
You will have to build this array by merging smaller sorted string arrays in order not to exceed your heap space. But once you have it, it should be reasonably fast.
Alternatively, you could try a memory optimized trie implementation such as https://github.com/rklaehn/radixtree . This uses not just prefix sharing, but also structural sharing for common suffixes, so unless your strings are completely random, it should be quite compact. See the space usage benchmark. But it is scala, not java.

What is an efficient alternative for an extremely large HashMap?

I'm trying to break a symmetric encryption using a 'meet-in-the-middle' attack. For this I need to store 2**32 integer-integer pairs. I'm storing the mapping from a 4-byte cyphertext to a 4-byte key.
At first I tried using an array, but then I realized that you cannot have such a big array in java (the max size is bound by Integer.MAX_VALUE).
Now I'm using a HashMap, but this gets way too slow when the map gets large, even when increasing the max memory to 8GB with -Xmx8192M.
What is an efficient alternative for an extremely large HashMap?
This is the code I'm currently using to populate my hashmap:
HashMap<Integer, Integer> map = new HashMap<>(Integer.MAX_VALUE);
// Loop until integer overflow
for (int k = 1; k != 0; k++)
map.put(encrypt_left(c, k), k);
I haven't seen this code finish, even after letting it run for hours. Progress logging shows that the first 2**24 values are created in 22s, but then the performance quickly decreases.

I'm storing the mapping from a 4-byte cyphertext to a 4-byte key.
Conveniently, 4 bytes is an int. As you observed, array sizes are limited by Integer.MAX_VALUE. That suggests you can use an array – but there's a minor hangup. Integers are signed, but arrays only permit values >=0.
So you create two arrays: one for the positive cyphertexts, and one for the negative cyphertexts. Then you just need to make sure that you've given the JVM enough heap.
How much heap is that?
4 bytes * Integer.MAX_VALUE * 2 arrays
= 17179869176 bytes
= ~16.0 gigabytes.

When building a rainbow table, consider the size of data, you are going to produce. Consider also the fact, that this problem can be solved without vast amounts of RAM. This is done by using files instead of putting all in memory. Typically you build files of the size that fits in your file buffer. For example 4096 bytes or 8192 bytes. If you get a key, you just divide it by the file buffer's size, load the file and look at mod x position.
The tricky part is that you need the encrypted data to be layed out, and not the key. So you start with dummy files and write the key data at the position of the encrypted data.
So let's say, your key is 1026 and the encrypted data is 126. The flke to write 1026 to is 0.rbt because 126*4 byte / 4096 = 0. The position is 126*4 byte.
And of course you need the nio classes for that.

Following the advice of #MattBall, I implemented my own BigArray, which composes a 32-bit length array from 4 separate arrays.
Running this without the suggested JVM arguments will cause an OutOfMemoryError. Using this with the suggested JVM arguments but with too little RAM will probably cause your machine to crash.
/**
* Array that holds 2**32 integers, Implemented as four 30-bit arrays.
* <p>
* Requires 16 GB RAM solely for the array allocation.
* <p>
* Example JVM Arguments: <code>-Xmx22000M -Xms17000M</code>
* <p>
* This sets the max memory to 22,000 MB and the initial memory to 17,000 MB
* <p>
* WARNING: don't use these settings if your machine does not have this much RAM.
*
* #author popovitsj
*/
public class BigArray
{
private int[] a_00= new int[1 << 30];
private int[] a_01 = new int[1 << 30];
private int[] a_10 = new int[1 << 30];
private int[] a_11 = new int[1 << 30];
private static final int A_00 = 0;
private static final int A_01 = 1 << 30;
private static final int A_10 = 1 << 31;
private static final int A_11 = 3 << 30;
private static final int A_30 = A_01 - 1;
public void set(int index, int value)
{
getArray(index)[index & A_30] = value;
}
public int get(int index)
{
return getArray(index)[index & A_30];
}
private int[] getArray(int index)
{
switch (index & A_11)
{
case (A_00):
return a_00;
case (A_01):
return a_01;
case (A_10):
return a_10;
default:
return a_11;
}
}
}

This is big data problem, in this case it is more of a big memory problem. The computation should be done in memory for performance. Use Hazelcast distributed HashMap. It is very easy to use and very performant.
You can use more than 2 or more machines for your problem.
Sample usage :
HazelcastInstance hzInstance = Hazelcast.newHazelcastInstance();
Map<Integer, Integer> map = hzInstance.getMap("map1");
map.put(x,y);
..

Is it better to set size of a Java Collection in constructor?

Is it better to pass the size of Collection to Collection constructor if I know the size at that point? Is the saving effect in regards to expanding Collection and allocating/re-allocating noticable?
What if I know minimal size of the Collection but not the upper bound. It's still worth creating it at least with minimal size?

Different collections have different performance consequences for this, for ArrayList the saving can be very noticeable.
import java.util.*;
public class Main{
public static void main(String[] args){
List<Integer> numbers = new ArrayList<Integer>(5);
int max = 1000000;
// Warmup
for (int i=0;i<max;i++) {
numbers.add(i);
}
long start = System.currentTimeMillis();
numbers = new ArrayList<Integer>(max);
for (int i=0;i<max;i++) {
numbers.add(i);
}
System.out.println("Preall: "+(System.currentTimeMillis()-start));
start = System.currentTimeMillis();
numbers = new ArrayList<Integer>(5);
for (int i=0;i<max;i++) {
numbers.add(i);
}
System.out.println("Resizing: "+(System.currentTimeMillis()-start));
}
}
Result:
Preall: 26
Resizing: 58
Running with max set to 10 times the value at 10000000 gives:
Preall: 510
Resizing: 935
So you can see even at different sizes the ratio stays around the same.
This is pretty much a worst-case test but filling an array one element at a time is very common and you can see that there was a roughly 2*speed difference.

OK, here's my jmh code:
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 3, time = 1)
#Measurement(iterations = 3, time = 1)
#Fork(3)
public class Comparison
{
static final int size = 1_000;
#GenerateMicroBenchmark
public List<?> testSpecifiedSize() {
final ArrayList<Integer> l = new ArrayList(size);
for (int i = 0; i < size; i++) l.add(1);
return l;
}
#GenerateMicroBenchmark
public List<?> testDefaultSize() {
final ArrayList<Integer> l = new ArrayList();
for (int i = 0; i < size; i++) l.add(1);
return l;
}
}
My results for size = 10_000:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
testDefaultSize avgt 1 9 1 80.770 2.095 usec/op
testSpecifiedSize avgt 1 9 1 50.060 1.078 usec/op
Results for size = 1_000:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
testDefaultSize avgt 1 9 1 6.208 0.131 usec/op
testSpecifiedSize avgt 1 9 1 4.900 0.078 usec/op
My interpretation:
presizing has some edge on the default size;
the edge isn't that spectacular;
the absolute time spent on the task of adding to the list is quite insignificant.
My conclusion:
Add the initial size if that makes you feel warmer around the heart, but objectively speaking, your customer is highly unlikely to notice the difference.

All collections are auto-expanding. Not knowing the bounds will not affect their functionality (until you run into other issues such as using all available memory etc), it may however affect their performance.
With some collections. Most notably the ArrayList, auto expanding is expensive as the whole underlying array is copied; array lists are default sized at 10 and then double in size each time they get to their maximum. So, say you know your arraylist will contain 110 objects but do not give it a size, the following copies will happen
Copy 10 --> 20
Copy 20 --> 40
Copy 40 --> 80
Copy 80 --> 160
By telling the arraylist up front that it contains 110 items you skip these copies.
An educated guess is better than nothing
Even if you're wrong it doesn't matter. The collection will still autoexpand and you will still avoid some copies. The only way you can decrease performance is if your guess is far far too large: which will lead to too much memory being allocated to the collection

In the rare cases when the size is well known (for example when filling a know number of elements into a new collection), it may be set for performance reasons.
Most often it's better to ommit it and use the default constructor instead, leading to simpler and better understandable code.

For array-based collections re-sizing is a quite expensive operation. That's why pass exact size for ArrayList is a good idea.
If you set up size to a minimal size(MIN) and then add to the collection MIN+1 elements, then you got re-sizing. ArrayList() invokes ArrayList(10) so if MIN is big enough then you get some advantage. But the best way is to create ArrayList with expecting collection size.
But possibly you prefer LinkedList because it has no any costs for adding elements (although list.get(i) have O(i) cost)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.