How much should I add when resizing an array?

How much should I add when resizing an array? - java

I'm having a contest with another student to make the fastest version of our homework assignment, and I'm not using an ArrayList for performance reasons (resizing the array myself cut the benchmark time from 56 seconds to 4), but I'm wondering how much I should resize the array every time I need to. Specifically the relevant parts of my code are this:
private Node[] list;
private int size; // The number of items in the list
private static final int N; // How much to resize the list by every time
public MyClass(){
list = new Node[N];
}
public void add(Node newNode){
if(size == list.length){
list = Arrays.copyOf(list, size + N);
}
list[size] = newNode;
size++;
}
TL;DR: What should I make N?

It's recommended to double the size of the array when resizing. Doubling the size leads to amortized linear-time cost.
The naive idea is that there are two costs associated with the resize value:
Copying performance costs - costs of copying the elements from previous array to new one, and
Memory overhead costs - cost of the allotted memory that is not used.
If you were to re-size the array by adding one element at a time, the memory overhead is zero, but the copying cost becomes quadratic. If you were to allocate too much slots, the copying cost will be linear, but the memory overhead is too much.
Doubling leads to a linear amortized cost (i.e. over a long time, the cost of copying is linear with respect to the size of the array), and you are guaranteed not to waste more than half of the array.
UPDATE: By the way, apparently Java's ArrayList expands by (3/2). This makes it a bit more memory conservative, but cost a bit more in terms of copying. Benchmarking for your use wouldn't hurt.
MINOR Correction: Doubling would make the cost resizing linear amortized, but would ensure that you have a amortized constant time insertion. Check CMU's lecture on Amortized Analysis.

3/2 is likely chosen as "something that divides cleanly but is less than phi". There was an epic thread on comp.lang.c++.moderated back in November 2003 exploring how phi establishes an upper bound on reusing previously-allocated storage during reallocation for a first-fit allocator.
See post #7 from Andrew Koenig for the first mention of phi's application to this problem.

If you know roughly how many items there are going to be, then pre-assign the array or the ArrayList to that size, and you'll never have to expand. Unbeatable performance!
Failing that, a reasonable way to achieve good amortized cost is to keep icreasing by some percentage. 100% or 50% are common.

You should resize your lists as a multiple of the previous size, rather than adding a constant amount each time.
for example:
newSize = oldSize * 2;
not
newSize = oldSize + N;

Double the size each time you need to resize unless you know that more or less would be best.
If memory isn't an issue, just start off with a big array to begin with.

Your code seems to do pretty much what ArrayList does - if you know you will be using a large list, you can pass it an initial size when you create the list and avoid resizing at all. This ofcourse assumes that you're going for raw speed and memory consumption is not an issue.

From the comments of one of the answers:
The problem is that memory isn't an
issue, but I'm reading an arbitrarily
large file.
Try this:
new ArrayList<Node>((int)file.length());
You could do it with your array as well. Then there should be no resizing in either case since the array will be the size of the file (assuming that the file is not longer then an int...).

For maximum performance, you're going to want to resize as rarely as possible. Set the initial size to be as large as you'll typically need, rather than starting with N elements. The value you choose for N will matter less in that case.
If you are going to create a large number of these list objects, of varying sizes, then you'll want to use a pool-based allocator, and not free the memory until you exit.
And to eliminate the copy operation altogether, you can use a list of arrays

Here's an analogy for you, long long ago when I used to work on a mainframe we used a filing system called VSAM, which would require you to specify the initial file size and the amount of freespace required.
Whenever the amount of freespace dropped below threshold required, then the amount of freespace required would be allocated in the background while the program continued to process.
It would be interesting to see if this could be done in java using a separate thread to allocate the additional space and 'attach' it to the end of the array while the main thread continues processing.

Related

Java memory optimized [Key:Long, Value:Long] store of very large size (500M) for concurrent read-access

I have a use-case where I need to store Key - Value pairs of size approx. 500 Million entries in sinle VM of size 8 GB. Key and Value are of type Long. Key is auto incremented starting from 1, 2 ,3, so on..
Only once I build this Map[K-V] structure at the start of program as a exclusive operation, Once this is build, used only for lookup, No update or delete is performed in this structure.
I have tried this with java.util.hashMap but as expected it consumes a lot of memory and program give OOM : Heap usage exceeds Error.
I need some guidance on following which helps in reducing the memory footprint, I am Ok with some degradation in access performance.
What are the other alternative (from java collection or other libraries)
that can be tried here.
What is a recommended way to get the memory footprint by this Map, for
comparison purpose.

Just use a long[] or long[][].
500 million ascending keys is less than 2^31. And if you go over 2^31, use a long[][] where the first dimension is small and the second one is large.
(When the key type is an integer, you only need a complicated "map" data structure if the key space is sparse.)
The space wastage in a 1D array is insignificant. Every Java array node has 12 byte header, and the node size is rounded up to a multiple of 8 bytes. So a 500 million entry long[] will take so close to 500 million x 8 bytes == 4 billion bytes that it doesn't matter.
However, a JVM typically cannot allocate a single object that takes up the entire available heap space. If virtual address space is at a premium, it would be advisable to use a 2-D array; e.g. new long[4][125_000_000]. This makes the lookups slightly more complicated, but you will most likely reduce the memory footprint by doing this.
If you don't know beforehand the number of keys to expect, you could do the same thing with a combination of arrays and ArrayList objects. But an ArrayList has the problem that if you don't set an (accurate) capacity, the memory utilization is liable to be suboptimal. And if you populate an ArrayList by appending to it, the instantaneous memory demand for the append can be as much as 3 times the list's current space usage.

There is no reason for using a Map in your case.
If you just have a start index and further indizes are just constant increments, just use a List:
List<Long> data=new ArrayList<>(510_000_000);//capacity should ideally not be reached, if it is reached, the array behind the ArrayList needs to be reallocated, the allocated memory would be doubled by that
data.add(1337L);//inserting, how often you want
long value=data.get(1-1);//1...your index that starts with 1, -1...because your index starts with 1, you should subtract one from the index.
If you don't even add more elements and know the size from the start, an array will be even better:
long[] data=long[510_000_000];//capacity should surely not be reached, you will need to create a new array and copy all data if it is higher
int currentIndex=0;
data[currentIndex++]=1337L//inserting, as often as it is smaller than the size
long value=data[1-1];//1...your index that starts with 1, -1...because your index starts with 1, you should subtract one from the index.
Note that you should check the index (currentIndex) before inserting so that it is smaller than the array length.
When iterating, use currentIndex+1 as length instead of .length.
Create an array with the size you need and whenever you need to access it, use arr[i-1] (-1 because your indizes start with 1 instead of zero).
If you "just" have 500 million entries, you will not reach the integer limit and a simple array will be fine.
If you need more entries and you have sufficient memories, use an array of arrays.
The memory footprint of using an array this big is the memory footprint of the data and a bit more.
However, if you don't know the size, you should use a higher length/capacity then you may need. If you use an ArrayList, the memory footprint will be doubled (temporarily tripled) whenever the capacity is reached because it needs to allocate a bigger array.
A Map would need an object for each entry and an array of lists for all those object that would highly increase the memory footprint. The increasing of the memory footprint (using HashMap) is even worse than with ÀrrayLists as the underlaying array is reallocated even if the Map is not completely filled up.
But consider saving it to the HDD/SSD if you need to store that much data. In most cases, this works much better. You can use RandomAccessFile in order to access the data on the HDD/SSD on any point.

Difference between ArrayList.TrimToSize() and Array?

Generally, They say that we have moved from Array to ArrayList for the following Reason
Arrays are fixed size where as Array Lists are not .
One of the disadvantages of ArrayList is:
When it reaches it's capacity , ArrayList becomes 3/2 of it's actual size. As a result , Memory can go wasted if we donot utilize the space properly.In this scenario, Arrays are preferred.
If we use ArrayList.TrimSize(), will that make Array List a unanimous choice? Eliminating the only advantage(fixed size) Array has over it?

One short answer would be: trimToSize doesn't solve everything, because shrinking an array after it has grown - is not the same as preventing growth in the first place; the former has the cost of copying + garbage collection.
The longer answer would be: int[] is low level, ArrayList is high level which means it's more convenient but gives you less control over the details. Thus in a business-oriented code (e.g. manipulating a short list of "Products") i'll prefer ArrayList so that I can forget about the technicalities and focus on the business. In a mathematically-oriented code i'll probably go for int[].
There are additional subtle differences, but i'm not sure how relevant they are to you. E.g. concurrency: if you change the data of ArrayList from several threads simultaneously, it will intentionally fail, because that's the intuitive requirement for most business code. An int[] will allow you to do whatever you want, leaving it up to you to make sure it makes sense. Again, this can all be summarized as "low level"...

If you are developing an extremely memory critical application, need resizability as well and performance can be traded off, then trimming array list is your best bet. This is the only time, array list with trimming will be unanimous choice.
In other situations, what you are actually doing is:
You have created an array list. Default capacity of the list is 10.
Added an element and applied trim operation. So both size and capacity is now 1. How trim size works? It basically creates a new array with actual size of the list and copies old array data to new array. Old array is left for grabage collection.
You again added a new element. Since list is full, it will be reallocated with more 50% spaces. Again, procedure similar to 2 will be followed.
Again you call TrimSize and it follows same procedure as 2.
Things repeats...
So you see, we are incurring lots of performance overhead just to keep list capacity and size same. Fixed size is not offering you anything advantageous here except saving few more extra spaces which is hardly an issue in modern machines.
In a nutshell, if you want resizability without writing lots of boilerplate code, then array list is unanimous choice. But if size never changes and you don't need any dynamic function such as removal operation, then array is better choice. Few extra bytes are hardly an issue.

Tradeoff between guessing ArrayList capacity wrongly vs. unused values?

Say I have to read out data, which can be either 1 object (majority of time) or multiple objects (some of the time).
If I do:
List list = new ArrayList<Object>(1);
... loop over the loaded object(s) and add it/them to the list...
This will serve me well the majority of times when there is only 1 object loaded from the database.
But assuming the less common scenario where I have to expand my initial list, this will cause me to lose operations.
I assume this won't really make much of an impact in the real world, but I wonder how I could calculate the following:
Assume X% of my data is 1 object, and Y% is a list of multiple objects. Is there a way I can calculate the ideal initial capacity for my list, going for the least operations (through list expansions, allocated but unused fields in the list)?

You dissociate your data into 2 groups X (1 element) and Y (more than one). You optimized your code for the X group because it is the most common case.
It's a good idea to initialize your ArrayList with one element so most of the time you won't waste any memory.
But if the members of the Y group have a high average size (and a small standard deviation) you can still optimize the worstcase with ensureCapacity(int cap). On the second iteration you can force to resize the ArrayList backing array to the average size of the Y group.
For a member of the Y group with 100 elements it will create/copy arrays 12 times and the backing array length will be 141 against 1 small array copy and no wasted memory if you implement the optimization.
Example of this optimization :
Iterator<Obj> it = // Get your iterator from your resource
ArrayList<Obj> result = new ArrayList<Obj>(1);
if(it.hasNext()) {
result.add(it.next());
}
if(it.hasNext()) {
result.ensureCapacity(100);// Avg size of the Y group
while(it.hasNext()) {
result.add(it.next());
}
}
But unless it's a performance critical feature It's not worth the effort. Because to make sure this trick will optimize speed and memory you have to analyse the distribution of the size in the Y group.
It's not drectly related to your problem but it contains a lot of useful comments on ArrayList : When to use LinkedList over ArrayList?

ArrayList vs LinkedList from memory allocation perspective

I need to store a large amount of information, say for example 'names' in a java List. The number of items can change (or in short I cannot predefine the size). I am of the opinion that from a memory allocation perspective LinkedList would be a better option than ArrayList, as for an ArrayList once the max size is reached, automatically the memory allocation doubles and hence there would always be a chance of more memory being allocated than what is needed.
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.

LinkedList might allocate fewer entries, but those entries are astronomically more expensive than they'd be for ArrayList -- enough that even the worst-case ArrayList is cheaper as far as memory is concerned.
(FYI, I think you've got it wrong; ArrayList grows by 1.5x when it's full, not 2x.)
See e.g. https://github.com/DimitrisAndreou/memory-measurer/blob/master/ElementCostInDataStructures.txt : LinkedList consumes 24 bytes per element, while ArrayList consumes in the best case 4 bytes per element, and in the worst case 6 bytes per element. (Results may vary depending on 32-bit versus 64-bit JVMs, and compressed object pointer options, but in those comparisons LinkedList costs at least 36 bytes/element, and ArrayList is at best 8 and at worst 12.)
UPDATE:
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.
To be clear, even in the worst case, ArrayList is 4x smaller than a LinkedList with the same elements. The only possible way to make LinkedList win is to deliberately fix the comparison by calling ensureCapacity with a deliberately inflated value, or to remove lots of values from the ArrayList after they've been added.
In short, it's basically impossible to make LinkedList win the memory comparison, and if you care about space, then calling trimToSize() on the ArrayList will instantly make ArrayList win again by a huge margin. Seriously. ArrayList wins.

... but I am still guessing for the scenario I have defined LinkedList might be a better option
Your guess is incorrect.
Once you have got past the initial capacity of the array list, the size of the backing will be between 1 and 2 references times the number of entries. This is due to strategy used to grow the backing array.
For a linked list, each node occupies AT LEAST 3 times the number of entries, because each node has a next and prev reference as well as the entry reference. (And in fact, it is more than 3 times, because of the space used by the nodes' object headers and padding. Depending on the JVM and pointer size, it can be as much as 6 times.)
The only situation where a linked list will use less space than an array list is if you badly over-estimate the array list's initial capacity. (And for very small lists ...)
When you think about it, the only real advantage linked lists have over array lists is when you are inserting and removing elements. Even then, it depends on how you do it.

ArrayList use one reference per object (or two when its double the size it needs to be) This is typically 4 bytes.
LinkedList uses only the nodes its needs, but these can be 24 bytes each.
So even at it worst ArrayList will be 3x smaller than LinkedList.
For fetching ARrayList support random access O(1) but LinkedList is O(n). For deleting from the end, both are O(1), for deleting from somewhere in the middle ArrayList is O(n)
Unless you have millions of entries, the size of the collection is unlikely to matter. What will matter first is the size of entries which is the same regardless of the collection used.

Back of the envelope worst-case:
500,000 names in an array sized to 1,000,000 = 500,000 used, 500,000 empty pointers in the unused portion of the allocated array.
500,000 entries in a linked list = 3 pointers per entry (Node object holds current, prev, and next) = 1,5000,000 pointers in memory. (Then you have to add the size of the Node itself)

ArrayList.trimToSize() may satisfy you.
Trims the capacity of this ArrayList instance to be the list's current
size. An application can use this operation to minimize the storage of
an ArrayList instance.
By the way, in ArrayList Java6, it's not double capacity, it's about 1.5 times max size is reached.

Java collections memory consumption

Say I instantiate 100 000 of Vectors
a[0..100k] = new Vector<Integer>();
If i do this
a[0..100k] = new Vector<Integer>(1);
Will they take less memory? That is ignoring whether they have stuff in them and the overhead of expanding them when there has to be more than 1 element.

According to the Javadoc, the default capacity for a vector is 10, so I would expect it to take more memory than a vector of capacity 1.
In practice, you should probably use an ArrayList unless you need to work with another API that requires vectors.

When you create a Vector, you either specify the size you want it to have at the start or leave some default value. But it should be noted that in any case everything stored in a Vector is just a bunch of references, which take really little place compared to the objects they are actually pointing at.
So yes, you will save place initially, but only by the amount which equals to the difference between the default size and the specified multiplied by the size of a reference variable. If you create a really large amount of vectors like in your case, initial size does matter.

Well, sort of yes. IIRC Vector initializes internally 16 elements by default which means that due to byte alignment and other stuff done by underlying VM you'll save a considerable amount of memory initially.
What are you trying to accomplish, though?

Yes, they will. Putting in reasonable "initial sizes" for collections is one of the first things I do when confronted with a need to radically improve memory consumption of my program.

Yes, it will. By default, Vector allocates space for 10 elements.
Vector()
Constructs an empty vector so that its internal data array has size 10
and its standard capacity increment is zero.increment is zero.
Therefore, it reserves memory for 10 memory references.
That being said, in real life situations, this is rarely a concern. If you are truly generating 100,000 Vectors, you need to rethink your designincrement is zero.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.