Tradeoff between guessing ArrayList capacity wrongly vs. unused values?

Tradeoff between guessing ArrayList capacity wrongly vs. unused values? - java

Say I have to read out data, which can be either 1 object (majority of time) or multiple objects (some of the time).
If I do:
List list = new ArrayList<Object>(1);
... loop over the loaded object(s) and add it/them to the list...
This will serve me well the majority of times when there is only 1 object loaded from the database.
But assuming the less common scenario where I have to expand my initial list, this will cause me to lose operations.
I assume this won't really make much of an impact in the real world, but I wonder how I could calculate the following:
Assume X% of my data is 1 object, and Y% is a list of multiple objects. Is there a way I can calculate the ideal initial capacity for my list, going for the least operations (through list expansions, allocated but unused fields in the list)?

You dissociate your data into 2 groups X (1 element) and Y (more than one). You optimized your code for the X group because it is the most common case.
It's a good idea to initialize your ArrayList with one element so most of the time you won't waste any memory.
But if the members of the Y group have a high average size (and a small standard deviation) you can still optimize the worstcase with ensureCapacity(int cap). On the second iteration you can force to resize the ArrayList backing array to the average size of the Y group.
For a member of the Y group with 100 elements it will create/copy arrays 12 times and the backing array length will be 141 against 1 small array copy and no wasted memory if you implement the optimization.
Example of this optimization :
Iterator<Obj> it = // Get your iterator from your resource
ArrayList<Obj> result = new ArrayList<Obj>(1);
if(it.hasNext()) {
result.add(it.next());
}
if(it.hasNext()) {
result.ensureCapacity(100);// Avg size of the Y group
while(it.hasNext()) {
result.add(it.next());
}
}
But unless it's a performance critical feature It's not worth the effort. Because to make sure this trick will optimize speed and memory you have to analyse the distribution of the size in the Y group.
It's not drectly related to your problem but it contains a lot of useful comments on ArrayList : When to use LinkedList over ArrayList?

Related

Question about java.util.ArrayList realization [duplicate]

The usual constructor of ArrayList is:
ArrayList<?> list = new ArrayList<>();
But there is also an overloaded constructor with a parameter for its initial capacity:
ArrayList<?> list = new ArrayList<>(20);
Why is it useful to create an ArrayList with an initial capacity when we can append to it as we please?

If you know in advance what the size of the ArrayList is going to be, it is more efficient to specify the initial capacity. If you don't do this, the internal array will have to be repeatedly reallocated as the list grows.
The larger the final list, the more time you save by avoiding the reallocations.
That said, even without pre-allocation, inserting n elements at the back of an ArrayList is guaranteed to take total O(n) time. In other words, appending an element is an amortized constant-time operation. This is achieved by having each reallocation increase the size of the array exponentially, typically by a factor of 1.5. With this approach, the total number of operations can be shown to be O(n).

Because ArrayList is a dynamically resizing array data structure, which means it is implemented as an array with an initial (default) fixed size. When this gets filled up, the array will be extended to a double sized one. This operation is costly, so you want as few as possible.
So, if you know your upper bound is 20 items, then creating the array with initial length of 20 is better than using a default of, say, 15 and then resize it to 15*2 = 30 and use only 20 while wasting the cycles for the expansion.
P.S. - As AmitG says, the expansion factor is implementation specific (in this case (oldCapacity * 3)/2 + 1)

Default size of Arraylist is 10.
/**
* Constructs an empty list with an initial capacity of ten.
*/
public ArrayList() {
this(10);
}
So if you are going to add 100 or more records, you can see the overhead of memory reallocation.
ArrayList<?> list = new ArrayList<>();
// same as new ArrayList<>(10);
So if you have any idea about the number of elements which will be stored in Arraylist its better to create Arraylist with that size instead of starting with 10 and then going on increasing it.

I actually wrote a blog post on the topic 2 months ago. The article is for C#'s List<T> but Java's ArrayList has a very similar implementation. Since ArrayList is implemented using a dynamic array, it increases in size on demand. So the reason for the capacity constructor is for optimisation purposes.
When one of these resizings operation occurs, the ArrayList copies the contents of the array into a new array that is twice the capacity of the old one. This operation runs in O(n) time.
Example
Here is an example of how the ArrayList would increase in size:
10
16
25
38
58
... 17 resizes ...
198578
297868
446803
670205
1005308
So the list starts with a capacity of 10, when the 11th item is added it is increase by 50% + 1 to 16. On the 17th item the ArrayList is increased again to 25 and so on. Now consider the example where we're creating a list where the desired capacity is already known as 1000000. Creating the ArrayList without the size constructor will call ArrayList.add 1000000 times which takes O(1) normally or O(n) on resize.
1000000 + 16 + 25 + ... + 670205 + 1005308 = 4015851 operations
Compare this using the constructor and then calling ArrayList.add which is guaranteed to run in O(1).
1000000 + 1000000 = 2000000 operations
Java vs C#
Java is as above, starting at 10 and increasing each resize at 50% + 1. C# starts at 4 and increases much more aggressively, doubling at each resize. The 1000000 adds example from above for C# uses 3097084 operations.
References
My blog post on C#'s List<T>
Java's ArrayList source code

Setting the initial size of an ArrayList, e.g. to ArrayList<>(100), reduces the number of times the re-allocation of internal memory has to occur.
Example:
ArrayList example = new ArrayList<Integer>(3);
example.add(1); // size() == 1
example.add(2); // size() == 2,
example.add(2); // size() == 3, example has been 'filled'
example.add(3); // size() == 4, example has been 'expanded' so that the fourth element can be added.
As you see in the above example - an ArrayList can be expanded if needed to be. What this doesn't show you is that the size of the Arraylist usually doubles (although note that the new size depends on your implementation). The following is quoted from Oracle:
"Each ArrayList instance has a capacity. The capacity is the size of
the array used to store the elements in the list. It is always at
least as large as the list size. As elements are added to an
ArrayList, its capacity grows automatically. The details of the growth
policy are not specified beyond the fact that adding an element has
constant amortized time cost."
Obviously, if you have no idea as to what kind of range you will be holding, setting the size probably won't be a good idea - however, if you do have a specific range in mind, setting an initial capacity will increase memory efficiency.

ArrayList can contain many values and when doing large initial insertions you can tell ArrayList to allocate a larger storage to begin with as to not waste CPU cycles when it tries to allocate more space for the next item. Thus to allocate some space at the beginning is more effiecient.

This is to avoid possible efforts for reallocation for every single object.
int newCapacity = (oldCapacity * 3)/2 + 1;
internally new Object[] is created. JVM needs effort to create new Object[] when you add element in the arraylist. If you don't have above code(any algo you think) for reallocation then every time when you invoke arraylist.add() then new Object[] has to be created which is pointless and we are loosing time for increasing size by 1 for each and every objects to be added. So it is better to increase size of Object[] with following formula.
(JSL has used forcasting formula given below for dynamically growing arraylist instead of growing by 1 every time. Because to grow it takes effort by JVM)
int newCapacity = (oldCapacity * 3)/2 + 1;

I think each ArrayList is created with an init capacity value of "10". So anyway, if you create an ArrayList without setting capacity within constructor it will be created with a default value.

I'd say its an optimization. ArrayList without initial capacity will have ~10 empty rows and will expand when you are doing an add.
To have a list with exactly the number of items you need to call trimToSize()

As per my experience with ArrayList, giving an initial capacity is a nice way to avoid reallocation costs. But it bears a caveat. All suggestions mentioned above say that one should provide initial capacity only when a rough estimate of the number of elements is known. But when we try to give an initial capacity without any idea, the amount of memory reserved and unused will be a waste as it may never be required once the list is filled to required number of elements. What i am saying is, we can be pragmatic at the beginning while allocating capacity, and then find a smart way of knowing required minimal capacity at runtime. ArrayList provides a method called ensureCapacity(int minCapacity). But then, one has find a smart way...

I have tested ArrayList with and without initialCapacity and I got suprising result
When I set LOOP_NUMBER to 100,000 or less the result is that setting initialCapacity is efficient.
list1Sttop-list1Start = 14
list2Sttop-list2Start = 10
But when I set LOOP_NUMBER to 1,000,000 the result changes to:
list1Stop-list1Start = 40
list2Stop-list2Start = 66
Finally, I couldn't figure out how does it works?!
Sample code:
public static final int LOOP_NUMBER = 100000;
public static void main(String[] args) {
long list1Start = System.currentTimeMillis();
List<Integer> list1 = new ArrayList();
for (int i = 0; i < LOOP_NUMBER; i++) {
list1.add(i);
}
long list1Stop = System.currentTimeMillis();
System.out.println("list1Stop-list1Start = " + String.valueOf(list1Stop - list1Start));
long list2Start = System.currentTimeMillis();
List<Integer> list2 = new ArrayList(LOOP_NUMBER);
for (int i = 0; i < LOOP_NUMBER; i++) {
list2.add(i);
}
long list2Stop = System.currentTimeMillis();
System.out.println("list2Stop-list2Start = " + String.valueOf(list2Stop - list2Start));
}
I have tested on windows8.1 and jdk1.7.0_80

How to determine program complexity in Big-O when adding to a full ArrayList?

I am working through a practice exam for my Computer Science class. However, I am not sure about the question below.
Consider four different approaches to re-sizing an array-based list data-structure.
In each case, when we want to add() an item to the end of the array, but it is full, we
create a new array and then copy elements from the old one into the new one. For each of
the following choices about how big to make the new array, give the resulting complexity
of adding to the end of the list, in big-O terms:
(i) increase array size by 1 item.
(ii) Increase array size by 100 items.
(iii) Double the array size.
(iv) Triple the array size.
Since you call the same System.arraycopy() method reach time regardless, wouldn't the complexity be the same for each?

Since you call the same System.arraycopy() method reach time regardless, wouldn't the complexity be the same for each?
Yes, and no.
Yes - when you actually do the copy, the cost of the copy will be similar in all cases.
(They are not exactly the same if you include the cost of allocating and initializing the array. It takes more space and time to allocate and initialize an array of 2 * N elements than for N + 1 elements. But you will only be copying N elements into the array.)
No - the different strategies result in the array copies happening a different number of times. If you do a complete complexity analysis for a sequence of operations, you will find that options 3 and 4 have a different complexity to 1 and 2.
(And it is worth noting that 2 will be faster than 1, even though they have the same complexity.)
The typical analysis for this involves working out the total costs for something like this:
List<Object> list = new ArrayList<>();
for (int i = 0; i < N; i++) {
list.add(new Object());
}
(Hint: the analysis may be given as an example in your recommended "data structures and algorithms" textbook, or your lecture notes. If so, that is what you should be revising (before doing practice exams!) If not, Google for "complexity amortized arraylist" and you will find examples.)

Best data structure to hold large amounts of data?

Reading in a lot of data from a file. There may be 100 different data objects with necessary headings, but there can be well over 300,000 values stored in each of these data objects. The values need to be stored in the same order that they are read in. This is the constructor for the data object:
public Data(String heading, ArrayList<Float> values) {
this.heading = heading;
this.values = values;
}
What would be the quickest way to store and retrieve these values sequentially in RAM?

Although in your comments you mention "quickness", without specifying what operation needs to be "quick", your main concern seems to be heap memory consumption.
Let's assume 100 groups of 300,000 numbers (you've used words like "may be" and "well over" but this will do as an example).
That's 30,000,000 numbers to store, plus 100 headings and some structural overhead for grouping.
A primitive Java float is 32 bits, that is 4 bytes. So at an absolute minimum, you're going to need 30,000,000 * 4 bytes == 120MB.
An array of primitives - float[30000000] - is just all the values concatenated into a contiguous chunk of memory, so will consume this theoretical minumum of 120MB -- plus a few bytes of once-per-array overhead that I won't go into detail about here.
A java Float wrapper object is 12 bytes. When you store an object (rather than a primitive) in an array, the reference itself is 4 bytes. So an array of Float - Float[30000000] will consume 30,000,000 * (12 + 4) == 480MB.
So, you can cut your memory use by more than half by using primitives rather than wrappers.
An ArrayList is quite a light wrapper around an array of Object and so has about the same memory costs. The once-per-list overheads are too small to have an impact compared to the elements, at these list sizes. But there are some caveats:
ArrayList can only store Objects, not primitives, so if you choose a List you're stuck with the 12-bytes-per-element overhead of Float.
There are some third-party libraries that provide lists of primitives - see: Create a List of primitive int?
The capacity of an ArrayList is dynamic, and to achieve this, if you grow the list to be bigger than its backing array, it will:
create a new array, 50% bigger than the old array
copy the contents of the old array into the new array (this sounds expensive, but hardware is very fast at doing this)
discard the old array
This means that if the backing array happens to have 30 million elements, and is full, ArrayList.add() will replace the array with one of 45 million elements, even if your List only needs 30,000,001.
You can avoid this if you know the needed capacity in advance, by providing the capacity in the constructor.
You can use ArrayList.trimToSize() to drop unneeded capacity and claw some memory back after you've filled the ArrayList.
If I was striving to use as little heap memory as possible, I would aim to store my lists of numbers as arrays of primitives:
class Data {
String header;
float[] values;
}
... and I would just put these into an ArrayList<Data>.
With this structure, you have O(1) access to arbitrary values, and you can use Arrays.binarySearch() (if the values are sorted) to find by value within a group.
If at all possible, I would find out the size of each group before reading the values, and initialise the array to the right size. If you can, make your input file format facilitate this:
while(line = readLine()) {
if(isHeader(line)) {
ParsedHeader header = new ParsedHeader(line);
currentArray = new float[header.size()];
arrayIndex = 0;
currentGroup = new Group(header.name(), currentArray);
groups.add(currentGroup);
} else if (isValue(line)) {
currentArray[arrayIndex++] = parseValue(line);
}
}
If you can't change the input format, consider making two passes through the file - once to discover group lengths, once again to fill your arrays.
If you have to consume the file in one pass, and the file format can't provide group lengths before groups, then you'll have to do something that allows a "list" to grow arbitrarily. There are several options:
Consume each group into an ArrayList<Float> - when the group is complete, convert it into an array[float]:
float[] array = new float[list.size()];
int i = 0;
for (Float f : list) {
array[i] = f; // auto-unboxes Float to float
}
Use a third-party list-of-float library class
Copy the logic used by ArrayList to replace your array with a bigger one when needed -- http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
Any number of approaches discussed in Computer Science textbooks, for example a linked list of arrays.
However none of this considers your reasons for slurping all these numbers into memory in the first place, nor whether this store meets your needs when it comes to processing the numbers.
You should step back and consider what your actual data processing requirement is, and whether slurping into memory is the best approach.
See whether you can do your processing by storing only a slice of data at a time, rather than storing the whole thing in memory. For example, to calculate max/min/mean, you don't need every number to be in memory -- you just need to keep a running total.
Or, consider using a lightweight database library.

You could use a RedBlack BST, which will be an extremely efficient way to store/retrieve data. This relies on nodes that link to other nodes, so there's no limit to the size of the input, as long as you have enough memory for java.

ArrayList vs LinkedList from memory allocation perspective

I need to store a large amount of information, say for example 'names' in a java List. The number of items can change (or in short I cannot predefine the size). I am of the opinion that from a memory allocation perspective LinkedList would be a better option than ArrayList, as for an ArrayList once the max size is reached, automatically the memory allocation doubles and hence there would always be a chance of more memory being allocated than what is needed.
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.

LinkedList might allocate fewer entries, but those entries are astronomically more expensive than they'd be for ArrayList -- enough that even the worst-case ArrayList is cheaper as far as memory is concerned.
(FYI, I think you've got it wrong; ArrayList grows by 1.5x when it's full, not 2x.)
See e.g. https://github.com/DimitrisAndreou/memory-measurer/blob/master/ElementCostInDataStructures.txt : LinkedList consumes 24 bytes per element, while ArrayList consumes in the best case 4 bytes per element, and in the worst case 6 bytes per element. (Results may vary depending on 32-bit versus 64-bit JVMs, and compressed object pointer options, but in those comparisons LinkedList costs at least 36 bytes/element, and ArrayList is at best 8 and at worst 12.)
UPDATE:
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.
To be clear, even in the worst case, ArrayList is 4x smaller than a LinkedList with the same elements. The only possible way to make LinkedList win is to deliberately fix the comparison by calling ensureCapacity with a deliberately inflated value, or to remove lots of values from the ArrayList after they've been added.
In short, it's basically impossible to make LinkedList win the memory comparison, and if you care about space, then calling trimToSize() on the ArrayList will instantly make ArrayList win again by a huge margin. Seriously. ArrayList wins.

... but I am still guessing for the scenario I have defined LinkedList might be a better option
Your guess is incorrect.
Once you have got past the initial capacity of the array list, the size of the backing will be between 1 and 2 references times the number of entries. This is due to strategy used to grow the backing array.
For a linked list, each node occupies AT LEAST 3 times the number of entries, because each node has a next and prev reference as well as the entry reference. (And in fact, it is more than 3 times, because of the space used by the nodes' object headers and padding. Depending on the JVM and pointer size, it can be as much as 6 times.)
The only situation where a linked list will use less space than an array list is if you badly over-estimate the array list's initial capacity. (And for very small lists ...)
When you think about it, the only real advantage linked lists have over array lists is when you are inserting and removing elements. Even then, it depends on how you do it.

ArrayList use one reference per object (or two when its double the size it needs to be) This is typically 4 bytes.
LinkedList uses only the nodes its needs, but these can be 24 bytes each.
So even at it worst ArrayList will be 3x smaller than LinkedList.
For fetching ARrayList support random access O(1) but LinkedList is O(n). For deleting from the end, both are O(1), for deleting from somewhere in the middle ArrayList is O(n)
Unless you have millions of entries, the size of the collection is unlikely to matter. What will matter first is the size of entries which is the same regardless of the collection used.

Back of the envelope worst-case:
500,000 names in an array sized to 1,000,000 = 500,000 used, 500,000 empty pointers in the unused portion of the allocated array.
500,000 entries in a linked list = 3 pointers per entry (Node object holds current, prev, and next) = 1,5000,000 pointers in memory. (Then you have to add the size of the Node itself)

ArrayList.trimToSize() may satisfy you.
Trims the capacity of this ArrayList instance to be the list's current
size. An application can use this operation to minimize the storage of
an ArrayList instance.
By the way, in ArrayList Java6, it's not double capacity, it's about 1.5 times max size is reached.

How much should I add when resizing an array?

I'm having a contest with another student to make the fastest version of our homework assignment, and I'm not using an ArrayList for performance reasons (resizing the array myself cut the benchmark time from 56 seconds to 4), but I'm wondering how much I should resize the array every time I need to. Specifically the relevant parts of my code are this:
private Node[] list;
private int size; // The number of items in the list
private static final int N; // How much to resize the list by every time
public MyClass(){
list = new Node[N];
}
public void add(Node newNode){
if(size == list.length){
list = Arrays.copyOf(list, size + N);
}
list[size] = newNode;
size++;
}
TL;DR: What should I make N?

It's recommended to double the size of the array when resizing. Doubling the size leads to amortized linear-time cost.
The naive idea is that there are two costs associated with the resize value:
Copying performance costs - costs of copying the elements from previous array to new one, and
Memory overhead costs - cost of the allotted memory that is not used.
If you were to re-size the array by adding one element at a time, the memory overhead is zero, but the copying cost becomes quadratic. If you were to allocate too much slots, the copying cost will be linear, but the memory overhead is too much.
Doubling leads to a linear amortized cost (i.e. over a long time, the cost of copying is linear with respect to the size of the array), and you are guaranteed not to waste more than half of the array.
UPDATE: By the way, apparently Java's ArrayList expands by (3/2). This makes it a bit more memory conservative, but cost a bit more in terms of copying. Benchmarking for your use wouldn't hurt.
MINOR Correction: Doubling would make the cost resizing linear amortized, but would ensure that you have a amortized constant time insertion. Check CMU's lecture on Amortized Analysis.

3/2 is likely chosen as "something that divides cleanly but is less than phi". There was an epic thread on comp.lang.c++.moderated back in November 2003 exploring how phi establishes an upper bound on reusing previously-allocated storage during reallocation for a first-fit allocator.
See post #7 from Andrew Koenig for the first mention of phi's application to this problem.

If you know roughly how many items there are going to be, then pre-assign the array or the ArrayList to that size, and you'll never have to expand. Unbeatable performance!
Failing that, a reasonable way to achieve good amortized cost is to keep icreasing by some percentage. 100% or 50% are common.

You should resize your lists as a multiple of the previous size, rather than adding a constant amount each time.
for example:
newSize = oldSize * 2;
not
newSize = oldSize + N;

Double the size each time you need to resize unless you know that more or less would be best.
If memory isn't an issue, just start off with a big array to begin with.

Your code seems to do pretty much what ArrayList does - if you know you will be using a large list, you can pass it an initial size when you create the list and avoid resizing at all. This ofcourse assumes that you're going for raw speed and memory consumption is not an issue.

From the comments of one of the answers:
The problem is that memory isn't an
issue, but I'm reading an arbitrarily
large file.
Try this:
new ArrayList<Node>((int)file.length());
You could do it with your array as well. Then there should be no resizing in either case since the array will be the size of the file (assuming that the file is not longer then an int...).

For maximum performance, you're going to want to resize as rarely as possible. Set the initial size to be as large as you'll typically need, rather than starting with N elements. The value you choose for N will matter less in that case.
If you are going to create a large number of these list objects, of varying sizes, then you'll want to use a pool-based allocator, and not free the memory until you exit.
And to eliminate the copy operation altogether, you can use a list of arrays

Here's an analogy for you, long long ago when I used to work on a mainframe we used a filing system called VSAM, which would require you to specify the initial file size and the amount of freespace required.
Whenever the amount of freespace dropped below threshold required, then the amount of freespace required would be allocated in the background while the program continued to process.
It would be interesting to see if this could be done in java using a separate thread to allocate the additional space and 'attach' it to the end of the array while the main thread continues processing.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.