Why do dynamic arrays in C++ and Java have different initial capacities?

Why do dynamic arrays in C++ and Java have different initial capacities? - java

So I've been searching for how actually dynamic array works in general. what i found is two different concepts.
In C++
In C++, dynamic array is generally implemented by vectors. Vectors set the capacity to 0, increases the count to insert new element and then doubles the capacity size for new insertions.
vector.h
/*
* Implementation notes: Vector constructor and destructor
* -------------------------------------------------------
* The constructor allocates storage for the dynamic array and initializes
* the other fields of the object. The destructor frees the memory used
* for the array.
*/
template <typename ValueType>
Vector<ValueType>::Vector() {
count = capacity = 0;
elements = NULL;
}
For expanding the vector size
/*
* Implementation notes: expandCapacity
* ------------------------------------
* This function doubles the array capacity, copies the old elements into
* the new array, and then frees the old one.
*/
template <typename ValueType>
void Vector<ValueType>::expandCapacity() {
capacity = max(1, capacity * 2);
ValueType *array = new ValueType[capacity];
for (int i = 0; i < count; i++) {
array[i] = elements[i];
}
if (elements != NULL) delete[] elements;
elements = array;
}
In Java
In java, dynamic array is implemented using arrayList, They set the capacity to 10 (JVM based) and once the capacity is full they increases the capacity by some factor.
Reason being for setting the capacity to 10 is that you don't have to initialize the memory frequently for every new insertion. Once the capacity is full increase the capacity size.
curiosity
Why implementation in vector.h sets the default value to 0?? Setting the capacity to some small value (lets say 10)instead of setting it to 0 may save the overhead of initializing memory every time user inserts some element.
As it is a dynamic array, setting small capacity for vector will do no harm because size of dynamic array generally goes beyond 10.
Edit : My question is why default 0 ? it could be any small value by default , because anyways vector is going to expand to some specific size , that's the purpose of using vector in first place.

Having a capacity of zero by default has the advantage that default constructing an std::vector doesn't do any dynamic memory allocation at all (You don't pay for what you don't need).
If you know you need ~10 elements, you can set the capacity explicitly via calling std::vector::reserve:
std::vector<int> vec;
vec.reserve(10);
I can only speculate, why Java does things differently, but afaik, dynamic memory allocation is cheaper in Java than in c++ and the two languages also follow different philosophies, when it comes to performance/low level control vs simplicity.

Why default 0?
It's not 0 by default, actually. That is, the C++ language standard does not define (AFAIK) what the initial capacity of a default-constructed vector should be.
In practice, most/all implementations default to 0-capacity. The reason, I would say, lies in one of the design principles of the C++ language is:
What you don't use, you don't pay for.
(see: B. Stroustrup: The Design and Evolution of C++. Addison Wesley, ISBN 0-201-54330-3. March 1994)
And this is not just a trivial tautology - it's a slant of design considerations.
So, in C++, we would rather not pay anything for a vector which does not have any elements inserted into it, then save on potential size increases by making an initial "sacrifice".
As #yurikilocheck points out, however, the std::vector class does provide you with a reserve() method, so you can set an initial capacity yourself - and since the default constructor doesn't allocate anything, you will not have paid 'extra', for two allocations - just once. Again, you pay (essentially) the minimum possible.
Edit: On the other hand, std::vector partially breaks this principle in allocating more space than it actually needs when hitting its capacity. But at least it's not a superfluous allocation call.

I have used an implementation which reserves a default value of 32 elements per vector. It was native Sun C++ STL implementation. It was a disaster. I had a perfectly reasonable program which by it's nature had to have somewhat about hundreds of thousands of those vectors. It was running out of memory. And it should have been runinng fine, since of those hundreds of thousands of elements only small percentage had those vectors non-empty.
So from personal experience, 0 is the best default vector size.

Related

Array's lookup time complexity vs. how it is stored

It's well known that the time complexity of array access by index is O(1).
The documentation of Java's ArrayList, which is backed by an array, says the same about its get operation:
The size, isEmpty, get, set, iterator, and listIterator operations run in constant time.
The lookup is done by getting the memory address of the element at a given index independently of the array's size (something like start_address + element_size * index). My understanding is that the array's elements have to be stored next to each other in the memory for this lookup mechanism to be possible.
However, from this question, I understand that arrays in Java are not guaranteed to have their elements stored contiguously in the memory. If that is the case, how could it always be O(1)?
Edit: I'm quite aware of how an ArrayList works. My point is, if contiguous storage for arrays is not guaranteed by the JVM specification, its elements could be at different areas in the memory. Although that situation is highly unlikely, it would render the lookup mechanism mentioned above impossible, and the JVM would have another way to do the lookup, which shouldn't be O(1) anymore. At that point, it would be against both the common knowledge stated at the top of this question and the documentation of ArrayList regarding its get operation.
Thanks everybody for your answers.
Edit: In the end, I think it's a JVM-specific thing but most, if not all, JVM's stick to contiguous storage of an array's elements even when there's no guarantee, so that the lookup mechanism above can be used. It's simple, efficient and cost-effective.
As far as I can see, it would be silly to store the elements all over the place and then have to take a different approach to doing the lookup.

As far as I'm aware, the spec gives no guarantee that arrays will be stored contiguously. I'd speculate that most JVM implementations will though. In the basic case it's simple enough to enforce: if you can't expand the array because other memory is occupying the space you need, move the whole thing somewhere else.
Your confusion stems from a misunderstanding of the meaning of O(1). O(1) does not mean that a function is performed in a single operation (e.g. start_address + element_size * index). It means that the operation is performed in a constant amount of time irrespective of the size of the input data - in this case, the array. This is perfectly achievable with data that is not stored contiguously. You could have a lookup table mapping indexes to memory locations, for example.

From the linked question you can see that even though it's not mandated by the JVM rules, it's highly likely that 1D arrays are continuous in memory.
Given a contiguous array the time complexities of ArrayList are as given. However it's not impossible that in a special case or a special JVM the complexities might be slightly different. It's impossible to provide the time complexities if you have to consider all kinds of VMs that are allowed by the spec.

Everytime, an element is added, its capacity is checked:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/ArrayList.java#ArrayList.add%28java.lang.Object%29
public boolean add(E e) {
ensureCapacity(size + 1); // Increments modCount!!
elementData[size++] = e;
return true;
}
Here, ensureCapacity() does the trick to keep the array sequential. How?
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
public void ensureCapacity(int minCapacity) {
modCount++;
int oldCapacity = elementData.length;
if (minCapacity > oldCapacity) {
Object oldData[] = elementData;
int newCapacity = (oldCapacity * 3)/2 + 1;
if (newCapacity < minCapacity)
newCapacity = minCapacity;
// minCapacity is usually close to size, so this is a win:
elementData = Arrays.copyOf(elementData, newCapacity);
}
}
Thus, in every stage it tries to ensure that the array has enough capacity and is linear i.e. any index within the range can be retrieved in O(1).

An ArrayList wraps a real array. (On add it might need to grow.) So it has the same complexity for get and set, O(1).
However the ArrayList can (till some future version of Java) only contain Objects. For primitive types like int or char wrapper classes are needed that are inefficient, and have the objects chaotically divided over the entire allocated memory. Still O(1) but with a large constant factor.
So for primitive types you might use arrays and do the growing yourself:
elementData = Arrays.copyOf(elementData, newCapacity);
Or if that would fit, use Bitset where the indices with true are the values.

Difference between implicit and explicit ArrayList size declarations?

what is the difference between the following declarations:
List list1 = new ArrayList();
List list2 = new ArrayList(10);
By default is allocates it with 10. But is there any difference?
Can I add an 11th element to list2 by list2.add("something")?

Here is the source code for you for first example
public ArrayList() {
this(10);
}
So there is no difference. Since the initial capacity is 10, no matter you pass 10 or not, it gets initialised with capacity 10.
Can I add 11th element in the list2 by list2.add("something")?
Ofcourse, initial capacity is not final capacity. So as you keep on adding more than 10, the size of the list keeps increasing.
If you want to have a fixed size container, use Arrays.asList (or, for primitive arrays, the asList methods in Guava) and also consider java.util.Collections.unmodifiableList()
Worth reading about this change in Java 8 : In Java 8, why is the default capacity of ArrayList now zero?
In short, providing initial capacity wont really change anything interms of size.

You can always add elements in a list. However, the inlying array, which is used by the ArrayList, is initialized with either the default size of 10 or the size, which you specify when initializing the ArrayList. This means, if you e.g. add the 11th element, the array size has to be increased, which is done by copying the contents of the array to a new, bigger array instance. This of course needs time depending on the size of the list/array. So if you already know, that your list will hold thousands of elements, it is faster if you already initialize the list with that approximate size.

ArrayLists in Java are auto-growable, and will resize themselves if they need to in order to add additional elements. The size parameter in the constructor is just used for the initial size of the internal array, and is a sort of optimization for when you know exactly what you're going to use the array for.
Specifying this initial capacity is often a premature optimization, but if you really need an ArrayList of 10 elements, you should specify it explicitly, not assume that the default size is 10. Although this really used to be the default behavior (up to JDK 7, IIRC), you should not rely on it - JDK 8 (checked with java-1.8.0-openjdk-1.8.0.101-1.b14.fc24.x86_64 I have installed) creates empty ArrayLists by default.

The other answers have explained really well, but just to keep things relevant, in JDK 1.7.0_95:
/**
* Constructs a new {#code ArrayList} instance with zero initial capacity.
*/
public ArrayList() {
array = EmptyArray.OBJECT;
}
/**
* Constructs a new instance of {#code ArrayList} with the specified
* initial capacity.
*
* #param capacity
* the initial capacity of this {#code ArrayList}.
*/
public ArrayList(int capacity) {
if (capacity < 0) {
throw new IllegalArgumentException("capacity < 0: " + capacity);
}
array = (capacity == 0 ? EmptyArray.OBJECT : new Object[capacity]);
}
As the comment mentions, the constructor accepting no arguments initializes an ArrayList with zero initial capacity.
And even more interesting here is a variable (with a comment) that lends a lot of information on its own:
/**
* The minimum amount by which the capacity of an ArrayList will increase.
* This tuning parameter controls a time-space tradeoff. This value (12)
* gives empirically good results and is arguably consistent with the
* RI's specified default initial capacity of 10: instead of 10, we start
* with 0 (sans allocation) and jump to 12.
*/
private static final int MIN_CAPACITY_INCREMENT = 12;

You just picked the perfect example. Both actually do the same as new ArrayList() calls this(10) ;) But internally it would define the holding array with the size 10. the ArrayList#size method on the other side does just return a variable size, which only will be changed after adding and removing elements. This variable is also the main reason for IOOB Exceptions. So you wont be able to do so.
If you check the code of the ArrayList for example, you´ll notice that the method ArrayList#add will call ArrayList#rangeCheck. The range check actually just cares for the size variable and not the actuall length of the array holding the data for the List.
Due to this you´ll still not be able to insert data at the index 5 for example. The internal length of the data array at this point will be 10, but as you didn´t add anything to your List, the size variable will still be 0 and you´ll get the proper IndexOutOfBoundsException when you´ll try to do so.
just try to call list.size() after initializing the List with any size, and you´ll notice the returned size will be 0.

The initialization of ArrayList has been optimized since JDK 1.7 update 40 and there's a good explanation about the two different behaviours at this link
java-optimization-empty-arraylist-and-Hashmap-cost-less-memory-jdk-17040-update.
So before Java 1.7u40 there're no difference but from that version there's a quite substantial difference.
This difference is about perfomance optimization and doesn't change the contract of List.add(E e) and ArrayList(int initialCapacity).

In Java 8, why is the default capacity of ArrayList now zero?

As I recall, before Java 8, the default capacity of ArrayList was 10.
Surprisingly, the comment on the default (void) constructor still says: Constructs an empty list with an initial capacity of ten.
From ArrayList.java:
/**
* Shared empty array instance used for default sized empty instances. We
* distinguish this from EMPTY_ELEMENTDATA to know how much to inflate when
* first element is added.
*/
private static final Object[] DEFAULTCAPACITY_EMPTY_ELEMENTDATA = {};
...
/**
* Constructs an empty list with an initial capacity of ten.
*/
public ArrayList() {
this.elementData = DEFAULTCAPACITY_EMPTY_ELEMENTDATA;
}

Technically, it's 10, not zero, if you admit for a lazy initialisation of the backing array. See:
public boolean add(E e) {
ensureCapacityInternal(size + 1);
elementData[size++] = e;
return true;
}
private void ensureCapacityInternal(int minCapacity) {
if (elementData == DEFAULTCAPACITY_EMPTY_ELEMENTDATA) {
minCapacity = Math.max(DEFAULT_CAPACITY, minCapacity);
}
ensureExplicitCapacity(minCapacity);
}
where
/**
* Default initial capacity.
*/
private static final int DEFAULT_CAPACITY = 10;
What you're referring to is just the zero-sized initial array object that is shared among all initially empty ArrayList objects. I.e. the capacity of 10 is guaranteed lazily, an optimisation that is present also in Java 7.
Admittedly, the constructor contract is not entirely accurate. Perhaps this is the source of confusion here.
Background
Here's an E-Mail by Mike Duigou
I have posted an updated version of the empty ArrayList and HashMap patch.
http://cr.openjdk.java.net/~mduigou/JDK-7143928/1/webrev/
This revised implementation introduces no new fields to either class. For ArrayList the lazy allocation of the backing array occurs only if the list is created at default size. According to our performance analysis team, approximately 85% of ArrayList instances are created at default size so this optimization will be valid for an overwhelming majority of cases.
For HashMap, creative use is made of the threshold field to track the requested initial size until the bucket array is needed. On the read side the empty map case is tested with isEmpty(). On the write size a comparison of (table == EMPTY_TABLE) is used to detect the need to inflate the bucket array. In readObject there's a little more work to try to choose an efficient initial capacity.
From: http://mail.openjdk.java.net/pipermail/core-libs-dev/2013-April/015585.html

In java 8 default capacity of ArrayList is 0 until we add at least one object into the ArrayList object (You can call it lazy initialization).
Now question is why this change has been done in JAVA 8?
Answer is to save memory consumption. Millions of array list objects are created in real time java applications. Default size of 10 objects means that we allocate 10 pointers (40 or 80 bytes) for underlying array at creation and fill them in with nulls.
An empty array (filled with nulls) occupy lot of memory .
Lazy initialization postpones this memory consumption till moment you will actually use the array list.
Please see below code for help.
ArrayList al = new ArrayList(); //Size: 0, Capacity: 0
ArrayList al = new ArrayList(5); //Size: 0, Capacity: 5
ArrayList al = new ArrayList(new ArrayList(5)); //Size: 0, Capacity: 0
al.add( "shailesh" ); //Size: 1, Capacity: 10
public static void main( String[] args )
throws Exception
{
ArrayList al = new ArrayList();
getCapacity( al );
al.add( "shailesh" );
getCapacity( al );
}
static void getCapacity( ArrayList<?> l )
throws Exception
{
Field dataField = ArrayList.class.getDeclaredField( "elementData" );
dataField.setAccessible( true );
System.out.format( "Size: %2d, Capacity: %2d%n", l.size(), ( (Object[]) dataField.get( l ) ).length );
}
Response: -
Size: 0, Capacity: 0
Size: 1, Capacity: 10
Article Default capacity of ArrayList in Java 8 explains it in details.

If the very first operation that is done with an ArrayList is to pass addAll a collection which has more than ten elements, then any effort put into creating an initial ten-element array to hold the ArrayList's contents would be thrown out the window. Whenever something is added to an ArrayList it's necessary to test whether the size of the resulting list will exceed the size of the backing store; allowing the initial backing store to have size zero rather than ten will cause this test to fail one extra time in the lifetime of a list whose first operation is an "add" which would require creating the initial ten-item array, but that cost is less than the cost of creating a ten-item array that never ends up getting used.
That having been said, it might have been possible to improve performance further in some contexts if there were a overload of "addAll" which specified how many items (if any) would likely be added to the list after the present one, and which could use that to influence its allocation behavior. In some cases code which adds the last few items to a list will have a pretty good idea that the list is never going to need any space beyond that. There are many situations where a list will get populated once and never modified after that. If at the point code knows that the ultimate size of a list will be 170 elements, it has 150 elements and a backing store of size 160, growing the backing store to size 320 will be unhelpful and leaving it at size 320 or trimming it to 170 will be less efficient than simply having the next allocation grow it to 170.

The question is 'why?'.
Memory profiling inspections (for example (https://www.yourkit.com/docs/java/help/inspections_mem.jsp#sparse_arrays) shows that empty (filled with nulls) arrays occupy tons of memory .
Default size of 10 objects means that we allocate 10 pointers (40 or 80 bytes) for underlying array at creation and fill them in with nulls. Real java applications create millions of array lists.
The introduced modification removes^W postpone this memory consumption till moment you will actually use the array list.

After above question I gone through ArrayList Document of Java 8. I found the default size is still 10 only.

ArrayList default size in JAVA 8 is stil 10. The only change made in JAVA 8 is that if a coder adds elements less than 10 then the remaining arraylist blank places are not specified to null. Saying so because I have myself gone through this situation and eclipse made me look into this change of JAVA 8.
You can justify this change by looking at below screenshot. In it you can see that ArrayList size is specified as 10 in Object[10] but the number of elements displayed are only 7. Rest null value elements are not displayed here. In JAVA 7 below screenshot is same with just a single change which is that the null value elements are also displayed for which the coder needs to write code for handling null values if he is iterating complete array list while in JAVA 8 this burden is removed from the head of coder/developer.
Screen shot link.

Is it better to specify HashMap's initial capacity when it is not a power of 2 than to not specify at all?

Suppose I know the exact number of key-value pairs that will go in the HashMap and I know that it is not a power of 2. In these cases, should I specify the initial capacity or not ? I could get the nearest power of 2 and specify that but still I'd like to know which would be the better thing to do in such cases (when I don't want to calculate the nearest power of 2).
Thanks!

If you look at the java.util.HashMap source code (java 1.7) (you can find in the src.zip file in the JDK's directory) you will see that the HashMap's put method uses the inflateTable method to create an array that stores HashMap's entries and the method always increases the capacity of the HashMap to a power of two that is greater than (or equal to) the size that you specified.
Here is the method:
private void inflateTable(int toSize) {
// Find a power of 2 >= toSize
int capacity = roundUpToPowerOf2(toSize);
threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
table = new Entry[capacity];
initHashSeedAsNeeded(capacity);
}
Hence it does not matter if the size you specified is a power of two.

You should consider the initial capacity a hint to the HashMap of approximately what data to expect. By providing a correct initial capacity, you minimize the number of times the map has to be rebuilt in order to scale up. If, for instance, you knew you were planning to insert a million records, by constructing the map with a 1,000,000 initial capacity, it will ensure, at construction time, to allocate enough memory to handle that many inserts. After that point, future inserts into the map may require a large O(n) operation during a map.put() call in order to resize.
Thinking of this initial capacity as a hint, rather than an instruction you expect HashMap to follow, may help you see that the optimization you're describing is unnecessary. HashMap is designed to behave well in all normal circumstances, and so while providing an initial capacity can help marginally, it's generally not going to have huge impacts on your code unless you are building many new large maps all the time. In such a case specifying the capacity would avoid the intermediate table resizing, but that's all.
As documented, you could introduce some unnecessary slowdowns if you specified too large of an initial capacity:
Iteration over collection views requires time proportional to the "capacity" of the HashMap instance
However in practice the wasted memory of allocating such large maps would likely cause problems for you sooner than the slightly slower iteration speed.
Be sure you read Why does HashMap require that the initial capacity be a power of two? as well.
One thing you might consider is switching to Guava's ImmutableMap implementation; if you know in advance the contents of your map, and you don't expect to change them, immutable collections will prove easier to work with and use less memory than their mutable counterparts.
Here's some quick inspections I did using Scala's REPL (and some personal utility functions) to inspect what's going on inside HashMap (Java 1.7):
// Initialize with capacity=7
scala> new HashMap[String,String](7)
res0: java.util.HashMap[String,String] = {}
scala> getPrivate(res0, "table").length
res1: Int = 8
scala> ... put 7 values
// Still internally using the same array
scala> getPrivate(res0, "table").length
res9: Int = 8
// Specifying capacity 9 allocates a 16-lenth array
scala> getPrivate(new HashMap[String,String](9), "table").length
res10: Int = 16
// Copying our first map into a new map interestingly
// also allocates the default 16 slots, rather than 8
scala> getPrivate(new HashMap[String,String](res0), "table").length
res11: Int = 16
scala> ... put 10 more values in our map
scala> getPrivate(res0,"table").length
res22: Int = 32
// Copying again immediately jumps to 32 capacity
scala> getPrivate(new HashMap[String,String](res0),"table").length
res23: Int = 32

Java collections memory consumption

Say I instantiate 100 000 of Vectors
a[0..100k] = new Vector<Integer>();
If i do this
a[0..100k] = new Vector<Integer>(1);
Will they take less memory? That is ignoring whether they have stuff in them and the overhead of expanding them when there has to be more than 1 element.

According to the Javadoc, the default capacity for a vector is 10, so I would expect it to take more memory than a vector of capacity 1.
In practice, you should probably use an ArrayList unless you need to work with another API that requires vectors.

When you create a Vector, you either specify the size you want it to have at the start or leave some default value. But it should be noted that in any case everything stored in a Vector is just a bunch of references, which take really little place compared to the objects they are actually pointing at.
So yes, you will save place initially, but only by the amount which equals to the difference between the default size and the specified multiplied by the size of a reference variable. If you create a really large amount of vectors like in your case, initial size does matter.

Well, sort of yes. IIRC Vector initializes internally 16 elements by default which means that due to byte alignment and other stuff done by underlying VM you'll save a considerable amount of memory initially.
What are you trying to accomplish, though?

Yes, they will. Putting in reasonable "initial sizes" for collections is one of the first things I do when confronted with a need to radically improve memory consumption of my program.

Yes, it will. By default, Vector allocates space for 10 elements.
Vector()
Constructs an empty vector so that its internal data array has size 10
and its standard capacity increment is zero.increment is zero.
Therefore, it reserves memory for 10 memory references.
That being said, in real life situations, this is rarely a concern. If you are truly generating 100,000 Vectors, you need to rethink your designincrement is zero.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.