Array's lookup time complexity vs. how it is stored - java

It's well known that the time complexity of array access by index is O(1).
The documentation of Java's ArrayList, which is backed by an array, says the same about its get operation:
The size, isEmpty, get, set, iterator, and listIterator operations run in constant time.
The lookup is done by getting the memory address of the element at a given index independently of the array's size (something like start_address + element_size * index). My understanding is that the array's elements have to be stored next to each other in the memory for this lookup mechanism to be possible.
However, from this question, I understand that arrays in Java are not guaranteed to have their elements stored contiguously in the memory. If that is the case, how could it always be O(1)?
Edit: I'm quite aware of how an ArrayList works. My point is, if contiguous storage for arrays is not guaranteed by the JVM specification, its elements could be at different areas in the memory. Although that situation is highly unlikely, it would render the lookup mechanism mentioned above impossible, and the JVM would have another way to do the lookup, which shouldn't be O(1) anymore. At that point, it would be against both the common knowledge stated at the top of this question and the documentation of ArrayList regarding its get operation.
Thanks everybody for your answers.
Edit: In the end, I think it's a JVM-specific thing but most, if not all, JVM's stick to contiguous storage of an array's elements even when there's no guarantee, so that the lookup mechanism above can be used. It's simple, efficient and cost-effective.
As far as I can see, it would be silly to store the elements all over the place and then have to take a different approach to doing the lookup.

As far as I'm aware, the spec gives no guarantee that arrays will be stored contiguously. I'd speculate that most JVM implementations will though. In the basic case it's simple enough to enforce: if you can't expand the array because other memory is occupying the space you need, move the whole thing somewhere else.
Your confusion stems from a misunderstanding of the meaning of O(1). O(1) does not mean that a function is performed in a single operation (e.g. start_address + element_size * index). It means that the operation is performed in a constant amount of time irrespective of the size of the input data - in this case, the array. This is perfectly achievable with data that is not stored contiguously. You could have a lookup table mapping indexes to memory locations, for example.

From the linked question you can see that even though it's not mandated by the JVM rules, it's highly likely that 1D arrays are continuous in memory.
Given a contiguous array the time complexities of ArrayList are as given. However it's not impossible that in a special case or a special JVM the complexities might be slightly different. It's impossible to provide the time complexities if you have to consider all kinds of VMs that are allowed by the spec.

Everytime, an element is added, its capacity is checked:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/ArrayList.java#ArrayList.add%28java.lang.Object%29
public boolean add(E e) {
ensureCapacity(size + 1); // Increments modCount!!
elementData[size++] = e;
return true;
}
Here, ensureCapacity() does the trick to keep the array sequential. How?
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
public void ensureCapacity(int minCapacity) {
modCount++;
int oldCapacity = elementData.length;
if (minCapacity > oldCapacity) {
Object oldData[] = elementData;
int newCapacity = (oldCapacity * 3)/2 + 1;
if (newCapacity < minCapacity)
newCapacity = minCapacity;
// minCapacity is usually close to size, so this is a win:
elementData = Arrays.copyOf(elementData, newCapacity);
}
}
Thus, in every stage it tries to ensure that the array has enough capacity and is linear i.e. any index within the range can be retrieved in O(1).

An ArrayList wraps a real array. (On add it might need to grow.) So it has the same complexity for get and set, O(1).
However the ArrayList can (till some future version of Java) only contain Objects. For primitive types like int or char wrapper classes are needed that are inefficient, and have the objects chaotically divided over the entire allocated memory. Still O(1) but with a large constant factor.
So for primitive types you might use arrays and do the growing yourself:
elementData = Arrays.copyOf(elementData, newCapacity);
Or if that would fit, use Bitset where the indices with true are the values.

Related

Comparison of these two algorithms?

So I'm presented with a problem that states. "Determine if a string contains all unique characters"
So I wrote up this solution that adds each character to a set, but if the character already exists it returns false.
private static boolean allUniqueCharacters(String s) {
Set<Character> charSet = new HashSet<Character>();
for (int i = 0; i < s.length(); i++) {
char currentChar = s.charAt(i);
if (!charSet.contains(currentChar)) {
charSet.add(currentChar);
} else {
return false;
}
}
return true;
}
According to the book I am reading this is the "optimal solution"
public static boolean isUniqueChars2(String str) {
if (str.length() > 128)
return false;
boolean[] char_set = new boolean[128];
for (int i = 0; i < str.length(); i++) {
int val = str.charAt(i);
if (char_set[val]) {
return false;
}
char_set[val] = true;
}
return true;
}
My question is, is my implementation slower than the one presented? I assume it is, but if a Hash look up is O(1) wouldn't they be the same complexity?
Thank you.
As Amadan said in the comments, the two solutions have the same time complexity O(n) because you have a for loop looping through the string, and you do constant time operations in the for loop. This means that the time it takes to run your methods increases linearly with the length of the string.
Note that time complexity is all about how the time it takes changes when you change the size of the input. It's not about how fast it is with data of the same size.
For the same string, the "optimal" solution should be faster because sets have some overheads over arrays. Handling arrays is faster than handling sets. However, to actually make the "optimal" solution work, you would need an array of length 2^16. That is how many different char values there are. You would also need to remove the check for a string longer than 128.
This is one of the many examples of the tradeoff between space and time. If you want it to go faster, you need more space. If you want to save space, you have to go slower.
Both algorithms have time complexity of O(N). The difference is in their space complexity.
The book's solution will always require storage for 128 characters - O(1), while your solution's space requirement will vary linearly according to the input - O(N).
The book's space requirement is based on an assumed character set with 128 characters. But this may be rather problematic (and not scalable) given the likelihood of needing different character sets.
The hashmap is in theory acceptable, but is a waste.
A hashmap is built over an array (so it is certainly more costly than an array), and collision resolution requires extra space (at least the double of the number of elements). In addition, any access requires the computation of the hash and possibly the resolution of collisions.
This adds a lot of overhead in terms of space and time, compared to a straight array.
Also note that it is kind of folklore that a hash table has an O(1) behavior. The worst case is much poorer, accesses can take up to O(N) time for a table of size N.
As a final remark, the time complexity of this algorithm is O(1) because you conclude false at worse when N>128.
Your algorithm is also O(1). You can think about complexity like how my algorithm will react to the change in amount of elements processed. Therefore O(n) and O(2n) are effectively equal.
People are talking about O notation as growth rate here
Your solution is could indeed be slower than the book's solution. Firstly, a hash lookup ideally has a constant time lookup. But, the retrieval of the object will not be if there are multiple hash collisions. Secondly, even if it is constant time lookup, there is usually significant overhead involved in executing the hash code function as compared to looking up an element in an array by index. That's why you may want to go with the array lookup. However, if you start to deal with non-ASCII Unicode characters, then you might not want to go with the array approach due to the significant amount of space overhead.
The bottleneck of your implementation is, that a set has a lookup (and insert) complexity* of O(log k), while the array has a lookup complexity in O(1).
This sounds like your algorithm must be much worse. But in fact it is not, as k is bounded by 128 (else the reference implementation would be wrong and produce a out-of-bounds error) and can be treated as a constant. This makes the set lookup O(1) as well with a bit bigger constants than the array lookup.
* assuming a sane implementation as tree or hashmap. The hashmap time complexity is in general not constant, as filling it up needs log(n) resize operations to avoid the increase of collisions which would lead to linear lookup time, see e.g. here and here for answers on stackoverflow.
This article even explains that java 8 by itself converts a hashmap to a binary tree (O(n log n) for the converstion, O(log n) for the lookup) before its lookup time degenerates to O(n) because of too many collisions.

Is HashTable/HashMap an array?

I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.
The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.
Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).

Why do dynamic arrays in C++ and Java have different initial capacities?

So I've been searching for how actually dynamic array works in general. what i found is two different concepts.
In C++
In C++, dynamic array is generally implemented by vectors. Vectors set the capacity to 0, increases the count to insert new element and then doubles the capacity size for new insertions.
vector.h
/*
* Implementation notes: Vector constructor and destructor
* -------------------------------------------------------
* The constructor allocates storage for the dynamic array and initializes
* the other fields of the object. The destructor frees the memory used
* for the array.
*/
template <typename ValueType>
Vector<ValueType>::Vector() {
count = capacity = 0;
elements = NULL;
}
For expanding the vector size
/*
* Implementation notes: expandCapacity
* ------------------------------------
* This function doubles the array capacity, copies the old elements into
* the new array, and then frees the old one.
*/
template <typename ValueType>
void Vector<ValueType>::expandCapacity() {
capacity = max(1, capacity * 2);
ValueType *array = new ValueType[capacity];
for (int i = 0; i < count; i++) {
array[i] = elements[i];
}
if (elements != NULL) delete[] elements;
elements = array;
}
In Java
In java, dynamic array is implemented using arrayList, They set the capacity to 10 (JVM based) and once the capacity is full they increases the capacity by some factor.
Reason being for setting the capacity to 10 is that you don't have to initialize the memory frequently for every new insertion. Once the capacity is full increase the capacity size.
curiosity
Why implementation in vector.h sets the default value to 0?? Setting the capacity to some small value (lets say 10)instead of setting it to 0 may save the overhead of initializing memory every time user inserts some element.
As it is a dynamic array, setting small capacity for vector will do no harm because size of dynamic array generally goes beyond 10.
Edit : My question is why default 0 ? it could be any small value by default , because anyways vector is going to expand to some specific size , that's the purpose of using vector in first place.
Having a capacity of zero by default has the advantage that default constructing an std::vector doesn't do any dynamic memory allocation at all (You don't pay for what you don't need).
If you know you need ~10 elements, you can set the capacity explicitly via calling std::vector::reserve:
std::vector<int> vec;
vec.reserve(10);
I can only speculate, why Java does things differently, but afaik, dynamic memory allocation is cheaper in Java than in c++ and the two languages also follow different philosophies, when it comes to performance/low level control vs simplicity.
Why default 0?
It's not 0 by default, actually. That is, the C++ language standard does not define (AFAIK) what the initial capacity of a default-constructed vector should be.
In practice, most/all implementations default to 0-capacity. The reason, I would say, lies in one of the design principles of the C++ language is:
What you don't use, you don't pay for.
(see: B. Stroustrup: The Design and Evolution of C++. Addison Wesley, ISBN 0-201-54330-3. March 1994)
And this is not just a trivial tautology - it's a slant of design considerations.
So, in C++, we would rather not pay anything for a vector which does not have any elements inserted into it, then save on potential size increases by making an initial "sacrifice".
As #yurikilocheck points out, however, the std::vector class does provide you with a reserve() method, so you can set an initial capacity yourself - and since the default constructor doesn't allocate anything, you will not have paid 'extra', for two allocations - just once. Again, you pay (essentially) the minimum possible.
Edit: On the other hand, std::vector partially breaks this principle in allocating more space than it actually needs when hitting its capacity. But at least it's not a superfluous allocation call.
I have used an implementation which reserves a default value of 32 elements per vector. It was native Sun C++ STL implementation. It was a disaster. I had a perfectly reasonable program which by it's nature had to have somewhat about hundreds of thousands of those vectors. It was running out of memory. And it should have been runinng fine, since of those hundreds of thousands of elements only small percentage had those vectors non-empty.
So from personal experience, 0 is the best default vector size.

Java Set.contains(o) vs. List.get(index) Time Complexity

I'm creating a stock application where I save the history of indices when a certain stock was bought. Currently I'm using a HashSet<Integer> to save these values (range 0-270).
In the program, there are a lot of lookups to this history that use Set.contains(o), which is O(1).
I'm considering changing this history to an ArrayList<Boolean>, where a true at index 0 means there was a buy at index 0, false at index 1 means there was no buy at index 1, etc...
This way, I can do a List.get(index), which is also O(1), but I'm guessing will be slightly faster becuase of the fundamental nature of a HashSet lookup.
But because of the small range of the indices, I'm not sure if my assumptions hold true.
So if I am not concerned about space complexity, which method would be faster?
Since your range is small, the fastest is to use an array directly:
boolean[] values = new boolean[271];
// get the value (equivalent to your hashset.contains(index)):
boolean contained = values[index];
It does not involve any hashCode / equals operations that a HashSet requires. This is roughly equivalent to using an ArrayList<Boolean>, minus the (very small) call stack.
Array lookup is definitely O(1) and a very fast operation.
You can also consider using a BitSet as suggested by yshavit.
As well as the boolean[] mentioned above, you might also consider a BitSet. It's designed pretty much exactly for these purposes.
BitSet bs = new BitSet(271);
bs.set(someIndex);
boolean isSet = bs.get(anotherIndex);
This is more compact than a boolean[], taking 34 bytes instead of 270 (not counting headers, which are roughly comparable). It also handles bounds more flexibly -- if you try to set a bit at an index above 270, it'll work instead of throwing an exception. Whether that's a good or bad thing is up to you.
It is obvious that array[index] is faster than [set/list].get(index), otherwise modern JITs will optimize this in a way, that you won't be able to see the difference, unless your app has a very high critical performance requirements.

ArrayList vs LinkedList from memory allocation perspective

I need to store a large amount of information, say for example 'names' in a java List. The number of items can change (or in short I cannot predefine the size). I am of the opinion that from a memory allocation perspective LinkedList would be a better option than ArrayList, as for an ArrayList once the max size is reached, automatically the memory allocation doubles and hence there would always be a chance of more memory being allocated than what is needed.
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.
LinkedList might allocate fewer entries, but those entries are astronomically more expensive than they'd be for ArrayList -- enough that even the worst-case ArrayList is cheaper as far as memory is concerned.
(FYI, I think you've got it wrong; ArrayList grows by 1.5x when it's full, not 2x.)
See e.g. https://github.com/DimitrisAndreou/memory-measurer/blob/master/ElementCostInDataStructures.txt : LinkedList consumes 24 bytes per element, while ArrayList consumes in the best case 4 bytes per element, and in the worst case 6 bytes per element. (Results may vary depending on 32-bit versus 64-bit JVMs, and compressed object pointer options, but in those comparisons LinkedList costs at least 36 bytes/element, and ArrayList is at best 8 and at worst 12.)
UPDATE:
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.
To be clear, even in the worst case, ArrayList is 4x smaller than a LinkedList with the same elements. The only possible way to make LinkedList win is to deliberately fix the comparison by calling ensureCapacity with a deliberately inflated value, or to remove lots of values from the ArrayList after they've been added.
In short, it's basically impossible to make LinkedList win the memory comparison, and if you care about space, then calling trimToSize() on the ArrayList will instantly make ArrayList win again by a huge margin. Seriously. ArrayList wins.
... but I am still guessing for the scenario I have defined LinkedList might be a better option
Your guess is incorrect.
Once you have got past the initial capacity of the array list, the size of the backing will be between 1 and 2 references times the number of entries. This is due to strategy used to grow the backing array.
For a linked list, each node occupies AT LEAST 3 times the number of entries, because each node has a next and prev reference as well as the entry reference. (And in fact, it is more than 3 times, because of the space used by the nodes' object headers and padding. Depending on the JVM and pointer size, it can be as much as 6 times.)
The only situation where a linked list will use less space than an array list is if you badly over-estimate the array list's initial capacity. (And for very small lists ...)
When you think about it, the only real advantage linked lists have over array lists is when you are inserting and removing elements. Even then, it depends on how you do it.
ArrayList use one reference per object (or two when its double the size it needs to be) This is typically 4 bytes.
LinkedList uses only the nodes its needs, but these can be 24 bytes each.
So even at it worst ArrayList will be 3x smaller than LinkedList.
For fetching ARrayList support random access O(1) but LinkedList is O(n). For deleting from the end, both are O(1), for deleting from somewhere in the middle ArrayList is O(n)
Unless you have millions of entries, the size of the collection is unlikely to matter. What will matter first is the size of entries which is the same regardless of the collection used.
Back of the envelope worst-case:
500,000 names in an array sized to 1,000,000 = 500,000 used, 500,000 empty pointers in the unused portion of the allocated array.
500,000 entries in a linked list = 3 pointers per entry (Node object holds current, prev, and next) = 1,5000,000 pointers in memory. (Then you have to add the size of the Node itself)
ArrayList.trimToSize() may satisfy you.
Trims the capacity of this ArrayList instance to be the list's current
size. An application can use this operation to minimize the storage of
an ArrayList instance.
By the way, in ArrayList Java6, it's not double capacity, it's about 1.5 times max size is reached.

Categories