I've been programming for quite a bit and recently started learning more pure Computer Science topics (for a job interview).
I know the difference between an Array and a LinkedList data structure, but now that I have started using Java I'm seeing this ArrayList, which I'm having trouble conceptualizing.
Web searches have only really shown me HOW to use them and WHEN to use them (benefits of each), but nothing can answer my question of:
What is an ArrayList? My assumption is that it is a list that maintains memory references to each element, making it also able to act like an array.
I also have a feeling since Java is open, that I should be able to look at the Class definition, but haven't figured out how to do that yet either.
Thanks!
I like to think of it as a data-structure that lets you enjoy both worlds, the quick-access to an index like with an array and the infinite growth of a list. Of course, there are always trade-offs.
ArrayList is actually a wrapper to an array. Every time the size of the array ends, a new array, twice the size, is created and all the data from the original array is copied to the new one.
From the java doc:
Resizable-array implementation of the List interface. Implements all
optional list operations, and permits all elements, including null. In
addition to implementing the List interface, this class provides
methods to manipulate the size of the array that is used internally to
store the list. (This class is roughly equivalent to Vector, except
that it is unsynchronized.) The size, isEmpty, get, set, iterator, and
listIterator operations run in constant time. The add operation runs
in amortized constant time, that is, adding n elements requires O(n)
time. All of the other operations run in linear time (roughly
speaking). The constant factor is low compared to that for the
LinkedList implementation.
Each ArrayList instance has a capacity. The capacity is the size of
the array used to store the elements in the list. It is always at
least as large as the list size. As elements are added to an
ArrayList, its capacity grows automatically. The details of the growth
policy are not specified beyond the fact that adding an element has
constant amortized time cost.
An application can increase the capacity of an ArrayList instance
before adding a large number of elements using the ensureCapacity
operation. This may reduce the amount of incremental reallocation.
This allows O(1) access for most of the operations like it would take with an array. Once in a while you need to pay for this performance with an insert operation that takes much longer though.
This is called amortized complexity. Each operation takes only O(1) aside for those times you need to double the size of the array. In those time you would pay O(n) but if you average it over n operations, the average time taken is only O(1) and not O(n).
Let's take an example:
We have an array of size 100 (n=100). You make 100 insert operations (to different indices) and each of them takes only O(1), of course that all get-by-index operations also take O(1) (as this is an array). On the 101 insertion, there's no more more capacity in the array so the ArrayList will create a new array, the size of 200, copy all the values to it (O(n) operations) and then insert the 101st item. Until you fill out the array to 200 items, all of the operations would take O(1).
An ArrayList is a list that is directly backed by an array. More specifically, it's backed by an array that is dynamically resized. You can read a bit more about it in its source code; there are some pretty good comments to it.
The reason that this is significant is due to how a LinkedList is implemented - as a traditional collection of nodes and references to other nodes. This has performance impacts in indexing and traversal, whereas with an ArrayList, since it's backed by an array, all one needs to do is index into the specific array to retrieve the value.
Related
I have this my code:
#Nullable
#Value.Default
default List<String> write() {
return new LinkedList<>();
}
And DeepCode IntelliJ plugin indicates that LinkedList can lead to unnecessary performance overhead if the List is randomly accessed. And ArrayList should be used instead.
What is this performance overhead that LinkedList have over ArrayList? Is it really much of a difference as what DeepCode suggest?
LinkedList and ArrayList have different performance characteristics as described in the JavaDoc. Inserting into a LinkedList is cheap, especially at the front and back. Traversing a LinkedList in sequence, e.g. with streams or foreach, is (relatively) cheap.
On the other hand, random access e.g. with get(n) is slow, as it takes O(n).
ArrayLists on the other hand do random access in O(1). Inserting on the other hand runs in amortized constant time:
The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking). The constant factor is low compared to that for the LinkedList implementation.
The main advantage of LinkedList is that it allows for fast inserting and deletion at the front/end and via the iterator. A typical usage scenario is if you use the LinkedList as a Queue or Deque (it actually implements those two interfaces as well).
So, it depends on what you are doing. If you have frequent random access, use ArrayList. If you have frequent sequential access (via the iterator) and adding/removing from front/back, e.g. because you use it as a Queue, use LinkedList.
If you add at arbitrary positions, e.g. via add(int index, E element), a LinkedList has to traverse the list first, making insertion O(n), having no benefit over ArrayList (which has to shift the subsequent elements down and eventually resize the underlying array, which again, is amortized O(n) as well).
In practice, I'd only choose LinkedList if there is a clear need for it, otherwise I'd use ArrayList as the default choice. Note that if you know the number of elements, you can size an ArrayList properly and thus avoid resizing, making the disadvantages of ArrayList even smaller.
https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html
https://docs.oracle.com/javase/8/docs/api/java/util/LinkedList.html
See also https://stuartmarks.wordpress.com/2015/12/18/some-java-list-benchmarks/ (thanks to #Leprechaun for providing this resource)
Is there any advantage to using an ArrayList over a sparseArray. A sparseArray is better memory management has it does not put nulls in empty slots like ArrayList will. But Would you always recommend me to use a sparseArray over an ArrayList which is used very commonly ?
To be clear, i am not asking about what is a sparseArray (i have already defined it above), i am asking When would one want to use an ArrayList over a sparseArray.
When the list is not sparse, an ArrayList requires less memory than SparseArray, and accesses by index in O(1) rather than O(log n).
From the SparseArray class documentation:
Note that this container keeps its mappings in an array data structure, using a binary search to find keys. The implementation is not intended to be appropriate for data structures that may contain large numbers of items. It is generally slower than a traditional HashMap, since lookups require a binary search and adds and removes require inserting and deleting entries in the array. For containers holding up to hundreds of items, the performance difference is not significant, less than 50%.
I have a List<String> toProcess which I want to process further with
toProcess.parallelStream().map(/*some function*/).collect(Collectors.toList());
Which is the best List-type (like LinkedList, ArrayList ect.) for the initial list to gain the best speed from this multithreading?
Additional information: The expected element-count ranges in the size of 10^3-10^5, but the individual element can become quite big (10^5-10^6 chars).
Alternativly I can use String[] all over the place, as the amount of strings is guaranteed to not change (results will contain as many elements as toProcess).
Either way I have to iterate over all elements in order at the end. At the moment I use a foreach-loop to assemble the final result. This can be easily changed to a regular for-loop.
If you are certain that the number of output elements equals the number of input elements, and you're satisfied with an array as the result, then definitely use toArray instead of a collector. If the pipeline has a fixed size throughout, the destination array will be preallocated with the right size, and the parallel operations deposit their results directly into the destination array at the right locations: no copying, reallocation, or merging.
If you want a List, you can always wrap the result using Arrays.asList, but of course you can't add or remove elements to the result.
Collectors
If one of the above conditions doesn't hold, then you need to deal with collectors, which have different tradeoffs.
Collectors work in parallel by operating on intermediate results in a thread-confined manner. The intermediate results are then merged into the final result. There are two operations to consider: 1) the accumulation of individual elements into the intermediate results, and 2) the merging (or combining) of the intermediate results into a final result.
Between LinkedList and ArrayList, it's likely that ArrayList is faster, but you should probably benchmark this to be sure. Note that Collectors.toList uses ArrayList by default, although this may change in a future release.
LinkedList
Each element being accumulated (LinkedList.add) involves allocating a new list node and hooking it to the end of the list. Hooking the node to the list is quite fast, but this involves an allocation for every single stream element, which will probably incur minor garbage collections as accumulation proceeds.
Merging (LinkedList.addAll) is also quite expensive. The first step is to convert the source list to an array; this is done by looping over every node of the list and storing the element into a temporary array. Then, the code iterates over this temporary array and adds each element to the end of the destination list. This incurs allocation of a new node for each element, as noted above. Thus a merge operation is quite expensive, because it iterates over every element in the source list twice and incurs allocation for each element, which probably introduces garbage collection overhead.
ArrayList
Accumulation of each element usually involves appending it to the end of the array contained within the ArrayList. This is usually quite fast, but if the array is full, it must be reallocated and copied into a larger array. The growth policy for ArrayList is to allocate the new array to be 50% larger than the current one, so reallocations occur proportional to the log of the number of elements being added, which isn't too bad. All the elements have to be copied over, however, which means that the earlier elements might need to be copied multiple times.
Merging an ArrayList is probably much cheaper than LinkedList. Converting an ArrayList to an array involves a bulk copy (not one-at-a-time) of the elements from the source into a temporary array. The destination array is resized if necessary (which is likely in this case), requiring a bulk copy of all the elements. The source elements are then bulk-copied from the temporary array to the destination, which has been pre-sized to accomodate them.
Discussion
Given the above, it seems like ArrayList will be faster than LinkedList. However, even collection to an ArrayList requires some unnecessary reallocation and copying of many elements, probably several times. A potential future optimization would be for Collectors.toList to accumulate elements into a data structure that's optimized for fast-append access, preferably one that's been pre-sized to accommodate the expected number of elements. A data structure that supports fast merging is a possibility as well.
If all you need to do is iterate over the final result, it shouldn't be too difficult to roll your own data structure that has these properties. Significant simplification should be possible if it doesn't need to be a full-blown List. It could accumulate into pre-sized lists to avoid reallocations, and merging would simply gather these into a tree structure or list-of-lists. See the JDK's SpinedBuffer (a private implementation class) for ideas.
Given the cost of a context switch, and multithreading in general. The performance gains of switching between a type of list is generally really insignificant. Even if you use a suboptimal list - it won't matter.
If you really care, then ArrayList because of cache locality would probably do a better job, but it depends.
Generally, ArrayList is much more friendly to parallelization compared to LinkedList because arrays are easy to split into pieces to hand to each thread.
However, since your terminal operation is to write the result to a file, parallelization may not help you at all since you will likely be limited by IO, not by CPU.
Could you please let me know Performance wise why Array is better than Collection?
It is not. It will actually depend on the use you make of your container.
Some algorithms may run in O(n) on an array and in O(1) on another collection type (which implements the Collection interface).
Think about removal of an item for instance. In that case, the array, even if a native type, would perform slower than the linked list and its method calls (which could be inlined anyway on some VMs): it runs in O(n) VS O(1) for a linked list
Think about searching an element. It runs in 0(n) for an array VS O(log n) for a tree.
Some Collection implementations use an array to store their elements (ArrayList I think) so in that case performance will not be significantly different.
You should spend time on optimizing your algorithm (and make use of the various collection types available) instead of worrying of the pros/cons of an array VS Collection.
Many collections are wrappers for arrays. This includes ArrayList, HashMap/Set, StringBuilder. For optimised code, the performance difference of the operations is minimal except when you come to operations which are better suited to that data structure e.g. lookup of a Map is much faster than the lookup in an array.
Using generics for collections which are basically primitives can be slower, not because the collection is slower but the extra object creation and cache usage (as the memory needed can be higher) This difference is usually too small to matter but if you are concerned about this you can use the Trove4J libraries which are wrappers for arrays of primitives instead of arrays of Objects.
Where collections are slower is when you use operations which they are not suitable for e.g. random access of a LinkedList, but sensible coding can avoid these situations.
Basically, because arrays are primitive data structures in Java. Accesses to them can be translated directly into native memory-access instructions rather than method calls.
That said, though, it's not entirely obvious that arrays will strictly outperform collections in all circumstances. If your code references collection variables where the runtime type can be monomorphically known at JIT-time, Hotspot will be able to inline the access methods, and where they are simple, can be just as fast since there's basically no overhead anyway.
Many of the collections' access methods are intrinsically more complex than array referencing, however. There is, for instance, no way that a HashMap will be as efficient as a simple array lookup, no matter how much Hotspot optimizes it.
You cannot compare the two. ArrayList is an implementation, Collection is an interface. There might be different implementations for the Collection interface.
In practice the implementation is chosen which as the simple access to your data. Usually ArrayList if you need to loop through all elements. Hashtable if you need access by key.
Performance should be considered only after measurements are made. Then it is easy to change the implementation because the collection framework has common interfaces like the Collection interface.
The question is which one to use and when?
An array is basically a fixed size collection of elements. The bad point about an array is that it is not resizable. But its constant size provides efficiency if you are clear with your element size. So arrays are better to use when you know the number of elements available with you.
Collection
ArrayList is another collection where the number of elements is resizable. So if you are not sure about the number of elements in the collection use an ArrayList. But there are certain facts to be considered while using ArrayLists.
ArrayLists is not synchronized. So if there are multiple threads
accessing and modifying the list, then synchronization might be
required to be handled externally.
ArrayList is internally implemented as an array. So whenever a new
element is added an array of n+1 elements is created and then all the
n elements are copied from the old array to the new array and then
the new element is inserted in the new array.
Adding n elements requires on time.
The isEmpty, size, iterator, set, get and listIterator operations
require the same amount of time, independently of element you access.
Only Objects can be added to an ArrayList
Permits null elements
If you need to add a large number of elements to an ArrayList, you can use the ensureCapacity(int minCapacity) operation to ensure that the ArrayList has that required capacity. This will ensure that the Array is copied only once when all the elements are added and increase the performance of addition of elements to an ArrayList. Also inserting an element in the middle of say 1000 elements would require you to move 500 elements up or down and then add the element in the middle.
The benefit of using ArrayList is that accessing random elements is cheap and is not affected by the number of elemets in the ArrayList. But addition of elements to the head of tail or in the middle is costly.
Vector is similar to ArrayList with the difference that it is synchronized. It offers some other benefits like it has an initial capacity and an incremental capacity. So if your vector has a capacity of 10 and incremental capacity of 10, then when you are adding the 11th element a new Vector would be created with 20 elements and the 11 elements would be copied to the new Vector. So addition of 12th to 20th elements would not require creation of new vector.
By default, when a vector needs to grow the size of its internal data structure to hold more elements, the size of internal data structure is doubled, whereas for ArrayList the size is increased by only 50%. So ArrayList is more conservative in terms of space.
LinkedList is much more flexible and lets you insert, add and remove elements from both sides of your collection - it can be used as queue and even double-ended queue! Internally a LinkedList does not use arrays. LinkedList is a sequence of nodes, which are double linked. Each node contains header, where actually objects are stored, and two links or pointers to next or previous node. A LinkedList looks like a chain, consisting of people who hold each other's hand. You can insert people or node into that chain or remove. Linked lists permit node insert/remove operation at any point in the list in constant time.
So inserting elements in linked list (whether at head or at tail or in the middle) is not expensive. Also when you retrieve elements from the head it is cheap. But when you want to randomly access the elements of the linked list or access the elements at the tail of the list then the operations are heavy. Cause, for accessing the n+1 th element, you will need to parse through the first n elements to reach the n+1th element.
Also linked list is not synchronized. So multiple threads modifying and reading the list would need to be synchronized externally.
So the choice of which class to use for creating lists depends on the requirements. ArrayList or Vector( if you need synchronization ) could be used when you need to add elements at the end of the list and access elements randomly - more access operations than add operations. Whereas a LinkedList should be used when you need to do a lot of add/delete (elements) operations from the head or the middle of the list and your access operations are comparatively less.
I have a web application that uses ArrayList's extensively to store and operate on data within itself. However,recently I understood that HashMap's may have been a better choice. Could anyone tell me what exactly is the algorithmic cost(Big O(n)) of adding, accessing and removing an element from both and whether it is wise to go into the code and change them for the sake of efficency?
For ArrayList:
The size, isEmpty, get, set, iterator, and listIterator operations run
in constant time. The add operation runs in amortized constant time,
that is, adding n elements requires O(n) time. All of the other
operations run in linear time (roughly speaking). The constant factor
is low compared to that for the LinkedList implementation.
From the documentation:
http://docs.oracle.com/javase/6/docs/api/java/util/ArrayList.html
For HashMap:
This implementation provides constant-time performance for the basic
operations (get and put), assuming the hash function disperses the
elements properly among the buckets. Iteration over collection views
requires time proportional to the "capacity" of the HashMap instance
(the number of buckets) plus its size (the number of key-value
mappings). Thus, it's very important not to set the initial capacity
too high (or the load factor too low) if iteration performance is
important.
From the documentation:
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
The documentation for ArrayList has a discussion of the performance of the performance of these operations. As for HashMap, it will be O(1) for all three. This does assume, however, that your hashCode method is well-implemented, and is also O(1).
Please have a look at http://www.coderfriendly.com/wp-content/uploads/2009/05/java_collections_v2.pdf
O(1) to get/add
O(n) for contains
O(1) for next