Could you please let me know Performance wise why Array is better than Collection?
It is not. It will actually depend on the use you make of your container.
Some algorithms may run in O(n) on an array and in O(1) on another collection type (which implements the Collection interface).
Think about removal of an item for instance. In that case, the array, even if a native type, would perform slower than the linked list and its method calls (which could be inlined anyway on some VMs): it runs in O(n) VS O(1) for a linked list
Think about searching an element. It runs in 0(n) for an array VS O(log n) for a tree.
Some Collection implementations use an array to store their elements (ArrayList I think) so in that case performance will not be significantly different.
You should spend time on optimizing your algorithm (and make use of the various collection types available) instead of worrying of the pros/cons of an array VS Collection.
Many collections are wrappers for arrays. This includes ArrayList, HashMap/Set, StringBuilder. For optimised code, the performance difference of the operations is minimal except when you come to operations which are better suited to that data structure e.g. lookup of a Map is much faster than the lookup in an array.
Using generics for collections which are basically primitives can be slower, not because the collection is slower but the extra object creation and cache usage (as the memory needed can be higher) This difference is usually too small to matter but if you are concerned about this you can use the Trove4J libraries which are wrappers for arrays of primitives instead of arrays of Objects.
Where collections are slower is when you use operations which they are not suitable for e.g. random access of a LinkedList, but sensible coding can avoid these situations.
Basically, because arrays are primitive data structures in Java. Accesses to them can be translated directly into native memory-access instructions rather than method calls.
That said, though, it's not entirely obvious that arrays will strictly outperform collections in all circumstances. If your code references collection variables where the runtime type can be monomorphically known at JIT-time, Hotspot will be able to inline the access methods, and where they are simple, can be just as fast since there's basically no overhead anyway.
Many of the collections' access methods are intrinsically more complex than array referencing, however. There is, for instance, no way that a HashMap will be as efficient as a simple array lookup, no matter how much Hotspot optimizes it.
You cannot compare the two. ArrayList is an implementation, Collection is an interface. There might be different implementations for the Collection interface.
In practice the implementation is chosen which as the simple access to your data. Usually ArrayList if you need to loop through all elements. Hashtable if you need access by key.
Performance should be considered only after measurements are made. Then it is easy to change the implementation because the collection framework has common interfaces like the Collection interface.
The question is which one to use and when?
An array is basically a fixed size collection of elements. The bad point about an array is that it is not resizable. But its constant size provides efficiency if you are clear with your element size. So arrays are better to use when you know the number of elements available with you.
Collection
ArrayList is another collection where the number of elements is resizable. So if you are not sure about the number of elements in the collection use an ArrayList. But there are certain facts to be considered while using ArrayLists.
ArrayLists is not synchronized. So if there are multiple threads
accessing and modifying the list, then synchronization might be
required to be handled externally.
ArrayList is internally implemented as an array. So whenever a new
element is added an array of n+1 elements is created and then all the
n elements are copied from the old array to the new array and then
the new element is inserted in the new array.
Adding n elements requires on time.
The isEmpty, size, iterator, set, get and listIterator operations
require the same amount of time, independently of element you access.
Only Objects can be added to an ArrayList
Permits null elements
If you need to add a large number of elements to an ArrayList, you can use the ensureCapacity(int minCapacity) operation to ensure that the ArrayList has that required capacity. This will ensure that the Array is copied only once when all the elements are added and increase the performance of addition of elements to an ArrayList. Also inserting an element in the middle of say 1000 elements would require you to move 500 elements up or down and then add the element in the middle.
The benefit of using ArrayList is that accessing random elements is cheap and is not affected by the number of elemets in the ArrayList. But addition of elements to the head of tail or in the middle is costly.
Vector is similar to ArrayList with the difference that it is synchronized. It offers some other benefits like it has an initial capacity and an incremental capacity. So if your vector has a capacity of 10 and incremental capacity of 10, then when you are adding the 11th element a new Vector would be created with 20 elements and the 11 elements would be copied to the new Vector. So addition of 12th to 20th elements would not require creation of new vector.
By default, when a vector needs to grow the size of its internal data structure to hold more elements, the size of internal data structure is doubled, whereas for ArrayList the size is increased by only 50%. So ArrayList is more conservative in terms of space.
LinkedList is much more flexible and lets you insert, add and remove elements from both sides of your collection - it can be used as queue and even double-ended queue! Internally a LinkedList does not use arrays. LinkedList is a sequence of nodes, which are double linked. Each node contains header, where actually objects are stored, and two links or pointers to next or previous node. A LinkedList looks like a chain, consisting of people who hold each other's hand. You can insert people or node into that chain or remove. Linked lists permit node insert/remove operation at any point in the list in constant time.
So inserting elements in linked list (whether at head or at tail or in the middle) is not expensive. Also when you retrieve elements from the head it is cheap. But when you want to randomly access the elements of the linked list or access the elements at the tail of the list then the operations are heavy. Cause, for accessing the n+1 th element, you will need to parse through the first n elements to reach the n+1th element.
Also linked list is not synchronized. So multiple threads modifying and reading the list would need to be synchronized externally.
So the choice of which class to use for creating lists depends on the requirements. ArrayList or Vector( if you need synchronization ) could be used when you need to add elements at the end of the list and access elements randomly - more access operations than add operations. Whereas a LinkedList should be used when you need to do a lot of add/delete (elements) operations from the head or the middle of the list and your access operations are comparatively less.
Related
for example if I have an array list with elements (1,4,7,2,10,100,76) and a linked list with same data, and I want to search for key=2, which one would take less time, .contains() for array list or .contains for linked list? I have heard that array list is better for random access but does that mean it is also better for searching?
Your biggest largely unnecessary overhead is having to box the primitives instead of using int[].
ArrayList will generally perform faster for practically any real situation (use ArrayDeque for queues). In this case accessing references to the elements is guaranteed to be sequential through memory, which is cache friendly, and also not incur the overhead of nodes and reading the next node reference.
A better algorithm, for all but the smallest collections, would be a binary search in a sorted array. Even a HashSet (or TreeSet) would be better.
I want to have a list of objects that satisfies just a few basic requirements. It needs to support fast random access and it needs to be safe to use from multiple threads. Reads will dominate by far and ideally should be about as fast as normal ArrayList access, i.e. no locking. There is no need to insert elements in the middle, delete, or change the value at an index: the only mutation required is to be able to append a new element to the end of the list. More specifically a caller will specify an index at which an element should be placed, and the index is expected to be only a few more than the length of the list, i.e. the list is dense. There is also no need for iteration.
Is there anything that supports this in Java? It can be in a third party library.
If not I am thinking I will implement my own class. There'll be an internal array of arrays, each twice as big as the last. Lookups by index will do just a little more maths to figure out which array has the right element and what the index in that array is. Appends will be similar unless they go beyond the available space, in which case a new array is allocated. Only the creation of a new array will require a lock to be acquired.
Does this sound like a sensible approach?
This doesn't sound like a particularly novel data structure. Does it have a name?
Reads will dominate by far and ideally should be about as fast as normal ArrayList access, i.e. no locking.
CopyOnWriteArrayList generally works in this scenario because the cost of insertion will be amortized over the large number of cheap read accesses.
Under the condition it is append-only one could amortize it even further by pre-sizing the array and maintaining a separate length and bumping that atomically after an insert.
Other approaches are only necessary if you're concerned about peak latency for inserts. But that's not one of the criteria you mentioned.
You also have to keep in mind that you're asking for a data structure tailored to your use-case (append-only, lock-free, O(1) access, etc. etc.) whereas the JDK provides general-purpose data structures which make some tradeoffs to cover more use-cases.
There are 3rd-party libraries which provide more specialized implementations for limited use-cases.
The type of datastructure you describe is a spined buffer and used internally by the JDK in some places (e.g. in the form of java.util.stream.SpinedBuffer<E>), but that implementation is not thread-safe and not exposed since it does not implement collection APIs.
Its javadocs state:
One or more arrays are used to store elements. The use of a multiple
arrays has better performance characteristics than a single array used
by ArrayList, as when the capacity of the list needs to be increased
no copying of elements is required. This is usually beneficial in the
case where the results will be traversed a small number of times.
I.e. it's mostly useful for write-once, read-a-few-times scenarios where the allocation costs will dominate.
In read-heavy data-structures the cost of indirection, extra math operations and non-sequential memory access might actually outstrip the cost of occasional copying/reallocation.
Java has a concurrent list implementation in java.util.concurrent. CopyOnWriteArrayList which is a thread-safe variant of ArrayList in which all mutative operations (add, set, and so on) are implemented by making a fresh copy of the underlying array.
From doc:
This is ordinarily too costly, but may be more efficient than
alternatives when traversal operations vastly outnumber mutations, and
is useful when you cannot or don't want to synchronize traversals, yet
need to preclude interference among concurrent threads. The "snapshot"
style iterator method uses a reference to the state of the array at
the point that the iterator was created. This array never changes
during the lifetime of the iterator, so interference is impossible and
the iterator is guaranteed not to throw
ConcurrentModificationException. The iterator will not reflect
additions, removals, or changes to the list since the iterator was
created. Element-changing operations on iterators themselves (remove,
set, and add) are not supported. These methods throw
UnsupportedOperationException.
All elements are permitted, including null.
As per your requirement:
Reads will dominate by far and ideally should be about as fast as
normal ArrayList access, i.e. no locking. There is no need to insert
elements in the middle, delete, or change the value at an index: the
only mutation required is to be able to append a new element to the
end of the list.
Appending an element at end will result in a fresh copy of underlying array (O(n)) and may be too expensive. I believe using Collection.synchronizedList may be a good option but that involves locking (blocking).
Also check this.
Any list wrapped using Collections.synchronizedList(...) satisfies the requirements as you have stated them.
However:
Insertion anywhere other than at the end of the list will be a concurrency bottleneck. The longer the list, the worse it will get.
There are caveats in the javadocs about iteration that you should read.
CopyOnWriteArrayList is an alternative, but all updates on a copy-on-write list are O(N) irrespective of where you insert the element. This is expensive, and would be a concurrency bottleneck is there are multiple writers. The argument that the cost of updates can be ignored, only applies if the ratio of writes to reads reduces over time. If the ratio is constant over time, then you need to take the (O(N)) cost updates into account.
Note that a synchronized wrapper for an ArrayList will give O(1) lookup and (amortized) O(1) insertion at the end of the list. Admittedly, insertion into the middle of a list is O(N) ... but there is no list structure that I'm aware of that gives better than O(logN) for insertion at a random position. (Look up "indexable skiplist".)
UPDATE
You commented:
"I don't need random insertion, only appends, except that the position of the append can be beyond the end of the list. For example I might have a list [0,1,2] and want to insert a 4 at index 4 so my list will then be [0,1,2,null,4]."
If that is a correct characterization of your problem, then the data structure you are talking about is NOT a "list". Certainly, it is not compatible with the Java List API. In the List context, appending means adding an element immediately after the current last element of the list; i.e. at position == list.size().
Maybe you should be looking for a concurrent sparse array class. Here is one possibility:
http://software.clapper.org/javautil/api/org/clapper/util/misc/SparseArrayList.html
I've been programming for quite a bit and recently started learning more pure Computer Science topics (for a job interview).
I know the difference between an Array and a LinkedList data structure, but now that I have started using Java I'm seeing this ArrayList, which I'm having trouble conceptualizing.
Web searches have only really shown me HOW to use them and WHEN to use them (benefits of each), but nothing can answer my question of:
What is an ArrayList? My assumption is that it is a list that maintains memory references to each element, making it also able to act like an array.
I also have a feeling since Java is open, that I should be able to look at the Class definition, but haven't figured out how to do that yet either.
Thanks!
I like to think of it as a data-structure that lets you enjoy both worlds, the quick-access to an index like with an array and the infinite growth of a list. Of course, there are always trade-offs.
ArrayList is actually a wrapper to an array. Every time the size of the array ends, a new array, twice the size, is created and all the data from the original array is copied to the new one.
From the java doc:
Resizable-array implementation of the List interface. Implements all
optional list operations, and permits all elements, including null. In
addition to implementing the List interface, this class provides
methods to manipulate the size of the array that is used internally to
store the list. (This class is roughly equivalent to Vector, except
that it is unsynchronized.) The size, isEmpty, get, set, iterator, and
listIterator operations run in constant time. The add operation runs
in amortized constant time, that is, adding n elements requires O(n)
time. All of the other operations run in linear time (roughly
speaking). The constant factor is low compared to that for the
LinkedList implementation.
Each ArrayList instance has a capacity. The capacity is the size of
the array used to store the elements in the list. It is always at
least as large as the list size. As elements are added to an
ArrayList, its capacity grows automatically. The details of the growth
policy are not specified beyond the fact that adding an element has
constant amortized time cost.
An application can increase the capacity of an ArrayList instance
before adding a large number of elements using the ensureCapacity
operation. This may reduce the amount of incremental reallocation.
This allows O(1) access for most of the operations like it would take with an array. Once in a while you need to pay for this performance with an insert operation that takes much longer though.
This is called amortized complexity. Each operation takes only O(1) aside for those times you need to double the size of the array. In those time you would pay O(n) but if you average it over n operations, the average time taken is only O(1) and not O(n).
Let's take an example:
We have an array of size 100 (n=100). You make 100 insert operations (to different indices) and each of them takes only O(1), of course that all get-by-index operations also take O(1) (as this is an array). On the 101 insertion, there's no more more capacity in the array so the ArrayList will create a new array, the size of 200, copy all the values to it (O(n) operations) and then insert the 101st item. Until you fill out the array to 200 items, all of the operations would take O(1).
An ArrayList is a list that is directly backed by an array. More specifically, it's backed by an array that is dynamically resized. You can read a bit more about it in its source code; there are some pretty good comments to it.
The reason that this is significant is due to how a LinkedList is implemented - as a traditional collection of nodes and references to other nodes. This has performance impacts in indexing and traversal, whereas with an ArrayList, since it's backed by an array, all one needs to do is index into the specific array to retrieve the value.
I have a List<String> toProcess which I want to process further with
toProcess.parallelStream().map(/*some function*/).collect(Collectors.toList());
Which is the best List-type (like LinkedList, ArrayList ect.) for the initial list to gain the best speed from this multithreading?
Additional information: The expected element-count ranges in the size of 10^3-10^5, but the individual element can become quite big (10^5-10^6 chars).
Alternativly I can use String[] all over the place, as the amount of strings is guaranteed to not change (results will contain as many elements as toProcess).
Either way I have to iterate over all elements in order at the end. At the moment I use a foreach-loop to assemble the final result. This can be easily changed to a regular for-loop.
If you are certain that the number of output elements equals the number of input elements, and you're satisfied with an array as the result, then definitely use toArray instead of a collector. If the pipeline has a fixed size throughout, the destination array will be preallocated with the right size, and the parallel operations deposit their results directly into the destination array at the right locations: no copying, reallocation, or merging.
If you want a List, you can always wrap the result using Arrays.asList, but of course you can't add or remove elements to the result.
Collectors
If one of the above conditions doesn't hold, then you need to deal with collectors, which have different tradeoffs.
Collectors work in parallel by operating on intermediate results in a thread-confined manner. The intermediate results are then merged into the final result. There are two operations to consider: 1) the accumulation of individual elements into the intermediate results, and 2) the merging (or combining) of the intermediate results into a final result.
Between LinkedList and ArrayList, it's likely that ArrayList is faster, but you should probably benchmark this to be sure. Note that Collectors.toList uses ArrayList by default, although this may change in a future release.
LinkedList
Each element being accumulated (LinkedList.add) involves allocating a new list node and hooking it to the end of the list. Hooking the node to the list is quite fast, but this involves an allocation for every single stream element, which will probably incur minor garbage collections as accumulation proceeds.
Merging (LinkedList.addAll) is also quite expensive. The first step is to convert the source list to an array; this is done by looping over every node of the list and storing the element into a temporary array. Then, the code iterates over this temporary array and adds each element to the end of the destination list. This incurs allocation of a new node for each element, as noted above. Thus a merge operation is quite expensive, because it iterates over every element in the source list twice and incurs allocation for each element, which probably introduces garbage collection overhead.
ArrayList
Accumulation of each element usually involves appending it to the end of the array contained within the ArrayList. This is usually quite fast, but if the array is full, it must be reallocated and copied into a larger array. The growth policy for ArrayList is to allocate the new array to be 50% larger than the current one, so reallocations occur proportional to the log of the number of elements being added, which isn't too bad. All the elements have to be copied over, however, which means that the earlier elements might need to be copied multiple times.
Merging an ArrayList is probably much cheaper than LinkedList. Converting an ArrayList to an array involves a bulk copy (not one-at-a-time) of the elements from the source into a temporary array. The destination array is resized if necessary (which is likely in this case), requiring a bulk copy of all the elements. The source elements are then bulk-copied from the temporary array to the destination, which has been pre-sized to accomodate them.
Discussion
Given the above, it seems like ArrayList will be faster than LinkedList. However, even collection to an ArrayList requires some unnecessary reallocation and copying of many elements, probably several times. A potential future optimization would be for Collectors.toList to accumulate elements into a data structure that's optimized for fast-append access, preferably one that's been pre-sized to accommodate the expected number of elements. A data structure that supports fast merging is a possibility as well.
If all you need to do is iterate over the final result, it shouldn't be too difficult to roll your own data structure that has these properties. Significant simplification should be possible if it doesn't need to be a full-blown List. It could accumulate into pre-sized lists to avoid reallocations, and merging would simply gather these into a tree structure or list-of-lists. See the JDK's SpinedBuffer (a private implementation class) for ideas.
Given the cost of a context switch, and multithreading in general. The performance gains of switching between a type of list is generally really insignificant. Even if you use a suboptimal list - it won't matter.
If you really care, then ArrayList because of cache locality would probably do a better job, but it depends.
Generally, ArrayList is much more friendly to parallelization compared to LinkedList because arrays are easy to split into pieces to hand to each thread.
However, since your terminal operation is to write the result to a file, parallelization may not help you at all since you will likely be limited by IO, not by CPU.
I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are (and the number won't change), and was wondering whether Array would have a notable performance advantage over HashSet for storing/retrieving said elements.
On paper, Array guarantees constant time insertion (since I know the size ahead of time) and retrieval, but the code for HashSet looks much cleaner and adds some flexibility, so I'm wondering if I'm losing anything performance-wise using it, at least, theoretically.
Depends on your data;
HashSet gives you an O(1) contains() method but doesn't preserve order.
ArrayList contains() is O(n) but you can control the order of the entries.
Array if you need to insert anything in between, worst case can be O(n), since you will have to move the data down and make room for the insertion. In Set, you can directly use SortedSet which too has O(n) too but with flexible operations.
I believe Set is more flexible.
The choice greatly depends on what do you want to do with it.
If it is what mentioned in your question:
I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are
If this is what you need to do, the you need neither of them. There is a size() method in Collection for which you can get the size of it, which mean how many of them there are in the collection.
If what you mean for "collection of object" is not really a collection, and you need to choose a type of collection to store your objects for further processing, then you need to know, for different kind of collections, there are different capabilities and characteristic.
First, I believe to have a fair comparison, you should consider using ArrayList instead Array, for which you don't need to deal with the reallocation.
Then it become the choice of ArrayList vs HashSet, which is quite straight-forward:
Do you need a List or Set? They are for different purpose: Lists provide you indexed access, and iteration is in order of index. While Sets are mainly for you to keep a distinct set of data, and given its nature, you won't have indexed access.
After you made your decision of List or Set to use, then it is a choice of List/Set implementation, normally for Lists, you choose from ArrayList and LinkedList, while for Sets, you choose between HashSet and TreeSet.
All the choice depends on what you would want to do with that collection of data. They performs differently on different action.
For example, an indexed access in ArrayList is O(1), in HashSet (though not meaningful) is O(n), (just for your interest, in LinkedList is O(n), in TreeSet is O(nlogn) )
For adding new element, both ArrayList and HashSet is O(1) operation. Inserting in the middle is O(n) for ArrayList, while it doesn't make sense in HashSet. Both will suffer from reallocation, and both of them need O(n) for the reallocation (HashSet is normally slower in reallocation, because it involve calculation of hash for each element again).
To find if certain element exists in the collection, ArrayList is O(n) and HashSet is O(1).
There are still lots of operations you can do, so it is quite meaningless to discuss for performance without knowing what you want to do.
theoretically, and as SCJP6 Study guide says :D
arrays are faster than collections, and as said, most of the collections depend mainly on arrays (Maps are not considered Collection, but they are included in the Collections framework)
if you guarantee that the size of your elements wont change, why get stuck in Objects built on Objects (Collections built on Arrays) while you can use the root objects directly (arrays)
It looks like you will want an HashMap that maps id's to counts. Particularly,
HashMap<Integer,Integer> counts=new HashMap<Integer,Integer>();
counts.put(uniqueID,counts.get(uniqueID)+1);
This way, you get amortized O(1) adds, contains and retrievals. Essentially, an array with unique id's associated with each object IS a HashMap. By using the HashMap, you get the added bonus of not having to manage the size of the array, not having to map the keys to an array index yourself AND constant access time.