Complexity of processing a collection's values

Complexity of processing a collection's values - java

I need to store a growing large number of objects in a collection. While performing actions of each object of the collection, I regularly need to check whether an object is already stored. If an object is not stored yet I will add it to the end of the collection. I process each object iteratively while doing the checks.
Objects already processed should not be removed from the collection because I do not want put them back to processing when I stumble upon them again.
As a result I do not know what collection may fit best. HashSet has a constant time "contains" method but a List has faster methods to iterate over its elements, right ?
What would be the wiser choice ? Would it be relevant to keep two different structures at a time containing the same nodes, a HashSet for the checks and a LinkedList for the processing ?

As a result I do not know what collection may fit best. HashSet has a constant time "contains" method but a List has faster methods to iterate over its elements, right ?
How about a LinkedHashSet?
Hash table and linked list implementation of the Set interface, with predictable iteration order. This implementation differs from HashSet in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is the order in which elements were inserted into the set (insertion-order)

1) Use ArrayList, not LinkedList. LinkedLists consume a lot of memory, and it's slower on iteration than ArrayList.
2) I'd suggest to use two data structures. E.g. for the sake of you being unable to add to a collection wile iterating through it (ConcurrentModificationException)

Well, it seems you are interested in two views on your collection.
A queue like view, adding things to the end and inspecting them at the front.
A contains check
All those operations are well supported in different kinds of heaps, e.g. java.util.PriorityQueue

Related

Java HashSet vs Array Performance

I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are (and the number won't change), and was wondering whether Array would have a notable performance advantage over HashSet for storing/retrieving said elements.
On paper, Array guarantees constant time insertion (since I know the size ahead of time) and retrieval, but the code for HashSet looks much cleaner and adds some flexibility, so I'm wondering if I'm losing anything performance-wise using it, at least, theoretically.

Depends on your data;
HashSet gives you an O(1) contains() method but doesn't preserve order.
ArrayList contains() is O(n) but you can control the order of the entries.
Array if you need to insert anything in between, worst case can be O(n), since you will have to move the data down and make room for the insertion. In Set, you can directly use SortedSet which too has O(n) too but with flexible operations.
I believe Set is more flexible.

The choice greatly depends on what do you want to do with it.
If it is what mentioned in your question:
I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are
If this is what you need to do, the you need neither of them. There is a size() method in Collection for which you can get the size of it, which mean how many of them there are in the collection.
If what you mean for "collection of object" is not really a collection, and you need to choose a type of collection to store your objects for further processing, then you need to know, for different kind of collections, there are different capabilities and characteristic.
First, I believe to have a fair comparison, you should consider using ArrayList instead Array, for which you don't need to deal with the reallocation.
Then it become the choice of ArrayList vs HashSet, which is quite straight-forward:
Do you need a List or Set? They are for different purpose: Lists provide you indexed access, and iteration is in order of index. While Sets are mainly for you to keep a distinct set of data, and given its nature, you won't have indexed access.
After you made your decision of List or Set to use, then it is a choice of List/Set implementation, normally for Lists, you choose from ArrayList and LinkedList, while for Sets, you choose between HashSet and TreeSet.
All the choice depends on what you would want to do with that collection of data. They performs differently on different action.
For example, an indexed access in ArrayList is O(1), in HashSet (though not meaningful) is O(n), (just for your interest, in LinkedList is O(n), in TreeSet is O(nlogn) )
For adding new element, both ArrayList and HashSet is O(1) operation. Inserting in the middle is O(n) for ArrayList, while it doesn't make sense in HashSet. Both will suffer from reallocation, and both of them need O(n) for the reallocation (HashSet is normally slower in reallocation, because it involve calculation of hash for each element again).
To find if certain element exists in the collection, ArrayList is O(n) and HashSet is O(1).
There are still lots of operations you can do, so it is quite meaningless to discuss for performance without knowing what you want to do.

theoretically, and as SCJP6 Study guide says :D
arrays are faster than collections, and as said, most of the collections depend mainly on arrays (Maps are not considered Collection, but they are included in the Collections framework)
if you guarantee that the size of your elements wont change, why get stuck in Objects built on Objects (Collections built on Arrays) while you can use the root objects directly (arrays)

It looks like you will want an HashMap that maps id's to counts. Particularly,
HashMap<Integer,Integer> counts=new HashMap<Integer,Integer>();
counts.put(uniqueID,counts.get(uniqueID)+1);
This way, you get amortized O(1) adds, contains and retrievals. Essentially, an array with unique id's associated with each object IS a HashMap. By using the HashMap, you get the added bonus of not having to manage the size of the array, not having to map the keys to an array index yourself AND constant access time.

Collection Vs Array Performance wise

Could you please let me know Performance wise why Array is better than Collection?

It is not. It will actually depend on the use you make of your container.
Some algorithms may run in O(n) on an array and in O(1) on another collection type (which implements the Collection interface).
Think about removal of an item for instance. In that case, the array, even if a native type, would perform slower than the linked list and its method calls (which could be inlined anyway on some VMs): it runs in O(n) VS O(1) for a linked list
Think about searching an element. It runs in 0(n) for an array VS O(log n) for a tree.
Some Collection implementations use an array to store their elements (ArrayList I think) so in that case performance will not be significantly different.
You should spend time on optimizing your algorithm (and make use of the various collection types available) instead of worrying of the pros/cons of an array VS Collection.

Many collections are wrappers for arrays. This includes ArrayList, HashMap/Set, StringBuilder. For optimised code, the performance difference of the operations is minimal except when you come to operations which are better suited to that data structure e.g. lookup of a Map is much faster than the lookup in an array.
Using generics for collections which are basically primitives can be slower, not because the collection is slower but the extra object creation and cache usage (as the memory needed can be higher) This difference is usually too small to matter but if you are concerned about this you can use the Trove4J libraries which are wrappers for arrays of primitives instead of arrays of Objects.
Where collections are slower is when you use operations which they are not suitable for e.g. random access of a LinkedList, but sensible coding can avoid these situations.

Basically, because arrays are primitive data structures in Java. Accesses to them can be translated directly into native memory-access instructions rather than method calls.
That said, though, it's not entirely obvious that arrays will strictly outperform collections in all circumstances. If your code references collection variables where the runtime type can be monomorphically known at JIT-time, Hotspot will be able to inline the access methods, and where they are simple, can be just as fast since there's basically no overhead anyway.
Many of the collections' access methods are intrinsically more complex than array referencing, however. There is, for instance, no way that a HashMap will be as efficient as a simple array lookup, no matter how much Hotspot optimizes it.

You cannot compare the two. ArrayList is an implementation, Collection is an interface. There might be different implementations for the Collection interface.
In practice the implementation is chosen which as the simple access to your data. Usually ArrayList if you need to loop through all elements. Hashtable if you need access by key.
Performance should be considered only after measurements are made. Then it is easy to change the implementation because the collection framework has common interfaces like the Collection interface.

The question is which one to use and when?
An array is basically a fixed size collection of elements. The bad point about an array is that it is not resizable. But its constant size provides efficiency if you are clear with your element size. So arrays are better to use when you know the number of elements available with you.
Collection
ArrayList is another collection where the number of elements is resizable. So if you are not sure about the number of elements in the collection use an ArrayList. But there are certain facts to be considered while using ArrayLists.
ArrayLists is not synchronized. So if there are multiple threads
accessing and modifying the list, then synchronization might be
required to be handled externally.
ArrayList is internally implemented as an array. So whenever a new
element is added an array of n+1 elements is created and then all the
n elements are copied from the old array to the new array and then
the new element is inserted in the new array.
Adding n elements requires on time.
The isEmpty, size, iterator, set, get and listIterator operations
require the same amount of time, independently of element you access.
Only Objects can be added to an ArrayList
Permits null elements
If you need to add a large number of elements to an ArrayList, you can use the ensureCapacity(int minCapacity) operation to ensure that the ArrayList has that required capacity. This will ensure that the Array is copied only once when all the elements are added and increase the performance of addition of elements to an ArrayList. Also inserting an element in the middle of say 1000 elements would require you to move 500 elements up or down and then add the element in the middle.
The benefit of using ArrayList is that accessing random elements is cheap and is not affected by the number of elemets in the ArrayList. But addition of elements to the head of tail or in the middle is costly.
Vector is similar to ArrayList with the difference that it is synchronized. It offers some other benefits like it has an initial capacity and an incremental capacity. So if your vector has a capacity of 10 and incremental capacity of 10, then when you are adding the 11th element a new Vector would be created with 20 elements and the 11 elements would be copied to the new Vector. So addition of 12th to 20th elements would not require creation of new vector.
By default, when a vector needs to grow the size of its internal data structure to hold more elements, the size of internal data structure is doubled, whereas for ArrayList the size is increased by only 50%. So ArrayList is more conservative in terms of space.
LinkedList is much more flexible and lets you insert, add and remove elements from both sides of your collection - it can be used as queue and even double-ended queue! Internally a LinkedList does not use arrays. LinkedList is a sequence of nodes, which are double linked. Each node contains header, where actually objects are stored, and two links or pointers to next or previous node. A LinkedList looks like a chain, consisting of people who hold each other's hand. You can insert people or node into that chain or remove. Linked lists permit node insert/remove operation at any point in the list in constant time.
So inserting elements in linked list (whether at head or at tail or in the middle) is not expensive. Also when you retrieve elements from the head it is cheap. But when you want to randomly access the elements of the linked list or access the elements at the tail of the list then the operations are heavy. Cause, for accessing the n+1 th element, you will need to parse through the first n elements to reach the n+1th element.
Also linked list is not synchronized. So multiple threads modifying and reading the list would need to be synchronized externally.
So the choice of which class to use for creating lists depends on the requirements. ArrayList or Vector( if you need synchronization ) could be used when you need to add elements at the end of the list and access elements randomly - more access operations than add operations. Whereas a LinkedList should be used when you need to do a lot of add/delete (elements) operations from the head or the middle of the list and your access operations are comparatively less.

Efficient java data structure to delete and retrieve information?

I have a situation where I have need a data structure that I can add strings to. This data structure is very large.
The specific qualities I need it have are:
get(index)
delete a certain number of entries that were added initially when the limit exceeds.(LIFO)
I've tried using an ArrayList but the delete operation is o(n) and for a linkedList the traverse or get() operation will be o(n).
What other options do I have?

circular buffer - one thats implemented with an array under the hood.

LinkedHashSet might be of interest. It is effectively a HashSet but it also maintains a LinkedList to allow a predictable iteration order - and therefore can also be used as a FIFO queue, with the nice added benefit that it can't contain duplicate entries.
Because it is a HashSet too, searches (as opposed to scans) can be O(1) if they can match on equals()
You can have a look at this question and this too.

Randomly getting elements in a HashMap or HashSet without looping

I have roughly 420,000 elements that I need to store easily in a Set or List of some kind. The restrictions though is that I need to be able to pick a random element and that it needs to be fast.
Initially I used an ArrayList and a LinkedList, however with that many elements it was very slow. When I profiled it, I saw that the equals() method in the object I was storing was called roughly 21 million times in a very short period of time.
Next I tried a HashSet. What I gain in performance I loose in functionality: I can't pick a random element. HashSet is backed by a HashMap which is backed by an array of HashMap.Entry objects. However when I attempted to expose them I was hindered by the crazy private and package-private visibility of the entire Java Collections Framework (even copying and pasting the class didn't work, the JCF is very "Use what we have or roll your own").
What is the best way to randomly select an element stored in a HashSet or HashMap? Due to the size of the collection I would prefer not to use looping.
IMPORTANT EDIT: I forgot a really important detail: exactly how I use the Collection. I populate the entire Collection at the begging of the table. During the program I pick and remove a random element, then pick and remove a few more known elements, then repeat. The constant lookup and changing is what causes the slowness

There's no reason why an ArrayList or a LinkedList would need to call equals()... although you don't want a LinkedList here as you want quick random access by index.
An ArrayList should be ideal - create it with an appropriate capacity, add all the items to it, and then you can just repeatedly pick a random number in the appropriate range, and call get(index) to get the relevant value.
HashMap and HashSet simply aren't suitable for this.

If ALL you need to do is get a large collection of values and pick a random one, then ArrayList is (literally) perfect for your needs. You won't get significantly faster (unless you went directly to primitive array, where you lose benefits of abstraction.)
If this is too slow for you, it's because you're using other operations as well. If you update your question with ALL the operations the collection must service, you'll get a better answer.

If you don't call contains() (which will call equals() many times), you can use ArrayList.get(randomNumber) and that will be O(1)
You can't do it with a HashMap - it stores the objects internally in an array, where the index = hashcode for the object. Even if you had that table, you'd need to guess which buckets contain objects. So a HashMap is not an option for random access.

Assuming that equals() calls are because you sort out duplicates with contains(), you may want to keep both a HashSet (for quick if-already-present lookup) and an ArrayList (for quick random access). Or, if operations don't interleave, build a HashSet first, then extract its data with toArray() or transform it into ArrayList with constructor of the latter.
If your problems are due to remove() call on ArrayList, don't use it and instead:
if you remove not the last element, just replace (with set()) the removed element with the last;
shrink the list size by 1.
This will of course screw up element order, but apparently you don't need it, judging by description. Or did you omit another important detail?

Most Lightweight Java Collection

If I am going to create a Java Collection, and only want to fill it with elements, and then iterate through it (without knowing the necessary size beforehand), i.e. all I need is Collection<E>.add(E) and Collection<E>.iterator(), which concrete class should I choose? Is there any advantage to using a Set rather than a List, for example? Which one would have the least overhead?

which concrete class should I choose?
I would probably just go with an ArrayList or a LinkedList. Both support the add and iterator methods, and neighter of them have any considerable overhead.
Is there any advantage to using a Set rather than a List, for example?
No, I wouldn't say so. (Unless you rely on the order of the elements, in which case you must use a List, or want to disallow duplicates, in which case you should use a Set.)
(I don't see how any Set implementation could beat a list implementation for add / iterator methods, so I'd probably go with a List even if I don't care about order.)
Which one would have the least overhead?
Sounds like micro benchmarking here, but if I'd be forced to guess, I'd say ArrayList (or perhaps LinkedList in coner cases where ArrayLists need to reallocate memory often :-)

Do not go with a Set. Sets and Lists differ according to their purpose, that you should always consider when choosing the right Collection
a List is there for maintaining elements in the order you added them; and if you insert the same element twice it will be kept twice
a Set is there for holding one specific element exactly once (uniqueness); order is only relevant for specific implementations (like TreeSet), but still elements that are 'the same' would not be added twice

Set is only meaningful if you want to sort your objects and to make sure no duplicate element is 'registered'. Else, an ArrayList is just fine.
However, if you want to add elements while iterating too, an ArrayBlockingQueue is better.

Here are some key points which can help you to choose your collection according to your requirement -
List(ArrayList or LinkedList)
Allowed duplicate values.
Insertion order preserved.
Set
Not allowed duplicate values.
Insertion order is not preserved.
So according to your requirement List seems to be a suitable choice.
Now Between ArrayList and LinkedList -
ArrayList is a random access list. Use if your frequent operation is the retrieval of elements.
LinkedList is the best option if you want to add or remove elements from the list.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.