Efficient java data structure to delete and retrieve information?

Efficient java data structure to delete and retrieve information? - java

I have a situation where I have need a data structure that I can add strings to. This data structure is very large.
The specific qualities I need it have are:
get(index)
delete a certain number of entries that were added initially when the limit exceeds.(LIFO)
I've tried using an ArrayList but the delete operation is o(n) and for a linkedList the traverse or get() operation will be o(n).
What other options do I have?

circular buffer - one thats implemented with an array under the hood.

LinkedHashSet might be of interest. It is effectively a HashSet but it also maintains a LinkedList to allow a predictable iteration order - and therefore can also be used as a FIFO queue, with the nice added benefit that it can't contain duplicate entries.
Because it is a HashSet too, searches (as opposed to scans) can be O(1) if they can match on equals()
You can have a look at this question and this too.

Related

Java HashSet vs Array Performance

I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are (and the number won't change), and was wondering whether Array would have a notable performance advantage over HashSet for storing/retrieving said elements.
On paper, Array guarantees constant time insertion (since I know the size ahead of time) and retrieval, but the code for HashSet looks much cleaner and adds some flexibility, so I'm wondering if I'm losing anything performance-wise using it, at least, theoretically.

Depends on your data;
HashSet gives you an O(1) contains() method but doesn't preserve order.
ArrayList contains() is O(n) but you can control the order of the entries.
Array if you need to insert anything in between, worst case can be O(n), since you will have to move the data down and make room for the insertion. In Set, you can directly use SortedSet which too has O(n) too but with flexible operations.
I believe Set is more flexible.

The choice greatly depends on what do you want to do with it.
If it is what mentioned in your question:
I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are
If this is what you need to do, the you need neither of them. There is a size() method in Collection for which you can get the size of it, which mean how many of them there are in the collection.
If what you mean for "collection of object" is not really a collection, and you need to choose a type of collection to store your objects for further processing, then you need to know, for different kind of collections, there are different capabilities and characteristic.
First, I believe to have a fair comparison, you should consider using ArrayList instead Array, for which you don't need to deal with the reallocation.
Then it become the choice of ArrayList vs HashSet, which is quite straight-forward:
Do you need a List or Set? They are for different purpose: Lists provide you indexed access, and iteration is in order of index. While Sets are mainly for you to keep a distinct set of data, and given its nature, you won't have indexed access.
After you made your decision of List or Set to use, then it is a choice of List/Set implementation, normally for Lists, you choose from ArrayList and LinkedList, while for Sets, you choose between HashSet and TreeSet.
All the choice depends on what you would want to do with that collection of data. They performs differently on different action.
For example, an indexed access in ArrayList is O(1), in HashSet (though not meaningful) is O(n), (just for your interest, in LinkedList is O(n), in TreeSet is O(nlogn) )
For adding new element, both ArrayList and HashSet is O(1) operation. Inserting in the middle is O(n) for ArrayList, while it doesn't make sense in HashSet. Both will suffer from reallocation, and both of them need O(n) for the reallocation (HashSet is normally slower in reallocation, because it involve calculation of hash for each element again).
To find if certain element exists in the collection, ArrayList is O(n) and HashSet is O(1).
There are still lots of operations you can do, so it is quite meaningless to discuss for performance without knowing what you want to do.

theoretically, and as SCJP6 Study guide says :D
arrays are faster than collections, and as said, most of the collections depend mainly on arrays (Maps are not considered Collection, but they are included in the Collections framework)
if you guarantee that the size of your elements wont change, why get stuck in Objects built on Objects (Collections built on Arrays) while you can use the root objects directly (arrays)

It looks like you will want an HashMap that maps id's to counts. Particularly,
HashMap<Integer,Integer> counts=new HashMap<Integer,Integer>();
counts.put(uniqueID,counts.get(uniqueID)+1);
This way, you get amortized O(1) adds, contains and retrievals. Essentially, an array with unique id's associated with each object IS a HashMap. By using the HashMap, you get the added bonus of not having to manage the size of the array, not having to map the keys to an array index yourself AND constant access time.

Randomly getting elements in a HashMap or HashSet without looping

I have roughly 420,000 elements that I need to store easily in a Set or List of some kind. The restrictions though is that I need to be able to pick a random element and that it needs to be fast.
Initially I used an ArrayList and a LinkedList, however with that many elements it was very slow. When I profiled it, I saw that the equals() method in the object I was storing was called roughly 21 million times in a very short period of time.
Next I tried a HashSet. What I gain in performance I loose in functionality: I can't pick a random element. HashSet is backed by a HashMap which is backed by an array of HashMap.Entry objects. However when I attempted to expose them I was hindered by the crazy private and package-private visibility of the entire Java Collections Framework (even copying and pasting the class didn't work, the JCF is very "Use what we have or roll your own").
What is the best way to randomly select an element stored in a HashSet or HashMap? Due to the size of the collection I would prefer not to use looping.
IMPORTANT EDIT: I forgot a really important detail: exactly how I use the Collection. I populate the entire Collection at the begging of the table. During the program I pick and remove a random element, then pick and remove a few more known elements, then repeat. The constant lookup and changing is what causes the slowness

There's no reason why an ArrayList or a LinkedList would need to call equals()... although you don't want a LinkedList here as you want quick random access by index.
An ArrayList should be ideal - create it with an appropriate capacity, add all the items to it, and then you can just repeatedly pick a random number in the appropriate range, and call get(index) to get the relevant value.
HashMap and HashSet simply aren't suitable for this.

If ALL you need to do is get a large collection of values and pick a random one, then ArrayList is (literally) perfect for your needs. You won't get significantly faster (unless you went directly to primitive array, where you lose benefits of abstraction.)
If this is too slow for you, it's because you're using other operations as well. If you update your question with ALL the operations the collection must service, you'll get a better answer.

If you don't call contains() (which will call equals() many times), you can use ArrayList.get(randomNumber) and that will be O(1)
You can't do it with a HashMap - it stores the objects internally in an array, where the index = hashcode for the object. Even if you had that table, you'd need to guess which buckets contain objects. So a HashMap is not an option for random access.

Assuming that equals() calls are because you sort out duplicates with contains(), you may want to keep both a HashSet (for quick if-already-present lookup) and an ArrayList (for quick random access). Or, if operations don't interleave, build a HashSet first, then extract its data with toArray() or transform it into ArrayList with constructor of the latter.
If your problems are due to remove() call on ArrayList, don't use it and instead:
if you remove not the last element, just replace (with set()) the removed element with the last;
shrink the list size by 1.
This will of course screw up element order, but apparently you don't need it, judging by description. Or did you omit another important detail?

Most Lightweight Java Collection

If I am going to create a Java Collection, and only want to fill it with elements, and then iterate through it (without knowing the necessary size beforehand), i.e. all I need is Collection<E>.add(E) and Collection<E>.iterator(), which concrete class should I choose? Is there any advantage to using a Set rather than a List, for example? Which one would have the least overhead?

which concrete class should I choose?
I would probably just go with an ArrayList or a LinkedList. Both support the add and iterator methods, and neighter of them have any considerable overhead.
Is there any advantage to using a Set rather than a List, for example?
No, I wouldn't say so. (Unless you rely on the order of the elements, in which case you must use a List, or want to disallow duplicates, in which case you should use a Set.)
(I don't see how any Set implementation could beat a list implementation for add / iterator methods, so I'd probably go with a List even if I don't care about order.)
Which one would have the least overhead?
Sounds like micro benchmarking here, but if I'd be forced to guess, I'd say ArrayList (or perhaps LinkedList in coner cases where ArrayLists need to reallocate memory often :-)

Do not go with a Set. Sets and Lists differ according to their purpose, that you should always consider when choosing the right Collection
a List is there for maintaining elements in the order you added them; and if you insert the same element twice it will be kept twice
a Set is there for holding one specific element exactly once (uniqueness); order is only relevant for specific implementations (like TreeSet), but still elements that are 'the same' would not be added twice

Set is only meaningful if you want to sort your objects and to make sure no duplicate element is 'registered'. Else, an ArrayList is just fine.
However, if you want to add elements while iterating too, an ArrayBlockingQueue is better.

Here are some key points which can help you to choose your collection according to your requirement -
List(ArrayList or LinkedList)
Allowed duplicate values.
Insertion order preserved.
Set
Not allowed duplicate values.
Insertion order is not preserved.
So according to your requirement List seems to be a suitable choice.
Now Between ArrayList and LinkedList -
ArrayList is a random access list. Use if your frequent operation is the retrieval of elements.
LinkedList is the best option if you want to add or remove elements from the list.

TreeMap or HashMap? [duplicate]

This question already has answers here:
Difference between HashMap, LinkedHashMap and TreeMap
(17 answers)
What is the difference between a HashMap and a TreeMap? [duplicate]
(8 answers)
Closed 8 years ago.
When to use hashmaps or treemaps?
I know that I can use TreeMap to iterate over the elements when I need them to be sorted.
But is just that? There is no optimization when I just want to consult the maps, or some optimal specific uses?

TreeMap provides guaranteed O(log n) lookup time (and insertion etc), whereas HashMap provides O(1) lookup time if the hash code disperses keys appropriately.
Unless you need the entries to be sorted, I'd stick with HashMap. Or there's ConcurrentHashMap of course. I can't remember the details of the differences between all of them, but HashMap is a perfectly reasonable "default" option :)
For completeness, I should point out that there was a discussion on Stack Overflow a month or so ago about the internals of various maps. See the comments in this question, which I will copy into this answer if bestsss is happy for me to do so.

Hashtables (usually) perform search operations (look up) bounded within the complexity of O(n)<=T(n)<=O(1), with an average case complexity of O(1 + n/k); however, binary search trees, (BST's), perform search operations (lookup) bounded within the complexity of O(n)<=T(n)<=O(log_2(n)), with an average case complexity of O(log_2(n)). The implementation for each (and every) data structure should be known (by you), to understand the advantages, drawbacks, time complexity of operations, and code complexity.
For example, the number of entries in a hashtable often have some fixed number of entries (some part of which may not be filled at all) with lists of collisions. Trees, on the other hand, usually have two pointers (references) per node, but this can be more if the implementation allows more than two child nodes per node, and this allows the tree to grow as nodes are added, but may not allow duplicates. (The default implementation of a Java TreeMap does not allow for duplicates)
There are special cases to consider as well, for example, what if the number of elements in a particular data structure increases without bound or approaches the limit of an underlying part of the data structure? What about amortized operations that perform some rebalancing or cleanup operation?
For example, in a hashtable, when the number of elements in the table become sufficiently large, and arbitrary number of collisions can occur. On the other hand, trees usually require come re-balancing procedure after an insertion (or deletion).
So, if you have something like a cache (Ex. the number of elements in bounded, or size is known) then a hashtable is probably your best bet; however, if you have something more like a dictionary (Ex. populated once and looked up many times) then I'd use a tree.
This is only in the general case, however, (no information was given). You have to understand process that happen how they happen to make the right choice in deciding which data structure to use.
When I need a multi-map (ranged lookup) or sorted flattening of a collection, then it can't be a hashtable.

The largest difference between the two is the underlying structure used in the implementation.
HashMaps use an array and a hashing function to store elements. When you try to insert or delete an item in the array the hashing function converts the key into an index on the array where the object is/should be stored (ignoring conflicts). While hashmaps are generally very fast because they don't need to iterate over large amounts of data, they slow down when they're filled because they need to copy all the key/values into a new array.
TreeMaps store a the data in a sorted tree structure. While this means that they'll never have to allocate more space and copy over to it, operations require that part of the data already stored be iterated over. Sometimes changing large amounts of the structure.
Out of the two Hashmaps will generally have better performance when you don't need sorting.

Inserting new elements into a HashMap will, on average, be a good deal faster than inserting elements into a TreeMap. Unless you need your elements sorted, I'd go with the HashMap.

Don't forget there is also LinkedHashMap which is nearly as fast as HashMap for add/contains/remove operations but also maintains the insertion order.

Complexity of processing a collection's values

I need to store a growing large number of objects in a collection. While performing actions of each object of the collection, I regularly need to check whether an object is already stored. If an object is not stored yet I will add it to the end of the collection. I process each object iteratively while doing the checks.
Objects already processed should not be removed from the collection because I do not want put them back to processing when I stumble upon them again.
As a result I do not know what collection may fit best. HashSet has a constant time "contains" method but a List has faster methods to iterate over its elements, right ?
What would be the wiser choice ? Would it be relevant to keep two different structures at a time containing the same nodes, a HashSet for the checks and a LinkedList for the processing ?

As a result I do not know what collection may fit best. HashSet has a constant time "contains" method but a List has faster methods to iterate over its elements, right ?
How about a LinkedHashSet?
Hash table and linked list implementation of the Set interface, with predictable iteration order. This implementation differs from HashSet in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is the order in which elements were inserted into the set (insertion-order)

1) Use ArrayList, not LinkedList. LinkedLists consume a lot of memory, and it's slower on iteration than ArrayList.
2) I'd suggest to use two data structures. E.g. for the sake of you being unable to add to a collection wile iterating through it (ConcurrentModificationException)

Well, it seems you are interested in two views on your collection.
A queue like view, adding things to the end and inspecting them at the front.
A contains check
All those operations are well supported in different kinds of heaps, e.g. java.util.PriorityQueue

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.