Most Efficient but Thread-Safe List/Set

Most Efficient but Thread-Safe List/Set - java

Java has tons of different Collections designed for concurrency and thread safety, and I'm at a loss as to which one to choose for my situation.
Multiple threads may be calling .add() and .remove(), and I will be copying this list frequently with something like List<T> newList = new ArrayList<T>(concurrentList). I will never be looping over the concurrent list.
I thought about something like CopyOnWriteArrayList, but I've read that it can be very inefficient because it copies itself every time it's modified. I'm hoping to find a good compromise between safety and efficiency.
What is the best list (or set) for this situation?

As #SpiderPig said, the best case scenario with a List would be an immutable, singly-linked list.
However, looking at what's being done here, a List is unnecessary (#bhspencer's comment). A ConcurrentSkipListSet will work most efficiently (#augray).
This Related Thread's accepted answer offers more insight on the pros and cons of different concurrent collections.

You might want to look into whether a ctrie would be appropriate for your use case - it has thread-safe add and remove operations, and "copying" (in actuality, taking a snapshot of) the data structure runs in O(1). I'm aware of two JVM implementations of the data structure: implementation one, implementation two.

Collections.newSetFromMap(new ConcurrentHashMap<...>())
This is typically how a normal Set is done (HashSet is really a modified wrapper over HashMap). It offers both the advantages of performance/concurrecy from ConcurrentHashMap, and does not have extra features like ConcurrentSkipListSet (ordering), COW lists (copying every modification), or concurrent queues (FIFO/LIFO ordering).
Edit: I didn't see #bhspencer's comment on the original post, apologies for stealing the spotlight.

Hashset being hashing based would be better than List.
Add last and remove first will be good with LinkedList.
Search will be fast in arraylist being array index based.
Thanks,

Related

Why doesn't LinkedHashSet have addFirst method?

As the documentation of LinkedHashSet states, it is
Hash table and linked list implementation of the Set interface, with
predictable iteration order. This implementation differs from HashSet
in that it maintains a doubly-linked list running through all of its
entries.
So it's essentially a HashSet with FIFO queue of keys implemented by a linked list. Considering that LinkedList is Deque and permits, in particular, insertion at the beginning, I wonder why doesn't LinkedHashSet have the addFirst(E e) method in addition to the methods present in the Set interface. It seems not hard to implement this.

As Eliott Frisch said, the answer is in the next sentence of the paragraph you quoted:
… This linked list defines the iteration ordering, which is the order
in which elements were inserted into the set (insertion-order). …
An addFirst method would break the insertion order and thereby the design idea of LinkedHashSet.
If I may add a bit of guesswork too, other possible reasons might include:
It’s not so simple to implement as it appears since a LinkedHashSet is really implemented as a LinkedHasMap where the values mapped to are not used. At least you would have to change that class too (which in turn would also break its insertion order and thereby its design idea).
As that other guy may have intended in a comment, they didn’t find it useful.
That said, you are asking the question the wrong way around. They designed a class with a functionality for which they saw a need. They moved on to implement it using a hash table and a linked list. You are starting out from the implementation and using it as a basis for a design discussion. While that may occasionally add something useful, generally it’s not the way to good designs.
While I can in theory follow your point that there might be a situation where you want a double-ended queue with set property (duplicates are ignored/eliminated), I have a hard time imagining when a Deque would not fulfil your needs in this case (Eliott Frisch mentioned the under-used ArrayDeque). You need pretty large amounts of data and/or pretty strict performance requirements before the linear complexity of contains and remove would be prohibitive. And in that case you may already be better off custom designing your own data structure.

Vector vs SynchronizedList performance

While reading Oracle tutorial on collections implementations, i found the following sentence :
If you need synchronization, a Vector will be slightly faster than an ArrayList synchronized with Collections.synchronizedList
source : List Implementations
but when searching for difference between them, many people discourage the using of Vector and should be replaced by SynchronizedList when the synchronization is needed.
So which side has right to be followed ?

When you use Collections.synchronizedList(new ArrayList<>()) you are separating the two implementation details. It’s clear, how you could change the underlying storage model from “array based” to, e.g. “linked nodes”, by simply replacing new ArrayList with new LinkedList without changing the synchronized decoration.
The statement that Vector “will be slightly faster” seems to be based on the fact that its use does not bear a delegation between the wrapper and the underlying storage, but deriving statements about the performance from that was even questionable by the time, when ArrayList and the synchronizedList wrapper were introduced.
It should be noted that when you are really concerned about the performance of a list accessed by multiple threads, you will use neither of these two alternatives. The idea of making a storage thread safe by making all access methods synchronized is flawed right from the start. Every operation that involves multiple access to the list, e.g. simple constructs like if(!list.contains(o)) list.add(o); or iterating over the list or even a simple Collections.swap(list, i, j); require additional manual synchronization to work correctly in a multi-threaded setup.
If you think it over, you will realize that most operations of a real life application consist of multiple access and therefore will require careful manual locking and the fact that every low level access method synchronizes additionally can not only take away performance, it’s also a disguise pretending a safety that isn’t there.

Vector is an old API, of course. No questions there.
For speed though, it might be only because the synchronized list involves extra method calls to reach and return the data since it is a "wrapper" on top of a list after all. That's all there is to it, imho.

ArrayDeque and LinkedBlockingDeque

Just wondering why they made a LinkedBlockingDeque while the same non concurrent counterpart is an ArrayDeque which is backed on a resizable array.
LinkedBlockingQueue use a set of nodes like a LinkedList (even though not implementing List).
I am aware of the possibility to use an ArrayBlockingQueue but what if one wanted to use an ArrayBlockingDeque?
Why is there not such an option?
Thanks in advance.

This May not be a proper question w.r.t stackoverflow.
But i would Like to say something about these implementations.
-> First thing We need to answer why we give Different implementations for a certain Interface.
Suppose we have a interface A and There are two implementations of that Let say B and C.
Now suppose these implementations offer same functionality through their implementation.
But B has a better performance than C.
Then we should remove the Implementation except for two reason
1. Backward Compatibility - marking as deprecated.
2. There is a specific scenario where we cannot use B implementation.
For example:
HashMap and LinkedHashMap ->
If i need ordered Keys i will use LinkedHashMap otherwise I will Use HashMap (for a little performance gain).
ArrayBlockingQueue vs LinkedBlockingQueue
if i need Bounded Queue i will use ArrayBlockingQueue otherwise i will use LinkedBlockingQueue.
Now Your question Why there is no ArrayBlockingDeque while LinkedBlockingDeque exist.
First Let see why ArrayDeque exist.
From Java docs of ArrayDeque.
This class is likely to be faster than
{#link Stack} when used as a stack, and faster than {#link LinkedList}
when used as a queue.
Also note ArrayDeque doesn’t have Capacity Restriction.
Also it has Two pointers for head and tail as use use in LinkedList implementation.
Therefore if there would have been ArrayBlockingDeque.
1. There would have been no Capacity Restriction, which we normally
get from ArrayBlockingQueue.
2. There would have guards to access to tail and head pointers
same as in LinkedBlockingDeque and therefore no significant performance
gain over LinkedBlockingDeque.
So to conclude there is no ArrayBlockingDeque implementation because there in nothing
extra this implementation could provide over LinkedBlockingDeque. If you can prove there is
something then yes, implementation need to be there :)

Why is Java Vector (and Stack) class considered obsolete or deprecated?

Why is Java Vector considered a legacy class, obsolete or deprecated?
Isn't its use valid when working with concurrency?
And if I don't want to manually synchronize objects and just want to use a thread-safe collection without needing to make fresh copies of the underlying array (as CopyOnWriteArrayList does), then is it fine to use Vector?
What about Stack, which is a subclass of Vector, what should I use instead of it?

Vector synchronizes on each individual operation. That's almost never what you want to do.
Generally you want to synchronize a whole sequence of operations. Synchronizing individual operations is both less safe (if you iterate over a Vector, for instance, you still need to take out a lock to avoid anyone else changing the collection at the same time, which would cause a ConcurrentModificationException in the iterating thread) but also slower (why take out a lock repeatedly when once will be enough)?
Of course, it also has the overhead of locking even when you don't need to.
Basically, it's a very flawed approach to synchronization in most situations. As Mr Brian Henk pointed out, you can decorate a collection using the calls such as Collections.synchronizedList - the fact that Vector combines both the "resized array" collection implementation with the "synchronize every operation" bit is another example of poor design; the decoration approach gives cleaner separation of concerns.
As for a Stack equivalent - I'd look at Deque/ArrayDeque to start with.

Vector was part of 1.0 -- the original implementation had two drawbacks:
1. Naming: vectors are really just lists which can be accessed as arrays, so it should have been called ArrayList (which is the Java 1.2 Collections replacement for Vector).
2. Concurrency: All of the get(), set() methods are synchronized, so you can't have fine grained control over synchronization.
There is not much difference between ArrayList and Vector, but you should use ArrayList.
From the API doc.
As of the Java 2 platform v1.2, this
class was retrofitted to implement the
List interface, making it a member of
the Java Collections Framework. Unlike
the new collection implementations,
Vector is synchronized.

Besides the already stated answers about using Vector, Vector also has a bunch of methods around enumeration and element retrieval which are different than the List interface, and developers (especially those who learned Java before 1.2) can tend to use them if they are in the code. Although Enumerations are faster, they don't check if the collection was modified during iteration, which can cause issues, and given that Vector might be chosen for its syncronization - with the attendant access from multiple threads, this makes it a particularly pernicious problem. Usage of these methods also couples a lot of code to Vector, such that it won't be easy to replace it with a different List implementation.

You can use the synchronizedCollection/List method in java.util.Collection to get a thread-safe collection from a non-thread-safe one.

java.util.Stack inherits the synchronization overhead of java.util.Vector, which is usually not justified.
It inherits a lot more than that, though. The fact that java.util.Stack extends java.util.Vector is a mistake in object-oriented design. Purists will note that it also offers a lot of methods beyond the operations traditionally associated with a stack (namely: push, pop, peek, size). It's also possible to do search, elementAt, setElementAt, remove, and many other random-access operations. It's basically up to the user to refrain from using the non-stack operations of Stack.
For these performance and OOP design reasons, the JavaDoc for java.util.Stack recommends ArrayDeque as the natural replacement. (A deque is more than a stack, but at least it's restricted to manipulating the two ends, rather than offering random access to everything.)

Why not always use ArrayLists in Java, instead of plain ol' arrays?

Quick question here: why not ALWAYS use ArrayLists in Java? They apparently have equal access speed as arrays, in addition to extra useful functionality. I understand the limitation in that it cannot hold primitives, but this is easily mitigated by use of wrappers.

Plenty of projects do just use ArrayList or HashMap or whatever to handle all their collection needs. However, let me put one caveat on that. Whenever you are creating classes and using them throughout your code, if possible refer to the interfaces they implement rather than the concrete classes you are using to implement them.
For example, rather than this:
ArrayList insuranceClaims = new ArrayList();
do this:
List insuranceClaims = new ArrayList();
or even:
Collection insuranceClaims = new ArrayList();
If the rest of your code only knows it by the interface it implements (List or Collection) then swapping it out for another implementation becomes much easier down the road if you find you need a different one. I saw this happen just a month ago when I needed to swap out a regular HashMap for an implementation that would return the items to me in the same order I put them in when it came time to iterate over all of them. Fortunately just such a thing was available in the Jakarta Commons Collections and I just swapped out A for B with only a one line code change because both implemented Map.

If you need a collection of primitives, then an array may well be the best tool for the job. Boxing is a comparatively expensive operation. For a collection (not including maps) of primitives that will be used as primitives, I almost always use an array to avoid repeated boxing and unboxing.
I rarely worry about the performance difference between an array and an ArrayList, however. If a List will provide better, cleaner, more maintainable code, then I will always use a List (or Collection or Set, etc, as appropriate, but your question was about ArrayList) unless there is some compelling reason not to. Performance is rarely that compelling reason.
Using Collections almost always results in better code, in part because arrays don't play nice with generics, as Johannes Weiß already pointed out in a comment, but also because of so many other reasons:
Collections have a very rich API and a large variety of implementations that can (in most cases) be trivially swapped in and out for each other
A Collection can be trivially converted to an array, if occasional use of an array version is useful
Many Collections grow more gracefully than an array grows, which can be a performance concern
Collections work very well with generics, arrays fairly badly
As TofuBeer pointed out, array covariance is strange and can act in unexected ways that no object will act in. Collections handle covariance in expected ways.
arrays need to be manually sized to their task, and if an array is not full you need to keep track of that yourself. If an array needs to be resized, you have to do that yourself.
All of this together, I rarely use arrays and only a little more often use an ArrayList. However, I do use Lists very often (or just Collection or Set). My most frequent use of arrays is when the item being stored is a primitive and will be inserted and accessed and used as a primitive. If boxing and unboxing every become so fast that it becomes a trivial consideration, I may revisit this decision, but it is more convenient to work with something, to store it, in the form in which it is always referenced. (That is, 'int' instead of 'Integer'.)

This is a case of premature unoptimization :-). You should never do something because you think it will be better/faster/make you happier.
ArrayList has extra overhead, if you have no need of the extra features of ArrayList then it is wasteful to use an ArrayList.
Also for some of the things you can do with a List there is the Arrays class, which means that the ArrayList provided more functionality than Arrays is less true. Now using those might be slower than using an ArrayList, but it would have to be profiled to be sure.
You should never try to make something faster without being sure that it is slow to begin with... which would imply that you should go ahead and use ArrayList until you find out that they are a problem and slow the program down. However there should be common sense involved too - ArrayList has overhead, the overhead will be small but cumulative. It will not be easy to spot in a profiler, as all it is is a little overhead here, and a little overhead there. So common sense would say, unless you need the features of ArrayList you should not make use of it, unless you want to die by a thousands cuts (performance wise).
For internal code, if you find that you do need to change from arrays to ArrayList the chance is pretty straight forward in most cases ([i] becomes get(i), that will be 99% of the changes).
If you are using the for-each look (for( value : items) { }) then there is no code to change for that as well.
Also, going with what you said:
1) equal access speed, depending on your environment. For instance the Android VM doesn't inline methods (it is just a straight interpreter as far as I know) so the access on that will be much slower. There are other operations on an ArrayList that can cause slowdowns, depends on what you are doing, regardless of the VM (which could be faster with a stright array, again you would have to profile or examine the source to be sure).
2) Wrappers increase the amount of memory being used.
You should not worry about speed/memory before you profile something, on the other hand you shouldn't choose what you know to be a slower option unless you have a good reason to.

Performance should not be your primary concern.
Use List interface where possible, choose concrete implementation based on actual requirements (ArrayList for random access, LinkedList for structural modifications, ...).
You should be concerned about performance.
Use arrays, System.arraycopy, java.util.Arrays and other low-level stuff to squeeze out every last drop of performance.

Well don't always blindly use something that is not right for the job. Always start off using Lists, choose ArrayList as your implementation. This is a more OO approach. If you don't know that you specifically need an array, you'll find that not tying yourself to a particular implementation of List will be much better for you in the long run. Get it working first, optimize later.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.