Best Java Data Structure for Fast, Concurrent Insertions - java

My use case is as follows: I have 10 threads simultaneously writing to one data structure. The order of the elements in the data structure does not matter. All the elements are unique. I will only be doing a read from this data structure only once at the very end.
What would be the fastest native Java data structure to suit this purpose? From my reading, it seems Collections.synchronizedList might be the go to option?

I have 10 threads simultaneously writing to one data structure.
I think it would be best to use a separate data structure per thread. That way no synchronisation is needed between the threads, and it would be much more CPU cache friendly too.
At the end they could be joined.
As for the underlying structure: if the elements are fixed size, an array/verctor would be best. Joining them would only take a copy of the block of memory they occupy, depending on the implementation - but lists would always be slower.

There is no need for you to synchronize on a list as each of the thread can work on their local copy and at the end can join results from all the threads into one final list.
If am going to use JDK7 and above then I would use fork and join for the same where i would create simple List in each forked task and finally join it in the main list at the end in the join phase.
If am on JDK6 then i could use a CountDownLatch with count as 10. Each and every thread after writing to their individual list (passed to the thread from main controller thread) counts down the latch and in the main controller, once all threads are done, i would combine all the result into one.

Related

Concurrent Queue Data Structure with ArrayList as Element

Problems in Details
Will it cause any issue when using data structure e.g ArrayBlockingQueue<ArrayList<MyClass>>
with multiple threads?
Background
In a high level, i am trying to achieve that I have one producer which will produce a giant list. In order to speed up the processing. I decide to use multiple consumers(threads) consuming the giant list produced by the producer.
My Proposal Solution
I will be converting the giant list to multiple relatively smaller list and in order to ensure its thread safe, I will enqueue these smaller lists to a concurrent data structure. So in multi-threads scenario, each thread just poll the concurrent queue to get one list and work on it.
Problem Statement
In multi-threads scenario, I understand we have to use the concurrent data structure to avoid thread interference and build happen-before relation.
But will it be safe that using non-thread-safe data structure as element of thread-safe data structure?
Will it cause any issue when using data structure e.g ArrayBlockingQueue<ArrayList<MyClass>>
with multiple threads?
Will it be any impact to the performance?
There shouldn't be an obvious problem with this approach.
will it be safe that using non-thread-safe data structure as element of thread-safe data structure?
This is safe as long as you properly coordinate (or avoid) concurrent access to the non-thread-safe inner data structure. The ArrayBlockingQueue ensures the happens-before relation is established when you access its elements via on of the peek, poll or related methods.
Will it cause any issue when using data structure e.g ArrayBlockingQueue<ArrayList<MyClass>> with multiple threads?
No, this is what BlockingQueue is intended for as long as you coordinate access to the inner lists (see above).
Will it be any impact to the performance?
In general the approach where the single producer partitions the list into sub-lists might not be optimal. The producer does not / should not know about the number of consumers and their bandwidth and thus in general does not know what partition sizes work well. A better approach might be to use an ArrayBlockingQueue<MyClass> and from the consumer side always consume multiple elements in one go by calling drainTo for a suitable number maxElements of elements.
Thanks the Answer from michid# and Thilo#
Final Resolution
I end up using LinkedBlockingQueue<List<MyObjClass>> and have multiple child threads polling from the queue. Each child thread will take list of MyObjClass to work on.
This resolution does not have any impact to slow down the performance.
For why i am choosing LinkedBlockingQueue over ArrayBlockingQueue see Link

Java Parallel Network Requests and List Updates

I am limited to a 1-core machine on AWS, but after measuring the time to complete all of my http requests and check their results, two of them together require as much time combined to fetch data as the remaining fifty requests (roughly 2 minutes).
I don't want to bloat my code more than I have to, but I know parallelism and asynchrony can seriously cut down the execution time for this task. I want to launch the two big requests on their own threads so they can go out while the others are running, but I store the results of these http requests in a list currently.
Can you access different (guaranteed) elements of a list at the same time as long as the data is initialized beforehand? I've seen the concurrent list and parallel list, but the one isn't parallel, and the other reallocates the entire list on every modification, so neither is a particularly sane option.
What can I do in this situation?
There is no such thing as a concurrent list in Java. I'm assuming that you are referring to a concurrent hash set (using newSetFromMap) and your "parallel list" refers to a CopyOnWriteArrayList.
You most definitely can use the former option to store update data.
A better way to solve your problem of updating data asynchronously is to just simply use a non-thread-safe collection for your worker thread and then push them all at once when you're done to a thread-safe collection that you use to aggregate all your requests.
So something like:
Set<Response> aggregate = Collections.newSetFromMap(...);
executor.execute(...);
...
// Workers
Set<Response> local = new HashSet<>();
populate(local);
aggregate.addAll(local);
You might want to use various synchronizers if you want your response data to be ordered in a specific way, such as having all your responses from Request 1 to be together. If you only need to move one request from each worker, use a thread safe transfer or a singleton collection.

Efficient multithreaded array building in Java

I have many threads adding result-like objects to an array, and would like to improve the performance of this area by removing synchronization.
To do this, I would like for each thread to instead post their results to a ThreadLocal array - then once processing is complete, I can combine the arrays for the following phase. Unfortunately, for this purpose ThreadLocal has a glaring issue: I cannot combine the collections at the end, as no thread has access the collection of another.
I can work around this by additionally adding each ThreadLocal array to a list next to the ThreadLocal as they are created, so I have all the lists available later on (this will require synchronization but only needs to happen once for each thread), however in order to avoid a memory leak I will have to somehow get all the threads to return at the end to clean up their ThreadLocal cache... I would much rather the simple process of adding a result be transparent, and not require any follow up work beyond simply adding the result.
Is there a programming pattern or existing ThreadLocal-like object which can solve this issue?
You're right, ThreadLocal objects are designed to be only accessible to the current thread. If you want to communicate across threads you cannot use ThreadLocal and should use a thread-safe data structure instead, such as ConcurrentHashMap or ConcurrentLinkedQueue.
For the use case you're describing it would be easy enough to share a ConcurrentLinkedQueue between your threads and have them all write to the queue as needed. Once they're all done (Thread.join() will wait for them to finish) you can read the queue into whatever other data structure you need.

Is there a Java data structure that is thread-safe for parallel threads writing to different parts of an array of fixed size?

This is what I'm trying to implement:
A (singleton) array of fixed size (say 1000 elements)
A pool of threads writing smaller (<=100) element blocks to that array in parallel
We are guaranteed that total writes by all threads in the pool will write <1000 elements, so we never have to grow the array.
The order of writes doesn't matter but they have to be contiguous, e.g Thread1 populates array indexes 0-49, Thread 3 indexes 50-149, Thread 2 indexes 149-200
Is there a thread-safe data structure to achieve this?
Clearly, I would need to synchronize the "index manager" which allocates where in the array indexes a given thread needs to write. But is there a Java data structure for the array itself that can be used for this, without worrying about thread safety?
You should be able to use an AtomicReferenceArray. You can safely update indexes or atomically update with compareAndSet (though it appears you wont need that).
Editing to address akhil_mittal's question.
Let's switch the train of thought from updating an array to updating individual fields. If you were to update a field in a class the write will occur without word tearing, it won't be the case that the write will be some bits from one thread and some bits from another thread. The same is true for array indexes.
However, if you were to update a field in a class by multiple threads, the write from one thread may not be immediately visible to another thread. That is because the write may be buffered on a processor cache and eventually flushed to the other processors. The same is true for an array write to a particular index. It will be eventually visible but does not guarantee a happens-before ordering.
do we still need to concern about thread safety
You would need to worry about thread-safety the same way you would need to worry about thread-safety for a non-volatile field. It turns out that DVK may not need to worry about the writes being immediately visible.
The point of this answer is to explain that array writes are not necessarily thread-safe and using an AtomicReferenceArray can protect you from delayed writes.
Your question has been answered already by others so I'll just add examples:
Adding to an array by different threads is the way parallel sort works.
Creating arrays with the Fork/Join framework does so by the work-threads writing to different parts of the array.
Go ahead and do it, you're fine.

Java Concurrency: should I synchronize all List and Maps?

So I have a SomeTask class which extends Thread, and it has Map and List fields. What would be the behavior when you don't do Collections.synchronizedXXX and you have multiple thread of SomeTask running?
Once a Map is called from the database (I am using Object Database to directly store POJO), would I need to synchronized the Map object returned from this database as well?
Map SomeTasksOwnMap = Collections.synchronizedMap(MapReturnedFromDatabase);
Collections.synchronizedXXX is required when 2 or more Threads are accessing the same Map/List.
If your task doesn't access other tasks Map/List, then there is no need to synchronize them.
Example.
Task 1 builds a list of numbers divisible exactly by 2.
Task 2 builds a list of numbers divisible exactly by 3.
These two tasks have individual lists that do not require synchronization.
Example require synchronization.
Task 1 and 2 both calculate numbers and store them in a shared list.
To answer the questions: "What would be the behavior when you don't", you could lose one of the writes if it was timed that both threads wanted to write to index 'x'.
You may also have a null value in the list as the size of the array was increased before the write to the location was done.
Basically you would have an inconsistent view.
No. There is nothing in your question that suggests synchronization is required, because as far as I can tell each thread reads only data within itself: You only need synchronization when threads access data in other threads.
As an aside, having SomeTask extends Thread is a poor design - it should extends Runnable, then use new Thread(new SomeTask()).start().
... should I synchronize all List and Maps?
No you shouldn't. Synchronizing things that don't need it is a waste of resources. And for things that do need synchronization, you need to do it the right way. (And the synchronizedXxx wrappers are not always the right way.)
First, you need to identify the data structures that are going to be visible to multiple threads. Data structures that are provably thread confined don't need synchronizing at all.
Second, you need to examine the way that the data structures are used to see if a synchronizedXxx wrapper is sufficient. For instance, these wrappers don't synchronize iteration, and you can get into trouble if one thread changes a collection while another one is iterating it.
Finally, you need to think about whether the synchronized data structures are heavily used by different threads. The synchronzedXxx wrappers can result in a performance bottleneck if the data structure is heavily used. If this is the case, you should consider using one of the ConcurrentYyyy classes instead.

Categories