Threading a data set?

Threading a data set? - java

I have a bunch of objects. They don't need to be sorted or ordered. They have one method that needs to be called: myObject.update(). Eventually they will need to be removed from he container.
Right now it's single threaded and the update() method is CPU bound (no I/O). We have a nice server with 16 "cores" (cores + HT).
What I would like do is have one container object responsible for "dishing out" objects. And then 15 threads that ask the container for a new object when they need one. Is this a good way to go about it?
What is a thread safe data structure to hold the objects? Or should I just make the container object responsible for not sending out the same object twice?

In java, good candidates for your problem are LinkedBlockingQueue and ArrayBlockingQueue.
They provide first-in-first-out functionality with an optional bound on the number of elements they hold at one time.
Alternatively, a good approach is to use an ExecutorService, which holds a thread pool and an internal queue for serving the threads on-demand.

Related

Should I use ThreadLocal in this high traffic multi-threaded scenario?

Scenario
We are developing an API that will handle around 2-3 million hits per hour in a multi-threaded environment. The server is Apache Tomcat 7.0.64.
We have a custom object with lot of data let's call it XYZDataContext. When a new request comes in we associate XYZDataContext object to the request context. One XYZDataContext object per request. We will be spawning various threads in parallel to serve that request to collect/process data from/into XYZDataContext object. Our threads that will process things in parallel need access to this XYZDataContext object and
to avoid passing around of this object everywhere in the application, to various objects/methods/threads,
we are thinking to make it a threadlocal. Threads will use data from XYZDataContext object and will also update data in this object.
When the thread finishes we are planning to merge the data from the updated XYZDataContext object in the spawned child thread into the main thread's XYZDataContext object.
My questions:
Is this a good approach?
Threadpool risks - Tomcat server will maintain a threadpool and I read that using threadlocal with thread pools is a disaster because thread is not GCed per say and is reused so the references to the threadlocal objects will not get GCed and will result in storing huge objects in memory that we don't need anymore eventually resulting into OutOfMemory issues...
UNLESS they are referenced as weak references so that get GCed immediately.
We're using Java 1.7 open JDK. I saw the source code for ThreadLocal and the although the ThreadLocalMap.Entry is a weakreference it's not associated with a ReferenceQueue, and the comment for Entry constructor says "since reference queues are not used, stale entries are guaranteed to be removed only when the table starts running out of space."
I guess this works great in case of caches but is not the best thing in our case. I would like that the threadlocal XYZDataContext object be GCed immediately. Will the ThreadLocal.remove() method be effective here?
Is there any way to enforce emptying the space in the next GC run?
This is a right scenario to use ThreadLocal objects? Or are we abusing the threadlocal concept and using it where it shouldn't be used?

My gut feeling tells me you're on the wrong path. Since you already have a central context object (one for all threads) and you want to access it from multiple threads at the same time I would go with a Singleton hosting the context object and providing threadsafe methods to access it.
Instead of manipulating multiple properties of your context object, I would strongly suggest to do all manipulations at the same time. Best would be if you pass only one object containing all the properties you want to change in your context object.
e.g
Singleton.getInstance().adjustContext(ContextAdjuster contextAdjuster)
You might also want to consider using a threadsafe queue, filling it up with ContextAdjuster objects from your threads and finally processing it in the Context's thread.
Google for things like Concurrent, Blocking and Nonblocking Queue in Java. I am sure you'll find tons of example code.

Using Java Concurrent Collections in Scala [duplicate]

I have an Actor that - in its very essence - maintains a list of objects. It has three basic operations, an add, update and a remove (where sometimes the remove is called from the add method, but that aside), and works with a single collection. Obviously, that backing list is accessed concurrently, with add and remove calls interleaving each other constantly.
My first version used a ListBuffer, but I read somewhere it's not meant for concurrent access. I haven't gotten concurrent access exceptions, but I did note that finding & removing objects from it does not always work, possibly due to concurrency.
I was halfway rewriting it to use a var List, but removing items from Scala's default immutable List is a bit of a pain - and I doubt it's suitable for concurrent access.
So, basic question: What collection type should I use in a concurrent access situation, and how is it used?
(Perhaps secondary: Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread?)
(Tertiary: In Scala, what collection type is best for inserts and random access (delete / update)?)
Edit: To the kind responders: Excuse my late reply, I'm making a nasty habit out of dumping a question on SO or mailing lists, then moving on to the next problem, forgetting the original one for the moment.

Take a look at the scala.collection.mutable.Synchronized* traits/classes.
The idea is that you mixin the Synchronized traits into regular mutable collections to get synchronized versions of them.
For example:
import scala.collection.mutable._
val syncSet = new HashSet[Int] with SynchronizedSet[Int]
val syncArray = new ArrayBuffer[Int] with SynchronizedBuffer[Int]

You don't need to synchronize the state of the actors. The aim of the actors is to avoid tricky, error prone and hard to debug concurrent programming.
Actor model will ensure that the actor will consume messages one by one and that you will never have two thread consuming message for the same Actor.

Scala's immutable collections are suitable for concurrent usage.
As for actors, a couple of things are guaranteed as explained here the Akka documentation.
the actor send rule: where the send of the message to an actor happens before the receive of the same actor.
the actor subsequent processing rule: where processing of one message happens before processing of the next message by the same actor.
You are not guaranteed that the same thread processes the next message, but you are guaranteed that the current message will finish processing before the next one starts, and also that at any given time, only one thread is executing the receive method.
So that takes care of a given Actor's persistent state. With regard to shared data, the best approach as I understand it is to use immutable data structures and lean on the Actor model as much as possible. That is, "do not communicate by sharing memory; share memory by communicating."

What collection type should I use in a concurrent access situation, and how is it used?
See #hbatista's answer.
Is an Actor actually a multithreaded entity, or is that just my wrong conception and does it process messages one at a time in a single thread
The second (though the thread on which messages are processed may change, so don't store anything in thread-local data). That's how the actor can maintain invariants on its state.

Efficient multithreaded array building in Java

I have many threads adding result-like objects to an array, and would like to improve the performance of this area by removing synchronization.
To do this, I would like for each thread to instead post their results to a ThreadLocal array - then once processing is complete, I can combine the arrays for the following phase. Unfortunately, for this purpose ThreadLocal has a glaring issue: I cannot combine the collections at the end, as no thread has access the collection of another.
I can work around this by additionally adding each ThreadLocal array to a list next to the ThreadLocal as they are created, so I have all the lists available later on (this will require synchronization but only needs to happen once for each thread), however in order to avoid a memory leak I will have to somehow get all the threads to return at the end to clean up their ThreadLocal cache... I would much rather the simple process of adding a result be transparent, and not require any follow up work beyond simply adding the result.
Is there a programming pattern or existing ThreadLocal-like object which can solve this issue?

You're right, ThreadLocal objects are designed to be only accessible to the current thread. If you want to communicate across threads you cannot use ThreadLocal and should use a thread-safe data structure instead, such as ConcurrentHashMap or ConcurrentLinkedQueue.
For the use case you're describing it would be easy enough to share a ConcurrentLinkedQueue between your threads and have them all write to the queue as needed. Once they're all done (Thread.join() will wait for them to finish) you can read the queue into whatever other data structure you need.

Java Concurrency: should I synchronize all List and Maps?

So I have a SomeTask class which extends Thread, and it has Map and List fields. What would be the behavior when you don't do Collections.synchronizedXXX and you have multiple thread of SomeTask running?
Once a Map is called from the database (I am using Object Database to directly store POJO), would I need to synchronized the Map object returned from this database as well?
Map SomeTasksOwnMap = Collections.synchronizedMap(MapReturnedFromDatabase);

Collections.synchronizedXXX is required when 2 or more Threads are accessing the same Map/List.
If your task doesn't access other tasks Map/List, then there is no need to synchronize them.
Example.
Task 1 builds a list of numbers divisible exactly by 2.
Task 2 builds a list of numbers divisible exactly by 3.
These two tasks have individual lists that do not require synchronization.
Example require synchronization.
Task 1 and 2 both calculate numbers and store them in a shared list.
To answer the questions: "What would be the behavior when you don't", you could lose one of the writes if it was timed that both threads wanted to write to index 'x'.
You may also have a null value in the list as the size of the array was increased before the write to the location was done.
Basically you would have an inconsistent view.

No. There is nothing in your question that suggests synchronization is required, because as far as I can tell each thread reads only data within itself: You only need synchronization when threads access data in other threads.
As an aside, having SomeTask extends Thread is a poor design - it should extends Runnable, then use new Thread(new SomeTask()).start().

... should I synchronize all List and Maps?
No you shouldn't. Synchronizing things that don't need it is a waste of resources. And for things that do need synchronization, you need to do it the right way. (And the synchronizedXxx wrappers are not always the right way.)
First, you need to identify the data structures that are going to be visible to multiple threads. Data structures that are provably thread confined don't need synchronizing at all.
Second, you need to examine the way that the data structures are used to see if a synchronizedXxx wrapper is sufficient. For instance, these wrappers don't synchronize iteration, and you can get into trouble if one thread changes a collection while another one is iterating it.
Finally, you need to think about whether the synchronized data structures are heavily used by different threads. The synchronzedXxx wrappers can result in a performance bottleneck if the data structure is heavily used. If this is the case, you should consider using one of the ConcurrentYyyy classes instead.

Where should you use BlockingQueue Implementations instead of Simple Queue Implementations?

I think I shall reframe my question from
Where should you use BlockingQueue Implementations instead of Simple Queue Implementations ?
to
What are the advantages/disadvantages of BlockingQueue over Queue implementations taking into consideration aspects like speed,concurrency or other properties which vary e.g. time to access last element.
I have used both kind of Queues. I know that Blocking Queue is normally used in concurrent application. I was writing simple ByteBuffer pool where I needed some placeholder for ByteBuffer objects. I needed fastest , thread safe queue implementation. Even there are List implementations like ArrayList which has constant access time for elements.
Can anyone discuss about pros and cons of BlockingQueue vs Queue vs List implementations?
Currently I have used ArrayList to hold these ByteBuffer objects.
Which data structure shall I use to hold these objects?

A limited capacity BlockingQueue is also helpful if you want to throttle some sort of request. With an unbounded queue, a producers can get far ahead of the consumers. The tasks will eventually be performed (unless there are so many that they cause an OutOfMemoryError), but the producer may long since have given up, so the effort is wasted.
In situations like these, it may be better to signal a would-be producer that the queue is full, and to give up quickly with a failure. For example, the producer might be a web request, with a user that doesn't want to wait too long, and even though it won't consume many CPU cycles while waiting, it is using up limited resources like a socket and some memory. Giving up will give the tasks that have been queued already a better chance to finish in a timely manner.
Regarding the amended question, which I'm interpreting as, "What is a good collection for holding objects in a pool?"
An unbounded LinkedBlockingQueue is a good choice for many pools. However, depending on your pool management strategy, a ConcurrentLinkedQueue may work too.
In a pooling application, a blocking "put" is not appropriate. Controlling the maximum size of the queue is the job of the pool manager—it decides when to create or destroy resources for the pool. Clients of the pool borrow and return resources from the pool. Adding a new object, or returning a previously borrowed object to the pool should be fast, non-blocking operations. So, a bounded capacity queue is not a good choice for pools.
On the other hand, when retrieving an object from the pool, most applications want to wait until a resource is available. A "take" operation that blocks, at least temporarily, is much more efficient than a "busy wait"—repeatedly polling until a resource is available. The LinkedBlockingQueue is a good choice in this case. A borrower can block indefinitely with take, or limit the time it is willing to block with poll.
A less common case in when a client is not willing to block at all, but has the ability to create a resource for itself if the pool is empty. In that case, a ConcurrentLinkedQueue is a good choice. This is sort of a gray area where it would be nice to share a resource (e.g., memory) as much as possible, but speed is even more important. In the worse case, this degenerates to every thread having its own instance of the resource; then it would have been more efficient not to bother trying to share among threads.
Both of these collections give good performance and ease of use in a concurrent application. For non-concurrent applications, an ArrayList is hard to beat. Even for collections that grow dynamically, the per-element overhead of a LinkedList allows an ArrayList with some empty slots to stay competitive memory-wise.

You would see BlockingQueue in multi-threaded situations. For example you need pass in a BlockingQueue as a parameter to create ThreadPoolExecutor if you want to create one using constructor. Depending on the type of queue you pass in the executor could act differently.

It is a Queue implementation that additionally supports operations that
wait for the queue to become non-empty when retrieving an element,
and
wait for space to become available in the queue when storing an
element.
If you required above functionality will be followed by your Queue implementation then use Blocking Queue

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.