I have this scenario where I am receiving events from thousands of sources. Each source is sending information about its current status. While I do want to process all events, it is more important to first process the latest event of each source, so that the current view is up to date. So I was thinking of using a ConcurrentHashMap with the identifier of each source as the key, and a LIFO queue (stack) as the value. I would then iterate through the keys of the Map and just pop one item off the stack of each source.
My concern is that while I am iterating through the keys and taking items off the queue of each key, the producer could post new events on the queues, potentially creating concurrency issues. The producer could also add new keys to the map, and iterating through the entrySet of the Map seems to be weakly consistent. Which is not a huge issue, because the new item will be processed in a subsequent iteration. Ideally I could also use some parallel processing on the stream of the entrySet to speed up the process.
I am wondering if there is a cleaner approach to this. In reality I could have used a LIFO BlockingDequeue and processed latest events first, but the problem with this approach is that there is a risk that one source could send more events than others and thus maybe get more events processed than the others.
Is there any other data structure that I could look into that provides this kind of behaviour? Essentially what I am looking for is a way to prioritise events from each source, while at the same time giving a fair chance to each source to be processed by the consumer.
I recommend building your own structure to manage this as it adds flexibility (and speed) for your use case in particular.
I'd go with a circular queue to store each LIFO queue (stack). A circular queue is one where you add elements to the tail, and read (but doesn't remove) from the head. Once head = tail, you start over.
You can build your own queue using a simple array. It's not too hard to manage the synchronization around operations such as adding more queues to the array - and expanding it when needed. I believe adding queues to the array is not something you do very often.
This is easy to manage and you can expand your circular queue to calculate how often the entries are being accessed, and throttle the frequency of access to its entries (by adding/removing consumer threads, or even making them wait a bit before consuming from the stack managed by an entry).
You can even avoid thread locking when reading elements from the circular queue using multiple threads, by making them call a "register" operation before consuming from the stack: each thread has its ID and when it "registers" the ID is stored at the given queue entry. Before registering and before popping from the stack, the thread does a "read the registration ID" operation, and the ID that returns must match its own ID. That means only the thread that "owns" the given queue entry can pop from that stack. If the registration/confirmation of registration process fails, it means another thread is consuming form that entry, so the current thread moves on to the next available entry.
I've used this sort of strategy in the past and it scaled like a charm. I hope this makes sense for you.
Did you think about a FIFO queue of LIFO queues? Each source adds to its LIFO queue and for processing you take the first LIFO queue from the FIFO queue, process one event and then put it back into the FIFO queue. This way you also should have no problem with new sources, as their LIFO queue will simply be added to the FIFO queue.
For adding events to the correct LIFO queue, you can maintain an additional HashMap that knows the queue per source and if a new source occurs that is not in the Map yet, you know you have to add its LIFO queue to the FIFO queue.
Related
I need to create a list to do the following operation:
I receive an object from an external queue/topic every microsecond.
After doing some operations on the object, I need to persist these objects into database.
I am doing the persist in batches of 100 or 1000. The only problem is, persist rate is lower than the incoming message rate. Now I don't want to keep this in a single thread since the persist will slow down the message consumption.
My idea is to keep accepting the message objects and adding them to a collection (like a linked list)
And keep removing from the other end of the collection in batches of 100 or 1000 and persist into database.
What is the right collection to use? How to synchronize this and avoid concurrent modification exceptions?
Below is the code I'm trying to implement with an ArrayList that clears out the list every few seconds while persisting.
class myclass{
List persistList;
ScheduledExecutorService persistExecutor;
ScheduledFuture scheduledFuture;
PersistOperation persistOperation;
//Initialize delay, interval
void init(){
scheduledFuture=persistExecutor.scheduleAtFixedRate(new persistOperation(persistList), delay, interval, TimeUnit.SECONDS);
}
void execute(msg){
//process the message and add to the persist list
}
class PersistOperation implements Runnable{
List persistList
PersistOperation(List persistList){
//Parameterized constructor
}
run(){
//Copy persistList to new ArrayList and clear persistList
//entity manager persist/update/merge
}
}
}
And keep removing from the other end of the collection in batches of 100 or 1000 and persist into database.
This is reasonable so long as multiple threads poll from the collection.
Below is the code I'm trying to implement with an ArrayList
An ArrayList is a bad choice here, as it is not thread-safe and, when removing an element at index 0, every element to the right of it must be shifted over (an O(n) operation).
The collection that you're looking for is called a Deque, otherwise known as a double-ended queue. However, because you need the collection to be thread-safe, I recommend using a ConcurrentLinkedDeque.
I think that you will want to use the LMAX Disruptor framework here. I envision two RingBuffers. You would use the first to accept incoming messages. Your worker(s) would read from the RingBuffer. You would set the size of the RingBuffer to equal your persistence chunk size (eg 100 or 1000). After a worker takes an event from the RingBuffer and processes it, it places a reference to the persisted object into a Queue Collection. Each time the first RingBuffer has been circled once, you allocate a new Queue and place the old Queue into the second RingBuffer. The worker(s) for the second RingBuffer take a Queue object from the RingBuffer, persist all the objects in the Queue, and then move to the next queue. You can tune the size of the second RingBuffer and the worker threads to accommodate the speed at which the database can persist your chunks.
You risk losing messages with that approach, if you have 100 messages receive but not saved, and your application dies, can you afford to lose those messages?
The kind of topic/queue is important here, topics have the advantage of managing this backpressure control, queues are usually there because ordered processing is required.
If you queue/topic is kafka, and you pull messages, kafka can pull batches, and you probably can save batches to the database as well, an only ack the messages to kafka once saved.
If your processing needs to be ordered, you can probably handle some king of reactive approach and tune the db. A queue system can control the flow, usually.
I was going through the javadocs and source code for drainTo method present in BlockingQueue interface and LinkedBlockingQueue implementation of the same. My understanding of this method after looking at the source (JDK7), is that the calling thread actually submits a Collection and afterwards acquires a takeLock(), which blocks other consumers. After that till the count of max elements, the items of the nodes are removed from the queue and put in a collection.
What I could appreciate is that it saves the threads from acquiring locks again and again, but pardon my limited knowledge, I could not appreciate the need for the same in real world examples. Could some one please share some real world examples where drainTo behavior is observable ?
Well, I used it in real life code and it looked quite natural to me: a background database thread creates items and puts them into a queue in a loop until either the end of data is reached or a stop signal is detected. On the first item a UI updater is launched using EventQueue.invokeLater. Due to the asynchronous nature and some overhead in this invokeLater mechanism, it will take some time until the UI updater comes to the point where it queries the queue and most likely more than one item may be available.
So it will use drainTo to get all items that are available at this specific point and update a ListDataModel which produces a single event for the added interval. The next update can be triggered using another invokeLater or using a Timer. So drainTo has the semantic of “gimme all items arrived since the last call” here.
On the other hand, polling the queue for single items could lead to a situation that producer and consumer are blocking each other for a short time and every time the consumer asks for a new item, another item is available due to the fact that the consumer has been blocked just long enough for the producer to create and put a new item. So you have to implement your own time limit to avoid blocking the UI thread too long in this case. Using drainTo once and release the event handling thread afterwards is much easier.
I have a class that's a listener to a log server. The listener gets notified whenever a log/text is spewed out. I store this text in an arraylist.
I need to process this text (remove duplicate words, store it in a trie, compare it against some patterns etc).
My question is should i be doing this as an when the listener is notified? Or should i be creating a separate thread that handles the processing.
What is the best way to handle this situation?
Sounds like you're trying to solve the Producer Consumer Problem, in which case - Yes, you should be looking at threads.
If, however, you only need to do very basic operations that take less than milliseconds per entry - don't overly complicate things. If you use a TreeSet in conjunction with an ArrayList - it will automatically take care of keeping duplicates out. Simple atomic operations such as validating the log entry aren't such a big deal that they need a seperate thread, unless new text is coming in at such a rapid rate that you need to need a thread to busy itself full time with processing new notifications.
The process that are not related to UI i always run that type of process in separate thread so it will not hang your app screen. So as my point of view you need to go with separate thread.
Such a situation can be solved using Queues. The simplest solution would be to have an unbounded blocking queue (a LinkedTransferQueue is tailored for such a case) and a limited size pool of worker threads.
You would add()/offer() the log entry from the listener's thread and take() for processing with worker threads. take() will block a thread if no log entries are available for processing.
P. S. A LinkedTransferQueue is designed for concurrent usage, no external synchronization is necessary: it's based on weak iterators, just like the Concurrent DS family.
i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?
One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.
Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.
I am completely new to java, but I have urgent requirement to create a queue and thread. I am confused which queue class must be used.
Here's the scenario:
I need to a thread to handle user events from the application layer as well as callback events from the lower middleware layer.
For this purpose, it was decided that a queue will be maintained.
Events will be posted to this queue whenever a user event or callback event occurs.
The thread polls for events in the queue and takes appropriate action.
The same queue can be written into by different classes(i.e application layer & lower layer). Hence, which queue wuld be safer, to ensure the same location is not being written into simultaneously by different classes?
Also, what is the basic one-sentence difference between a Queue, BlockingQueue and ArrayBlockingQueue and in what scenarios must each be selected?
Regards,
kiki
Of the three you listed, the only which is actually a class is ArrayBlockingQueue. A blocking queue is different from a normal queue in that, if an object attempts to remove the front item, it will pause execution until there is an available item to remove.
"BlockingQueue" and "Queue" are just a interfaces; you can't instantiate them. Types of BlockingQueue that you can instantiate are ArrayBlockingQueue, LinkedBlockingQueue, etc.
Personally, I would use a LinkedBlockingQueue for this application - the advantage of using a linked list is that there's no set max capacity, and the memory usage decreases as the queue shrinks.
In connection to "few words difference": Queue and BlockingQueue are interfaces, whereas ArrayBlockingQueue is a class which imiplements BlockingQueue interface.
You should choice mainly between ConcurrentLinkedQueue and ArrayBlockingQueue/LinkedBlockingQueue.
Former gives you unbounded queue ( not limite sin size), latter provide fixed-size queues which wait for space to become available in the queue when storing an element.
As an alternative to queues + threads you can consider Executor and Future interfaces from concurrent package, they may be easier in usage to implement client-server model.
For your scenario, what you need is a thread safe queue such as ConcurrentLinkedQueue. Regarding your other question on Queue and BlockingQueue. There are basically the following types of queue implementations:
Blocking: Blocks until the operation (put(),take() etc.) is possible with an optional timeout.
Non-Blocking: The operation completes instantly
Bound: Has a upper limit on the number of items in the queue
Non-bound: No limit on the number of items in the queue.
As for ArrayBlockingQueue, it is backed up by an Array while a LinkedBlockingQueue is backed up by a LinkedList.
Use the higher-level Executors.newSingleThreadExecutor()