I need to create a list to do the following operation:
I receive an object from an external queue/topic every microsecond.
After doing some operations on the object, I need to persist these objects into database.
I am doing the persist in batches of 100 or 1000. The only problem is, persist rate is lower than the incoming message rate. Now I don't want to keep this in a single thread since the persist will slow down the message consumption.
My idea is to keep accepting the message objects and adding them to a collection (like a linked list)
And keep removing from the other end of the collection in batches of 100 or 1000 and persist into database.
What is the right collection to use? How to synchronize this and avoid concurrent modification exceptions?
Below is the code I'm trying to implement with an ArrayList that clears out the list every few seconds while persisting.
class myclass{
List persistList;
ScheduledExecutorService persistExecutor;
ScheduledFuture scheduledFuture;
PersistOperation persistOperation;
//Initialize delay, interval
void init(){
scheduledFuture=persistExecutor.scheduleAtFixedRate(new persistOperation(persistList), delay, interval, TimeUnit.SECONDS);
}
void execute(msg){
//process the message and add to the persist list
}
class PersistOperation implements Runnable{
List persistList
PersistOperation(List persistList){
//Parameterized constructor
}
run(){
//Copy persistList to new ArrayList and clear persistList
//entity manager persist/update/merge
}
}
}
And keep removing from the other end of the collection in batches of 100 or 1000 and persist into database.
This is reasonable so long as multiple threads poll from the collection.
Below is the code I'm trying to implement with an ArrayList
An ArrayList is a bad choice here, as it is not thread-safe and, when removing an element at index 0, every element to the right of it must be shifted over (an O(n) operation).
The collection that you're looking for is called a Deque, otherwise known as a double-ended queue. However, because you need the collection to be thread-safe, I recommend using a ConcurrentLinkedDeque.
I think that you will want to use the LMAX Disruptor framework here. I envision two RingBuffers. You would use the first to accept incoming messages. Your worker(s) would read from the RingBuffer. You would set the size of the RingBuffer to equal your persistence chunk size (eg 100 or 1000). After a worker takes an event from the RingBuffer and processes it, it places a reference to the persisted object into a Queue Collection. Each time the first RingBuffer has been circled once, you allocate a new Queue and place the old Queue into the second RingBuffer. The worker(s) for the second RingBuffer take a Queue object from the RingBuffer, persist all the objects in the Queue, and then move to the next queue. You can tune the size of the second RingBuffer and the worker threads to accommodate the speed at which the database can persist your chunks.
You risk losing messages with that approach, if you have 100 messages receive but not saved, and your application dies, can you afford to lose those messages?
The kind of topic/queue is important here, topics have the advantage of managing this backpressure control, queues are usually there because ordered processing is required.
If you queue/topic is kafka, and you pull messages, kafka can pull batches, and you probably can save batches to the database as well, an only ack the messages to kafka once saved.
If your processing needs to be ordered, you can probably handle some king of reactive approach and tune the db. A queue system can control the flow, usually.
Related
I am working on a design of a program that will need to fetch results from a datastore and post those results to another system. The data that I am fetching is referenced by a UUID, and has other documents linked to it by UUIDs. I will be posting a lot of documents (>100K documents), so I would like to do this concurrently. I am thinking about the following design:
Get the list of documents from the datastore. Each document would have:
docId (UUID)
docData (json doc)
type1 (UUID)
type1Data (json)
type2 (UUUID)
type2Data (json)
list<UUID> type3Ids
list of type3 data (json)
The only data that I get from my first call are the docIds. I was thinking of pushing these documents into a queue and having a set of workers (fetchers) make the relevant calls back to the datastore to retrieve the data.
retrieve the docData from datastore, fill in the type1, type2 and type3 UUIDS
do a batch get to retrieve all the type1, typ2 and type3 docs
Push the results into another queue for posting to other system
The second set of workers (posters) would read from the scond queue each document and post the results to the second system.
One question that I have, should I create 1 FixedThreadPool(size X) or two FixedThreadPool(size X/2)? Is there a danger of starvation if there are a lot of jobs in the first queue such that the second queue would not get started until the first queue was empty?
The fetchers will be making network coalls to talk to the database, they seem like they would be more IO bound than CPU bound. The posters will also make network calls, but they are in the cloud in the same VPC as where my code would run, so they would be fairly close together.
Blocking Queue
This is a pretty normal pattern.
If you have two distinct jobs to do, use two distinct thread pools and make their size configurable so you can size them as needed / test different values on the deployment server.
It is common to use a blocking queue (BlockingQueue built into Java 5 and later) with a bounded size (say, 1000 elements for an arbitrary example).
The blocking queue is thread-safe, so everything in the first thread pool writes to it as fast as they can, everything in the second thread pool reads as fast as it can. If the queue is full, the write just blocks, and if the queue is empty, the read just blocks - nice and easy.
You can tune the thread numbers and repeatedly run to narrow down the best configured size for each pool.
I have this scenario where I am receiving events from thousands of sources. Each source is sending information about its current status. While I do want to process all events, it is more important to first process the latest event of each source, so that the current view is up to date. So I was thinking of using a ConcurrentHashMap with the identifier of each source as the key, and a LIFO queue (stack) as the value. I would then iterate through the keys of the Map and just pop one item off the stack of each source.
My concern is that while I am iterating through the keys and taking items off the queue of each key, the producer could post new events on the queues, potentially creating concurrency issues. The producer could also add new keys to the map, and iterating through the entrySet of the Map seems to be weakly consistent. Which is not a huge issue, because the new item will be processed in a subsequent iteration. Ideally I could also use some parallel processing on the stream of the entrySet to speed up the process.
I am wondering if there is a cleaner approach to this. In reality I could have used a LIFO BlockingDequeue and processed latest events first, but the problem with this approach is that there is a risk that one source could send more events than others and thus maybe get more events processed than the others.
Is there any other data structure that I could look into that provides this kind of behaviour? Essentially what I am looking for is a way to prioritise events from each source, while at the same time giving a fair chance to each source to be processed by the consumer.
I recommend building your own structure to manage this as it adds flexibility (and speed) for your use case in particular.
I'd go with a circular queue to store each LIFO queue (stack). A circular queue is one where you add elements to the tail, and read (but doesn't remove) from the head. Once head = tail, you start over.
You can build your own queue using a simple array. It's not too hard to manage the synchronization around operations such as adding more queues to the array - and expanding it when needed. I believe adding queues to the array is not something you do very often.
This is easy to manage and you can expand your circular queue to calculate how often the entries are being accessed, and throttle the frequency of access to its entries (by adding/removing consumer threads, or even making them wait a bit before consuming from the stack managed by an entry).
You can even avoid thread locking when reading elements from the circular queue using multiple threads, by making them call a "register" operation before consuming from the stack: each thread has its ID and when it "registers" the ID is stored at the given queue entry. Before registering and before popping from the stack, the thread does a "read the registration ID" operation, and the ID that returns must match its own ID. That means only the thread that "owns" the given queue entry can pop from that stack. If the registration/confirmation of registration process fails, it means another thread is consuming form that entry, so the current thread moves on to the next available entry.
I've used this sort of strategy in the past and it scaled like a charm. I hope this makes sense for you.
Did you think about a FIFO queue of LIFO queues? Each source adds to its LIFO queue and for processing you take the first LIFO queue from the FIFO queue, process one event and then put it back into the FIFO queue. This way you also should have no problem with new sources, as their LIFO queue will simply be added to the FIFO queue.
For adding events to the correct LIFO queue, you can maintain an additional HashMap that knows the queue per source and if a new source occurs that is not in the Map yet, you know you have to add its LIFO queue to the FIFO queue.
Would a linkedblockingqueue be suitable for the following:
1. insert strings (maximum 1024 bytes) into the queue at a very high rate
2. every x inserts or based on a timed interval, flush items into mysql
During the flush, I was looking at the API: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/LinkedBlockingQueue.html
At was wondering it drainTo would be a good choice, since I have to aggregate before flushing.
So I would drainTo the items in the queue, then iterate and aggreate and then write to mysql.
Will this be suitable for upto 10K writers per second?
Do I need to consider any locking/synchronization issues or is that taken care of already?
I will store this linkedblockingqueue as the value in a concurrenthashmap.
Items will never be removed from the hashmap, only inserted if not present, and if present, I will append to the queue.
It depends a bit if the inserter is per queue or for all queues. If I am understanding your spec, I would think something like the following would work.
Writer adds an item to the one of the LinkedBlockingQueue collections in your map. If the size of the queue is more than X (if you want it per queue) then it signals the MySQL inserter thread. Something like this should work:
queue.add(newItem);
// race conditions here that may cause multiple signals but that's ok
if (queue.size() > 1000) {
// this will work if there is 1 inserter per queue
synchronized (queue) {
queue.notify();
}
}
...
Then the inserter is waiting on the queue and in something like the following loop:
List insertList = new ArrayList();
while (!done) {
synchronized (queue) {
// typically this would be while but if we are notified or timeout we insert
if (queue.size() < 1000) {
queue.wait(MILLIS_TIME_INTERVAL);
}
}
queue.drainTo(insertList);
// insert them into the db
insertList.clear();
}
It gets a bit more complicated if there 1 one thread doing the inserts across all queues. I guess the question is then why do you have the ConcurrentHashMap at all? If you do have 1 inserter which, for example, is inserting into multiple tables or something then you will need a mechanism to inform the insert which queue(s) need to be drained. It could just run through all of the queues in the map but that might be expensive. You would synchronize on some global lock object or maybe the map object instead of the queue.
Oh, and as #Peter Lawrey mentioned, you will quickly run out of memory if your database is slower than the writers so make sure the queues have a proper capacity set so they limit the writers and keep the working memory down.
Hope this helps.
For every queue you need a thread and a connection, so I wouldn't create too many queues. You can perform over 10K writes per second provided your MySQL server can handle this (you will only know when you test it) LinkedBlockingQueue is thread safe, and provide you have all your queues created before you start you don't need any locking/synchronization.
If you are inserting long Strings up to 1024 characters at 10 K per second you are likely to run out of memory pretty fast. (up to 36 GB per hour) Instead I would have the database only insert new strings.
i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?
One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.
Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.
Could you advise me how to implements queue in multithreaded application.
I have to register events in my application in queue (events from all users) which I flush and save to database every 100 events to improve performance. I don't want save database log for every single user commit. I suppose commit e.x. every 100 events to database will be faster that 100 single commits. I have three ideas:
use ThreadLocal
use queue for single user
use synchronized LinkedList and flush from time to time or every number of events
Do you have any other ideas? I don't use log4j because I have to save log to database - not to file, which method will the best?
You could try a BlockingQueue implementation like ArrayBlockingQueue. Any thread can safely add an event to the queue, and one (or more) threads can safely remove elements from the queue. Have a background thread wait for events on the queue (BlockingQueue.take()). Once you collect 100 elements, do your stuff.