Should I really need the "InProgress" flag while polling my database? - java

I'm implementing an event listener, querying new items to process, by creationTime in ascending order.
I deal with multithreading.
My current workflow is:
Querying a batch of items (let's say 50) containing the "New" flag.
Looping through those items, and for each item, updating its status to "InProgress".
For each item, still within the loop, start the corresponding process, detached in a thread (using Akka Actors in my case).
As soon as a process is fully completed, update the item's flag to "Consumed".
I set a polling frequency of 3 seconds, that obviously may involve query of new items BEFORE the current retrieved items are being fully processed (due to multithreading), with the flag "Consumed" set.
Only the querying is single-threaded, otherwise it would lead to retrieve duplicates.
I wonder if the step 2 is essential: updating each item with "InProgress" flag.
Indeed, it would slow down the whole.
I thought about skipping this step but to ensure that futures queries don't retrieve items that are currently being processed (let's imagine a very long computation), I would NOT start the next retrieval query as soon as the whole batch is processed.
Basically, my query step would wait for workers to finish their current jobs.
Obviously, this would make sense if the kind of jobs are similar in computation time.
What is a good practice of polling database while dealing with multithreaded computation?

Related

Polling items from DynamoDB

AWS newbie here.
I have a DynamoDB table and 2+ nodes of Java apps reading/writing from/to it. My use case is as follow: the app should fetch N numbers of items every X seconds based on a timestamp, process them, then remove them from the DB. Because the app may scale, other nodes might be reading from the DB in the same time and I want to avoid processing the same items multiple times.
The questions is: is there any way to implement something like a poll() method that fetches the item and immediately removes it (atomic operation) as if the table was a queue. As far as I checked, delete item methods that DynamoDBMapper offers do not return removed items data.
Consistency is a weak spot of DDB, but that's the price to pay for its scalability.
You said it yourself, you're looking for a queue, so why not use one?
I suggest:
Create a lambda that:
Reads the items
Publishes them to an SQS FIFO queue with message deduplication
Deletes the items from the DB
Create an EventBridge schedule to run the Lambda every n minutes
Have your nodes poll that queue instead of DDB
For this to work you have to consider a few things regarding timings:
DDB will typically be consistent in under a second, but this isn't guaranteed.
SQS deduplication only works for 5 minutes.
EventBridge only supports minute level granularity, not seconds.
So you can run your Lambda as frequently as once a minute, but you can run your nodes as frequently (or infrequently) as you like.
If you run your Lambda less frequently than every 5 minutes then there is technically a chance of processing an item twice, but this is very unlikely to ever happen (technically this could still happen anyway if DDB took >10 minutes to be consistent, but again, extremely unlikely to ever happen).
My understanding is that you want to read and delete an item in an atomic manner, however, we are aware that is not possible with DynamoDB.
However, what is possible is deleting the item and being returned the value, which is more likened to a delete then read. As you correctly pointed out, the Mapper client does not support ReturnValues however the low level clients do.
Key keyToDelete = new Key().withHashKeyElement(new AttributeValue("214141"));
DeleteItemRequest dir = new DeleteItemRequest()
.withTableName("ABC")
.withKey(keyToDelete)
.withReturnValues("ALL_OLD");
More info here DeleteItemRequest

use database as a queue of tasks

In one of our java applications (based on postgresql db), we have a database table that maintains a list of tasks to be executed.
Each row has a json blob for the details of a task as well as scheduled time value.
We have a few java workers/threads whose jobs are to search for tasks that are ready for execution (based on its schedule value), execute and delete them from the table. Execution of a task may take a few seconds.
The problem is, more than one worker may grab the same row, causing duplicate execution of a task, which is something we want to avoid.
One approach is, when doing select to grab a row, do it with FOR UPDATE to lock the row, supposedly preventing other worker from grabbing the same row that's locked.
My concern with this approach is, the row is only locked when the select transaction is being executed in the db (according to this), while the java code is actually executing the row/task that's selected, the locking has gone, another worker can grab it again.
Can some shed some light on whether the above approach is going to work for sure? Thanks!
Treat the DB calls as atomic instructions and design lock free algos around your table, using updates to change a boolean column "in-progress" from false to true. Could also just be a state int (0=avail, 1=inprogress, N=resultcode).
Make sure you have a partial index on state 0 (and possibly 1 to recover from crashes to find tasks in progress), so that the ...where state=0 remains selective and fast (on top of the scheduled time index of course).
Hope this helps.
When one thread has successfully locked the row on a given connection, another one attempting to obtain a lock on the row on a different connection should fail. You should issue the select-for-update with some kind of no-wait clause to request immediate failure if the row is locked.
Now, this doesn't solve the query vs lock race, as a failed lock may interrupt a thread's execution. You can solve that by (in each execution):
Select all records with new tasks (regardless of whether they're being processed or not)
For each new task returned in [1], run a matching select-for-update, then continue with processing the task if the lock fails.
If any lock attempt fails, skip the task without failing the entire process.

ExecutorService idle tasks

I use the java.util.ExecutorService to handle tasks, sometimes with only one worker. Not I'd like to add something like idle tasks, to preload data from the database and similar stuff while nothing is happening and the user has selected some item.
My first idea was to just add it as a task when the user selects something, because when the user starts an interaction with the selection, the data are needed and have to be loaded either way.
The problem with this approach is that when the user selects another item without doing something with the first selection, then there is this huge task in the Executor which only makes everything slower.
Any simple ideas how I could start something like that? I really don't want to build a huge management class to handle it and classify tasks or stuff like that.
So what about using PriorityBlockingQueue? Keep your tasks in that queue, give idle tasks low priority, so that they are always on the end of the queue. Implement your pool's runnables so that they simply take the highest priority task from the queue and execute it.
To be sure that executing idle tasks will be replaced by more important ones, you can implement them to be executed in short chunks and placed back in the queue after each chunk is finished. If something more important was placed in the queue in the meantime, it will be taken next, if not, idle task will be fetched once again from the queue.

Multiple thread selecting row from database optimisation

I have an java application where 15 threads select a row from table with 11,000 records, through a synchronized method called getNext(), the threads are getting slow at selection a row, thereby taking a huge amount of time. Each of the thread follows the following process:
Thread checks if a row with resume column value set to 1 exist.
A. If it exist the thread takes the id of that row and uses that id to select another row with id greater than that of the taking id.
B. Otherwise it select's a row with id greater than 0.
The last row received based on the outcome of steps described in 1 above is marked with the resume column set to 1.
The threads takes the row data and works on it.
Question:
How can multiple thread access thesame table selecting rows that another thread has not selected and be fast?
How can threads be made to resume in case of a crash at the last row that was selected by any of the threads?
1.:
It seems the multiple database operations in getNext() art the bottleneck. If the data isn't change by an outside source you could read "id" and "resume" of all rows and cache it. Than you would only have one query and than operate just in memory for reads. This would safe lot of expensive DB calls in getNext():
2.:
Basically you need some sort of transactions or at least add an other column that gets updated when a thread has finished processing that row. Basically the processing and the update need to happen in a single transaction. When something happens while the transaction is not finished, you can rollback to the state in which the row wasn't processed.
If the threads are all on the same machine they could use a shared data structure to avoid working on the same thing instead of synchronization. But the following assumes the threads are on on different machines ( maybe different members of an application server cluster ) and can only communicate via the database.
Remove synchronization on getNext() method. When setting the resume flag to 1 (step 2), do so atomically. update table set resume=1 where resume = 0, commit. Only one thread will succeed at this, the thread that does gets that unit of work. At the same time, set a resume time-- if the resume time is greater than some max assume the thread working on that unit of work hash crashed, set resume flag back to 0. After the work is finished set the resume time to null, or otherwise mark the work as done.
Well, would think of different issues here:
Are you keeping status in your DB? I would look for some approach where you call a select for update where you filter by inactive status (be sure just to get one row in the select) and immediately update to active (in same transaction). It would be nice to know what DB you're using, not sure if "select for update" is always an option.
Process and when you're finished, update to finished status.
Be sure to keep a timestamp in the table to identifiy when you changed status for the last time. Make yourself a rule to decide when an active thread will be treated as lost.
Define other possible error scenarios (what happens if the process fails).
You would also need to analyze the scenario. How many rows does your table have? How many threads call it concurrently? How many inserts occur in a given time? Depending on this you will have to see how DB performance is running.
I'm assuming you'r getNext() is synchronized, with what I wrote on point 1 you might get around this...

producer-consumer: how to know inform that prodcution completed

i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?
One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.
Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.

Categories