Multiple thread selecting row from database optimisation - java

I have an java application where 15 threads select a row from table with 11,000 records, through a synchronized method called getNext(), the threads are getting slow at selection a row, thereby taking a huge amount of time. Each of the thread follows the following process:
Thread checks if a row with resume column value set to 1 exist.
A. If it exist the thread takes the id of that row and uses that id to select another row with id greater than that of the taking id.
B. Otherwise it select's a row with id greater than 0.
The last row received based on the outcome of steps described in 1 above is marked with the resume column set to 1.
The threads takes the row data and works on it.
Question:
How can multiple thread access thesame table selecting rows that another thread has not selected and be fast?
How can threads be made to resume in case of a crash at the last row that was selected by any of the threads?

1.:
It seems the multiple database operations in getNext() art the bottleneck. If the data isn't change by an outside source you could read "id" and "resume" of all rows and cache it. Than you would only have one query and than operate just in memory for reads. This would safe lot of expensive DB calls in getNext():
2.:
Basically you need some sort of transactions or at least add an other column that gets updated when a thread has finished processing that row. Basically the processing and the update need to happen in a single transaction. When something happens while the transaction is not finished, you can rollback to the state in which the row wasn't processed.

If the threads are all on the same machine they could use a shared data structure to avoid working on the same thing instead of synchronization. But the following assumes the threads are on on different machines ( maybe different members of an application server cluster ) and can only communicate via the database.
Remove synchronization on getNext() method. When setting the resume flag to 1 (step 2), do so atomically. update table set resume=1 where resume = 0, commit. Only one thread will succeed at this, the thread that does gets that unit of work. At the same time, set a resume time-- if the resume time is greater than some max assume the thread working on that unit of work hash crashed, set resume flag back to 0. After the work is finished set the resume time to null, or otherwise mark the work as done.

Well, would think of different issues here:
Are you keeping status in your DB? I would look for some approach where you call a select for update where you filter by inactive status (be sure just to get one row in the select) and immediately update to active (in same transaction). It would be nice to know what DB you're using, not sure if "select for update" is always an option.
Process and when you're finished, update to finished status.
Be sure to keep a timestamp in the table to identifiy when you changed status for the last time. Make yourself a rule to decide when an active thread will be treated as lost.
Define other possible error scenarios (what happens if the process fails).
You would also need to analyze the scenario. How many rows does your table have? How many threads call it concurrently? How many inserts occur in a given time? Depending on this you will have to see how DB performance is running.
I'm assuming you'r getNext() is synchronized, with what I wrote on point 1 you might get around this...

Related

use database as a queue of tasks

In one of our java applications (based on postgresql db), we have a database table that maintains a list of tasks to be executed.
Each row has a json blob for the details of a task as well as scheduled time value.
We have a few java workers/threads whose jobs are to search for tasks that are ready for execution (based on its schedule value), execute and delete them from the table. Execution of a task may take a few seconds.
The problem is, more than one worker may grab the same row, causing duplicate execution of a task, which is something we want to avoid.
One approach is, when doing select to grab a row, do it with FOR UPDATE to lock the row, supposedly preventing other worker from grabbing the same row that's locked.
My concern with this approach is, the row is only locked when the select transaction is being executed in the db (according to this), while the java code is actually executing the row/task that's selected, the locking has gone, another worker can grab it again.
Can some shed some light on whether the above approach is going to work for sure? Thanks!
Treat the DB calls as atomic instructions and design lock free algos around your table, using updates to change a boolean column "in-progress" from false to true. Could also just be a state int (0=avail, 1=inprogress, N=resultcode).
Make sure you have a partial index on state 0 (and possibly 1 to recover from crashes to find tasks in progress), so that the ...where state=0 remains selective and fast (on top of the scheduled time index of course).
Hope this helps.
When one thread has successfully locked the row on a given connection, another one attempting to obtain a lock on the row on a different connection should fail. You should issue the select-for-update with some kind of no-wait clause to request immediate failure if the row is locked.
Now, this doesn't solve the query vs lock race, as a failed lock may interrupt a thread's execution. You can solve that by (in each execution):
Select all records with new tasks (regardless of whether they're being processed or not)
For each new task returned in [1], run a matching select-for-update, then continue with processing the task if the lock fails.
If any lock attempt fails, skip the task without failing the entire process.

Strategy for locking rows for updates with multiple processing nodes

I have an application on Spring Boot with PostgreSQL.
The application performs updates for rows in a database. In the past it was SELECT FOR UPDATE SKIP LOCKED to fetch new data and do updates in one thread, this was made to prevent several nodes to update same row (and as consequence to speed up update process for documents).
Now for speed up processing time, select rows and perform requests for update (to external service) are in separate threads (multiple workers with RestTemplate to smooth I/O waiting time) that fill this queue with ready updates and another thread worker perform post-processing by selecting from queue and insert to database. So now select and update in separate process and works in different transactions.
What is a good way to save behavior of SELECT FOR UPDATE SKIP LOCKED when processing is separated to different threads to prevent different nodes update same rows?
I think about adding few fields to table like update-status update-started node, and select like WHERE STATUS != 'IN PROGRESS' and to prevent holding rows if app crash add something like AND update-stared < now() - '20 minutes::INTERVAL'.
And second way to send connection from pool with document to another process, maybe it is better solution. As I know I can also select which node acquire lock so it also good for monitoring.

Commit changes with lock keeping

I have an application that works with database table like
Id, state, procdate, result
When there is a need to process some data, the app sets state to PROCESSING. After processing the result of processing is being set to result column and the state goes to STANDBY.
To do the first set to PROCESSING I start the transaction, do select for update, then update the state and procdate.
Then I do the work and using selection for update update the state and the result.
The processing may take up to 5 minutes. The state switching is needed to see how many rows are in progress. The problem is that another request for processing may occur and it has to wait until the first processing will end.
So I want to keep row locked. If I will make the select for update for locking just after I commit the processing state the second request may intercept and lock the row.
So how can I both keep the locking and commit the changes?
You'll need to handle this with your design. Here is an idea.
You records initially have a status, say 'READY', and a processing id, null initially.
When you start, update the status to 'PROCESSING', and update the id to a value for the job run, this can come from a sequence within Oracle, such that it is unique for your process run. commit.
the process runs, with the same id, and selects the status 'PROCESSING' and the same as it's defined processing id. Complete processing, update status to 'COMPLETE' (or 'STANDBY' as you have it). Commit.
This allows a second process to select new 'READY' records and set them for its own processing without interference with the already running process.
Here are two approaches I have taken. (I provide a third, but have never had to take that approach.)
1) Why not exit the transaction after committing the changes.
2) If option 1 is not viable, then you could simply:
COMMIT the changes
attempt to re-acquire the lock, if you fail, leave the screen, else just continue.
3) If it is absolutely imperative that no one can ever acquire the lock in the middle of a commit... you could actually lock another object. I will admit, I have never had to take this approach, but it would be as follows:
Initial phase
LOCK GLOBALOBJECT
Attempt to Acquire record lock for table
UNLOCK GLOBALOBJECT
Test to see if record lock was attained
Phase for committing the change
LOCK GLOBALOBJECT
COMMIT change
Acquire record lock for table
UNLOCK GLOBALOBJECT
Test to see if record lock was attained (Should never happen...)
I have never needed this kind of logic, and I really do not like it since it requires a GLOBAL locking object for this table. Again, it depends on your code, and the criticality of someone being able to commit changes while still being in the transaction.
However, just make sure you are not gold-plating your code when simply exiting out of the transaction after commiting a change would be fine for your stakeholders.

Oracle 11g - System performace impacted as the data grows

Our system has quite a high level of concurrency with multiple java threads picking up one record at a time from a given Oracle 11g table which normally holds about two millions records.
There are always many records ready to be picked up for processing. The records ready to be processed are selected based on a relatively complex SQL statement but once selected the processing order is based on a FIFO algorithm (ID order).
It is crucial that the same record is not picked up by two distinct threads. Because of this we have a locking mechanism in place.
From a high level view the way in which it works at the moment is that java thread invokes a stored procedure which in turn will open a RECORD_READY_KEYS cursor and then it iterates trough that cursor and try to acquire a lock on a record on a locking table with that key. The locking attempt is done with SELECT FOR UPDATE SKIP LOCKED. If the lock succeeds the record to process is returned to the java thread for processing.
Everything works fine as long as the records ready to process are not too many. However when this number grows over a limit (from our observations when going over 15K) the SQL statement used to get the RECORD_READY_KEYS cursor starts decreasing in performance. Despite the fact it is fully optimised it starts taking close to 0.2 seconds to run which means you can only process maximum five records per second per java thread. In reality considering the time taken to acquire the lock, to travel over the network, to actually do the processing, commit the transaction, etc. will result in even slower processing.
Increasing the number of java threads is an option, however we cannot go over a certain limit as they will start putting pressure on the database/application server/system resources, etc.
The real problem is that we run an SQL statement to get the RECORD_READY_KEYS containing fifteen thousand keys out of a total of two millions and we then pick up the first available record from the top and the we discard the rest by closing the cursor.
My idea would be to have a KEYS_CACHE nested table defined at package level and store the result of RECORD_READY_KEYS selection in that nested table. Once a key is locked it will delete it from the KEYS_CACHE and will return it to the java thread. The process can go that way until the whole KEYS_CACHE gets consumed and when this happens it will populate it again.
Now my questions will be:
Q1. Can you see any weak point with this approach.
I can see multiple threads trying to lock the same record at the same time and such wasting a bit of time. On the java side we can make the stored procedure invocation synchronized to a given extend only as the invocation will happen from multiple JVMs. However I cannot see this a major issue.
Another issue would be when an unlikely rollback happens as there will be no easy way to put back the deleted key. The next RECORD_READY_KEYS selection will pick it back again and a delay of a few minutes will not really matter.
Q2. As the nested table gets less and less records it will become very sparse. Can you see this becoming a problem? If so should I limit the initial size to say 5000 keys or it does not really matter.
Q3. Can you see a problem with that package level KEYS_CACHE nested table being accessed concurrently by so many threads (we have between 25 to 100 of them)
Q4. Can you see an alternative approach that would not require a whole system redesign.
Thank you in advance
I think I was not very when explaining my situation. We do not lock the records to process in the two millions records table but we do the lock the key instead that are also saved on a different locking table.
Say I have this 2 million records table called messages:
And there are only messages with Key-A, Key-B, and Key-C that are ready to be processed a possible content of the key locking table may be:
Note the Key-X is in there even if no messages ready to be processed for that key because messages with such a key were just finished processing and the clean-up thread did not kicked off yet. That is OK and even desirable in case more new messages with Key-X will enter the system in a short while it will save a new insert.
So our select (fully optimised) will obtain a list with the Key-A, Key-C, and Key-B in this order (Key-C comes before Key-B because has a message with an Id = 2 which is smaller than the first Key-B message with the Id=6
Very simplified what we do here in fact is
SELECT key FROM messages WHERE ready = ‘Y’ GROUP BY key ORDER BY min(id)
Once we get that select in a cursor we fetch the key one by one and try to lock it in the key_locckings table. Once a lock succeeds the key get assigned to a thread (there is threads table for this) and will stay with that thread processing all messages that are ready for that key. As I mentioned in my first post it is crucial that messages with the same key be processed by the same thread as the key is how we link related messages which must be processed in sequence.
The SELECT above is instantly when the number of keys selected is up to a few thousands. It is still performing OK when it gets around 10000 keys. Once the number of retrieved keys gets over 15000 then the performance starts degrading. The time to run the SELECT is still OK (about 0.2 seconds) and we do have indexes on all fields involved in this selection. It is just that getting the WHERE, GROUP, ORDER BY applied to select 15000 keys out of two million records that take the time.
So the problem for us is that every single thread will run the same SELECT and will get 15000 records just to pick up one of them. The think I was considering was that rather than closing the cursor and throwing the hard work away as we do at the moment to try storing those keys in a package level nested table and delete the keys from there as we allocate them to the threads. My first three questions just wanted to capture some others opinions about this approach while the last one was about finding some alternative ideas (e.g. someone would say use advanced queues, etc)
I have an example to hand that is (I think) very similar.
You have a table, with lots of rows, and you have multiple consumer processes (the Java threads), that all want to work on the contents of that table.
First off, I recommend avoiding SKIP LOCKED. If you think you need it, consider if you've set your INITRANS high enough on the table. Consider that SKIP LOCKED means that Oracle will skip locked resources, not just locked rows. If a block's ITL is full, SKIP LOCKED will skip it, even if there are unlocked rows in the block!
For a more detailed discussion of this, see here:
https://markjbobak.wordpress.com/2010/04/06/unintended-consequences/
So, on to my suggestion. For each concurrent JAVA thread, define a thread number. So, suppose you have 10 concurrent threads, assign them each a thread number or thread id, 0-9. Now, if you have the flexibility to modify the table, you could add a column, THREAD_ID, and then use that in the select statement when selecting from the table. Each concurrent JAVA thread will select only those rows that match it's thread id. In this way, you can guarantee that you'll avoid collisions. If you don't have the ability to add a a column to the table, then you hopefully have a nice, numeric, sequence-driven primary key? If so, you can get the same effect by querying MOD(PRIMARY_KEY_COLUMN, 10) = :client_thread_id.
Additionally, do you have columns that specify a status of some sort, or something like that, which you'll use to determine which rows from the table are eligible to be processed by the Java thread? If so, and particularly if that criteria significantly improves selectivity, creating a virtual column that is only populated for the values which you're interested in, could be quite useful, if that column is then added to the index. (THREAD_ID, STATUS), for example.
Finally, you mentioned processing in a specific order. If THREAD_ID, STATUS is your selection criteria, then perhaps a PRIORITY or STATUS_DATE column may be your ordering requirement. In that case, it may be useful to continue to build out the index, to add in the column(s) specifying the required order, and top it off with the primary key of the table.
With a carefully constructed index, and using the THREAD_ID idea, it should be possible to construct an index that will allow you to:
avoid collisions (Use THREAD_ID or MOD() on primary key)
minimize the size of the index (vurtual columns)
avoid any ORDER BY operation (add order by columns to index)
avoid any TABLE ACCESS BY ROWID operation (add primary key column to end of index)
I made a few assumptions that may or may not apply.

Should I really need the "InProgress" flag while polling my database?

I'm implementing an event listener, querying new items to process, by creationTime in ascending order.
I deal with multithreading.
My current workflow is:
Querying a batch of items (let's say 50) containing the "New" flag.
Looping through those items, and for each item, updating its status to "InProgress".
For each item, still within the loop, start the corresponding process, detached in a thread (using Akka Actors in my case).
As soon as a process is fully completed, update the item's flag to "Consumed".
I set a polling frequency of 3 seconds, that obviously may involve query of new items BEFORE the current retrieved items are being fully processed (due to multithreading), with the flag "Consumed" set.
Only the querying is single-threaded, otherwise it would lead to retrieve duplicates.
I wonder if the step 2 is essential: updating each item with "InProgress" flag.
Indeed, it would slow down the whole.
I thought about skipping this step but to ensure that futures queries don't retrieve items that are currently being processed (let's imagine a very long computation), I would NOT start the next retrieval query as soon as the whole batch is processed.
Basically, my query step would wait for workers to finish their current jobs.
Obviously, this would make sense if the kind of jobs are similar in computation time.
What is a good practice of polling database while dealing with multithreaded computation?

Categories