I had written a code which will pull database rows and process it.
Right now I am selecting 100 rows and making there status as ProcessInprogress and after successful process making status of 100 rows as Processed one by one.
This process is scheduled for every 2 min under quartz.
Question: what I have to take care so that this process can run successfully when my code deployed in multiple nodes. So that I should avoid duplicate data processing in another node.
Please suggest:)
Right now I am locking the records fetched by the polling process so that same records should not be fetched by polling process of other instance or node. But this is making other process to wait for release of lock.
Is there anything which I can do so that other process should go to next 100 records fetch instead of waiting on the lock.
Please suggest if any more suggestions for handling behavior of multiple nodes for a database polling process running in java.
Thanks
Related
1. cron job started
2. create Entity1 and save to DB
3. Fetch transactionEntity from DB
4. using transactions as transactionIds.
for (Transaction id : transactionIds) {
a. create Entity2 and save to db
b. fetch paymentEntity from DB.
c. response = post request Rest API call
d. udpate Entity2 with response
}
5. udpate Entity1.
Problem statement - I am getting 5000+ transaction from db in transactionIds using cron jobs which need to process as given above. With the above approach while my previous loop is going on, next 5000+ transactions come in the loop as cron job runs in 2 minutes.
I have checked multiple solutions(.parallelStream() with ForkJoinPool / ListenableFuture, but am unable to decide which is the best solution to scale the above code. Can I use spring batch for this, if yes, how to do this? What are the steps comes in reader, process and writer from above steps.
One way to approach this problem will be to use Kafka for consuming the messages. You can increase the number of pods (hopefully you are using microservices) and each pod can be part of a consumer group. This will effectively remove the loop in your code and consumers can be increased on demand to process any scale.
Another advantage of message based approach will be that you can have multiple delivery modes(at least once, at most once etc) and there are a lot of open source libraries available to view the stats of the topic (Lag between consumption and production of messages in a topic).
If this is not possible,
The rest call should not happen for every transaction, you'll need to post the transactions as a batch. API calls are always expensive to do, so the lesser roundtrips will give you a huge difference in time taken to complete the loop.
Instead of directly updating DB before and after API call, you can change the loop use
repository.saveAll(yourentitycollection) // Only one DB call after looping, can be batched
Suggest you to move to producer-consumer strategy in near future.
In one of our java applications (based on postgresql db), we have a database table that maintains a list of tasks to be executed.
Each row has a json blob for the details of a task as well as scheduled time value.
We have a few java workers/threads whose jobs are to search for tasks that are ready for execution (based on its schedule value), execute and delete them from the table. Execution of a task may take a few seconds.
The problem is, more than one worker may grab the same row, causing duplicate execution of a task, which is something we want to avoid.
One approach is, when doing select to grab a row, do it with FOR UPDATE to lock the row, supposedly preventing other worker from grabbing the same row that's locked.
My concern with this approach is, the row is only locked when the select transaction is being executed in the db (according to this), while the java code is actually executing the row/task that's selected, the locking has gone, another worker can grab it again.
Can some shed some light on whether the above approach is going to work for sure? Thanks!
Treat the DB calls as atomic instructions and design lock free algos around your table, using updates to change a boolean column "in-progress" from false to true. Could also just be a state int (0=avail, 1=inprogress, N=resultcode).
Make sure you have a partial index on state 0 (and possibly 1 to recover from crashes to find tasks in progress), so that the ...where state=0 remains selective and fast (on top of the scheduled time index of course).
Hope this helps.
When one thread has successfully locked the row on a given connection, another one attempting to obtain a lock on the row on a different connection should fail. You should issue the select-for-update with some kind of no-wait clause to request immediate failure if the row is locked.
Now, this doesn't solve the query vs lock race, as a failed lock may interrupt a thread's execution. You can solve that by (in each execution):
Select all records with new tasks (regardless of whether they're being processed or not)
For each new task returned in [1], run a matching select-for-update, then continue with processing the task if the lock fails.
If any lock attempt fails, skip the task without failing the entire process.
I have an application on Spring Boot with PostgreSQL.
The application performs updates for rows in a database. In the past it was SELECT FOR UPDATE SKIP LOCKED to fetch new data and do updates in one thread, this was made to prevent several nodes to update same row (and as consequence to speed up update process for documents).
Now for speed up processing time, select rows and perform requests for update (to external service) are in separate threads (multiple workers with RestTemplate to smooth I/O waiting time) that fill this queue with ready updates and another thread worker perform post-processing by selecting from queue and insert to database. So now select and update in separate process and works in different transactions.
What is a good way to save behavior of SELECT FOR UPDATE SKIP LOCKED when processing is separated to different threads to prevent different nodes update same rows?
I think about adding few fields to table like update-status update-started node, and select like WHERE STATUS != 'IN PROGRESS' and to prevent holding rows if app crash add something like AND update-stared < now() - '20 minutes::INTERVAL'.
And second way to send connection from pool with document to another process, maybe it is better solution. As I know I can also select which node acquire lock so it also good for monitoring.
Good time guys!
We have a pretty straightforward application-adapter: once in 30 seconds it reads records from a database (can't write to it) of one system, converts each of these records into an internal format, performs filtering, encrichment, ..., and, finally, transforms the resulting, let's say, entities into an xml format and sends them via a JMS to other system. Nothing new.
Let's add some spice here: records in the database are sequential (that means that their identifies are generated by a sequence), and when it is time to read a new bunch of records, we get a last-processed-sequence-number -- which is stored in our internal databese and updated each time the next record is processed (sent to the JMS) -- and start reading from that record (+1).
The problem is our customers gave us an NFR: processing of a read record bunch must not last longer than 30 seconds. As far as there are a lot of steps in the workflow (with some pretty long running ones), and it is possible to get a pretty big amount of records, and as far as we process them one by one, it can take more than 30 seconds.
Because of all the above I want to ask 2 questions:
1) Is there an approach of a parallel processing of sequential data, maybe with one or several intermediate storages, or Disruptor patern, or cqrs-like, or a notification-based, or ... that provides a possibility of working in such a system?
2) A general one. I need to store a last-processed-number and send an entity to the JMS. If I save a number to a database and then some problem raises with the JMS, on an application's restart my adapter will think that it successfuly sended the entity, which is not true and it won't be ever received. If I send an entity and after that try so save a number to a database and get an exception, on an application's restart a reprocessing will be performed which will lead to duplications in the JMS. I'm not sure that xa transactions will help here or some kind of a last resorce gambit...
Could somebody, please, share experience or ideas?
Thanks in advance!
1) 30 seconds is a long time and you can do a lot in that time esp with more than one CPU. Without specifics I can only say it is likely you can make it faster if you profile it and use more CPUs.
2) You can update the database before you send and listen to the JMS queue yourself to see it was received by the broker.
Dimitry - I don't know the detail around your problem so I'm just going to make a set of assumptions. I hope it willtrigger an idea that will lead to the solution at least.
Here goes:
Grab you list of items to process.
Store the last id (and maybe the starting id)
Process each item on a different thread (suggest using Tasks).
Record any failed item in a local failed queue.
When you grab the next bunch, ensure you process the failed queue first.
Have a way of determining a max number of retries and a way of moving/marking it as permanently failed.
Not sure if that was what you were after. NServiceBus has a retry process where the gap between each retry gets longer up to a point, then it is marked as failed.
Folks, finally we ended up with the following solution. We implemented a kind of the Actor Model. The idea is the following.
There are two main (internal) database tables for our application, let's call them READ_DATA_INFO, which contains a last-read-record-number of the 'source' external system, and DUMPED_DATA, which stores a metadata about each read record of the source system. This is how it all works: each n (a configurable property) seconds a service bus reads the last processed identifier of the source system and sends a request to the source system to get new records from it. If there are several new records, they are being wrapped with a DumpRecordBunchMessage message and sent to a DumpActor class. This class begins a transaction which comprises two operations: update the last-read-record-number (the READ_DATA_INFO table) and save a metadata about each reacord (the DUMPED_DATA table) (each dumped record gets the 'NEW' status. When a record is successfully processed, it gets the 'COMPLETED' status; otherwise - the 'FAILED' status). In case of a successfull transaction commit each of those records is wrapped with a RecordMessage message class and send to next processing actor; otherwise those records are just skipped - they would be reread after next n seconds.
There are three interesting points:
an application's disaster recovery. What if our application will be stopped somehow at the middle of a processing. No problem, at an application's startup (#PostConstruct marked method) we find all the records with the 'NEW' statuses at the DUMPED_DATA table and with a help of a stored metadata rebuild restore them from the source system.
parallel processing. After all records are successfully dumped, they become independent, which means that they can be processed in parallel. We introduced several mechanisms of a parallelism and a loa balancing. The simplest one is a round robin approach. Each processing actor consists of a parant actor (load balancer) and a configurable set of it's child actors (worker). When a new message arrives to the parent actor's queue, it dispatches it to the next worker.
duplicate record prevention. This is the most interesting one. Let's assume that we read data each 5 seconds. If there is an actor with a long running operation, it is possible to have several tryings to read from the source system's database starting from the same last-read-record number. Thus there would potentially be a lot duplicate records dumped and processed. In order to prevent this we added a CAS-like check of DumpActor's messages: if the last-read-record from a message is equal to a one from the DUMPED_DATA table, this message should be processed (no messages were processed before it); otherwise this message is rejected. Rather simple, but powerfull.
I hope this overview will help somebody. Have a good time!
I have an java application where 15 threads select a row from table with 11,000 records, through a synchronized method called getNext(), the threads are getting slow at selection a row, thereby taking a huge amount of time. Each of the thread follows the following process:
Thread checks if a row with resume column value set to 1 exist.
A. If it exist the thread takes the id of that row and uses that id to select another row with id greater than that of the taking id.
B. Otherwise it select's a row with id greater than 0.
The last row received based on the outcome of steps described in 1 above is marked with the resume column set to 1.
The threads takes the row data and works on it.
Question:
How can multiple thread access thesame table selecting rows that another thread has not selected and be fast?
How can threads be made to resume in case of a crash at the last row that was selected by any of the threads?
1.:
It seems the multiple database operations in getNext() art the bottleneck. If the data isn't change by an outside source you could read "id" and "resume" of all rows and cache it. Than you would only have one query and than operate just in memory for reads. This would safe lot of expensive DB calls in getNext():
2.:
Basically you need some sort of transactions or at least add an other column that gets updated when a thread has finished processing that row. Basically the processing and the update need to happen in a single transaction. When something happens while the transaction is not finished, you can rollback to the state in which the row wasn't processed.
If the threads are all on the same machine they could use a shared data structure to avoid working on the same thing instead of synchronization. But the following assumes the threads are on on different machines ( maybe different members of an application server cluster ) and can only communicate via the database.
Remove synchronization on getNext() method. When setting the resume flag to 1 (step 2), do so atomically. update table set resume=1 where resume = 0, commit. Only one thread will succeed at this, the thread that does gets that unit of work. At the same time, set a resume time-- if the resume time is greater than some max assume the thread working on that unit of work hash crashed, set resume flag back to 0. After the work is finished set the resume time to null, or otherwise mark the work as done.
Well, would think of different issues here:
Are you keeping status in your DB? I would look for some approach where you call a select for update where you filter by inactive status (be sure just to get one row in the select) and immediately update to active (in same transaction). It would be nice to know what DB you're using, not sure if "select for update" is always an option.
Process and when you're finished, update to finished status.
Be sure to keep a timestamp in the table to identifiy when you changed status for the last time. Make yourself a rule to decide when an active thread will be treated as lost.
Define other possible error scenarios (what happens if the process fails).
You would also need to analyze the scenario. How many rows does your table have? How many threads call it concurrently? How many inserts occur in a given time? Depending on this you will have to see how DB performance is running.
I'm assuming you'r getNext() is synchronized, with what I wrote on point 1 you might get around this...