Good time guys!
We have a pretty straightforward application-adapter: once in 30 seconds it reads records from a database (can't write to it) of one system, converts each of these records into an internal format, performs filtering, encrichment, ..., and, finally, transforms the resulting, let's say, entities into an xml format and sends them via a JMS to other system. Nothing new.
Let's add some spice here: records in the database are sequential (that means that their identifies are generated by a sequence), and when it is time to read a new bunch of records, we get a last-processed-sequence-number -- which is stored in our internal databese and updated each time the next record is processed (sent to the JMS) -- and start reading from that record (+1).
The problem is our customers gave us an NFR: processing of a read record bunch must not last longer than 30 seconds. As far as there are a lot of steps in the workflow (with some pretty long running ones), and it is possible to get a pretty big amount of records, and as far as we process them one by one, it can take more than 30 seconds.
Because of all the above I want to ask 2 questions:
1) Is there an approach of a parallel processing of sequential data, maybe with one or several intermediate storages, or Disruptor patern, or cqrs-like, or a notification-based, or ... that provides a possibility of working in such a system?
2) A general one. I need to store a last-processed-number and send an entity to the JMS. If I save a number to a database and then some problem raises with the JMS, on an application's restart my adapter will think that it successfuly sended the entity, which is not true and it won't be ever received. If I send an entity and after that try so save a number to a database and get an exception, on an application's restart a reprocessing will be performed which will lead to duplications in the JMS. I'm not sure that xa transactions will help here or some kind of a last resorce gambit...
Could somebody, please, share experience or ideas?
Thanks in advance!
1) 30 seconds is a long time and you can do a lot in that time esp with more than one CPU. Without specifics I can only say it is likely you can make it faster if you profile it and use more CPUs.
2) You can update the database before you send and listen to the JMS queue yourself to see it was received by the broker.
Dimitry - I don't know the detail around your problem so I'm just going to make a set of assumptions. I hope it willtrigger an idea that will lead to the solution at least.
Here goes:
Grab you list of items to process.
Store the last id (and maybe the starting id)
Process each item on a different thread (suggest using Tasks).
Record any failed item in a local failed queue.
When you grab the next bunch, ensure you process the failed queue first.
Have a way of determining a max number of retries and a way of moving/marking it as permanently failed.
Not sure if that was what you were after. NServiceBus has a retry process where the gap between each retry gets longer up to a point, then it is marked as failed.
Folks, finally we ended up with the following solution. We implemented a kind of the Actor Model. The idea is the following.
There are two main (internal) database tables for our application, let's call them READ_DATA_INFO, which contains a last-read-record-number of the 'source' external system, and DUMPED_DATA, which stores a metadata about each read record of the source system. This is how it all works: each n (a configurable property) seconds a service bus reads the last processed identifier of the source system and sends a request to the source system to get new records from it. If there are several new records, they are being wrapped with a DumpRecordBunchMessage message and sent to a DumpActor class. This class begins a transaction which comprises two operations: update the last-read-record-number (the READ_DATA_INFO table) and save a metadata about each reacord (the DUMPED_DATA table) (each dumped record gets the 'NEW' status. When a record is successfully processed, it gets the 'COMPLETED' status; otherwise - the 'FAILED' status). In case of a successfull transaction commit each of those records is wrapped with a RecordMessage message class and send to next processing actor; otherwise those records are just skipped - they would be reread after next n seconds.
There are three interesting points:
an application's disaster recovery. What if our application will be stopped somehow at the middle of a processing. No problem, at an application's startup (#PostConstruct marked method) we find all the records with the 'NEW' statuses at the DUMPED_DATA table and with a help of a stored metadata rebuild restore them from the source system.
parallel processing. After all records are successfully dumped, they become independent, which means that they can be processed in parallel. We introduced several mechanisms of a parallelism and a loa balancing. The simplest one is a round robin approach. Each processing actor consists of a parant actor (load balancer) and a configurable set of it's child actors (worker). When a new message arrives to the parent actor's queue, it dispatches it to the next worker.
duplicate record prevention. This is the most interesting one. Let's assume that we read data each 5 seconds. If there is an actor with a long running operation, it is possible to have several tryings to read from the source system's database starting from the same last-read-record number. Thus there would potentially be a lot duplicate records dumped and processed. In order to prevent this we added a CAS-like check of DumpActor's messages: if the last-read-record from a message is equal to a one from the DUMPED_DATA table, this message should be processed (no messages were processed before it); otherwise this message is rejected. Rather simple, but powerfull.
I hope this overview will help somebody. Have a good time!
Related
In a microservice architecture, suppose there is a business scenario where a user purchases something that will expire after two years, and the system needs to notify the user a little bit in advance.
In this case, how should we handle the situation so that the users can be notified on time even if there are many users who need to be notified?
For example, using a delayed queue of message queue will cause the messages to pile up when there are many users; using a timed task, too many users will overload the server CPU.
Is there a good way to do this?
While "microservices" do not inherently mean "REST", they usually are. And in REST you shouldn't store in memory anything that needs to survive more than one request. Two years is an extreme case, but even if it is for just 10 minutes, it should probably go to the DB.
Building up a queue for two years will just be very impractical and likely to fail if the queue contents are not persisted somewhere. Since you mention purchases I am assuming you have some sort of data store to record them either in sql or no-sql.
You can simply add purchase date/time column(s) to the table to make life easier. If you volumes are low enough for daily purchases then I would start with date based lookup only. You will need a scheduled execution of some service method say at 6am everyday that looks up purchases close to expiry i.e 7 days before 2 years purchase_date = now - 723days and then send rest request somewhere or publishes an event or jms message with order number and purchase_date as content for each purchase order. This will then be picked up by event/message listener somewhere and processed accordingly i.e. send a notification to customer. To avoid sending duplicate notifications you should also persist the expiry notifications in a database and ensure you check that notification has been sent for purchase id before sending it again.
If you ever reach a situation where you are processing thousands of orders a day and don't want to publish large number of events in one go then extend the functionality to filter by purchase timestamp and process chunks of purchases multiple times a day by changing the lookup condition.
This is just general idea of such requirement and you will have to fine-grain a lot of implementation details such as what happens if your email server is down.
You can use quartz job and configure it to use persistent mode in database (JDBC JobStore) to not loose information and also it is suitable for clustering mode.
Quartz checks periodically the database for the nearest task (configurable parameter) if the time comes, it will process the notification.
You can configure the thread pool size in order to avoid overload.
We currently have a distributed setup where we are publishing events to SQS and we have an application which has multiple hosts that drains messages from the queue and does some transformation over it and transmits to interested parties. I have a use case where the receiving end point has scalability concerns with the message volume and hence we would like to batch these messages periodically (say every 15 mins) in the application before sending it.
The incoming message rate is around 200 messages per second and each message is no more than 10 KB. This system need not be real time, but would definitely be a good to have and also the order is not important (its okay if a batch containing older messages gets sent first).
One approach that I can think of is maintaining an embedded database within the application (each host) that batches the events and another thread that runs periodically and clears the data.
Another approach could be to create timestamped buckets in a a distributed key-value store (s3, dynamo etc.) where we write the message to the correct bucket based the messages time stamp and we periodically clear the buckets.
We can run into several issues here, since the messages would be out of order a bucket might have already been cleared (can be solved by having a default bucket though), would need to accurately decide when to clear a bucket etc.
The way I see it, at least two components would be required one which does the batching into a temporary storage and another that clears it.
Any feedback on the above approaches would help, also it looks like a common problem are they any existing solutions that I can leverage ?
Thanks
Can you help me in two problem :
A. We have a table on which read and write operation happens simultaneously. Write happens very vastly so read is very slow - sometimes my web application does not come up due to heavy write operation on this table. How could i handle such scenario. Write happens through different Java application while read happens through our web application, so web application become very slow. Any idea?
B. Write happens to this table happens through 200 threads, these thread take connection from connection pool and write into the table and this application run 24 by 7. is the thread priority is having issue and stopping read operation from web application.
C. Can we have master- master replication for that table only- so write happens in one table and write happens in other table and every two minute data migrates from one table to other table?
Please suggest me .
Thanks in advance.
Check connection pool size - maybe it's too small and your threads waste time waiting for connection from pool.
Check your database settings, if you just running it with out-of-the-box params there maybe a good space for improvements.
You probably need some kind of event-driven system - when vehicle sends data DB is not updated, but a message is added to some queue (e.g. JMS). Your app then caches data on startup, and updates both cache and database upon receiving this message. The key thing is that the only component that interacts with DB is your app, and data changed only when you receive event - so you don't need to query DB to read the data, plus you may do updates in the background using only few threads, etc. There are quite good open-source messaging systems (e.g. Apache Active MQ) and caching libraries (e.g. EH Cache), so you can built reasonably perfomant and fault-tolerant system with not too much effort.
I guess introducing messaging will be a serious reengineering, so to solve your immediate problem replication might be the best solution - merge data from the updateable table to another one every 2 minutes, and the tracker will read that another table; obviously works well if you only read the data in the web-app, and not update them, otherwise you need to put a lot of effort to keep 2 tables in sync. A variation of that is batching - data from vehicle are iserted into intermediate table, and then every 2 minutes transferred into main table from which reader queries them; intermediate table is cleaned after transfer.
The one true way to solve this is to use a queue of write events and to stop the writing periodically so that the reader has a chance.
Create a queue for incoming write updates
Create an atomicXXX (see java.util.concurrency) to use as a lock
Create a thread pool to read from the queue and execute the updates when the lock is unset
Use javax.swing.Timer to periodically set the lock and read the table data.
Before trying anything too complicated try this perhaps:
1) Don't use Thread priorities, they are rarely what you want.
2) Set up your own priority scheme, perhaps simply by having a (priority) queue for both reads and writes where reads are prioritized. That is: add read and write requests to a single queue and have them block or be notified of the result.
3) check your database features to optimize write heavy tables
I am using Informix DB. This question may not be tied to one specific database. But I want to know how I can in Java, continuously probe into a Database and check if a certain row has been added to a table in the DB. Basically, the flow is:
My Java application should use JDBC to check if a certain table is populated.
If no, it should wait until a row has been inserted.
My question how can I have Java be aware of a row insertion. I am not expecting to add any triggers or anything, but in pure Java be able to check that the row is added.
Some thoughts that come to my mind are continuously call DB for the row, or periodically (every half-hour or so) call DB and check if the row is available. But what I am looking for is something like a Listener which can do this.
There is no facility in the Informix DBMS to signal when a particular row arrives in a table.
Well, I say that, but there is the DB-Cron facility which can periodically execute tasks (inside the server), and you could conceivably schedule a task to poll for the data to see if it has arrived and to send a message (somehow) to indicate that it has. It would be non-trivial, especially the part the indicates that it has arrived.
The JDBC protocol (and SQL protocols generally) are essentially synchronous; the client sends a request and waits for an answer from the DBMS.
So, pragmatically, if your delay period is half an hour, you can either create an admin task to handle the processing (you could write a Java UDR to be executed in the server by the server if that's crucial to you), or you can arrange for the Java (client-side) program to poll periodically to find out whether the information you need is there. A half-hour delay is not going to stress anything, even with a moderate number of processes polling for separate values (or even the same value). On the other hand, you normally try to avoid polling when you can. You'll need to strike a balance between responsiveness to the special data arriving and general system responsiveness. On the whole, general system responsiveness is more important, so keep the polling interval as large as you can.
If your polling interval needed to be sub-second, then the balance would be different - the job would be a lot harder.
We have a JMS queue of job statuses, and two identical processes pulling from the queue to persist the statuses via JDBC. When a job status is pulled from the queue, the database is checked to see if there is already a row for the job. If so, the existing row is updated with new status. If not, a row is created for this initial status.
What we are seeing is that a small percentage of new jobs are being added to the database twice. We are pretty sure this is because the job's initial status is quickly followed by a status update - one process gets one, another process the other. Both processes check to see if the job is new, and since it has not been recorded yet, both create a record for it.
So, my question is, how would you go about preventing this in a vendor-neutral way? Can it be done without locking the entire table?
EDIT: For those saying the "architecture" is unsound - I agree, but am not at liberty to change it.
Create a unique constraint on JOB_ID, and retry to persist the status in the event of a constraint violation exception.
That being said, I think your architecture is unsound: If two processes are pulling messages from the queue, it is not guaranteed they will write them to the database in queue order: one consumer might be a bit slower, a packet might be dropped, ..., causing the other consumer to persist the later messages first, causing them to be overridden with the earlier state.
One way to guard against that is to include sequence numbers in the messages, update the row only if the sequence number is as expected, and delay the update otherwise (this is vulnerable to lost messages, though ...).
Of course, the easiest way would be to have only one consumer ...
JDBC connections are not thread safe, so there's nothing to be done about that.
"...two identical processes pulling from the queue to persist the statuses via JDBC..."
I don't understand this at all. Why two identical processes? Wouldn't it be better to have a pool of message queue listeners, each of which would handle messages landing on the queue? Each listener would have its own thread; each one would be its own transaction. A Java EE app server allows you to configure the size of the message listener pool to match the load.
I think a design that duplicates a process like this is asking for trouble.
You could also change the isolation level on the JDBC connection. If you make it SERIALIZABLE you'll ensure ACID at the price of slower performance.
Since it's an asynchronous process, performance will only be an issue if you find that the listeners can't keep up with the messages landing on the queue. If that's the case, you can try increasing the size of the listener pool until you have adequate capacity to process the incoming messages.