Batch Message Consumption for JMSItemReader-Spring Batch

Batch Message Consumption for JMSItemReader-Spring Batch - java

I'm working on a POC where requirement is to consume messages in a batch from a JMS queue and pass it to processor and writer in chain as per Spring batch Step..Spring Batch provides JMSItemReader as a out of box functionality but looks like it consumes messages one by one until there are no messages left in queue or receiver timeout is reached...As far as Chunk based step is concerned, Items are read into chunks,which are processed and then written within transaction as a chunk into another datastore..Here JMSItemReader doesn't read items in a batch..Is there any solution in Spring Batch world to consume messages in a batch from a queue to improve overall performance of an application?
I tried a lot of documentation but didn't find any appropriate solution for this use case..appreciate your help..Thank you

I believe the BatchMessageListenerContainer is what you are looking for. It allows you to read messages in batches.
Note that this is not part of the standard library of readers/writers, but you can use it as is if you want or take inspiration and adapt it as needed.

Related

Replacing a scheduled task with Spring Events

In my Spring Boot app, customers can submit files. Each customer's files are merged together by a scheduled task that runs every minute. The fact that the merging is performed by a scheduler has a number of drawbacks, e.g. it's difficult to write end-to-end tests, because in the test you have to wait for the scheduler to run before retrieving the result of the merge.
Because of this, I would like to use an event-based approach instead, i.e.
Customer submits a file
An event is published that contains this customer's ID
The merging service listens for these events and performs a merge operation for the customer in the event object
This would have the advantage of triggering the merge operation immediately after there is a file available to merge.
However, there are a number of problems with this approach which I would like some help with
Concurrency
The merging is a reasonably expensive operation. It can take up to 20 seconds, depending on how many files are involved. Therefore the merging will have to happen asynchronously, i.e. not as part of the same thread which publishes the merge event. Also, I don't want to perform multiple merge operations for the same customer concurrently in order to avoid the following scenario
Customer1 saves file2 triggering a merge operation2 for file1 and file2
A very short time later, customer1 saves file3 triggering merge operation3 for file1, file2, and file3
Merge operation3 completes saving merge-file3
Merge operation2 completes overwriting merge-file3 with merge-file2
To avoid this, I plan to process merge operations for the same customer in sequence using locks in the event listener, e.g.
#Component
public class MergeEventListener implements ApplicationListener<MergeEvent> {
private final ConcurrentMap<String, Lock> customerLocks = new ConcurrentHashMap<>();
#Override
public void onApplicationEvent(MergeEvent event) {
var customerId = event.getCustomerId();
var customerLock = customerLocks.computeIfAbsent(customerId, key -> new ReentrantLock());
customerLock.lock();
mergeFileForCustomer(customerId);
customerLock.unlock();
}
private void mergeFileForCustomer(String customerId) {
// implementation omitted
}
}
Fault-Tolerance
How do I recover if for example the application shuts down in the middle of a merge operation or an error occurs during a merge operation?
One of the advantages of the scheduled approach is that it contains an implicit retry mechanism, because every time it runs it looks for customers with unmerged files.
Summary
I suspect my proposed solution may be re-implementing (badly) an existing technology for this type of problem, e.g. JMS. Is my proposed solution advisable, or should I use something like JMS instead? The application is hosted on Azure, so I can use any services it offers.
If my solution is advisable, how should I deal with fault-tolerance?

Regarding the concurrency part, I think the approach with locks would work fine, if the number of files submitted per customer (on a given timeframe) is small enough.
You can eventually monitor over time the number of threads waiting for the lock to see if there is a lot of contention. If there is, then maybe you can accumulate a number of merge events (on a specific timeframe) and then run a parallel merge operation, which in fact leads to a solution similar to the one with the scheduler.
In terms of fault-tolerance, an approach based on a message queue would work (haven't worked with JMS but I see it's an implementation of a message-queue).
I would go with a cloud-based message queue (SQS for example) simply because of reliability purposes. The approach would be:
Push merge events into the queue
The merging service scans one event at a time and it starts the merge job
When the merge job is finished, the message is removed from the queue
That way, if something goes wrong during the merge process, the message stays in the queue and it will be read again when the app is restarted.

My thoughts around this matter after some considerations.
I restricted possible solutions to what's available from Azure managed services, according to specifications from OP.
Azure Blob Storage Function Trigger
Because this issue is about storing files, let's start with exploring Blob Storage with trigger function that fires on file creation. According to doc, Azure functions can run up to 230 seconds, and will have a default retry count of 5.
But, this solution will require that files from a single customer arrives in a manner that will not cause concurrency issues, hence let's leave this solution for now.
Azure Queue Storage
Does not guarantee first-in-first-out (FIFO) ordered delivery, hence it does not meet the requirements.
Storage queues and Service Bus queues - compared and contrasted: https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted
Azure Service Bus
Azure Service Bus is a FIFO queue, and seems to meet the requirements.
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted#compare-storage-queues-and-service-bus-queues
From doc above, we see that large files are not suited as message payload. To solve this, files may be stored in Azure Blob Storage, and message will contain info where to find the file.
With Azure Service Bus and Azure Blob Storage selected, let's discuss implementation caveats.
Queue Producer
On AWS, the solution for the producer side would have been like this:
Dedicated end-point provides pre-signed URL to customer app
Customer app uploads file to S3
Lambda triggered by S3 object creation inserts message to queue
Unfortunately, Azure doesn't have a pre-signed URL equivalent yet (they have Shared Access Signature which is not equal), hence file uploads must be done through an end-point which in turn stores the file to Azure Blob Storage. When file upload end-point is required, it seems appropriate to let the file upload end-point also be reponsible for inserting messages into queue.
Queue Consumer
Because file merging takes a signicant amount of time (~ 20 secs), it should be possible to scale out the consumer side. With multiple consumers, we'll have to make sure that a single customer is processed by no more than one consumer instance.
This can be solved by using message sessions: https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sessions
In order to achieve fault tolerance, consumer should use peek-lock (as opposed to receive-and-delete) during file merge and mark message as completed when file merge is completed. When message is marked as completed, consumer may be responsible for
removing superfluous files in Blob Storage.
Possible problems with both existing solution and future solution
If customer A starts uploading a huge file #1 and immediately after that starts uploading a small file #2, file upload of file #2 may be be completed before file #1 and cause an out-of-order situation.
I assume that this is an issue that is solved in existing solution by using some kind of locking mechanism or file name convention.

Spring-boot with Kafka can solve your problem of fault tolerance.
Kafka supports the producer-consumer model. let the customer events posted to Kafka producer.
configure Kafka with replication for not to lose any events.
use consumers that can invoke the Merging service for each event.
once the consumer read the event of customerId and merged then commit the offset.
In case of any failure in between merging the event, offset is not committed so it can be read again when the application started again.
If the merging service can detect the duplicate event with given data then reprocessing the same message should not cause any issue(Kafka promises single delivery of the event). Duplicate event detection is a safety check for an event processed full but failed to commit to Kafka.

First, event-based approach is corrrect for this scenario. You should use external broker for pub-sub event messages.
Attention that, by default, Spring publishing an event is synchronous.
Suppose that, you have 3 services:
App Service
Merge Servcie
CDC Service (change data capture)
Broker Service (Kafka, RabbitMQ,...)
Main flow base on "Outbox Pattern":
App Service save event message to Outbox message table
CDC Service watching outbox table and publish event message from Outbox table to Broker Servie
Merge Service subcribe to Broker Server and receiving event message (messages is orderly)
Merge Servcie perform merge action
You can use eventuate lib for this flow.
Futher more, you can apply DDD to your architecture. Using Axon framework for CQRS pattern, public domain event and process it.
Refer to:
Outbox pattern: https://microservices.io/patterns/data/transactional-outbox.html

It really sounds like you may do with a Stream or an ETL tool for the job. When you are developing an app, and you have some prioritisation/queuing/batching requirement, it is easy to see how you can build a solution with a Cron + SQL Database, with maybe a queue to decouple doing work from producing work.
This may very well be the easiest thing to build as you have a lot of granularity and control to this approach. If you believe that you can in fact meet your requirements this way fairly quickly with low risk, you can do so.
There are software components which are more tailored to these tasks, but they do have some learning curves, and depend on what PAAS or cloud you may be using. You'll get monitoring, scalability, availability resiliency out-of-the-box. An open source or cloud service will take the burden of management off your hands.
What to use will also depend on what your priority and requirements are. If you want to go the ETL approach which is great at banking up jobs you might want to use something like a Glue t. If you want to want prioritization functionality you may want to use multiple queues, it really depends. You'll also want to monitor with a dashboard to see what wait time you should have for your merge regardless of the approach.

Flink Consumer with DataStream API for Batch Processing - How do we know when to stop & How to stop processing [ 2 fold ]

I am basically trying to use the same Flink pipeline (of transformations, with different input parameters to distinguish between real-time and batch modes) to run it in Batch Mode & realtime mode. I want to use the DataStream API, as most of my transformations are dependent on DataStream API.
My Producer is Kafka & real time pipeline works just fine. Now I want to build a Batch pipeline with the same exact code with different topics for batch & real-time mode. How does my batch processor know when to stop processing?
One way I thought of was to add an extra parameter in the Producer record to say this is the last record, however, given multi partitioned topics, record delivery across multiple partitions does not guarantee the order (delivery inside one partition is guaranteed though).
What is the best practice to design this?
PS: I don't want to use DataSet API.

You can use the DataStream API for batch processing without any issue. Basically, Flink will inject the barrier that will mark the end of the stream, so that Your application will work on finite streams instead of infinite ones.
I am not sure if Kafka is the best solution for the problem to be completely honest.
Generally, when implementing KafkaDeserializationSchema You have the method isEndOfStream() that will mark that the stream has finished. Perhaps, You could inject the end markers for each partition and simply check if all of the markers have been read and then finish the stream. But this would require You to know the number of partitions beforehand.

Kafka: Bounded Batch Processing in Parallel

I would like to use Kafka to perform bounded batch processing, where the program will know when it is processing the last record.
Batch:
Reading a flat file
Send each line as message to Kafka
Kafka Listener:
Consumes message from Kafka
Insert record into database
If it is the last record, mark batch job as done in database.
One way probably is to use a single Kafka partition, assuming FIFO (First In First Out) is guaranteed, and make the batch program to send an isLastRecord flag.
However, this means the processing will be restricted to single-thread (single consumer).
Question
Is there any way to achieve this with parallel-processing by leveraging multiple Kafka partitions?

If you need in-order guarantees per file, you are restricted to a single partition.
If you have multiple files, you could use different partitions for different files though.
If each line in the file is an insert into a database, I am wondering though if you need in-order guarantee in the first place, or if you can insert all records/lines in any order?
A more fundamental question is: why do you need to put the data into Kafka first? Why not read the file and to the insert directly?

How to design a JMS message containing large amounts of data

I am working on designing a system that uses an ETL tool to retrieve batches of data, i.e., insert/update/deletes for one or more tables, and puts them on a JMS topic to be processed later by multiple clients. Right now, each message on the topic represents a single record I/U/D and we have a special message to delimit the end of the batch. It's important to process the batches in a single transaction, so having a bunch of messages delimited by a special one is not ideal: both sessions publishing and receiving messages must be designed for multiple messages; the batch delimiter message is a messy solution (each time we receive a message we need to check if it's the last) and very error prone; the system is difficult to debug and maintain; the number of messages on the topic becomes quickly huge (up to millions).
Now, I think that the next natural step to improve the architecture is to pack all the records in a single JMS message so that when a message is received, it encompasses a single transaction, it's easy to detect failures, there are no "orphan" records on the topic, etc. I only see advantages in doing so! Now here are my questions:
What's the best way to create such a packed message? I think my choices are StreamMessage, ByteMessage or ObjectMessage. I excluded text and map messages because the first will require text parsing, which will kill performance, and I assume the second one doesn't really seem to fit the scenario. I'm kinda leaning towards StreamMessage because it seems quite compact although it will require a lot of work writing custom serialization code (even worse for ByteMessage). Not sure about ObjectMessage, how does it perform? Is there an out of the box solution for this?
What's the maximum size allowed per message? Could it be in the order of hundreds of KB or even few MB?
Thanks for the thoughts!
Giovanni

Instead of using one large message, you could use two (or more) queues, correlation ids and a message selector.
Queueing:
Post a notification message to "notification queue" to indicate that processing should start
Post command messages to "command queue" with correlation id set to notification messages message id (you can use multiple command queues, if queue depth gets too high)
Commit the transaction
Processing:
Receive the notification message from "notification queue" (e.g. with message driven bean)
Receive and process all the related messages from "command queue" using a message selector
Commit the transaction

Using bytes (e.g. a ByteMessage) is likely the less memory intensive.
If you manipulate Java Objects, you can use a fast and byte effective serialization/deserialization library like Kryo
We happily use Kryo in production on a messaging system, but you have plenty of alternatives such as the popular Google Protocol Buffers

Generic QoS Message batching and compression in Java

We have a custom messaging system written in Java, and I want to implement a basic batching/compression feature that basically under heavy load it will aggregate a bunch of push responses into a single push response.
Essentially:
if we detect 3 messages were sent in the past second then start batching responses and schedule a timer to fire in 5 seconds
The timer will aggregate all the message responses received in the next 5 seconds into a single message
I'm sure this has been implemented before I'm just looking for the best example of it in Java. I'm not looking for a full blown messaging layer, just the basic detect messages per second and schedule some task (obviously I can easily write this myself I just want to compare it with any existing algorithms to make sure I'm not missing any edge cases or that I've simplified the problem as much as possible).
Are there any good open source examples of building a basic QoS batching/throttling/compression implementations?

we are using a very similar mechanism for high load.
it will work as you described it
* Aggregate messages over a given interval
* Send a List instead of a single message after that.
* Start aggregating again.
You should watch out for the following pitfalls:
* If you are using a transacted messaging system like JMS you can get into trouble because your implementation will not be able to send inside the JMS transaction so it will keep aggregating. Depending on the size of your data structure to hold the messages this can run out of space. If you are have very long transactions sending many messages this can pose a problem.
* Sending a message in such a way will happen asynchronous because a different thread will be sending the message and the thread calling the send() method will only put it in the data structure.
* Sticking to the JMS example you should keep in mind that they way messages are consumed is also changed by this approach. Because you will receive the list of messages from JMS as a single message. So once you commit this single JMS message you commited the entire list of messages. You should check if this i a problem to your requirements.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.