Reactive Java Queue With Lazy Fetching - java

I'm a bit stuck with the idea of implementation of the certain collection. The idea is the following:
It should work like a simple Queue (poll/offer)
The API of it should be reactive (either RxJava or Reactor works for me well). This means, I expect the poll method, for example, to look like: Mono<T> poll();
It should lazy fetch the data with the provided loader, if the queue is empty. The loader is also a function returning a batch of next values in a reactive manner
It should ideally be thread-safe. Not strictly required, but would be nice if I can exclude the race condition when two thread are fetching data after both detecting that queue is empty.
Maybe someone already met such a collection? Or at least something similar? Or at least something that can be used as a base for this one?
Or at least please give me an idea of how to implement such a functionality

Related

Best Practice for Kafka rollback scenario in microservices [duplicate]

We have a micro-services architecture, with Kafka used as the communication mechanism between the services. Some of the services have their own databases. Say the user makes a call to Service A, which should result in a record (or set of records) being created in that service’s database. Additionally, this event should be reported to other services, as an item on a Kafka topic. What is the best way of ensuring that the database record(s) are only written if the Kafka topic is successfully updated (essentially creating a distributed transaction around the database update and the Kafka update)?
We are thinking of using spring-kafka (in a Spring Boot WebFlux service), and I can see that it has a KafkaTransactionManager, but from what I understand this is more about Kafka transactions themselves (ensuring consistency across the Kafka producers and consumers), rather than synchronising transactions across two systems (see here: “Kafka doesn't support XA and you have to deal with the possibility that the DB tx might commit while the Kafka tx rolls back.”). Additionally, I think this class relies on Spring’s transaction framework which, at least as far as I currently understand, is thread-bound, and won’t work if using a reactive approach (e.g. WebFlux) where different parts of an operation may execute on different threads. (We are using reactive-pg-client, so are manually handling transactions, rather than using Spring’s framework.)
Some options I can think of:
Don’t write the data to the database: only write it to Kafka. Then use a consumer (in Service A) to update the database. This seems like it might not be the most efficient, and will have problems in that the service which the user called cannot immediately see the database changes it should have just created.
Don’t write directly to Kafka: write to the database only, and use something like Debezium to report the change to Kafka. The problem here is that the changes are based on individual database records, whereas the business significant event to store in Kafka might involve a combination of data from multiple tables.
Write to the database first (if that fails, do nothing and just throw the exception). Then, when writing to Kafka, assume that the write might fail. Use the built-in auto-retry functionality to get it to keep trying for a while. If that eventually completely fails, try to write to a dead letter queue and create some sort of manual mechanism for admins to sort it out. And if writing to the DLQ fails (i.e. Kafka is completely down), just log it some other way (e.g. to the database), and again create some sort of manual mechanism for admins to sort it out.
Anyone got any thoughts or advice on the above, or able to correct any mistakes in my assumptions above?
Thanks in advance!
I'd suggest to use a slightly altered variant of approach 2.
Write into your database only, but in addition to the actual table writes, also write "events" into a special table within that same database; these event records would contain the aggregations you need. In the easiest way, you'd simply insert another entity e.g. mapped by JPA, which contains a JSON property with the aggregate payload. Of course this could be automated by some means of transaction listener / framework component.
Then use Debezium to capture the changes just from that table and stream them into Kafka. That way you have both: eventually consistent state in Kafka (the events in Kafka may trail behind or you might see a few events a second time after a restart, but eventually they'll reflect the database state) without the need for distributed transactions, and the business level event semantics you're after.
(Disclaimer: I'm the lead of Debezium; funnily enough I'm just in the process of writing a blog post discussing this approach in more detail)
Here are the posts
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
first of all, I have to say that I’m no Kafka, nor a Spring expert but I think that it’s more a conceptual challenge when writing to independent resources and the solution should be adaptable to your technology stack. Furthermore, I should say that this solution tries to solve the problem without an external component like Debezium, because in my opinion each additional component brings challenges in testing, maintaining and running an application which is often underestimated when choosing such an option. Also not every database can be used as a Debezium-source.
To make sure that we are talking about the same goals, let’s clarify the situation in an simplified airline example, where customers can buy tickets. After a successful order the customer will receive a message (mail, push-notification, …) that is sent by an external messaging system (the system we have to talk with).
In a traditional JMS world with an XA transaction between our database (where we store orders) and the JMS provider it would look like the following: The client sets the order to our app where we start a transaction. The app stores the order in its database. Then the message is sent to JMS and you can commit the transaction. Both operations participate at the transaction even when they’re talking to their own resources. As the XA transaction guarantees ACID we’re fine.
Let’s bring Kafka (or any other resource that is not able to participate at the XA transaction) in the game. As there is no coordinator that syncs both transactions anymore the main idea of the following is to split processing in two parts with a persistent state.
When you store the order in your database you can also store the message (with aggregated data) in the same database (e.g. as JSON in a CLOB-column) that you want to send to Kafka afterwards. Same resource – ACID guaranteed, everything fine so far. Now you need a mechanism that polls your “KafkaTasks”-Table for new tasks that should be send to a Kafka-Topic (e.g. with a timer service, maybe #Scheduled annotation can be used in Spring). After the message has been successfully sent to Kafka you can delete the task entry. This ensures that the message to Kafka is only sent when the order is also successfully stored in application database. Did we achieve the same guarantees as we have when using a XA transaction? Unfortunately, no, as there is still the chance that writing to Kafka works but the deletion of the task fails. In this case the retry-mechanism (you would need one as mentioned in your question) would reprocess the task an sends the message twice. If your business case is happy with this “at-least-once”-guarantee you’re done here with a imho semi-complex solution that could be easily implemented as framework functionality so not everyone has to bother with the details.
If you need “exactly-once” then you cannot store your state in the application database (in this case “deletion of a task” is the “state”) but instead you must store it in Kafka (assuming that you have ACID guarantees between two Kafka topics). An example: Let’s say you have 100 tasks in the table (IDs 1 to 100) and the task job processes the first 10. You write your Kafka messages to their topic and another message with the ID 10 to “your topic”. All in the same Kafka-transaction. In the next cycle you consume your topic (value is 10) and take this value to get the next 10 tasks (and delete the already processed tasks).
If there are easier (in-application) solutions with the same guarantees I’m looking forward to hear from you!
Sorry for the long answer but I hope it helps.
All the approach described above are the best way to approach the problem and are well defined pattern. You can explore these in the links provided below.
Pattern: Transactional outbox
Publish an event or message as part of a database transaction by saving it in an OUTBOX in the database.
http://microservices.io/patterns/data/transactional-outbox.html
Pattern: Polling publisher
Publish messages by polling the outbox in the database.
http://microservices.io/patterns/data/polling-publisher.html
Pattern: Transaction log tailing
Publish changes made to the database by tailing the transaction log.
http://microservices.io/patterns/data/transaction-log-tailing.html
Debezium is a valid answer but (as I've experienced) it can require some extra overhead of running an extra pod and making sure that pod doesn't fall over. This could just be me griping about a few back to back instances where pods OOM errored and didn't come back up, networking rule rollouts dropped some messages, WAL access to an aws aurora db started behaving oddly... It seems that everything that could have gone wrong, did. Not saying Debezium is bad, it's fantastically stable, but often for devs running it becomes a networking skill rather than a coding skill.
As a KISS solution using normal coding solutions that will work 99.99% of the time (and inform you of the .01%) would be:
Start Transaction
Sync save to DB
-> If fail, then bail out.
Async send message to kafka.
Block until the topic reports that it has received the
message.
-> if it times out or fails Abort Transaction.
-> if it succeeds Commit Transaction.
I'd suggest to use a new approach 2-phase message. In this new approach, much less codes are needed, and you don't need Debeziums any more.
https://betterprogramming.pub/an-alternative-to-outbox-pattern-7564562843ae
For this new approach, what you need to do is:
When writing your database, write an event record to an auxiliary table.
Submit a 2-phase message to DTM
Write a service to query whether an event is saved in the auxiliary table.
With the help of DTM SDK, you can accomplish the above 3 steps with 8 lines in Go, much less codes than other solutions.
msg := dtmcli.NewMsg(DtmServer, gid).
Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPrepared", db, func(tx *sql.Tx) error {
return AdjustBalance(tx, busi.TransOutUID, -req.Amount)
})
app.GET(BusiAPI+"/QueryPrepared", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
return MustBarrierFromGin(c).QueryPrepared(db)
}))
Each of your origin options has its disadvantage:
The user cannot immediately see the database changes it have just created.
Debezium will capture the log of the database, which may be much larger than the events you wanted. Also deployment and maintenance of Debezium is not an easy job.
"built-in auto-retry functionality" is not cheap, it may require much codes or maintenance efforts.

Axon - Easiest way to make projection at query time

I will usually have 5-6 events per aggregate and would like not to store projections in DB. What would be the easiest way always to make view projection at query time?
The short answer to this, is that there is no easy/quick way to do this.
However, it most certainly is doable to implement a 'replay given events at request time' set up.
What I would suggest you do exists in several steps:
Create the query model you would like to return, which can handle events (use #EventHandler annotated methods on the model)
Create a Component which can handle the query that'll return the query model in step one (use a #QueryHandler annotated method for this.
The Query-Handling-Component should be able to retrieve a stream of events from the EventStore. If this is based on an aggregateIdentifier, use the EventStore#readEvents(String) method. If you need the entire event stream, you need to use the StreamableMessageSource#openStream(TrackingToken) method (note: the EventStore interface implements StreamableMessageSource)
Upon query handling, create a AnnotationEventHandlerAdapter, giving it a fresh instance of your Query Model
For every event in the event stream you've created in point 3, call the AnnotationEventHandlerAdapter#handle(EventMessage) method. This method will call the #EventHandler annotated methods on your Query Model object
If the stream is depleted, you are ensured all necessary events for your Query Model have dealt with. Thus, you can now return the Query Model
So, again, I don't think this is overly trivial, easy or quick to set up.
Additionally, step 3 has quite a caveat in there. Retrieving the stream of a given Aggregate based on the Aggregate Identifier is pretty fast/concise, as an Aggregate in general doesn't have a lot of events.
However, retrieving the Event Stream based on a TrackingToken, which you'd need if your Query Model spans several Aggregates, can ensure you pull in the entire event store for instantiating your models on the fly. Granted, you can fine tune the point in time you want the Event Stream to return events from as you're dealing with a TrackingToken, but the changes are pretty high you will be incomplete and relatively slow.
However, you stated you want to retrieve events for a given Aggregate Identifier.
I'd thus think this should be a workable solution in your scenario.
Hope this helps!

Atomic read and delete in mongo

I am fairly new to mongo, so what I'm trying to achieve here might not be possible. My research so far is inconclusive...
My scenario is the following: I have an application which may have multiple instances running. These instances are processing some data, and when that processing fails, they write the ID of the failed item in a mongo collection ("error").
From time to time I want to retry processing those items. So, at fixed intervals, the application reads all the IDs from the collection, after which it deletes all the records. Now, this is an obvious race condition. Two instances may read the very same data, which would double the work to be done. Some IDs may also be missed like this.
My question would be the following: is there any way I can read and delete those records, in a distributed-atomic way? I was thinking about locking the collection, but for this I found no support so far in the java driver's documentation. I also tried to look for a findAndDrop() like method, but no luck so far.
I am aware of techniques like leader election, which most probably would solve this problem, but I wanted to see if it can be done in an easier way.
You can use BlockingQueue with multiple producer-single consumer approach, as you have multiple producer to produce ids and delete them with single consumer.
After all, I found no way to implement this with mongo.
However, since this is a heroku app, I stored the IDs in a Redis collection. This library I found implements a distributed Redis lock for Jedis, so this workaround solved my problem.

Spring Integration Custom Poller for Different Events

I need to poll a folder for changes i.e. files added, modified and deleted.
If I want to distinguish between the different types of events listed above would I need to implement a custom poller i.e. implement AbstractPoller. I have already implemented a poller that does this for a different project but would like to us spring integration and batch as I need to use other functionality.
What is the best way of doing this?
Thanks
Wouldn't you mind to share your code? BTW you always can utilize the custom code with <int:inbound-channel-adapter> as a ref and method, where an underlying POJO will return some object which will become as payload of message.
As you know the <int:inbound-channel-adapter> should be configured with <poller> how often you want to call that undelying POJO.

Hibernate Dirty Object usage

I have a Hibernate Entity in my code. i would fetch that and based on the value of one of the properties ,say "isProcessed" , go on and :
change value of "isProcessed" to "Yes" (the property that i checked)
add some task to a DelayedExecutor.
in my performance test, i have found that if i hammer the function,a classic dirty read scenario happens and i add too many tasks to the Executor that all of them would be executed. i can't use checking the equality of the objects in the Queue based on anything , i mean java would just execute all of them which are added.
how can i use hibernate's dirty object stuff to be able to check "isProcessed" before adding the task to executor? would it work?
hope that i have been expressive enough.
If you can do all of your queries to dispatch your tasks using the same Session, you can probably patch something together. The caveat is that you have to understand how hibernate's caching mechanisms (yes, that's plural) work. The first-level cache that is associated with the Session is going to be the key here. Also, it's important to know that executing a query and hydrating objects will not look into and return objects from the first-level cache...the right hand is not talking to the left hand.
So, to accomplish what you're trying to do (assuming you can keep using the same Session...if you can't do this, then I think you're out of luck) you can do the following:
execute your query
for each returned object, re-load it with Session's get method
check the isProcessed flag and dispatch if need-be
By calling get, you'll be sure to get the object from the first-level cache...where all the dirty objects pending flush are held.
For background, this is an extremely well-written and helpful document about hibernate caching.

Categories