Is 'exactly once' only for streams (topic1 -> app -> topic2)?

Is 'exactly once' only for streams (topic1 -> app -> topic2)? - java

I have an architecture where we have two separate applications. The original source is a sql database. App1 listens to CDC tables to track changes to tables in that database, normalizes, and serializes those changes. It takes those serialized messages and sends them to a Kafka topic. App2 listens to that topic, adapts the messages to different formats, and sends those adapted messages to their respective destinations via HTTP.
So our streaming architecture looks like:
SQL (CDC event) -> App1 ( normalizes events) -> Kafka -> App2 (adapts events to endpoints) -> various endpoints
We're looking to add error handling in case of failure and cannot tolerate duplicate events, missing events, or changing of order. Given the architecture above, all we really care about is that exactly-once applies to messages getting from App1 to App2 (our separate producers and consumers)
Everything I'm reading and every example I've found of the transactional api points to "streaming". It looks like the Kafka streaming api is meant for an individual application that takes an input from a Kafka topic, does its processing, and outputs it to another Kafka topic, which doesn't seem to apply to our use of Kafka. Here's an excerpt from Confluent's docs:
Now, stream processing is nothing but a read-process-write operation
on a Kafka topic; a consumer reads messages from a Kafka topic, some
processing logic transforms those messages or modifies state
maintained by the processor, and a producer writes the resulting
messages to another Kafka topic. Exactly once stream processing is
simply the ability to execute a read-process-write operation exactly
one time. In this case, “getting the right answer” means not missing
any input messages or producing any duplicate output. This is the
behavior users expect from an exactly once stream processor.
I'm struggling to wrap my head around how we can use exactly-once with our Kafka topic, or if Kafka's exactly-once is even built for non-"streaming" use cases. Will we have to build our own deduplication and fault tolerance?

If you are using Kafka's Streams API (or another tool that supports exactly-once processing with Kafka), then Kafka's exactly-once semantics (EOS) are covered across apps:
topic A --> App 1 --> topic B --> App 2 --> topic C
In your use case, one question is whether the initial CDC step supports EOS, too. In other words, you must ask the question: Which steps are involved, and are all steps covered by EOS?
In the following example, EOS is supported end-to-end if (and only if) the initial CDC step supports EOS as well, like the rest of the data flow.
SQL --CDC--> topic A --> App 1 --> topic B --> App 2 --> topic C
If you use Kafka Connect for the CDC step, then you must check whether the used connector you supports EOS yes or no.
Everything I'm reading and every example I've found of the transactional api points to "streaming".
The transactional API of the Kafka producer/consumer clients provide the primitives for EOS processing. Kafka Streams, which sits on top of the producer/consumer clients, uses this functionality to implement EOS in a way that it can be used easily by developers with a few lines of code (such as automatically taking care of state management when an application needs to do a stateful operation like an aggregation or join). Perhaps that relation between producer/consumer <-> Kafka Streams was your confusion after reading the documentation?
Of course, you can also "build your own" by using the underlying Kafka producer and consumer clients (with the transactional APIs) when developing your applications, but that's more work.
I'm struggling to wrap my head around how we can use exactly-once with our Kafka topic, or if Kafka's exactly-once is even built for non-"streaming" use cases. Will we have to build our own deduplication and fault tolerance?
Not sure what you mean by "non-streaming" use cases. If you mean, "If we don't want to use Kafka Streams or KSQL (or another existing tool that can read from Kafka to process data), what would we need to do achieve EOS in our applications?", then the answer is "Yes, in this case you must use the Kafka producer/clients directly, and ensure that whatever you are doing with them properly implements EOS processing." (And because the latter is difficult, this EOS functionality was added to Kafka Streams.)
I hope that helps.

Related

I need to save data in mongodb and send this data to kafka all atomically

I have spring boot application which persists data to mongodb and sends this data to kafka. I want this two processes to run atomically. That means if data is persisted to mongo then it should be sent to kafka. How can I do it?

With Kafka itself you can't.
Kafka offers transactions, but they are restricted to write to multiple partitions in Kafka atomically. They are designed with stream processing in mind, so consuming from one topic and producing to another in one go - but a Kafka transaction cannot know whether a write command to mongo succeeded.
The usecase you have is something that appears regularly though. Usually you would use the outbox pattern. The answer is to only modify one of the two resources (the database or Apache Kafka) and drive the update of the second one based on that in an eventually consistent manner.
If you really need atomic writes, I believe it is possible to do so by relying on the ACID guarantees Mongo >= 4.2 gives you instead of Kafka's transactional guarantees. But this would mean you need to manage the Kafka offsets in Mongo.
If you have "Kafka the definitive guide, 2nd edition", there is a small chapter with more details about what exactly Kafka transactions can and cannot do and possible workarounds.

Synchronize all consumers of a queue to process only one message at a time

I am running a Spring Cloud Stream application, consuming messages from RabbitMq. I want to implement a behaviour where a given queue, having three consumers instances, delivers exactly one message to any of them, and wait for the next to be delivered until the current is acked (some sort of synchronization between all consumers).
I think that this can be done using this https://www.rabbitmq.com/consumer-prefetch.html with global option enabled, but I can't find a way of achieving this using spring cloud stream. Any help will be appreciated.

Spring uses a separate channel for each consumer so global channel prefetch won't have any effect.
Bear in mind that, even if it was supported, it would only work if the consumers were all in the same instance.

Can I use spring.cloud.stream.bindings.<channel>.group when using RabbitMQ to obtain exactly-once delivery?

so I was reading this tutorial to configure RabbitMQ and SpringBoot.
At a certain point it is said:
Most of the time, we need the message to be processed only once.
Spring Cloud Stream implements this behavior via consumer groups.
So I started looking for more information on Spring docs it is written that:
When doing so, different instances of an application are placed in a
competing consumer relationship, where only one of the instances is
expected to handle a given message.
Spring Cloud Stream models this behavior through the concept of a
consumer group. (Spring Cloud Stream consumer groups are similar to
and inspired by Kafka consumer groups.)
So I setup here two nodes with Spring Boot Cloud Stream and RabbitMQ and using spring.cloud.stream.bindings.<channel>.group.
This to me still looks like at-least-once behavior. Am I wrong in assuming that? Should I still manage the possibility to process a message twice even using spring.cloud.stream.bindings.<channel>.group?
Thank you

It's at least once. The connection might close before the ack is sent. Rare, but possible.

Generate "fake" stream data. Kafka - Flink

I am trying to generate stream data, to simulate a situation where I receive two values, Integer type, in a different time range, with timestamps, and Kafka as connector.
I am using Flink environment as a consumer, but I don't know which is the best solution for the producer. (Java syntax better than Scala if possible)
Should I produce the data direct from Kafka? If yes, what is the best way to do it?
Or maybe is better if I produce the data from Flink as a producer, send it to Kafka and consume it at the end by Flink again? How can I do that from flink?
Or perhaps there is another easy way to generate stream data and pass it to Kafka.
If yes, please put me on the track to achieve it.

As David also mentioned, you can create a dummy producer in simple Java using KafkaProducer APIs to schedule and send messages to Kafka as per you wish. Similarly you can do that with Flink if you want multiple simultaneous producers. With Flink you will need to write a separate job for producer and consumer. Kafka basically enables an ASync processing architecture so it does not have queue mechanisms. So better to keep producer and consumer jobs separate.
But think a little bit more about the intention of this test:
Are you trying to test Kafka streaming durability, replication, offset management capabilities
In this case, you need simultaneous producers for same topic, with null or non-null key in the message.
or Are you trying to test Flink-Kafka connector capabilities.
In this case, you need only one producer, few internal scenarios could be back pressure test by making producer push more messages than consumer can handle.
or Are you trying to test topic partitioning and Flink streaming parallelism.
In this case, single or multiple producers but key of message should be non-null, you can test how Flink executors are connecting with individual partitions and observe their behavior.
There are more ideas you may want to test and each of these will need something specific to be done in producer or not to be done.
You can check out https://github.com/abhisheknegi/twitStream for pulling tweets using Java APIs in case needed.

A Scalable Architecture for Reconstructing events

I have been tasked to develop the architecture for a data transformation pipeline.
Essentially, data comes in at one end and is routed through various internal systems acquiring different forms before ending up in its destination.
The main objectives are -
Fault Tolerant. The message should be recoverable if one of the intermediate systems were down.
Replay/ Resequence - The message can be replayed from any stage and it should be possible to recreate the events in an idempotent manner.
I have a few custom solutions in mind to address
Implement a checkpoint system where a message can be logged at both entry and exit points at each checkpoint so we know where failure happens.
Implement a recovery mechanism that can go to the logged storage ( database, log file etc.. ) and reconstruct events programmatically.
However, I have a feeling this is a fairly standard problem with well defined solutions.
So, I would welcome any thoughts on a suitable architecture to go with, any tools/packages/patterns to refer to etc..
Thanks

Akka is obvious choice. Of course Scala version is more powerful, but even with Java bindings you can achieve a lot.
I think you can follow CQRS approach and use Akka Persistence module. In this case it's easy to replay any sequence of events, because you always have a persistent journal.
Generally Actor Model provides you fault-tolerance using supervision.
Akka Clustering will give you scalability you need.
Really awesome example of using Akka Clustering with Akka Persistence and Cassandra - https://github.com/boldradius/akka-dddd-template (only Scala unfortunately).

One common solution is JMS, where a central component (the JMS Broker) keeps a transactional store of pending messages. Because it does nothing other than that, it can have a high uptime (uptime can further be increased with a failover cluster, in which case you'll likely its persistence store to be a failover cluster, too).
Sending a JMS message can be made transactional, as can consuming a message. These transaction can be synchronized with database transactions through XA-transactions, which does its utmost get as close to exactly-once delivery as possible, but is rather heavy machinery.
In many cases (idempotent receiver), at-least-once delivery is sufficient. This can be accomplished by sending the message with a synchronous transaction (that is, the sender only succeeds once the broker has acknowledged receipt of the message), and consuming a message only after it has been processed.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.