Generate "fake" stream data. Kafka - Flink - java

I am trying to generate stream data, to simulate a situation where I receive two values, Integer type, in a different time range, with timestamps, and Kafka as connector.
I am using Flink environment as a consumer, but I don't know which is the best solution for the producer. (Java syntax better than Scala if possible)
Should I produce the data direct from Kafka? If yes, what is the best way to do it?
Or maybe is better if I produce the data from Flink as a producer, send it to Kafka and consume it at the end by Flink again? How can I do that from flink?
Or perhaps there is another easy way to generate stream data and pass it to Kafka.
If yes, please put me on the track to achieve it.

As David also mentioned, you can create a dummy producer in simple Java using KafkaProducer APIs to schedule and send messages to Kafka as per you wish. Similarly you can do that with Flink if you want multiple simultaneous producers. With Flink you will need to write a separate job for producer and consumer. Kafka basically enables an ASync processing architecture so it does not have queue mechanisms. So better to keep producer and consumer jobs separate.
But think a little bit more about the intention of this test:
Are you trying to test Kafka streaming durability, replication, offset management capabilities
In this case, you need simultaneous producers for same topic, with null or non-null key in the message.
or Are you trying to test Flink-Kafka connector capabilities.
In this case, you need only one producer, few internal scenarios could be back pressure test by making producer push more messages than consumer can handle.
or Are you trying to test topic partitioning and Flink streaming parallelism.
In this case, single or multiple producers but key of message should be non-null, you can test how Flink executors are connecting with individual partitions and observe their behavior.
There are more ideas you may want to test and each of these will need something specific to be done in producer or not to be done.
You can check out https://github.com/abhisheknegi/twitStream for pulling tweets using Java APIs in case needed.

Related

I need to save data in mongodb and send this data to kafka all atomically

I have spring boot application which persists data to mongodb and sends this data to kafka. I want this two processes to run atomically. That means if data is persisted to mongo then it should be sent to kafka. How can I do it?
With Kafka itself you can't.
Kafka offers transactions, but they are restricted to write to multiple partitions in Kafka atomically. They are designed with stream processing in mind, so consuming from one topic and producing to another in one go - but a Kafka transaction cannot know whether a write command to mongo succeeded.
The usecase you have is something that appears regularly though. Usually you would use the outbox pattern. The answer is to only modify one of the two resources (the database or Apache Kafka) and drive the update of the second one based on that in an eventually consistent manner.
If you really need atomic writes, I believe it is possible to do so by relying on the ACID guarantees Mongo >= 4.2 gives you instead of Kafka's transactional guarantees. But this would mean you need to manage the Kafka offsets in Mongo.
If you have "Kafka the definitive guide, 2nd edition", there is a small chapter with more details about what exactly Kafka transactions can and cannot do and possible workarounds.

Synchronize all consumers of a queue to process only one message at a time

I am running a Spring Cloud Stream application, consuming messages from RabbitMq. I want to implement a behaviour where a given queue, having three consumers instances, delivers exactly one message to any of them, and wait for the next to be delivered until the current is acked (some sort of synchronization between all consumers).
I think that this can be done using this https://www.rabbitmq.com/consumer-prefetch.html with global option enabled, but I can't find a way of achieving this using spring cloud stream. Any help will be appreciated.
Spring uses a separate channel for each consumer so global channel prefetch won't have any effect.
Bear in mind that, even if it was supported, it would only work if the consumers were all in the same instance.

Dataflow send PubSub message after BigQuery write completion

I have a Dataflow job that transforms data and writes out to BigQuery (batch job). Following the completion of the write operation I want to send a message to PubSub which will trigger further processing of the data in BigQuery. I have seen a few older questions/answers that hint at this being possible but only on streaming jobs:
Perform action after Dataflow pipeline has processed all data
Execute a process exactly after BigQueryIO.write() operation
How to notify when DataFlow Job is complete
I'm wondering if this is supported in any way for batch write jobs now? I cant use apache airflow to orchestrate all this unfortunately so sending a PubSub message seemed like the easiest way.
The conception of Beam implies the impossibility to do what you want. Indeed, you write a PCollection to BigQuery. By definition, a PCollection is a bounded or unbounded collection. How can you trigger something after a unbounded collection? When do you know that you have reach the end?
So, you have different way to achieve this. In your code, you can wait the pipeline completion and then publish a PubSub message.
Personally, I prefer to base this on the logs; When the the dataflow job is finish, I get the log of the end of job and I sink it into PubSub. That's decorrelated the pipeline code and the next step.
You can also have a look to Workflow. It's not really mature yet, but very promising for simple workflow like yours.

Can I use spring.cloud.stream.bindings.<channel>.group when using RabbitMQ to obtain exactly-once delivery?

so I was reading this tutorial to configure RabbitMQ and SpringBoot.
At a certain point it is said:
Most of the time, we need the message to be processed only once.
Spring Cloud Stream implements this behavior via consumer groups.
So I started looking for more information on Spring docs it is written that:
When doing so, different instances of an application are placed in a
competing consumer relationship, where only one of the instances is
expected to handle a given message.
Spring Cloud Stream models this behavior through the concept of a
consumer group. (Spring Cloud Stream consumer groups are similar to
and inspired by Kafka consumer groups.)
So I setup here two nodes with Spring Boot Cloud Stream and RabbitMQ and using spring.cloud.stream.bindings.<channel>.group.
This to me still looks like at-least-once behavior. Am I wrong in assuming that? Should I still manage the possibility to process a message twice even using spring.cloud.stream.bindings.<channel>.group?
Thank you
It's at least once. The connection might close before the ack is sent. Rare, but possible.

Is 'exactly once' only for streams (topic1 -> app -> topic2)?

I have an architecture where we have two separate applications. The original source is a sql database. App1 listens to CDC tables to track changes to tables in that database, normalizes, and serializes those changes. It takes those serialized messages and sends them to a Kafka topic. App2 listens to that topic, adapts the messages to different formats, and sends those adapted messages to their respective destinations via HTTP.
So our streaming architecture looks like:
SQL (CDC event) -> App1 ( normalizes events) -> Kafka -> App2 (adapts events to endpoints) -> various endpoints
We're looking to add error handling in case of failure and cannot tolerate duplicate events, missing events, or changing of order. Given the architecture above, all we really care about is that exactly-once applies to messages getting from App1 to App2 (our separate producers and consumers)
Everything I'm reading and every example I've found of the transactional api points to "streaming". It looks like the Kafka streaming api is meant for an individual application that takes an input from a Kafka topic, does its processing, and outputs it to another Kafka topic, which doesn't seem to apply to our use of Kafka. Here's an excerpt from Confluent's docs:
Now, stream processing is nothing but a read-process-write operation
on a Kafka topic; a consumer reads messages from a Kafka topic, some
processing logic transforms those messages or modifies state
maintained by the processor, and a producer writes the resulting
messages to another Kafka topic. Exactly once stream processing is
simply the ability to execute a read-process-write operation exactly
one time. In this case, “getting the right answer” means not missing
any input messages or producing any duplicate output. This is the
behavior users expect from an exactly once stream processor.
I'm struggling to wrap my head around how we can use exactly-once with our Kafka topic, or if Kafka's exactly-once is even built for non-"streaming" use cases. Will we have to build our own deduplication and fault tolerance?
If you are using Kafka's Streams API (or another tool that supports exactly-once processing with Kafka), then Kafka's exactly-once semantics (EOS) are covered across apps:
topic A --> App 1 --> topic B --> App 2 --> topic C
In your use case, one question is whether the initial CDC step supports EOS, too. In other words, you must ask the question: Which steps are involved, and are all steps covered by EOS?
In the following example, EOS is supported end-to-end if (and only if) the initial CDC step supports EOS as well, like the rest of the data flow.
SQL --CDC--> topic A --> App 1 --> topic B --> App 2 --> topic C
If you use Kafka Connect for the CDC step, then you must check whether the used connector you supports EOS yes or no.
Everything I'm reading and every example I've found of the transactional api points to "streaming".
The transactional API of the Kafka producer/consumer clients provide the primitives for EOS processing. Kafka Streams, which sits on top of the producer/consumer clients, uses this functionality to implement EOS in a way that it can be used easily by developers with a few lines of code (such as automatically taking care of state management when an application needs to do a stateful operation like an aggregation or join). Perhaps that relation between producer/consumer <-> Kafka Streams was your confusion after reading the documentation?
Of course, you can also "build your own" by using the underlying Kafka producer and consumer clients (with the transactional APIs) when developing your applications, but that's more work.
I'm struggling to wrap my head around how we can use exactly-once with our Kafka topic, or if Kafka's exactly-once is even built for non-"streaming" use cases. Will we have to build our own deduplication and fault tolerance?
Not sure what you mean by "non-streaming" use cases. If you mean, "If we don't want to use Kafka Streams or KSQL (or another existing tool that can read from Kafka to process data), what would we need to do achieve EOS in our applications?", then the answer is "Yes, in this case you must use the Kafka producer/clients directly, and ensure that whatever you are doing with them properly implements EOS processing." (And because the latter is difficult, this EOS functionality was added to Kafka Streams.)
I hope that helps.

Categories