I am using Spring 5, in detail the Reactor project, to read information from a huge Mongo collection to a Kafka topic. Unfortunately, the production of Kafka messages is much faster than the program that consumes them. So, I need to implement some backpressure mechanism.
Suppose I want a throughput of 100 messages every second. Googling a little, I decided to combine the feature of the buffer(int maxSize) method, zipping the result with a Flux that emits a message using a predefined interval.
// Create a clock that emits an event every second
final Flux<Long> clock = Flux.interval(Duration.ofMillis(1000L));
// Create a buffered producer
final Flux<ProducerRecord<String, Data>> outbound =
repository.findAll()
.map(this::buildData)
.map(this::createKafkaMessage)
.buffer(100)
// Limiting the emission in time interval
.zipWith(clock, (msgs, tick) -> msgs)
.flatMap(Flux::fromIterable);
// Subscribe a Kafka sender
kafkaSender.createOutbound()
.send(outbound)
.then()
.block();
Is there a smarter way to do this? I mean, it seems to me a little bit complex (the zip part, overall).
Yes, you can use delayElements(Duration.ofSeconds(1)) operation directily whitout need to zipWith it. There is always enhancement with reactor cool project as it a continious upgrades so let us be sticky :) hope was helpful!
Related
I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?
Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).
in my application I am using Spring Boot kafkaTemplate for consuming the messages. I am new to kafka with Spring Boot. I have added a consumer as below -
#KafkaListener(topics = "#{'${app.kafka.consumer.topic}'.split(',')}")
public void receivedMessage(ConsumerRecord<String, String> cr, #Payload String message){
log.info("Message received from topic {} ", cr.topic());
//TODO
}
On a topic we will receive near about 200K messages per second. Ones I received message will send to another method for processing which filtered the messages based on certain criteria and then publishing the filtered message to another topic.
My question is, does above #KafkaListener method will handle this load or do I need to do any special treatment like threading or concurrency.
It depends entirely on your workload, the number of cores in your CPU, etc, etc. In general, increasing the concurrency will provide more throughput (as long as you have at least that many partitions on the topic).
But, your downstream code must be completely thread-safe.
Even then, there may be other bottlenecks in your code (DB etc).
If all you are doing is computation, transformation, and publishing to another topic, without doing any other I/O, increasing the concurrency should definitely help.
The only real solution is experimentation and, if you don't get the throughput you need, profile your application.
You can't learn such skills without actually doing it.
One of our system has a micro service architecture using Apache Kafka as a service bus.
Low latency is a very important factor but reliability and consistency (exactly once) are even more important.
When we perform some load tests we noticed signifiant performance degradation and all investigations pointed to big increases in Kafka topics producer and consumer latencies. No matter how much configuration we changed or more resources we added we could not get rid of the symptoms.
At the moment our needs are processing 10 transactions per second (TPS) and the load test is exercising 20 TPS but as the system is evolving and adding more functionality we know we'll reach a stage when the need will be 500TPS so we started being worried if we can achieve this with Kafka.
As a proof of concept I tried to switch to one of our micro services to use a chronicle-queue instead of a Kafka topic. It was easy to migrate following the avro example as from Chronicle-Queue-Demo git hub repo
public class MessageAppender {
private static final String MESSAGES = "/tmp/messages";
private final AvroHelper avroHelper;
private final ExcerptAppender messageAppender;
public MessageAppender() {
avroHelper = new AvroHelper();
messageAppender = SingleChronicleQueueBuilder.binary(MESSAGES).build().acquireAppender();
}
#SneakyThrows
public long append(Message message) {
try (var documentContext = messageAppender.writingDocument()) {
var paymentRecord = avroHelper.getGenericRecord();
paymentRecord.put("id", message.getId());
paymentRecord.put("workflow", message.getWorkflow());
paymentRecord.put("workflowStep", message.getWorkflowStep());
paymentRecord.put("securityClaims", message.getSecurityClaims());
paymentRecord.put("payload", message.getPayload());
paymentRecord.put("headers", message.getHeaders());
paymentRecord.put("status", message.getStatus());
avroHelper.writeToOutputStream(paymentRecord, documentContext.wire().bytes().outputStream());
return messageAppender.lastIndexAppended();
}
}
}
After configuring that appender we ran a loop to produce 100_000 messages to a chronicle queue. Every message has the same size and the final size of the file was 621MB. It took 22 minutes 20 seconds and 613 milliseconds (~1341seconds) to process write all messages so an average of about 75 message/second.
This was definitely not what we hopped for and so far from latencies advertised in the chronicle documentation that made me believe my approach was not the correct one. I admit that our messages are not small at about 6.36KB/message but i have no doubts storing them in a database would be faster so I still think I am not doing it right.
It is important our messages are process one by one.
Thank you in advance for your inputs and or suggestions.
Hand building the Avro object each time seems a bit of a code smell to me.
Can you create a predefined message -> avro serializer and use that to feed the queue?
Or, just for testing, create one avro object outside the loop and feed that one object into the queue many times. That way you can see if it is the building or the queuing which is the bottleneck.
More general advice:
Maybe attach a profiler and see if you are making an excessive amount of object allocations. Which is particularly bad if they are getting promoted to higher generations.
See if they are your objects or Chronicle Queue ones.
Is your code maxing out your ram or cpu (or network)?
We have messages which are dependent.Ex. say we have 4 messages M1, M2, M1_update1,(should be processed only after M1 is processed),M3 (should be processed only after M1,M2 are processed).
In this example, only M1 and M2 can be processed in parallel, others have to be sequential. I know messages in one partition of Kafka topic are processed sequentially. But how do I know that M1,M2 are processed and now is the time to push M1_update1 and M3 messages to the topic? Is Kafka right choice for this kind of use-case? Any insights is appreciated!!
Kafka is used as pub-sub messaging system which is highly scalable and fault tolerant.
I believe using kafka alone when your messages are interdependent could be a bad choice. The processing you require is condition based probably you need a routing engine such as camel or drool to achieve the end result.
You're basically describing a message queue that guarantees ordering. Kafka, by design, does not guarantee ordering, except in the case you mention, where the topic has a single partition. In that case, though, you're not taking full advantage of Kafka's ability to maximize throughput by parallelizing data in partitions.
As far as messages being dependent on each other, that would require a logic layer that core Kafka itself doesn't provide. If I understand it correctly, and the processing happens after the message is consumed from Kafka, you would need some sort of notification on the consumer end, which would receive and process M1 and M2 and somehow notify the producer on the other side it's now ok to send M1_update and M3. This is definitely outside the scope of what core Kafka provides. You could still use Kafka to build something like this, but there's probably other solutions that would work better for you.
[Attention] The question is Lagom framework specific!
In my current project, the problem with cutting the list of messages from Source to Kafka topic publisher has been observed when upstream is of high speed and looks like downstream can't handle all messages in time. As realized, the cutting is related to the behavior of PubSubRef.subscribe() method https://github.com/lagom/lagom/blob/master/pubsub/javadsl/src/main/scala/com/lightbend/lagom/javadsl/pubsub/PubSubRef.scala#L85
The full method definition:
def subscriber(): Source[T, NotUsed] = {
scaladsl.Source.actorRef[T](bufferSize, OverflowStrategy.dropHead)
.mapMaterializedValue { ref =>
mediator ! Subscribe(topic.name, ref)
NotUsed
}.asJava
}
There's OverflowStrategy.dropHead is used. Can it be changed to use back-pressure strategy?
UPD#1:
The use case is pretty simple, when a query request is published into command topic, get it and query objects from DB table, the resulting list is pushed into result Kafka topic. Code snippet:
objectsResultTopic = pubSub.refFor(TopicId.of(CustomObject.class, OBJECTS_RESULT_TOPIC));
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
Flow.fromFunction(this::deserializeCommandAndQueryObjects)
.mapAsync(concurrency, objects -> objects)
.flatMapMerge(concurrency, objects -> objects)
.alsoTo(Sink.foreach(event -> LOG.trace("Sending object {}", object)))
.to(objectsResultTopic.publisher()),
Source.repeat(Done.getInstance())
)
)
In case of objects stream generated by deserializeCommandAndQueryObjects function is more than default buffer-size = 1000 it starts cutting the elements (our case is ~ 2.5k objects).
UPD#2:
The source of objects data is:
// returns CompletionStage<Source<CustomObject, ?>>
jdbcSession.withConnection(
connection -> Source.from(runQuery(connection, rowConverter))
)
And there's a subscription to Kafka objectsResultTopic:
TopicProducer.singleStreamWithOffset(
offset -> objectsResultTopic.subscriber().map(gm -> {
JsonNode node = mapper.convertValue(gm, JsonNode.class);
return Pair.create(node, offset);
}));
It sounds like Lagom's distributed publish-subscribe feature may not be the best tool for the job you have.
Your question mentions Kafka, but this feature does not make use of Kafka. Instead, it works by directly broadcasting messages to all subscribers in the cluster. This is an "at most once" messaging transport that may indeed lose messages, and is intended for consumers who care more about keeping up with recent messages than processing every single one. The overflow strategy is not customizable, and you would not want to use back-pressure in these use cases, as it would mean that one slow consumer could slow down delivery to all of the other subscribers.
There are a few other options that you have:
If you do want to use Kafka, you should use Lagom's message broker API. This supports "at least once" delivery semantics, and can be used to ensure that each consumer processes every message (at the cost of possibly increasing latency).
In this case, Kafka acts as a giant durable buffer, so it's even better than back-pressure: the producer and consumer can proceed at different paces, and (when used with partitioning) you can add consumers in order to scale out and process messages more quickly when needed.
The message broker API can be used when producers and consumers are all in the same service, but it is particularly suitable for communication between services.
If the messages you are sending are persistent entity events, and the consumers are part of the same service, then a persistent read-side processor might be a good option.
This also provides "at least once" delivery, and if the only effects of processing messages are database updates, then the built-in support for Cassandra read-side databases and relational read-side databases provide "effectively once" semantics, where the database updates are run transactionally to ensure that failures that occur during event processing cannot result in partial updates.
If the messages you are sending are persistent entity events, the consumers are part of the same service, but you want to process the events as a stream, you can access a raw stream of events.
If your use case does not fit into one of the use cases that Lagom supports explicitly, you can use lower-level Akka APIs, including distributed publish-subscribe, to implement something more tailored to your needs.
The best choice will depend on the specifics of your use case: the source of the messages and the types of consumers you want. If you update your question with more details and add a comment to this answer, I can edit the answer with more specific suggestions.
If someone is interested, finally we solved that problem by using Akka Producer API, like:
ProducerSettings<String, CustomObject> producerSettings = ProducerSettings.create(system, new StringSerializer(), new CustomObjectSerializer());
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
Flow.fromFunction(this::deserializeCommandAndQueryObjects)
.mapAsync(concurrency, objects -> objects)
.flatMapMerge(concurrency, objects -> objects)
.alsoTo(Sink.foreach(object -> LOG.trace("Sending event {}", object)))
.map(object -> new ProducerRecord<String, CustomObject>(OBJECTS_RESULT_TOPIC, object))
.to(Producer.plainSink(producerSettings)),
Source.repeat(Done.getInstance())));
It works without buffering, just makes the pushing into Kafka topic.