[Attention] The question is Lagom framework specific!
In my current project, the problem with cutting the list of messages from Source to Kafka topic publisher has been observed when upstream is of high speed and looks like downstream can't handle all messages in time. As realized, the cutting is related to the behavior of PubSubRef.subscribe() method https://github.com/lagom/lagom/blob/master/pubsub/javadsl/src/main/scala/com/lightbend/lagom/javadsl/pubsub/PubSubRef.scala#L85
The full method definition:
def subscriber(): Source[T, NotUsed] = {
scaladsl.Source.actorRef[T](bufferSize, OverflowStrategy.dropHead)
.mapMaterializedValue { ref =>
mediator ! Subscribe(topic.name, ref)
NotUsed
}.asJava
}
There's OverflowStrategy.dropHead is used. Can it be changed to use back-pressure strategy?
UPD#1:
The use case is pretty simple, when a query request is published into command topic, get it and query objects from DB table, the resulting list is pushed into result Kafka topic. Code snippet:
objectsResultTopic = pubSub.refFor(TopicId.of(CustomObject.class, OBJECTS_RESULT_TOPIC));
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
Flow.fromFunction(this::deserializeCommandAndQueryObjects)
.mapAsync(concurrency, objects -> objects)
.flatMapMerge(concurrency, objects -> objects)
.alsoTo(Sink.foreach(event -> LOG.trace("Sending object {}", object)))
.to(objectsResultTopic.publisher()),
Source.repeat(Done.getInstance())
)
)
In case of objects stream generated by deserializeCommandAndQueryObjects function is more than default buffer-size = 1000 it starts cutting the elements (our case is ~ 2.5k objects).
UPD#2:
The source of objects data is:
// returns CompletionStage<Source<CustomObject, ?>>
jdbcSession.withConnection(
connection -> Source.from(runQuery(connection, rowConverter))
)
And there's a subscription to Kafka objectsResultTopic:
TopicProducer.singleStreamWithOffset(
offset -> objectsResultTopic.subscriber().map(gm -> {
JsonNode node = mapper.convertValue(gm, JsonNode.class);
return Pair.create(node, offset);
}));
It sounds like Lagom's distributed publish-subscribe feature may not be the best tool for the job you have.
Your question mentions Kafka, but this feature does not make use of Kafka. Instead, it works by directly broadcasting messages to all subscribers in the cluster. This is an "at most once" messaging transport that may indeed lose messages, and is intended for consumers who care more about keeping up with recent messages than processing every single one. The overflow strategy is not customizable, and you would not want to use back-pressure in these use cases, as it would mean that one slow consumer could slow down delivery to all of the other subscribers.
There are a few other options that you have:
If you do want to use Kafka, you should use Lagom's message broker API. This supports "at least once" delivery semantics, and can be used to ensure that each consumer processes every message (at the cost of possibly increasing latency).
In this case, Kafka acts as a giant durable buffer, so it's even better than back-pressure: the producer and consumer can proceed at different paces, and (when used with partitioning) you can add consumers in order to scale out and process messages more quickly when needed.
The message broker API can be used when producers and consumers are all in the same service, but it is particularly suitable for communication between services.
If the messages you are sending are persistent entity events, and the consumers are part of the same service, then a persistent read-side processor might be a good option.
This also provides "at least once" delivery, and if the only effects of processing messages are database updates, then the built-in support for Cassandra read-side databases and relational read-side databases provide "effectively once" semantics, where the database updates are run transactionally to ensure that failures that occur during event processing cannot result in partial updates.
If the messages you are sending are persistent entity events, the consumers are part of the same service, but you want to process the events as a stream, you can access a raw stream of events.
If your use case does not fit into one of the use cases that Lagom supports explicitly, you can use lower-level Akka APIs, including distributed publish-subscribe, to implement something more tailored to your needs.
The best choice will depend on the specifics of your use case: the source of the messages and the types of consumers you want. If you update your question with more details and add a comment to this answer, I can edit the answer with more specific suggestions.
If someone is interested, finally we solved that problem by using Akka Producer API, like:
ProducerSettings<String, CustomObject> producerSettings = ProducerSettings.create(system, new StringSerializer(), new CustomObjectSerializer());
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
Flow.fromFunction(this::deserializeCommandAndQueryObjects)
.mapAsync(concurrency, objects -> objects)
.flatMapMerge(concurrency, objects -> objects)
.alsoTo(Sink.foreach(object -> LOG.trace("Sending event {}", object)))
.map(object -> new ProducerRecord<String, CustomObject>(OBJECTS_RESULT_TOPIC, object))
.to(Producer.plainSink(producerSettings)),
Source.repeat(Done.getInstance())));
It works without buffering, just makes the pushing into Kafka topic.
Related
I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?
Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).
I am creating two apache camel (blueprint XML) kafka projects, one is kafka-producer which accepts requests and stores it in kafka server, and other is kafka-consumer which picks ups messages from kafka server and processes them.
This setup is working fine for single topic and single consumer. However how do I create separate consumer groups within same kafka topic? How to route multiple consumer specific messages within same topic inside different consumer groups? Any help is appreciated. Thank you.
Your question is quite general as it's not very clear what's the problem you are trying to solve, therefore it's hard to understand if there's a better way to implement the solution.
Anyway let's start by saying that, as far as I can understand, you are looking for a Selective Consumer (EIP) which is something that's not supported out-of-the-box by Kafka and Consumer API. Selective Consumer can choose what message to pick from the queue or topic based on specific selectors' values that are put in advance by a producer. This feature must be implemented in the message broker as well, but kafka has not such a capability.
Kafka does implement a hybrid solution between pure pub/sub and queue. That being said, what you can do is subscribing to the topic with one or more consumer groups (more on that later) and filter out all messages you're not interested in, by inspecting messages themselves. In the messaging and EIP world, this pattern is known as Array of Filters. As you can imagine this happen after the message has been broadcasted to all subscribers; therefore if that solution does not fit your requirements or context, then you can think of implementing a Content Based Router which is intended to dispatch the message to a subset of consumers only under your centralized control (this would imply intermediate consumer-specific channels that could be other Kafka topics or seda/VM queues, of course).
Moving to the second question, here is the official Kafka Component website: https://camel.apache.org/components/latest/kafka-component.html.
In order to create different consumer groups, you just have to define multiple routes each of them having a dedicated groupId. By adding the groupdId property, you will inform the Consumer Group coordinators (that reside in Kafka brokers) about the existence of multiple separated groups of consumers and brokers will use those in order to discriminate and treat them separately (by sending them a copy of each log message stored in the topic)...
Here is an example:
public void configure() throws Exception {
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=myFirstConsumerGroup"
.log("Message received by myFirstConsumerGroup : ${body}");
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=mySecondConsumerGroup"
.log("Message received by mySecondConsumerGroup : ${body}");
}
As you can see, I created two routes in the same RouteBuilder, not to say in the same Java process. That's a very bad design decision in most of the use cases I can think of, because there's no single responsibility, segregated concerns and they will not scale. But again, it depends on your requirements/context.
Out of completeness, please consider taking a look at all other Kafka Component properties, as there may be many other configurations of your interest such as the number of consumer threads per group.
I tried to stay high level, in order to initiate the discussion... I'll edit my answer in case of new updates from you. Hope I helped!
We have messages which are dependent.Ex. say we have 4 messages M1, M2, M1_update1,(should be processed only after M1 is processed),M3 (should be processed only after M1,M2 are processed).
In this example, only M1 and M2 can be processed in parallel, others have to be sequential. I know messages in one partition of Kafka topic are processed sequentially. But how do I know that M1,M2 are processed and now is the time to push M1_update1 and M3 messages to the topic? Is Kafka right choice for this kind of use-case? Any insights is appreciated!!
Kafka is used as pub-sub messaging system which is highly scalable and fault tolerant.
I believe using kafka alone when your messages are interdependent could be a bad choice. The processing you require is condition based probably you need a routing engine such as camel or drool to achieve the end result.
You're basically describing a message queue that guarantees ordering. Kafka, by design, does not guarantee ordering, except in the case you mention, where the topic has a single partition. In that case, though, you're not taking full advantage of Kafka's ability to maximize throughput by parallelizing data in partitions.
As far as messages being dependent on each other, that would require a logic layer that core Kafka itself doesn't provide. If I understand it correctly, and the processing happens after the message is consumed from Kafka, you would need some sort of notification on the consumer end, which would receive and process M1 and M2 and somehow notify the producer on the other side it's now ok to send M1_update and M3. This is definitely outside the scope of what core Kafka provides. You could still use Kafka to build something like this, but there's probably other solutions that would work better for you.
Let me first describe my use-case.
I have topics [ T1 ... Tn ] to which the Kafka consumer(s) need to subscribe to. For each topic, all the data passing though it are logically similar. Let's assume data in different topics don't have any correlation. But once consumed, all the data irrespective of their topics receive the same treatment, i.e. they are fed to Elasticsearch using bulkprocessor api. ES is set up as multi-node cluster.
The kafka consumer javadoc mentions two different multithreading approaches. I'm leaning towards the first approach, i.e. One Consumer Per Thread model. Assuming #partitions / topic = p, I'll have p consumer threads for each topic. So, in total there will be n.p threads. If I've independent bulkprocessor attached to each of these threads, then I can choose to control committed position manually. It'll save me from data loss in case bulkprocessor fails. But the downside is, number of bulkprocessors might become too high and that might slow down elasticsearch ingestion.
The other approach I'm thinking, is to have only one thread per topic, so each thread listens to p partitions and writes to one bulk processor. In that case I've to use auto-commit for offsets, and I might lose data for bulkprocessor failure.
I'd like to know which approach is better, or is there a 3rd approach, better than both of these?
Kafka v0.9.0.x and ES v2.3.x
I am trying to figure out if I can switch from a blocking scenario to a more reactive pattern.
I have incoming update commands arriving in a queue, and I need to handle them in order, but only those regarding the same entity. In essence, I can create as many parallel streams of update events as I wish, as long as no two streams contain events regarding the same entity.
I was thinking that the consumer of the primary queue would possibly be able to leverage amqp's routing mechanisms, and temporary queues, by creating temporary queues for each entity id, and hooking a consumer to them. Once the subscriber is finished and no other events regarding the entity in question are currently in the queue, the queue could be disposed of.
Is this scenario something that is used regularly? Is there a better way to achieve this? In our current system we use a named lock based on the id to prevent concurrent updates.
There are at least 2 Options:
A single queue for each entity
And n Consumers on one Entity-Queue.
One queue with messages of all entities. Where the message contains data what it is for an entity. You could than split this up into several queues (One AMQP-Queue for one type of entity) or by using a BlockingQueue implementation.
Benefits of splitting up the Entities in qmqp-queues
You could create an ha-setup with rabbitmq
You could route messages
You could maybe have more than one consumer of an entity queue if it
is necessary someday (scalability)
Messages could be persistent and therefore recoverable on an
application-crash
Benefits of using an internal BlockingQueue implementation
It is faster (no net-io obviously)
Everything has to happen in one JVM
Anyway it does depend on what you want since both ways could have their benefits.
UPDATE:
I am not sure if I got you now, but let me give you some resources to try some things out.
There are special rabbitmq extensions maybe some of them can give you an idea. Take a look at alternate exchanges and exchange to exchange bindings.
Also for basic testing, I am not sure if it covers all rabbitmq features or at all all amqp features but this can sometimes be usefull. Keep in mind the routing key in this visualization is the producer name, you can also find there some examples. Import and Export your configuration.