Scaling pattern matched Kafka consumers

Scaling pattern matched Kafka consumers - java

I have a scenario where there are multiple Kafka topics (single partition each) and a single consumer group to consume the records. I use a single pattern matched consumer in the consumer group that matches all the topics and hence consumes all the records in all the topics.
I now want to scale this up and have multiple consumers (in the same consumer group) listening to all the topics. However, this does not seem to be working as all the records are getting consumed only by the first consumer in the group, rendering other consumers in the group useless. Also, I am running consumers as separate threads using an ExecutorService.
How can I achieve this?
Below is my code:
Pattern pattern = Pattern.compile(topicPattern); consumer.subscribe(pattern);
The pattern sent in the code above is such that it matches with the names of all the topics,
eg.
If topics names are sample_topic_1, sample_topic_2 etc, we match it with sample_topic_*$.

The approach you describe should work with the code you posted. However, this might be a case where there's not enough data for more than one consumer. Or maybe the data comes in "bursts" that are small enough to fit in a single batch.
Even though load in Kafka is theoretically distributed across all consumers of the same consumer group, in practice, if there is only data for one "batch", then the first consumer could grab all the data and there will be nothing left for anybody else. This means either:
You're not sending enough data for it to be distributed across all consumers (try sending lots more data to validate this), or
You have a weird configuration where your configured batches are gigantic, and/or linger.ms property is configured very high, or
A combination of the two above.
I suggest trying to send more data first, and seeing if that fixes the issue. If not, try to scale back to only 1 consumer, validate it's still working. Then just add one more consumer to that consumer group, and seeing if the behavior changes.

Related

Advantage of "forcing" the partition

I have a topic theimportanttopic with three partitions.
Is an advantage to "forcing" the partition assignment?
For instance, I have three consumers in one group. I want consumer1 to always and only consume from partition-0, consumer2 to always and only consume from partition-1 and consumer3 to always and only consume from partition-2.
One consumer should not touch any other partition at any point of time except the one that was assigned.
A drawback I can think of is that when one of the consumers goes down, no one is consuming from the partition.
Let's suppose a fancy self-healing architecture is in place, can bring back any of those lost consumer back very efficiently.
Would it be an advantage, knowing there won't be any partition reassignment cost to the healthy consumers? The healthy consumers can focus on their own partition, etc.
Are there any other pros and cons?
https://docs.spring.io/spring-kafka/reference/html/#tip-assign-all-parts
https://docs.spring.io/spring-kafka/reference/html/#manual-assignment
It seems the API allow possibility of forcing the partition, I was wondering if this use case was one of the purposes of this design.

How do you know what is the "number" of each consumer? Based on your last questions, you've either used kubernetes, or setting concurrency in Spring Kafka. In either case, pods/threads rebalance across partitions of the same executable application... Therefore, you cannot scale them and assign to specific partitions without extra external locking logic.
In my opinion, all executable consumers should be equally be able to handle any partition.
Plus, as you pointed out, there's downtime if one stops.
But, the use case is to exactly match the producer. You've produced data by some custom partitioner logic, therefore, you need exact consumers to read only a subset of that data.
Also, assignment doesn't use consumer groups, so while there would be no rebalancing, it makes it not possible to monitor lag using tools like Burrow or consumer groups cli. Lag would need gathered directly from the consumer metrics themselves.

Subscribe to every topic partition

What is the canonical way to subscribe multiple times to a given Kafka topic and receive every message from every partition for each KafkaConsumer.
What I am doing as the moment is generating a random Uuid group.id so that each subscription is a new group, but given these subscriptions are short-lived (and there are many of them), the overhead of Kafka storing metadata about them might be detrimental.
What is the correct way to acheive this?

I believe the answer to this question is to use the assign() method rather than subscribe().
Manual topic assignment through this method does not use the
consumer's group management functionality.
Reference: https://kafka.apache.org/26/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

Well, having unique consumer groups is the way to ensure that your consumer(s) running inside a group subscribes to all partitions and receives all the messages. That is the purpose of multiple consumer group subscribing to the same topic.
I agree that it requires you to create multiple consumer groups and that gives the overhead of metadata. But it all depends on your usecase requirement whether you want single/multiple consumer groups.

Apache Camel - Kafka component - single producer multiple consumer

I am creating two apache camel (blueprint XML) kafka projects, one is kafka-producer which accepts requests and stores it in kafka server, and other is kafka-consumer which picks ups messages from kafka server and processes them.
This setup is working fine for single topic and single consumer. However how do I create separate consumer groups within same kafka topic? How to route multiple consumer specific messages within same topic inside different consumer groups? Any help is appreciated. Thank you.

Your question is quite general as it's not very clear what's the problem you are trying to solve, therefore it's hard to understand if there's a better way to implement the solution.
Anyway let's start by saying that, as far as I can understand, you are looking for a Selective Consumer (EIP) which is something that's not supported out-of-the-box by Kafka and Consumer API. Selective Consumer can choose what message to pick from the queue or topic based on specific selectors' values that are put in advance by a producer. This feature must be implemented in the message broker as well, but kafka has not such a capability.
Kafka does implement a hybrid solution between pure pub/sub and queue. That being said, what you can do is subscribing to the topic with one or more consumer groups (more on that later) and filter out all messages you're not interested in, by inspecting messages themselves. In the messaging and EIP world, this pattern is known as Array of Filters. As you can imagine this happen after the message has been broadcasted to all subscribers; therefore if that solution does not fit your requirements or context, then you can think of implementing a Content Based Router which is intended to dispatch the message to a subset of consumers only under your centralized control (this would imply intermediate consumer-specific channels that could be other Kafka topics or seda/VM queues, of course).
Moving to the second question, here is the official Kafka Component website: https://camel.apache.org/components/latest/kafka-component.html.
In order to create different consumer groups, you just have to define multiple routes each of them having a dedicated groupId. By adding the groupdId property, you will inform the Consumer Group coordinators (that reside in Kafka brokers) about the existence of multiple separated groups of consumers and brokers will use those in order to discriminate and treat them separately (by sending them a copy of each log message stored in the topic)...
Here is an example:
public void configure() throws Exception {
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=myFirstConsumerGroup"
.log("Message received by myFirstConsumerGroup : ${body}");
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=mySecondConsumerGroup"
.log("Message received by mySecondConsumerGroup : ${body}");
}
As you can see, I created two routes in the same RouteBuilder, not to say in the same Java process. That's a very bad design decision in most of the use cases I can think of, because there's no single responsibility, segregated concerns and they will not scale. But again, it depends on your requirements/context.
Out of completeness, please consider taking a look at all other Kafka Component properties, as there may be many other configurations of your interest such as the number of consumer threads per group.
I tried to stay high level, in order to initiate the discussion... I'll edit my answer in case of new updates from you. Hope I helped!

Apache Kafka Message broadcasting

I am studying Apache-kafka and have some confusion. Please help me to understand the following scenario.
I have a topic with 5 partitions and 5 brokers in a Kafka cluster. I am maintaining my message order in Partition 1(say P1).I want to broadcast the messages of P1 to 10 consumers.
So my question is; how do these 10 consumers interact with topic partition p1.

This is probably not how you want to use Kafka.
Unless you're being explicit with how you set your keys, you can't really control which partition your messages end up in when producing to a topic. Partitions in Kafka are designed to be more like low-level plumbing, something that exists, but you don't usually have to interact with. On the consumer side, you will be assigned partitions based on how many consumers are active for a particular consumer group at any one time.
One way to get around this is to define a topic to have only a single partition, in which case, of course, all messages will go to that partition. This is not ideal, since Kafka won't be able to parallelize data ingestion or serving, but it is possible.
So, having said that, let's assume that you did manage to put all your messages in partition 1 of a specific topic. When you fire up a consumer of that topic with consumer group id of consumer1, it will be assigned all the partitions for that topic, since that consumer is the only active one for that particular group id. If there is only one partition for that topic, like explained above, then that consumer will get all the data. If you then fire up a second consumer with the same group id, Kafka will notice there's a second consumer for that specific group id, but since there's only one partition, it can't assign any partitions to it, so that consumer will never get any data.
On the other hand, if you fire up a third consumer with a different consumer group id, say consumer2, that consumer will now get all the data, and it won't interfere at all with consumer1 message consumption, since Kafka keeps track of their consuming offsets separately. Kafka keeps track of which offset each particular ConsumerGroupId is at on each partition, so it won't get confused if one of them starts consuming slowly or stops for a while and restarts consuming later that day.
Much more detailed information here on how Kafka works here: https://kafka.apache.org/documentation/#gettingStarted
And more information on how to use the Kafka consumer at this link:
https://kafka.apache.org/20/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

#mjuarez's answer is absolutely correct - just for brevity I would reduce it to the following;
Don't try and read only from a single partition because it's a low level construct and it somewhat undermines the parallelism of Kafka. You're much better off just creating more topics if you need finer separation of data.
I would also add that most of the time a consumer needn't know which partition a message came from, in the same way that I don't eat a sandwich differently depending on which store it came from.

#mjuarez is actually not correct and I am not sure why his comment is being falsely confirmed by the OP. You can absolutely explicitly tell Kafka which partition a producer record pertains to using the following:
ProducerRecord(
java.lang.String topic,
java.lang.Integer partition, // <--------- !!!
java.lang.Long timestamp,
K key,
V value)
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/ProducerRecord.html#ProducerRecord-java.lang.String-java.lang.Integer-java.lang.Long-K-V-
So most of what was said after that becomes irrelevant.
Now to address the OP question directly: you want to accomplish a broadcast. To have a message sent once and read more than once you would have to have a different consumer group for each reader.
And that use case is an absolutely valid Kafka usage paradigm.
You can accomplish that using RabbitMQ too:
https://www.rabbitmq.com/tutorials/tutorial-three-java.html
... but the way it is done is not ideal because multiple out-of-process queues are involved.

Order of messages from multiple topics Kafka

I'm developing a software that uses Apache Kafka. I've got one consumer that subscribed to multiple topics, I'd like to know if there is an order for receiving messages from those topics. I tried some combination on my computer but I need to be sure about this.
Example
Consumer sub to topic1 and topic2
Producer1 write something on topic1
Producer2 write something on topic2
Producer1 write something on topic1
When the consumer polls, it receives a list of records containing first the messages from the first topic that he subscribed and then the messages from the other topic.
I'd like to know if it is always like this, i.e. the messages are in order like the topics that I subscribed.
Thanks
[EDIT] I'd like to specify that I have the two topics with one partition each, and only one producer and one consumer. I need to read first all the messages from the first topic and then the messages from the other topic

Kafka gives you only the guarantee of messages ordering inside a partition. It means that even with only one topic but more than one partitions you have no guarantee that messages are received in the same order they are sent.
Regarding your use case with two topics there is no relation between subscription order to the topics and messages ordering even because if the cluster has more than one node, the topic partition leader will be on different brokers and the client receives messages over different connections. Btw even with only one broker with all topics/partitions on that you can't have the guarantee you are describing.

No. Message ordering is only preserved within partitions (not even within topics).
If you need stronger ordering guarantees, you have to re-arrange messages in your application, for example using a timestamp (and a sufficiently large window buffer to catch all the ones that arrive out-of-order). Support for this has improved a bit with the recent addition of timestamps for all messages by Kafka itself, but the principle remains the same.

Why not first subscribe to the first topic and do a poll, and then subscribe to the other topic and do another poll? Without this, I don't think there is any guarantee in which order you receive messages from the two topics.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.