Kafka - Topics, Consumer Groups And Consumer Replicas - java

I'm new to Kafka and would like to know the best approach for configuring the topics, partitions, consumer groups and consumer app replicas.
Working on an existing setup, the configuration handed down is as follows:
10 topics
Each topic has its own group i.e. topic1-group1, topic2-group2 and so on.
Each topic has 5 partitions
The Java consumer app has 5 (intentionally same as number of partitions, I'm told) replicas (k8s pods) which use Spring Kafka's #KafkaListener
Q1. I'd like to know if this is the configuration that will offer the best performance (high throughput and low latency)?
The consumer app sends the messages to only ONE downstream system which makes me think that all the topics (and all their partitions + consumer app replicas) can share a single consumer group (let's call it main-group).
Q2. Would this offer better or worse performance than having dedicated group for each topic?
Sub question:
Q3. Can ONE #KafkaListener work with 10 topics each with dedicated consumer groups given it has only 1 containerGroup and 1 containerFactory parameter?
Thanks

Related

Kafka Consumer RegExp: stale consumers

I subscribe to Kafka Topics by RegExp. There are 5 topics matching the given pattern. Four of them has 10 partitions. And the last topic has 12 partitions. So, there are 52 partitions at all.
I have 5 running instances and each of them starts 10 consumers. So, there are 50 consumers at all. I expect that the load is spread horizontally across all the consumers. So, each consumer reads 1 partition except two of them because total number of consumers is less than the whole partitions count.
Though the reality is a bit different.
In short, here is what happens
There are 50 consumers subscribing to multiple topics by RegExp. All found topics has 52 partitions totally.
Each consumer tries to subscribe to 1 partition from each found topic. There are also two consumer subscribed to 1 partition from the single topic because the latter has 12 but not 10 partitions.
12 consumers are working whilst 38 remains stale due to unavailable partitions to read.
Is there any way to force Kafka consumer to read 1 partition maximum with RegExp subscription? In this case, I can make all consumers start work. Or maybe there is a different approach that allows to read multiple topics respecting the number of partitions and running consumers?
Well, it's about partition.assignment.strategy Kafka property. The default value is RangeAssignor which leads to assigning partitions on topic basis. That leads to spreading load between consumers unfairly.
We set the property to RoundRobinAssignor and it helped.
Though you should be careful when you deploy new version with different partition.assignment.strategy. Suppose you have 5 running consumers with RangeAssignor strategy. Then you redeploy one with RoundRobinAssignor strategy. In this case, you get an exception on consumer replacement. Because all consumers in one consumer group should provide the same partition.assignment.strategy. The problem is described in this StackOverflow question
So, if you want to change the partition.assignment.strategy, you have several options:
Redeploy all consumers and then deploy new ones.
Specify new group.id for the new consumers.
Both of these ways have pros and cons.

Apache Kafka: 3 partitions, 3 consumers in the consumer group, each consumer should be multithreaded

Lets say there is a topic in Apache Kafka with 3 partitions. I need to run 3 consumers inside a consumer group and, according to the documentation, it means each consumer will read data from 1 partition.
Consumers are implemented using Spring Kafka. As we all know, by default all messages are received in a single thread, but using ConcurrentMessageListenerContainer should allow us to set up concurrency.
What I want? I want to use server CPU resources efficiently and make each consumer to receive and process messsages in separate threads (3 threads in our case, which is equal to the number of partitions).
As a result - 3 consumers (3 servers) in the consumer group and each consumer receives messages from all 3 partitions.
Is it possible? If yes, will it be enough if I just use ConcurrentMessageListenerContainerand specify 3 listeners for each partition?
I was little confused by your statement. Just to clarify, in Kafka only one consumer can read from one partition within a consumer group. It is not possible for two consumers in same consumer group to read from same partition.
Within a consumer group,
if no of consumers is greater than number of partition, then extra consumer threads will be idle.
if no of consumers is less than number of partition, then same consumer thread will read from multiple partitions
this code snippet will read from topic named "mytopic" and it will use 3 thread to read from 3 partitions #KafkaListener(topics = "mytopic", concurrency = "3", groupId = "myconsumergroup")

More Kafka consumers than partitions

I've saw a lot of questions about of this subject but I'm not very convinced. Is there a way to have more different consumers with a different group.id value than partitions number ?
Is a good workaround to achive this in the java code ?
Consumer Groups in Kafka is one way of parallelism in consuming the data. Multiple consumers can join a consumer group so that every individual consumer can consume data from different partitions of the Kafka Topic.
In addition to the above Kafka can track the Active consumers of a particular Group by using the group.id.
Therefore, having more consumers than partitions is ineffective as each partition is consumed by only one consumer in a consumer group to maintain the total order of messages consumed. Kafka only provides total ordering on all the partitions rather than maintaining order per partition.
But, you can still have multiple consumer groups consuming same topic which is more of a Publish/Subscribe rather than Point-to-Point.
If you have a different group id, then the partitions are not assigned to the same consumer group.
Basically, If you have N partitions and M distinct group ids, you can have at most N * M consumer threads polling from that topic. Any more, and you've oversubscribed for a particular group.

Is there a limitation on the number of consumer to the same topic in Kafka?

Does Kafka have a limitation on the simultaneous connections (created with
Consumer.createJavaConsumerConnector) for the same topic within the same
group?
My scenario is I need to consume a topic from different process (not
thread), so I need to create lots of high level consumers.
The number of active consumers within the same consumer group is limited by the number of partitions of the topic. Extra consumers will act as backups and will only start consuming when one of the active consumers goes down and the group is re-balanced.
If you need to consume the same copy of the data within multiple processes, your consumers should be in different consumer groups. There is no limitation on the number of consumer groups you can have.
The main limitation is number of partitions that have this topic - you can create more consumers than partitions, but they won't consume anything.

Kafka Topic-per-Consumer Configuration

I understand that Kafka Consumer Group is load-balanced based on how many partitions exist for a specific topic. So the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group which subscribes to the topic.
I have a scenario where each of my consumer is actually a consumer-group itself (i.e. 1 consumer per group). This mainly due to synchronisation between different databases so that the same data exists. All I am trying to do is run the same job on different environments as soon as the consumer get a message from the producer (broadcast).
For me, I don't believe that partitions/load balancing idea makes any difference. I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case). Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
P.S. I am using the Producer/Consumer API only my messaging framework needs to have a minimum change/impact to my existing application setup.
the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group
To be more precise, the number of partitions limits the number of consumers in a consumer group (if there are more consumers than partitions, they will just be idle). There can be fewer consumers than partitions. I wouldn't call 1:1 as necessarily ideal, it's the practical limit.
I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case).
I don't see value having replication-factor equal to number of consumer groups. Replication is for resilience, i.e. to prevent data loss if a broker goes down. It doesn't have anything to do with the number of consumers, since each consumer will only ever be consuming from the leader broker for a given partition.
Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
Partitioning data is for load distribution, both on the broker side and for parallelism on the consumer side. It's easier to set a higher number of partitions from the start, even if you don't think you need it, than to re-partition data later, if/when you discover you could benefit from it. On the other hand, there's no point setting them too high as they come with their own overheads (e.g. CPU load on the broker).
P.S. I am not using the Producer/Consumer API since I am not doing Table/Stream related aggregation
Sounds to me you intended to say you're not using Kafka Streams API, since it's Kafka Streams that offers KTable, KStream and aggregations thereon.
Multiple partitions are useful when you run Kafka in a cluster where the number of brokers is larger than the replication factor. So when you have 5 brokers and a replication of 3 then the 2 additional brokers are not needed. When you have two partitions with a replication of 3 you can divide 2*3 = 6 partitions over 5 brokers.
Only now there is one broker with two partitions while the others have one. So it's not spread evenly. It would be better to have more partitions to get a better spread.
There are other reasons to pick a number of partitions, but there are a lot of articles about this. What I explained is a good rule of thumb to start with.

Categories