I understand that Kafka Consumer Group is load-balanced based on how many partitions exist for a specific topic. So the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group which subscribes to the topic.
I have a scenario where each of my consumer is actually a consumer-group itself (i.e. 1 consumer per group). This mainly due to synchronisation between different databases so that the same data exists. All I am trying to do is run the same job on different environments as soon as the consumer get a message from the producer (broadcast).
For me, I don't believe that partitions/load balancing idea makes any difference. I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case). Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
P.S. I am using the Producer/Consumer API only my messaging framework needs to have a minimum change/impact to my existing application setup.
the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group
To be more precise, the number of partitions limits the number of consumers in a consumer group (if there are more consumers than partitions, they will just be idle). There can be fewer consumers than partitions. I wouldn't call 1:1 as necessarily ideal, it's the practical limit.
I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case).
I don't see value having replication-factor equal to number of consumer groups. Replication is for resilience, i.e. to prevent data loss if a broker goes down. It doesn't have anything to do with the number of consumers, since each consumer will only ever be consuming from the leader broker for a given partition.
Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
Partitioning data is for load distribution, both on the broker side and for parallelism on the consumer side. It's easier to set a higher number of partitions from the start, even if you don't think you need it, than to re-partition data later, if/when you discover you could benefit from it. On the other hand, there's no point setting them too high as they come with their own overheads (e.g. CPU load on the broker).
P.S. I am not using the Producer/Consumer API since I am not doing Table/Stream related aggregation
Sounds to me you intended to say you're not using Kafka Streams API, since it's Kafka Streams that offers KTable, KStream and aggregations thereon.
Multiple partitions are useful when you run Kafka in a cluster where the number of brokers is larger than the replication factor. So when you have 5 brokers and a replication of 3 then the 2 additional brokers are not needed. When you have two partitions with a replication of 3 you can divide 2*3 = 6 partitions over 5 brokers.
Only now there is one broker with two partitions while the others have one. So it's not spread evenly. It would be better to have more partitions to get a better spread.
There are other reasons to pick a number of partitions, but there are a lot of articles about this. What I explained is a good rule of thumb to start with.
Related
I subscribe to Kafka Topics by RegExp. There are 5 topics matching the given pattern. Four of them has 10 partitions. And the last topic has 12 partitions. So, there are 52 partitions at all.
I have 5 running instances and each of them starts 10 consumers. So, there are 50 consumers at all. I expect that the load is spread horizontally across all the consumers. So, each consumer reads 1 partition except two of them because total number of consumers is less than the whole partitions count.
Though the reality is a bit different.
In short, here is what happens
There are 50 consumers subscribing to multiple topics by RegExp. All found topics has 52 partitions totally.
Each consumer tries to subscribe to 1 partition from each found topic. There are also two consumer subscribed to 1 partition from the single topic because the latter has 12 but not 10 partitions.
12 consumers are working whilst 38 remains stale due to unavailable partitions to read.
Is there any way to force Kafka consumer to read 1 partition maximum with RegExp subscription? In this case, I can make all consumers start work. Or maybe there is a different approach that allows to read multiple topics respecting the number of partitions and running consumers?
Well, it's about partition.assignment.strategy Kafka property. The default value is RangeAssignor which leads to assigning partitions on topic basis. That leads to spreading load between consumers unfairly.
We set the property to RoundRobinAssignor and it helped.
Though you should be careful when you deploy new version with different partition.assignment.strategy. Suppose you have 5 running consumers with RangeAssignor strategy. Then you redeploy one with RoundRobinAssignor strategy. In this case, you get an exception on consumer replacement. Because all consumers in one consumer group should provide the same partition.assignment.strategy. The problem is described in this StackOverflow question
So, if you want to change the partition.assignment.strategy, you have several options:
Redeploy all consumers and then deploy new ones.
Specify new group.id for the new consumers.
Both of these ways have pros and cons.
I'm new to Kafka and would like to know the best approach for configuring the topics, partitions, consumer groups and consumer app replicas.
Working on an existing setup, the configuration handed down is as follows:
10 topics
Each topic has its own group i.e. topic1-group1, topic2-group2 and so on.
Each topic has 5 partitions
The Java consumer app has 5 (intentionally same as number of partitions, I'm told) replicas (k8s pods) which use Spring Kafka's #KafkaListener
Q1. I'd like to know if this is the configuration that will offer the best performance (high throughput and low latency)?
The consumer app sends the messages to only ONE downstream system which makes me think that all the topics (and all their partitions + consumer app replicas) can share a single consumer group (let's call it main-group).
Q2. Would this offer better or worse performance than having dedicated group for each topic?
Sub question:
Q3. Can ONE #KafkaListener work with 10 topics each with dedicated consumer groups given it has only 1 containerGroup and 1 containerFactory parameter?
Thanks
I have a setup where several KafkaConsumers each handle a number of partitions on a single topic. They are statically assigned the partitions, in a way that ensures that each consumer has an equal number of partitions to handle. The record key is also chosen so that we have equal distribution of messages over all partitions.
At times of heavy load, we often see a small number of partitions build up a considerable lag (thousands of messages/several minutes worth), while other partitions that are getting the same load and are consumed by the same consumer manage to keep the lag down to a few hundred messages / couple of seconds.
It looks like the consumer is fetching records as fast as it can, going around most of the partitions, but now and then there is one partition that gets left out for a long time. Ideally, I'd like to see the lag spread out more evenly across the partitions.
I've been reading about KafkaConsumer poll behaviour and configuration for a while now, and so far I think there's 2 options to work around this:
Build something custom that can monitor the lag per partition, and use KafkaConsumer.pause() and .resume() to essentially force the KafkaConsumer to read from the partitions with the most lag
Restrict our KafkaConsumer to only ever subscribe to one TopicPartition, and work with multiple instances of KafkaConsumer.
Neither of these options seem like the proper way to handle this. Configuration also doesn't seem to have the answer:
max.partition.fetch.bytes only specifies the max fetch size for a single partition, it doesn't guarantee that the next fetch will be from another partition.
max.poll.interval.ms only works for consumer groups and not on a per-partition basis.
Am I missing a way to encourage the KafkaConsumer to switch partition more often? Or a way to implement a preference for the partitions with the highest lag?
Not sure wether the answer is still relevant to you or if my answer exactly replies to your needs, However, you could try a lag aware assignor. This assignor which assign partitions to consumers ensures that consumers are assigned partitions so that the lag among consumers is assigned uniformly/equally. Here is a well written code that I used it that implements a lag based assignor.
https://github.com/grantneale/kafka-lag-based-assignor
All what you need is to configure you consumer to use this assignor. The below statament.
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, LagBasedPartitionAssignor.class.getName());
I've saw a lot of questions about of this subject but I'm not very convinced. Is there a way to have more different consumers with a different group.id value than partitions number ?
Is a good workaround to achive this in the java code ?
Consumer Groups in Kafka is one way of parallelism in consuming the data. Multiple consumers can join a consumer group so that every individual consumer can consume data from different partitions of the Kafka Topic.
In addition to the above Kafka can track the Active consumers of a particular Group by using the group.id.
Therefore, having more consumers than partitions is ineffective as each partition is consumed by only one consumer in a consumer group to maintain the total order of messages consumed. Kafka only provides total ordering on all the partitions rather than maintaining order per partition.
But, you can still have multiple consumer groups consuming same topic which is more of a Publish/Subscribe rather than Point-to-Point.
If you have a different group id, then the partitions are not assigned to the same consumer group.
Basically, If you have N partitions and M distinct group ids, you can have at most N * M consumer threads polling from that topic. Any more, and you've oversubscribed for a particular group.
I would like to know what is the difference between simple topic & partition topic.As per my understanding to balance the load, topic has been partitioned, Each message will have offset & consumer will acknowledge to ensure previous messages have been consumed.In case no of partition & consumer mismatches the re balance done by kafka does it efficiently manages.
If multiple topics created instead partition does it affect the operational efficiency.
From the kafka documentation
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data
Having multiple partitions for any given topic allows Kafka to distribute it across the Kafka cluster. As a result the request for handling data from different partitions can be divided among multiple servers in the whole cluster. Also each partition can be replicated across multiple servers to minimize the data loss. Again from the doc page
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
So having a topic with a single partition won't allow you to use these flexibilities. Also note in a real life environment you can have different topics to hold different categories of messages (though it is also possible to have a single topic with multiple partitions where each partitions can have specific categories of messages using the messgae key while producing).
I don't think creating multiple topics instead of partitions will have much impact on the overall performace. But imagine you want to keep track of all the tweets made by users in your site. You can then have one topic named "User_tweet" with multiple partitons so that while producing messages Kafka can distribute the data across multiple partitions and on the consumer end you only need to have one group of consumer pulling data from the same topic. Instead keeping "User_tweet_1", "User_tweet_2", "User_tweet_3" will only make things complex for you while both producing and consuming the messages.