I subscribe to Kafka Topics by RegExp. There are 5 topics matching the given pattern. Four of them has 10 partitions. And the last topic has 12 partitions. So, there are 52 partitions at all.
I have 5 running instances and each of them starts 10 consumers. So, there are 50 consumers at all. I expect that the load is spread horizontally across all the consumers. So, each consumer reads 1 partition except two of them because total number of consumers is less than the whole partitions count.
Though the reality is a bit different.
In short, here is what happens
There are 50 consumers subscribing to multiple topics by RegExp. All found topics has 52 partitions totally.
Each consumer tries to subscribe to 1 partition from each found topic. There are also two consumer subscribed to 1 partition from the single topic because the latter has 12 but not 10 partitions.
12 consumers are working whilst 38 remains stale due to unavailable partitions to read.
Is there any way to force Kafka consumer to read 1 partition maximum with RegExp subscription? In this case, I can make all consumers start work. Or maybe there is a different approach that allows to read multiple topics respecting the number of partitions and running consumers?
Well, it's about partition.assignment.strategy Kafka property. The default value is RangeAssignor which leads to assigning partitions on topic basis. That leads to spreading load between consumers unfairly.
We set the property to RoundRobinAssignor and it helped.
Though you should be careful when you deploy new version with different partition.assignment.strategy. Suppose you have 5 running consumers with RangeAssignor strategy. Then you redeploy one with RoundRobinAssignor strategy. In this case, you get an exception on consumer replacement. Because all consumers in one consumer group should provide the same partition.assignment.strategy. The problem is described in this StackOverflow question
So, if you want to change the partition.assignment.strategy, you have several options:
Redeploy all consumers and then deploy new ones.
Specify new group.id for the new consumers.
Both of these ways have pros and cons.
Related
I'm new to Kafka and would like to know the best approach for configuring the topics, partitions, consumer groups and consumer app replicas.
Working on an existing setup, the configuration handed down is as follows:
10 topics
Each topic has its own group i.e. topic1-group1, topic2-group2 and so on.
Each topic has 5 partitions
The Java consumer app has 5 (intentionally same as number of partitions, I'm told) replicas (k8s pods) which use Spring Kafka's #KafkaListener
Q1. I'd like to know if this is the configuration that will offer the best performance (high throughput and low latency)?
The consumer app sends the messages to only ONE downstream system which makes me think that all the topics (and all their partitions + consumer app replicas) can share a single consumer group (let's call it main-group).
Q2. Would this offer better or worse performance than having dedicated group for each topic?
Sub question:
Q3. Can ONE #KafkaListener work with 10 topics each with dedicated consumer groups given it has only 1 containerGroup and 1 containerFactory parameter?
Thanks
Lets say there is a topic in Apache Kafka with 3 partitions. I need to run 3 consumers inside a consumer group and, according to the documentation, it means each consumer will read data from 1 partition.
Consumers are implemented using Spring Kafka. As we all know, by default all messages are received in a single thread, but using ConcurrentMessageListenerContainer should allow us to set up concurrency.
What I want? I want to use server CPU resources efficiently and make each consumer to receive and process messsages in separate threads (3 threads in our case, which is equal to the number of partitions).
As a result - 3 consumers (3 servers) in the consumer group and each consumer receives messages from all 3 partitions.
Is it possible? If yes, will it be enough if I just use ConcurrentMessageListenerContainerand specify 3 listeners for each partition?
I was little confused by your statement. Just to clarify, in Kafka only one consumer can read from one partition within a consumer group. It is not possible for two consumers in same consumer group to read from same partition.
Within a consumer group,
if no of consumers is greater than number of partition, then extra consumer threads will be idle.
if no of consumers is less than number of partition, then same consumer thread will read from multiple partitions
this code snippet will read from topic named "mytopic" and it will use 3 thread to read from 3 partitions #KafkaListener(topics = "mytopic", concurrency = "3", groupId = "myconsumergroup")
I have a setup where several KafkaConsumers each handle a number of partitions on a single topic. They are statically assigned the partitions, in a way that ensures that each consumer has an equal number of partitions to handle. The record key is also chosen so that we have equal distribution of messages over all partitions.
At times of heavy load, we often see a small number of partitions build up a considerable lag (thousands of messages/several minutes worth), while other partitions that are getting the same load and are consumed by the same consumer manage to keep the lag down to a few hundred messages / couple of seconds.
It looks like the consumer is fetching records as fast as it can, going around most of the partitions, but now and then there is one partition that gets left out for a long time. Ideally, I'd like to see the lag spread out more evenly across the partitions.
I've been reading about KafkaConsumer poll behaviour and configuration for a while now, and so far I think there's 2 options to work around this:
Build something custom that can monitor the lag per partition, and use KafkaConsumer.pause() and .resume() to essentially force the KafkaConsumer to read from the partitions with the most lag
Restrict our KafkaConsumer to only ever subscribe to one TopicPartition, and work with multiple instances of KafkaConsumer.
Neither of these options seem like the proper way to handle this. Configuration also doesn't seem to have the answer:
max.partition.fetch.bytes only specifies the max fetch size for a single partition, it doesn't guarantee that the next fetch will be from another partition.
max.poll.interval.ms only works for consumer groups and not on a per-partition basis.
Am I missing a way to encourage the KafkaConsumer to switch partition more often? Or a way to implement a preference for the partitions with the highest lag?
Not sure wether the answer is still relevant to you or if my answer exactly replies to your needs, However, you could try a lag aware assignor. This assignor which assign partitions to consumers ensures that consumers are assigned partitions so that the lag among consumers is assigned uniformly/equally. Here is a well written code that I used it that implements a lag based assignor.
https://github.com/grantneale/kafka-lag-based-assignor
All what you need is to configure you consumer to use this assignor. The below statament.
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, LagBasedPartitionAssignor.class.getName());
Does Kafka have a limitation on the simultaneous connections (created with
Consumer.createJavaConsumerConnector) for the same topic within the same
group?
My scenario is I need to consume a topic from different process (not
thread), so I need to create lots of high level consumers.
The number of active consumers within the same consumer group is limited by the number of partitions of the topic. Extra consumers will act as backups and will only start consuming when one of the active consumers goes down and the group is re-balanced.
If you need to consume the same copy of the data within multiple processes, your consumers should be in different consumer groups. There is no limitation on the number of consumer groups you can have.
The main limitation is number of partitions that have this topic - you can create more consumers than partitions, but they won't consume anything.
I understand that Kafka Consumer Group is load-balanced based on how many partitions exist for a specific topic. So the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group which subscribes to the topic.
I have a scenario where each of my consumer is actually a consumer-group itself (i.e. 1 consumer per group). This mainly due to synchronisation between different databases so that the same data exists. All I am trying to do is run the same job on different environments as soon as the consumer get a message from the producer (broadcast).
For me, I don't believe that partitions/load balancing idea makes any difference. I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case). Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
P.S. I am using the Producer/Consumer API only my messaging framework needs to have a minimum change/impact to my existing application setup.
the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group
To be more precise, the number of partitions limits the number of consumers in a consumer group (if there are more consumers than partitions, they will just be idle). There can be fewer consumers than partitions. I wouldn't call 1:1 as necessarily ideal, it's the practical limit.
I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case).
I don't see value having replication-factor equal to number of consumer groups. Replication is for resilience, i.e. to prevent data loss if a broker goes down. It doesn't have anything to do with the number of consumers, since each consumer will only ever be consuming from the leader broker for a given partition.
Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
Partitioning data is for load distribution, both on the broker side and for parallelism on the consumer side. It's easier to set a higher number of partitions from the start, even if you don't think you need it, than to re-partition data later, if/when you discover you could benefit from it. On the other hand, there's no point setting them too high as they come with their own overheads (e.g. CPU load on the broker).
P.S. I am not using the Producer/Consumer API since I am not doing Table/Stream related aggregation
Sounds to me you intended to say you're not using Kafka Streams API, since it's Kafka Streams that offers KTable, KStream and aggregations thereon.
Multiple partitions are useful when you run Kafka in a cluster where the number of brokers is larger than the replication factor. So when you have 5 brokers and a replication of 3 then the 2 additional brokers are not needed. When you have two partitions with a replication of 3 you can divide 2*3 = 6 partitions over 5 brokers.
Only now there is one broker with two partitions while the others have one. So it's not spread evenly. It would be better to have more partitions to get a better spread.
There are other reasons to pick a number of partitions, but there are a lot of articles about this. What I explained is a good rule of thumb to start with.