I've saw a lot of questions about of this subject but I'm not very convinced. Is there a way to have more different consumers with a different group.id value than partitions number ?
Is a good workaround to achive this in the java code ?
Consumer Groups in Kafka is one way of parallelism in consuming the data. Multiple consumers can join a consumer group so that every individual consumer can consume data from different partitions of the Kafka Topic.
In addition to the above Kafka can track the Active consumers of a particular Group by using the group.id.
Therefore, having more consumers than partitions is ineffective as each partition is consumed by only one consumer in a consumer group to maintain the total order of messages consumed. Kafka only provides total ordering on all the partitions rather than maintaining order per partition.
But, you can still have multiple consumer groups consuming same topic which is more of a Publish/Subscribe rather than Point-to-Point.
If you have a different group id, then the partitions are not assigned to the same consumer group.
Basically, If you have N partitions and M distinct group ids, you can have at most N * M consumer threads polling from that topic. Any more, and you've oversubscribed for a particular group.
Related
I'm new to Kafka and would like to know the best approach for configuring the topics, partitions, consumer groups and consumer app replicas.
Working on an existing setup, the configuration handed down is as follows:
10 topics
Each topic has its own group i.e. topic1-group1, topic2-group2 and so on.
Each topic has 5 partitions
The Java consumer app has 5 (intentionally same as number of partitions, I'm told) replicas (k8s pods) which use Spring Kafka's #KafkaListener
Q1. I'd like to know if this is the configuration that will offer the best performance (high throughput and low latency)?
The consumer app sends the messages to only ONE downstream system which makes me think that all the topics (and all their partitions + consumer app replicas) can share a single consumer group (let's call it main-group).
Q2. Would this offer better or worse performance than having dedicated group for each topic?
Sub question:
Q3. Can ONE #KafkaListener work with 10 topics each with dedicated consumer groups given it has only 1 containerGroup and 1 containerFactory parameter?
Thanks
Lets say there is a topic in Apache Kafka with 3 partitions. I need to run 3 consumers inside a consumer group and, according to the documentation, it means each consumer will read data from 1 partition.
Consumers are implemented using Spring Kafka. As we all know, by default all messages are received in a single thread, but using ConcurrentMessageListenerContainer should allow us to set up concurrency.
What I want? I want to use server CPU resources efficiently and make each consumer to receive and process messsages in separate threads (3 threads in our case, which is equal to the number of partitions).
As a result - 3 consumers (3 servers) in the consumer group and each consumer receives messages from all 3 partitions.
Is it possible? If yes, will it be enough if I just use ConcurrentMessageListenerContainerand specify 3 listeners for each partition?
I was little confused by your statement. Just to clarify, in Kafka only one consumer can read from one partition within a consumer group. It is not possible for two consumers in same consumer group to read from same partition.
Within a consumer group,
if no of consumers is greater than number of partition, then extra consumer threads will be idle.
if no of consumers is less than number of partition, then same consumer thread will read from multiple partitions
this code snippet will read from topic named "mytopic" and it will use 3 thread to read from 3 partitions #KafkaListener(topics = "mytopic", concurrency = "3", groupId = "myconsumergroup")
I need to call Kafka consumer in publish/subscribe mode 1000 times. As far as I know for kafka to work in pub/subscribe mode I need to give a new groupId to each consumer( props.put("group.id", String.valueOf(Instant.now().toEpochMilli()));). But when I do this if two consumer threads access consumer at the same millisecond there will be problems. How should this problem be solved?
If you want to spread the messages across the consumers you need to use the same group.id. If you have 1000 messages and 1000 consumers, then each of the consumer will normally consume one message.
On the other hand, if you want each of the consumer to consume all the messages from the topics, you need to use a different group.id so that the messages in the topic are consumed by all consumers. If you have a huge number of consumers you can use UUID.randomUUID().toString() in order to produce a distinct group.id for each one.
According to the docs:
Consumers label themselves with a consumer group name, and each record
published to a topic is delivered to one consumer instance within each
subscribing consumer group. Consumer instances can be in separate
processes or on separate machines.
If all the consumer instances have the same consumer group, then the
records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then
each record will be broadcast to all the consumer processes.
Does Kafka have a limitation on the simultaneous connections (created with
Consumer.createJavaConsumerConnector) for the same topic within the same
group?
My scenario is I need to consume a topic from different process (not
thread), so I need to create lots of high level consumers.
The number of active consumers within the same consumer group is limited by the number of partitions of the topic. Extra consumers will act as backups and will only start consuming when one of the active consumers goes down and the group is re-balanced.
If you need to consume the same copy of the data within multiple processes, your consumers should be in different consumer groups. There is no limitation on the number of consumer groups you can have.
The main limitation is number of partitions that have this topic - you can create more consumers than partitions, but they won't consume anything.
I understand that Kafka Consumer Group is load-balanced based on how many partitions exist for a specific topic. So the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group which subscribes to the topic.
I have a scenario where each of my consumer is actually a consumer-group itself (i.e. 1 consumer per group). This mainly due to synchronisation between different databases so that the same data exists. All I am trying to do is run the same job on different environments as soon as the consumer get a message from the producer (broadcast).
For me, I don't believe that partitions/load balancing idea makes any difference. I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case). Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
P.S. I am using the Producer/Consumer API only my messaging framework needs to have a minimum change/impact to my existing application setup.
the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group
To be more precise, the number of partitions limits the number of consumers in a consumer group (if there are more consumers than partitions, they will just be idle). There can be fewer consumers than partitions. I wouldn't call 1:1 as necessarily ideal, it's the practical limit.
I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case).
I don't see value having replication-factor equal to number of consumer groups. Replication is for resilience, i.e. to prevent data loss if a broker goes down. It doesn't have anything to do with the number of consumers, since each consumer will only ever be consuming from the leader broker for a given partition.
Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
Partitioning data is for load distribution, both on the broker side and for parallelism on the consumer side. It's easier to set a higher number of partitions from the start, even if you don't think you need it, than to re-partition data later, if/when you discover you could benefit from it. On the other hand, there's no point setting them too high as they come with their own overheads (e.g. CPU load on the broker).
P.S. I am not using the Producer/Consumer API since I am not doing Table/Stream related aggregation
Sounds to me you intended to say you're not using Kafka Streams API, since it's Kafka Streams that offers KTable, KStream and aggregations thereon.
Multiple partitions are useful when you run Kafka in a cluster where the number of brokers is larger than the replication factor. So when you have 5 brokers and a replication of 3 then the 2 additional brokers are not needed. When you have two partitions with a replication of 3 you can divide 2*3 = 6 partitions over 5 brokers.
Only now there is one broker with two partitions while the others have one. So it's not spread evenly. It would be better to have more partitions to get a better spread.
There are other reasons to pick a number of partitions, but there are a lot of articles about this. What I explained is a good rule of thumb to start with.