I have a kafka cluster with 3 brokers and a topic with 8 partitions.
A producer written in java using spring boot and without custom rule for load balancing. It means it should do round robin.
The issue is that there are some partitions there are not receiving messages into it. I figured it out checking what the 4 consumers are receiving and even they are processing all messages there is a consumer idle all the time because it has received just one message.
What could be the issue?
Kafka version I'm using is 0.10.1.1
Additional note in this case I'm not using replicas for the partitions
It means it should do round robin.
It will only do round robin, if you have no keys in your Kafka messages. Otherwise, the messages are partitioned based on a hash value of the key:
hash(key) % number_of_partitions
It is not unusal, that this will cause some partitions to not receive any messages at all. Imagine a case, where you are using a key that can only have two different values. In that case, all your data will flow into only two partitions, independent of the number of partitions in your topic.
Related
I subscribe to Kafka Topics by RegExp. There are 5 topics matching the given pattern. Four of them has 10 partitions. And the last topic has 12 partitions. So, there are 52 partitions at all.
I have 5 running instances and each of them starts 10 consumers. So, there are 50 consumers at all. I expect that the load is spread horizontally across all the consumers. So, each consumer reads 1 partition except two of them because total number of consumers is less than the whole partitions count.
Though the reality is a bit different.
In short, here is what happens
There are 50 consumers subscribing to multiple topics by RegExp. All found topics has 52 partitions totally.
Each consumer tries to subscribe to 1 partition from each found topic. There are also two consumer subscribed to 1 partition from the single topic because the latter has 12 but not 10 partitions.
12 consumers are working whilst 38 remains stale due to unavailable partitions to read.
Is there any way to force Kafka consumer to read 1 partition maximum with RegExp subscription? In this case, I can make all consumers start work. Or maybe there is a different approach that allows to read multiple topics respecting the number of partitions and running consumers?
Well, it's about partition.assignment.strategy Kafka property. The default value is RangeAssignor which leads to assigning partitions on topic basis. That leads to spreading load between consumers unfairly.
We set the property to RoundRobinAssignor and it helped.
Though you should be careful when you deploy new version with different partition.assignment.strategy. Suppose you have 5 running consumers with RangeAssignor strategy. Then you redeploy one with RoundRobinAssignor strategy. In this case, you get an exception on consumer replacement. Because all consumers in one consumer group should provide the same partition.assignment.strategy. The problem is described in this StackOverflow question
So, if you want to change the partition.assignment.strategy, you have several options:
Redeploy all consumers and then deploy new ones.
Specify new group.id for the new consumers.
Both of these ways have pros and cons.
According to producer configs, there are: retries and max.in.flight.requests.per.connection. Suppose that retries > 0 and max.in.flight.requests.per.connection > 1.
Can messages arrive out of order within ONE partition of topic (e.g. if first message has retries, but second message delivered to broker with the first attempt)?
Or do out of order only happen across several partitions of topic, but within partition order is preserved?
If you set retries to more than 0 and max.in.flight.requests.per.connection to more than 1, then yes messages can arrive out of order on the broker even if they are for the same partition.
You can also have duplicates if for example a message is correctly added to the Kafka logs and an error happens when sending the response back to the client.
Since Kafka 0.11, you can use the Idempotent producer to solve these 2 issues. See http://kafka.apache.org/documentation/#semantics
As per latest update documentation, you can have maximum 5 max.in.flight.requests.per.connection and Kafka can maintain order for this.
I understand that Kafka Consumer Group is load-balanced based on how many partitions exist for a specific topic. So the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group which subscribes to the topic.
I have a scenario where each of my consumer is actually a consumer-group itself (i.e. 1 consumer per group). This mainly due to synchronisation between different databases so that the same data exists. All I am trying to do is run the same job on different environments as soon as the consumer get a message from the producer (broadcast).
For me, I don't believe that partitions/load balancing idea makes any difference. I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case). Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
P.S. I am using the Producer/Consumer API only my messaging framework needs to have a minimum change/impact to my existing application setup.
the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group
To be more precise, the number of partitions limits the number of consumers in a consumer group (if there are more consumers than partitions, they will just be idle). There can be fewer consumers than partitions. I wouldn't call 1:1 as necessarily ideal, it's the practical limit.
I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case).
I don't see value having replication-factor equal to number of consumer groups. Replication is for resilience, i.e. to prevent data loss if a broker goes down. It doesn't have anything to do with the number of consumers, since each consumer will only ever be consuming from the leader broker for a given partition.
Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
Partitioning data is for load distribution, both on the broker side and for parallelism on the consumer side. It's easier to set a higher number of partitions from the start, even if you don't think you need it, than to re-partition data later, if/when you discover you could benefit from it. On the other hand, there's no point setting them too high as they come with their own overheads (e.g. CPU load on the broker).
P.S. I am not using the Producer/Consumer API since I am not doing Table/Stream related aggregation
Sounds to me you intended to say you're not using Kafka Streams API, since it's Kafka Streams that offers KTable, KStream and aggregations thereon.
Multiple partitions are useful when you run Kafka in a cluster where the number of brokers is larger than the replication factor. So when you have 5 brokers and a replication of 3 then the 2 additional brokers are not needed. When you have two partitions with a replication of 3 you can divide 2*3 = 6 partitions over 5 brokers.
Only now there is one broker with two partitions while the others have one. So it's not spread evenly. It would be better to have more partitions to get a better spread.
There are other reasons to pick a number of partitions, but there are a lot of articles about this. What I explained is a good rule of thumb to start with.
I have a 3 node Kafka cluster and I am creating a topic in one of the node with the below command:
bin/kafka-create-topic.sh --zookeeper host1.com:2181,host2.com:2181,host3.com:2181 --replica 1 --partition 1 --topic test
So,now when I push messages to the topic,one of my host is getting overloaded with the topic messages as Kafka stores the messages in disk space. I want to know if there is any configuration to set to distribute the storing process across the cluster.
Thanks,
As #om-nom-nom points out, you are creating a topic with a single partition. So that topic will only ever be on the node that you created it on. So even though you have a 3 node setup, the other two nodes will never be used.
Changing your topic to use multiple partitions is how you distribute a Kafka topic. The Kafka broker doesn't distribute messages to different nodes. It's the producers responsibility to determine which partition a message goes to. This is something you can you determine, or let the producer use a round-robin approach to distribute to partitions, as #om-nom-nom points out.
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one.
source
Topic can be sliced into multiple partitions (your config uses just 1), which by default will be distributed between brokers in round-robin fashion.
I would like to know what is the difference between simple topic & partition topic.As per my understanding to balance the load, topic has been partitioned, Each message will have offset & consumer will acknowledge to ensure previous messages have been consumed.In case no of partition & consumer mismatches the re balance done by kafka does it efficiently manages.
If multiple topics created instead partition does it affect the operational efficiency.
From the kafka documentation
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data
Having multiple partitions for any given topic allows Kafka to distribute it across the Kafka cluster. As a result the request for handling data from different partitions can be divided among multiple servers in the whole cluster. Also each partition can be replicated across multiple servers to minimize the data loss. Again from the doc page
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
So having a topic with a single partition won't allow you to use these flexibilities. Also note in a real life environment you can have different topics to hold different categories of messages (though it is also possible to have a single topic with multiple partitions where each partitions can have specific categories of messages using the messgae key while producing).
I don't think creating multiple topics instead of partitions will have much impact on the overall performace. But imagine you want to keep track of all the tweets made by users in your site. You can then have one topic named "User_tweet" with multiple partitons so that while producing messages Kafka can distribute the data across multiple partitions and on the consumer end you only need to have one group of consumer pulling data from the same topic. Instead keeping "User_tweet_1", "User_tweet_2", "User_tweet_3" will only make things complex for you while both producing and consuming the messages.