Where do Kafka stores the topic in a multi node cluster?

Where do Kafka stores the topic in a multi node cluster? - java

I have a 3 node Kafka cluster and I am creating a topic in one of the node with the below command:
bin/kafka-create-topic.sh --zookeeper host1.com:2181,host2.com:2181,host3.com:2181 --replica 1 --partition 1 --topic test
So,now when I push messages to the topic,one of my host is getting overloaded with the topic messages as Kafka stores the messages in disk space. I want to know if there is any configuration to set to distribute the storing process across the cluster.
Thanks,

As #om-nom-nom points out, you are creating a topic with a single partition. So that topic will only ever be on the node that you created it on. So even though you have a 3 node setup, the other two nodes will never be used.
Changing your topic to use multiple partitions is how you distribute a Kafka topic. The Kafka broker doesn't distribute messages to different nodes. It's the producers responsibility to determine which partition a message goes to. This is something you can you determine, or let the producer use a round-robin approach to distribute to partitions, as #om-nom-nom points out.
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one.
source

Topic can be sliced into multiple partitions (your config uses just 1), which by default will be distributed between brokers in round-robin fashion.

Related

Why kakfa topic partition is not receiving messages?

I have a kafka cluster with 3 brokers and a topic with 8 partitions.
A producer written in java using spring boot and without custom rule for load balancing. It means it should do round robin.
The issue is that there are some partitions there are not receiving messages into it. I figured it out checking what the 4 consumers are receiving and even they are processing all messages there is a consumer idle all the time because it has received just one message.
What could be the issue?
Kafka version I'm using is 0.10.1.1
Additional note in this case I'm not using replicas for the partitions

It means it should do round robin.
It will only do round robin, if you have no keys in your Kafka messages. Otherwise, the messages are partitioned based on a hash value of the key:
hash(key) % number_of_partitions
It is not unusal, that this will cause some partitions to not receive any messages at all. Imagine a case, where you are using a key that can only have two different values. In that case, all your data will flow into only two partitions, independent of the number of partitions in your topic.

Apache Kafka Message broadcasting

I am studying Apache-kafka and have some confusion. Please help me to understand the following scenario.
I have a topic with 5 partitions and 5 brokers in a Kafka cluster. I am maintaining my message order in Partition 1(say P1).I want to broadcast the messages of P1 to 10 consumers.
So my question is; how do these 10 consumers interact with topic partition p1.

This is probably not how you want to use Kafka.
Unless you're being explicit with how you set your keys, you can't really control which partition your messages end up in when producing to a topic. Partitions in Kafka are designed to be more like low-level plumbing, something that exists, but you don't usually have to interact with. On the consumer side, you will be assigned partitions based on how many consumers are active for a particular consumer group at any one time.
One way to get around this is to define a topic to have only a single partition, in which case, of course, all messages will go to that partition. This is not ideal, since Kafka won't be able to parallelize data ingestion or serving, but it is possible.
So, having said that, let's assume that you did manage to put all your messages in partition 1 of a specific topic. When you fire up a consumer of that topic with consumer group id of consumer1, it will be assigned all the partitions for that topic, since that consumer is the only active one for that particular group id. If there is only one partition for that topic, like explained above, then that consumer will get all the data. If you then fire up a second consumer with the same group id, Kafka will notice there's a second consumer for that specific group id, but since there's only one partition, it can't assign any partitions to it, so that consumer will never get any data.
On the other hand, if you fire up a third consumer with a different consumer group id, say consumer2, that consumer will now get all the data, and it won't interfere at all with consumer1 message consumption, since Kafka keeps track of their consuming offsets separately. Kafka keeps track of which offset each particular ConsumerGroupId is at on each partition, so it won't get confused if one of them starts consuming slowly or stops for a while and restarts consuming later that day.
Much more detailed information here on how Kafka works here: https://kafka.apache.org/documentation/#gettingStarted
And more information on how to use the Kafka consumer at this link:
https://kafka.apache.org/20/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

#mjuarez's answer is absolutely correct - just for brevity I would reduce it to the following;
Don't try and read only from a single partition because it's a low level construct and it somewhat undermines the parallelism of Kafka. You're much better off just creating more topics if you need finer separation of data.
I would also add that most of the time a consumer needn't know which partition a message came from, in the same way that I don't eat a sandwich differently depending on which store it came from.

#mjuarez is actually not correct and I am not sure why his comment is being falsely confirmed by the OP. You can absolutely explicitly tell Kafka which partition a producer record pertains to using the following:
ProducerRecord(
java.lang.String topic,
java.lang.Integer partition, // <--------- !!!
java.lang.Long timestamp,
K key,
V value)
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/ProducerRecord.html#ProducerRecord-java.lang.String-java.lang.Integer-java.lang.Long-K-V-
So most of what was said after that becomes irrelevant.
Now to address the OP question directly: you want to accomplish a broadcast. To have a message sent once and read more than once you would have to have a different consumer group for each reader.
And that use case is an absolutely valid Kafka usage paradigm.
You can accomplish that using RabbitMQ too:
https://www.rabbitmq.com/tutorials/tutorial-three-java.html
... but the way it is done is not ideal because multiple out-of-process queues are involved.

If a consumer group is subscribed to multiple topic partitions how does kafka decide which it will read first?

I am trying to build a application that can read through a kafka topic but I need it to have a "see previous" button. I know how to seek through a particular partition but is it possible to back through all the messages in a topic in the order that they were read in? I am using Java KafkaConsumer.

Offsets are per partition and there is no ordering between messages across different partitions. You can't go back through all the messages in a topic in the order that they were read in as kafka doesn't know the order in which the consumer read in messages from different partitions (different partitions may be read from different brokers across the cluster). However, what you could do is do ordered buffering in your app as you read messages in, that would allow you to go back as far as the buffer capacity.

Kafka Topic-per-Consumer Configuration

I understand that Kafka Consumer Group is load-balanced based on how many partitions exist for a specific topic. So the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group which subscribes to the topic.
I have a scenario where each of my consumer is actually a consumer-group itself (i.e. 1 consumer per group). This mainly due to synchronisation between different databases so that the same data exists. All I am trying to do is run the same job on different environments as soon as the consumer get a message from the producer (broadcast).
For me, I don't believe that partitions/load balancing idea makes any difference. I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case). Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
P.S. I am using the Producer/Consumer API only my messaging framework needs to have a minimum change/impact to my existing application setup.

the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group
To be more precise, the number of partitions limits the number of consumers in a consumer group (if there are more consumers than partitions, they will just be idle). There can be fewer consumers than partitions. I wouldn't call 1:1 as necessarily ideal, it's the practical limit.
I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case).
I don't see value having replication-factor equal to number of consumer groups. Replication is for resilience, i.e. to prevent data loss if a broker goes down. It doesn't have anything to do with the number of consumers, since each consumer will only ever be consuming from the leader broker for a given partition.
Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
Partitioning data is for load distribution, both on the broker side and for parallelism on the consumer side. It's easier to set a higher number of partitions from the start, even if you don't think you need it, than to re-partition data later, if/when you discover you could benefit from it. On the other hand, there's no point setting them too high as they come with their own overheads (e.g. CPU load on the broker).
P.S. I am not using the Producer/Consumer API since I am not doing Table/Stream related aggregation
Sounds to me you intended to say you're not using Kafka Streams API, since it's Kafka Streams that offers KTable, KStream and aggregations thereon.

Multiple partitions are useful when you run Kafka in a cluster where the number of brokers is larger than the replication factor. So when you have 5 brokers and a replication of 3 then the 2 additional brokers are not needed. When you have two partitions with a replication of 3 you can divide 2*3 = 6 partitions over 5 brokers.
Only now there is one broker with two partitions while the others have one. So it's not spread evenly. It would be better to have more partitions to get a better spread.
There are other reasons to pick a number of partitions, but there are a lot of articles about this. What I explained is a good rule of thumb to start with.

Kafka Topic vs Partition topic

I would like to know what is the difference between simple topic & partition topic.As per my understanding to balance the load, topic has been partitioned, Each message will have offset & consumer will acknowledge to ensure previous messages have been consumed.In case no of partition & consumer mismatches the re balance done by kafka does it efficiently manages.
If multiple topics created instead partition does it affect the operational efficiency.

From the kafka documentation
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data
Having multiple partitions for any given topic allows Kafka to distribute it across the Kafka cluster. As a result the request for handling data from different partitions can be divided among multiple servers in the whole cluster. Also each partition can be replicated across multiple servers to minimize the data loss. Again from the doc page
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
So having a topic with a single partition won't allow you to use these flexibilities. Also note in a real life environment you can have different topics to hold different categories of messages (though it is also possible to have a single topic with multiple partitions where each partitions can have specific categories of messages using the messgae key while producing).
I don't think creating multiple topics instead of partitions will have much impact on the overall performace. But imagine you want to keep track of all the tweets made by users in your site. You can then have one topic named "User_tweet" with multiple partitons so that while producing messages Kafka can distribute the data across multiple partitions and on the consumer end you only need to have one group of consumer pulling data from the same topic. Instead keeping "User_tweet_1", "User_tweet_2", "User_tweet_3" will only make things complex for you while both producing and consuming the messages.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.