Kafka Topic vs Partition topic

Kafka Topic vs Partition topic - java

I would like to know what is the difference between simple topic & partition topic.As per my understanding to balance the load, topic has been partitioned, Each message will have offset & consumer will acknowledge to ensure previous messages have been consumed.In case no of partition & consumer mismatches the re balance done by kafka does it efficiently manages.
If multiple topics created instead partition does it affect the operational efficiency.

From the kafka documentation
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data
Having multiple partitions for any given topic allows Kafka to distribute it across the Kafka cluster. As a result the request for handling data from different partitions can be divided among multiple servers in the whole cluster. Also each partition can be replicated across multiple servers to minimize the data loss. Again from the doc page
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
So having a topic with a single partition won't allow you to use these flexibilities. Also note in a real life environment you can have different topics to hold different categories of messages (though it is also possible to have a single topic with multiple partitions where each partitions can have specific categories of messages using the messgae key while producing).
I don't think creating multiple topics instead of partitions will have much impact on the overall performace. But imagine you want to keep track of all the tweets made by users in your site. You can then have one topic named "User_tweet" with multiple partitons so that while producing messages Kafka can distribute the data across multiple partitions and on the consumer end you only need to have one group of consumer pulling data from the same topic. Instead keeping "User_tweet_1", "User_tweet_2", "User_tweet_3" will only make things complex for you while both producing and consuming the messages.

Related

Kafka Priority queue using topics

I'm trying to create a priority queue using Kafka Consumer, I read about Bucket pattern,and how to use it in one Topic, distributing partitions between the Buckets.
But what I need is to do here, is using different topics high, medium, low ensure I will only consume from High as long as there are message in the topic to be consumed
Kafka Consumer provide a way to pass a list of topics to the consumers but it seems he is doing round-robin balancing between topics
Here a code example that I did to prove it
https://github.com/politrons/reactive/blob/master/kafka/src/test/java/com/politrons/kafka/KafkaOrdering.java
What I would like to know is, if exist any mechanism in Kafka Consumer to tell him, when he is subscribing to several topics, that he have to focus on one, and no read anything from another topic, if this first one is still getting events.
Regards.

Is there a limitation on the number of consumer to the same topic in Kafka?

Does Kafka have a limitation on the simultaneous connections (created with
Consumer.createJavaConsumerConnector) for the same topic within the same
group?
My scenario is I need to consume a topic from different process (not
thread), so I need to create lots of high level consumers.

The number of active consumers within the same consumer group is limited by the number of partitions of the topic. Extra consumers will act as backups and will only start consuming when one of the active consumers goes down and the group is re-balanced.
If you need to consume the same copy of the data within multiple processes, your consumers should be in different consumer groups. There is no limitation on the number of consumer groups you can have.

The main limitation is number of partitions that have this topic - you can create more consumers than partitions, but they won't consume anything.

If a consumer group is subscribed to multiple topic partitions how does kafka decide which it will read first?

I am trying to build a application that can read through a kafka topic but I need it to have a "see previous" button. I know how to seek through a particular partition but is it possible to back through all the messages in a topic in the order that they were read in? I am using Java KafkaConsumer.

Offsets are per partition and there is no ordering between messages across different partitions. You can't go back through all the messages in a topic in the order that they were read in as kafka doesn't know the order in which the consumer read in messages from different partitions (different partitions may be read from different brokers across the cluster). However, what you could do is do ordered buffering in your app as you read messages in, that would allow you to go back as far as the buffer capacity.

Kafka Topic-per-Consumer Configuration

I understand that Kafka Consumer Group is load-balanced based on how many partitions exist for a specific topic. So the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group which subscribes to the topic.
I have a scenario where each of my consumer is actually a consumer-group itself (i.e. 1 consumer per group). This mainly due to synchronisation between different databases so that the same data exists. All I am trying to do is run the same job on different environments as soon as the consumer get a message from the producer (broadcast).
For me, I don't believe that partitions/load balancing idea makes any difference. I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case). Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
P.S. I am using the Producer/Consumer API only my messaging framework needs to have a minimum change/impact to my existing application setup.

the ideal combination is 1:1 for the number of partitions for a topic and the number of consumers in a consumer group
To be more precise, the number of partitions limits the number of consumers in a consumer group (if there are more consumers than partitions, they will just be idle). There can be fewer consumers than partitions. I wouldn't call 1:1 as necessarily ideal, it's the practical limit.
I am going with a topic that has 1 partitions and n Replication-Factor (n = total consumer groups, or consumer for my case).
I don't see value having replication-factor equal to number of consumer groups. Replication is for resilience, i.e. to prevent data loss if a broker goes down. It doesn't have anything to do with the number of consumers, since each consumer will only ever be consuming from the leader broker for a given partition.
Does anyone think that I should still implement more than 1 partition for my case? If so, could you please mention why.
Partitioning data is for load distribution, both on the broker side and for parallelism on the consumer side. It's easier to set a higher number of partitions from the start, even if you don't think you need it, than to re-partition data later, if/when you discover you could benefit from it. On the other hand, there's no point setting them too high as they come with their own overheads (e.g. CPU load on the broker).
P.S. I am not using the Producer/Consumer API since I am not doing Table/Stream related aggregation
Sounds to me you intended to say you're not using Kafka Streams API, since it's Kafka Streams that offers KTable, KStream and aggregations thereon.

Multiple partitions are useful when you run Kafka in a cluster where the number of brokers is larger than the replication factor. So when you have 5 brokers and a replication of 3 then the 2 additional brokers are not needed. When you have two partitions with a replication of 3 you can divide 2*3 = 6 partitions over 5 brokers.
Only now there is one broker with two partitions while the others have one. So it's not spread evenly. It would be better to have more partitions to get a better spread.
There are other reasons to pick a number of partitions, but there are a lot of articles about this. What I explained is a good rule of thumb to start with.

Where do Kafka stores the topic in a multi node cluster?

I have a 3 node Kafka cluster and I am creating a topic in one of the node with the below command:
bin/kafka-create-topic.sh --zookeeper host1.com:2181,host2.com:2181,host3.com:2181 --replica 1 --partition 1 --topic test
So,now when I push messages to the topic,one of my host is getting overloaded with the topic messages as Kafka stores the messages in disk space. I want to know if there is any configuration to set to distribute the storing process across the cluster.
Thanks,

As #om-nom-nom points out, you are creating a topic with a single partition. So that topic will only ever be on the node that you created it on. So even though you have a 3 node setup, the other two nodes will never be used.
Changing your topic to use multiple partitions is how you distribute a Kafka topic. The Kafka broker doesn't distribute messages to different nodes. It's the producers responsibility to determine which partition a message goes to. This is something you can you determine, or let the producer use a round-robin approach to distribute to partitions, as #om-nom-nom points out.
In Kafka producer, a partition key can be specified to indicate the destination partition of the message. By default, a hashing-based partitioner is used to determine the partition id given the key, and people can use customized partitioners also.
To reduce # of open sockets, in 0.8.0 (https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning key is not specified or null, a producer will pick a random partition and stick to it for some time (default is 10 mins) before switching to another one.
source

Topic can be sliced into multiple partitions (your config uses just 1), which by default will be distributed between brokers in round-robin fashion.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.