I was looking at building a Kafka consumer using 0.9 API version. Please can you explain what is meant by consumer rebalance? What is the difference between consumer and co-ordinator referred to here? Also please can you explain consumer split brain problem?
The following is a draft design that uses a high-available consumer
coordinator at the broker side to handle consumer rebalance. By
migrating the rebalance logic from the consumer to the coordinator we
can resolve the consumer split brain problem and help thinner the
consumer client.
Consumer rebalance means that the consumer groups is rebalancing the partitions between the consumers of that consumer groups, this happens when a new consumer enters or leaves the group.
Each consumer group has a co-ordinator that coordinates the group basically.
If you want to know more about the new consumer you can read this.
And split brain is a common issue with distribuited systems that happens when there is network partition and different parts of the system can not communicate between each other and don't realize about this. You can look it up here
Related
I'm new to Kafka and would like to know the best approach for configuring the topics, partitions, consumer groups and consumer app replicas.
Working on an existing setup, the configuration handed down is as follows:
10 topics
Each topic has its own group i.e. topic1-group1, topic2-group2 and so on.
Each topic has 5 partitions
The Java consumer app has 5 (intentionally same as number of partitions, I'm told) replicas (k8s pods) which use Spring Kafka's #KafkaListener
Q1. I'd like to know if this is the configuration that will offer the best performance (high throughput and low latency)?
The consumer app sends the messages to only ONE downstream system which makes me think that all the topics (and all their partitions + consumer app replicas) can share a single consumer group (let's call it main-group).
Q2. Would this offer better or worse performance than having dedicated group for each topic?
Sub question:
Q3. Can ONE #KafkaListener work with 10 topics each with dedicated consumer groups given it has only 1 containerGroup and 1 containerFactory parameter?
Thanks
I have requirement to implement healthcheck and as part of that I have to find if producer will be able to publish message and consumer will be able to consumer message, for this I have to check that connection to cluster is working which can be checked using "connection_count" metric but that doesn't give true picture especially for consumer which will be tied to certain brokers on which partition for this consumer is.
Situation with producer is even more tricky as Producer might be publishing the message to any broker which holds the partition for topic on which producer is publishing.
In nutshell, how do I find the health of relevant brokers on producer/consumer sude.
Ultimately, I divide the question into a few checks.
Can you reach the broker? AdminClient.describeCluster works for this
Can you descibe the Topic(s) you are using? AdminClient.describeTopic can do that
Is the ISR list for those topics higher than min.in.sync.replicas? Extrapolate data from (2)
On the producer side, if you set at least acks=1, and there is no ack callback, or you could expose JMX data around the buffer size and if the producer's buffer isn't periodically flushed, then it is not healthy.
For the consumer, look at the conditions under which a rebalance will happen (such as long processing times between polls), then you can quickly identify what it means to be "unhealthy" for them. Attaching partition assignment + rebalance listeners can help here.
Some of these concepts I've written between
dropwizard-kafka (also has Producer and Consumer checks)
remora
I would like to think Spring has something similar
We have messages which are dependent.Ex. say we have 4 messages M1, M2, M1_update1,(should be processed only after M1 is processed),M3 (should be processed only after M1,M2 are processed).
In this example, only M1 and M2 can be processed in parallel, others have to be sequential. I know messages in one partition of Kafka topic are processed sequentially. But how do I know that M1,M2 are processed and now is the time to push M1_update1 and M3 messages to the topic? Is Kafka right choice for this kind of use-case? Any insights is appreciated!!
Kafka is used as pub-sub messaging system which is highly scalable and fault tolerant.
I believe using kafka alone when your messages are interdependent could be a bad choice. The processing you require is condition based probably you need a routing engine such as camel or drool to achieve the end result.
You're basically describing a message queue that guarantees ordering. Kafka, by design, does not guarantee ordering, except in the case you mention, where the topic has a single partition. In that case, though, you're not taking full advantage of Kafka's ability to maximize throughput by parallelizing data in partitions.
As far as messages being dependent on each other, that would require a logic layer that core Kafka itself doesn't provide. If I understand it correctly, and the processing happens after the message is consumed from Kafka, you would need some sort of notification on the consumer end, which would receive and process M1 and M2 and somehow notify the producer on the other side it's now ok to send M1_update and M3. This is definitely outside the scope of what core Kafka provides. You could still use Kafka to build something like this, but there's probably other solutions that would work better for you.
I am studying Apache-kafka and have some confusion. Please help me to understand the following scenario.
I have a topic with 5 partitions and 5 brokers in a Kafka cluster. I am maintaining my message order in Partition 1(say P1).I want to broadcast the messages of P1 to 10 consumers.
So my question is; how do these 10 consumers interact with topic partition p1.
This is probably not how you want to use Kafka.
Unless you're being explicit with how you set your keys, you can't really control which partition your messages end up in when producing to a topic. Partitions in Kafka are designed to be more like low-level plumbing, something that exists, but you don't usually have to interact with. On the consumer side, you will be assigned partitions based on how many consumers are active for a particular consumer group at any one time.
One way to get around this is to define a topic to have only a single partition, in which case, of course, all messages will go to that partition. This is not ideal, since Kafka won't be able to parallelize data ingestion or serving, but it is possible.
So, having said that, let's assume that you did manage to put all your messages in partition 1 of a specific topic. When you fire up a consumer of that topic with consumer group id of consumer1, it will be assigned all the partitions for that topic, since that consumer is the only active one for that particular group id. If there is only one partition for that topic, like explained above, then that consumer will get all the data. If you then fire up a second consumer with the same group id, Kafka will notice there's a second consumer for that specific group id, but since there's only one partition, it can't assign any partitions to it, so that consumer will never get any data.
On the other hand, if you fire up a third consumer with a different consumer group id, say consumer2, that consumer will now get all the data, and it won't interfere at all with consumer1 message consumption, since Kafka keeps track of their consuming offsets separately. Kafka keeps track of which offset each particular ConsumerGroupId is at on each partition, so it won't get confused if one of them starts consuming slowly or stops for a while and restarts consuming later that day.
Much more detailed information here on how Kafka works here: https://kafka.apache.org/documentation/#gettingStarted
And more information on how to use the Kafka consumer at this link:
https://kafka.apache.org/20/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
#mjuarez's answer is absolutely correct - just for brevity I would reduce it to the following;
Don't try and read only from a single partition because it's a low level construct and it somewhat undermines the parallelism of Kafka. You're much better off just creating more topics if you need finer separation of data.
I would also add that most of the time a consumer needn't know which partition a message came from, in the same way that I don't eat a sandwich differently depending on which store it came from.
#mjuarez is actually not correct and I am not sure why his comment is being falsely confirmed by the OP. You can absolutely explicitly tell Kafka which partition a producer record pertains to using the following:
ProducerRecord(
java.lang.String topic,
java.lang.Integer partition, // <--------- !!!
java.lang.Long timestamp,
K key,
V value)
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/ProducerRecord.html#ProducerRecord-java.lang.String-java.lang.Integer-java.lang.Long-K-V-
So most of what was said after that becomes irrelevant.
Now to address the OP question directly: you want to accomplish a broadcast. To have a message sent once and read more than once you would have to have a different consumer group for each reader.
And that use case is an absolutely valid Kafka usage paradigm.
You can accomplish that using RabbitMQ too:
https://www.rabbitmq.com/tutorials/tutorial-three-java.html
... but the way it is done is not ideal because multiple out-of-process queues are involved.
New to Kafka.
I'm really confused by Kafka's API:
Version 0.9 is completely different from 0.8.
Then there are the simpleConsumer, the highlevel Consumer and the consumer group
When I instantiate a SimpleConsumer is it associated to a consumer group? Or is the consumer group an abstraction which is used by the high-level consumer?
If I don't care about ordering of messages or duplicates, can I instantiate 2 simpleConsumers that read from the same partition?
Is there a way to use a simpleConsumer to read from the topic without specifying partitions?
With Kafka 0.9 there is a new consumer API as you noted and the two older consumer APIs still exist but will likely be decommissioned in a future release in favour of the new API.
The consumer group concept relates only to the high-level consumer and is a helper to coordinate consumer instances reading from the same set of topics to avoid duplicated messages and allow parallelism with automatic fail-over in case of a consumer instance crash etc. When using the simple consumer API, you have to take care of this coordination yourself and therefore you also need to specify which partitions to read from and it's also not preventing you from having multiple consumers reading from the same partition.
I don't know of a good use case where you would need multiple consumers reading from the same partition though, if you want to consume it for different purposes, you can just use the high-level API with multiple consumer group IDs and they would work independently from each-other.