I have requirement to implement healthcheck and as part of that I have to find if producer will be able to publish message and consumer will be able to consumer message, for this I have to check that connection to cluster is working which can be checked using "connection_count" metric but that doesn't give true picture especially for consumer which will be tied to certain brokers on which partition for this consumer is.
Situation with producer is even more tricky as Producer might be publishing the message to any broker which holds the partition for topic on which producer is publishing.
In nutshell, how do I find the health of relevant brokers on producer/consumer sude.
Ultimately, I divide the question into a few checks.
Can you reach the broker? AdminClient.describeCluster works for this
Can you descibe the Topic(s) you are using? AdminClient.describeTopic can do that
Is the ISR list for those topics higher than min.in.sync.replicas? Extrapolate data from (2)
On the producer side, if you set at least acks=1, and there is no ack callback, or you could expose JMX data around the buffer size and if the producer's buffer isn't periodically flushed, then it is not healthy.
For the consumer, look at the conditions under which a rebalance will happen (such as long processing times between polls), then you can quickly identify what it means to be "unhealthy" for them. Attaching partition assignment + rebalance listeners can help here.
Some of these concepts I've written between
dropwizard-kafka (also has Producer and Consumer checks)
remora
I would like to think Spring has something similar
Related
I'm working on a process that collect data from IBM MQ and process it to a kafka topic.
To make sure not loosing any message,I need to commit my JMS message only after making sure my message is being sent and received by kafka broker.
I don't want to use synchronous kafka producer (waiting on future.get()) because of the performance impact it may have,instead I want to commit my JMS message inside the callback I'm providing my kafka producer.
For this to work correctly, I need the garantee that ack will be received in the same order of my produced messages (first ack corresponds to the first message being sent..).
Is my assumption correct?
These are the producer configs you want for ordered Kafka producer messages
enable.idempotence=true
acks=all
max.in.flight.requests.per.connection=5
retries=2147483647
The last two are defaults, so you don't explicitly need to set them.
Details - https://developer.confluent.io/tutorials/message-ordering/kafka.html
However, there will be no guaranteed order of Kafka producer callbacks unless you use a synchronous producer
collect data from IBM MQ and process it to a kafka topic
You can use Kafka Connect for this rather than writing your own producer
We need to talk little about the messages processing guarantee here.
If we are after the exactly-once processing semantics and messaging order, then the consumers must be configured with isolation.level="read_committed" and producers have to be configured with retries=Integer.MAX_VALUE, enable.idempotence=true, and max.in.flight.requests.per.connection=1 per default.
Also, setting max.in.flight.requests.per.connection=1, will guarantee that messages will be written to the broker in the order in which they were sent, even when retries occur.
I am studying Apache-kafka and have some confusion. Please help me to understand the following scenario.
I have a topic with 5 partitions and 5 brokers in a Kafka cluster. I am maintaining my message order in Partition 1(say P1).I want to broadcast the messages of P1 to 10 consumers.
So my question is; how do these 10 consumers interact with topic partition p1.
This is probably not how you want to use Kafka.
Unless you're being explicit with how you set your keys, you can't really control which partition your messages end up in when producing to a topic. Partitions in Kafka are designed to be more like low-level plumbing, something that exists, but you don't usually have to interact with. On the consumer side, you will be assigned partitions based on how many consumers are active for a particular consumer group at any one time.
One way to get around this is to define a topic to have only a single partition, in which case, of course, all messages will go to that partition. This is not ideal, since Kafka won't be able to parallelize data ingestion or serving, but it is possible.
So, having said that, let's assume that you did manage to put all your messages in partition 1 of a specific topic. When you fire up a consumer of that topic with consumer group id of consumer1, it will be assigned all the partitions for that topic, since that consumer is the only active one for that particular group id. If there is only one partition for that topic, like explained above, then that consumer will get all the data. If you then fire up a second consumer with the same group id, Kafka will notice there's a second consumer for that specific group id, but since there's only one partition, it can't assign any partitions to it, so that consumer will never get any data.
On the other hand, if you fire up a third consumer with a different consumer group id, say consumer2, that consumer will now get all the data, and it won't interfere at all with consumer1 message consumption, since Kafka keeps track of their consuming offsets separately. Kafka keeps track of which offset each particular ConsumerGroupId is at on each partition, so it won't get confused if one of them starts consuming slowly or stops for a while and restarts consuming later that day.
Much more detailed information here on how Kafka works here: https://kafka.apache.org/documentation/#gettingStarted
And more information on how to use the Kafka consumer at this link:
https://kafka.apache.org/20/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
#mjuarez's answer is absolutely correct - just for brevity I would reduce it to the following;
Don't try and read only from a single partition because it's a low level construct and it somewhat undermines the parallelism of Kafka. You're much better off just creating more topics if you need finer separation of data.
I would also add that most of the time a consumer needn't know which partition a message came from, in the same way that I don't eat a sandwich differently depending on which store it came from.
#mjuarez is actually not correct and I am not sure why his comment is being falsely confirmed by the OP. You can absolutely explicitly tell Kafka which partition a producer record pertains to using the following:
ProducerRecord(
java.lang.String topic,
java.lang.Integer partition, // <--------- !!!
java.lang.Long timestamp,
K key,
V value)
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/ProducerRecord.html#ProducerRecord-java.lang.String-java.lang.Integer-java.lang.Long-K-V-
So most of what was said after that becomes irrelevant.
Now to address the OP question directly: you want to accomplish a broadcast. To have a message sent once and read more than once you would have to have a different consumer group for each reader.
And that use case is an absolutely valid Kafka usage paradigm.
You can accomplish that using RabbitMQ too:
https://www.rabbitmq.com/tutorials/tutorial-three-java.html
... but the way it is done is not ideal because multiple out-of-process queues are involved.
I'm developing a software that uses Apache Kafka. I've got one consumer that subscribed to multiple topics, I'd like to know if there is an order for receiving messages from those topics. I tried some combination on my computer but I need to be sure about this.
Example
Consumer sub to topic1 and topic2
Producer1 write something on topic1
Producer2 write something on topic2
Producer1 write something on topic1
When the consumer polls, it receives a list of records containing first the messages from the first topic that he subscribed and then the messages from the other topic.
I'd like to know if it is always like this, i.e. the messages are in order like the topics that I subscribed.
Thanks
[EDIT] I'd like to specify that I have the two topics with one partition each, and only one producer and one consumer. I need to read first all the messages from the first topic and then the messages from the other topic
Kafka gives you only the guarantee of messages ordering inside a partition. It means that even with only one topic but more than one partitions you have no guarantee that messages are received in the same order they are sent.
Regarding your use case with two topics there is no relation between subscription order to the topics and messages ordering even because if the cluster has more than one node, the topic partition leader will be on different brokers and the client receives messages over different connections. Btw even with only one broker with all topics/partitions on that you can't have the guarantee you are describing.
No. Message ordering is only preserved within partitions (not even within topics).
If you need stronger ordering guarantees, you have to re-arrange messages in your application, for example using a timestamp (and a sufficiently large window buffer to catch all the ones that arrive out-of-order). Support for this has improved a bit with the recent addition of timestamps for all messages by Kafka itself, but the principle remains the same.
Why not first subscribe to the first topic and do a poll, and then subscribe to the other topic and do another poll? Without this, I don't think there is any guarantee in which order you receive messages from the two topics.
I have below configuration for rabbitmq
prefetchCount:1
ack-mode:auto.
I have one exchange and one queue is attached to that exchange and one consumer is attached to that queue. As per my understanding below steps will be happening if queue has multiple messages.
Queue write data on a channel.
As ack-mode is auto,as soon as queue writes message on channel,message is removed from queue.
Message comes to consumer,consumer start performing on that data.
As Queue has got acknowledgement for previous message.Queue writes next data on Channel.
Now,my doubt is,Suppose consumer is not finished with previous data yet.What will happen with that next data queue has written in channel?
Also,suppose prefetchCount is 10 and I have just once consumer attached to queue,where these 10 messages will reside?
The scenario you have described is one that is mentioned in the documentation for RabbitMQ, and elaborated in this blog post. Specifically, if you set a sufficiently large prefetch count, and have a relatively small publish rate, your RabbitMQ server turns into a fancy network switch. When acknowledgement mode is set to automatic, prefetch limiting is effectively disabled, as there are never unacknowledged messages. With automatic acknowledgement, the message is acknowledged as soon as it is delivered. This is the same as having an arbitrarily large prefetch count.
With prefetch >1, the messages are stored within a buffer in the client library. The exact data structure will depend upon the client library used, but to my knowledge, all implementations store the messages in RAM. Further, with automatic acknowledgements, you have no way of knowing when a specific consumer actually read and processed a message.
So, there are a few takeaways here:
Prefetch limit is irrelevant with automatic acknowledgements, as there are never any unacknowledged messages, thus
Automatic acknowledgements don't make much sense when using a consumer
Sufficiently-large prefetch when auto-ack is off, or any use of autoack = on will result in the message broker not doing any queuing, and instead doing routing only.
Now, here's a little bit of expert opinion. I find the whole notion of a message broker that "pushes" messages out to be a little backwards, and for this very reason- it's difficult to configure properly, and it is unclear what the benefit is. A queue system is a natural fit for a pull-based system. The processor can ask the broker for the next message when it is done processing the current message. This approach will ensure that load is balanced naturally and the messages don't get lost when processors disconnect or get knocked out.
Therefore, my recommendation is to drop the use of consumers altogether and switch over to using basic.get.
I was looking at building a Kafka consumer using 0.9 API version. Please can you explain what is meant by consumer rebalance? What is the difference between consumer and co-ordinator referred to here? Also please can you explain consumer split brain problem?
The following is a draft design that uses a high-available consumer
coordinator at the broker side to handle consumer rebalance. By
migrating the rebalance logic from the consumer to the coordinator we
can resolve the consumer split brain problem and help thinner the
consumer client.
Consumer rebalance means that the consumer groups is rebalancing the partitions between the consumers of that consumer groups, this happens when a new consumer enters or leaves the group.
Each consumer group has a co-ordinator that coordinates the group basically.
If you want to know more about the new consumer you can read this.
And split brain is a common issue with distribuited systems that happens when there is network partition and different parts of the system can not communicate between each other and don't realize about this. You can look it up here