Reading messages offset in Apache Kafka - java

I am very much new to Kafka and we are using Kafka 0.8.1.
What I need to do is to consume a message from topic. For that, I will have to write one consumer in Java which will consume a message from topic and then save that message to database. After a message is saved, some acknowledgement will be sent to Java consumer. If acknowledgement is true, then next message should be consumed from the topic. If acknowldgement is false(which means due to some error message,read from the topic, couldn't be saved into the database), then again that message should be read.
I think I need to use Simple Consumer,to have control over message offset and have gone through the Simple Consumer example as given in this link https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example.
In this example, offset is evaluated in run method as 'readOffset'. Do I need to play with that? For e.g. I can use LatestTime() instead of EarliestTime() and in case of false, I will reset the offset to the one before using offset - 1.
Is this how I should proceed?

I think you can get along with using the high level consumer (http://kafka.apache.org/documentation.html#highlevelconsumerapi), that should be easier to use than the SimpleConsumer. I don't think the consumer needs to reread messages from Kafka on database failure, as the consumer already has those messages and can resend them to the DB or do anything else it sees fit.
High-level consumers store the last offset read from a specific partition in Zookeeper (based on the consumer group name), so that when a consumer process dies and is later restarted (potentially on an other host), it can continue processing messages where it left off. It's possible to autosave this offset to Zookeeper periodically (see the consumer properties auto.commit.enable and auto.commit.interval.ms), or have it saved by application logic by calling ConsumerConnector.commitOffsets . See also https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example .
I suggest you turn auto-commit off and commit your offsets yourselves once you received DB acknowledgement. Thus, you can make sure unprocessed messages are reread from Kafka in case of consumer failure and all messages commited to Kafka will eventually reach the DB at least once (but not 'exactly once').

Related

Should I ensure that Kafka messages are send successfully before deleting the data?

I need to read data from the database, send them to Kafka, and then delete those data (which were successfully sent) from the database. I would think to do it straitforward:
public void syncData() {
List<T> data = repository.findAll();
data.forEach(value -> kafkaTemplate.send(topicName, value));
repository.deleteAll(data);
}
But I have never worked with Kafka before and I have a confusion with kafkaTemplate.send operation. As the method returns ListenableFuturethat means that the iteration data.forEach might be finished before all the messages are really sent to a broker. Thus, I might delete the data before they are really sent. What if, for some reason, some messages are not sent. Say I have 10 data, and starting from 7th the broker gets down.
Will Kafka throw an exception if a message is not sent?
Should I introduce an extra logic to ensure that all messages are sent before going to the next stage of deleting the data?
P.S. I use Kafka with Spring-boot
You should implement a callback that will trigger when the producer either succeeds or fails to write the data to Kafka before deleting it from the DB.
https://docs.spring.io/spring-kafka/docs/1.0.0.M2/reference/html/_reference.html
On top of this, you can set required acks to ALL so that every broker acknowledges the messages before it's considered sent.
Also little tid bit worth knowing in this context - Acks=ALL is not all assigned replicas, it's all in sync replicas for the partition need to acknowledge the write. So, it's important to have your min isr settings sensible for this also. If you have min isr = one, in a very strict sense Acks=all is still only guaranteeing that 1 broker saw the write. If you then lose that one broker, you lose the write. That's obviously not going to be a common situation, but it's one that you should be aware of.
The usage of outbox pattern. (as the safe way of doing this)
Also there's some directions that might be helpful are, investigate how the replication factor of a topic relays to the amount of brokers. Get in touch with the min.insync.replicas broker setting. Then read on the ack-setting for the client-(producer) and what it means in terms of communication with the broker. For restarting at the correct data position when something bad happens to your application or database connection, you can get some inspiration from the kafka-connect library (and maybe use this as a separately deployed db-polling-service).
One of the strategies would be to keep those Future objects that are returned and monitor them (possibly in a separate thread). Once all of the tasks complete you can either delete the records that were successfully sent or write the IDs that need to be deleted in DB. And then have a scheduled task (once per hour or day or whatever period that fits you) that would delete all the ids that should be deleted.

When can a Flink job consume from Kafka?

We have a Flink job which has the following topology:
source -> filter -> map -> sink
We set a live(ready) status at the sink operator open-override function. After we get that status, we send events. Sometimes it can't consume the events sent early.
We want to know the exact time/step that we can send data which will not be missing.
It looks like you want to ensure that no message is missed for processing. Kafka will retain your messages, so there is no need to send messages only when the Flink consumer is ready. You can simplify your design by avoiding the status message.
Any Kafka Consumer (not just Flink Connector) will have an offset associated with it in Kafka Server to track the id of the last message that was consumed.
From kafka docs:
Kafka maintains a numerical offset for each record in a partition. This
offset acts as a unique identifier of a record within that partition,
and also denotes the position of the consumer in the partition. For
example, a consumer which is at position 5 has consumed records with
offsets 0 through 4 and will next receive the record with offset 5
In your Flink Kafka Connector, specify the offset as the committed offset.
OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)
This will ensure that if your Flink Connector is restarted, it will consume from the last position that it left off, before the restart.
If for some reason, the offset is lost, this will read from the beginning (earliest message) in your Kafka topic. Note that this approach will cause you to reprocess the messages.
There are many more offset strategies you can explore to choose the right one for you.
Refer - https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#starting-offset

Reading all messages from a Kafka topic

I have the following use case:
I have two Kafka topics, one is meant to be used a stream of incoming messages to be processed, the other is meant as store of records that is meant to be used as a bootstrap to the initial state of the application.
Is there a way to do the following:
Read all messages from a Kafka topic when the application starts up and store all ConsumerRecord in memory from the topic that is meant to bootstrap the application to its initial state
Only after all messages have been read allow the ConsumerRecord from the stream topic to be processed
As there may be additional records on the state topic to incorporate them into the application's state when the application is running without having to restart the application.
Thanks!
Start your bootstrap consumer first.
Read the other topic till a particular offset is reached or (if you want the end, you can read as long as there is no polled records available [this is not the best way!]). If you want to start at particular offset every-time you have to use a seek. Also use a unique consumer group id for this since you want to all the records. You might want to handle the rebalance case appropriately.
Then close that consumer and start the other stream consumer and process the data.
Using Ktables with Kafka streams might be better, but I am not familiar with it.

Client Healthcheck: Check for consumer/producer if broker is down

I have requirement to implement healthcheck and as part of that I have to find if producer will be able to publish message and consumer will be able to consumer message, for this I have to check that connection to cluster is working which can be checked using "connection_count" metric but that doesn't give true picture especially for consumer which will be tied to certain brokers on which partition for this consumer is.
Situation with producer is even more tricky as Producer might be publishing the message to any broker which holds the partition for topic on which producer is publishing.
In nutshell, how do I find the health of relevant brokers on producer/consumer sude.
Ultimately, I divide the question into a few checks.
Can you reach the broker? AdminClient.describeCluster works for this
Can you descibe the Topic(s) you are using? AdminClient.describeTopic can do that
Is the ISR list for those topics higher than min.in.sync.replicas? Extrapolate data from (2)
On the producer side, if you set at least acks=1, and there is no ack callback, or you could expose JMX data around the buffer size and if the producer's buffer isn't periodically flushed, then it is not healthy.
For the consumer, look at the conditions under which a rebalance will happen (such as long processing times between polls), then you can quickly identify what it means to be "unhealthy" for them. Attaching partition assignment + rebalance listeners can help here.
Some of these concepts I've written between
dropwizard-kafka (also has Producer and Consumer checks)
remora
I would like to think Spring has something similar

When consumer gets message from channel in rabbitmq,where does pre-fetch messages reside

I have below configuration for rabbitmq
prefetchCount:1
ack-mode:auto.
I have one exchange and one queue is attached to that exchange and one consumer is attached to that queue. As per my understanding below steps will be happening if queue has multiple messages.
Queue write data on a channel.
As ack-mode is auto,as soon as queue writes message on channel,message is removed from queue.
Message comes to consumer,consumer start performing on that data.
As Queue has got acknowledgement for previous message.Queue writes next data on Channel.
Now,my doubt is,Suppose consumer is not finished with previous data yet.What will happen with that next data queue has written in channel?
Also,suppose prefetchCount is 10 and I have just once consumer attached to queue,where these 10 messages will reside?
The scenario you have described is one that is mentioned in the documentation for RabbitMQ, and elaborated in this blog post. Specifically, if you set a sufficiently large prefetch count, and have a relatively small publish rate, your RabbitMQ server turns into a fancy network switch. When acknowledgement mode is set to automatic, prefetch limiting is effectively disabled, as there are never unacknowledged messages. With automatic acknowledgement, the message is acknowledged as soon as it is delivered. This is the same as having an arbitrarily large prefetch count.
With prefetch >1, the messages are stored within a buffer in the client library. The exact data structure will depend upon the client library used, but to my knowledge, all implementations store the messages in RAM. Further, with automatic acknowledgements, you have no way of knowing when a specific consumer actually read and processed a message.
So, there are a few takeaways here:
Prefetch limit is irrelevant with automatic acknowledgements, as there are never any unacknowledged messages, thus
Automatic acknowledgements don't make much sense when using a consumer
Sufficiently-large prefetch when auto-ack is off, or any use of autoack = on will result in the message broker not doing any queuing, and instead doing routing only.
Now, here's a little bit of expert opinion. I find the whole notion of a message broker that "pushes" messages out to be a little backwards, and for this very reason- it's difficult to configure properly, and it is unclear what the benefit is. A queue system is a natural fit for a pull-based system. The processor can ask the broker for the next message when it is done processing the current message. This approach will ensure that load is balanced naturally and the messages don't get lost when processors disconnect or get knocked out.
Therefore, my recommendation is to drop the use of consumers altogether and switch over to using basic.get.

Categories