How to reset Kafka offsets to match tail position? - java

We're using Storm with Kafka and ZooKeeper. We had a situation where we had to delete some topics and recreate them with different names. Our Kafka spouts stayed the same, aside from now reading from the new topic names. However now the spouts are using the offsets from the old topic partitions when trying to read from the new topics. So the tail position of my-topic-name partition 0 will be 500 but the offset will be something like 10000.
Is there a way to reset the offset position so it matches the tail of the topic?

There a multiple options (as Storm's KafkaSpout does not provide any API to define the starting offset).
If you want to consumer from the tail of the log you should delete old offsets
depending on you Kafka version
(pre 0.9) you can manipulate ZK (which is a little tricky)
(0.9+) or you try do delete the offset from the topic __consumer_offsets (which is also tricky and might delete other offset you want to preserve, too)
if no offsets are there, you can restart your spout with auto offset reset policy "latest" or "largest" (depending on you Kafka version)
as an alternative (which I would recommend), you can write a small client application that uses seek() to manipulate the offset in the way you need them and commit() the offsets. This client must use the same group ID as you KafkaSpout and must subscribe to the same topic(s). Furthermore, you need to make sure that this client application is running a single consumer group member so it get's all partitions assigned.
for this, you an either seek to the end of the log and commit
or you commit an invalid offset (like -1) and rely on auto offset reset configuration"latest" or "largest" (depending on you Kafka version)
For Kafka Streams, there is a "Application Reset Tool" that does a similar thing to manipulate committed offsets. If you want to get some details, you can read this blog post http://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
(disclaimer: I am the author of the post and it is about Kafka Streams -- nevertheless, the underlying offset manipulation ideas are the same)

Related

Is there any way to get the eldest available offset for a kafka topic

I have a requirement where I am storing information of the offset till which system has read the data. So next time when the system starts reading data again from kafka I need to read data in between the older offset that we have in the system to the newest offset. But the older offset might be invalid due to kafka retention policy. So if we specify older offset in kafka consumer what will be the behavior? Also is there any way we could get the oldest offset value for a particular topic/offset so that we start reading from it?
Simply perform seekToBeginning during start up.
Implement ConsumerSeekAware or, preferably, extend AbstractConsumerSeekAware.
See here.
Just call seekToBeginning in onPartitionsAssigned.
This depends on the way you configure your consumer. Specifically, the auto.offset.rest parameter. If you set it to earliest your consumer will start consuming from the earliest available offset (if the offset you try to consume from is not valid).
This way you don't need to find the oldest offset value, since the consumer behaves like I described.
You can find more details here.

Get last record offset in kafka partition

I want to get the last record offset in the topic partition. There is endOffsets method in the consumer. And usually endOffsets - 1 works fine. But in the case of transactional producer topic may contain offsets without a records. And endOffsets - 1 will point to the offset without record. So, how should I compute the last record offset in this case?
More interestingly, what if I will have both a simple and transactional producer for my topic? Is there any reliable way to get the last record offset ignoring all this complexity?
I ended up realizing that there is no reliable and simple way to do that in the current version of the java consumer. I created a feature request for that in Kafka's issue tracker: https://issues.apache.org/jira/browse/KAFKA-10009

How does Kafka store offsets for each topic?

While polling Kafka, I have subscribed to multiple topics using the subscribe() function. Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic. Will calling seek() iteratively over each of the topic names, before polling for data achieve the result?
How are the offsets exactly stored in Kafka?
I have one partition per topic and just one consumer to read from all topics.
How does Kafka store offsets for each topic?
Kafka has moved the offset storage from zookeeper to kafka brokers. The reason is below:
Zookeeper is not a good way to service a high-write load such as offset updates because zookeeper routes each write though every node and hence has no ability to partition or otherwise scale writes. We have always known this, but chose this implementation as a kind of "marriage of convenience" since we already depended on zk.
Kafka store the offset commits in a topic, when consumer commit the offset, kafka publish an commit offset message to an "commit-log" topic and keep an in-memory structure that mapped group/topic/partition to the latest offset for fast retrieval. More design infomation could be found in this page about offset management.
Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic.
There is a new feature about kafka admin tools to reset offset.
kafka-consumer-group.sh --bootstrap-server 127.0.0.1:9092 --group
your-consumer-group **--reset-offsets** --to-offset 1 --all-topics --execute
There are more options you can use.

Kafka offsetcommit request with high level consumer API

I would like to use Kafka high level consumer API, and at the same time I would like to disable auto commit of offsets. I tried to achieve this through the following steps.
1) auto.commit.enable = false
2) offsets.storage = kafka
3) dual.commit.enabled = false
I created a offset manager, which periodically creates offsetcommit request to kafka and commits the offset.
Still I have the following questions
1) Does high level consumer API automatically fetches offset from kafka storage and initializes itself with that offset? Or should I use simple consumer API to achieve this?
2) Does kafka based storage for offsets is repicated across all brokers? Or it is maintained on only one broker?
I created a offset manager, which periodically creates offsetcommit request to kafka and commits the offset.
You need not do that if you are using the high level consumer which provides you with methods to commit the offsets manually, the javadoc (under Manual Offset Control) provides you with examples on how to do that.
1) Does high level consumer API automatically fetches offset from kafka storage and initializes itself with that offset? Or should I use simple consumer API to achieve this?
High level consumer will take care of fetching the last committed offset when you restart it, so you can resume consuming from where you left off.
2) Does kafka based storage for offsets is repicated across all brokers? Or it is maintained on only one broker?
Kafka stores the consumer offsets in an internal topic named __consumer_offsets and by default its replication factor is set to 3 with 50 partitions. So it is replicated across 3 brokers. You can find more info on its configuration in broker config, they start with offset or offsets.

Reading messages offset in Apache Kafka

I am very much new to Kafka and we are using Kafka 0.8.1.
What I need to do is to consume a message from topic. For that, I will have to write one consumer in Java which will consume a message from topic and then save that message to database. After a message is saved, some acknowledgement will be sent to Java consumer. If acknowledgement is true, then next message should be consumed from the topic. If acknowldgement is false(which means due to some error message,read from the topic, couldn't be saved into the database), then again that message should be read.
I think I need to use Simple Consumer,to have control over message offset and have gone through the Simple Consumer example as given in this link https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example.
In this example, offset is evaluated in run method as 'readOffset'. Do I need to play with that? For e.g. I can use LatestTime() instead of EarliestTime() and in case of false, I will reset the offset to the one before using offset - 1.
Is this how I should proceed?
I think you can get along with using the high level consumer (http://kafka.apache.org/documentation.html#highlevelconsumerapi), that should be easier to use than the SimpleConsumer. I don't think the consumer needs to reread messages from Kafka on database failure, as the consumer already has those messages and can resend them to the DB or do anything else it sees fit.
High-level consumers store the last offset read from a specific partition in Zookeeper (based on the consumer group name), so that when a consumer process dies and is later restarted (potentially on an other host), it can continue processing messages where it left off. It's possible to autosave this offset to Zookeeper periodically (see the consumer properties auto.commit.enable and auto.commit.interval.ms), or have it saved by application logic by calling ConsumerConnector.commitOffsets . See also https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example .
I suggest you turn auto-commit off and commit your offsets yourselves once you received DB acknowledgement. Thus, you can make sure unprocessed messages are reread from Kafka in case of consumer failure and all messages commited to Kafka will eventually reach the DB at least once (but not 'exactly once').

Categories