I am doing something like the following pseudo code
var consumer = new KafkaConsumer();
consumer.assign(topicPartitions);
var beginOff = consumer.beginningOffsets(topicPartitions);
var endOff = consumer.endOffsets(topicPartitions);
var lastOffsets = Math.max(beginOff, endOff - 1));
lastOffsets.forEach(consumer::seek);
lastMessages = consumer.poll(1 sec);
// do something with the received messages
consumer.close();
In the simple test that I did, this works, but I wonder if there are cases, like producer crashes etc., where offsets are not monotonically increasing by one? In that case, would I have to seek() my way back in time, or can I get the message offset of the last already produced message from Kafka?
I am not using transactions, so we don't need to worry about read-committed vs. uncommitted messages.
Edit: An example where offsets are not consecutive is after log compaction. However, log compaction should always keep the last message, as it is - obviously - more recent than all preceding messages (same key or not). But the offset before that last message could theoretically have been compacted away.
In kafka.apache.org/10/javadoc/, it is clearly mentioned that, consumer.endOffsets
Get the last offset for the given partitions. The last offset of a partition is the offset of the upcoming message, i.e. the offset of the last available message + 1.
So when you get that endOff - 1, it is the last available Kafka record for that topic partition when you fetched that. So producer concerns are not impacted for this.
And one more thing, Offset is not decided by the producer. It is decided by the partition leader of that topic partition. So, it is always monotonically increasing by one.
Related
I want to get the last record offset in the topic partition. There is endOffsets method in the consumer. And usually endOffsets - 1 works fine. But in the case of transactional producer topic may contain offsets without a records. And endOffsets - 1 will point to the offset without record. So, how should I compute the last record offset in this case?
More interestingly, what if I will have both a simple and transactional producer for my topic? Is there any reliable way to get the last record offset ignoring all this complexity?
I ended up realizing that there is no reliable and simple way to do that in the current version of the java consumer. I created a feature request for that in Kafka's issue tracker: https://issues.apache.org/jira/browse/KAFKA-10009
I want to read particular messages from topic. For example there are 12000 messages in topic and I want to read from 2000 to 5000 only. Is there any provision in kafka ? or can I use java consumer code to read particular messages from a topic?
The Java consumer API provides you "seek" methods and more specifically the following one
seek(TopicPartition partition, long offset)
You can specify to read messages starting from the provided offset but you cannot provide an ending offset. The other thing is that specifying an offset is more partition related and for this reason, you have to provide the TopicPartition as the first parameter.
Consider that if the topic partition is compacted and/or some messages are deleted, the offsets are not sequential anymore so you can have some holes. So you should pay attention if you want to read from the message with offset 2000 to the one with offset 5000 or you want to read from the 2000th message to the 5000th message (in this case the ordinal position could be not equal to the offset, i.e. the 2000th message is at offset 2100 because 100 messages before it were deleted).
I have producer which I call and posts a record to Kafka, then I call a consumer which returns the record, but when I call the consumer again the consumer doesn't return any records. (I need to get the record which I had posted to Kafka again). How can I do this?(Any code would be appreciated)
Kafka doesn't delete the message after it has been consumed. But it keeps the offset of reading for any consumer. So after you read a message from it, the offset goes forward. The second read doesn't read anything because the offset point after your only message and there is nothing after that. You should try resetting the offset before you read again. See this post:
Reset consumer offset to the beginning from Kafka Streams
But if you don't want to reset locally or globally, you can create another consumer group and since every consumer group has its own offset, your second read by the new consumers can achieve what you want. See this link:
kafka-tutorial-kafka-consumer
Hope this would be helpful.
You can manually reset the offset to desired offset or if you need to consumer from the start offset ( whatever is available in kafka) , then you can set the consumer property "auto.offset.reset=earliest"
You can also provide every time a new group.id value for the consumer properties. Just generate a random string value. The property auto.offset.reset must be set to earliest.
We're using Storm with Kafka and ZooKeeper. We had a situation where we had to delete some topics and recreate them with different names. Our Kafka spouts stayed the same, aside from now reading from the new topic names. However now the spouts are using the offsets from the old topic partitions when trying to read from the new topics. So the tail position of my-topic-name partition 0 will be 500 but the offset will be something like 10000.
Is there a way to reset the offset position so it matches the tail of the topic?
There a multiple options (as Storm's KafkaSpout does not provide any API to define the starting offset).
If you want to consumer from the tail of the log you should delete old offsets
depending on you Kafka version
(pre 0.9) you can manipulate ZK (which is a little tricky)
(0.9+) or you try do delete the offset from the topic __consumer_offsets (which is also tricky and might delete other offset you want to preserve, too)
if no offsets are there, you can restart your spout with auto offset reset policy "latest" or "largest" (depending on you Kafka version)
as an alternative (which I would recommend), you can write a small client application that uses seek() to manipulate the offset in the way you need them and commit() the offsets. This client must use the same group ID as you KafkaSpout and must subscribe to the same topic(s). Furthermore, you need to make sure that this client application is running a single consumer group member so it get's all partitions assigned.
for this, you an either seek to the end of the log and commit
or you commit an invalid offset (like -1) and rely on auto offset reset configuration"latest" or "largest" (depending on you Kafka version)
For Kafka Streams, there is a "Application Reset Tool" that does a similar thing to manipulate committed offsets. If you want to get some details, you can read this blog post http://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
(disclaimer: I am the author of the post and it is about Kafka Streams -- nevertheless, the underlying offset manipulation ideas are the same)
I am very much new to Kafka and we are using Kafka 0.8.1.
What I need to do is to consume a message from topic. For that, I will have to write one consumer in Java which will consume a message from topic and then save that message to database. After a message is saved, some acknowledgement will be sent to Java consumer. If acknowledgement is true, then next message should be consumed from the topic. If acknowldgement is false(which means due to some error message,read from the topic, couldn't be saved into the database), then again that message should be read.
I think I need to use Simple Consumer,to have control over message offset and have gone through the Simple Consumer example as given in this link https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example.
In this example, offset is evaluated in run method as 'readOffset'. Do I need to play with that? For e.g. I can use LatestTime() instead of EarliestTime() and in case of false, I will reset the offset to the one before using offset - 1.
Is this how I should proceed?
I think you can get along with using the high level consumer (http://kafka.apache.org/documentation.html#highlevelconsumerapi), that should be easier to use than the SimpleConsumer. I don't think the consumer needs to reread messages from Kafka on database failure, as the consumer already has those messages and can resend them to the DB or do anything else it sees fit.
High-level consumers store the last offset read from a specific partition in Zookeeper (based on the consumer group name), so that when a consumer process dies and is later restarted (potentially on an other host), it can continue processing messages where it left off. It's possible to autosave this offset to Zookeeper periodically (see the consumer properties auto.commit.enable and auto.commit.interval.ms), or have it saved by application logic by calling ConsumerConnector.commitOffsets . See also https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example .
I suggest you turn auto-commit off and commit your offsets yourselves once you received DB acknowledgement. Thus, you can make sure unprocessed messages are reread from Kafka in case of consumer failure and all messages commited to Kafka will eventually reach the DB at least once (but not 'exactly once').