I want to read particular messages from topic. For example there are 12000 messages in topic and I want to read from 2000 to 5000 only. Is there any provision in kafka ? or can I use java consumer code to read particular messages from a topic?
The Java consumer API provides you "seek" methods and more specifically the following one
seek(TopicPartition partition, long offset)
You can specify to read messages starting from the provided offset but you cannot provide an ending offset. The other thing is that specifying an offset is more partition related and for this reason, you have to provide the TopicPartition as the first parameter.
Consider that if the topic partition is compacted and/or some messages are deleted, the offsets are not sequential anymore so you can have some holes. So you should pay attention if you want to read from the message with offset 2000 to the one with offset 5000 or you want to read from the 2000th message to the 5000th message (in this case the ordinal position could be not equal to the offset, i.e. the 2000th message is at offset 2100 because 100 messages before it were deleted).
Related
We have a Flink job which has the following topology:
source -> filter -> map -> sink
We set a live(ready) status at the sink operator open-override function. After we get that status, we send events. Sometimes it can't consume the events sent early.
We want to know the exact time/step that we can send data which will not be missing.
It looks like you want to ensure that no message is missed for processing. Kafka will retain your messages, so there is no need to send messages only when the Flink consumer is ready. You can simplify your design by avoiding the status message.
Any Kafka Consumer (not just Flink Connector) will have an offset associated with it in Kafka Server to track the id of the last message that was consumed.
From kafka docs:
Kafka maintains a numerical offset for each record in a partition. This
offset acts as a unique identifier of a record within that partition,
and also denotes the position of the consumer in the partition. For
example, a consumer which is at position 5 has consumed records with
offsets 0 through 4 and will next receive the record with offset 5
In your Flink Kafka Connector, specify the offset as the committed offset.
OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)
This will ensure that if your Flink Connector is restarted, it will consume from the last position that it left off, before the restart.
If for some reason, the offset is lost, this will read from the beginning (earliest message) in your Kafka topic. Note that this approach will cause you to reprocess the messages.
There are many more offset strategies you can explore to choose the right one for you.
Refer - https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#starting-offset
I want to get the last record offset in the topic partition. There is endOffsets method in the consumer. And usually endOffsets - 1 works fine. But in the case of transactional producer topic may contain offsets without a records. And endOffsets - 1 will point to the offset without record. So, how should I compute the last record offset in this case?
More interestingly, what if I will have both a simple and transactional producer for my topic? Is there any reliable way to get the last record offset ignoring all this complexity?
I ended up realizing that there is no reliable and simple way to do that in the current version of the java consumer. I created a feature request for that in Kafka's issue tracker: https://issues.apache.org/jira/browse/KAFKA-10009
I have producer which I call and posts a record to Kafka, then I call a consumer which returns the record, but when I call the consumer again the consumer doesn't return any records. (I need to get the record which I had posted to Kafka again). How can I do this?(Any code would be appreciated)
Kafka doesn't delete the message after it has been consumed. But it keeps the offset of reading for any consumer. So after you read a message from it, the offset goes forward. The second read doesn't read anything because the offset point after your only message and there is nothing after that. You should try resetting the offset before you read again. See this post:
Reset consumer offset to the beginning from Kafka Streams
But if you don't want to reset locally or globally, you can create another consumer group and since every consumer group has its own offset, your second read by the new consumers can achieve what you want. See this link:
kafka-tutorial-kafka-consumer
Hope this would be helpful.
You can manually reset the offset to desired offset or if you need to consumer from the start offset ( whatever is available in kafka) , then you can set the consumer property "auto.offset.reset=earliest"
You can also provide every time a new group.id value for the consumer properties. Just generate a random string value. The property auto.offset.reset must be set to earliest.
While polling Kafka, I have subscribed to multiple topics using the subscribe() function. Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic. Will calling seek() iteratively over each of the topic names, before polling for data achieve the result?
How are the offsets exactly stored in Kafka?
I have one partition per topic and just one consumer to read from all topics.
How does Kafka store offsets for each topic?
Kafka has moved the offset storage from zookeeper to kafka brokers. The reason is below:
Zookeeper is not a good way to service a high-write load such as offset updates because zookeeper routes each write though every node and hence has no ability to partition or otherwise scale writes. We have always known this, but chose this implementation as a kind of "marriage of convenience" since we already depended on zk.
Kafka store the offset commits in a topic, when consumer commit the offset, kafka publish an commit offset message to an "commit-log" topic and keep an in-memory structure that mapped group/topic/partition to the latest offset for fast retrieval. More design infomation could be found in this page about offset management.
Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic.
There is a new feature about kafka admin tools to reset offset.
kafka-consumer-group.sh --bootstrap-server 127.0.0.1:9092 --group
your-consumer-group **--reset-offsets** --to-offset 1 --all-topics --execute
There are more options you can use.
We're using Storm with Kafka and ZooKeeper. We had a situation where we had to delete some topics and recreate them with different names. Our Kafka spouts stayed the same, aside from now reading from the new topic names. However now the spouts are using the offsets from the old topic partitions when trying to read from the new topics. So the tail position of my-topic-name partition 0 will be 500 but the offset will be something like 10000.
Is there a way to reset the offset position so it matches the tail of the topic?
There a multiple options (as Storm's KafkaSpout does not provide any API to define the starting offset).
If you want to consumer from the tail of the log you should delete old offsets
depending on you Kafka version
(pre 0.9) you can manipulate ZK (which is a little tricky)
(0.9+) or you try do delete the offset from the topic __consumer_offsets (which is also tricky and might delete other offset you want to preserve, too)
if no offsets are there, you can restart your spout with auto offset reset policy "latest" or "largest" (depending on you Kafka version)
as an alternative (which I would recommend), you can write a small client application that uses seek() to manipulate the offset in the way you need them and commit() the offsets. This client must use the same group ID as you KafkaSpout and must subscribe to the same topic(s). Furthermore, you need to make sure that this client application is running a single consumer group member so it get's all partitions assigned.
for this, you an either seek to the end of the log and commit
or you commit an invalid offset (like -1) and rely on auto offset reset configuration"latest" or "largest" (depending on you Kafka version)
For Kafka Streams, there is a "Application Reset Tool" that does a similar thing to manipulate committed offsets. If you want to get some details, you can read this blog post http://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
(disclaimer: I am the author of the post and it is about Kafka Streams -- nevertheless, the underlying offset manipulation ideas are the same)