I would like to use Kafka high level consumer API, and at the same time I would like to disable auto commit of offsets. I tried to achieve this through the following steps.
1) auto.commit.enable = false
2) offsets.storage = kafka
3) dual.commit.enabled = false
I created a offset manager, which periodically creates offsetcommit request to kafka and commits the offset.
Still I have the following questions
1) Does high level consumer API automatically fetches offset from kafka storage and initializes itself with that offset? Or should I use simple consumer API to achieve this?
2) Does kafka based storage for offsets is repicated across all brokers? Or it is maintained on only one broker?
I created a offset manager, which periodically creates offsetcommit request to kafka and commits the offset.
You need not do that if you are using the high level consumer which provides you with methods to commit the offsets manually, the javadoc (under Manual Offset Control) provides you with examples on how to do that.
1) Does high level consumer API automatically fetches offset from kafka storage and initializes itself with that offset? Or should I use simple consumer API to achieve this?
High level consumer will take care of fetching the last committed offset when you restart it, so you can resume consuming from where you left off.
2) Does kafka based storage for offsets is repicated across all brokers? Or it is maintained on only one broker?
Kafka stores the consumer offsets in an internal topic named __consumer_offsets and by default its replication factor is set to 3 with 50 partitions. So it is replicated across 3 brokers. You can find more info on its configuration in broker config, they start with offset or offsets.
Related
We have a Flink job which has the following topology:
source -> filter -> map -> sink
We set a live(ready) status at the sink operator open-override function. After we get that status, we send events. Sometimes it can't consume the events sent early.
We want to know the exact time/step that we can send data which will not be missing.
It looks like you want to ensure that no message is missed for processing. Kafka will retain your messages, so there is no need to send messages only when the Flink consumer is ready. You can simplify your design by avoiding the status message.
Any Kafka Consumer (not just Flink Connector) will have an offset associated with it in Kafka Server to track the id of the last message that was consumed.
From kafka docs:
Kafka maintains a numerical offset for each record in a partition. This
offset acts as a unique identifier of a record within that partition,
and also denotes the position of the consumer in the partition. For
example, a consumer which is at position 5 has consumed records with
offsets 0 through 4 and will next receive the record with offset 5
In your Flink Kafka Connector, specify the offset as the committed offset.
OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)
This will ensure that if your Flink Connector is restarted, it will consume from the last position that it left off, before the restart.
If for some reason, the offset is lost, this will read from the beginning (earliest message) in your Kafka topic. Note that this approach will cause you to reprocess the messages.
There are many more offset strategies you can explore to choose the right one for you.
Refer - https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#starting-offset
While polling Kafka, I have subscribed to multiple topics using the subscribe() function. Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic. Will calling seek() iteratively over each of the topic names, before polling for data achieve the result?
How are the offsets exactly stored in Kafka?
I have one partition per topic and just one consumer to read from all topics.
How does Kafka store offsets for each topic?
Kafka has moved the offset storage from zookeeper to kafka brokers. The reason is below:
Zookeeper is not a good way to service a high-write load such as offset updates because zookeeper routes each write though every node and hence has no ability to partition or otherwise scale writes. We have always known this, but chose this implementation as a kind of "marriage of convenience" since we already depended on zk.
Kafka store the offset commits in a topic, when consumer commit the offset, kafka publish an commit offset message to an "commit-log" topic and keep an in-memory structure that mapped group/topic/partition to the latest offset for fast retrieval. More design infomation could be found in this page about offset management.
Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic.
There is a new feature about kafka admin tools to reset offset.
kafka-consumer-group.sh --bootstrap-server 127.0.0.1:9092 --group
your-consumer-group **--reset-offsets** --to-offset 1 --all-topics --execute
There are more options you can use.
We're using Storm with Kafka and ZooKeeper. We had a situation where we had to delete some topics and recreate them with different names. Our Kafka spouts stayed the same, aside from now reading from the new topic names. However now the spouts are using the offsets from the old topic partitions when trying to read from the new topics. So the tail position of my-topic-name partition 0 will be 500 but the offset will be something like 10000.
Is there a way to reset the offset position so it matches the tail of the topic?
There a multiple options (as Storm's KafkaSpout does not provide any API to define the starting offset).
If you want to consumer from the tail of the log you should delete old offsets
depending on you Kafka version
(pre 0.9) you can manipulate ZK (which is a little tricky)
(0.9+) or you try do delete the offset from the topic __consumer_offsets (which is also tricky and might delete other offset you want to preserve, too)
if no offsets are there, you can restart your spout with auto offset reset policy "latest" or "largest" (depending on you Kafka version)
as an alternative (which I would recommend), you can write a small client application that uses seek() to manipulate the offset in the way you need them and commit() the offsets. This client must use the same group ID as you KafkaSpout and must subscribe to the same topic(s). Furthermore, you need to make sure that this client application is running a single consumer group member so it get's all partitions assigned.
for this, you an either seek to the end of the log and commit
or you commit an invalid offset (like -1) and rely on auto offset reset configuration"latest" or "largest" (depending on you Kafka version)
For Kafka Streams, there is a "Application Reset Tool" that does a similar thing to manipulate committed offsets. If you want to get some details, you can read this blog post http://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
(disclaimer: I am the author of the post and it is about Kafka Streams -- nevertheless, the underlying offset manipulation ideas are the same)
There is an application (not mine) that reads messages from Kafka, does some processing on them, and stores records in a database. I've put together a program in Java that writes messages into the queue at a given rate. Right now, it does a simple measure of performance by querying the database at the end of the test run to ensure that records in = records out. However, I'd like to expand it to periodically check the queue to see how many messages are pending that the application hasn't yet processed to see if it's getting backed up.
I figure that I can check offset of the application's group ID in Zookeeper. I looked at the Kafka documentation, but it only gives basic consumer examples and the API documentation is sparse at best, so I'm not sure how to go about finding this information.
What APIs to I need to call in order to find out where in the queue the application is currently at, and how many messages are in the queue behind that position?
I'm using Kafka 2.10-0.8.2.1 with a single Zookeeper instance and three Kafka instances, and the load tester is using the 0.8.2.1 Java API. The topic in question has three partitions (one on each Kafka server), however for the purpose of the test there is only a single consumer.
I would suggest looking at the already provided tools in Kafka (code is available in src if you need to call the API directly). In particular,
$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group consumer-group1 --zkconnect zkhost:zkport --topic topic1
Will show you the offset and lag:
consumer-group1,topic1,0-0 (Group,Topic,BrokerId-PartitionId)
Owner = consumer-group1-consumer1
Consumer offset = 70121994703
= 70,121,994,703 (65.31G)
Log size = 70122018287
= 70,122,018,287 (65.31G)
Consumer lag = 23584
= 23,584 (0.00G)
References:
Kafka FAQ
Kafka Tools
There are several offsets Kafka exposes (via JMX), that you can use to figure out how much lag consumers have (per topic, per partition). These are Latest Offset (basically where Kafka Broker is with its writes of new data) and Consumer Offset (where Consumers are with their reads). The delta between these two is known as Consumer Lag, and this tells you how far behind real-time Consumers are. Based on this info one can also derive Broker Write Rate and Consumer Read Rate, which are handy to know and see in your Kafka monitoring tool, too. See Kafka Consumer Lag Monitoring for more details. If you just want a tool that can chart a bunch of Kafka metrics, see SPM for Kafka monitoring. HTH.
I am very much new to Kafka and we are using Kafka 0.8.1.
What I need to do is to consume a message from topic. For that, I will have to write one consumer in Java which will consume a message from topic and then save that message to database. After a message is saved, some acknowledgement will be sent to Java consumer. If acknowledgement is true, then next message should be consumed from the topic. If acknowldgement is false(which means due to some error message,read from the topic, couldn't be saved into the database), then again that message should be read.
I think I need to use Simple Consumer,to have control over message offset and have gone through the Simple Consumer example as given in this link https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example.
In this example, offset is evaluated in run method as 'readOffset'. Do I need to play with that? For e.g. I can use LatestTime() instead of EarliestTime() and in case of false, I will reset the offset to the one before using offset - 1.
Is this how I should proceed?
I think you can get along with using the high level consumer (http://kafka.apache.org/documentation.html#highlevelconsumerapi), that should be easier to use than the SimpleConsumer. I don't think the consumer needs to reread messages from Kafka on database failure, as the consumer already has those messages and can resend them to the DB or do anything else it sees fit.
High-level consumers store the last offset read from a specific partition in Zookeeper (based on the consumer group name), so that when a consumer process dies and is later restarted (potentially on an other host), it can continue processing messages where it left off. It's possible to autosave this offset to Zookeeper periodically (see the consumer properties auto.commit.enable and auto.commit.interval.ms), or have it saved by application logic by calling ConsumerConnector.commitOffsets . See also https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example .
I suggest you turn auto-commit off and commit your offsets yourselves once you received DB acknowledgement. Thus, you can make sure unprocessed messages are reread from Kafka in case of consumer failure and all messages commited to Kafka will eventually reach the DB at least once (but not 'exactly once').