We have a Flink job which has the following topology:
source -> filter -> map -> sink
We set a live(ready) status at the sink operator open-override function. After we get that status, we send events. Sometimes it can't consume the events sent early.
We want to know the exact time/step that we can send data which will not be missing.
It looks like you want to ensure that no message is missed for processing. Kafka will retain your messages, so there is no need to send messages only when the Flink consumer is ready. You can simplify your design by avoiding the status message.
Any Kafka Consumer (not just Flink Connector) will have an offset associated with it in Kafka Server to track the id of the last message that was consumed.
From kafka docs:
Kafka maintains a numerical offset for each record in a partition. This
offset acts as a unique identifier of a record within that partition,
and also denotes the position of the consumer in the partition. For
example, a consumer which is at position 5 has consumed records with
offsets 0 through 4 and will next receive the record with offset 5
In your Flink Kafka Connector, specify the offset as the committed offset.
OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)
This will ensure that if your Flink Connector is restarted, it will consume from the last position that it left off, before the restart.
If for some reason, the offset is lost, this will read from the beginning (earliest message) in your Kafka topic. Note that this approach will cause you to reprocess the messages.
There are many more offset strategies you can explore to choose the right one for you.
Refer - https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#starting-offset
Related
I have the following use case:
I have two Kafka topics, one is meant to be used a stream of incoming messages to be processed, the other is meant as store of records that is meant to be used as a bootstrap to the initial state of the application.
Is there a way to do the following:
Read all messages from a Kafka topic when the application starts up and store all ConsumerRecord in memory from the topic that is meant to bootstrap the application to its initial state
Only after all messages have been read allow the ConsumerRecord from the stream topic to be processed
As there may be additional records on the state topic to incorporate them into the application's state when the application is running without having to restart the application.
Thanks!
Start your bootstrap consumer first.
Read the other topic till a particular offset is reached or (if you want the end, you can read as long as there is no polled records available [this is not the best way!]). If you want to start at particular offset every-time you have to use a seek. Also use a unique consumer group id for this since you want to all the records. You might want to handle the rebalance case appropriately.
Then close that consumer and start the other stream consumer and process the data.
Using Ktables with Kafka streams might be better, but I am not familiar with it.
I want to read particular messages from topic. For example there are 12000 messages in topic and I want to read from 2000 to 5000 only. Is there any provision in kafka ? or can I use java consumer code to read particular messages from a topic?
The Java consumer API provides you "seek" methods and more specifically the following one
seek(TopicPartition partition, long offset)
You can specify to read messages starting from the provided offset but you cannot provide an ending offset. The other thing is that specifying an offset is more partition related and for this reason, you have to provide the TopicPartition as the first parameter.
Consider that if the topic partition is compacted and/or some messages are deleted, the offsets are not sequential anymore so you can have some holes. So you should pay attention if you want to read from the message with offset 2000 to the one with offset 5000 or you want to read from the 2000th message to the 5000th message (in this case the ordinal position could be not equal to the offset, i.e. the 2000th message is at offset 2100 because 100 messages before it were deleted).
While polling Kafka, I have subscribed to multiple topics using the subscribe() function. Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic. Will calling seek() iteratively over each of the topic names, before polling for data achieve the result?
How are the offsets exactly stored in Kafka?
I have one partition per topic and just one consumer to read from all topics.
How does Kafka store offsets for each topic?
Kafka has moved the offset storage from zookeeper to kafka brokers. The reason is below:
Zookeeper is not a good way to service a high-write load such as offset updates because zookeeper routes each write though every node and hence has no ability to partition or otherwise scale writes. We have always known this, but chose this implementation as a kind of "marriage of convenience" since we already depended on zk.
Kafka store the offset commits in a topic, when consumer commit the offset, kafka publish an commit offset message to an "commit-log" topic and keep an in-memory structure that mapped group/topic/partition to the latest offset for fast retrieval. More design infomation could be found in this page about offset management.
Now, I want to set the offset from which I want to read from each topic, without resubscribing after every seek() and poll() from a topic.
There is a new feature about kafka admin tools to reset offset.
kafka-consumer-group.sh --bootstrap-server 127.0.0.1:9092 --group
your-consumer-group **--reset-offsets** --to-offset 1 --all-topics --execute
There are more options you can use.
I would like to use Kafka high level consumer API, and at the same time I would like to disable auto commit of offsets. I tried to achieve this through the following steps.
1) auto.commit.enable = false
2) offsets.storage = kafka
3) dual.commit.enabled = false
I created a offset manager, which periodically creates offsetcommit request to kafka and commits the offset.
Still I have the following questions
1) Does high level consumer API automatically fetches offset from kafka storage and initializes itself with that offset? Or should I use simple consumer API to achieve this?
2) Does kafka based storage for offsets is repicated across all brokers? Or it is maintained on only one broker?
I created a offset manager, which periodically creates offsetcommit request to kafka and commits the offset.
You need not do that if you are using the high level consumer which provides you with methods to commit the offsets manually, the javadoc (under Manual Offset Control) provides you with examples on how to do that.
1) Does high level consumer API automatically fetches offset from kafka storage and initializes itself with that offset? Or should I use simple consumer API to achieve this?
High level consumer will take care of fetching the last committed offset when you restart it, so you can resume consuming from where you left off.
2) Does kafka based storage for offsets is repicated across all brokers? Or it is maintained on only one broker?
Kafka stores the consumer offsets in an internal topic named __consumer_offsets and by default its replication factor is set to 3 with 50 partitions. So it is replicated across 3 brokers. You can find more info on its configuration in broker config, they start with offset or offsets.
I am very much new to Kafka and we are using Kafka 0.8.1.
What I need to do is to consume a message from topic. For that, I will have to write one consumer in Java which will consume a message from topic and then save that message to database. After a message is saved, some acknowledgement will be sent to Java consumer. If acknowledgement is true, then next message should be consumed from the topic. If acknowldgement is false(which means due to some error message,read from the topic, couldn't be saved into the database), then again that message should be read.
I think I need to use Simple Consumer,to have control over message offset and have gone through the Simple Consumer example as given in this link https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example.
In this example, offset is evaluated in run method as 'readOffset'. Do I need to play with that? For e.g. I can use LatestTime() instead of EarliestTime() and in case of false, I will reset the offset to the one before using offset - 1.
Is this how I should proceed?
I think you can get along with using the high level consumer (http://kafka.apache.org/documentation.html#highlevelconsumerapi), that should be easier to use than the SimpleConsumer. I don't think the consumer needs to reread messages from Kafka on database failure, as the consumer already has those messages and can resend them to the DB or do anything else it sees fit.
High-level consumers store the last offset read from a specific partition in Zookeeper (based on the consumer group name), so that when a consumer process dies and is later restarted (potentially on an other host), it can continue processing messages where it left off. It's possible to autosave this offset to Zookeeper periodically (see the consumer properties auto.commit.enable and auto.commit.interval.ms), or have it saved by application logic by calling ConsumerConnector.commitOffsets . See also https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example .
I suggest you turn auto-commit off and commit your offsets yourselves once you received DB acknowledgement. Thus, you can make sure unprocessed messages are reread from Kafka in case of consumer failure and all messages commited to Kafka will eventually reach the DB at least once (but not 'exactly once').