I need to consume messages from a Kafka topic within a specific time range. It is easy enough to determine the starting offset with the help of partition.seekToTimestamp(startTimeInMillis) like the code below:
ReceiverOptions<ByteBuffer, ByteBuffer> options =
receiverOptions
.subscription(topicConfig.getTopics())
.pollTimeout(Duration.ofMillis(topicConfig.getPollWaitTimeoutMs()))
.addAssignListener(
partitions -> {
partitions.forEach(partition -> partition.seekToTimestamp(startTimeInMillis));
});
but how do I stop consuming once the messages start exceeding the end timestamp?
I can use the below code:
KafkaReceiver.create(options)
.receive()
.takeWhile(record -> record.timestamp() < endTimeInMillis)
.map(this::handleConsumerRecord);
But the problem is that when the condition is met for a message, that message might be the last valid mesage in its partition, but it might be possible that other partitions in the topic still have messages below the end timestamp.
How do I ensure that I have consumed all the messages across partitions within the end timestamp?
Related
recently I am using kafka,
I have a topic and I am using the following code to consume
#KafkaListener(topics = "topic_name", groupId = "_id" , id = "pro", containerFactory = "kafkaListenerContainerFactory")
public void consume(ConsumerRecord<String, String> record, Acknowledgment ack) {
kafkaService.proccessorConsumer(record);
ack.acknowledge();
}
every thing works fine, but I need to handle a situation where if the service stopped for any reason, then started I want to continue consuming from the last message that has processed, I do understand that the acknowledgment help with this, but for the sake of certainty I saved the last consumed offset somewhere.
my question is how I could use that offset to start consuming the topic from it.
As #OneCricketeer indicates, what you are trying to achieve is the default behaviour of the Kafka consumer, if you haven't disabled automatic commit.
You can check this by describing your consumer group using the consumer id as follows, just check that the offset of your consumer is the same as the one you have stored elsewhere.
> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group-id
I have a scenario where I load a subscription with around 1100 messages. I then start a Spark job which pulls messages from this subscription with these settings:
MaxOutstandingElementCount: 5
MaxAckExtensionPeriod: 60 min
AckDeadlineSeconds: 600
The first message to get processed starts a cache generation which takes about 30 minutes to complete. Any other messages arriving while this is going on are simply "returned" with no ack or nack. After that, a given message takes between 1 min and 30 mins to process. With an ack extension period of 60 min, I would never expect to see resending of messages.
The behaviour I am seeing is that while the initial cache is being generated, every 10 minutes 5 new messages are grabbed by the client and returned with no ack or nack by my code. This is unexpected. I would expect the deadline of the original 5 messages to be extended up to an hour.
Furthermore, after having processed and acked about 500 of the messages, I would expect around 600 left in the subscription, but I see almost the original 1100. These turn out to be resent duplicates, as I log these in my code. This is also very unexpected.
This is a screenshot from google console after around 500 messages have been processed and acked (ignore the first "hump", that was an aborted test run):
Am I missing something?
Here is the setup code:
val name = ProjectSubscriptionName.of(ConfigurationValues.ProjectId,
ConfigurationValues.PubSubSubscription)
val topic = ProjectTopicName.of(ConfigurationValues.ProjectId,
ConfigurationValues.PubSubSubscriptionTopic)
val pushConfig = PushConfig.newBuilder.build
val ackDeadlineSeconds = 600
subscriptionAdminClient.createSubscription(
name,
topic,
pushConfig,
ackDeadlineSeconds)
val flowControlSettings = FlowControlSettings.newBuilder()
.setMaxOutstandingElementCount(5L)
.build();
// create a subscriber bound to the asynchronous message receiver
val subscriber = Subscriber
.newBuilder(subscriptionName, new EtlMessageReceiver(spark))
.setFlowControlSettings(flowControlSettings)
.setMaxAckExtensionPeriod(Duration.ofMinutes(60))
.build
subscriber.startAsync.awaitRunning()
Here is the code in the receiver which runs when a message arrives while the cache is being generated:
if(!BIQConnector.cacheGenerationDone){
Utilities.logLine(
s"PubSub message for work item $uniqueWorkItemId ignored as cache is still being generated.")
return
}
And finally when a message has been processed:
consumer.ack()
Utilities.logLine(s"PubSub message ${message.getMessageId} for $tableName acknowledged.")
// Write back to ETL Manager
Utilities.logLine(
s"Writing result message back to topic ${etlResultTopic} for table $tableName, $tableDetailsForLog.")
sendPubSubResult(importTableName, validTableName, importTimestamp, 2, etlResultTopic, stageJobData,
tableDetailsForLog, "Success", isDeleted)
Is your Spark job using a Pub/Sub client library to pull messages? These libraries should indeed keep extending your message deadlines up to the MaxAckExtensionPeriod you specified.
If your job is using a Pub/Sub client library, this is unexpected behavior. You should contact Google Cloud support with your project name, subscription name, client library version, and a sample of the message IDs from the messages you are "returning" without acking. They will be able to investigate further into why you're receiving these resent messages.
We are using Prometheus and Grafana for monitoring our Kafka cluster.
In our application, we use Kafka streams and there is a chance that Kafka stream getting stopped due to exception. We are logging the event setUnCaughtExceptionHandler but, we also need some kind of alerting when the stream stops.
What we currently have is, jmx_exporter running as a agent and exposes Kafka metrics through an endpoint and prometheus fetches the metrics from the endpoint.
We don't see any kind of metrics which gives the count of active consumers per topic. Are we missing something? Any suggestions on how to get the number of active consumers and send alerts when the consumer stops.
we had similar needs and added Kafka Consumer Lag per partition into Grafana, and also added alerts if lag is more than specified threshold (threshold should be different per each topic, depending on load, e.g. for some topics it could be 10, and for highly loaded - 100000). so if you have more that e.g. 1000 unprocessed messages, you will get alert.
you could add state listener for each kafka stream and in case stream is in error state, log error or send email:
kafkaStream.setStateListener((newState, oldState) -> {
log.info("Kafka stream state changed [{}] >>>>> [{}]", oldState, newState);
if (newState == KafkaStreams.State.ERROR || newState == KafkaStreams.State.PENDING_SHUTDOWN) {
log.error("Kafka Stream is in [{}] state. Application should be restarted", newState);
}
});
also you could add health check indicator (e.g. via REST endpoint or via spring-boot HealthIndicator) that provides info whether stream is running or not:
KafkaStreams.State streamState = kafkaStream.state();
state.isRunning();
I also haven't found any kafka streams metrics which provide info about active consumers or available connected partitions, but as for me it would be nice if kafka streams provide such data (and hope it will be available in future releases).
I'm following here.While following the code. I came up with two Questions
Is the Key and offset were the same?
According to Google,
Offset: A Kafka topic receives messages across a distributed set of
partitions where they are stored. Each partition maintains the
messages it has received in a sequential order where they are
identified by an offset, also known as a position.
Seems both are very similar for me. Since offset maintain a unique message in the partition: Producers send records to a partition based on the record’s key
What is the best way to choose the Key/Offset for a producer?
For an instance the example which I provided above, they have chosen the timestamp as the Key and offset.
Is this the always the best recommendation?
class IRCMessageListener extends IRCEventAdapter {
#Override
public void onPrivmsg(String channel, IRCUser u, String msg) {
IRCMessage event = new IRCMessage(channel, u, msg);
//FIXME kafka round robin default partitioner seems to always publish to partition 0 only (?)
long ts = event.getInt64("timestamp");
Map<String, ?> srcOffset = Collections.singletonMap(TIMESTAMP_FIELD, ts);
Map<String, ?> srcPartition = Collections.singletonMap(CHANNEL_FIELD, channel);
SourceRecord record = new SourceRecord(srcPartition, srcOffset, topic, KEY_SCHEMA, ts, IRCMessage.SCHEMA, event);
queue.offer(record);
}
Because I'm actually trying to create a custom Kafka connector to get the data from 3rd Party WebSocket API. The API sends real-time data stream messages for a given Key value. So I thought of using that Key for my PartitionKey as well as Offset. But need to make sure I'm right about my thought.
Key is an optional metadata, that can be sent with a Kafka message, and by default, it is used to route message to a specific partition. E.g. if you're sending a message m with key as k, to a topic mytopic that has p partitions, then m goes to the partition Hash(k) % p in mytopic. It has no connection to the offset of a partition whatsoever. Offsets are used by consumers to keep track of the position of last read message in a partition. In your case, if the timestamp is fairly randomly distributed, then it's fine, else you might be causing partition imbalance while using it as key.
These are some basic differences :
Offset : maintained by kafka to keep a track of the records consumed to avoid loss of records and duplicate records while consuming.
Key : it is specific to input events,if it is not available then by default it is mentioned as null,this is useful while writing records to HDFS with default partition-er using kafka connect.every message can have a single key or many messages can have similar key.
I have a unique problem which is happening like 50-100 times a day with message volume of ~2 millions per day on the topic.I am using Kafka producer API 0.8.2.1 and I have 12 brokers (v 0.8.2.2) running in prod with replication of 4. I have a topic with 60 partitions and I am calculating partition for all my messages and providing the value in the ProducerRecord itself. Now, the issue -
Application creates 'ProducerRecord' using -
new ProducerRecord<String, String>(topic, 30, null, message1);
providing topic, value message1 and partition 30. Then application call the send method and future is returned -
// null is for callback
Future<RecordMetadata> future = producer.send(producerRecord. null);
Now, app prints the offset and partition value by calling get on Future and then getting values from RecordMetadata - this is what i get -
Kafka Response : partition 30, offset 3416092
Now, the app produce the next message - message2 to same partition -
new ProducerRecord<String, String>(topic, 30, null, message2);
and kafka response -
Kafka Response : partition 30, offset 3416092
I receive the same offset again, and if I pull message from the offset of partition 30 using simple consumer, it ends up being the message2 which essentially mean i lost the message1.
Based on KafkaProducer documentation KafkaProducer, I am using a single producer instance (static instance shared) among 10 threads.
The producer is thread safe and should generally be shared among all threads for best performance.
I am using all default properties for producer (except max.request.size: 10000000), the message (String payload) size can be a few kbs to a 500 kbs. I am using ack value of 1.
What am i doing wrong here? Is there something I can look into or any producer property or server property I can tweak to make sure i don't lose any messages. I need some help here soon as I am losing some critical messages in production which is not good at all coz because of no exception its even hard to find out the message lost unless downstream process reports it.
EDIT:
The servers and client are now updated to kafka version 0.8.2.2. Also, the 10 app threads each use their own instance of kafka producer now. We are seeing better performance but still there is message loss.
Producer Properties:
value.serializer: org.apache.kafka.common.serialization.StringSerializer
key.serializer: org.apache.kafka.common.serialization.StringSerializer
bootstrap.servers: {SERVER VIP ENDPOINT}
acks: 1
batch.size: 204800
linger.ms: 10
send.buffer.bytes: 1048576
max.request.size: 10000000