I need to consume messages from a Kafka topic within a specific time range. It is easy enough to determine the starting offset with the help of partition.seekToTimestamp(startTimeInMillis) like the code below:
ReceiverOptions<ByteBuffer, ByteBuffer> options =
receiverOptions
.subscription(topicConfig.getTopics())
.pollTimeout(Duration.ofMillis(topicConfig.getPollWaitTimeoutMs()))
.addAssignListener(
partitions -> {
partitions.forEach(partition -> partition.seekToTimestamp(startTimeInMillis));
});
but how do I stop consuming once the messages start exceeding the end timestamp?
I can use the below code:
KafkaReceiver.create(options)
.receive()
.takeWhile(record -> record.timestamp() < endTimeInMillis)
.map(this::handleConsumerRecord);
But the problem is that when the condition is met for a message, that message might be the last valid mesage in its partition, but it might be possible that other partitions in the topic still have messages below the end timestamp.
How do I ensure that I have consumed all the messages across partitions within the end timestamp?
Sometimes(seems very random) Kafka sends old messages. I only want the latest messages so it overwrite messages with the same key. Currently it looks like I have multiple messages with the same key it doesn't get compacted.
I use this setting in the topic:
cleanup.policy=compact
I'm using Java/Kotlin and Apache Kafka 1.1.1 client.
Properties(8).apply {
val jaasTemplate = "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"%s\" password=\"%s\";"
val jaasCfg = String.format(jaasTemplate, Configuration.kafkaUsername, Configuration.kafkaPassword)
put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
BOOTSTRAP_SERVERS)
put(ConsumerConfig.GROUP_ID_CONFIG,
"ApiKafkaKotlinConsumer${Configuration.kafkaGroupId}")
put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
StringDeserializer::class.java.name)
put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
StringDeserializer::class.java.name)
put("security.protocol", "SASL_SSL")
put("sasl.mechanism", "SCRAM-SHA-256")
put("sasl.jaas.config", jaasCfg)
put("max.poll.records", 100)
put("receive.buffer.bytes", 1000000)
}
Have I missed some settings?
If you want have only one value for each key, you have to use KTable<K,V> abstraction: StreamsBuilder::table(final String topic) from Kafka Streams. Topic used here should have cleanup policy set to compact.
If you use KafkaConsumer you just pull data from brokers. It doesn't give you any mechanism that perform some kind of deduplication. Depending on if compaction was performed or not, you can get one to n messages for same key.
Regarding compaction
Compaction doesn't mean, that all old value for same key are removed immediately. When old message for same key will be removed, depends on several properties. The most important are:
log.cleaner.min.cleanable.ratio
The minimum ratio of dirty log to total log for a log to eligible for cleaning
log.cleaner.min.compaction.lag.ms
The minimum time a message will remain uncompacted in the log. Only applicable for logs that are being compacted.
log.cleaner.enable
Enable the log cleaner process to run on the server. Should be enabled if using any topics with a cleanup.policy=compact including the internal offsets topic. If disabled those topics will not be compacted and continually grow in size.
More detail about compaction you can find https://kafka.apache.org/documentation/#compaction
I'm following here.While following the code. I came up with two Questions
Is the Key and offset were the same?
According to Google,
Offset: A Kafka topic receives messages across a distributed set of
partitions where they are stored. Each partition maintains the
messages it has received in a sequential order where they are
identified by an offset, also known as a position.
Seems both are very similar for me. Since offset maintain a unique message in the partition: Producers send records to a partition based on the record’s key
What is the best way to choose the Key/Offset for a producer?
For an instance the example which I provided above, they have chosen the timestamp as the Key and offset.
Is this the always the best recommendation?
class IRCMessageListener extends IRCEventAdapter {
#Override
public void onPrivmsg(String channel, IRCUser u, String msg) {
IRCMessage event = new IRCMessage(channel, u, msg);
//FIXME kafka round robin default partitioner seems to always publish to partition 0 only (?)
long ts = event.getInt64("timestamp");
Map<String, ?> srcOffset = Collections.singletonMap(TIMESTAMP_FIELD, ts);
Map<String, ?> srcPartition = Collections.singletonMap(CHANNEL_FIELD, channel);
SourceRecord record = new SourceRecord(srcPartition, srcOffset, topic, KEY_SCHEMA, ts, IRCMessage.SCHEMA, event);
queue.offer(record);
}
Because I'm actually trying to create a custom Kafka connector to get the data from 3rd Party WebSocket API. The API sends real-time data stream messages for a given Key value. So I thought of using that Key for my PartitionKey as well as Offset. But need to make sure I'm right about my thought.
Key is an optional metadata, that can be sent with a Kafka message, and by default, it is used to route message to a specific partition. E.g. if you're sending a message m with key as k, to a topic mytopic that has p partitions, then m goes to the partition Hash(k) % p in mytopic. It has no connection to the offset of a partition whatsoever. Offsets are used by consumers to keep track of the position of last read message in a partition. In your case, if the timestamp is fairly randomly distributed, then it's fine, else you might be causing partition imbalance while using it as key.
These are some basic differences :
Offset : maintained by kafka to keep a track of the records consumed to avoid loss of records and duplicate records while consuming.
Key : it is specific to input events,if it is not available then by default it is mentioned as null,this is useful while writing records to HDFS with default partition-er using kafka connect.every message can have a single key or many messages can have similar key.
We have multiple kafka consumers on different hosts.
Consumer-1 (on server-1) consumes data from Topic-1.
Consumer-2 (on server-2) consumes data from Topic-2.
The group.id of Consumer-1 and Consumer-2 are the same.
It's expected that Consumer-1 and Consumer-2 can run separately to process message from Topic-1 and Topic-2.
However, we found that sometimes when we reboot Consumer-2 (on server-2) it will try to fetch metadata about Topic-1 and failed finally. If we reboot it again, it may fetch metadata about Topic-2 and works as expected.
I'm confused and find out this page , if it's correct, the key for topic re-balance is like (topic:consumer_group_id) , why it's not stable?
We are using kafka version 0.11.0.1.
Use Manual partition assignment Consumer.assign method instead of subscribe. With assign, you have more control over which topic-partition a consumer should read data from. You can also commit the offset in Kafka with the same group.id
NOTE: With assign mechanism, there won't be any rebalance mechanism happens within the group.
I send String-messages to Kafka V. 0.8 with the Java Producer API.
If the message size is about 15 MB I get a MessageSizeTooLargeException.
I have tried to set message.max.bytesto 40 MB, but I still get the exception. Small messages worked without problems.
(The exception appear in the producer, I don't have a consumer in this application.)
What can I do to get rid of this exception?
My example producer config
private ProducerConfig kafkaConfig() {
Properties props = new Properties();
props.put("metadata.broker.list", BROKERS);
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
props.put("message.max.bytes", "" + 1024 * 1024 * 40);
return new ProducerConfig(props);
}
Error-Log:
4709 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 214 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
4869 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 217 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5035 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 220 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5198 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 223 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5305 [main] ERROR kafka.producer.async.DefaultEventHandler - Failed to send requests for topics datasift with correlation ids in [213,224]
kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
at kafka.producer.async.DefaultEventHandler.handle(Unknown Source)
at kafka.producer.Producer.send(Unknown Source)
at kafka.javaapi.producer.Producer.send(Unknown Source)
You need to adjust three (or four) properties:
Consumer side:fetch.message.max.bytes - this will determine the largest size of a message that can be fetched by the consumer.
Broker side: replica.fetch.max.bytes - this will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly. If this is too small, then the message will never be replicated, and therefore, the consumer will never see the message because the message will never be committed (fully replicated).
Broker side: message.max.bytes - this is the largest size of the message that can be received by the broker from a producer.
Broker side (per topic): max.message.bytes - this is the largest size of the message the broker will allow to be appended to the topic. This size is validated pre-compression. (Defaults to broker's message.max.bytes.)
I found out the hard way about number 2 - you don't get ANY exceptions, messages, or warnings from Kafka, so be sure to consider this when you are sending large messages.
Minor changes required for Kafka 0.10 and the new consumer compared to laughing_man's answer:
Broker: No changes, you still need to increase properties message.max.bytes and replica.fetch.max.bytes. message.max.bytes has to be equal or smaller(*) than replica.fetch.max.bytes.
Producer: Increase max.request.size to send the larger message.
Consumer: Increase max.partition.fetch.bytes to receive larger messages.
(*) Read the comments to learn more about message.max.bytes<=replica.fetch.max.bytes
The answer from #laughing_man is quite accurate. But still, I wanted to give a recommendation which I learned from Kafka expert Stephane Maarek. We actively applied this solution in our live systems.
Kafka isn’t meant to handle large messages.
Your API should use cloud storage (for example, AWS S3) and simply push a reference to S3 to Kafka or any other message broker. You'll need to find a place to save your data, whether it can be a network drive or something else entirely, but it shouldn't be a message broker.
If you don't want to proceed with the recommended and reliable solution above,
The message max size is 1MB (the setting in your brokers is called message.max.bytes) Apache Kafka. If you really needed it badly, you could increase that size and make sure to increase the network buffers for your producers and consumers.
And if you really care about splitting your message, make sure each message split has the exact same key so that it gets pushed to the same partition, and your message content should report a “part id” so that your consumer can fully reconstruct the message.
If the message is text-based try to compress the data, which may reduce the data size, but not magically.
Again, you have to use an external system to store that data and just push an external reference to Kafka. That is a very common architecture and one you should go with and widely accepted.
Keep that in mind Kafka works best only if the messages are huge in amount but not in size.
Source: https://www.quora.com/How-do-I-send-Large-messages-80-MB-in-Kafka
The idea is to have equal size of message being sent from Kafka Producer to Kafka Broker and then received by Kafka Consumer i.e.
Kafka producer --> Kafka Broker --> Kafka Consumer
Suppose if the requirement is to send 15MB of message, then the Producer, the Broker and the Consumer, all three, needs to be in sync.
Kafka Producer sends 15 MB --> Kafka Broker Allows/Stores 15 MB --> Kafka Consumer receives 15 MB
The setting therefore should be:
a) on Broker:
message.max.bytes=15728640
replica.fetch.max.bytes=15728640
b) on Consumer:
fetch.message.max.bytes=15728640
You need to override the following properties:
Broker Configs($KAFKA_HOME/config/server.properties)
replica.fetch.max.bytes
message.max.bytes
Consumer Configs($KAFKA_HOME/config/consumer.properties)
This step didn't work for me. I add it to the consumer app and it was working fine
fetch.message.max.bytes
Restart the server.
look at this documentation for more info:
http://kafka.apache.org/08/configuration.html
I think, most of the answers here are kind of outdated or not entirely complete.
To refer on the answer of Sacha Vetter (with the update for Kafka 0.10), I'd like to provide some additional Information and links to the official documentation.
Producer Configuration:
max.request.size (Link) has to be increased for files bigger than 1 MB, otherwise they are rejected
Broker/Topic configuration:
message.max.bytes (Link) may be set, if one like to increase the message size on broker level. But, from the documentation: "This can be set per topic with the topic level max.message.bytes config."
max.message.bytes (Link) may be increased, if only one topic should be able to accept lager files. The broker configuration must not be changed.
I'd always prefer a topic-restricted configuration, due to the fact, that I can configure the topic by myself as a client for the Kafka cluster (e.g. with the admin client). I may not have any influence on the broker configuration itself.
In the answers from above, some more configurations are mentioned as necessary:
replica.fetch.max.bytes (Link) (Broker config)
From the documentation: "This is not an absolute maximum, if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that progress can be made."
max.partition.fetch.bytes (Link) (Consumer config)
From the documentation: "Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress."
fetch.max.bytes (Link) (Consumer config; not mentioned above, but same category)
From the documentation: "Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress."
Conclusion: The configurations regarding fetching messages are not necessary to change for processing messages, lager than the default values of these configuration (had this tested in a small setup). Probably, the consumer may always get batches of size 1. However, two of the configurations from the first block has to be set, as mentioned in the answers before.
This clarification should not tell anything about performance and should not be a recommendation to set or not to set these configuration. The best values has to be evaluated individually depending on the concrete planned throughput and data structure.
One key thing to remember that message.max.bytes attribute must be in sync with the consumer's fetch.message.max.bytes property. the fetch size must be at least as large as the maximum message size otherwise there could be situation where producers can send messages larger than the consumer can consume/fetch. It might worth taking a look at it.
Which version of Kafka you are using? Also provide some more details trace that you are getting. is there some thing like ... payload size of xxxx larger
than 1000000 coming up in the log?
For people using landoop kafka:
You can pass the config values in the environment variables like:
docker run -d --rm -p 2181:2181 -p 3030:3030 -p 8081-8083:8081-8083 -p 9581-9585:9581-9585 -p 9092:9092
-e KAFKA_TOPIC_MAX_MESSAGE_BYTES=15728640 -e KAFKA_REPLICA_FETCH_MAX_BYTES=15728640 landoop/fast-data-dev:latest `
This sets topic.max.message.bytes and replica.fetch.max.bytes on the broker.
And if you're using rdkafka then pass the message.max.bytes in the producer config like:
const producer = new Kafka.Producer({
'metadata.broker.list': 'localhost:9092',
'message.max.bytes': '15728640',
'dr_cb': true
});
Similarly, for the consumer,
const kafkaConf = {
"group.id": "librd-test",
"fetch.message.max.bytes":"15728640",
... .. }
Here is how I achieved successfully sending data up to 100mb using kafka-python==2.0.2:
Broker:
consumer = KafkaConsumer(
...
max_partition_fetch_bytes=max_bytes,
fetch_max_bytes=max_bytes,
)
Producer (See final solution at the end):
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
)
Then:
producer.send(topic, value=data).get()
After sending data like this, the following exception appeared:
MessageSizeTooLargeError: The message is n bytes when serialized which is larger than the total memory buffer you have configured with the buffer_memory configuration.
Finally I increased buffer_memory (default 32mb) to receive the message on the other end.
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
buffer_memory=KafkaSettings.MAX_BYTES * 3,
)