I have my consumer configured like so:
The problem is when I poll my data from test topic (1 partition containing 1000 messages), I'm only getting 500 messages per poll. Each message is roughly 90 bytes a piece. This config should definitely be high enough to handle all the data. Any reason why this would be?
Consume Configuration
public static KafkaConsumer<String, SpecificRecordBase> createConsumer(
Arguments args) {
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, args.bootstrapServers);
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, SpecificAvroDeserializer.class.getName());
properties.setProperty(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
properties.setProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, args.groupId);
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.setProperty(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "4500");
// Data batching configuration
properties.setProperty(ConsumerConfig.FETCH_MIN_BYTES_CONFIG, "500000000");
properties.setProperty(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, "500000000");
properties.setProperty(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, "500000000");
// Specify the number of bytes you want to read in batch
properties.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
properties.setProperty(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, args.schemaRegistryUrl);
return new KafkaConsumer<>(properties);
}
Polling Piece
.....
while (true) {
ConsumerRecords<String, SpecificRecordBase> records =
myConsumer.poll(Duration.ofSeconds(CONSUMER_POLL_SECONDS));
....
Record count here is 500
Edit:
Read in docs that default poll count is 500. Which config should I need? I don't really care about number of messages, I care about the amount of bytes I'm streaming.
properties.setProperty(ConsumerConfig.FETCH_MIN_BYTES_CONFIG, "500000000");
properties.setProperty(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, "500000000");
properties.setProperty(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, "500000000");
properties.setProperty(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "500000000");
There is a consumer config property max.poll.records left you did not change from its default value which is 500.
If you are using the Java consumer, you can also adjust max.poll.records to tune the number of records that are handled on every loop iteration.
refer to: Confluent Kafka Consumer Properties
I remember me having a similar issue but in my case the problem was caused by one of the byte limitations.
To complete.
It seems you want to control exactly the amount of bytes the broker will send to your consumer. Indeed you need to play with the following parameters :
FETCH_MIN_BYTES_CONFIG
==> The minimum amount of data the server should return for a fetch request.
FETCH_MAX_BYTES_CONFIG
==> The maximum amount of data the broker should return for a fetch request. Keep in mind that, if the first batch of record of the first non empty partition has a size greater than this value, the broker will still return it ( to let the consumer progress). This is not an absolute maximum.
FETCH_MAX_WAIT_MS_CONFIG
==> The maximum amount of time the server will block before answering the fetch request if there isn't sufficient data to immediately satisfy.
Should be less than or equal to the timeout used in poll(timeout)
Playing with this param might be efficient if you want to control the size you're streaming, but will add latency.
MAX_POLL_RECORDS_CONFIG
==> The maximum number of records returned for a fetch request. As already explained in other answer, this parameter is important if you want to control the size of the broker answers.
If S is the expected payload size, and s the average expected size of your records, you should be sure that MAX_POLL_RECORDS_CONFIG > S/s
Keep in mind that the more control you want over the size of the payload ( records), the more latency you might incur ( by increasing FETCH_MAX_WAIT_MS_CONFIG).
It seems configs in consumer side is okay. But you should also consider broker configs. In broker side there is another size limit which is called message.max.bytes. You should increase it too.
From Kafka docs:
message.max.bytes: The largest record batch size allowed by Kafka. If this is increased and there are consumers older than 0.10.2, the
consumers' fetch size must also be increased so that the they can
fetch record batches this large. In the latest message format version,
records are always grouped into batches for efficiency. In previous
message format versions, uncompressed records are not grouped into
batches and this limit only applies to a single record in that
case.This can be set per topic with the topic level max.message.bytes
config. (default: 1000012)
You can also check this for more information.
Related
I'm a bit confused by some of the consumer API configuration properties. It seems as though they either conflict, or cancel each other out. Can someone help me understand the difference between the following keys.
Definitions:
fetch.max.bytes: Maximum amount of data the server should return for a fetch request
max.partition.fetch.bytes: Max amount of data per-partition the server will return
max.poll.records: The maximum number of records returned in a single call to poll()
Example:
fetch.max.bytes: 30000 (30kb)
max.partition.fetch.bytes: 20000000 (20mb)
max.poll.records: 1000
To me it seems like the consumer definition above is saying it can accept up to 20mb of data/partition, but then only specifying max bytes of 30kb which doesn't make sense. Max poll records also seems to limit data intake since it's possible 1000 is too low or too high based on the size of each record.
fetch.max.bytes and max.partition.fetch.bytes are fields of Fetch requests sent to Kafka brokers. They respectively determine the maximum size of the Fetch response the broker will send and the maximum size of data per partition the broker can return. It's the broker that uses these values to compute a Fetch response.
On the other hand, max.poll.records is a client-side only configuration. It determines how many records a call to poll() can return.
The consumer will fetch records in the background and buffer them so records are ready when poll() is called.
These settings allow for example to fetch records in batches, which is more efficient, but still pass them to the Consumer application in small chunks or even individually depending on the processing its doing.
I am doing something like the following pseudo code
var consumer = new KafkaConsumer();
consumer.assign(topicPartitions);
var beginOff = consumer.beginningOffsets(topicPartitions);
var endOff = consumer.endOffsets(topicPartitions);
var lastOffsets = Math.max(beginOff, endOff - 1));
lastOffsets.forEach(consumer::seek);
lastMessages = consumer.poll(1 sec);
// do something with the received messages
consumer.close();
In the simple test that I did, this works, but I wonder if there are cases, like producer crashes etc., where offsets are not monotonically increasing by one? In that case, would I have to seek() my way back in time, or can I get the message offset of the last already produced message from Kafka?
I am not using transactions, so we don't need to worry about read-committed vs. uncommitted messages.
Edit: An example where offsets are not consecutive is after log compaction. However, log compaction should always keep the last message, as it is - obviously - more recent than all preceding messages (same key or not). But the offset before that last message could theoretically have been compacted away.
In kafka.apache.org/10/javadoc/, it is clearly mentioned that, consumer.endOffsets
Get the last offset for the given partitions. The last offset of a partition is the offset of the upcoming message, i.e. the offset of the last available message + 1.
So when you get that endOff - 1, it is the last available Kafka record for that topic partition when you fetched that. So producer concerns are not impacted for this.
And one more thing, Offset is not decided by the producer. It is decided by the partition leader of that topic partition. So, it is always monotonically increasing by one.
I am opening a kafka producer with config properties -
KafkaProducer<String, MyValue> producer = new KafkaProducer<String, MyValue>(kafkaProperties);
then sending records synchronously using - (so as to avoid batching and also maintain the original message order)
//create myValue instance //omited for simplicity
//create myrecord instance using topicname and myvalue
producer.send(myRecord).get();
producer.flush(); //send message as soon as record is available to producer
now my issue is, I have several records to send and between sends i might have to wait for long times - few minutes to hours (for what ever reasons, atleast to explore and understand kafka better).
I want to know for how long will the producer connection with the cluster/bootstrap server be alive. Is there anyway i can configure it using the producer configurations.
(In depth explanations will be greatly thanked - even if it has to go to tcp connection levels, you are welcome)
(kafka consumers have a heartbeat concept. Does producers have similar concept. A google search for "kafka producer heartbeat.interval.ms" returned only result for consumer).
KafkaProducer.send method is asynchronous, by default it adds all records into buffer memory and send them at once, so according docs the producer establish the connection while sending the batch to cluster
The send() method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.
The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by the batch.size config. Making this larger can result in more batching, but requires more memory (since we will generally have one of these buffers for each active partition).
By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you want to reduce the number of requests you can set linger.ms to something greater than 0.
This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP.
For example, in the code snippet above, likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer.
Note that records that arrive close together in time will generally batch together even with linger.ms=0 so under heavy load batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more efficient requests when not under maximal load at the cost of a small amount of latency.
From the KafkaProducer.flush, invoking flush doesn't mean producer send each record to cluster, invoking flush makes all buffered records immediately available to send
Invoking this method makes all buffered records immediately available to send (even if linger.ms is greater than 0) and blocks on the completion of the requests associated with these records. The post-condition of flush() is that any previously sent record will have completed (e.g. Future.isDone() == true). A request is considered completed when it is successfully acknowledged according to the acks configuration you have specified or else it results in an error.
I have a consumer set up that manually commits offsets. Events are in the millions to low billions. I'm committing offsets if and only if processing was successful within the consumer batch being processed. However, we're noticing that even with commitSync being called successfully, we have hundreds of thousands of duplicates. We will commitSync and just repull the same exact data in the consumer on the next poll from the topic. Why would this happen?
#Ryan - Kindly ensure that you have set below property for Consumer
props.setProperty("enable.auto.commit", "false");
Even if above is not giving you the desired result due to huge load, kindly commit the current offset with below constructor so that you will not get the offset in next Poll.
public void commitSync(java.util.Map<TopicPartition,OffsetAndMetadata> offsets)
Follow API at
commitSync
I have a setup where several KafkaConsumers each handle a number of partitions on a single topic. They are statically assigned the partitions, in a way that ensures that each consumer has an equal number of partitions to handle. The record key is also chosen so that we have equal distribution of messages over all partitions.
At times of heavy load, we often see a small number of partitions build up a considerable lag (thousands of messages/several minutes worth), while other partitions that are getting the same load and are consumed by the same consumer manage to keep the lag down to a few hundred messages / couple of seconds.
It looks like the consumer is fetching records as fast as it can, going around most of the partitions, but now and then there is one partition that gets left out for a long time. Ideally, I'd like to see the lag spread out more evenly across the partitions.
I've been reading about KafkaConsumer poll behaviour and configuration for a while now, and so far I think there's 2 options to work around this:
Build something custom that can monitor the lag per partition, and use KafkaConsumer.pause() and .resume() to essentially force the KafkaConsumer to read from the partitions with the most lag
Restrict our KafkaConsumer to only ever subscribe to one TopicPartition, and work with multiple instances of KafkaConsumer.
Neither of these options seem like the proper way to handle this. Configuration also doesn't seem to have the answer:
max.partition.fetch.bytes only specifies the max fetch size for a single partition, it doesn't guarantee that the next fetch will be from another partition.
max.poll.interval.ms only works for consumer groups and not on a per-partition basis.
Am I missing a way to encourage the KafkaConsumer to switch partition more often? Or a way to implement a preference for the partitions with the highest lag?
Not sure wether the answer is still relevant to you or if my answer exactly replies to your needs, However, you could try a lag aware assignor. This assignor which assign partitions to consumers ensures that consumers are assigned partitions so that the lag among consumers is assigned uniformly/equally. Here is a well written code that I used it that implements a lag based assignor.
https://github.com/grantneale/kafka-lag-based-assignor
All what you need is to configure you consumer to use this assignor. The below statament.
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, LagBasedPartitionAssignor.class.getName());