I am opening a kafka producer with config properties -
KafkaProducer<String, MyValue> producer = new KafkaProducer<String, MyValue>(kafkaProperties);
then sending records synchronously using - (so as to avoid batching and also maintain the original message order)
//create myValue instance //omited for simplicity
//create myrecord instance using topicname and myvalue
producer.send(myRecord).get();
producer.flush(); //send message as soon as record is available to producer
now my issue is, I have several records to send and between sends i might have to wait for long times - few minutes to hours (for what ever reasons, atleast to explore and understand kafka better).
I want to know for how long will the producer connection with the cluster/bootstrap server be alive. Is there anyway i can configure it using the producer configurations.
(In depth explanations will be greatly thanked - even if it has to go to tcp connection levels, you are welcome)
(kafka consumers have a heartbeat concept. Does producers have similar concept. A google search for "kafka producer heartbeat.interval.ms" returned only result for consumer).
KafkaProducer.send method is asynchronous, by default it adds all records into buffer memory and send them at once, so according docs the producer establish the connection while sending the batch to cluster
The send() method is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.
The producer maintains buffers of unsent records for each partition. These buffers are of a size specified by the batch.size config. Making this larger can result in more batching, but requires more memory (since we will generally have one of these buffers for each active partition).
By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you want to reduce the number of requests you can set linger.ms to something greater than 0.
This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will arrive to fill up the same batch. This is analogous to Nagle's algorithm in TCP.
For example, in the code snippet above, likely all 100 records would be sent in a single request since we set our linger time to 1 millisecond. However this setting would add 1 millisecond of latency to our request waiting for more records to arrive if we didn't fill up the buffer.
Note that records that arrive close together in time will generally batch together even with linger.ms=0 so under heavy load batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more efficient requests when not under maximal load at the cost of a small amount of latency.
From the KafkaProducer.flush, invoking flush doesn't mean producer send each record to cluster, invoking flush makes all buffered records immediately available to send
Invoking this method makes all buffered records immediately available to send (even if linger.ms is greater than 0) and blocks on the completion of the requests associated with these records. The post-condition of flush() is that any previously sent record will have completed (e.g. Future.isDone() == true). A request is considered completed when it is successfully acknowledged according to the acks configuration you have specified or else it results in an error.
Related
I'm a bit confused by some of the consumer API configuration properties. It seems as though they either conflict, or cancel each other out. Can someone help me understand the difference between the following keys.
Definitions:
fetch.max.bytes: Maximum amount of data the server should return for a fetch request
max.partition.fetch.bytes: Max amount of data per-partition the server will return
max.poll.records: The maximum number of records returned in a single call to poll()
Example:
fetch.max.bytes: 30000 (30kb)
max.partition.fetch.bytes: 20000000 (20mb)
max.poll.records: 1000
To me it seems like the consumer definition above is saying it can accept up to 20mb of data/partition, but then only specifying max bytes of 30kb which doesn't make sense. Max poll records also seems to limit data intake since it's possible 1000 is too low or too high based on the size of each record.
fetch.max.bytes and max.partition.fetch.bytes are fields of Fetch requests sent to Kafka brokers. They respectively determine the maximum size of the Fetch response the broker will send and the maximum size of data per partition the broker can return. It's the broker that uses these values to compute a Fetch response.
On the other hand, max.poll.records is a client-side only configuration. It determines how many records a call to poll() can return.
The consumer will fetch records in the background and buffer them so records are ready when poll() is called.
These settings allow for example to fetch records in batches, which is more efficient, but still pass them to the Consumer application in small chunks or even individually depending on the processing its doing.
I am new to Apache Kafka and I am trying to configure Apache Kafka that it receives messages from the producer as much as possible but it only sends to the consumer configured number of messages per specific time.
In other words How to configure Apache Kafka to send only "50 messages for example" per "30 seconds"
to the consumer regardless of the number of the messages, and in the next 30 seconds it takes another 50 messages from the cashed messages and so on.
If you have control over the consumer
You could use max.poll.records property to limit max number of records per poll() method call. And then you only need to ensure that poll() is called once in 30 seconds.
In general you can take a look at all available configuration properties here.
If you cannot control consumer
Then the only option for you is to write messages as per your demand - write at most 50 messages in 30 seconds. There are no configuration options available. Only your application logic can achieve that.
updated - how to control ensure call to poll
The simplest way is to:
while (true) {
consumer.poll()
// .. do your stuff
Thread.sleep(30000);
}
You can make things more complex with measuring time for processing (i.e. starting after poll call up to Thread.sleep() to not wait more then 30 seconds at all.
The problem that producer really doesn't send messages to the consumer. There is that persistent Kafka topic in between where producer places its messages. And it really doesn't care if there is any consumer on the other side. Same from the consumer perspective: it just subscribers for data from the topic and doesn't care if there is some producer on the other side. So, thinking about a back-pressure from the consumer down to producer where there is a messaging middle ware is wrong direction.
On the other hand it is not clear how those consumed messages may impact your third party service. The point is that Kafka consumer is single-threaded per partition. So, all the messages from one partition is going to be (must) processed one by one in the same thread. This way you cannot send more than one messages to your service: the next one can be sent only when the previous has been replied. So, think about it: how it is even possible in your consumer application to excess rate limit?
However if you have enough partitions and high concurrency on the consumer side, so you really may end up with several requests to your service in parallel from different threads. For this purpose I would suggest to take a look into a Rate Limiter pattern. This library provides a good implementation: https://resilience4j.readme.io/docs/ratelimiter. It is much better to keep messages in the topic then try to limit producer somehow.
To conclude: even if the consumer side is not your project, it is better to discuss with that team how to improve their consumer. You did your part well: the producer sends messages to Kafka topic. What else you can do over here?
Interesting use case and not sure why you need it, but two possible solutions: 1. To protect the cluster, you could use quotas, not for amount of messages but for bandwidth throughput: https://kafka.apache.org/documentation/#design_quotas . 2. If you need an exact amount of messages per time frame, you could put a buffering service (rate limiter) in between where you consume and pause, publishing messages to the consumed topic. Rate limiter could consume next 50 then pause until minute passes. This will increase space used on your cluster because of duplicated messages. You also need to be careful of how to pause the consumer, hearbeats need to be sent else you will rebalance your consumer continuously, ie you can't just sleep till next minute. This is obviously if you can't control the end consumer.
I wanted to understand the relationship between the timeout present in the poll() method and the configuration fetch.max.wait.ms. So , lets say I have the following configuration
fetch.min.bytes= 50
fetch.max.wait.ms= 400 ms
timeout on poll method= 200 ms
So, consider I call the poll method with the above specified timeout. The Consumer sends a fetch request to the Kafka Broker who is the leader for that partition. The Broker has not got enough bytes to send according to the configuration fetch.min.bytes, so it will wait for maximum of 400 milliseconds to respond for enough data to arrive. But I have configured, the timeout to be 200 ms for the poll method, so does that mean, behind the hood, when the fetch request is sent, it only waits for 200 ms for the server to respond and then terminates the connection?
Is that how it will turn out? In this scenario, would it be safe to say, you would always configure your timeout to be -
timeout >= network latency + fetch.max.wait.ms
Also, does Kafka Consumer fetch records proactively? By that I mean, is the consumer busy fetching records behind the hood , when the user code is busy processing the records on the last poll() method call, so that to reduce latency when the next time poll() is called. If yes, how does it maintain this internal buffer? Can we also configure, the maximum size of this internal buffer?
Thank you in advance.
Time out on poll allows you to do asynchronous processing. After subscribing to a set of topics, the consumer will automatically join the group when poll(long) is invoked. The poll API is designed to ensure consumer availability.
As long as the consumer continue to call poll, the consumer will stay in the group and continue to receive messages from the partitions it was assigned.
Under the hood, the consumer sends periodic heartbeats to the server. If the consumer crashes or is unable to send heartbeats for a duration of session.timeout.ms, then the consumer will be considered dead and its partitions will be reassigned.
But we should be careful that the long value in the poll(long) is not too long. This makes the whole process synchronous. You can read the discussion in the below link.
https://github.com/confluentinc/confluent-kafka-dotnet/issues/142
fetch.max.wait.ms This will make sure whenever a fetch request is created the server will block the request until the time specified. This usually kicks in if there isn't sufficient data to immediately satisfy the requirement given by fetch.min.bytes.
Point 1: When there is a fetch request the server blocks your fetch request for 400ms if it does not meet 50bytes.
fetch.min.bytes= 50
fetch.max.wait.ms= 400 ms
Point 2: For every 200ms you consumer sends a heartbeat to avoid rebalance by kafka.
timeout on poll method= 200 ms
When Point 1 happens your consumer is idle but since you did Point 2 the heart beat is sent at every 200ms so rebalance does not occur and you may perform some tasks asynchronously for the next 200ms.
So setting poll() will only make sure that your consumer is not considered dead and fetch.max.wait.ms is to tell the server about how long it need to wait when the fetch request comes. What i mean to say is there is not inherent dependency on the two parameter. poll() is more of the asynchronous way of doing things in your code.
Timeout is based purely on the poll().
I am attempting to use <KStream>.process() with a TimeWindows.of("name", 30000) to batch up some KTable values and send them on. It seems that 30 seconds exceeds the consumer timeout interval after which Kafka considers said consumer to be defunct and releases the partition.
I've tried upping the frequency of poll and commit interval to avoid this:
config.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "5000");
config.put(StreamsConfig.POLL_MS_CONFIG, "5000");
Unfortunately these errors are still occurring:
(lots of these)
ERROR o.a.k.s.p.internals.RecordCollector - Error sending record to topic kafka_test1-write_aggregate2-changelog
org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for kafka_test1-write_aggregate2-changelog-0
Followed by these:
INFO o.a.k.c.c.i.AbstractCoordinator - Marking the coordinator 12.34.56.7:9092 (id: 2147483547 rack: null) dead for group kafka_test1
WARN o.a.k.s.p.internals.StreamThread - Failed to commit StreamTask #0_0 in thread [StreamThread-1]:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:578)
Clearly I need to be sending heartbeats back to the server more often. How?
My topology is:
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, String> lines = kStreamBuilder.stream(TOPIC);
KTable<Windowed<String>, String> kt = lines.aggregateByKey(
new DBAggregateInit(),
new DBAggregate(),
TimeWindows.of("write_aggregate2", 30000));
DBProcessorSupplier dbProcessorSupplier = new DBProcessorSupplier();
kt.toStream().process(dbProcessorSupplier);
KafkaStreams kafkaStreams = new KafkaStreams(kStreamBuilder, streamsConfig);
kafkaStreams.start();
The KTable is grouping values by key every 30 seconds. In Processor.init() I call context.schedule(30000).
DBProcessorSupplier provides an instance of DBProcessor. This is an implementation of AbstractProcessor where all the overrides have been provided. All they do is LOG so I know when each is being hit.
It's a pretty simple topology but it's clear I'm missing a step somewhere.
Edit:
I get that I can adjust this on the server side but Im hoping there is a client-side solution. I like the notion of partitions being made available pretty quickly when a client exits / dies.
Edit:
In an attempt to simplify the problem I removed the aggregation step from the graph. It's now just consumer->processor(). (If I send the consumer directly to .print() it works v quickly so I know it's ok). (Similarly If I output the aggregation (KTable) via .print() it seems ok too).
What I found was that the .process() - which should be calling .punctuate() every 30 seconds is actually blocking for variable lengths of time and outputting somewhat randomly (if at all).
Main program
Debug output
Processor Supplier
Processor
Further:
I set the debug level to 'debug' and reran. Im seeing lots of messages:
DEBUG o.a.k.s.p.internals.StreamTask - Start processing one record [ConsumerRecord <info>
but a breakpoint in the .punctuate() function isn't getting hit. So it's doing lots of work but not giving me a chance to use it.
A few clarifications:
StreamsConfig.COMMIT_INTERVAL_MS_CONFIG is a lower bound on the commit interval, ie, after a commit, the next commit happens not before this time passed. Basically, Kafka Stream tries to commit ASAP after this time passed, but there is no guarantee whatsoever how long it will actually take to do the next commit.
StreamsConfig.POLL_MS_CONFIG is used for the internal KafkaConsumer#poll() call, to specify the maximum blocking time of the poll() call.
Thus, both values are not helpful to heartbeat more often.
Kafka Streams follows a "depth-first" strategy when processing record. This means, that after a poll() for each record all operators of the topology are executed. Let's assume you have three consecutive maps, than all three maps will be called for the first record, before the next/second record will get processed.
Thus, the next poll() call will be made, after all record of the first poll() got fully processed. If you want to heartbeat more often, you need to make sure, that a single poll() call fetches less records, such that processing all records takes less time and the next poll() will be triggered earlier.
You can use configuration parameters for KafkaConsumer that you can specify via StreamsConfig to get this done (see https://kafka.apache.org/documentation.html#consumerconfigs):
streamConfig.put(ConsumerConfig.XXX, VALUE);
max.poll.records: if you decrease this value, less record will be polled
session.timeout.ms: if you increase this value, there is more time for processing data (adding this for completeness because it is actually a client setting and not a server/broker side configuration -- even if you are aware of this solution and do not like it :))
EDIT
As of Kafka 0.10.1 it is possible (and recommended) to prefix consumer and procuder configs within streams config. This avoids parameter conflicts as some parameter names are used for consumer and producer and cannot be distinguiesh otherwise (and would be applied to consumer and producer at the same time).
To prefix a parameter you can use StreamsConfig#consumerPrefix() or StreamsConfig#producerPrefix(), respectively. For example:
streamsConfig.put(StreamsConfig.consumerPrefix(ConsumerConfig.PARAMETER), VALUE);
One more thing to add: The scenario described in this question is a known issue and there is already KIP-62 that introduces a background thread for KafkaConsumer that send heartbeats, thus decoupling heartbeats from poll() calls. Kafka Streams will leverage this new feature in upcoming releases.
We have a system which receives data from the users and pushes data to kafka and only when we are sure that the data has been pushed we send the user an "OK" response.
Since the new kafka is using async send(ProducerRecord,Callback), I wanted to know that if this send is crash resistant (fault-tolerant)?
My guess is that its most probably not,so how can I use it in sync mode? Or should I make the user wait until the callback is called?
According to Kafka's Design :
Asynchronous send
Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer has an asynchronous mode that accumulates data in memory and sends out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 100 messages or 5 seconds). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. Since this buffering happens in the client it obviously reduces the durability as any data buffered in memory and not yet sent will be lost in the event of a producer crash.