Kafka KStreams - processing timeouts - java

I am attempting to use <KStream>.process() with a TimeWindows.of("name", 30000) to batch up some KTable values and send them on. It seems that 30 seconds exceeds the consumer timeout interval after which Kafka considers said consumer to be defunct and releases the partition.
I've tried upping the frequency of poll and commit interval to avoid this:
config.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "5000");
config.put(StreamsConfig.POLL_MS_CONFIG, "5000");
Unfortunately these errors are still occurring:
(lots of these)
ERROR o.a.k.s.p.internals.RecordCollector - Error sending record to topic kafka_test1-write_aggregate2-changelog
org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for kafka_test1-write_aggregate2-changelog-0
Followed by these:
INFO o.a.k.c.c.i.AbstractCoordinator - Marking the coordinator 12.34.56.7:9092 (id: 2147483547 rack: null) dead for group kafka_test1
WARN o.a.k.s.p.internals.StreamThread - Failed to commit StreamTask #0_0 in thread [StreamThread-1]:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:578)
Clearly I need to be sending heartbeats back to the server more often. How?
My topology is:
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, String> lines = kStreamBuilder.stream(TOPIC);
KTable<Windowed<String>, String> kt = lines.aggregateByKey(
new DBAggregateInit(),
new DBAggregate(),
TimeWindows.of("write_aggregate2", 30000));
DBProcessorSupplier dbProcessorSupplier = new DBProcessorSupplier();
kt.toStream().process(dbProcessorSupplier);
KafkaStreams kafkaStreams = new KafkaStreams(kStreamBuilder, streamsConfig);
kafkaStreams.start();
The KTable is grouping values by key every 30 seconds. In Processor.init() I call context.schedule(30000).
DBProcessorSupplier provides an instance of DBProcessor. This is an implementation of AbstractProcessor where all the overrides have been provided. All they do is LOG so I know when each is being hit.
It's a pretty simple topology but it's clear I'm missing a step somewhere.
Edit:
I get that I can adjust this on the server side but Im hoping there is a client-side solution. I like the notion of partitions being made available pretty quickly when a client exits / dies.
Edit:
In an attempt to simplify the problem I removed the aggregation step from the graph. It's now just consumer->processor(). (If I send the consumer directly to .print() it works v quickly so I know it's ok). (Similarly If I output the aggregation (KTable) via .print() it seems ok too).
What I found was that the .process() - which should be calling .punctuate() every 30 seconds is actually blocking for variable lengths of time and outputting somewhat randomly (if at all).
Main program
Debug output
Processor Supplier
Processor
Further:
I set the debug level to 'debug' and reran. Im seeing lots of messages:
DEBUG o.a.k.s.p.internals.StreamTask - Start processing one record [ConsumerRecord <info>
but a breakpoint in the .punctuate() function isn't getting hit. So it's doing lots of work but not giving me a chance to use it.

A few clarifications:
StreamsConfig.COMMIT_INTERVAL_MS_CONFIG is a lower bound on the commit interval, ie, after a commit, the next commit happens not before this time passed. Basically, Kafka Stream tries to commit ASAP after this time passed, but there is no guarantee whatsoever how long it will actually take to do the next commit.
StreamsConfig.POLL_MS_CONFIG is used for the internal KafkaConsumer#poll() call, to specify the maximum blocking time of the poll() call.
Thus, both values are not helpful to heartbeat more often.
Kafka Streams follows a "depth-first" strategy when processing record. This means, that after a poll() for each record all operators of the topology are executed. Let's assume you have three consecutive maps, than all three maps will be called for the first record, before the next/second record will get processed.
Thus, the next poll() call will be made, after all record of the first poll() got fully processed. If you want to heartbeat more often, you need to make sure, that a single poll() call fetches less records, such that processing all records takes less time and the next poll() will be triggered earlier.
You can use configuration parameters for KafkaConsumer that you can specify via StreamsConfig to get this done (see https://kafka.apache.org/documentation.html#consumerconfigs):
streamConfig.put(ConsumerConfig.XXX, VALUE);
max.poll.records: if you decrease this value, less record will be polled
session.timeout.ms: if you increase this value, there is more time for processing data (adding this for completeness because it is actually a client setting and not a server/broker side configuration -- even if you are aware of this solution and do not like it :))
EDIT
As of Kafka 0.10.1 it is possible (and recommended) to prefix consumer and procuder configs within streams config. This avoids parameter conflicts as some parameter names are used for consumer and producer and cannot be distinguiesh otherwise (and would be applied to consumer and producer at the same time).
To prefix a parameter you can use StreamsConfig#consumerPrefix() or StreamsConfig#producerPrefix(), respectively. For example:
streamsConfig.put(StreamsConfig.consumerPrefix(ConsumerConfig.PARAMETER), VALUE);
One more thing to add: The scenario described in this question is a known issue and there is already KIP-62 that introduces a background thread for KafkaConsumer that send heartbeats, thus decoupling heartbeats from poll() calls. Kafka Streams will leverage this new feature in upcoming releases.

Related

Apache kafka - manual acknowledge(AbstractMessageListenerContainer.AckMode.MANUAL) not working and events replayed on library upgrade

Kafka events getting replayed to consumer repeatedly. I can see following exception -
5-Nov-2019 10:43:25 ERROR [org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run : 685] :: org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1 :: :: Container exception
org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing.
You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
But in my case, it's just 1 message which takes more than 30 mins to process so we acknowledge it on receiving. So i don't think no.of records is an issue. I know it can be solved by increasing max.poll.interval.ms but it used to work before upgrade. Trying to figure out what is optimal workaround.
Tried with AbstractMessageListenerContainer.AckMode.MANUAL_IMMEDIATE seems to commit offset immediately and works, but I need to figure out why AbstractMessageListenerContainer.AckMode.MANUAL fails now
Previous working jar versions:
spring-kafka-1.0.5.RELEASE.jar
kafka-clients-0.9.0.1.jar
Current versions (getting above exception):
spring-kafka-1.3.9.RELEASE.jar
kafka-clients-2.2.1.jar
Yes, you must increase max.poll.interval.ms; you can use MANUAL_IMMEDIATE instead to commit the offset immediately (with MANUAL, the commit is enqueued, the actual commit is not performed until the thread exits the listener).
However, this will still not prevent a rebalance because Kafka requires the consumer to call poll() within max.poll.interval.ms.
So I suggest you switch to MANUAL_IMMEDIATE and increase the interval beyond 30 minutes.
With the old version (before 1.3), there were two threads - one for the consumer and one for the listener, so the queued ack was processed earlier. But it was a very complicated threading model which was much simplified in 1.3, thanks to KIP-62, but this side effect was the result.

Kafka enable.auto.commit false in combination with commitSync()

I am having a scenario where the enable.auto.commit is set to false. For every poll() the records obtained are offloaded to a threadPoolExecutor. And the commitSync() is happening out of the context. But, I doubt if this is the right way to handle as my thread pool may still be processing few message while i commit the messages.
while (true) {
ConsumerRecords < String, NormalizedSyslogMessage > records = consumer.poll(100);
Date startTime = Calendar.getInstance().getTime();
for (ConsumerRecord < String, NormalizedSyslogMessage > record: records) {
NormalizedSyslogMessage normalizedMessage = record.value();
normalizedSyslogMessageList.add(normalizedMessage);
}
Date endTime = Calendar.getInstance().getTime();
long durationInMilliSec = endTime.getTime() - startTime.getTime();
// execute process thread on message size equal to 5000 or timeout > 4000
if (normalizedSyslogMessageList.size() == 5000) {
CorrelationProcessThread correlationProcessThread = applicationContext
.getBean(CorrelationProcessThread.class);
List < NormalizedSyslogMessage > clonedNormalizedSyslogMessages = deepCopy(normalizedSyslogMessageList);
correlationProcessThread.setNormalizedMessage(clonedNormalizedSyslogMessages);
taskExecutor.execute(correlationProcessThread);
normalizedSyslogMessageList.clear();
}
consumer.commitSync();
}
I suppose there are a couple of things to address here.
First is Offsets being out of sync - This is probably caused by either one of the following:
If the number of messages fetched by poll() does not take the size of the normalizedSyslogMessageList to 5000, the commitSync() will still run regardless of whether the current batch of messages has been processed or not.
If however, the size touches 5000 - because the processing is being done in a separate thread, the main consumer thread will never know whether the processing has been completed or not but... The commitSync() would run anyway committing the offsets.
The second part (Which I believe is your actual concern/question) - Whether or not this is the best way to handle this. I would say No because of point number 2 above i.e. the correlationProcessThread is being invoked in a fire-and-forget manner here so you wouldn't know whe the processing of those messages would be completed for you to be able to safely commit the offsets.
Here's a statement from "Kafka's Definitive Guide" -
It is important to remember that commitSync() will commit the latest
offset returned by poll(), so make sure you call commitSync() after
you are done processing all the records in the collection, or you risk
missing messages.
Point number 2 especially will be hard to fix because:
Supplying the consumer reference to the threads in the pool will basically mean multiple threads trying to access one consumer instance (This post makes a mention of this approach and the issues - Mainly, Kafka Consumer NOT being Thread-Safe).
Even if you try and get the status of the processing thread before committing offsets by using the submit() method instead of execute() in the ExecutorService, then you would need to make a blocking get() method call to the correlationProcessThread. So, you may not get a lot of benefit by processing in multiple threads.
Options for fixing this
As I'm not aware of the your context and the exact requirement, I will only be able to suggest conceptual ideas but it might be worth considering:
breaking the consumer instances as per the processing they need to do and carrying out the processing in the same thread or
you could explore the possibility of maintaining the offsets of the messages in a map (as and when they are processed) and then committing those specific offsets (this method)
I hope this helps.
Totally agree with what Lalit has mentioned. Currently i'm going through the same exact situation where my processing are happening in separate threads and consumer & producer in different threads. I've used a ConcurrentHashMap to be shared between producer and consumer threads which updates that a particular offset has been processed or not.
ConcurrentHashMap<OffsetAndMetadata, Boolean>
On the consumer side, a local LinkedHashMap can be used to maintain the order in which the records are consumed from Topic/Partition and do manual commit in the consumer thread itself.
LinkedHashMap<OffsetAndMetadata, TopicPartition>
You can refer to the following link, if your processing thread is maintaining any consumed record order.
Transactions in Kafka
A point to mention in my approach, there will be chance that data will be duplicated in case of any failures.

"Commit failed for offsets" while committing offset asynchronously

I have a kafka consumer from which I am consuming data from a particular topic and I am seeing below exception. I am using 0.10.0.0 kafka version.
LoggingCommitCallback.onComplete: Commit failed for offsets= {....}, eventType= some_type, time taken= 19ms, error= org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
I added these two extra consumer properties but still it didn't helped:
session.timeout.ms=20000
max.poll.records=500
I am committing offsets in a different background thread as shown below:
kafkaConsumer.commitAsync(new LoggingCommitCallback(consumerType.name()));
What does that error mean and how can I resolve it? Do I need to add some other consumer properties?
Yes, lower max.poll.records. You'll get smaller batches of data but there more frequent calls to poll that will result will help keep the session alive.

Kafka new consumer: (re)set and commit offsets with using assign, not subscribe

Using the new Kafka Java consumer api, I run a single consumer to consume messages. When all available messages are consumed, I kill it with kill -15.
Now I would like to reset the offsets to start. I would like to avoid to just use a different consumer group. What I tried is the following sequence of calls, using the same group as the consumer that just had finished reading the data.
assign(topicPartition);
OffsetAndMetadata om = new OffsetAndMetadata(0);
commitSync(Collections.singletonMap(topicPartition, 0));
I thought I had got this working in a test, but now I always just get:
ERROR internals.ConsumerCoordinator: Error UNKNOWN_MEMBER_ID occurred while committing offsets for group queue
Exception in thread "main" org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:552)
Is it in principle wrong to combine assign with commitSync, possibly because only subscribe and commitSync go together? The docs only say that assign does not go along with subscribe, but I thought this applies only in one consumer process. (In fact I was even hoping to run the offset-reset consumer while the other consumer is up, hoping that the other one might notice the offset change and start over again. But shutting it down first is fine too.)
Any ideas?
Found the problem. The approach described in my question works well, given we respect the following conditions:
There may be no other consumer running with the targeted group.id. Even if a consumer is subscribed only to other topics, this hinders committing topic offsets after calling assign() instead of subscribe().
After the last other consumer has stopped, it takes 30 seconds (I think it is group.max.session.timeout.ms) before the operation can succeed. The indicative log message from kafka is
Group X generation Y is dead and removed
Once this appears in the log, the sequence
assign(topicPartition);
OffsetAndMetadata om = new OffsetAndMetadata(0);
commitSync(Collections.singletonMap(topicPartition, 0));
can succeed.
Why even commit offsets in the first place?
Set enable.auto.commit to false in Properties and don't commit it at all if you just re-read all messages on restart.
To reset offset you can use for example these methods:
public void seek(TopicPartition partition, long offset)
public void seekToBeginning(TopicPartition... partitions)

Kafka 0.9 new consumer api --- how to just watch consumer offsets

I am trying to monitor consumer offsets of a given group with the Java API. I create one additional consumer which does not subscribe to any topic, but just calls consumer.committed(topic) to get the offset information. This kind of works, but:
For testing I use only one real consumer (i.e. one which does subscribe to the topic). When I shut it down using close() and later restart one, it takes 27 seconds between subscribe and the first consumption of messages, despite the fact that I use poll(1000).
I am guessing this has to do with the rebalancing possibly being confused by the non-subscribing consumer. Could that be possible? Is there a better way to monitor offsets with the Java API (I know about the command line tool, but need to use the API).
There are different ways to inspect offset from topics, depends on the purpose of what you want it for, besides of "committed" that you described above, here are two more options:
1) if you want to know the offset id from which the consumer start to fetch data from broker next time Thread(s) start(s), then you must use "position" as
long offsetPosition;
TopicPartition tPartition = new TopicPartition(topic,partitionToReview);
offsetPosition = kafkaConsumer.position(tPartition);
System.out.println("offset of the next record to fetch is : " + position);
2) calling "offset()" method from ConsumerRecord object, after performing a poll from kafkaConsumer
Iterator<ConsumerRecord<byte[],byte[]>> it = kafkaConsumer.poll(1000).iterator();
while(it.hasNext()){
ConsumerRecord<byte[],byte[]> record = it.next();
System.out.println("offset : " + record.offset());
}
Found it: the monitoring consumer added to the confusion but was not the culprit. In the end it is easy to understand though slightly unexpected (for me at least):
The default for session.timeout.ms is 30 seconds. When a consumer disappears it takes up to 30 seconds before it is declared dead and the work is rebalanced. For testing, I had stopped the single consumer I had, waited three seconds and restarted a new one. This then took 27 seconds before it started, filling the 30 seconds time-out.
I would have expected that a single, lone consumer starting up does not wait for the time-out to expire, but starts to "rebalance", i.e. grab the work immediately. It seems though that the time-out has to expire before work is rebalanced, even if there is only one consumer.
For the testing to get through faster, I changed the configuration to use a lower session.timeout.ms for the consumer as well as group.min.session.timeout.ms for the broker.
To conclude: using a consumer that does not subscribe to any topic for monitoring the offsets works just fine and does not seem to interfere with the rebalancing process.

Categories