Keeping consumer alive using Kafka - java

I'm looking for a "low-cost" method to keep a consumer alive when I'm not actively polling. I.e., still processing records from the last poll, and I don't want the consumer connection to time out.
Some functions that look promising:
wakeup
commitAsync
In each case this would be non-standard usage of the API, so I'm not sure it would be a reasonable / rational approach.
RE: Setting the connection timeout higher - I want the consumer to timeout if it gets wedged. My question pertains to one section where I've fetched a block of records and separate threads are working through them.

The documentation seems to suggest you should call pause() and then keep actively polling. If you call poll() while paused, nothing will be returned.
For use cases where message processing time varies unpredictably,
neither of these options may be sufficient. The recommended way to
handle these cases is to move message processing to another thread,
which allows the consumer to continue calling poll while the processor
is still working. Some care must be taken to ensure that committed
offsets do not get ahead of the actual position. Typically, you must
disable automatic commits and manually commit processed offsets for
records only after the thread has finished handling them (depending on
the delivery semantics you need). Note also that you will need to
pause the partition so that no new records are received from poll
until after thread has finished handling those previously returned.
The documentation for pause() confirms this:
Suspend fetching from the requested partitions. Future calls to
poll(long) will not return any records from these partitions until
they have been resumed using resume(Collection). Note that this method
does not affect partition subscription. In particular, it does not
cause a group rebalance when automatic assignment is used.

Since Kafka 0.10.1, the consumer no longer heartbeats during poll calls. It runs the hearbeat in a separate thread. So if that's your version, there is nothing else to do. See KIP-62

Related

Is there any way we can pause kafka stream for certain period and resume later?

We have one requirement where we are using Kafka Streams to read from Kafka topic and then send the data over network through a pool of sessions. However, sometimes, network calls are bit slow and we need to frequently pause the stream, ensure we are not overloading network. Currently, we capture data into a stream and load it to a executor service and then send it over network through session pool.
If data in executor service is too high, we need to pause the stream for some time and then resume it once backlog on executor service is cleared up. For achiveing this pause mechanism, We are currently closing the stream and starting again once backlog is cleared up.
Is there any way we can pause the kafka stream?
If I understand you correctly, there is nothing special you need to do. You are talking about "back pressure" and Kafka Streams can handle it out of the box.
What can be done is putting this data into a queue with some max size and use this queue to load in executor service. Whenever the queue reaches some threshold, there are two methods:
If your call to put data in queue is blocking with no time-out, there is nothing more you need to do. Just wait until the system is back online, your call
returns, and processing will resume.
If your call to put data in queue is blocking with time-out,just issue the lookup to check the size of the queue. Repeat this until the system is back online and your call succeeds.
The only caveat is that as long as your Streams application blocks, the internally used Kafka consumer client will not send any heartbeats to Kafka and might time out. Thus, you need to set the time-out configuration parameter higher than the expected maximum downtime of your external system.
Another approach is to use a Processor API available in Kafka-streams, though, it is not usually recommended pattern.
Let me know if it helps!!

Apache kafka - manual acknowledge(AbstractMessageListenerContainer.AckMode.MANUAL) not working and events replayed on library upgrade

Kafka events getting replayed to consumer repeatedly. I can see following exception -
5-Nov-2019 10:43:25 ERROR [org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run : 685] :: org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1 :: :: Container exception
org.apache.kafka.clients.consumer.CommitFailedException:
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing.
You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
But in my case, it's just 1 message which takes more than 30 mins to process so we acknowledge it on receiving. So i don't think no.of records is an issue. I know it can be solved by increasing max.poll.interval.ms but it used to work before upgrade. Trying to figure out what is optimal workaround.
Tried with AbstractMessageListenerContainer.AckMode.MANUAL_IMMEDIATE seems to commit offset immediately and works, but I need to figure out why AbstractMessageListenerContainer.AckMode.MANUAL fails now
Previous working jar versions:
spring-kafka-1.0.5.RELEASE.jar
kafka-clients-0.9.0.1.jar
Current versions (getting above exception):
spring-kafka-1.3.9.RELEASE.jar
kafka-clients-2.2.1.jar
Yes, you must increase max.poll.interval.ms; you can use MANUAL_IMMEDIATE instead to commit the offset immediately (with MANUAL, the commit is enqueued, the actual commit is not performed until the thread exits the listener).
However, this will still not prevent a rebalance because Kafka requires the consumer to call poll() within max.poll.interval.ms.
So I suggest you switch to MANUAL_IMMEDIATE and increase the interval beyond 30 minutes.
With the old version (before 1.3), there were two threads - one for the consumer and one for the listener, so the queued ack was processed earlier. But it was a very complicated threading model which was much simplified in 1.3, thanks to KIP-62, but this side effect was the result.

Kafka enable.auto.commit false in combination with commitSync()

I am having a scenario where the enable.auto.commit is set to false. For every poll() the records obtained are offloaded to a threadPoolExecutor. And the commitSync() is happening out of the context. But, I doubt if this is the right way to handle as my thread pool may still be processing few message while i commit the messages.
while (true) {
ConsumerRecords < String, NormalizedSyslogMessage > records = consumer.poll(100);
Date startTime = Calendar.getInstance().getTime();
for (ConsumerRecord < String, NormalizedSyslogMessage > record: records) {
NormalizedSyslogMessage normalizedMessage = record.value();
normalizedSyslogMessageList.add(normalizedMessage);
}
Date endTime = Calendar.getInstance().getTime();
long durationInMilliSec = endTime.getTime() - startTime.getTime();
// execute process thread on message size equal to 5000 or timeout > 4000
if (normalizedSyslogMessageList.size() == 5000) {
CorrelationProcessThread correlationProcessThread = applicationContext
.getBean(CorrelationProcessThread.class);
List < NormalizedSyslogMessage > clonedNormalizedSyslogMessages = deepCopy(normalizedSyslogMessageList);
correlationProcessThread.setNormalizedMessage(clonedNormalizedSyslogMessages);
taskExecutor.execute(correlationProcessThread);
normalizedSyslogMessageList.clear();
}
consumer.commitSync();
}
I suppose there are a couple of things to address here.
First is Offsets being out of sync - This is probably caused by either one of the following:
If the number of messages fetched by poll() does not take the size of the normalizedSyslogMessageList to 5000, the commitSync() will still run regardless of whether the current batch of messages has been processed or not.
If however, the size touches 5000 - because the processing is being done in a separate thread, the main consumer thread will never know whether the processing has been completed or not but... The commitSync() would run anyway committing the offsets.
The second part (Which I believe is your actual concern/question) - Whether or not this is the best way to handle this. I would say No because of point number 2 above i.e. the correlationProcessThread is being invoked in a fire-and-forget manner here so you wouldn't know whe the processing of those messages would be completed for you to be able to safely commit the offsets.
Here's a statement from "Kafka's Definitive Guide" -
It is important to remember that commitSync() will commit the latest
offset returned by poll(), so make sure you call commitSync() after
you are done processing all the records in the collection, or you risk
missing messages.
Point number 2 especially will be hard to fix because:
Supplying the consumer reference to the threads in the pool will basically mean multiple threads trying to access one consumer instance (This post makes a mention of this approach and the issues - Mainly, Kafka Consumer NOT being Thread-Safe).
Even if you try and get the status of the processing thread before committing offsets by using the submit() method instead of execute() in the ExecutorService, then you would need to make a blocking get() method call to the correlationProcessThread. So, you may not get a lot of benefit by processing in multiple threads.
Options for fixing this
As I'm not aware of the your context and the exact requirement, I will only be able to suggest conceptual ideas but it might be worth considering:
breaking the consumer instances as per the processing they need to do and carrying out the processing in the same thread or
you could explore the possibility of maintaining the offsets of the messages in a map (as and when they are processed) and then committing those specific offsets (this method)
I hope this helps.
Totally agree with what Lalit has mentioned. Currently i'm going through the same exact situation where my processing are happening in separate threads and consumer & producer in different threads. I've used a ConcurrentHashMap to be shared between producer and consumer threads which updates that a particular offset has been processed or not.
ConcurrentHashMap<OffsetAndMetadata, Boolean>
On the consumer side, a local LinkedHashMap can be used to maintain the order in which the records are consumed from Topic/Partition and do manual commit in the consumer thread itself.
LinkedHashMap<OffsetAndMetadata, TopicPartition>
You can refer to the following link, if your processing thread is maintaining any consumed record order.
Transactions in Kafka
A point to mention in my approach, there will be chance that data will be duplicated in case of any failures.

Kafka KStreams - processing timeouts

I am attempting to use <KStream>.process() with a TimeWindows.of("name", 30000) to batch up some KTable values and send them on. It seems that 30 seconds exceeds the consumer timeout interval after which Kafka considers said consumer to be defunct and releases the partition.
I've tried upping the frequency of poll and commit interval to avoid this:
config.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "5000");
config.put(StreamsConfig.POLL_MS_CONFIG, "5000");
Unfortunately these errors are still occurring:
(lots of these)
ERROR o.a.k.s.p.internals.RecordCollector - Error sending record to topic kafka_test1-write_aggregate2-changelog
org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for kafka_test1-write_aggregate2-changelog-0
Followed by these:
INFO o.a.k.c.c.i.AbstractCoordinator - Marking the coordinator 12.34.56.7:9092 (id: 2147483547 rack: null) dead for group kafka_test1
WARN o.a.k.s.p.internals.StreamThread - Failed to commit StreamTask #0_0 in thread [StreamThread-1]:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:578)
Clearly I need to be sending heartbeats back to the server more often. How?
My topology is:
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, String> lines = kStreamBuilder.stream(TOPIC);
KTable<Windowed<String>, String> kt = lines.aggregateByKey(
new DBAggregateInit(),
new DBAggregate(),
TimeWindows.of("write_aggregate2", 30000));
DBProcessorSupplier dbProcessorSupplier = new DBProcessorSupplier();
kt.toStream().process(dbProcessorSupplier);
KafkaStreams kafkaStreams = new KafkaStreams(kStreamBuilder, streamsConfig);
kafkaStreams.start();
The KTable is grouping values by key every 30 seconds. In Processor.init() I call context.schedule(30000).
DBProcessorSupplier provides an instance of DBProcessor. This is an implementation of AbstractProcessor where all the overrides have been provided. All they do is LOG so I know when each is being hit.
It's a pretty simple topology but it's clear I'm missing a step somewhere.
Edit:
I get that I can adjust this on the server side but Im hoping there is a client-side solution. I like the notion of partitions being made available pretty quickly when a client exits / dies.
Edit:
In an attempt to simplify the problem I removed the aggregation step from the graph. It's now just consumer->processor(). (If I send the consumer directly to .print() it works v quickly so I know it's ok). (Similarly If I output the aggregation (KTable) via .print() it seems ok too).
What I found was that the .process() - which should be calling .punctuate() every 30 seconds is actually blocking for variable lengths of time and outputting somewhat randomly (if at all).
Main program
Debug output
Processor Supplier
Processor
Further:
I set the debug level to 'debug' and reran. Im seeing lots of messages:
DEBUG o.a.k.s.p.internals.StreamTask - Start processing one record [ConsumerRecord <info>
but a breakpoint in the .punctuate() function isn't getting hit. So it's doing lots of work but not giving me a chance to use it.
A few clarifications:
StreamsConfig.COMMIT_INTERVAL_MS_CONFIG is a lower bound on the commit interval, ie, after a commit, the next commit happens not before this time passed. Basically, Kafka Stream tries to commit ASAP after this time passed, but there is no guarantee whatsoever how long it will actually take to do the next commit.
StreamsConfig.POLL_MS_CONFIG is used for the internal KafkaConsumer#poll() call, to specify the maximum blocking time of the poll() call.
Thus, both values are not helpful to heartbeat more often.
Kafka Streams follows a "depth-first" strategy when processing record. This means, that after a poll() for each record all operators of the topology are executed. Let's assume you have three consecutive maps, than all three maps will be called for the first record, before the next/second record will get processed.
Thus, the next poll() call will be made, after all record of the first poll() got fully processed. If you want to heartbeat more often, you need to make sure, that a single poll() call fetches less records, such that processing all records takes less time and the next poll() will be triggered earlier.
You can use configuration parameters for KafkaConsumer that you can specify via StreamsConfig to get this done (see https://kafka.apache.org/documentation.html#consumerconfigs):
streamConfig.put(ConsumerConfig.XXX, VALUE);
max.poll.records: if you decrease this value, less record will be polled
session.timeout.ms: if you increase this value, there is more time for processing data (adding this for completeness because it is actually a client setting and not a server/broker side configuration -- even if you are aware of this solution and do not like it :))
EDIT
As of Kafka 0.10.1 it is possible (and recommended) to prefix consumer and procuder configs within streams config. This avoids parameter conflicts as some parameter names are used for consumer and producer and cannot be distinguiesh otherwise (and would be applied to consumer and producer at the same time).
To prefix a parameter you can use StreamsConfig#consumerPrefix() or StreamsConfig#producerPrefix(), respectively. For example:
streamsConfig.put(StreamsConfig.consumerPrefix(ConsumerConfig.PARAMETER), VALUE);
One more thing to add: The scenario described in this question is a known issue and there is already KIP-62 that introduces a background thread for KafkaConsumer that send heartbeats, thus decoupling heartbeats from poll() calls. Kafka Streams will leverage this new feature in upcoming releases.

Multithread-safe JDBC Save or Update

We have a JMS queue of job statuses, and two identical processes pulling from the queue to persist the statuses via JDBC. When a job status is pulled from the queue, the database is checked to see if there is already a row for the job. If so, the existing row is updated with new status. If not, a row is created for this initial status.
What we are seeing is that a small percentage of new jobs are being added to the database twice. We are pretty sure this is because the job's initial status is quickly followed by a status update - one process gets one, another process the other. Both processes check to see if the job is new, and since it has not been recorded yet, both create a record for it.
So, my question is, how would you go about preventing this in a vendor-neutral way? Can it be done without locking the entire table?
EDIT: For those saying the "architecture" is unsound - I agree, but am not at liberty to change it.
Create a unique constraint on JOB_ID, and retry to persist the status in the event of a constraint violation exception.
That being said, I think your architecture is unsound: If two processes are pulling messages from the queue, it is not guaranteed they will write them to the database in queue order: one consumer might be a bit slower, a packet might be dropped, ..., causing the other consumer to persist the later messages first, causing them to be overridden with the earlier state.
One way to guard against that is to include sequence numbers in the messages, update the row only if the sequence number is as expected, and delay the update otherwise (this is vulnerable to lost messages, though ...).
Of course, the easiest way would be to have only one consumer ...
JDBC connections are not thread safe, so there's nothing to be done about that.
"...two identical processes pulling from the queue to persist the statuses via JDBC..."
I don't understand this at all. Why two identical processes? Wouldn't it be better to have a pool of message queue listeners, each of which would handle messages landing on the queue? Each listener would have its own thread; each one would be its own transaction. A Java EE app server allows you to configure the size of the message listener pool to match the load.
I think a design that duplicates a process like this is asking for trouble.
You could also change the isolation level on the JDBC connection. If you make it SERIALIZABLE you'll ensure ACID at the price of slower performance.
Since it's an asynchronous process, performance will only be an issue if you find that the listeners can't keep up with the messages landing on the queue. If that's the case, you can try increasing the size of the listener pool until you have adequate capacity to process the incoming messages.

Categories