kafka: Commit offsets failed with retriable exception. You should retry committing offsets

kafka: Commit offsets failed with retriable exception. You should retry committing offsets - java

[o.a.k.c.c.i.ConsumerCoordinator] [Auto offset commit failed for group
consumer-group: Commit offsets failed with retriable
exception. You should retry committing offsets.] []
Why does this error come in kafka consumer? what does this mean?
The consumer properties I am using are:
fetch.min.bytes:1
enable.auto.commit:true
auto.offset.reset:latest
auto.commit.interval.ms:5000
request.timeout.ms:300000
session.timeout.ms:20000
max.poll.interval.ms:600000
max.poll.records:500
max.partition.fetch.bytes:10485760
What is the reason for that error to come? I am guessing the consumer is doing duplicated work right now (polling same message again) because of this error.
I am neither using consumer.commitAsync() or consumer.commitSync()

Consumer gives this error in case if it catches an instance of RetriableException.
The reasons for it might be various:
if coordinator is still loading the group metadata
if the group metadata topic has not been created yet
if network or disk corruption happens, or miscellaneous disk-related or network-related IOException occurred when handling a request
if server disconnected before a request could be completed
if the client's metadata is out of date
if there is no currently available leader for the given partition
if no brokers were available to complete a request
As you can see from the list above, all these errors could be temporary issues, that is why it is suggested to retry the request.

Related

How to avoid losing messages with Kafka streams

We have a streams application that consumes messages from a source topic, does some processing and forward the results to a destination topic.
The structure of the messages are controlled by some avro schemas.
When starting consuming messages if the schema is not cached yet the application will try to retrieve it from schema registry. If for whichever reason the schema registry is not available (say a network glitch) then the currently being processed message is lost because the default handler is something called LogAndContinueExceptionHandler.
o.a.k.s.e.LogAndContinueExceptionHandler : Exception caught during Deserialization, taskId: 1_5, topic: my.topic.v1, partition: 5, offset: 142768
org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 62
Caused by: java.net.SocketTimeoutException: connect timed out
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:na]
...
o.a.k.s.p.internals.RecordDeserializer : stream-thread [my-app-StreamThread-3] task [1_5] Skipping record due to deserialization error. topic=[my.topic.v1] partition=[5] offset=[142768]
...
So my question is what would be the proper way of dealing with situations like described above and make sure you don't lose messages no matter what. Is there an out of the box LogAndRollbackExceptionHandler error handler or a way of implementing your own?
Thank you in advance for your inputs.

I've not worked a lot on Kafka, but when i did, i remember having issues such as the one you are describing in our system.
Let me tell you how we took care of our scenarios, maybe it would help you out too:
Scenario 1: If your messages are being lost at the publishing side (publisher --> kafka), you can configure Kafka acknowledgement setting according to your need, if you use spring cloud stream with kafka, the property is spring.cloud.stream.kafka.binder.required-acks
Possible values:
At most once (Ack=0)
Publisher does not care if Kafka acknowledges or not.
Send and forget
Data loss is possible
At least once (Ack=1)
If Kafka does not acknowledge, publisher resends message.
Possible duplication.
Acknowledgment is sent before message is copied to replicas.
Exactly once (Ack=all)
If Kafka does not acknowledge, publisher resends message.
However, if a message gets sent more than once to Kafka, there is no duplication.
Internal sequence number, used to decide if message has already been written on topic or not.
Min.insync.replicas property needs to be set to ensure what is the minimum number of replices that need to be synced before kafka acknowledges to the producer.
Scenario 2: If your data is being lost at the consumer side (kafka --> consumer), you can change the Auto Commit feature of Kafka according to your usage. This is the property if you are using Spring cloud stream spring.cloud.stream.kafka.bindings.input.consumer.AutoCommitOffset.
By default, AutoCommitOffset is true in kafka, and every message that is sent to the consumer is "committed" at Kafka's end, meaning it wont be sent again. However if you change AutoCommitOffset to false, you will have the power to poll the message from kafka in your code, and once you are done with your work, explicitly set commit to true to let kafka know that now you are done with the message.
If a message is not committed, kafka will keep resending it until it is.
Hope this helps you out, or atleast points you in the right direction.

BiqQuery insert retry policy in Apache Beam

Apache Beam API has the following BiqQuery insert retry policies.
How Dataflow job behave if I specify retryTransientErrors?
shouldRetry provides an error from BigQuery and I can decide if I should retry. Where can I find expected error from BigQuery?
BiqQuery insert retry policies
https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/io/gcp/bigquery/InsertRetryPolicy.html
alwaysRetry - Always retry all failures.
neverRetry - Never retry any failures.
retryTransientErrors - Retry all failures except for known persistent errors.
shouldRetry - Return true if this failure should be retried.
Background
When my Cloud Dataflow job inserting very old timestamp (more than 1 year before from now) to BigQuery, I got the following error.
jsonPayload: {
exception: "java.lang.RuntimeException: java.io.IOException: Insert failed:
[{"errors":[{"debugInfo":"","location":"","message":"Value 690000000 for field
timestamp_scanned of the destination table fr-prd-datalake:rfid_raw.store_epc_transactions_cr_uqjp is outside the allowed bounds.
You can only stream to date range within 365 days in the past and 183 days in
the future relative to the current date.","reason":"invalid"}],
After the first error, Dataflow try to retry insert and it always rejected from BigQuery with the same error.
It did not stop so I added retryTransientErrors to BigQueryIO.Write step then the retry stopped.

How Dataflow job behave if I specify retryTransientErrors?
All errors are considered transient except if BigQuery says that the error reason is one of "invalid", "invalidQuery", "notImplemented"
shouldRetry provides an error from BigQuery and I can decide if I should retry. Where can I find expected error from BigQuery?
You can't since the errors are not visible to the caller. I'm not sure if this was done on purpose or whether Apache Beam should expose the errors so users can write their own retry logic.

ActiveMQ - cannot rollback non transaced session INVIDUAL_ACK

Is it possible to rollback async processed message in ActiveMQ? I'm consuming next message while first one is still processing, so while I'm trying to rollback the first message on another (not activemq pool) thread, I'm getting above error. Eventually should I sednd message to DLQ manually?

Message error handling can work a couple ways:
Broker-side 'redelivery policy'. Where the client invokes a rollback n number (default is usually 6 retries) of times and the broker automatically moves the message to a Dead Letter Queue (DLQ)
Client-side. Application consumes the message and then produces to the DLQ.
Option #1 is good for unplanned/planned outages-- database down, etc. Where you want automatic retry. The re-delivery policy can also be configured when the client connects to the broker.
Option #2 is good for 'bad data' scenarios where you know the message will never be able to be processed. This is ideal, because you can move the message on the 1st consumption and not have to reject the message n number of times.
When you combine infinite retry with #1 and include #2 in your application flow, you can have a robust process flow of automatic retry, and move-bad-data-out-of-the-way-quickly. Best of breed =)
ActiveMQ Redelivery policy

Kafka Streams error - Offset commit failed on partition, request timed out

We use Kafka Streams for consuming, processing and producing messages, and on PROD env we faced with errors on multiple topics:
ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=app-xxx-StreamThread-3-consumer, groupId=app]
Offset commit failed on partition xxx-1 at offset 13920:
The request timed out.[]
These errors occur rarely for topics with small load, but for topics with high load (and spikes) errors occur dozens of times a day per topic. Topics have multiple partitions (e.g. 10). Seems this issue does not affect processing of data (despite performance), as after throwing exception (even could be multiple errors for the same offset), consumer later re-read message and successfully process it.
I see that this error message appeared in kafka-clients version 1.0.0 due to PR, but in previous kafka-clients versions for the same use case (Errors.REQUEST_TIMED_OUT on consumer) similar message (Offset commit for group {} failed: {}) was logged with debug level.
as for me, it would be more logical to update log level to warning for such use case.
How to fix this issue? What could be the root cause? Maybe changing consumer properties or partition setup could help to get rid of such issue.
we use the following implementation for creating Kafka Streams:
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> stream = builder.<String, String>stream(topicName);
stream.foreach((key, value) -> processMessage(key, value));
Topology topology = builder.build();
StreamsConfig streamsConfig = new StreamsConfig(consumerSettings);
new KafkaStreams(streamsTopology, streamsConfig);
our Kafka consumer settings:
bootstrap.servers: xxx1:9092,xxx2:9092,...,xxx5:9092
application.id: app
state.dir: /tmp/kafka-streams/xxx
commit.interval.ms: 5000 # also I tried default value 30000
key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
timestamp.extractor: org.apache.kafka.streams.processor.WallclockTimestampExtractor
kafka broker version: kafka_2.11-0.11.0.2.
error occur on both versions of Kafka Streams: 1.0.1 and 1.1.0.

Looks like you have issue with Kafka cluster and Kafka consumer is get timed out while trying to commit offsets.
You can try to increase connection related configs for Kafka consumer
request.timeout.ms (by default 305000ms)
The configuration controls the maximum amount of time the client will
wait for the response of a request
connections.max.idle.ms (by default 540000ms)
Close idle connections after the number of milliseconds specified by
this config.

Messages are not getting consumed from solace queue

I am using spring integration int-jms:message-driven-channel-adapter to consume message from solace queue.
I see below mentioned error in server logs
org.springframework.jms.listener.DefaultMessageListenerContainer- Execution of JMS message listener failed, and no ErrorHandler has been set.
javax.jms.TransactionRolledBackException: Error comitting - transaction rolled back (Transaction '12427' unexpectedly rolled back during commit attempt. (((Client name: xxxx.yyyy.com/7034/#0002000a Local addr: 123123 Remote addr: aaa.bbb.com:12345) - ) com.solacesystems.jcsmp.JCSMPErrorResponseException: 503: Message Consume Failure [Subcode:48]))
JMS configuration is as mentioned below
<int-jms:message-driven-channel-adapter
id="IdMessageDrivenChannelAdapter" send-timeout="5000"
max-messages-per-task="-1"
idle-task-execution-limit="100"
max-concurrent-consumers="2"
connection-factory="appCachedConnectionFactory" destination="appInQueue"
channel="reqChannel" error-channel="errorChannel"
acknowledge="transacted" />
Any pointers to solve this error will be really helpful.

The error indicates a failure to consume a message during a transaction. The cause of the error could be a number of different issues, such as the message was deleted/expired, or queue not found or shutdown.
You can analyze the rest of the API logs or the event logs on the Solace router to find out why the message could not be consumed.
The subcode documentation that you linked in the comments refers to the Solace .NET API. To see a list of errors and their subcodes and explanations for JCSMP errors, please see the documentation here:
http://docs.solace.com/API-Developer-Online-Ref-Documentation/java/constant-values.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.