Periods of prolonged inactivity and frequent MessageLockLostException in QueueClient - java

Background
We have a data transfer solution with Azure Service Bus as the message broker. We are transferring data from x datasets through x queues - with x dedicated QueueClients as senders. Some senders publish messages at the rate of one message every two seconds, while others publish one every 15 minutes.
The application on the data source side (where senders are) is working just fine, giving us the desired throughput.
On the other side, we have an application with one QueueClient receiver per queue with the following configuration:
maxConcurrentCalls = 1
autoComplete = true (if receive mode = RECEIVEANDDELETE) and false (if receive mode = PEEKLOCK) - we have some receivers where, if they shut-down unexpectedly, would want to preserve the messages in the Service Bus Queue.
maxAutoRenewDuration = 3 minutes (lock duraition on all queues = 30 seconds)
an Executor service with a single thread
The MessageHandler registered with each of these receivers does the following:
public CompletableFuture<Void> onMessageAsync(final IMessage message) {
// deserialize the message body
final CustomObject customObject = (CustomObject)SerializationUtils.deserialize((byte[])message.getMessageBody().getBinaryData().get(0));
// process processDB1() and processDB2() asynchronously
final List<CompletableFuture<Boolean>> processFutures = new ArrayList<CompletableFuture<Boolean>>();
processFutures.add(processDB1(customObject)); // processDB1() returns Boolean
processFutures.add(processDB2(customObject)); // processDB2() returns Boolean
// join both the completablefutures to get the result Booleans
List<Boolean> results = CompletableFuture.allOf(processFutures.toArray(new CompletableFuture[processFutures.size()])).thenApply(future -> processFutures.stream()
.map(CompletableFuture<Boolean>::join).collect(Collectors.toList())
if (results.contains(false)) {
// dead-letter the message if results contains false
return getQueueClient().deadLetterAsync(message.getLockToken());
} else {
// complete the message otherwise
getQueueClient().completeAsync(message.getLockToken());
}
}
We tested with the following scenarios:
Scenario 1 - receive mode = RECEIVEANDDELETE, message publish rate: 30/ minute
Expected Behavior
The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).
Actual behavior
We observe random, long periods of inactivity from the QueueClient - ranging from minutes to hours - there is no Outgoing Messages from the Service Bus namespace (observed on the Metrics charts) and there are no consumption logs for the same time periods!
Scenario 2 - receive mode = PEEKLOCK, message publish rate: 30/ minute
Expected Behavior
The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).
Actual behavior
We keep seeing MessageLockLostException constantly after 20-30 minutes into the run of the application.
We tried doing the following -
we reduced the prefetch count (from 20 * processing rate - as mentioned in the Best Practices guide) to a bare minimum (to even 0 in one test cycle), to reduce the no. of messages that are locked for the client
increased the maxAutoRenewDuration to 5 minutes - our processDB1() and processDB2() do not take more than a second or two for almost 90% of the cases - so, I think the lock duration of 30 seconds and maxAutoRenewDuration are not issues here.
removed the blocking CompletableFuture.get() and made the processing synchronous.
None of these tweaks helped us fix the issue. What we observed is that the COMPLETE or RENEWMESSAGELOCK are throwing the MessageLockLostException.
We need help with finding answers for the following:
why is there a long period of inactivity of the QueueClient in scenario 1?
how do we know the MessageLockLostExceptions are thrown, because the locks have indeed expired? we suspect the locks cannot expire too soon, as our processing happens in a second or two. disabling prefetch also did not solve this for us.
Versions and Service Bus details
Java - openjdk-11-jre
Azure Service Bus namespace tier: Standard
Java SDK version - 3.4.0

For Scenario 1 :
If you have the duplicate detection history enabled, there is a possibility of this behavior happening as per the below explained scenario :
I had enabled for 30 seconds. I constantly hit Service bus with duplicate messages ( im my case messages with the same messageid from the client - 30 /per minute). I would be seeing a no activity outgoing for the window. Though the messages are received at the servicebus from the sending client, I was not be able to see them in outgoing messages. You could probably check whether you re encountering the duplicate messages which are filtered - inturn resulting inactivity in outgoing.
Also Note : You can't enable/disable duplicate detection after the queue is created. You can only do so at the time of creating the queue.

The issue was not with the QueueClient object per se. It was with the processes that we were triggering from within the MessageHandler: processDB1(customObject) and processDB2(customObject). since these processes were not optimized, the message consumption dropped and the locks gor expired (in peek-lock mode), as the handler was spending more time (in relation to the rate at which messages were published to the queues) in completing these opertations.
After optimizing the processes, the consumption and completion (in peek-lock mode) were just fine.

Related

Google PubSub Java (Scala) Client Gets Excessive Amount of Resent Messages

I have a scenario where I load a subscription with around 1100 messages. I then start a Spark job which pulls messages from this subscription with these settings:
MaxOutstandingElementCount: 5
MaxAckExtensionPeriod: 60 min
AckDeadlineSeconds: 600
The first message to get processed starts a cache generation which takes about 30 minutes to complete. Any other messages arriving while this is going on are simply "returned" with no ack or nack. After that, a given message takes between 1 min and 30 mins to process. With an ack extension period of 60 min, I would never expect to see resending of messages.
The behaviour I am seeing is that while the initial cache is being generated, every 10 minutes 5 new messages are grabbed by the client and returned with no ack or nack by my code. This is unexpected. I would expect the deadline of the original 5 messages to be extended up to an hour.
Furthermore, after having processed and acked about 500 of the messages, I would expect around 600 left in the subscription, but I see almost the original 1100. These turn out to be resent duplicates, as I log these in my code. This is also very unexpected.
This is a screenshot from google console after around 500 messages have been processed and acked (ignore the first "hump", that was an aborted test run):
Am I missing something?
Here is the setup code:
val name = ProjectSubscriptionName.of(ConfigurationValues.ProjectId,
ConfigurationValues.PubSubSubscription)
val topic = ProjectTopicName.of(ConfigurationValues.ProjectId,
ConfigurationValues.PubSubSubscriptionTopic)
val pushConfig = PushConfig.newBuilder.build
val ackDeadlineSeconds = 600
subscriptionAdminClient.createSubscription(
name,
topic,
pushConfig,
ackDeadlineSeconds)
val flowControlSettings = FlowControlSettings.newBuilder()
.setMaxOutstandingElementCount(5L)
.build();
// create a subscriber bound to the asynchronous message receiver
val subscriber = Subscriber
.newBuilder(subscriptionName, new EtlMessageReceiver(spark))
.setFlowControlSettings(flowControlSettings)
.setMaxAckExtensionPeriod(Duration.ofMinutes(60))
.build
subscriber.startAsync.awaitRunning()
Here is the code in the receiver which runs when a message arrives while the cache is being generated:
if(!BIQConnector.cacheGenerationDone){
Utilities.logLine(
s"PubSub message for work item $uniqueWorkItemId ignored as cache is still being generated.")
return
}
And finally when a message has been processed:
consumer.ack()
Utilities.logLine(s"PubSub message ${message.getMessageId} for $tableName acknowledged.")
// Write back to ETL Manager
Utilities.logLine(
s"Writing result message back to topic ${etlResultTopic} for table $tableName, $tableDetailsForLog.")
sendPubSubResult(importTableName, validTableName, importTimestamp, 2, etlResultTopic, stageJobData,
tableDetailsForLog, "Success", isDeleted)
Is your Spark job using a Pub/Sub client library to pull messages? These libraries should indeed keep extending your message deadlines up to the MaxAckExtensionPeriod you specified.
If your job is using a Pub/Sub client library, this is unexpected behavior. You should contact Google Cloud support with your project name, subscription name, client library version, and a sample of the message IDs from the messages you are "returning" without acking. They will be able to investigate further into why you're receiving these resent messages.

Google Cloud Pub/Sub retries the message delivery if it does not get a response in 30 seconds. Intended?

I'm using a Java 8 servlet as a Cloud Pub/Sub push endpoint.
On my push endpoint I have a long-running blocking operation, that sometimes runs for over a minute.
After the operation is done, I return a 200 response, acking the message.
If I return a 500 server error, the message is retried, which is expected.
Note that I create my subscription with a maximum allowed deadline ack period of 600 seconds.
What I have noticed is that if my long-running operation runs for over 30 seconds, the message is also retried. Seems like the HTTP connection that is used for push delivery does not live for over 30 seconds or something.
Is this intended? Is it configurable somehow? Thanks in advance.
For push subscriptions, Cloud Pub/Sub does not send a negative acknowledgment (sometimes known as a nack). If your webhook does not return a success code within the acknowledgment deadline, Cloud Pub/Sub retries delivery until the message expires after the subscription's message retention period. You can configure a default acknowledgment deadline for push subscriptions when you create the push subscription (select push subscription and set Acknowledgement deadline).
Note that, unlike for pull subscriptions, the deadline cannot be extended for individual messages. The deadline is effectively the amount of time the endpoint has to respond to the push request.
Yes. It is expected.
Pubsub guarantee at-least one time delivery until you acknowledge the message.
You can add delay in Push subscription by adding setting in Subsccription.
Go To subscription -> Edit ->Acknowledgement deadline ->
set values from 10 Seconds to 600 Seconds(10 Minutes).

Azure eventhub Kafka org.apache.kafka.common.errors.TimeoutException for some of the records

Have a ArrayList containing 80 to 100 records trying to stream and send each individual record(POJO ,not entire list) to Kafka topic (event hub) . Scheduled a cron job like every hour to send these records(POJO) to event hub.
Able to see messages being sent to eventhub ,but after 3 to 4 successful run getting following exception (which includes several messages being sent and several failing with below exception)
Expiring 14 record(s) for eventhubname: 30125 ms has passed since batch creation plus linger time
Following is the config for Producer used,
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "1");
props.put(ProducerConfig.RETRIES_CONFIG, "3");
Message Retention period - 7
Partition - 6
using spring Kafka(2.2.3) to send the events
method marked as #Async where kafka send is written
#Async
protected void send() {
kafkatemplate.send(record);
}
Expected - No exception to be thrown from kafka
Actual - org.apache.kafka.common.errors.TimeoutException is been thrown
Prakash - we have seen a number of issues where spiky producer patterns see batch timeout.
The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in.
Set connections.max.idle.ms to < 4mins – this allows Kafka client’s network client layer to gracefully handle connection close for the producer’s message-sending TCP connection
Set metadata.max.age.ms to < 4mins – this is effectively a keep-alive for the producer metadata TCP connection
Feel free to reach out to the EH product team on Github, we are fairly good about responding to issues - https://github.com/Azure/azure-event-hubs-for-kafka
This exception indicates you are queueing records at a faster rate than they can be sent. Once a record is added a batch, there is a time limit for sending that batch to ensure it has been sent within a specified duration. This is controlled by the Producer configuration parameter, request.timeout.ms. If the batch has been queued longer than the timeout limit, the exception will be thrown. Records in that batch will be removed from the send queue.
Please check the below for similar issue, this might help better.
Kafka producer TimeoutException: Expiring 1 record(s)
you can also check this link
when-does-the-apache-kafka-client-throw-a-batch-expired-exception/34794261#34794261 for reason more details about batch expired exception.
Also implement proper retry policy.
Note this does not account any network issues scanner side. With network issues you will not be able to send to either hub.
Hope it helps.

How does setBackOffMultiplier(double backOffMultiplier) in ActiveMQ work

I am writing an application using activemq where I am using the redelivery policy to redeliver the messages. I am using the ActiveMQ's ExponentialBackOff concept.
My question is how does this ExponentialBackOff/setBackOffMultiplier work.
For example in my case I want to redeliver the message till the message expiration time, which is 15 minutes.I want to try to redeliver 10 times within 15 minutes.But ExponentialBackOff makes the message to redeliver beyond the 15 minutes expiry time of the message i.e. the message to be redelivered is still in the pending state even after the expiration time which is 15 minutes.
Why is this? I am kind of confused with this behavior. The redelivery policy I am using is as below.
RedeliveryPolicy queuePolicy = new RedeliveryPolicy();
queuePolicy.setInitialRedeliveryDelay(0);
queuePolicy.setBackOffMultiplier(3);
queuePolicy.setUseExponentialBackOff(true);
queuePolicy.setMaximumRedeliveries(10);
with this RedeliveryPolicy config, the RedeliveryPolicy will make attempts after each time waiting below :
after 1s
after 3s
9s
27s
81s
243s
729s
2187s
6561s
19683s
as you see like this attempts are executed after hours and in the meantime you see messages state is pending.
to prevent these long periods maybe you want to set the maximumRedeliveryDelay=300000L (5 minutes).
Note that
Once a message's redelivery attempts exceeds the maximumRedeliveries configured for the Redelivery Policy, a "Poison ack" is sent back to the broker letting him know that the message was considered a poison pill. The Broker then takes the message and sends it to a Dead Letter Queue so that it can be analyzed later on.
you need to adapt your RedeliveryPolicy because the message is pending as long as maximumRedeliveries is not exceeded.
http://activemq.apache.org/message-redelivery-and-dlq-handling.html

Kafka CommitFailedException consumer exception

After create multiple consumers (using Kafka 0.9 java API) and each thread started, I'm getting the following exception
Consumer has failed with exception: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
class com.messagehub.consumer.Consumer is shutting down.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:546)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:487)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:681)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:654)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:350)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:288)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:303)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:197)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:187)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:157)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:352)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:936)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:905)
and then start consuming message normally, I would like to know what is causing this exception in order to fix it.
Try also to tweak the following parameters:
heartbeat.interval.ms - This tells Kafka wait the specified amount of milliseconds before it consider the consumer will be considered "dead"
max.partition.fetch.bytes - This will limit the amount of messages (up to) the consumer will receive when polling.
I noticed that the rebalancing occurs if the consumer does not commit to Kafka before the heartbeat times out. If the commit occurs after the messages are processed, the amount of time to process them will determine these parameters. So, decreasing the number of messages and increasing the heartbeat time will help to avoid rebalancing.
Also consider to use more partitions, so there will be more threads processing your data, even with less messages per poll.
I wrote this small application to make tests. Hope it helps.
https://github.com/ajkret/kafka-sample
UPDATE
Kafka 0.10.x now offers a new parameter to control the number of messages received:
- max.poll.records - The maximum number of records returned in a single call to poll().
UPDATE
Kafka offers a way to pause the queue. While the queue is paused, you can process the messages in a separated Thread, allowing you to call KafkaConsumer.poll() to send heartbeats. Then call KafkaConsumer.resume() after the processing is done. This way you mitigate the problems of causing rebalances due to not sending heartbeats. Here is an outline of what you can do :
while(true) {
ConsumerRecords records = consumer.poll(Integer.MAX_VALUE);
consumer.commitSync();
consumer.pause();
for(ConsumerRecord record: records) {
Future<Boolean> future = workers.submit(() -> {
// Process
return true;
});
while (true) {
try {
if (future.get(1, TimeUnit.SECONDS) != null) {
break;
}
} catch (java.util.concurrent.TimeoutException e) {
getConsumer().poll(0);
}
}
}
consumer.resume();
}
Two possible reasons -->
If there are any network failures, consumers cannot reach out to the broker and will throw this exception. But there were no network failures when these exceptions occurred.
As mentioned in the error trace, if too much time is spent on processing the message, the ConsumerCoordinator will lose the connection and the commit will fail. This is because of polling.
The values given here are the default Kafka consumer configuration values.
request.timeout.ms=40000
heartbeat.interval.ms=3000
max.poll.interval.ms=300000
max.poll.records=500
session.timeout.ms=10000
Solution -->
Reduced the max.poll.records to 100 but still, the exception was occurring some times. So changed the configurations as below;
request.timeout.ms=300000
heartbeat.interval.ms=1000
max.poll.interval.ms=900000
max.poll.records=100
session.timeout.ms=600000
Reduced the heartbeat interval so that broker will be updated frequently that the Consumer is active. And also increased the session timeout configurations.

Categories