GCP Pubsub high latency on low message/sec

GCP Pubsub high latency on low message/sec - java

I'm publishing Pubsub messages from AppEngine Flexible environment with the JAVA client library like this:
Publisher publisher = Publisher
.newBuilder(ProjectTopicName.of(Utils.getApplicationId(), "test-topic"))
.setBatchingSettings(
BatchingSettings.newBuilder()
.setIsEnabled(false)
.build())
.build();
publisher.publish(PubsubMessage.newBuilder()
.setData(ByteString.copyFromUtf8(message))
.putAttributes("timestamp", String.valueOf(System.currentTimeMillis()))
.build());
I'm subscribing to the topic in Dataflow and logging how long it takes for the message to reach Dataflow from AppEngine flexible
pipeline
.apply(PubsubIO.readMessagesWithAttributes().fromSubscription(Utils.buildPubsubSubscription(Constants.PROJECT_NAME, "test-topic")))
.apply(ParDo.of(new DoFn<PubsubMessage, PubsubMessage>() {
#ProcessElement
public void processElement(ProcessContext c) {
long timestamp = System.currentTimeMillis() - Long.parseLong(c.element().getAttribute("timestamp"));
System.out.println("Time: " + timestamp);
}
}));
pipeline.run();
When I'm publishing messages at the rate of a few messages per second then the logs show that the time needed for the message to reach Dataflow is between 100ms and 1.5 seconds.
But when the rate is about 100 messages per second then the time is constantly between 100ms - 200ms, which seems totally adequate.
Can someone explain this behavior. It seems as turning off the publisher batching does not work.

Pub/Sub is designed for high throughput messages for both Subscription cases.
Pull subscription works best when there's large volume of messages, it's the kind of subscription you would use when throughput of message processing is your priority. Specially note that synchronous pull doesn't handle messages as soon as they are published, and can choose to pull and handle a fixed number of messages (more messages, more pulls). A better option would be to use asynchronous pull, which uses a long running message listener and acknowledges one message at a time [1].
On the other hand, Push subscription uses a slow-start algorithm: The number of messages sent is doubled with each successful delivery until it reaches its constraints (more messages, more -and faster- deliveries).
[1] https://cloud.google.com/pubsub/docs/pull#asynchronous-pull

Related

Ack pubSub message outside of the MessageReciever

I am using async Pull to pull messages from a pupSub topic, do some processing and send messages to ActiveMQ topic.
With the current configuration of pupSub I have to ack() the messages upon recieval. This however, does not suit my use case, as I need to ONLY ack() messages after they are successfully processed and sent to the other Topic. this means (per my understanding) ack()ing the messages outside the messageReciver.
I tried to save the each message and its AckReplyConsumer to be able to call it later and ack() the messages, this however does not work as expected. and not all messages are correctly ack() ed.
So I want to know if this is possible at all. and if Yes how
my subscriber configs
public Subscriber getSubscriber(CompositeConfigurationElement compositeConfigurationElement, Queue<CustomPupSubMessage> messages) throws IOException {
ProjectSubscriptionName subscriptionName = ProjectSubscriptionName.of(compositeConfigurationElement.getPubsub().getProjectid(),
compositeConfigurationElement.getSubscriber().getSubscriptionId());
ExecutorProvider executorProvider =
InstantiatingExecutorProvider.newBuilder().setExecutorThreadCount(2).build();
// Instantiate an asynchronous message receiver.
MessageReceiver receiver =
(PubsubMessage message, AckReplyConsumer consumer) -> {
messages.add(CustomPupSubMessage.builder().message(message).consumer(consumer).build());
};
// The subscriber will pause the message stream and stop receiving more messages from the
// server if any one of the conditions is met.
FlowControlSettings flowControlSettings =
FlowControlSettings.newBuilder()
// 1,000 outstanding messages. Must be >0. It controls the maximum number of messages
// the subscriber receives before pausing the message stream.
.setMaxOutstandingElementCount(compositeConfigurationElement.getSubscriber().getOutstandingElementCount())
// 100 MiB. Must be >0. It controls the maximum size of messages the subscriber
// receives before pausing the message stream.
.setMaxOutstandingRequestBytes(100L * 1024L * 1024L)
.build();
//read credentials
InputStream input = new FileInputStream(compositeConfigurationElement.getPubsub().getSecret());
CredentialsProvider credentialsProvider = FixedCredentialsProvider.create(ServiceAccountCredentials.fromStream(input));
Subscriber subscriber = Subscriber.newBuilder(subscriptionName, receiver)
.setParallelPullCount(compositeConfigurationElement.getSubscriber().getSubscriptionParallelThreads())
.setFlowControlSettings(flowControlSettings)
.setCredentialsProvider(credentialsProvider)
.setExecutorProvider(executorProvider)
.build();
return subscriber;
}
my processing part
jmsConnection.start();
for (int i = 0; i < patchSize; i++) {
var message = messages.poll();
if (message != null) {
byte[] payload = message.getMessage().getData().toByteArray();
jmsMessage = jmsSession.createBytesMessage();
jmsMessage.writeBytes(payload);
jmsMessage.setJMSMessageID(message.getMessage().getMessageId());
producer.send(jmsMessage);
list.add(message.getConsumer());
} else break;
}
jmsSession.commit();
jmsSession.close();
jmsConnection.close();
// if upload is successful then ack the messages
log.info("sent " + list.size() + " in direction " + dest);
list.forEach(consumer -> consumer.ack());

There is nothing that requires messages to be acked within the MessageReceiver callback and you should be able to acknowledge messages asynchronously. There are a few things to keep in mind and look for:
Check to ensure that you are calling ack before the ack deadline expires. By default, the Java client library does extend the ack deadline for up to 1 hour, so if you are taking less time than that to process, you should be okay.
If your subscriber is often flow controlled, consider reducing the value you pass into setParallelPullCount to 1. The flow control settings you pass in are passed to each stream, not divided among them, so if each stream is able to receive the full value passed in and your processing is slow enough, you could be exceeding the 1-hour deadline in the client library without having even received the message yet, causing the duplicate delivery. You really only need to use setParallelPullCount to a larger value if you are able to process messages much faster than a single stream can deliver them.
Ensure that your client library version is at least 1.109.0. There were some improvements made to the way flow control was done in that version.
Note that Pub/Sub has at-least-once delivery semantics, meaning messages can be redelivered, even if ack is called properly. Note that not acknowledging or nacking a single message could result in the redelivery of all messages that were published together in a single batch. See the "Message Redelivery & Duplication Rate
" section of "Fine-tuning Pub/Sub performance with batch and flow control settings."
If all of that still doesn't fix the issue, then it would be best to try to create a small, self-contained example that reproduces the issue and open up a bug in the GitHub repo.

Periods of prolonged inactivity and frequent MessageLockLostException in QueueClient

Background
We have a data transfer solution with Azure Service Bus as the message broker. We are transferring data from x datasets through x queues - with x dedicated QueueClients as senders. Some senders publish messages at the rate of one message every two seconds, while others publish one every 15 minutes.
The application on the data source side (where senders are) is working just fine, giving us the desired throughput.
On the other side, we have an application with one QueueClient receiver per queue with the following configuration:
maxConcurrentCalls = 1
autoComplete = true (if receive mode = RECEIVEANDDELETE) and false (if receive mode = PEEKLOCK) - we have some receivers where, if they shut-down unexpectedly, would want to preserve the messages in the Service Bus Queue.
maxAutoRenewDuration = 3 minutes (lock duraition on all queues = 30 seconds)
an Executor service with a single thread
The MessageHandler registered with each of these receivers does the following:
public CompletableFuture<Void> onMessageAsync(final IMessage message) {
// deserialize the message body
final CustomObject customObject = (CustomObject)SerializationUtils.deserialize((byte[])message.getMessageBody().getBinaryData().get(0));
// process processDB1() and processDB2() asynchronously
final List<CompletableFuture<Boolean>> processFutures = new ArrayList<CompletableFuture<Boolean>>();
processFutures.add(processDB1(customObject)); // processDB1() returns Boolean
processFutures.add(processDB2(customObject)); // processDB2() returns Boolean
// join both the completablefutures to get the result Booleans
List<Boolean> results = CompletableFuture.allOf(processFutures.toArray(new CompletableFuture[processFutures.size()])).thenApply(future -> processFutures.stream()
.map(CompletableFuture<Boolean>::join).collect(Collectors.toList())
if (results.contains(false)) {
// dead-letter the message if results contains false
return getQueueClient().deadLetterAsync(message.getLockToken());
} else {
// complete the message otherwise
getQueueClient().completeAsync(message.getLockToken());
}
}
We tested with the following scenarios:
Scenario 1 - receive mode = RECEIVEANDDELETE, message publish rate: 30/ minute
Expected Behavior
The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).
Actual behavior
We observe random, long periods of inactivity from the QueueClient - ranging from minutes to hours - there is no Outgoing Messages from the Service Bus namespace (observed on the Metrics charts) and there are no consumption logs for the same time periods!
Scenario 2 - receive mode = PEEKLOCK, message publish rate: 30/ minute
Expected Behavior
The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).
Actual behavior
We keep seeing MessageLockLostException constantly after 20-30 minutes into the run of the application.
We tried doing the following -
we reduced the prefetch count (from 20 * processing rate - as mentioned in the Best Practices guide) to a bare minimum (to even 0 in one test cycle), to reduce the no. of messages that are locked for the client
increased the maxAutoRenewDuration to 5 minutes - our processDB1() and processDB2() do not take more than a second or two for almost 90% of the cases - so, I think the lock duration of 30 seconds and maxAutoRenewDuration are not issues here.
removed the blocking CompletableFuture.get() and made the processing synchronous.
None of these tweaks helped us fix the issue. What we observed is that the COMPLETE or RENEWMESSAGELOCK are throwing the MessageLockLostException.
We need help with finding answers for the following:
why is there a long period of inactivity of the QueueClient in scenario 1?
how do we know the MessageLockLostExceptions are thrown, because the locks have indeed expired? we suspect the locks cannot expire too soon, as our processing happens in a second or two. disabling prefetch also did not solve this for us.
Versions and Service Bus details
Java - openjdk-11-jre
Azure Service Bus namespace tier: Standard
Java SDK version - 3.4.0

For Scenario 1 :
If you have the duplicate detection history enabled, there is a possibility of this behavior happening as per the below explained scenario :
I had enabled for 30 seconds. I constantly hit Service bus with duplicate messages ( im my case messages with the same messageid from the client - 30 /per minute). I would be seeing a no activity outgoing for the window. Though the messages are received at the servicebus from the sending client, I was not be able to see them in outgoing messages. You could probably check whether you re encountering the duplicate messages which are filtered - inturn resulting inactivity in outgoing.
Also Note : You can't enable/disable duplicate detection after the queue is created. You can only do so at the time of creating the queue.

The issue was not with the QueueClient object per se. It was with the processes that we were triggering from within the MessageHandler: processDB1(customObject) and processDB2(customObject). since these processes were not optimized, the message consumption dropped and the locks gor expired (in peek-lock mode), as the handler was spending more time (in relation to the rate at which messages were published to the queues) in completing these opertations.
After optimizing the processes, the consumption and completion (in peek-lock mode) were just fine.

Google PubSub Java (Scala) Client Gets Excessive Amount of Resent Messages

I have a scenario where I load a subscription with around 1100 messages. I then start a Spark job which pulls messages from this subscription with these settings:
MaxOutstandingElementCount: 5
MaxAckExtensionPeriod: 60 min
AckDeadlineSeconds: 600
The first message to get processed starts a cache generation which takes about 30 minutes to complete. Any other messages arriving while this is going on are simply "returned" with no ack or nack. After that, a given message takes between 1 min and 30 mins to process. With an ack extension period of 60 min, I would never expect to see resending of messages.
The behaviour I am seeing is that while the initial cache is being generated, every 10 minutes 5 new messages are grabbed by the client and returned with no ack or nack by my code. This is unexpected. I would expect the deadline of the original 5 messages to be extended up to an hour.
Furthermore, after having processed and acked about 500 of the messages, I would expect around 600 left in the subscription, but I see almost the original 1100. These turn out to be resent duplicates, as I log these in my code. This is also very unexpected.
This is a screenshot from google console after around 500 messages have been processed and acked (ignore the first "hump", that was an aborted test run):
Am I missing something?
Here is the setup code:
val name = ProjectSubscriptionName.of(ConfigurationValues.ProjectId,
ConfigurationValues.PubSubSubscription)
val topic = ProjectTopicName.of(ConfigurationValues.ProjectId,
ConfigurationValues.PubSubSubscriptionTopic)
val pushConfig = PushConfig.newBuilder.build
val ackDeadlineSeconds = 600
subscriptionAdminClient.createSubscription(
name,
topic,
pushConfig,
ackDeadlineSeconds)
val flowControlSettings = FlowControlSettings.newBuilder()
.setMaxOutstandingElementCount(5L)
.build();
// create a subscriber bound to the asynchronous message receiver
val subscriber = Subscriber
.newBuilder(subscriptionName, new EtlMessageReceiver(spark))
.setFlowControlSettings(flowControlSettings)
.setMaxAckExtensionPeriod(Duration.ofMinutes(60))
.build
subscriber.startAsync.awaitRunning()
Here is the code in the receiver which runs when a message arrives while the cache is being generated:
if(!BIQConnector.cacheGenerationDone){
Utilities.logLine(
s"PubSub message for work item $uniqueWorkItemId ignored as cache is still being generated.")
return
}
And finally when a message has been processed:
consumer.ack()
Utilities.logLine(s"PubSub message ${message.getMessageId} for $tableName acknowledged.")
// Write back to ETL Manager
Utilities.logLine(
s"Writing result message back to topic ${etlResultTopic} for table $tableName, $tableDetailsForLog.")
sendPubSubResult(importTableName, validTableName, importTimestamp, 2, etlResultTopic, stageJobData,
tableDetailsForLog, "Success", isDeleted)

Is your Spark job using a Pub/Sub client library to pull messages? These libraries should indeed keep extending your message deadlines up to the MaxAckExtensionPeriod you specified.
If your job is using a Pub/Sub client library, this is unexpected behavior. You should contact Google Cloud support with your project name, subscription name, client library version, and a sample of the message IDs from the messages you are "returning" without acking. They will be able to investigate further into why you're receiving these resent messages.

Google PubSub async rate limitation doesn't work as expected

We're using PubSub in prod and seeing a problem that there are more VMs handling PubSub messages that we would expect to have.
I’ve run simple tests using PubSub overnight and it appears that something goes not so smooth as we've expected with the rate limiting mechanism.
Here is the test:
Publish some amount of messages into a topic with Pull Subscription.
In the experiment, there are about 2,7k messages (started approx at 9pm)
Configure one async client using the StreamingPull connection and FlowControl set to 2.
Simulate that handling of every incoming message takes 5 seconds via moving the execution into a timer and acknowledging the message
only when the timer finishes.
Expected results:
Messages from PubSub are consumed with the same speed, getting 2 messages at a time every 5 seconds. A small timeout between acking a message and a new message pulled due to all the network and processing expenses is expected.
Actual result: PubSub starts throttling, or something like this, with a huge timeout. No message arrives at that time. The timeout depends on amount of unacked messages in subscription.
It doesn't seem clear from the FlowControl docs.
Here is the code of consumer (client):
var concurrentFlowsNumber = config.getLong(CONFIG_NUMBER_OF_THREADS);
var flowSettings = FlowControlSettings.newBuilder()
.setMaxOutstandingElementCount(concurrentFlowsNumber)
.setLimitExceededBehavior(FlowController.LimitExceededBehavior.Block)
.build();
var subscriber = Subscriber.newBuilder(subscriptionName, receiver)
.setCredentialsProvider(() -> serviceAccountCredentials)
.setFlowControlSettings(flowSettings)
.build();
subscriber.addListener(
new Subscriber.Listener() {
#Override
public void failed(ApiService.State from, Throwable failure) {
logger.error(failure);
}
},
MoreExecutors.directExecutor());
var apiService = subscriber.startAsync();
apiService.addListener(new ApiService.Listener() {
#Override
public void running() {
logger.info("Pubsub started");
}
#Override
public void failed(ApiService.State from, Throwable failure) {
logger.error("Pubsub failed on step: {}", from);
}
}, Runnable::run);
And the message handler is:
private static void handlePubSubMessage(PubsubMessage message, AckReplyConsumer consumer) {
new Timer().schedule(new TimerTask() {
#Override
public void run() {
consumer.ack();
}
}, (long) 3000 + rand.nextInt(5000));
}
So, does anyone have any idea how to make the clients (many vms) consume messages with concurrent handling limitations (up to 4 concurrent messages) without breaking for timeouts?
P.s. These questions are similar, but not the same:
Google pubsub flow control
pubsub Dynamic rate limiting
Cloud pubsub slow poll rate

Since you have a backlog build up, you might be running into this issue: https://cloud.google.com/pubsub/docs/pull#streamingpull_dealing_with_large_backlogs_of_small_messages
Your undelivered messages will get buffered between the Pub/Sub service and the client library. Messages might get stuck in a single client's buffer, or get redelivered to the same client if the ackDeadline was exceeded.
You can experiment with using the synchronous pull as suggested.

Google Cloud Pub/Sub retries the message delivery if it does not get a response in 30 seconds. Intended?

I'm using a Java 8 servlet as a Cloud Pub/Sub push endpoint.
On my push endpoint I have a long-running blocking operation, that sometimes runs for over a minute.
After the operation is done, I return a 200 response, acking the message.
If I return a 500 server error, the message is retried, which is expected.
Note that I create my subscription with a maximum allowed deadline ack period of 600 seconds.
What I have noticed is that if my long-running operation runs for over 30 seconds, the message is also retried. Seems like the HTTP connection that is used for push delivery does not live for over 30 seconds or something.
Is this intended? Is it configurable somehow? Thanks in advance.

For push subscriptions, Cloud Pub/Sub does not send a negative acknowledgment (sometimes known as a nack). If your webhook does not return a success code within the acknowledgment deadline, Cloud Pub/Sub retries delivery until the message expires after the subscription's message retention period. You can configure a default acknowledgment deadline for push subscriptions when you create the push subscription (select push subscription and set Acknowledgement deadline).
Note that, unlike for pull subscriptions, the deadline cannot be extended for individual messages. The deadline is effectively the amount of time the endpoint has to respond to the push request.

Yes. It is expected.
Pubsub guarantee at-least one time delivery until you acknowledge the message.
You can add delay in Push subscription by adding setting in Subsccription.
Go To subscription -> Edit ->Acknowledgement deadline ->
set values from 10 Seconds to 600 Seconds(10 Minutes).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.