Slow message retrieval from SQS

Slow message retrieval from SQS - java

Given a Java AWS Lambda with the following code:
private static final String QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/<ACCT_NUMBER>/<QUEUE_NAME>";
private static final AmazonSQS client = AmazonSQSClientBuilder.standard().build();
private static final int MAX_SQS_MESSAGES = 10;
And:
private List<Message> getMessages() {
return client.receiveMessage(new ReceiveMessageRequest().withQueueUrl(QUEUE_URL)
.withMaxNumberOfMessages(MAX_SQS_MESSAGES).withWaitTimeSeconds(1)).getMessages();
}
I experience rather "long" SQS retrieval times (considering the specified 1 second base for long polling), as sample evidence from logs:
Got 3 SQS msgs: 1985ms
Got 8 SQS msgs: 1887ms
Got 9 SQS msgs: 2438ms
Got 5 SQS msgs: 1748ms
Are those time between normal operation or, could I be doing something wrong or improve something?
Maven dependency:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-sqs</artifactId>
<version>1.11.488</version>
</dependency>

Those are indeed very long delays, and something is wrong. With a non-empty queue, you should be able to get typical reads in the 5-500ms range (lower with more messages available). Even if your queue is empty, the request times should be topping out around a maximum of 1s based on your usage of withWaitTimeSeconds in the request.
There are a number of steps you can take to narrow down the problem:
Make sure the queue and the lambda are in the same region - I mention this first as I've seen so many latency issues caused by cross-region calls in AWS.
Make sure you have accurate request metrics. I don't see how you are measuring the metrics timing in your code, but I do see how you're constructing your client.
Create an implementation of RequestHandler2 that implements afterError and afterResponse, and examines the details of request.getAWSRequestMetrics()
Add that request handler to the client via clientBuilder.withRequestHandlers(RequestHandler2... handlers)
This will give you accurate details of how the request is spending its time, and perhaps reveal some obvious problems, and may also point to a problem outside of the call to SQS.
Make sure you are reusing your client (and not creating a fresh one each time) - consider logging each time your client is created. Under the hood of the client there's a lot of setup, and if it's using a fresh client every time, there might be a lot of time wasted there.

Related

Unexpected backlog size in Pulsar

I'm using Pulsar for communication between services and I'm experiencing flakiness in a quite simple test of producers and consumers.
In JUnit 4 test, I spin up (my own wrappers around) a ZooKeeper server, a BookKeeper bookie, and a PulsarService; the configurations should be quite standard.
The test can be summarized in the following steps:
build a producer;
build a consumer (say, a reader of a Pulsar topic);
check the message backlog (using precise backlog);
this is done by getting the current subscription via PulsarAdmin#topics#getStats#subscriptions
I expect it to be 0, as nothing was sent on the topic, but sometimes it is 1, but this seems another problem...
build a new producer and synchronously send a message onto the topic;
build a new consumer and read the messages on the topic;
I expect a backlog of one message, and I actually read one
build a new producer and synchronously send four messages;
fetch again the messages, using the messageID read at step 5 as start message ID;
I expect a backlog of four messages here, and most of the time this value is correct, but running the test about ten times I consistently get 2 or 5
I tried debugging the test, but I cannot figure out where those values come from; did I misunderstand something?

Things you can try if not already done:
Ask for precise backlog measurement. By default, it's only estimated as getting the precise measurement is a costlier operation. Use admin.topics().getStats(topic, true) for this. (See https://github.com/apache/pulsar/blob/724523f3051def9577d6bd27697866c99f4a7b0e/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L862)
Deactivate batching on the producer side. The number returned in msgBacklog is the number of entries so multiple messages batched in a single entry will count as 1. See relevant issue : https://github.com/apache/pulsar/issues/7623. It can explain why you see a value of 2 for the msgBacklog if the 4 messages have been put in the same batch. Beware that deactivating batching can have a huge impact on performance.

Chronicle queue POC returned unexpected latency

One of our system has a micro service architecture using Apache Kafka as a service bus.
Low latency is a very important factor but reliability and consistency (exactly once) are even more important.
When we perform some load tests we noticed signifiant performance degradation and all investigations pointed to big increases in Kafka topics producer and consumer latencies. No matter how much configuration we changed or more resources we added we could not get rid of the symptoms.
At the moment our needs are processing 10 transactions per second (TPS) and the load test is exercising 20 TPS but as the system is evolving and adding more functionality we know we'll reach a stage when the need will be 500TPS so we started being worried if we can achieve this with Kafka.
As a proof of concept I tried to switch to one of our micro services to use a chronicle-queue instead of a Kafka topic. It was easy to migrate following the avro example as from Chronicle-Queue-Demo git hub repo
public class MessageAppender {
private static final String MESSAGES = "/tmp/messages";
private final AvroHelper avroHelper;
private final ExcerptAppender messageAppender;
public MessageAppender() {
avroHelper = new AvroHelper();
messageAppender = SingleChronicleQueueBuilder.binary(MESSAGES).build().acquireAppender();
}
#SneakyThrows
public long append(Message message) {
try (var documentContext = messageAppender.writingDocument()) {
var paymentRecord = avroHelper.getGenericRecord();
paymentRecord.put("id", message.getId());
paymentRecord.put("workflow", message.getWorkflow());
paymentRecord.put("workflowStep", message.getWorkflowStep());
paymentRecord.put("securityClaims", message.getSecurityClaims());
paymentRecord.put("payload", message.getPayload());
paymentRecord.put("headers", message.getHeaders());
paymentRecord.put("status", message.getStatus());
avroHelper.writeToOutputStream(paymentRecord, documentContext.wire().bytes().outputStream());
return messageAppender.lastIndexAppended();
}
}
}
After configuring that appender we ran a loop to produce 100_000 messages to a chronicle queue. Every message has the same size and the final size of the file was 621MB. It took 22 minutes 20 seconds and 613 milliseconds (~1341seconds) to process write all messages so an average of about 75 message/second.
This was definitely not what we hopped for and so far from latencies advertised in the chronicle documentation that made me believe my approach was not the correct one. I admit that our messages are not small at about 6.36KB/message but i have no doubts storing them in a database would be faster so I still think I am not doing it right.
It is important our messages are process one by one.
Thank you in advance for your inputs and or suggestions.

Hand building the Avro object each time seems a bit of a code smell to me.
Can you create a predefined message -> avro serializer and use that to feed the queue?
Or, just for testing, create one avro object outside the loop and feed that one object into the queue many times. That way you can see if it is the building or the queuing which is the bottleneck.
More general advice:
Maybe attach a profiler and see if you are making an excessive amount of object allocations. Which is particularly bad if they are getting promoted to higher generations.
See if they are your objects or Chronicle Queue ones.
Is your code maxing out your ram or cpu (or network)?

Is it good to cache AmazonSQS object

I am using the below jar for SQS
aws-java-sdk-core-1.11.397.jar
aws-java-sdk-sqs-1.11.397.jar
In my scenario i will be using the same SQS multiple times and getting AmazonSQS object using AmazonSQSClientBuilder. I am wondering if we can cache this to help improving performance.
Would caching really help , which will be the best approach to do it and for how long can the object be cached.
Current scenario
Will be getting message to post SQS on a particular event whose frequency might vary from no event to around 10000 per hour. This is the reason why i am think to cache it.

You should most definitely reuse the AmazonSQS instance that is returned from the AmazonSQSClientBuilder. If you're posting thousands of messages, you shouldn't need to make any other calls to SQS other than sendMessage.
You could also call sendMessageBatch if you have a lot of messages to send. However, SQS is extremely scalable, and 10,000 messages per hour will not even make it sweat so I don't think you'll have anything to worry about.

AWS Lambda Performance issues

I use aws api gateway integrated with aws lambda(java), but I'm seeing some serious problems in this approach. The concept of removing the server and having your app scaled out of the box is really nice but here are the problem I'm facing. My lambda is doing 2 simple things- validate the payload received from the client and then send it to a kinesis stream for further processing from another lambda(you will ask why I don't send directly to the stream and only use 1 lambda for all of the operations. Let's just say that I want to separate the logic and have a layer of abstraction and also be able to tell the client that he's sending invalid data.).
In the implementation of the lambda I integrated the spring DI. So far so good. I started making performance testing. I simulated 50 concurrent users making 4 requests each with 5 seconds between the requests. So what happened- In the lambda's coldstart I initialize the spring's application context but it seems that having so many simultaneous requests when the lambda was not started is doing some strange things. Here's a screenshot of the times the context was initialized for.
What we can see from the screenshot is that the times for initializing the context have big difference. My assumption of what happening is that when so many requests are received and there's no "active" lambda it initializes a lambda container for every one of them and in the same time it "blocks" some of them(the ones with the big times of 18s) until the others already started are ready. So maybe it has some internal limit of the containers it can start at the same time. The problem is that if you don't have equally distributed traffic this will happen from time to time and some of the requests will timeout. We don't want this to happen.
So next thing was to do some tests without spring container as my thought was "ok, the initialization is heavy, let's just make plain old java objects initialization". And unfortunatelly the same thing happened(maybe just reduced the 3s container initialization for some of the requests). Here is a more detailed screenshot of the test data:
So I logged the whole lambda execution time(from construction to the end), the kinesis client initialization and the actual sending of the data to the stream as these are the heaviest operations in the lambda. We still have these big times of 18s or something but the interesting thing is that the times are somehow proportional. So if the whole lambda takes 18s, around 7-8s is the client initialization and 6-7 for sending the data to the stream and 4-5 seconds left for the other operations in the lambda which for the moment is only validation. On the other hand if we take one of the small times(which means that it reuses an already started lambda),i.e. 820ms, it takes 100ms for the kinesis client initialization and 340 for the data sending and 400ms for the validation. So this pushes me again to the thoughts that internally it makes some sleeps because of some limits. The next screenshot is showing what is happening on the next round of requests when the lamda is already started:
So we don't have this big times, yes we still have some relatively big delta in some of the request(which for me is also strange), but the things looks much better.
So I'm looking for a clarification from someone who knows actually what is happening under the hood, because this is not a good behavior for a serious application which is using the cloud because of it's "unlimited" possibilities.
And another question is related to another limit of the lambda-200 concurrent invocations in all lambdas within an account in a region. For me this is also a big limitation for a big application with lots of traffic. So as my business case in the moment(I don't know for the future) is more or less fire and forget the request. And I'm starting to think of changing the logic in the way that the gateway sends the data directly to the stream and the other lambda is taking care of the validation and the further processing. Yes, I'm loosing the current abstraction(which I don't need at the moment) but I'm increasing the application availability many times. What do you think?

The lambda execution time spikes to 18s because AWS launches new containers w/ your code to handle the incoming requests. The bootstrap time is ~18s.
Assigning more RAM can significantly improve the performance of your lambda function, because you have more RAM, CPU and networking throughput!
And another question is related to another limit of the lambda-200 concurrent invocations in all lambdas within an account in a region.
You can ask to the AWS Support to increase that limit. I asked to increase that limit to 10,000 invocation/second and the AWS Support did it quickly!

You can proxy straight to the Kinesis stream via API Gateway. You would lose some control in terms of validation and transformation, but you won't have the cold start latency that you're seeing from Lambda.
You can use the API Gateway mapping template to transform the data and if validation is important, you could potentially do that at the processing Lambda on the other side of the stream.

Right design in akka. - Message delivery

I have gone through some posts on how and why akka does not guarantee message delivery. The documentation, this discussion and the other discussions on group do explain it well.
I am pretty new to akka and wish to know the appropriate design for a case. For example say I have 3 different actors all on different machines. One is responsible for cookbooks, the other for history and the last for technology books.
I have a main actor on another machine. Suppose there is a query to the main-actor to search if we have some book available. The main actor sends requests to the 3 remote actors, and expects the result. So I do this:
val scatter = system.actorOf(
Props[SearchActor].withRouter(ScatterGatherFirstCompletedRouter(
routees=someRoutees, within = 10 seconds)), "router")
implicit val timeout = Timeout(10 seconds)
val futureResult = scatter ? Text("Concurrency in Practice")
// What should I do here?.
//val result = Await.result(futureResult, timeout.duration) line(a)
In short, I have sent requests to all 3 remote actors and expect the result in 10 seconds.
What should be the action?
Say I do not get the result in 10 seconds, should I send a new request to all of them again?
What if within time above is premature. But I do not know pre-hand on how much time it might take.
What if within time was sufficient but the message got dropped.
If i dont get response in within time and resend the request again. Something like this, it remain asynchronous:
futureResult onComplete{
case Success(i) => println("Result "+i)
case Failure(e) => //send again
}
But under too many queries, wont it be too many threads on the call and bulky? If I uncomment line(a), it becomes synchronous and under load might perform badly.
Say I dont get response in 10 seconds. If within time was premature, then its a heavy useless computation happening again. If messsage got dropped, then 10 seconds of valuable time wasted. In case, say I knew that the message got delivered, I would probably wait for longer duration without being skeptical.
How do people solve such issues? ACK? But then I have to store the state in actor of all queries. It must be a common thing and I am looking for right design.

I'm going to try and answer some of these questions for you. I'm not going to have concrete answers for everything, but hopefully I can guide you in the right direction.
For starters, you will need to make a change in how you are communicating the request to the 3 actors that do book searches. Using a ScatterGatherFirstCompletedRouter is probably not the correct approach here. This router will only wait for an answer from one of the routees (the first one to respond), so your set of results will be incomplete as it will not contain results from the other 2 routees. There is also a BroadcastRouter, but that will not fit your needs either as it only handles tell (!) and not ask (?). To do what you want to do, one option is to send the request to each receipient, getting Futures for the responses and then combine them into an aggregate Future using Future.sequence. A simplified example could look like this:
case class SearchBooks(title:String)
case class Book(id:Long, title:String)
class BookSearcher extends Actor{
def receive = {
case req:SearchBooks =>
val routees:List[ActorRef] = ...//Lookup routees here
implicit val timeout = Timeout(10 seconds)
implicit val ec = context.system.dispatcher
val futures = routees.map(routee => (routee ? req).mapTo[List[Book]])
val fut = Future.sequence(futures)
val caller = sender //Important to not close over sender
fut onComplete{
case Success(books) => caller ! books.flatten
case Failure(ex) => caller ! Status.Failure(ex)
}
}
}
Now that's not going to be our final code, but it's an approximation of what your sample was attempting to do. In this example, if any one of the downstream routees fails/times out, we will hit our Failure block, and the caller will also get a failure. If they all succeed, the caller will get the aggregate List of Book objects instead.
Now onto your questions. First, you ask if you should send a request to all of the actors again if you do not get an answer from one of the routees within the timeout. The answer to this question really up to you. Would you allow your user on the other end to see a partial result (i.e. the results from 2 of the 3 actors), or does it always have to be the full set of results every time? If the answer is yes, you could tweak the code that is sending to the routees to look like this:
val futures = routees.map(routee => (routee ? req).mapTo[List[Book]].recover{
case ex =>
//probably log something here
List()
})
With this code, if any of the routees timesout or fails for any reason, an empty list of 'Book` will be substituted in for the response instead of the failure. Now, if you can't live with partial results, then you could resend the entire request again, but you have to remember that there is probably someone on the other end waiting for their book results and they don't want to wait forever.
For your second question, you ask if what if your timeout is premature? The timeout value you select is going to be completely up to you, but it most likely should be based on two factors. The first factor will come from testing the call times of the searches. Find out on average how long it takes and select a value based on that with a little cushion just to be safe. The second factor is how long someone on the other end is willing to wait for their results. You could just be very conservative in your timeout, making it like 60 seconds just to be safe, but if there is indeed someone on the other end waiting for results, how long are they willing to wait? I'd rather get a failure response indicating that I should try again instead of waiting forever. So taking those two factors into account, you should select a value that will allow you to get responses a very high percentage of the time while still not making the caller on the other end wait too long.
For question 3, you ask what happens if the message gets dropped. In this case I'm guessing that the future for whoever was to receive that message will just timeout because it will not get a response because the recipient actor will never receive a message to respond to. Akka is not JMS; it doesn't have acknowledgement modes where a message can be resent a number of times if the recipient does not receive and ack the message.
Also, as you can see from my example, I agree with not blocking on the aggregate Future by using Await. I prefer using the non-blocking callbacks. Blocking in a receive function is not ideal as that Actor instance will stop processing its mailbox until that blocking operation completes. By using a non-blocking callback, you free that instance up to go back to processing its mailbox and allow the handling of the result to be just another job that is executed in the ExecutionContext, decoupled from the actor processing its mailbox.
Now if you really want to not waste communications when the network is not reliable, you could look into the Reliable Proxy available in Akka 2.2. If you don't want to go this route, you could roll it yourself by sending ping type messages to the routees periodically. If one does not respond in time, you mark it as down and do not send messages to it until you can get a reliable (in a very short amount of time) ping from it, sort of like a FSM per routee. Either of these can work if you absolutely need this behavior, but you need to remember that these solutions add complexity and should only be employed if you absolutely need this behavior. If you're developing bank software and you absolutely need guaranteed delivery semantics as bad financial implications will result otherwise, by all means go with this kind of approach. Just be judicious in deciding if you need something like this because I bet 90% of the time you don't. In your model, the only person probably affected by waiting on something that you might have already known won't be successful is the caller on the other end. By using non-blocking callbacks in the actor, it's not being halted by the fact that something might take a long time; it's already moved in to its next message. You also do need to be careful if you decide to resubmit on failure. You don't want to flood the receiving actors mailboxes. If you decide to resend, cap it at a fixed number of times.
One other possible approach if you need these guaranteed kind of semantics might be to look into Akka's Clustering Model. If you clustered the downstream routees, and one of the servers was failing, then all traffic would be routed to the node that was still up until that other node recovered.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.