Achieving consistent response times in GAE? - java

When running load tests against my app I am seeing very consistent response times. Once there is a constant level of load on GAE, the mean reponse times get smaller and smaller. But I want to have the same consistency on other apps that receive far fewer requests per second. In those I never need to support more than ~3 requests/second.
Reading the docs makes me think increasing the number of minimum idle instances should result in more consistent response times. But even then clients will still be see higher response times, every time GAE's scheduler thinks more instances are required. I am looking for a setup where users do not see those initial slow requests.
When I increase the number of minimum idle instances to 1, I want GAE to use the one resident instance only. As load increases, it should bring up and warm up new (dynamic) instances. Only once they are warmed up, GAE should send requests to them. But judging from the response times it seems as if client requests arrive in dynamic instances as they are brought up. As a result, those requests take a long time (up to 30 seconds).
Could this happen if my warmup code is incomplete?
Could the first calls on the dynamic instances be so slow because they involve code paths that have not been warmed up yet?
I do not experience this problem during load tests or when enough people are using the app. But my testing environments practically unusable by clients when nobody is using the app yet e.g. in the morning.
Thanks!

Some generic thoughts:
30 seconds startup-time for instances seems very much. We do a lot of initialization (including database-hits), and we have around 5 seconds overhead.
Warmup-Requests aren't guaranteed. If all instances are busy, and the scheduler believes that the request will be answered faster if it starts a new instance instead of queuing it on a busy one, it will do so without wasting time with a warmup-request
I don't think this is an issue of an cold code-path (though i don't know java's hotspot in detail), its probably the (mem-) cache which needs to fill first
I don't know what you meant with "incomplete warmup code"; just check your logs for requests to /_ah/warmup - if there are any, warmup-requests are enabled and working.
Increasing the amount of idle instances beyond the 1-instance mark probably won't help here.
Sadly, there aren't any generic tricks to avoid that, but you could try to
defer initialization-code (doing only the absolute required minimum of instance-startup overhead)
start a backend keeping the (mem-) cache hot
If you don't mind the costs (and don't need automatic scaling for your low-volume application), you could even have all requests served by always-on backends

Related

Timing a remote call in a multithreaded java program

I am writing a stress test that will issue many calls to a remote server. I want to collect the following statistics after the test:
Latency (in milliseconds) of the remote call.
Number of operations per second that the remote server can handle.
I can successfully get (2), but I am having problems with (1). My current implementation is very similar to the one shown in this other SO question. And I have the same problem described in that question: latency reported by using System.currentTimeMillis() is longer than expected when the test is run with multiple threads.
I analyzed the problem and I am positive the problem comes from the thread interleaving (see my answer to the other question that I linked above for details), and that System.currentTimeMillis() is not the way to solve this problem.
It seems that I should be able to do it using java.lang.management, which has some interesting methods like:
ThreadMXBean.getCurrentThreadCpuTime()
ThreadMXBean.getCurrentThreadUserTime()
ThreadInfo.getWaitedTime()
ThreadInfo.getBlockedTime()
My problem is that even though I have read the API, it is still unclear to me which of these methods will give me what I want. In the context of the other SO question that I linked, this is what I need:
long start_time = **rightMethodToCall()**;
result = restTemplate.getForObject("Some URL",String.class);
long difference = (**rightMethodToCall()** - start_time);
So that the difference gives me a very good approximation of the time that the remote call took, even in a multi-threaded environment.
Restriction: I'd like to avoid protecting that block of code with a synchronized block because my program has other threads that I would like to allow to continue executing.
EDIT: Providing more info.:
The issue is this: I want to time the remote call, and just the remote call. If I use System.currentTimeMillis or System.nanoTime(), AND if I have more threads than cores, then it is possible that I could have this thread interleaving:
Thread1: long start_time ...
Thread1: result = ...
Thread2: long start_time ...
Thread2: result = ...
Thread2: long difference ...
Thread1: long difference ...
If that happens, then the difference calculated by Thread2 is correct, but the one calculated by Thread1 is incorrect (it would be greater than it should be). In other words, for the measurement of the difference in Thread1, I would like to exclude the time of lines 4 and 5. Is this time that the thread was WAITING?
Summarizing question in a different way in case it helps other people understand it better (this quote is how #jason-c put it in his comment.):
[I am] attempting to time the remote call, but running the test with multiple threads just to increase testing volume.
Use System.nanoTime() (but see updates at end of this answer).
You definitely don't want to use the current thread's CPU or user time, as user-perceived latency is wall clock time, not thread CPU time. You also don't want to use the current thread's blocking or waiting time, as it measures per-thread contention times which also doesn't accurately represent what you are trying to measure.
System.nanoTime() will return relatively accurate results (although granularity is technically only guaranteed to be as good or better than currentTimeMillis(), in practice it tends to be much better, generally implemented with hardware clocks or other performance timers, e.g. QueryPerformanceCounter on Windows or clock_gettime on Linux) from a high resolution clock with a fixed reference point, and will measure exactly what you are trying to measure.
long start_time = System.nanoTime();
result = restTemplate.getForObject("Some URL",String.class);
long difference = (System.nanoTime() - start_time);
long milliseconds = difference / 1000000;
System.nanoTime() does have it's own set of issues but be careful not to get whipped up in paranoia; for most applications it is more than adequate. You just wouldn't want to use it for, say, precise timing when sending audio samples to hardware or something (which you wouldn't do directly in Java anyways).
Update 1:
More importantly, how do you know the measured values are longer than expected? If your measurements are showing true wall clock time, and some threads are taking longer than others, that is still an accurate representation of user-perceived latency, as some users will experience those longer delay times.
Update 2 (based on clarification in comments):
Much of my above answer is still valid then; but for different reasons.
Using per-thread time does not give you an accurate representation because a thread could be idle/inactive while the remote request is still processing, and you would therefore exclude that time from your measurement even though it is part of perceived latency.
Further inaccuracies are introduced by the remote server taking longer to process the simultaneous requests you are making - this is an extra variable that you are adding (although it may be acceptable as representative of the remote server being busy).
Wall time is also not completely accurate because, as you have seen, variances in local thread overhead may add extra latency that isn't typically present in single-request client applications (although this still may be acceptable as representative of a client application that is multi-threaded, but it is a variable you cannot control).
Of those two, wall time will still get you closer to the actual result than per-thread time, which is why I left the previous answer above. You have a few options:
You could do your tests on a single thread, serially -- this is ultimately the most accurate way to achieve your stated requirements.
You could not create more threads than cores, e.g. a fixed size thread pool with bound affinities (tricky: Java thread affinity) to each core and measurements running as tasks on each. Of course this still adds any variables due to synchronization of underlying mechanisms that are beyond your control. This may reduce the risk of interleaving (especially if you set the affinities) but you still do not have full control over e.g. other threads the JVM is running or other unrelated processes on the system.
You could measure the request handling time on the remote server; of course this does not take network latency into account.
You could continue using your current approach and do some statistical analysis on the results to remove outliers.
You could not measure this at all, and simply do user tests and wait for a comment on it before attempting to optimize it (i.e. measure it with people, who are what you're developing for anyways). If the only reason to optimize this is for UX, it could very well be the case that users have a pleasant experience and the wait time is totally acceptable.
Also, none of this makes any guarantees that other unrelated threads on the system won't be affecting your timings, but that is why it is important to both a) run your test multiple times and average (obviously) and b) set an acceptable requirement for timing error's that you are OK with (do you really need to know this to e.g. 0.1ms accuracy?).
Personally, I would either do the first, single-threaded approach and let it run overnight or over a weekend, or use your existing approach and remove outliers from the result and accept a margin of error in the timings. Your goal is to find a realistic estimate within a satisfactory margin of error. You will also want to consider what you are going to ultimately do with this information when deciding what is acceptable.

How can I reduce Google App Engine datastore latency?

Through appstats, I can see that my datastore queries are taking about 125ms (api and cpu combined), but often there are long latencies (e.g. upto 12000ms) before the queries are executed.
I can see that my latency from the datastore is not related to my query (e.g. the same query/data has vastly different latencies), so I'm assuming that it's a scheduling issue with app engine.
Are other people seeing this same problem ?
Is there someway to reduce the latency (e.g. admin console setting) ?
Here's a screen shot from appstats. This servlet has very little cpu processing. It does a getObjectByID and then does a datastore query. The query has an OR operator so it's being converted into 3 queries by app engine.
.
As you can see, it takes 6000ms before the first getObjectByID is even executed. There is no processing before the get operation (other than getting pm). I thought this 6000ms latency might be due to an instance warm-up, so I had increased my idle instances to 2 to prevent any warm-ups.
Then there's a second latency around a 1000ms between the getObjectByID and the query. There's zero lines of code between the get and the query. The code simply takes the result of the getObjectByID and uses the data as part of the query.
The grand total is 8097ms, yet my datastore operations (and 99.99% of the servlet) are only 514ms (45ms api), though the numbers change every time I run the servlet. Here is another appstats screenshot that was run on the same servlet against the same data.
Here is the basics of my java code. I had to remove some of the details for security purposes.
user = pm.getObjectById(User.class, userKey);
//build queryBuilder.append(...
final Query query = pm.newQuery(UserAccount.class,queryBuilder.toString());
query.setOrdering("rating descending");
query.executeWithArray(args);
Edited:
Using Pingdom, I can see that GAE latency varies from 450ms to 7,399ms, or 1,644% difference !! This is with two idle instances and no users on the site.
I observed very similar latencies (in the 7000-10000ms range) in some of my apps. I don't think the bulk of the issue (those 6000ms) lies in your code.
In my observations, the issue is related to AppEngine spinning up a new instance. Setting min idle instances may help mitigate but it will not solve it (I tried up to 2 idle instances), because basically even if you have N idle instances app engine will prefer spinning up dynamic ones even when a single request comes in, and will "save" the idle ones in case of crazy traffic spikes. This is highly counter-intuitive because you'd expect it to use the instance that are already around and spin up dynamic ones for future requests.
Anyway, in my experience this issue (10000ms latency) very rarely happens under any non-zero amount of load, and many people had to revert to some king of pinging (possibly cron jobs) every couple of minutes (used to work with 5 minutes but lately instances are dying faster so it's more like a ping every 2 mins) to keep dynamic instances around to serve users who hit the site when no one else is on. This pinging is not ideal because it will eat away at your free quota (pinging every 5 minutes will eat away more than half of it) but I really haven't found a better alternative so far.
In recap, in general I found that app engine is awesome when under load, but not outstanding when you just have very few (1-3) users on the site.
Appstats only helps diagnose performance issues when you make GAE API/RPC calls.
In the case of your diagram, the "blank" time is spent running your code on your instance. It's not going to be scheduling time.
Your guess that the initial delay may be because of instance warm-up is highly likely. It may be framework code that is executing.
I can't guess at the delay between the Get and Query. It may be that there's 0 lines of code, but you called some function in the Query that takes time to process.
Without knowledge of the language, framework or the actual code, no one will be able to help you.
You'll need to add some sort of performance tracing on your own in order to diagnose this. The simplest (but not highly accurate) way to do this is to add timers and log timer values as your code executes.

App Engine app performance test

I have used jMeter for testing my appengine app performance.
I have created a thread group of
500 users,
ramp up period: 0 seconds
and loop to 1
and ran the test.
It created 4 instances in app engine. But interesting thing is, > 450 requests were processed by a single instance.
I have ran the test again with this instances up, still most of the requests (> 90%) were going to same instance.
Instance type: F1 Class
Max Idle Instances: ( Automatic )
Min Pending Latency: ( Automatic )
I'm getting much higher latency.
What's going wrong here?
Generating load from 1 IP , is there any problem?
Your problem is you are not using a realistic ramp up value. AppEngine, like most auto-scaling solutions, requires a reasonable amount of time to spin up new hardware. During this process while it is creating the new instances latency can increase if there was a large and sudden increase in traffic.
Choose a ramp up value that is representative of the sort of spikes / surges you realistically expect to see on Production and then run the test. Use the values from this test to decide how many appEngine instances you would like to be 'always on', the higher this value the lower any impact from a surge but obviously the higher your costs.
When you say "I'm getting much higher latency" what exactly are you getting? Do you consider it to be too slow?
If latency is an issue then you can reduce the max pending latency in the application settings. If you try this I imagine you will see your requests spread across the instances more.
My guess is simply that the 2-3 idle instances have spun up in anticipation of increased load but are actually not needed for your test.
It was totally app engine's issue...
see this issue reported at appengine's issue tracker
Spread your requests into different thread groups, and the instances will be utilised. I'm not sure why this happens. I was't able to find any definitive information that explains this.
(I wonder if maybe App Engine sees the requests from a single thread group as requests originating from a common origin, so it places all of the utilised resources in the same instance, so that the output can be most efficiently passed back to the originator of the requests.)

Impact of multiple cores on a page that takes 5ms to process

My spring app's page takes 5 ms to render, and using ab I was getting around 200 requests per second.
I tested on a VM that was a single core.
Now this page simply takes an xml file and parses it and initializes an object, and then inserts the object into mysql.
Now assuming mysql isn't blocking (connection pool is not large enough, or table locks), adding another core should double my requests per second correct?
If I get 200 requests per second with a single thread hitting tomcat, I should keep doubling my rps as I increase threads correct? (up to some point obviously).
What is usually the bottle neck as mysql seems to be able to handle 3-4K inserts per second on a very simple servlet app.
Changing from single to dual core should work as you expect. If more cores and higher load come into play, you will probably face that thread-scheduling isn't fair to all threads.
Since more memory is allocated and released in the same time garbage-collection could be an issue.
As long as all requests don't come faster than they can be processed (200 * 5ms) 1000ms (you should have around 300ms spare time). you're fine.
But due to scheduling, changing from user to kernel mode on each system call (mostly I/O from DB) adds additional time required to process a request. Which could leads to bottlenecks.

Thread.sleep and BufferedReader.readLine use the most cpu cycles in my java tcp server. Why?

Good evening,
I'm developing a java tcp server for communication between clients.
At this point i'm load testing the developed server.
This morning i got my hands on a profiler (yourkit) and started looking for problem spots in my server.
I now have 480 clients sending messages to the server every 500 msec. The server forwards every received message to 6 clients.
The server is now using about 8% of my cpu, when being on constant load.
My question is about the java functions that uses the most cpu cycles.
The java function that uses the most cpu cycles is strangly "Thread.sleep", followed by "BufferedReader.readLine".
Both of these functions seem to block the current thread while waiting for something (sleep waits for a few msec, readline waits for data).
Can somebody explain why these 2 functions take up that much cpu cycles? I was also wondering if there are alternative approaches that use less cpu cycles.
Kind regards,
T. Akhayo
sleep() and readLine() can use a lot of cpu as they both result in system calls which can context switch. It is also possible that the timing for these methods is not accurate for this reason (it may be an over estimate)
A way to reduce the overhead of context switches/sleep() is to have less threads and avoid needing to use sleep (e.g. use ScheduledExecutorServices), readLine() overhead can be reduced by using NIO but it is likely to add some complexity.
Sleeping shouldn't be an issue, unless you're having a bunch of threads sleep for short periods of time (100-150ms is 'short' in when you have 480 threads running a loop that just sleeps and does something trivial).
The readLine call should be using next to nothing when it's not actually reading something, except when you first call it. But like you said, it blocks, and it shouldn't be using a noticeable amount of CPU unless you have small windows where it blocks. CPU usage isn't that much unless you're reading tons of data, or initially calling the method.
So, your loops are too tight, and you're receiving too many messages too quickly, which is ultimately causing 'tons' of context switching, and processing. I'd suggest using a NIO framework (like Netty) if you're not comfortable enough with NIO to use it on your own.
Also, 8% CPU isn't that much for 480 clients that send 2 messages per second.
Here is a program in which sleep uses almost 100% of the cpu cycles given to the application:
for (i = 0; i < bigNumber; i++){
sleep(someTime);
}
Why? Because it doesn't use very many actual cpu cycles at all,
and of the ones it does use, nearly all of them are spent entering and leaving sleep.
Does that mean it's a real problem? Of course not.
That's the problem with profilers that only look at CPU time.
You need a sampler that samples on wall-clock time, not CPU time.
It should sample the stack, not just the program counter.
It should show you by line of code (not by function) the fraction of stack samples containing that line.
The usual objection to sampling on wall-clock time is that the measurements will be inaccurate due to sharing the machine with other processes.
But that doesn't matter, because to find time drains does not require precision of measurement.
It requires precision of location.
What you are looking for is precise code locations, and call sites, that are on the stack a healthy fraction of actual time, as determined by stack sampling that's uncorrelated with the state of the program.
Competition with other processes does not change the fraction of time that call sites are on the stack by a large enough amount to result in missing the problems.

Categories