I have used jMeter for testing my appengine app performance.
I have created a thread group of
500 users,
ramp up period: 0 seconds
and loop to 1
and ran the test.
It created 4 instances in app engine. But interesting thing is, > 450 requests were processed by a single instance.
I have ran the test again with this instances up, still most of the requests (> 90%) were going to same instance.
Instance type: F1 Class
Max Idle Instances: ( Automatic )
Min Pending Latency: ( Automatic )
I'm getting much higher latency.
What's going wrong here?
Generating load from 1 IP , is there any problem?
Your problem is you are not using a realistic ramp up value. AppEngine, like most auto-scaling solutions, requires a reasonable amount of time to spin up new hardware. During this process while it is creating the new instances latency can increase if there was a large and sudden increase in traffic.
Choose a ramp up value that is representative of the sort of spikes / surges you realistically expect to see on Production and then run the test. Use the values from this test to decide how many appEngine instances you would like to be 'always on', the higher this value the lower any impact from a surge but obviously the higher your costs.
When you say "I'm getting much higher latency" what exactly are you getting? Do you consider it to be too slow?
If latency is an issue then you can reduce the max pending latency in the application settings. If you try this I imagine you will see your requests spread across the instances more.
My guess is simply that the 2-3 idle instances have spun up in anticipation of increased load but are actually not needed for your test.
It was totally app engine's issue...
see this issue reported at appengine's issue tracker
Spread your requests into different thread groups, and the instances will be utilised. I'm not sure why this happens. I was't able to find any definitive information that explains this.
(I wonder if maybe App Engine sees the requests from a single thread group as requests originating from a common origin, so it places all of the utilised resources in the same instance, so that the output can be most efficiently passed back to the originator of the requests.)
Related
This is my first post, I am database administrator and looking to use play framework version 2.4 . I have read the play 2 documentation and still have a few questions since I am very new to it. I have a messaging system that will need to handle loads of up to 50,000 blocking threads per second. If I am correct the maximum number of threads available on play are as follows:
Parallism-Factor * AvailableProcessors
Where the Parallism-Factor is the amount of threads that could be used per core? I have seen that most examples have this number as 1.0 what is wrong with going for a 100 or so? I have this P-Factor right now set at 10.0 and I have 150 cpu cores so that means that I have a maximum of 1,500 threads available if that is the case and I have to process up to 50,000 blocking requests per second then the system would be very slow right? so that the only way to scale would be to add more cores since all the requests are blocking?
'50,000 blocking requests per second' doesn't necessary mean that you need 50.000 threads to handle them. You don't need a thread for each database call.
To do a very simple calculation: Each database call takes 0.1 seconds, which is an arbitrary number since I have no clue how long your calls take. And each of those 50.000 requests lead to a single, blocking database call. Then your system needs to handle 5000 database calls per seconds. So if you have 10 CPUs you'd need 500 threads per CPU to handle them, or if you have 250 CPUs you'd need 20 threads per CPU. But this is only under ideal circumstances where those requests don't actually do anything else than blocking and waiting.
Play is using Akka for its thread management and concurrency. The advantage is that one doesn't have to care about the hassles of concurrency during your application programming any more.
Akka calculates the max number of threads with available CPUs * parallelism-factor. Additionally you can limit them with parallelism-min or parallelism-max. So if you have 250 CPUs and your parallelism-factor is 20 you have max 5000 threads available at once which might or might not be enough to handle your requests.
So to come back to your question: It's difficult to say. It depends on the time your database calls take and the how heavy you use your CPUs for other calculations. I think there is no other way but trying it out and do some performance measuring. But in general it's better to have less threads since it takes a lot of resources to create a thread. And I'd guess a parallelism-factor of 20 is a good starting point in your case for 250 CPUs.
I also found this documentation to akka-concurrency-test which itself has a good list of sources.
Play was created for the asynchronous programming. That is the reason why parallelism factor is 1.0. This is the most optimal when you do a lot of small and fast non-blocking algorithms.
The question is what do you mean "50,000 blocking threads per second". Blocking could be different. The most spread example of blocking is access to RDBMS. I am pretty sure that in this case your system can handle 50000 blocking db access like a charm.
The example from the Play Thread Pool documentation says that it is fine to put parallelism factor to 300 for application that use blocking database calls, so 150*300 = 45 000 - almost your number.
I have a long task to run under my App Engine application with a lot of datastore to compute. It worked well with a small amount of data, but since yesterday, I'm suddenly getting more than a million datastore entries to compute per day. After a while running the task (around 2 minutes), it fails with a 202 exit code (HTTP error 500). I really cannot deal with this issue. It is pretty much undocumented. The only information I was able to find is that it probably means that my app is running out of memory.
The task is simple. Each entry in the datastore contains a non-unique string identifier and a long number. The task sums the numbers and stores the identifiers into a set.
My budget is really low since my app is entirely free and without ads. I would like to prevent the app cost to soar. I would like to find a cheap and simple solution to this issue.
Edit:
I read Objectify documentation thoroughly tonight, and I found that the session cache (which ensures entities references consistency) can consume a lot of memory and should be cleared regularly when performing a lot of requests (which is my case). Unfortunately, this didn't help.
It's possible to stay within the free quota but it will require a little extra work.
In your case you should split this operation into smaller batches ( ej process 1000 entities per batch) and queue those smaller tasks to run sequentially during off hours. That should save you form the memory issue and allow you to scale beyond your current entity amount.
We need to load test our servers and our goal is to simulate 100K concurrent users.
I have created a junit script that receives a NUM_OF_USERS parameter and runs the script against our servers.
Problem is we need a large magnitude of users (100K) and a single pc that runs this test can probably do a 1000 users only.
How can we perfeorm this task? any tools for that?
P.S - It would be really good if we could run this junit test from multiple pcs and not using a tool that need to configured with the relevant parameters.. (we spent a lot of time creating this script and would like to avoid transitioning to a different tool)
As you can understand opening 100K threads is not possible. However you do not really need 100K threads. Human users act relatively slowly. The perform maximum action per 10 seconds or something like that.
So, you can create probably 100 threads but each of them should simulate 1000 users. How to simulate? You can hold 1000 objects that represent user's state, go on the list either sequentially or randomly, take the next user's action and perform it.
You can implement this yourself or use Actors model framework, e.g. Akka.
If you do not want to use Akka right now you can just improve the first solution using JMeter. You can implement plugin to JMeter where you can use the same logic that simulates several users in one thread, but the thread pool will be managed my JMeter. As a benefit you will get reports, time measurements and configurable load.
You do not need to simulate 100k users to have an approximate idea of what the performance would be for 100k users. As your simulation will not exactly mimic real users, you are already accepting some inaccuracy, so why not go further?
You could measure the performance with 100, 300 and 1000 simulated users (which you say your computer will manage), and see the trend. That is, you could create a performance model, and use that model to estimate the performance by extrapolation. The cost of a computation (in CPU time or memory, for example) can be approximated by a power law:
C = C0 N^p
where C is the cost, C0 is an unknown cost constant, N is the problem size (the number of users, for your case) and p is an unknown number (probably in the range 0 to 2).
When running load tests against my app I am seeing very consistent response times. Once there is a constant level of load on GAE, the mean reponse times get smaller and smaller. But I want to have the same consistency on other apps that receive far fewer requests per second. In those I never need to support more than ~3 requests/second.
Reading the docs makes me think increasing the number of minimum idle instances should result in more consistent response times. But even then clients will still be see higher response times, every time GAE's scheduler thinks more instances are required. I am looking for a setup where users do not see those initial slow requests.
When I increase the number of minimum idle instances to 1, I want GAE to use the one resident instance only. As load increases, it should bring up and warm up new (dynamic) instances. Only once they are warmed up, GAE should send requests to them. But judging from the response times it seems as if client requests arrive in dynamic instances as they are brought up. As a result, those requests take a long time (up to 30 seconds).
Could this happen if my warmup code is incomplete?
Could the first calls on the dynamic instances be so slow because they involve code paths that have not been warmed up yet?
I do not experience this problem during load tests or when enough people are using the app. But my testing environments practically unusable by clients when nobody is using the app yet e.g. in the morning.
Thanks!
Some generic thoughts:
30 seconds startup-time for instances seems very much. We do a lot of initialization (including database-hits), and we have around 5 seconds overhead.
Warmup-Requests aren't guaranteed. If all instances are busy, and the scheduler believes that the request will be answered faster if it starts a new instance instead of queuing it on a busy one, it will do so without wasting time with a warmup-request
I don't think this is an issue of an cold code-path (though i don't know java's hotspot in detail), its probably the (mem-) cache which needs to fill first
I don't know what you meant with "incomplete warmup code"; just check your logs for requests to /_ah/warmup - if there are any, warmup-requests are enabled and working.
Increasing the amount of idle instances beyond the 1-instance mark probably won't help here.
Sadly, there aren't any generic tricks to avoid that, but you could try to
defer initialization-code (doing only the absolute required minimum of instance-startup overhead)
start a backend keeping the (mem-) cache hot
If you don't mind the costs (and don't need automatic scaling for your low-volume application), you could even have all requests served by always-on backends
Through appstats, I can see that my datastore queries are taking about 125ms (api and cpu combined), but often there are long latencies (e.g. upto 12000ms) before the queries are executed.
I can see that my latency from the datastore is not related to my query (e.g. the same query/data has vastly different latencies), so I'm assuming that it's a scheduling issue with app engine.
Are other people seeing this same problem ?
Is there someway to reduce the latency (e.g. admin console setting) ?
Here's a screen shot from appstats. This servlet has very little cpu processing. It does a getObjectByID and then does a datastore query. The query has an OR operator so it's being converted into 3 queries by app engine.
.
As you can see, it takes 6000ms before the first getObjectByID is even executed. There is no processing before the get operation (other than getting pm). I thought this 6000ms latency might be due to an instance warm-up, so I had increased my idle instances to 2 to prevent any warm-ups.
Then there's a second latency around a 1000ms between the getObjectByID and the query. There's zero lines of code between the get and the query. The code simply takes the result of the getObjectByID and uses the data as part of the query.
The grand total is 8097ms, yet my datastore operations (and 99.99% of the servlet) are only 514ms (45ms api), though the numbers change every time I run the servlet. Here is another appstats screenshot that was run on the same servlet against the same data.
Here is the basics of my java code. I had to remove some of the details for security purposes.
user = pm.getObjectById(User.class, userKey);
//build queryBuilder.append(...
final Query query = pm.newQuery(UserAccount.class,queryBuilder.toString());
query.setOrdering("rating descending");
query.executeWithArray(args);
Edited:
Using Pingdom, I can see that GAE latency varies from 450ms to 7,399ms, or 1,644% difference !! This is with two idle instances and no users on the site.
I observed very similar latencies (in the 7000-10000ms range) in some of my apps. I don't think the bulk of the issue (those 6000ms) lies in your code.
In my observations, the issue is related to AppEngine spinning up a new instance. Setting min idle instances may help mitigate but it will not solve it (I tried up to 2 idle instances), because basically even if you have N idle instances app engine will prefer spinning up dynamic ones even when a single request comes in, and will "save" the idle ones in case of crazy traffic spikes. This is highly counter-intuitive because you'd expect it to use the instance that are already around and spin up dynamic ones for future requests.
Anyway, in my experience this issue (10000ms latency) very rarely happens under any non-zero amount of load, and many people had to revert to some king of pinging (possibly cron jobs) every couple of minutes (used to work with 5 minutes but lately instances are dying faster so it's more like a ping every 2 mins) to keep dynamic instances around to serve users who hit the site when no one else is on. This pinging is not ideal because it will eat away at your free quota (pinging every 5 minutes will eat away more than half of it) but I really haven't found a better alternative so far.
In recap, in general I found that app engine is awesome when under load, but not outstanding when you just have very few (1-3) users on the site.
Appstats only helps diagnose performance issues when you make GAE API/RPC calls.
In the case of your diagram, the "blank" time is spent running your code on your instance. It's not going to be scheduling time.
Your guess that the initial delay may be because of instance warm-up is highly likely. It may be framework code that is executing.
I can't guess at the delay between the Get and Query. It may be that there's 0 lines of code, but you called some function in the Query that takes time to process.
Without knowledge of the language, framework or the actual code, no one will be able to help you.
You'll need to add some sort of performance tracing on your own in order to diagnose this. The simplest (but not highly accurate) way to do this is to add timers and log timer values as your code executes.