We need to load test our servers and our goal is to simulate 100K concurrent users.
I have created a junit script that receives a NUM_OF_USERS parameter and runs the script against our servers.
Problem is we need a large magnitude of users (100K) and a single pc that runs this test can probably do a 1000 users only.
How can we perfeorm this task? any tools for that?
P.S - It would be really good if we could run this junit test from multiple pcs and not using a tool that need to configured with the relevant parameters.. (we spent a lot of time creating this script and would like to avoid transitioning to a different tool)
As you can understand opening 100K threads is not possible. However you do not really need 100K threads. Human users act relatively slowly. The perform maximum action per 10 seconds or something like that.
So, you can create probably 100 threads but each of them should simulate 1000 users. How to simulate? You can hold 1000 objects that represent user's state, go on the list either sequentially or randomly, take the next user's action and perform it.
You can implement this yourself or use Actors model framework, e.g. Akka.
If you do not want to use Akka right now you can just improve the first solution using JMeter. You can implement plugin to JMeter where you can use the same logic that simulates several users in one thread, but the thread pool will be managed my JMeter. As a benefit you will get reports, time measurements and configurable load.
You do not need to simulate 100k users to have an approximate idea of what the performance would be for 100k users. As your simulation will not exactly mimic real users, you are already accepting some inaccuracy, so why not go further?
You could measure the performance with 100, 300 and 1000 simulated users (which you say your computer will manage), and see the trend. That is, you could create a performance model, and use that model to estimate the performance by extrapolation. The cost of a computation (in CPU time or memory, for example) can be approximated by a power law:
C = C0 N^p
where C is the cost, C0 is an unknown cost constant, N is the problem size (the number of users, for your case) and p is an unknown number (probably in the range 0 to 2).
Related
Dynamo db allows only 25 requests per batch. Is there any way we can increase this in Java as I have to process 1000's of records per second? Any solution better than dividing it in batches and processing them?
the 25 per BatchWriteItem is a hard dynamodb limit as documented here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
There is nothing preventing you from doing multiple BatchWrites in parallel. The thing that is going to gate how much you can write is the write-provisioned-throughput on the table.
BatchWrites in DynamoDB were introduce to reduce the number of round trips required to perform multiple write operations for languages that do not provide opportunities for parallel threads to perform the work such as PHP.
While you will still get better performance because of the reduced round trips by using the batch API, there is still the possibility that individual writes can fail and your code will need to look for those. A robust way to perform massively parallel writes using Java would be to use the ExecutorService class. This provides a simple mechanism to use multiple threads to perform the inserts. However, just as individual items within a batch can fail, you will want to track the Future objects to ensure the writes are performed successfully.
Another way to improve throughput is to run your code on EC2. If you are calling DynamoDB from your laptop or a datacenter outside of AWS the round trip time will take longer and the requests will be slightly slower.
The bottom line is to use standard Java multi-threading techniques to get the performance you want. However, past a certain point you may need to fan out and use additional hardware to drive even higher write OPS.
Whenever you've got a large stream of real-time data that needs to end up in AWS, Kinesis Streams are probably the way to go. Particularly with AWS Kinesis Firehose, you can pipe your data to S3 at massive scale with no administrative overhead. You can then use DataPipeline to move it to Dynamo.
I have been running a test for a large data migration to dynamo that we intend to do in our prod account this summer. I ran a test to batch write about 3.2 billion documents to our dynamo table, which has a hash and range keys and two partial indexes. Each document is small, less than 1k. While we succeeded in getting the items written in about 3 days, we were disappointed with the Dynamo performance we experienced and are looking for suggestions on how we might improve things.
In order to do this migration, we are using 2 ec2 instances (c4.8xlarges). Each runs up to 10 processes of our migration program; we've split the work among the processes by some internal parameters and know that some processes will run longer than others. Each process queries our RDS database for 100,000 records. We then split these into partitions of 25 each and use a threadpool of 10 threads to call the DynamoDB java SDK's batchSave() method. Each call to batchSave() is sending only 25 documents that are less than 1k each, so we expect each to only make a single HTTP call out to AWS. This means that at any given time, we can have as many as 100 threads on a server each making calls to batchSave with 25 records. Our RDS instance handled the load of queries to it just fine during this time, and our 2 EC2 instances did as well. On the ec2 side, we did not max out our cpu, memory, or network in or network out. Our writes are not grouped by hash key, as we know that can be known to slow down dynamo writes. In general, in a group of 100,000 records, they are split across 88,000 different hash keys. I created the dynamo table initially with 30,000 write throughput, but configured up to 40,000 write throughput at one point during the test, so our understanding is that there are at least 40 partitions on the dynamo side to handle this.
We saw very variable responses times in our calls to batchSave() to dynamo throughout this period. For one span of 20 minutes while I was running 100 threads per ec2 instance, the average time was 0.636 seconds, but the median was only 0.374, so we've got a lot of calls taking more than a second. I'd expect to see much more consistency in the time it takes to make these calls from an EC2 instance to dynamo. Our dynamo table seems to have plenty of throughput configured, and the EC2 instance is below 10% CPU, and the network in and out look healthy, but are not close to be maxed out. The CloudWatch graphs in the console (which are fairly terrible...) didn't show any throttling of write requests.
After I took these sample times, some of our processes finished their work, so we were running less threads on our ec2 instances. When that happened, we saw dramatically improved response times in our calls to dynamo. e.g. when we were running 40 threads instead of 100 on the ec2 instance, each making calls to batchSave, the response times improved more than 5x. However, we did NOT see improved write throughput even with the increased better response times. It seems that no matter what we configured our write throughput to be, we never really saw the actual throughput exceed 15,000.
We'd like some advice on how best to achieve better performance on a Dynamo migration like this. Our production migration this summer will be time-sensitive, of course, and by then, we'll be looking to migrate about 4 billion records. Does anyone have any advice on how we can achieve an overall higher throughput rate? If we're willing to pay for 30,000 units of write throughput for our main index during the migration, how can we actually achieve performance close to that?
One component of BatchWrite latency is the Put request that takes the longest in the Batch. Considering that you have to loop over the List of DynamoDBMapper.FailedBatch until it is empty, you might not be making progress fast enough. Consider running multiple parallel DynamoDBMapper.save() calls instead of batchSave so that you can make progress independently for each item you write.
Again, Cloudwatch metrics are 1 minute metrics so you may have peaks of consumption and throttling that are masked by the 1 minute window. This is compounded by the fact that the SDK, by default, will retry throttled calls 10 times before exposing the ProvisionedThroughputExceededException to the client, making it difficult to pinpoint when and where the actual throttling is happening. To improve your understanding, try reducing the number of SDK retries, request ConsumedCapacity=TOTAL, self-throttle your writes using Guava RateLimiter as is described in the rate-limited scan blog post, and log throttled primary keys to see if any patterns emerge.
Finally, the number of partitions of a table is not only driven by the amount of read and write capacity units you provision on your table. It is also driven by the amount of data you store in your table. Generally, a partition stores up to 10GB of data and then will split. So, if you just write to your table without deleting old entries, the number of partitions in your table will grow without bound. This causes IOPS starvation - even if you provision 40000 WCU/s, if you already have 80 partitions due to the amount of data, the 40k WCU will be distributed among 80 partitions for an average of 500 WCU per partition. To control the amount of stale data in your table, you can have a rate-limited cleanup process that scans and removes old entries, or use rolling time-series tables (slides 84-95) and delete/migrate entire tables of data as they become less relevant. Rolling time-series tables is less expensive than rate-limited cleanup as you do not consume WCU with a DeleteTable operation, while you consume at least 1 WCU for each DeleteItem call.
Through appstats, I can see that my datastore queries are taking about 125ms (api and cpu combined), but often there are long latencies (e.g. upto 12000ms) before the queries are executed.
I can see that my latency from the datastore is not related to my query (e.g. the same query/data has vastly different latencies), so I'm assuming that it's a scheduling issue with app engine.
Are other people seeing this same problem ?
Is there someway to reduce the latency (e.g. admin console setting) ?
Here's a screen shot from appstats. This servlet has very little cpu processing. It does a getObjectByID and then does a datastore query. The query has an OR operator so it's being converted into 3 queries by app engine.
.
As you can see, it takes 6000ms before the first getObjectByID is even executed. There is no processing before the get operation (other than getting pm). I thought this 6000ms latency might be due to an instance warm-up, so I had increased my idle instances to 2 to prevent any warm-ups.
Then there's a second latency around a 1000ms between the getObjectByID and the query. There's zero lines of code between the get and the query. The code simply takes the result of the getObjectByID and uses the data as part of the query.
The grand total is 8097ms, yet my datastore operations (and 99.99% of the servlet) are only 514ms (45ms api), though the numbers change every time I run the servlet. Here is another appstats screenshot that was run on the same servlet against the same data.
Here is the basics of my java code. I had to remove some of the details for security purposes.
user = pm.getObjectById(User.class, userKey);
//build queryBuilder.append(...
final Query query = pm.newQuery(UserAccount.class,queryBuilder.toString());
query.setOrdering("rating descending");
query.executeWithArray(args);
Edited:
Using Pingdom, I can see that GAE latency varies from 450ms to 7,399ms, or 1,644% difference !! This is with two idle instances and no users on the site.
I observed very similar latencies (in the 7000-10000ms range) in some of my apps. I don't think the bulk of the issue (those 6000ms) lies in your code.
In my observations, the issue is related to AppEngine spinning up a new instance. Setting min idle instances may help mitigate but it will not solve it (I tried up to 2 idle instances), because basically even if you have N idle instances app engine will prefer spinning up dynamic ones even when a single request comes in, and will "save" the idle ones in case of crazy traffic spikes. This is highly counter-intuitive because you'd expect it to use the instance that are already around and spin up dynamic ones for future requests.
Anyway, in my experience this issue (10000ms latency) very rarely happens under any non-zero amount of load, and many people had to revert to some king of pinging (possibly cron jobs) every couple of minutes (used to work with 5 minutes but lately instances are dying faster so it's more like a ping every 2 mins) to keep dynamic instances around to serve users who hit the site when no one else is on. This pinging is not ideal because it will eat away at your free quota (pinging every 5 minutes will eat away more than half of it) but I really haven't found a better alternative so far.
In recap, in general I found that app engine is awesome when under load, but not outstanding when you just have very few (1-3) users on the site.
Appstats only helps diagnose performance issues when you make GAE API/RPC calls.
In the case of your diagram, the "blank" time is spent running your code on your instance. It's not going to be scheduling time.
Your guess that the initial delay may be because of instance warm-up is highly likely. It may be framework code that is executing.
I can't guess at the delay between the Get and Query. It may be that there's 0 lines of code, but you called some function in the Query that takes time to process.
Without knowledge of the language, framework or the actual code, no one will be able to help you.
You'll need to add some sort of performance tracing on your own in order to diagnose this. The simplest (but not highly accurate) way to do this is to add timers and log timer values as your code executes.
I have used jMeter for testing my appengine app performance.
I have created a thread group of
500 users,
ramp up period: 0 seconds
and loop to 1
and ran the test.
It created 4 instances in app engine. But interesting thing is, > 450 requests were processed by a single instance.
I have ran the test again with this instances up, still most of the requests (> 90%) were going to same instance.
Instance type: F1 Class
Max Idle Instances: ( Automatic )
Min Pending Latency: ( Automatic )
I'm getting much higher latency.
What's going wrong here?
Generating load from 1 IP , is there any problem?
Your problem is you are not using a realistic ramp up value. AppEngine, like most auto-scaling solutions, requires a reasonable amount of time to spin up new hardware. During this process while it is creating the new instances latency can increase if there was a large and sudden increase in traffic.
Choose a ramp up value that is representative of the sort of spikes / surges you realistically expect to see on Production and then run the test. Use the values from this test to decide how many appEngine instances you would like to be 'always on', the higher this value the lower any impact from a surge but obviously the higher your costs.
When you say "I'm getting much higher latency" what exactly are you getting? Do you consider it to be too slow?
If latency is an issue then you can reduce the max pending latency in the application settings. If you try this I imagine you will see your requests spread across the instances more.
My guess is simply that the 2-3 idle instances have spun up in anticipation of increased load but are actually not needed for your test.
It was totally app engine's issue...
see this issue reported at appengine's issue tracker
Spread your requests into different thread groups, and the instances will be utilised. I'm not sure why this happens. I was't able to find any definitive information that explains this.
(I wonder if maybe App Engine sees the requests from a single thread group as requests originating from a common origin, so it places all of the utilised resources in the same instance, so that the output can be most efficiently passed back to the originator of the requests.)
we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...
The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.
In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.
If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.