Dynamo db allows only 25 requests per batch. Is there any way we can increase this in Java as I have to process 1000's of records per second? Any solution better than dividing it in batches and processing them?
the 25 per BatchWriteItem is a hard dynamodb limit as documented here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
There is nothing preventing you from doing multiple BatchWrites in parallel. The thing that is going to gate how much you can write is the write-provisioned-throughput on the table.
BatchWrites in DynamoDB were introduce to reduce the number of round trips required to perform multiple write operations for languages that do not provide opportunities for parallel threads to perform the work such as PHP.
While you will still get better performance because of the reduced round trips by using the batch API, there is still the possibility that individual writes can fail and your code will need to look for those. A robust way to perform massively parallel writes using Java would be to use the ExecutorService class. This provides a simple mechanism to use multiple threads to perform the inserts. However, just as individual items within a batch can fail, you will want to track the Future objects to ensure the writes are performed successfully.
Another way to improve throughput is to run your code on EC2. If you are calling DynamoDB from your laptop or a datacenter outside of AWS the round trip time will take longer and the requests will be slightly slower.
The bottom line is to use standard Java multi-threading techniques to get the performance you want. However, past a certain point you may need to fan out and use additional hardware to drive even higher write OPS.
Whenever you've got a large stream of real-time data that needs to end up in AWS, Kinesis Streams are probably the way to go. Particularly with AWS Kinesis Firehose, you can pipe your data to S3 at massive scale with no administrative overhead. You can then use DataPipeline to move it to Dynamo.
Related
I have been running a test for a large data migration to dynamo that we intend to do in our prod account this summer. I ran a test to batch write about 3.2 billion documents to our dynamo table, which has a hash and range keys and two partial indexes. Each document is small, less than 1k. While we succeeded in getting the items written in about 3 days, we were disappointed with the Dynamo performance we experienced and are looking for suggestions on how we might improve things.
In order to do this migration, we are using 2 ec2 instances (c4.8xlarges). Each runs up to 10 processes of our migration program; we've split the work among the processes by some internal parameters and know that some processes will run longer than others. Each process queries our RDS database for 100,000 records. We then split these into partitions of 25 each and use a threadpool of 10 threads to call the DynamoDB java SDK's batchSave() method. Each call to batchSave() is sending only 25 documents that are less than 1k each, so we expect each to only make a single HTTP call out to AWS. This means that at any given time, we can have as many as 100 threads on a server each making calls to batchSave with 25 records. Our RDS instance handled the load of queries to it just fine during this time, and our 2 EC2 instances did as well. On the ec2 side, we did not max out our cpu, memory, or network in or network out. Our writes are not grouped by hash key, as we know that can be known to slow down dynamo writes. In general, in a group of 100,000 records, they are split across 88,000 different hash keys. I created the dynamo table initially with 30,000 write throughput, but configured up to 40,000 write throughput at one point during the test, so our understanding is that there are at least 40 partitions on the dynamo side to handle this.
We saw very variable responses times in our calls to batchSave() to dynamo throughout this period. For one span of 20 minutes while I was running 100 threads per ec2 instance, the average time was 0.636 seconds, but the median was only 0.374, so we've got a lot of calls taking more than a second. I'd expect to see much more consistency in the time it takes to make these calls from an EC2 instance to dynamo. Our dynamo table seems to have plenty of throughput configured, and the EC2 instance is below 10% CPU, and the network in and out look healthy, but are not close to be maxed out. The CloudWatch graphs in the console (which are fairly terrible...) didn't show any throttling of write requests.
After I took these sample times, some of our processes finished their work, so we were running less threads on our ec2 instances. When that happened, we saw dramatically improved response times in our calls to dynamo. e.g. when we were running 40 threads instead of 100 on the ec2 instance, each making calls to batchSave, the response times improved more than 5x. However, we did NOT see improved write throughput even with the increased better response times. It seems that no matter what we configured our write throughput to be, we never really saw the actual throughput exceed 15,000.
We'd like some advice on how best to achieve better performance on a Dynamo migration like this. Our production migration this summer will be time-sensitive, of course, and by then, we'll be looking to migrate about 4 billion records. Does anyone have any advice on how we can achieve an overall higher throughput rate? If we're willing to pay for 30,000 units of write throughput for our main index during the migration, how can we actually achieve performance close to that?
One component of BatchWrite latency is the Put request that takes the longest in the Batch. Considering that you have to loop over the List of DynamoDBMapper.FailedBatch until it is empty, you might not be making progress fast enough. Consider running multiple parallel DynamoDBMapper.save() calls instead of batchSave so that you can make progress independently for each item you write.
Again, Cloudwatch metrics are 1 minute metrics so you may have peaks of consumption and throttling that are masked by the 1 minute window. This is compounded by the fact that the SDK, by default, will retry throttled calls 10 times before exposing the ProvisionedThroughputExceededException to the client, making it difficult to pinpoint when and where the actual throttling is happening. To improve your understanding, try reducing the number of SDK retries, request ConsumedCapacity=TOTAL, self-throttle your writes using Guava RateLimiter as is described in the rate-limited scan blog post, and log throttled primary keys to see if any patterns emerge.
Finally, the number of partitions of a table is not only driven by the amount of read and write capacity units you provision on your table. It is also driven by the amount of data you store in your table. Generally, a partition stores up to 10GB of data and then will split. So, if you just write to your table without deleting old entries, the number of partitions in your table will grow without bound. This causes IOPS starvation - even if you provision 40000 WCU/s, if you already have 80 partitions due to the amount of data, the 40k WCU will be distributed among 80 partitions for an average of 500 WCU per partition. To control the amount of stale data in your table, you can have a rate-limited cleanup process that scans and removes old entries, or use rolling time-series tables (slides 84-95) and delete/migrate entire tables of data as they become less relevant. Rolling time-series tables is less expensive than rate-limited cleanup as you do not consume WCU with a DeleteTable operation, while you consume at least 1 WCU for each DeleteItem call.
As per this article there are some serious flaws with Fork-Join architecture in Java. As per my understanding Streams in Java 8 make use of Fork-Join framework internally. We can easily turn a stream into parallel by using parallel() method. But when we submit a long running task to a parallel stream it blocks all the threads in the pool, check this. This kind of behaviour is not acceptable for real world applications.
My question is what are the various considerations that I should take into account before using these constructs in high-performance applications (e.g. equity analysis, stock market ticker etc.)
The considerations are similar to other uses of multiple threads.
Only use multiple threads if you know they help. The aim is not to use every core you have, but to have a program which performs to your requirements.
Don't forget multi-threading comes with an overhead, and this overhead can exceed the value you get.
Multi-threading can experience large outliers. When you test performance you should not only look at throughput (which should be better) but the distribution of your latencies (which is often worse in extreme cases)
For low latency, switch between threads as little as possible. If you can do everything in one thread that may be a good option.
For low latency, you don't want to play nice, instead you want to minimise jitter by doing things such as pinning busy waiting threads to isolated cores. The more isolated cores you have the less junk cores you have to run things like thread pools.
The streams API makes parallelism deceptively simple. As was stated before, whether using a parallel stream speeds up your application needs to be thoroughly analysed and tested in the actual runtime context. My own experience with parallel streams streams suggests the following (and I am sure this list is far from complete):
The cost of the operations performed on the elements of the stream versus the cost of the parallelising machinery determines the potential benefit of parallel streams. For example, finding the maximum in an array of doubles is so fast using a tight loop that the streams overhead is never worthwhile. As soon as the operations get more expensive, the balance starts to tip in favour of the parallel streams API - under ideal conditions, say, a multi-core machine dedicated to a single algorithm). I encourage you to experiment.
You need to have the time and stamina to learn the intrinsics of the stream API. There are unexpected pitfalls. For example, a Spliterator can be constructed from a regular Iterator in simple statement. Under the hood, the elements produced by the iterator are first collected into an array. Depending on the number of elements produced by the Iterator that approach becomes very or even too resource hungry.
While the cited article make it seem that we are completely at the mercy of Oracle, that is not entirely true. You can write your own Spliterator that splits the input into chunks that are specific to your situation rather than relying on the default implementation. Or, you could write your own ThreadFactory (see the method ForkJoinPool.makeCommonPool).
You need to be careful not to produce deadlocks. If the tasks executed on the elements of the stream use the ForkJoinPool themselves, a deadlock may occur. You need to learn how to use the ForkJoinPool.ManagedBlocker API and its use (which I find rather the opposite of easy to grasp). Technically you are telling the ForkJoinPool that a thread is blocking which may lead to the creation of additional threads to keep the degree of parallelism intact. The creation of extra threads is not free, of course.
Just my five cents...
The point (there are actually 17) of the articles is to point out that the F/J Framework is more of a research project than a general-purpose commercial application development framework.
Criticize the object, not the man. Trying to do that is most difficult when the main problem with the framework is that the architect is a professor/scientist not an engineer/commercial developer. The PDF consolidation downloadable from the article goes more into the problem of using research standards rather than engineering standards.
Parallel streams work fine, until you try to scale them. The framework uses pull technology; the request goes into a submission queue, the thread must pull the request out of the submission queue. The Task goes back into the forking thread's deque, other threads must pull the Task out of the deque. This technique doesn't scale well. In a push technology, each Task is scattered to every thread in the system. That works much better in large scale environments.
There are many other problems with scaling as even Paul Sandoz from Oracle pointed out: For instance if you have 32 cores and are doing Stream.of(s1, s2, s3, s4).flatMap(x -> x).reduce(...) then at most you will only use 4 cores. The article points out, with downloadable software, that scaling does not work well and the parquential technique is necessary to avoid stack overflows and OOME.
Use the parallel streams. But beware of the limitations.
We need to load test our servers and our goal is to simulate 100K concurrent users.
I have created a junit script that receives a NUM_OF_USERS parameter and runs the script against our servers.
Problem is we need a large magnitude of users (100K) and a single pc that runs this test can probably do a 1000 users only.
How can we perfeorm this task? any tools for that?
P.S - It would be really good if we could run this junit test from multiple pcs and not using a tool that need to configured with the relevant parameters.. (we spent a lot of time creating this script and would like to avoid transitioning to a different tool)
As you can understand opening 100K threads is not possible. However you do not really need 100K threads. Human users act relatively slowly. The perform maximum action per 10 seconds or something like that.
So, you can create probably 100 threads but each of them should simulate 1000 users. How to simulate? You can hold 1000 objects that represent user's state, go on the list either sequentially or randomly, take the next user's action and perform it.
You can implement this yourself or use Actors model framework, e.g. Akka.
If you do not want to use Akka right now you can just improve the first solution using JMeter. You can implement plugin to JMeter where you can use the same logic that simulates several users in one thread, but the thread pool will be managed my JMeter. As a benefit you will get reports, time measurements and configurable load.
You do not need to simulate 100k users to have an approximate idea of what the performance would be for 100k users. As your simulation will not exactly mimic real users, you are already accepting some inaccuracy, so why not go further?
You could measure the performance with 100, 300 and 1000 simulated users (which you say your computer will manage), and see the trend. That is, you could create a performance model, and use that model to estimate the performance by extrapolation. The cost of a computation (in CPU time or memory, for example) can be approximated by a power law:
C = C0 N^p
where C is the cost, C0 is an unknown cost constant, N is the problem size (the number of users, for your case) and p is an unknown number (probably in the range 0 to 2).
I am looking to get some ideas on how I can solve my failover problem in my Java service.
At a high level, my service receives 3 separate object streams of data from another service, performs some amalgamating logic and then writes to a datastore.
Each object in the stream will have a unique key. Data from the 3 streams can arrive simultaneously, there is no guaranteed ordering.
After the data arrives, it will be stored in some java.util.concurrent collection, such as a BlockingQueue or a ConcurrentHashMap.
The problem is that this service must support failover, and I am not sure how to resolve this if failover were to take place when data is stored in an in-memory object.
One simple idea I have is the following:
Write to a file/elsewhere when receiving an object, and prior to adding to queue
When an object is finally procesed and stored in the datastore
When failover occurs, ensure that same file is copied across and we know which objects we need to receive data for
Performance is a big factor in my service, and as IO is expensive this seems like a crude approach and is quite simplistic.
Therefore, I am wondering if there are any libraries etc out there that can solve this problem easily?
I would use Java Chronicle partly because I wrote it but mostly because ...
it can write and read millions of entries per second to disk in a text or a binary format.
can be shared between processes e.g. active-active clustering with sub-microsecond latency.
doesn't require a system call or flush to push out the data.
the producer is not slowed by the consumer which can be GBs ahead (more than the total memory of the machine)
it can be used in a low heap GC-less and lockless manner.
we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...
The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.
In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.
If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.