Microbenchmarking with Big Data - java

I'm currently working on my thesis project designing a cache implementation to be used with a shortest-path graph algorithm. The graph algorithm is rather inconsistent with runtimes, so it's too troublesome to benchmark the entire algorithm. I must concentrate on benchmarking the cache only.
The caches I need to benchmark are about a dozen or so implementations of a Map interface. These caches are designed to work well with a given access pattern (the order in which keys are queried from the algorithm above). However, in a given run of a "small" problem, there are a few hundred billion queries. I need to run almost all of them to be confident about the results of the benchmark.
I'm having conceptual problems about loading the data into memory. It's possible to create a query log, which would just be an on-disk ordered list of all the keys (they're 10-character string identifiers) that were queried in one run of the algorithm. This file is huge. The other thought I had would be to break the log up into chunks of 1-5 million queries, and benchmark in the following manner:
Load 1-5 million keys
Set start time to current time
Query them in order
Record elapsed time (current time - start time)
I'm unsure of what effects this will have with caching. How could I perform a warm-up period? Loading the file will likely clear any data that was in L1 or L2 caches for the last chunk. Also, what effect does maintaining a 1-5 million element string array have (and does even iterating it skew results)?
Keep in mind that the access pattern is important! For example, there are some hash tables with move-to-front heuristics, which re-orders the internal structure of the table. It would be incorrect to run a single chunk multiple times, or run chunks out of order. This makes warming CPU caches and HotSpot a bit more difficult (I could also keep a secondary dummy cache that's used for warming but not timing).
What are good practices for microbenchmarks with a giant dataset?

If I understand the problem correctly, how about loading the query log on one machine, possibly in chunks if you don't have enough memory, and streaming it to the machine running the benchmark, across a dedicated network (a crossover cable, probably), so you have minimal interference between the system under test, and the test code/data ...?
Whatever solution you use, you should try multiple runs so you can assess repeatability - if you don't get reasonable repeatability then you can at least detect that your solution is unsuitable!
Updated: re: batching and timing - in practice, you'll probably end up with some form of fine-grained batching, at least, to get the data over the network efficiently. If your data falls into natural large 'groups' or stages then I would time those individually to check for anomalies, but would rely most strongly on an overall timing. I don't see much benefit from timing small batches of thousands (given that you are running through many millions).
Even if you run everything on one machine with a lot of RAM, it might be worth loading the data in one JVM and the code under test on another, so that garbage collection on the cache JVM is not (directly) affected by the large heap required to hold the query log.

Related

Reaching memory limit launching a task

I have a long task to run under my App Engine application with a lot of datastore to compute. It worked well with a small amount of data, but since yesterday, I'm suddenly getting more than a million datastore entries to compute per day. After a while running the task (around 2 minutes), it fails with a 202 exit code (HTTP error 500). I really cannot deal with this issue. It is pretty much undocumented. The only information I was able to find is that it probably means that my app is running out of memory.
The task is simple. Each entry in the datastore contains a non-unique string identifier and a long number. The task sums the numbers and stores the identifiers into a set.
My budget is really low since my app is entirely free and without ads. I would like to prevent the app cost to soar. I would like to find a cheap and simple solution to this issue.
Edit:
I read Objectify documentation thoroughly tonight, and I found that the session cache (which ensures entities references consistency) can consume a lot of memory and should be cleared regularly when performing a lot of requests (which is my case). Unfortunately, this didn't help.
It's possible to stay within the free quota but it will require a little extra work.
In your case you should split this operation into smaller batches ( ej process 1000 entities per batch) and queue those smaller tasks to run sequentially during off hours. That should save you form the memory issue and allow you to scale beyond your current entity amount.

Key/Value store extremely slow on SSD

What I am sure of :
I am working with Java/Eclipse on Linux and trying to store a very large number of key/value pairs of 16/32 bytes respectively on disk. Keys are fully random, generated with SecureRandom.
The speed is constant at ~50000 inserts/sec until it reaches ~1 million entries.
Once this limit is reached, the java process oscillates every 1-2 seconds from 0% CPU to 100%, from 150MB of memory to 400MB, and from 10 inserts/sec to 100.
I tried with both Berkeley DB and Kyoto Cabinet and with both Btrees and Hashtables. Same results.
What might contribute :
It's writing on SSD.
For every insert there is on average 1.5 reads −alternating reads and writes constantly.
I suspect the nice 50000 rate is up until some cache/buffer limit is reached. Then the big slow down might be due to SSD not handling read/write mixed together, as suggested on this question : Low-latency Key-Value Store for SSD.
Question is :
Where might this extreme slow down be from ? It can't be all SSD's fault. Lots of people use happily SSD for high speed DB process, and I'm sure they mix read and write a lot.
Thanks.
Edit : I've made sure to remove any memory limit, and the java process has always room to allocate more memory.
Edit : Removing readings and doing inserts only does not change the problem.
Last Edit : For the record, for hash tables it seems related to the initial number buckets. On Kyoto cabinet that number cannot be changed and is defaulted to ~1 million, so better get the number right at creation time (1 to 4 times the maximum number of records to store). For BDB, it is designed to grow progressively the number of buckets, but as it is ressource consuming, better predefine the number in advance.
Your problem might be related to the strong durability guarantees of the databases you are using.
Basically, for any database that is ACID-compliant, at least one fsync() call per database commit will be necessary. This has to happen in order to guarantee durability (otherwise, updates could be lost in case of a system failure), but also to guarantee internal consistency of the database on disk. The database API will not return from the insert operation before the completion of the fsync() call.
fsync() can be a very heavy-weight operation on many operating systems and disk hardware, even on SSDs. (An exception to that would be battery- or capacitor-backed enterprise SSDs - they can treat a cache flush operation basically as a no-op to avoid exactly the delay you are probably experiencing.)
A solution would be to do all your stores inside of one big transaction. I don't know about Berkeley DB, but for sqlite, performance can be greatly improved that way.
To figure out if that is your problem at all, you could try to watch your database writing process with strace and look for frequent fsync() calls (more than a few every second would be a pretty strong hint).
Update:
If you are absolutely sure that you don't require durability, you can try the answer from Optimizing Put Performance in Berkeley DB; if you do, you should look into the TDS (transactional data storage) feature of Berkeley DB.

What does "costly" mean in terms of software operations?

What is meant by Operation is costly or the resource is costly in-terms of Software. When i come across with some documents they mentioned something like Opening a file every-time is a Costly Operation. I can have more examples like this (Database connection is a costly operation, Thread pool is a cheaper one, etc..). At what basis it decided whether the task or operation is costly or cheaper? When we calculating this what the constraints to consider? Is based on the Time also?
Note : I already checked in the net with this but i didn't get any good explanation. If you found kindly share with me and i can close this..
Expensive or Costly operations are those which cause a lot of resources to be used, such as the CPU, Disk Drive(s) or Memory
For example, creating an integer variable in code is not a costly or expensive operation
By contrast, creating a connection to a remote server that hosts a relational database, querying several tables and returning a large results set before iterating over it while remaining connected to the data source would be (relatively) expensive or costly, as opposed to my first example with the Integer.
In order to build scalable, fast applications you would generally want to minimize the frequency of performing these costly/expensive actions, applying techniques of optimisation, caching, parallelism (etc) where they are essential to the operation of the software.
To get a degree of accuracy and some actual numbers on what is 'expensive' and what is 'cheap' in your application, you would employ some sort of profiling or analysis tool. For JavaScript, there is ySlow - for .NET applications, dotTrace - I'd be certain that whatever the platform, a similar solution exists. It's then down to someone to comprehend the output, which is probably the most important part!
Running time, memory use or bandwidth consumption are the most typical interpretations of "cost". Also consider that it may apply to cost in development time.
I'll try explain through some examples:
If you need to edit two field in each row of a Database, if you do it one field at a time that's gonna be close to twice the time as if it was properly done both at same time.
This extra time was not only your waste of time, but also a connection opened longer then needed, memory occupied longer then needed and at the end of the day, your eficience goes down the drain.
When you start scalling, very small amount of time wasted grows into a very big waste of Company resources.
It is almost certainly talking about a time penalty to perform that kind of input / output. Lots of memory shuffling (copying of objects created from classes with lots of members) is another time waster (pass by reference helps eliminate a lot of this).
Usually costly means, in a very simplified way, that it'll take much longer then an operation on memory.
For instance, accessing a File in your file system and reading each line takes much longer then simply iterating over a list of the same size in memory.
The same can be said about database operations, they take much longer then in-memory operations, and so some caution should be used not to abuse these operations.
This is, I repeat, a very simplistic explanation. Exactly what costly means depends on your particular context, the number of operations you're performing, and the overall architecture of the system.

BerkeleyDB write performance problems

I need a disk-based key-value store that can sustain high write and read performance for large data sets. Tall order, I know.
I'm trying the C BerkeleyDB (5.1.25) library from java and I'm seeing serious performance problems.
I get solid 14K docs/s for a short while, but as soon as I reach a few hundred thousand documents the performance drops like a rock, then it recovers for a while, then drops again, etc. This happens more and more frequently, up to the point where most of the time I can't get more than 60 docs/s with a few isolated peaks of 12K docs/s after 10 million docs. My db type of choice is HASH but I also tried BTREE and it is the same.
I tried using a pool of 10 db's and hashing the docs among them to smooth out the performance drops; this increased the write throughput to 50K docs/s but didn't help with the performance drops: all 10 db's slowed to a crawl at the same time.
I presume that the files are being reorganized, and I tried to find a config parameter that affects when this reorganization takes place, so each of the pooled db's would reorganize at a different time, but I couldn't find anything that worked. I tried different cache sizes, reserving space using the setHashNumElements config option so it wouldn't spend time growing the file, but every tweak made it much worse.
I'm about to give berkeleydb up and try much more complex solutions like cassandra, but I want to make sure I'm not doing something wrong in berkeleydb before writing it off.
Anybody here with experience achieving sustained write performance with berkeleydb?
Edit 1:
I tried several things already:
Throttling the writes down to 500/s (less than the average I got after writing 30 million docs in 15 hors, which indicates the hardware is capable of writing 550 docs/s). Didn't work: once a certain number of docs has been written, performance drops regardless.
Write incoming items to a queue. This has two problems: A) It defeats the purpose of freeing up ram. B) The queue eventually blocks because the periods during which BerkeleyDB freezes get longer and more frequent.
In other words, even if I throttle the incoming data to stay below the hardware capability and use ram to hold items while BerkeleyDB takes some time to adapt to the growth, as this time gets increasingly longer, performance approaches 0.
This surprises me because I've seen claims that it can handle terabytes of data, yet my tests show otherwise. I still hope I'm doing something wrong...
Edit 2:
After giving it some more thought and with Peter's input, I now understand that as the file grows larger, a batch of writes will get spread farther apart and the likelihood of them falling into the same disk cylinder drops, until it eventually reaches the seeks/second limitation of the disk.
But BerkeleyDB's periodic file reorganizations are killing performance much earlier than that, and in a much worse way: it simply stops responding for longer and longer periods of time while it shuffles stuff around. Using faster disks or spreading the database files among different disks does not help. I need to find a way around those throughput holes.
What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.
Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.
I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.
Another option is to use SSD drives if the updates are bursty, (They can do 10K+ writes per second as they have no moving parts) with caching, this should give you more than you need but SSD have a limited number of writes.
BerkeleyDB does not perform file reorganizations, unless you're manually invoking the compaction utility. There are several causes of the slowdown:
Writes to keys in random access fashion, which causes much higher disk I/O load.
Writes are durable by default, which forces a lot of extra disk flushes.
Transactional environment is being used, in which case checkpoints cause a slowdown when flushing changes to disk.
When you say "documents", do you mean to say that you're using BDB for storing records larger than a few kbytes? BDB overflow pages have more overhead, and so you should consider using a larger page size.
This is an old question and the problem is probably gone, but I have recently had similar problems (speed of insert dropping dramatically after few hundred thousand records) and they were solved by giving more cache to the database (DB->set_cachesize). With 2GB of cache the insert speed was very good and more or less constant up to 10 million records (I didn't test further).
We have used BerkeleyDB (BDB) at work and have seem similar performance trends. BerkeleyDB uses a Btree to store its key/value pairs. When the number of entries keep increasing, the depth of the tree increases. BerkeleyDB caching works on loading trees into RAM so that a tree traversal does not incur file IO (reading from disk).
I need a disk-based key-value store that can sustain high write and read performance for large data sets.
Chronicle Map is a modern solution for this task. It's much faster than BerkeleyDB on both reads and writes, and is much more scalable in terms of concurrent access from multiple threads/processes.

Should I consider parallelism in statistic callculations?

we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...
The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.
In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.
If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.

Categories