memcached and performance - java

I might be asking very basic question, but could not find a clear answer by googling, so putting it here.
Memcached caches information in a separate Process. Thus in order to get the cached information requires inter-process communication (which is generally serialization in java). That means, generally, to fetch a cached object, we need to get a serialized object and generally transport it to network.
Both, serialization and network communication are costly operations. if memcached needs to use both of these (generally speaking, there might be cases when network communication is not required), then how Memcached is fast? Is not replication a better solution?
Or this is a tradeoff of distribution/platform independency/scalability vs performance?

You are right that looking something up in a shared cache (like memcached) is slower than looking it up in a local cache (which is what i think you mean by "replication").
However, the advantage of a shared cache is that it is shared, which means each user of the cache has access to more cache than if the memory was used for a local cache.
Consider an application with a 50 GB database, with ten app servers, each dedicating 1 GB of memory to caching. If you used local caches, then each machine would have 1 GB of cache, equal to 2% of the total database size. If you used a shared cache, then you have 10 GB of cache, equal to 20% of the total database size. Cache hits would be somewhat faster with the local caches, but the cache hit rate would be much higher with the shared cache. Since cache misses are astronomically more expensive than either kind of cache hit, slightly slower hits are a price worth paying to reduce the number of misses.
Now, the exact tradeoff does depend on the exact ratio of the costs of a local hit, a shared hit, and a miss, and also on the distribution of accesses over the database. For example, if all the accesses were to a set of 'hot' records that were under 1 GB in size, then the local caches would give a 100% hit rate, and would be just as good as a shared cache. Less extreme distributions could still tilt the balance.
In practice, the optimum configuration will usually (IMHO!) be to have a small but very fast local cache for the hottest data, then a larger and slower cache for the long tail. You will probably recognise that as the shape of other cache hierarchies: consider the way that processors have small, fast L1 caches for each core, then slower L2/L3 caches shared between all the cores on a single die, then perhaps yet slower off-chip caches shared by all the dies in a system (do any current processors actually use off-chip caches?).

You are neglecting the cost of disk i/o in your your consideration, which is generally going to be the slowest part of any process, and is the main driver IMO for utilizing in-memory caching like memcached.

Memory caches use ram memory over the network. Replication uses both ram-memory as well as persistent disk memory to fetch data. Their purposes are very different.
If you're only thinking of using Memcached to store easily obtainable data such as 1-1 mapping for table records :you-re-gonna-have-a-bad-time:.
On the other hand if your data is the entire result-set of a complex SQL query that may even oveflow the SQL memory pool (and need to be temporarily written to disk to be fetched) you're going to see a big speed-up.
The previous example mentions needing to write data to disk for a read operation - yes it happens if the result set is too big for memory (imagine a CROSS JOIN) that means that you both read and write to that drive (thrashing comes to mind).
In A highly optimized application written in C for example you may have a total processing time of 1microsec and may need to wait for networking and/or serialization/deserialization (marshaling/unmarshaling) for a much longer time than the app execution time itself. That's when you'll begin too feel the limitations of memory-caching over the network.

Related

java fastest concurrent random file R/W method for SSDs without memory swap

I have a linux box with 32GB of ram and a set of 4 SSD in a raid 0 config that maxes out at about 1GB of throughput (random 4k reads) and I am trying to determine the best way of accessing files on them randomly and conccurently using java. The two main ways I have seen so far are via random access file and mapped direct byte buffers.
Heres where it gets tricky though. I have my own memory cache for objects so any call to the objects stored in a file should go through to disk and not paged memory (I have disabled the swap space on my linux box to prevent this). Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping which is not good because A) I am using all the free memory for the object cache, using mappedbytebuffers instead would incur a massive serialization overhead which is what the object cache is there to prevent.(My program is already CPU limited) B) with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly, this is to prevent data corruption incase of power failure as I am not using ACID transactions.
On the other hand I need massive concurrency, ie. I need to read and write to multiple locations in the same file at the same time (whilst using offset/Range locks to prevent data corruption) I'm not sure how I can do this without mappedbytebuffers, I could always just que the reads/Writes but I'm not sure how this will negatively affect my throughput.
Finally I can not have a situation when I am creating new byte[] objects for reads or writes, this is because I perform almost a 100000 read/write operations per second, allocating and Garbage collecting all those objects would kill my program which is time sensitive and already CPU limited, reusing byte[] objects is fine through.
Please do not suggest any DB software as I have tried most of them and they add to much complexity and cpu overhead.
Anybody had this kind of dilemma?
Whilst mapped direct memory buffers are supposedly the fastest they rely on swapping
No, not if you have enough RAM. The mapping associates pages in memory with pages on disk. Unless the OS decides that it needs to recover RAM, the pages won't be swapped out. And if you are running short of RAM, all that disabling swap does is cause a fatal error rather than a performance degradation.
I am using all the free memory for the object cache
Unless your objects are extremely long-lived, this is a bad idea because the garbage collector will have to do a lot of work when it runs. You'll often find that a smaller cache results in higher overall throughput.
with mappedbytebuffers the OS handles the details of when data is written to disk, I need to control this myself, ie. when I write(byte[]) it goes straight out to disk instantly
Actually, it doesn't, unless you've mounted your filesystem with the sync option. And then you still run the risk of data loss from a failed drive (especially in RAID 0).
I'm not sure how I can do this without mappedbytebuffers
A RandomAccessFile will do this. However, you'll be paying for at least a kernel context switch on every write (and if you have the filesystem mounted for synchronous writes, each of those writes will involve a disk round-trip).
I am not using ACID transactions
Then I guess the data isn't really that valuable. So stop worrying about the possibility that someone will trip over a power cord.
Your objections to mapped byte buffers don't hold up. Your mapped files will be distinct from your object cache, and though they take address space they don't consume RAM. You can also sync your mapped byte buffers whenever you want (at the cost of some performance). Moreover, random access files end up using the same apparatus under the covers, so you can't save any performance there.
If mapped bytes buffers aren't getting you the performance you need, you might have to bypass the filesystem and write directly to raw partitions (which is what DBMS' do). To do that, you probably need to write C++ code for your data handling and access it through JNI.

Key/Value store extremely slow on SSD

What I am sure of :
I am working with Java/Eclipse on Linux and trying to store a very large number of key/value pairs of 16/32 bytes respectively on disk. Keys are fully random, generated with SecureRandom.
The speed is constant at ~50000 inserts/sec until it reaches ~1 million entries.
Once this limit is reached, the java process oscillates every 1-2 seconds from 0% CPU to 100%, from 150MB of memory to 400MB, and from 10 inserts/sec to 100.
I tried with both Berkeley DB and Kyoto Cabinet and with both Btrees and Hashtables. Same results.
What might contribute :
It's writing on SSD.
For every insert there is on average 1.5 reads −alternating reads and writes constantly.
I suspect the nice 50000 rate is up until some cache/buffer limit is reached. Then the big slow down might be due to SSD not handling read/write mixed together, as suggested on this question : Low-latency Key-Value Store for SSD.
Question is :
Where might this extreme slow down be from ? It can't be all SSD's fault. Lots of people use happily SSD for high speed DB process, and I'm sure they mix read and write a lot.
Thanks.
Edit : I've made sure to remove any memory limit, and the java process has always room to allocate more memory.
Edit : Removing readings and doing inserts only does not change the problem.
Last Edit : For the record, for hash tables it seems related to the initial number buckets. On Kyoto cabinet that number cannot be changed and is defaulted to ~1 million, so better get the number right at creation time (1 to 4 times the maximum number of records to store). For BDB, it is designed to grow progressively the number of buckets, but as it is ressource consuming, better predefine the number in advance.
Your problem might be related to the strong durability guarantees of the databases you are using.
Basically, for any database that is ACID-compliant, at least one fsync() call per database commit will be necessary. This has to happen in order to guarantee durability (otherwise, updates could be lost in case of a system failure), but also to guarantee internal consistency of the database on disk. The database API will not return from the insert operation before the completion of the fsync() call.
fsync() can be a very heavy-weight operation on many operating systems and disk hardware, even on SSDs. (An exception to that would be battery- or capacitor-backed enterprise SSDs - they can treat a cache flush operation basically as a no-op to avoid exactly the delay you are probably experiencing.)
A solution would be to do all your stores inside of one big transaction. I don't know about Berkeley DB, but for sqlite, performance can be greatly improved that way.
To figure out if that is your problem at all, you could try to watch your database writing process with strace and look for frequent fsync() calls (more than a few every second would be a pretty strong hint).
Update:
If you are absolutely sure that you don't require durability, you can try the answer from Optimizing Put Performance in Berkeley DB; if you do, you should look into the TDS (transactional data storage) feature of Berkeley DB.

What does "costly" mean in terms of software operations?

What is meant by Operation is costly or the resource is costly in-terms of Software. When i come across with some documents they mentioned something like Opening a file every-time is a Costly Operation. I can have more examples like this (Database connection is a costly operation, Thread pool is a cheaper one, etc..). At what basis it decided whether the task or operation is costly or cheaper? When we calculating this what the constraints to consider? Is based on the Time also?
Note : I already checked in the net with this but i didn't get any good explanation. If you found kindly share with me and i can close this..
Expensive or Costly operations are those which cause a lot of resources to be used, such as the CPU, Disk Drive(s) or Memory
For example, creating an integer variable in code is not a costly or expensive operation
By contrast, creating a connection to a remote server that hosts a relational database, querying several tables and returning a large results set before iterating over it while remaining connected to the data source would be (relatively) expensive or costly, as opposed to my first example with the Integer.
In order to build scalable, fast applications you would generally want to minimize the frequency of performing these costly/expensive actions, applying techniques of optimisation, caching, parallelism (etc) where they are essential to the operation of the software.
To get a degree of accuracy and some actual numbers on what is 'expensive' and what is 'cheap' in your application, you would employ some sort of profiling or analysis tool. For JavaScript, there is ySlow - for .NET applications, dotTrace - I'd be certain that whatever the platform, a similar solution exists. It's then down to someone to comprehend the output, which is probably the most important part!
Running time, memory use or bandwidth consumption are the most typical interpretations of "cost". Also consider that it may apply to cost in development time.
I'll try explain through some examples:
If you need to edit two field in each row of a Database, if you do it one field at a time that's gonna be close to twice the time as if it was properly done both at same time.
This extra time was not only your waste of time, but also a connection opened longer then needed, memory occupied longer then needed and at the end of the day, your eficience goes down the drain.
When you start scalling, very small amount of time wasted grows into a very big waste of Company resources.
It is almost certainly talking about a time penalty to perform that kind of input / output. Lots of memory shuffling (copying of objects created from classes with lots of members) is another time waster (pass by reference helps eliminate a lot of this).
Usually costly means, in a very simplified way, that it'll take much longer then an operation on memory.
For instance, accessing a File in your file system and reading each line takes much longer then simply iterating over a list of the same size in memory.
The same can be said about database operations, they take much longer then in-memory operations, and so some caution should be used not to abuse these operations.
This is, I repeat, a very simplistic explanation. Exactly what costly means depends on your particular context, the number of operations you're performing, and the overall architecture of the system.

BerkeleyDB write performance problems

I need a disk-based key-value store that can sustain high write and read performance for large data sets. Tall order, I know.
I'm trying the C BerkeleyDB (5.1.25) library from java and I'm seeing serious performance problems.
I get solid 14K docs/s for a short while, but as soon as I reach a few hundred thousand documents the performance drops like a rock, then it recovers for a while, then drops again, etc. This happens more and more frequently, up to the point where most of the time I can't get more than 60 docs/s with a few isolated peaks of 12K docs/s after 10 million docs. My db type of choice is HASH but I also tried BTREE and it is the same.
I tried using a pool of 10 db's and hashing the docs among them to smooth out the performance drops; this increased the write throughput to 50K docs/s but didn't help with the performance drops: all 10 db's slowed to a crawl at the same time.
I presume that the files are being reorganized, and I tried to find a config parameter that affects when this reorganization takes place, so each of the pooled db's would reorganize at a different time, but I couldn't find anything that worked. I tried different cache sizes, reserving space using the setHashNumElements config option so it wouldn't spend time growing the file, but every tweak made it much worse.
I'm about to give berkeleydb up and try much more complex solutions like cassandra, but I want to make sure I'm not doing something wrong in berkeleydb before writing it off.
Anybody here with experience achieving sustained write performance with berkeleydb?
Edit 1:
I tried several things already:
Throttling the writes down to 500/s (less than the average I got after writing 30 million docs in 15 hors, which indicates the hardware is capable of writing 550 docs/s). Didn't work: once a certain number of docs has been written, performance drops regardless.
Write incoming items to a queue. This has two problems: A) It defeats the purpose of freeing up ram. B) The queue eventually blocks because the periods during which BerkeleyDB freezes get longer and more frequent.
In other words, even if I throttle the incoming data to stay below the hardware capability and use ram to hold items while BerkeleyDB takes some time to adapt to the growth, as this time gets increasingly longer, performance approaches 0.
This surprises me because I've seen claims that it can handle terabytes of data, yet my tests show otherwise. I still hope I'm doing something wrong...
Edit 2:
After giving it some more thought and with Peter's input, I now understand that as the file grows larger, a batch of writes will get spread farther apart and the likelihood of them falling into the same disk cylinder drops, until it eventually reaches the seeks/second limitation of the disk.
But BerkeleyDB's periodic file reorganizations are killing performance much earlier than that, and in a much worse way: it simply stops responding for longer and longer periods of time while it shuffles stuff around. Using faster disks or spreading the database files among different disks does not help. I need to find a way around those throughput holes.
What I have seen with high rates of disk writes is that the system cache will fill up (giving lightening performance up to that point) but once it fills the application, even the whole system can slow dramatically, even stop.
Your underlying physical disk should sustain at least 100 writes per second. Any more than that is an illusion supported by clearer caching. ;) However, when the caching system is exhausted, you will see very bad behaviour.
I suggest you consider a disk controller cache. Its battery backed up memory would need to be about the size of your data.
Another option is to use SSD drives if the updates are bursty, (They can do 10K+ writes per second as they have no moving parts) with caching, this should give you more than you need but SSD have a limited number of writes.
BerkeleyDB does not perform file reorganizations, unless you're manually invoking the compaction utility. There are several causes of the slowdown:
Writes to keys in random access fashion, which causes much higher disk I/O load.
Writes are durable by default, which forces a lot of extra disk flushes.
Transactional environment is being used, in which case checkpoints cause a slowdown when flushing changes to disk.
When you say "documents", do you mean to say that you're using BDB for storing records larger than a few kbytes? BDB overflow pages have more overhead, and so you should consider using a larger page size.
This is an old question and the problem is probably gone, but I have recently had similar problems (speed of insert dropping dramatically after few hundred thousand records) and they were solved by giving more cache to the database (DB->set_cachesize). With 2GB of cache the insert speed was very good and more or less constant up to 10 million records (I didn't test further).
We have used BerkeleyDB (BDB) at work and have seem similar performance trends. BerkeleyDB uses a Btree to store its key/value pairs. When the number of entries keep increasing, the depth of the tree increases. BerkeleyDB caching works on loading trees into RAM so that a tree traversal does not incur file IO (reading from disk).
I need a disk-based key-value store that can sustain high write and read performance for large data sets.
Chronicle Map is a modern solution for this task. It's much faster than BerkeleyDB on both reads and writes, and is much more scalable in terms of concurrent access from multiple threads/processes.

Can I use Terracotta to scale a RAM-intensive application?

I'm evaluating Terracotta to help me scale up an application which is currently RAM-bounded. It is a collaborative filter and stores about 2 kilobytes of data per-user. I want to use Amazon's EC2, which means I'm limited to 14GB of RAM, which gives me an effective per-server upper-bound of around 7 million users. I need to be able to scale beyond this.
Based on my reading so-far I gather that Terracotta can have a clustered heap larger than the available RAM on each server. Would it be viable to have an effective clustered heap of 30GB or more, where each of the servers only supports 14GB?
The per-user data (the bulk of which are arrays of floats) changes very frequently, potentially hundreds of thousands of times per minute. It isn't necessary for every single one of these changes to be synchronized to other nodes in the cluster the moment they occur. Is it possible to only synchronize some object fields periodically?
I'd say the answer is a qualified yes for this. Terracotta does allow you to work with clustered heaps larger than the size of a single JVM although that's not the most common use case.
You still need to keep in mind a) the working set size and b) the amount of data traffic. For a), there is some set of data that must be in memory to perform the work at any given time and if that working set size > heap size, performance will obviously suffer. For b), each piece of data added/updated in the clustered heap must be sent to the server. Terracotta is best when you are changing fine-grained fields in pojo graphs. Working with big arrays does not take the best advantage of the Terracotta capabilities (which is not to say that people don't use it that way sometimes).
If you are creating a lot of garbage, then the Terracotta memory managers and distributed garbage collector has to be able to keep up with that. It's hard to say without trying it whether your data volumes exceed the available bandwidth there.
Your application will benefit enormously if you run multiple servers and data is partitioned by server or has some amount of locality of reference. In that case, you only need the data for one server's partition in heap and the rest does not need to be faulted into memory. It will of course be faulted if necessary for failover/availability if other servers go down. What this means is that in the case of partitioned data, you are not broadcasting to all nodes, only sending transactions to the server.
From a numbers point of view, it is possible to index 30GB of data, so that's not close to any hard limit.

Categories