Java Memory aware cache

Java Memory aware cache - java

I am looking for some ideas, and maybe already some concrete implemenatation if somebody knows any, but I am willing to code the wanted cache on my own.
I want to have a cache that caches only as many gigs as I configure. In comparision to the rest of the app the cache part will use nearly 100% of memory, so we can generalize the used memory of the app beeing the cache size(+ garbage).
Are there methods for getting a guess of how much memory is used? Or is it better to rely on soft pointers? Soft pointer and running always at the top of the jvm memory limit might be very inefficent with lots of cpu cycles for memory cleaning? Can I do some analysis on existing objects, like a myObject.getMemoryUsage()?
The LinkedHashMap has enough cache hits for my purpose so I don't have to code some strategic caching monster, but I don't know how to solve this momory issue properly. Any ideas? I don't want OOME flying anywhere.
What is best pratice?

SoftReference are not a great idea as they tend to be clearer all at once. This means when you get a performance hit from a GC, you also get a hit having to re-build your cache.
You can use Instrumentation.getObjectSize() to get the shallow size of an Object and use reflection to obtain a deep size. However, doing this relatively expensive and not something you want to get doing very often.
Why can't you limit the size to a number of object? In fact, I would start with the simplest cache you can and only add what you really need.
LRU cache in Java.
EDIT: One way to track how much memory you are using is to Serialize the value and store it as a byte[]. This can give you fairly precise control however can slow down your solution by up to 1000x times. (Nothing comes for free ;)

I would recommend using the Java Caching System. Though if you wanted to roll your own, I'm not aware of any way to get an objects size in memory. Your best bet would be to extend AbstractMap and wrap the values in SoftReferences. Then you could set the java heap size to the maximum size you wanted. Though, your implementation would also have to find and clean out stale data. It's probably easier just to use JCS.

The problem with SoftReferences is that they give more work to the garbage collector. Although it doesn't meet your requirements, HBase has a very interesting strategy in order to prevent the cache from contributing to the garbage collection pauses : they store the cache in native memory :
https://issues.apache.org/jira/browse/HBASE-4027
https://issues.apache.org/jira/secure/attachment/12488272/HBase-4027+%281%29.pdf
A good start for your use-case would be to store all your data on disk. It might seem naive, but thanks to the I/O cache, frequently accessed data will reside in memory. I highly recommend reading these architecture notes from the Varnish caching system :
https://www.varnish-cache.org/trac/wiki/ArchitectNotes

The best practice I find is to delegate the caching functionality outside of Java if possible. Java may be good in managing memory, but at dedicated caching system should be used for anything more than a simple LRU cache.
There is a large cost with GC when it kicks in.
EHCache is one of the more popular ones I know of. Java Caching System from another answer is good as well.
However, I generally offload that work to an underlying function (usually the JPA persistence layer by the application server, I let it get handled there so I don't have to deal with it on the application tier).
If you are caching other data such as web requests, http://hc.apache.org/httpclient-3.x/ is also another good candidate.
However, just remember you also have "a file system" there's absolutely nothing wrong with writing to the file system data you have retrieved. I've used the technique several times to fix out of memory errors due to improper use of ByteArrayOutputStreams

Related

Best way to synchronize cache data between two servers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Want to synchronize the cache data between two servers. Both database is sharing the same database, but for better execution data i have cached the data into Hash Map at startup.
Thus want to synchronize the cached data without restarting servers. (Both servers starts at same time).
Please suggest me the best and efficient way to do.

Instead of trying to synchronize the cached data between two server instances, why not centralize the caching instead using something like memcached/couchbase or redis? Using distributed caching with something like ehcache is far more complicated and error prone IMO vs centralizing the cached data using a caching server like those mentioned.
As an addendum to my original answer, when deciding what caching approach to use (in memory, centralized), one thing to take into account is the volatility of the data that is being cached.
If the data is stored in the DB, but does not change after the servers load it, then you don't even need synchronization between the servers. Just let them each load this static data into memory from the source and then go about their merry ways doing whatever it is they do. The data won't be changing, so no need to introduce a complicated pattern for keeping the data in sync between the servers.
If there is indeed a level of volatility in the data (like say you are caching looked up entity data from the DB in order to save hits to the DB), then I still think centralized caching is a better approach than in-memory distributed and synchronized caching. You just need to make sure that you use an appropriate expiration on the cached data to allow natural refresh of the data from time to time. Also, you might want to just drop the cached data from the centralized store when in the update path for a particular entity and then just let it be reloaded from the cache on the next request for that data. This is IMO better than trying to do a true write-through cache where you write to the underlying store as well as the cache. The DB itself might make tweaks to the data (via defaulting unsupplied values for example), and your cached data in that case might not match what's in the DB.
EDIT:
A question was asked in the comments about the advantages of a centralized cache (I'm guessing against something like an in memory distributed cache). I'll provide my opinion on that, but first a standard disclaimer. Centralized caching is not a cure-all. It aims to solve specific issues related to in-jvm-memory caching. Before evaluating whether or not to switch to it, you should understand what your problems are first and see if they fit with the benefits of centralized caching. Centralized caching is an architectural change and it can come with issues/caveats of its own. Don't switch to it simple because someone says it's better than what you are doing. Make sure the reason fits the problem.
Okay, now onto my opinion for what kinds of problems centralized caching can solve vs in-jvm-memory (and possibly distributed) caching. I'm going to list two things although I'm sure there are a few more. My two big ones are: Overall Memory Footprint and Data Synchronization Issues.
Let's start with Overall Memory Footprint. Say you are doing standard entity caching to protect your relational DB from undue stress. Let's also say that you have a lot of data to cache in order to really protect your DB; say in the range of many GBs. If you are doing in-jvm-memory caching, and you say had 10 app server boxes, you would need to get that additional memory ($$$) times 10 for each of the boxes that would need to be doing the caching in jvm memory. In addition, you would then have to allocate a larger heap to your JVM in order to accommodate the cached data. I'm from the opinion that the JVM heap should be small and streamlined in order to ease garbage collection burden. If you have a large chunks of Old Gen that can't be collected then your going to stress your garbage collector when it goes into a full GC and tries to reap something back from that bloated Old Gen space. You want to avoid long GC2 pause times and bloating your Old Gen is not going to help with that. Plus, if you memory requirement is above a certain threshold, and you happened to be running 32 bit machines for your app layer, you'll have to upgrade to 64 bit machines and that can be another prohibitive cost.
Now if you decided to centralize the cached data instead (using something like Redis or Memcached), you could significantly reduce the overall memory footprint of the cached data because you could have it on a couple of boxes instead of all of the app server boxes in the app layer. You probably want to use a clustered approach (both technologies support it) and at least two servers to give you high availability and avoid a single point of failure in your caching layer (more on that in a sec). By one having a couple of machines to support the needed memory requirement for caching, you can save some considerable $$. Also, you can tune the app boxes and the cache boxes differently now as they are serving distinct purposes. The app boxes can be tuned for high throughput and low heap and the cache boxes can be tuned for large memory. And having smaller heaps will definitely help out with overall throughput of the app layer boxes.
Now one quick point for centralized caching in general. You should set up your application in such a way that it can survive without the cache in case it goes completely down for a period of time. In traditional entity caching, this means that when the cache goes completely unavailable, you just are hitting your DB directly for every request. Not awesome, but also not the end of the world.
Okay, now for Data Synchronization Issues. With distributed in-jvm-memory caching, you need to keep the cache in sync. A change to cached data in one node needs to replicate to the other nodes and by sync'd into their cached data. This approach is a little scary in that if for some reason (network failure for example) one of the nodes falls out of sync, then when a request goes to that node, the data the user sees will not be accurate against what's currently in the DB. Even worse, if they make another request and that hits a different node, they will see different data and that will be confusing to the user. By centralizing the data, you eliminate this issue. Now, one could then argue that the centralized cache needs concurrency control around updates to the same cached data key. If two concurrent updates come in for the same key, how do you make sure the two updates don't stomp on each other? My thought here is to not even worry bout this; when an update happens, drop the item from the cache (and write though directly to the DB) and let it be reloaded on the next read. It's safer and easier this way. If you don't want to do that, then you can use CAS (Check-And-Set) functionality instead for optimistic concurrency control if you really want to update both the cache and db on updates.
So to summarize, you can save money and better tune your app layer machines if you centralize the data they cache. You also can get better accuracy of that data as you have less data synchronization issues to deal with. I hope this helps.

First, do try to forget about the premature optimization. Do you really need the cache? 99% that you do not need it. In this case you solution is in removing the redundant code.
If however you need it try to stop re-inventing wheels. There are perfect ready-to use libraries. For example ehCache that has distributed mode.

Use HazelCast. It allows data synchronization between servers using multicast protocol. It's easy to use. It supports locking and other features.

Can Terracotta's BigMemory Go be used without EHCache?

For an upcoming project I will keep a large amount of data (up to 10GB) in RAM, but not as a cache. Is is possible to use BigMemory (in particular Go, i.e. the free edition) without EH Cache, simply as a non garbage collected memory storage? I have not found a clear answer in the docs, which mostly talk about the typical integration with EHCache.
Thank you.

Yes, EhCache is the API for BigMemory:
BigMemory Go currently uses Ehcache as its user-facing data access API.

Basically, the way BigMemory has been designed is as sort of another storage tier. You store things in the heap exceeding which you store things offheap (which is the bigmemory) and then exceeding which you store things on the disk. It makes sense to do so because in the nosql paradigm where we want to store bigdata; things work well if they are in key-value form. You can choose to store any kind of value by just making it serializable.
As for your constraint of "not as a cache", its very much possible to configure the cache so that values don't get evicted from the memory. Anyways if you use BigMemory Go, you get a limit of 32GB so storing 10GB won't trigger any eviction algorithms even without any configuration.

Overhead of using coherence cache

I consider caching key-value lists stored in database. Right now for rendering of JSF pages, a lot of redundant queries are executed to find the names to be displayed for some keys (O/R-Mapper: Eclipselink).
The values are quasi-static, but can change very seldom by using the application (no change in database except by the application in question).
A simple cache would suffice when only using one application server. However, load balancing with multiple servers should be possible, avoiding returning stale values if data is changed using one server and therefore not reflected by the other server.
One idea would be to use oracle coherence as distributed cache. I'm not sure whether this is overkill because of the fact that the data is only changed very seldomly and the cache itself does not need to be distributed, only the invalidation should be.
What is the overhead of coherence in terms of memory, execution times and network communication? Are there any alternatives that better suit my use case?
I talk about 50.000 key value pairs, mainly short strings.

If the invalidation is that rare, then you can use a local cache and something like a JMS Topic that everyone subscribes to in order to handle the invalidation.
There's also something like EHCache as an alternative, since it's OSS and free to use vs Coherence, if that's important. I like to use EHCaches pull through ability.

Coherence has relatively low overhead, and can easily manage 50,000 (or 50,000,000) objects. However, if your use case is super simple, and you don't mind doing the invalidation work yourself, and don't need the various QoS that Coherence provides, then it probably is overkill.
Also, this simple use case can easily be done using the Coherence Standard Edition, which is far less expensive (licensed per server instead of per processor, and it's a much lower price).
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

memcached tomcat mysql on 1GB RAM

I am new to memcached and caching in general. I have a java web application running on Ubuntu + Tomcat + MySQL on a VPS Server with 1GB of memory.
Does it make sense to add a memcached layer with about 256MB for caching? Will this be too much load on the server? Which is more appropriate caching rendered html pages or database objects?
Please advise.

If you're going to cache pages, don't use memcached, use Varnish. However, there's a good chance that's not a great use of memory. Cacheing pages trades memory for computation and database work, but it does cost quite a lot of memory per page, so it's best for cases where the computation and database work needed to produce a single page amounts to a lot (or the pages are very small!). Also, consider that page cacheing won't be effective, or even possible, if you want to use per-user customisation on your pages (eg showing the number of items in a shopping cart). At least not without getting into some truly hairy shenanigans (edge-side includes, anyone?).
If you're not going to cache pages, and your app is on a single machine, then there's no point using memcached or similar. The point of cache servers like that is to make the memory on one machine work as a cache for another - like how a file server shares a disk, they're essentially memory servers. On a single machine, you might as well give all the memory to Java and cache objects on the heap.
Are you using an object-relational mapper? If so, see if it has any support for a second-level cache. The big three implementations (Hibernate, OpenJPA, and EclipseLink) all support in-memory caches. They're likely to do a much better job than you would if you did the cacheing yourself.
But, if you're not using a mapper, you have no choice but to do the cacheing yourself. There are extension points in LinkedHashMap for building LRU caches, and then of course there's the people's favourite, SoftReference, in combination with a HashMap. Plus, there are probably cache implementations out there you could download and use - i'd be shocked if there wasn't something in the Apache Commons libraries.

memcached won't add any noticeable load on your server, but it will be memory your app can't use. If you only plan to have a single app server for a while, you're better off using an in-JVM cache.
As far what to cache, the answer falls somewhere in the middle of the above. You don't want to cache exactly what's in your database and you certainly don't want to cache the final output. You have a data model representation in your application that isn't exactly what's in the DB (e.g. a User object might be made up of multiple queries from a few different tables). Cache that kind of thing as it's most reusable.
There's lots of info in the memcached site that should help you understand and get going with caching in general and memcached specifically.

It might make sense to do that, why don't try a smaller size like 64 MB and see how that goes. When you use more resources for the memcache, there is less for everything else. You should try it and see what will give you the best performance.

Automatically Sharding a Java Map across multiple nodes

I have a problem where I need to assemble a Map whose eventual size is in the GBs (going on past 64GB) and I cannot assume that a user of the program will have this kind of monster machine hanging around. A nice solution would be to distribute this map across a number of machines to make a far more modest memory footprint per instance.
Does anyone know of a library/suite of tools which can perform this sharding? I do not care about replication or transactions; just spreading this memory requirement around.

terracotta might be useful have a look here
http://www.terracotta.org/
its a clustered jvm will depend on how often you update the map i guess on how well it performs.

I suggest that you start with hazelcast:
http://www.hazelcast.com/
It is open-source, and in my opinion it is very easy to work with, so it is the best framework for rapid prototyping.
As far as I as know, it performs faster than the commercial alternatives, so I wouldn't worry about performance either.
(I haven't formally benchmarked it myself)

Must it be open source? If not, Oracle Coherence can do it.

You may be able to solve your problem by using a database instead, something like http://hsqldb.org/ may provide the functionality you need with the ability to write the data to disk rather than keeping the whole thing in memory.
I would definitely take a step back and ask yourself if a map is the right data structure for GBs of data.

Gigaspaces Datagrid sounds like the kind of thing you are looking for. (Not free though)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.