LRU LinkedHashMap that limits size based on available memory

LRU LinkedHashMap that limits size based on available memory - java

I want to create a LinkedHashMap which will limit its size based on available memory (ie. when freeMemory + (maxMemory - allocatedMemory) gets below a certain threshold). This will be used as a form of cache, probably using "least recently used" as a caching strategy.
My concern though is that allocatedMemory also includes (I assume) un-garbage collected data, and thus will over-estimate the amount of used memory. I'm concerned about the unintended consequences this might have.
For example, the LinkedHashMap may keep deleting items because it thinks there isn't enough free memory, but the free memory doesn't increase because these deleted items aren't being garbage collected immediately.
Does anyone have any experience with this type of thing? Is my concern warranted? If so, can anyone suggest a good approach?
I should add that I also want to be able to "lock" the cache, basically saying "ok, from now on don't delete anything because of memory usage issues".

I know I'm biased, but I really have to strongly recommend our MapMaker for this. Use the softKeys() or softValues() feature, depending on whether it's GC collection of the key or of the value that more aptly describes when an entry can be cleaned up.

Caches tend to be problematic. IIRC, there's a SoftCache in Sun's JRE that has had many problems.
Anyway, the simplest thing is to use SoftReferences in the map. This should work fine so long as the overhead of SoftReference plus Map.Entry is significantly lower than the cached data.
Alternatively you can, like WeakHashMap, use a ReferenceQueue and either poll it or have a thread blocking on it (one thread per instance, unfortunately). Be careful with synchronisation issues.
"Locking" the map, you probably want to avoid if necessary. You'd need keep strong references to all the data (and evict if not null). That is going to be ugly.

I would strongly suggest using something like Ehcache instead of re-inventing a caching system. It's super simple to use, very configurable, and works great.

As matt b said, something like Ehcache or JbossCache is a good first step.
If you want something light-weight and in-process, look at google collections. For example, you can use MapMaker (http://google-collections.googlecode.com/svn/trunk/javadoc/index.html?com/google/common/collect/BiMap.html) to make a map with Soft/Weak keys and values, so it would cache only those items it has room for (though you wouldn't get LRU).

I had the same need in the past and this is how I implemented my cache:
there is a cache memory manager, which has a minimum and a maximum memory limit(max limit it matters anyway)
every registered cache has the following(important) parameters: maximum capacity(most of the time you have a higher limit, you don't want go hold more than X items) & percent memory usage
I use a LinkedHashMap and a ReentrantReadWriteLock to guard the cache.
every X puts I calculate the mean memory consumption per entry and trigger eviction(async) if the calculated memory limit > allowed memory limit.
of course memory computation doesn't actually shows the real memory consumption but comparing the computed memory with the real value(using a profiler) I find that it is close enough.
I was planning to put also an additional guard on the cache, to evict in case puts are going faster than memory based evictions, but until now I didn't find the need to do it.

Related

Java Empty the cache before OutOfMemoryError

Is there a reliable approach to empty the cache before the memory is full?
Or even better limit the cache according to current available "actual" free memory (hard-referenced objects)?
A soft referenced cache is not a good idea due to high GC penalty, once hit the limit all cache entries need to be reloaded.
Also the value runtime.freeMemory() is not that reliable for my purpose because even if it is too low, after the next GC cycle there might be plenty of free space so it's not a good indication of the actual used memory.
I tried to figure out how much memory each primitive time would consume so I would know the actual memory usage of the cache and put a limit on it, but couldn't find a reliable way to figure out how much memory would be used to store a String reference of size n.

Have two or three collections. If you want degrading service with memory availability you can have.
a map on the most recent entries, e.g. LinkedHashMap.
a map of soft references.
a map of weak references.
You can control how large each map should be with the knowledge that weak references can be cleared after a minor collection, soft references will be cleared if needed, and the strong references map has the core data which will always be retained.
BTW: If you are hitting your memory limit often, you should consider buying more memory up to about 32 GB per JVM. You can buy 32 GB for less than $200.

Try one of the more recent Oracle 1.7 incarnations. They should offer a GarbageCollectorMXBean and GarbageCollectionNotificationInfo. Use that to monitor the amount of used/unused memory after each GC cycle. There is some sample code here.
You can then use a multi-level cache as suggested by Peter to clean out the outer level when memory is tight, but retain the smaller first-level cache.

I would suggest that the simplest solution would be to change your references to weak references.
This way the references can still finalized and garbage collected when all strong references have gone out of scope.
See: http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/ref/WeakReference.html

I want to use a java collection in order to speed up processing but at same time avoid memory heap exceptions?

I want to use a java collection (list, map, etc.) which will cache some data so that I can use this cache instead of directly checking the database. My only worry is the collection size, I want this cache to save let's say only 1000 entries, once this count is reached, I want to remove the oldest entry and put a new one. Is this possible?

You should have a look at LinkedHashMap. If you override removeEldestEntry you can control when the oldest entry in the map gets removed (when put or putAll is called).

You can use the caching utilities provided by Google Guava: http://code.google.com/p/guava-libraries/wiki/CachesExplained

Depending on the 'weight' of each cached object, there are multiple variants. Select what fits your use case better:
Fixed size cache (can be implemented using a Collection and tracking its size). This works well if the objects are reasonably small and the memory footprint can be well estimated ahead of time. The other answers basically illustrate ways to implement this type.
Dynamic cache with automatic eviction through the garbage collector. This works well if the objects to be cached are big (or of wildly varying size, e.g. files or images) and you want to use as much heap as available for the cache. The cache manages a collection of java.lang.SoftReference to keep the objects alive. The garbage collector will reclaim cached objects when it needs memory (by clearing the reference). A disadvantage of this approach is that you have no control over object eviction, the GC decides when and which objects are evicted.
Combination of both, (small) fixed size cache for most recent hits, dynamic GC'd one for second level.
Neither will ever cause any OutOfMemory errors when configured appropiately.

Java offers a interface called Queue and some implemetations os this interface.
You can see what is the best choice for your problem. Take a look
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Queue.html
http://docs.oracle.com/javase/tutorial/collections/implementations/queue.html

Apache commons has a circular fifo buffer. I guess that is what you are looking for.
from its docs
CircularFifoBuffer is a first in first out buffer with a fixed size that replaces its oldest element if full.
or else
you can create your own class extending AbstractQueue in java library.

Variable size LRU cache

I am trying to implement an LRU cache in Java which should be able to:
Change size dynamically. In the sense that I plan to have it as SoftReference subscribed to a ReferenceQueue. So depending upon the memory consumption, the cache size will vary.
I plan to use ConcurrentHashMap where the value will be a soft reference and then periodically clear the queue to update the map.
But the problem with the above is that, how do I make it LRU?
I know that we have no control over GC, but can we manage references to the value (in cache) in such a manner that all the possible objects in cache, will become softly reachable (under GC) depending upon usage (i.e. the last time it was accessed) and not in some random manner.

Neither weak nor soft references are really well suited for this. WeakReferences tend to get cleared immediatly as soon as the object has no stronger references anymore and soft references get cleared only after the heap has grown to it's maximum size and when a OutOufMemoryError would need to be thrown otherwise.
Typically it's more efficient to use some time based approach with regular strong refernces which are much cheaper for the VM than the Reference subclasses (faster to handle for the program and the GC and use no extra memory for the reference itself.). I.e. release all objects that have not been used for a certain time. You can check this with a periodic TimerTask that you would need anyway to operate your reference queue. The idea is that if it takes i.e. 10ms to create the object and you keep it at most 1s after it was last used you will on average only be 1% slower than when you would keep all objects forever. But since it will most likely use less memory it will actually be faster.
Edit: One way to implement this would be to use 3 buckets internally. Objects that are placed into the cache get always inserted into bucket 0. When a object is requested the cache looks for it in all 3 buckets in order and places it into bucket 0 if it was not already there. The TimerTask gets invoked in fixed intervals and just drops bucket 2 and places a new empty bucket at the front of the bucket list, so that the new bucket 0 will be empty and the former bucket 0 becomes 1 and the former bucket 1 is now bucket 2. This will ensure that idle objects will survive at least one and at most two timer intervals and that objects that are accessed more than once per interval are very fast to retrieve. The total maintenance overhead for such a data structure will be considerably smaller than everything that is based on reference objects and reference queues.

Your question doesn't really make sense unless you want several of these caches at the same time. If you have only a single cache, don't give it a size limit and always use WeakReference. That way, the cache will automatically use all available free memory.
Prepare for some hot discussions with your sysadmins, though, since they will come complaining that your app has a memory leak and "will crash any moment!" sigh
The other option is to use a mature cache library like EHCache since it already knows everything that there is to know about caches and they spent years getting them right - literally. Unless you want to spend years debugging your code to make it work with every corner case of the Java memory model, I suggest that you avoid reinventing the wheel this time.

I would use a LinkedHashMap as it support access order and use as a LRU map. It can have a variable maximum size.
Switching between weak and soft references based on usage is very difficult to get right because. Its hard to determine a) how much your cache is using exclusively, b) how much is being used by the system c) how much would be used after a Full GC.
You should note that weak and soft references are only cleaned on a GC, and that discarding them or changing them won't free memory until a GC is run.

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
Something along the lines of http://wwwasd.web.cern.ch/wwwasd/lhc++/Objectivity/V5.2/Java/guide/jgdStorage.fm.html and specifically non-garbage-collectible containers there (non-garbage-collectable?).
The problem is that I have lots of ordinary temporary objects, but I have even bigger (several Gigs) of objects that are stored for Cache purposes. For no reason should the Java GC traverse all those Cache gigabytes trying to find anything to collect, because they contain cached data which have their own timeouts.
This way I could partition my data in a custom way into infinite-lived and normal-lived objects, and hopefully GC would be quite fast because normal objects don't live so long and amount to smaller amounts.
There are some workarounds to this problem, such as Apache DirectMemory and Commercial Terracotta BigMemory(http://terracotta.org/products/bigmemory), but a java-native solution would be nicer (I mean free and probably more reliable?). Also I want to avoid serialization overhead which means it should happen within same jvm. To my understanding DirectMemory and BigMemory operate mainly off heap which means that the objects must be serialized/deserialized to/from memory outside jvm. Simply marking non-gc regions within the jvm would seem a better solution. Using Files for cache is not an option either, it has the same unaffordable serialization/deserialization overhead - use case is a HA server with lots of data used in random (human) order and low latency needed.

Any memory the JVM manages is also garbage-collected by the JVM. And any “live” objects which are directly available to Java methods without deserialization have to live in JVM memory. Therefore in my understanding you cannot have live objects which are immune to garbage collection.
On the other hand, the usage you describe should make the generational approach to garbage collection quite efficient. If your big objects stay around for a while, they will be checked for reclamation less often. So I doubt there is much to be gained from avoiding those checks.

Is it possible to mark java objects non-collectable from gc perspective to save on gc-sweep time?
No it is not possible.
You can prevent objects from being garbage collected by keeping them reachable, but the GC will still need to trace them to check reachability on each full; GC (at least).
Is simply my assumption, that when the jvm is starving it begins scanning all those unnecessary objects too.
Yes. That is correct. However, unless you've got LOTS of objects that you want to be treated this way, the overhead is likely to be insignificant. (And anyway, a better idea is to give the JVM more memory ... if that is possible.)

Quite simply, for you to be able to do this, the garbage collection algorithm would need to be aware of such a flag, and take it into account when doing its work.
I'm not aware of any of the standard GC algorithms having such a flag, so for this to work you would need to write your own GC algorithm (after deciding on some feasible way to communicate this information to it).
In principle, in fact, you've already started down this track - you're deciding how garbage collection should be done rather than being happy to leaving it to the JVM's GC algo. Is the situation you describe a measurable problem for you; something for which the existing garbage collection is insufficient, but your plan would work? Garbage collectors are extremely well-tuned, so I wouldn't be surprised if the "inefficient" default strategy is actually faster than your naively-optimal one.
(Doing manual memory management is tricky and error-prone at the best of times; managing some memory yourself while using a stock garbage collector to handle the rest seems even worse. I expect you'd run into a lot of edge cases where the GC assumes it "knows" what's happening with the whole heap, which would no longer be true. Steer clear if you can...)

The recommended approaches would be to use either a commerical RTSJ implementation to avoid GC, or to use off heap memory. One could also look into soft references for caches as well (they do get collected).
This is not recommended:
If for some reason you do not believe these options are sufficient, you could look into direct memory access which is UNSAFE (part of sun.misc.Unsafe). You can use the 'theUnsafe' field to get the 'Unsafe' instance. Unsafe allows to allocation/deallocate memory via 'allocateMemory' and 'freeMemory'. This is not under GC control nor limited by JVM heap size. The impact on GC/application, once you go down this route, is not guaranteed - which is why using byte buffers might be the way to go (if you're not using a RTSJ like implementation).
Hope this helps.

Living Java objects will always be part of the GC life cycle. Or said another way, marking an object to be non-gc is the same order of overhead than having your object referenced by a root reference (a static final map for instance).
But thinking a bit further, data put in a cache are most likely to be temporary, and would eventually be evicted. At that point you will start again to like the JVM and the GC.
If you have 100's of GBs of permanent data, you may want to rethink the architecture of your application, and try to shard and distribute your data (horizontally scalability).
Last but not least, lots of work has been done around serialization, and the overhead of serialization (I'm not speaking about the poor reputation of ObjectInputStream and ObjectOutputStream) is not that big.
More than that, if your data is mainly composed of primitive types (including bytes array), there is efficient way to readInt() or readBytes() from off heap buffers (for instannce netty.io's ChannelBuffer). This could be a way to go.

What are the "practical consequences" of using soft references?

Per the documentation for Guava's MapMaker.softValues():
Warning: in most circumstances it is better to set a per-cache maximum size instead of using soft references. You should only use this method if you are well familiar with the practical consequences of soft references.
I have an intermediate understanding of soft references - their behavior, uses, and their contract with garbage collection. However I'm wondering what these practical consequences are which the doc alludes to. Why exactly is it better to use maximum size rather than soft references? Don't the algorithms and behavior of soft references make their use more efficient that a hardcoded ceiling, in terms of implementing a cache?

I think that all they are alluding too is that you should be prepared for maximum memory usage, and potentially more gc activity, if you use a Soft reference map, since references are only gc'd as memory needs to be freed up.
If you know you only need the last n values in the cache then using a LRU Cache is a leaner approach, with more predictable resource usage for a running application.
Furthermore, according to this, it seems there are subtle differences in behaviour between -server and -client JVM's.
The Sun JRE does treat SoftReferences
differently from WeakReferences. We
attempt to hold on to object
referenced by a SoftReference if there
isn't pressure on the available
memory. One detail: the policy for the
"-client" and "-server" JRE's are
different: the -client JRE tries to
keep your footprint small by
preferring to clear SoftReferences
rather than expand the heap, whereas
the -server JRE tries to keep your
performance high by preferring to
expand the heap (if possible) rather
than clear SoftReferences. One size
does not fit all.

One of the practical problems with using SoftReferences is that they tend to be discarded all at once. The reason you have a cache is to provide pretty good perform, most of the time.
However using SoftReferences for a cache can mean after your application has stopped for a GC, it will run slowly until the cache is rebuilt. i.e. Just at the time you need to application catch up.
Note: You can use a LinkedHashMap as an LRU cache, its doesn't have to be complex.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.