A cache which knows about reachability

A cache which knows about reachability - java

I'd like a cache with some maximum retaining capacity of N. I'm allowing it to hold up to N objects which would otherwise be eligible for GC. Now, if my application itself currently holds N+1 strong references to objects which it's previously added to the cache, I want the cache to hold N+1 too. Why? Because the cache won't be keeping this N+1th object from being collected any longer than it would be otherwise, and I'm fine trading a bigger hash table for more cache hits.
Another way of putting it, I'd like an object cache which retains all objects added to it while they remain strongly reachable, and also retains enough non-strongly reachable objects to keep its size == N.
Example
We have a cache created with N=100. Size starts at 0. 150 objects are added, size is 150. 100 of those objects become non-strongly reachable (weakly, softly, whatever). Cache evicts 50 of those and keeps 50, size is 100. 49 more strongly reachable objects are added. Size is still 100 but now 99 of them are strongly reachable and only one is non-strongly reachable. What happened is 49 older, non-strongly reachable objects were replaced with the new 49 because the new ones were strongly reachable.
Motivation
I suspect it's actually an intuitive thing to want for a number of use cases. Typically the cache's capacity trades off cache hit probability for a guarantee for maximum memory usage. Knowing about the reachability of the objects it holds, a cache could deliver higher cache hit probability without changing its maximum memory usage guarantee.
The Trouble
I'm worried it's not possible on the JVM. I'm hoping to be told otherwise, but if you know for a fact it's not possible I'll accept that answer too if there's rationale.

You can add the entries to a LinkedHashMap configured as an LRU or FIFO cache. You can have a WeakHashMap as well. If you add the key to both maps, the LHM will prevent cleanup even though its in the WHM. Once the LHM discards the key, it may or may not be in the WHM.
e.g
private final int retainedSize;
private final Map<K,V> lruMap = new LinkedHashMap<K, V>(16, 0.7f, true) {
#Override
protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
return size() > retainedSize;
}
};
private final Map<K,V> weakMap = new WeakHashMap<K, V>();
public void put(K k, V v) {
lruMap.put(k, v);
weakMap.put(k,v);
}
public V get(K k) {
V v = lruMap.get(k);
return v == null ? weakMap.get(k) : v;
}
One of the reason to do this is that a WeakHashMap is like to be clearer all at once, so you hit rate can drop very sharply. This approach ensures that after you have been hit with a Full GC, your performance won't drop too much as you try to catch up. ;)

Check out WeakHashMap. Stale references will be removed automatically. Before putting you could check if size exceeds your threshold and skip putting in a new value.
Alternativley you could override put and discard the value if the size is above your threshold.
This method would work as you propose, since you do not need a cache eviction policy you could just skip putting in new elements if the size is greater than your threshold.

I think what you want makes sense, but maybe not that much. Let's assume that the values are quite big (some kilobytes), otherwise the caching of the elsewhere strongly hold values may get expensive too. Ignore this overhead, your cache indeed has constant memory costs. However, I'm not sure if this goal is worth pursuing -- I'm rather interested how to use about constant amount of memory for the whole program (I don't want to leave too much memory unused and in no case I want to start swapping).
Idea: The cache should use registered weak (or soft) references.1 You use another thread calling ReferenceQueue.remove() in loop and checking some condition2. Depending on it, you either remove the corresponding entry from the cache (as Guava does) or you resurrect the value via reference.get() and thus protect it temporarily from being garbage collected.3. This should work, but it costs some time during each GC run.
1Overriding finalize() would do as well. Actually, it looks like this is the only way as reference.get() when enqueued always returns null so it can't be used for resurrection.
2The condition should be sort of "do it 100 times per GC run".
3I'm not sure the GC really works this way, but I suppose it does. If not, then you could use a copy of the value instead. I'm also unsure what happens when the value loses strong reachability the next time, but again, this is surely solvable (e.g., create a new Reference).

Related

Avoiding objects garbage collection

TLDR: How can I force the JVM not to garbage collect my objects, even if I don't want to use them in any meaningful way?
Longer story:
I have some Items which are loaded from a permanent storage and held as weak references in a cache. The weak reference usage means that, unless someone is actually using a specific Item, there are no strong references to them and unused ones are eventually garbage collected. This is all desired behaviour and works just fine. Additionally, sometimes it is necessary to propagate changes of an Item into the permanent storage. This is done asynchronously in a dedicated writer thread. And here comes the problem, because I obviously cannot allow the Item to be garbage collected before the update is finished. The solution I currently have is to include a strong reference to the Item inside the update object (the Item is never actually used during the update process, just held).
public class Item {
public final String name;
public String value;
}
public class PendingUpdate {
public final Item strongRef; // not actually necessary, just to avoid GC
public final String name;
public final String newValue;
}
But after some thinking and digging I found this paragraph in JavaSE specs (12.6.1):
Optimizing transformations of a program can be designed that reduce the number of objects that are reachable to be less than those which would naively be considered reachable. For example, a Java compiler or code generator may choose to set a variable or parameter that will no longer be used to null to cause the storage for such an object to be potentially reclaimable sooner.
Which, if I understand it correctly, means that java can just decide that the Item is garbage anyway. One solution would be to do some unnecessary operation on the Item like item.hashCode(); at the end of the storage update code. But I expect that a JVM might be smart enough to remove such unnecessary code anyway and I cannot think of any reasonable solution that a sufficiently smart JVM wouldn't be able to release sooner than needed.
public void performStorageUpdate(PendingUpdate update) {
final Transaction transaction = this.getDataManager().beginTransaction();
try {
// ... some permanent storage update code
} catch (final Throwable t) {
transaction.abort();
}
transaction.commit();
// The Item should never be garbage collected before this point
update.item.hashCode(); // Trying to avoid GC of the item, is probably not enough
}
Has anyone encounter a similar problem with weak references? Are there some language guarantees that I can use to avoid GC for such objects? (Ideally causing as small performance hit as possible.) Or am I overthinking it and the specification paragraph mean something different?
Edit: Why I cannot allow the Item to be garbage collected before the storage update finishes:
Problematic event sequence:
Item is loaded into cache and is used (held as a strong reference)
An update to the item is enqueued
Strong reference to the Item is dropped and there are no other strong references to the item (besides the one in the PendingUpdate, but as I explained, I think that that one can be optimized away by JVM).
Item is garbage collected
Item is requested again and is loaded from the permanent storage and a new strong reference to it is created
Update to the storage is performed
Result state: There are inconsistent data inside the cache and the permanent storage. Therefore, I need to held the strong reference to the Item until the storage update finishes, but I just need to hold it I don't actually need to do anything with it (so JVM is probably free to think that it is safe to get rid off).

TL;DR How can I force the JVM not to garbage collect my objects, even if I don't want to use them in any meaningful way?
Make them strongly reachable; e.g. by adding them to a strongly reachable data structure. If objects are strongly reachable then the garbage collector won't break weak references to them.
When you finish have finished the processing where the objects need to remain in the cache you can clear the data structure to break the above strong references. The next GC run then will be able to break the weak references.
Which, if I understand it correctly, means that java can just decide that the Item is garbage anyway.
That's not what it means.
What it really means that the infrastructure may be able to determine that an object is effectively unreachable, even though there is still a reference to it in a variable. For example:
public void example() {
int[] big = new int[1000000];
// long computation that doesn't use 'big'
}
If the compiler / runtime can determine that the object that big refers to cannot be used1 during the long computation, it is permitted to garbage collect it ... during the long computation.
But here's the thing. It can only do this if the object cannot be used. And if it cannot be used, there is no reason not to garbage collect it.
1 - ... without traversing a reference object.
For what it is worth, the definition of strongly reachable isn't just that there is a reference in a local variable. The definition (in the javadocs) is:
"An object is strongly reachable if it can be reached by some thread without traversing any reference objects. A newly-created object is strongly reachable by the thread that created it."
It doesn't specify how the object can be reached by the thread. Or how the runtime could / might deduce that no thread can reach it.
But the implication is clear that if threads can only access the object via a reference object, then it is not strongly reachable.
Ergo ... make the object strongly reachable.

WeakHashMap and Concurrent Modification

I'm reading the Java Doc about the WeakHashMap and I get the basic concept.
Because of the GC thread(s) acting in the background, you can get 'unusual behavior', such as a ConcurrentModificationException when iterating and etc.
The thing I don't get is that if the default implementation is not synchronized and does not contain lock in any way, then how come there is no possibility of getting an inconsistent state.
Say you have 2 threads. A GC thread deleting some key at a certain index and at same time and at the same index, a user thread is inserting in the array a key value pair.
To me, if there is no synchronization, then there is a high risk of getting a hash map that is inconsistent.
Even worse, doing something like this might actually be super dangerous because v might actually be null.
if (map.contains(k)) {
V v = map.get(k)
}
Am I missing something?

The inconsistent state issues you mention do not arise because the GC does not actively restructure WeakHashMaps. When the garbage collector frees the referent of a weak reference, the corresponding entry is not physically removed from the map; the entry merely becomes stale, with no key. At some later point, the entry may be physically removed during some other operation on the map, but the GC won't take on that responsibility.
You can see one Java version's implementation of this design on grepcode.

What you're describing is what the documentation explicitly states:
Because the garbage collector may discard keys at any time, a WeakHashMap may behave as though an unknown thread is silently removing entries.
The only mistake you're making is the assumption that you can protect the state by synchronizing. That doesn't work because the synchronization would not be mutual on the part of the GC. To quote the documentation:
In particular, even if you synchronize on a WeakHashMap instance and invoke none of its mutator methods, it is possible for the size method to return smaller values over time, for the isEmpty method to return false and then true, for the containsKey method to return true and later false for a given key, for the get method to return a value for a given key but later return null, for the put method to return null and the remove method to return false for a key that previously appeared to be in the map, and for successive examinations of the key set, the value collection, and the entry set to yield successively smaller numbers of elements.

Referring to
even if you synchronize on a WeakHashMap [...] it is possible for the size method to return smaller values over time
the javadoc sufficiently explains to me that there is a possibility for an inconsistent state and that it is completely independent from synchronization.
A few examples later, the given example is referred to, too:
for the containsKey method to return true and later false for a given key
So basically, one should never rely on the state of a WeakHashMap. but use it as atomic as possible. The given example should therefore be rephrased to
V v = map.get(k);
if(null != v) {
}
or
Optional.ofNullable(map.get(k)).ifPresent(() -> { } );

This class is intended primarily for use with key objects whose equals methods test for object identity using the == operator. Once such a key is discarded it can never be recreated, so it is impossible to do a lookup of that key in a WeakHashMap at some later time and be surprised that its entry has been removed.
So if one uses WeakHashMap for objects whose equals() is based on identity check, all is fine. The first case you mentioned ("A GC thread deleting some key at a certain index and at same time and at the same index, a user thread is inserting in the array a key value pair.") is impossible because as long as the user thread keeps a reference to the key object it cannot be discarded by GC.
And the same stands for the second example:
if (map.contains(k)) {
V v = map.get(k)
}
You keep reference k so the corresponding object is reachable and cannot be discarded.
But
This class will work perfectly well with key objects whose equals
methods are not based upon object identity, such as String instances.
With such recreatable key objects, however, the automatic removal of
WeakHashMap entries whose keys have been discarded may prove to be
confusing.

Detect low memory in Java?

In order to prevent an OutOfMemoryError I would like to create a code that cleans some caches in my program when there is a danger of outgrowing the available RAM.
How I can detect from inside the code when memory is at a certain percentage from the maximum available, and be able to react?

It turns out (to my surprise!) that there is a semi-reliable way to detect memory usage crossing preset thresholds.
The MemoryPoolMXBean class provides a way to set usage thresholds on a memory pool, and get a notification when when a pool's usage exceeds the threshold.
You can get hold of the memory MXBean instances for a JVM by calling ManagementFactory.getMemoryPoolMXBeans().
There are two kinds of threshold, and you need to understand the distinction between them to use them correctly. It is complicated: refer to the javadoc.
It should also be noted that the spec:
says that not all kinds of space support threshold checking,
says there are no constraints on when threshold crossing is tested for:
"A Java virtual machine performs usage threshold crossing checking on a memory pool basis at its best appropriate time ..."
explains how to test if threshold checking is supported, and how to register for notifications.

Detecting the memory consumption in Java to react on your code on it is not a good idea. You never know how the garbage collector behaves, when it is started and how often, how much memory it can free up, …
You could use a WeakReference (java.lang.ref) if you want to prevent that a reference to an object prevents that it can be removed by the garbage collector. But if you implement a cache, this could make the cache useless because your cached objects might be removed very quickly and to often.
I would propose to use an LRU-Cache. Such a cache has a certain capacity. If this capacity is exceeded, the least recently used elements will kicked out of the cache. This prevents in a simple way, that you cache can grow infinitely.
You can find some simple implementations if you google for it:
public class LRUMap<K, V> extends LinkedHashMap<K, V> {
private static final long serialVersionUID = 1L;
private final int capacity;
public LRUMap(final int capacity) {
this.capacity = capacity;
}
#Override
protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
return size() > capacity;
}
}
If you need more I would check for existing cache implementations. They might support additional configuration capabilities like e. G. maximum age for an entry of you cache.

Old gen heap space overflow

I have a very weird problem with GC in Java. I am running th following piece of code:
while(some condition){
//do a lot of work...
logger.info("Generating resulting time series...");
Collection<MetricTimeSeries> allSeries = manager.getTimeSeries();
logger.info(String.format("Generated %,d time series! Storing in files now...", allSeries.size()));
//for (MetricTimeSeries series : allSeries) {
// just empty loop
//}
}
When I look into JConsole, at the restart of every loop iteration, my old gen heap space, if I manually force GC, takes a size of about 90 MB. If I uncomment the loop, like this
while(some condition){
//do a lot of work...
logger.info("Generating resulting time series...");
Collection<MetricTimeSeries> allSeries = manager.getTimeSeries();
logger.info(String.format("Generated %,d time series! Storing in files now...", allSeries.size()));
for (MetricTimeSeries series : allSeries) {
// just empty loop
}
}
Even if I force it to refresh, it won't fall below 550MB. According to yourKit profiler, the TimeSeries objects are accessible via main thread's local var (the collection), just after the GC at the restart of a new iteration... And the collection is huge (250K time series.)... Wyy is this happening and how can I "fight" this (incorrect?) behaviour?

Yup, the garbage collector can be mysterious.. but it beats managing your own memory ;)
Collections and Maps have a way of hanging onto references longer than you might like and thus preventing garbage collection when you might expect. As you noticed, setting the allSeries reference to null itself will ear mark it for garbage collection, and thus it's contents are up for grabs as well. Another way would be to call allSeries.clear(): this will unlink all it's MetricTimeSeries objects and they will be free for garbage collection.
Why does removing the loop get around this problem also? This is the more interesting question. I'm tempted to suggest the compiler is optimizing the reference to allSeries.. but you are still calling allSeries.size() so it can't completely optimize out the reference.
To muddy the waters, different compiles (and settings) behave differently and use different garbage collectors which themselves behave differently. It's tough to say exactly what's happening under the hood without more information.

Since you're building a (large) ArrayList of time series, it will occupy the heap as long as it's referenced, and will get promoted to old if it stays long enough (or if the young generation is too small to actually hold it). I'm not sure how you're associating the information you're seeing in JConsole or Yourkit to a specific point in the program, but until the empty loop is optimized by several JIT passes, your while loop will take longer and keep the collection longer, which might explain the perceived difference while there's actually not a lot.
There's nothing incorrect about that behaviour. If you don't want to consume so much memory, you need to change your Collection so it's not an eagerly-filled ArrayList, but a lazy collection, more of a stream (if you've ever done XML processing, think DOM vs SAX) which gets evaluated as it's iterated. If you don't need the whole collection to be sorted, that's doable, especially since you seem to be saying that the collection is a concatenation of sub-collections returned by underlying objects.
If you can change your return type from Collection to Iterable, you could for example use Guava's FluentIterable.transformAndConcat() to transform the collection of underlying objects to a lazily-evaluated Iterable concatenation of their time series. Of course, the size of the collection is not directly available anymore (and if you try to get it independently of the iteration, you'll evaluate the lazy collection twice).

Using SoftReference for static data to prevent memory shortage in Java

I have a class with a static member like this:
class C
{
static Map m=new HashMap();
{
... initialize the map with some values ...
}
}
AFAIK, this would consume memory practically to the end of the program. I was wondering, if I could solve it with soft references, like this:
class C
{
static volatile SoftReference<Map> m=null;
static Map getM() {
Map ret;
if(m == null || (ret = m.get()) == null) {
ret=new HashMap();
... initialize the map ...
m=new SoftReference(ret);
}
return ret;
}
}
The question is
is this approach (and the implementation) right?
if it is, does it pay off in real situations?

First, the code above is not threadsafe.
Second, while it works in theory, I doubt there is a realistic scenario where it pays off. Think about it: In order for this to be useful, the map's contents would have to be:
Big enough so that their memory usage is relevant
Able to be recreated on the fly without unacceptable delays
Used only at times when other parts of the program require less memory - otherwise the maximum memory required would be the same, only the average would be less, and you probably wouldn't even see this outside the JVM since it give back heap memory to the OS very reluctantly.
Here, 1. and 2. are sort of contradictory - large objects also take longer to create.

This is okay if your access to getM is single threaded and it only acts as a cache.
A better alternative is to have a fixed size cache as this provides a consistent benefit.

getM() should be synchronized, to avoid m being initialized at the same time by different threads.

How big is this map going to be ? Is it worth the effort to handle it this way ? Have you measured the memory consumption of this (for what it's worth, I believe the above is generally ok, but my first question with optimisations is "what does it really save me").
You're returning the reference to the map, so you need to ensure that your clients don't hold onto this reference (and prevent garbage collection). Perhaps your class can hold the reference, and provide a getKey() method to access the content of the map on behalf of clients ? That way you'll maintain control of the reference to the map in one place.
I would synchronise the above, in case the map gets garbage collected and two threads hit getMap() at the same time. Otherwise you're going to create two maps simultaneously!

Maybe you are looking for WeakHashMap? Then entries in the map can be garbage collected separately.
Though in my experience it didn't help much, so I instead built an LRU cache using LinkedHashMap. The advantage is that I can control the size so that it isn't too big and still useful.

I was wondering, if I could solve it with soft references
What is it that you are trying to solve? Are you running into memory problems, or are you prematurely optimizing?
In any case,
The implementation should be altered a bit if you were to use it. As has been noted, it isnt thread-safe. Multiple threads could access the method at the same time, allowing multiple copies of your collection to be created. If these collections were then strongly referenced for the remainder of your program you would end up with more memory consumption, not less
A reason to use SoftReferences is to avoid running out of memory, as there is no contract other than that they will be cleared before the VM throws an OutOfMemoryError. Therefore there is no guaranteed benefit of this approach, other than not creating the cache until it is first used.

The first thing I notice about the code is that it mixes generic with raw types. That is just going to lead to a mess. javac in JDK7 has -Xlint:rawtypes to quickly spot that kind of mistake before trouble starts.
The code is not thread-safe but uses statics so is published across all threads. You probably don' want it to be synchronized because the cause problems if contended on multithreaded machines.
A problem with use a SoftReference for the entire cache is that you will cause spikes when the reference is cleared. In some circumstances it might work out better to have ThreadLocal<SoftReference<Map<K,V>>> which would spread the spikes and help-thread safety at the expense of not sharing between threads.
However, creating a smarter cache is more difficult. Often you end up with values referencing keys. There are ways around this bit it is a mess. I don't think ephemerons (essentially a pair of linked References) are going to make JDK7. You might find the Google Collections worth looking at (although I haven't).
java.util.LinkedHashMap gives an easy way to limit the number of cached entries, but is not much use if you can't be sure how big the entries are, and can cause problems if it stops collection of large object systems such as ClassLoaders. Some people have said you shouldn't leave cache eviction up to the whims of the garbage collector, but then some people say you shouldn't use GC.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.