I am new to java caching, I try to understand the difference between store by value vs. store by reference.
I have below cited paragraph in java.cache documentation
"
The purpose of copying entries as they are stored in a Cache and again when they are returned from a Cache is to allow applications to continue mutating the state of the keys and values without causing side-effects to entries held by a Cache.
"
What is the "side-effects" mentioned above? And how do we choose how to store in practice?
The question is great, since the answer isn't an easy one. The real semantics vary slightly across cache implementations.
store by reference:
The cache stores and returns the identical object references.
Object key = ...
Object value = ...
cache.put(key, value);
assert cache.get(key) == value;
assert cache.iterator().next().getKey() == key;
If you mutate the key after storing the value, you have an ambiguous situation. The effect is the same when using a HashMap or ConcurrentHashMap.
Use store by reference, to:
Maximize performance / minimize processing overhead
When the data is fitting into the Java heap
If you want to mutate a value after storing it. This can be useful for performance, but isn't a recommended practice, since you have to take care of concurrency issues and the usage relies on the store by reference semantics.
store by value:
Also it seems obvious, things are not so clear what store by value really means. According to the Spec leads of JCache: Brian Oliver said it's protection against cache data corruption, Greg Luck said it's everything but not store by reference.
For that matter I did analyze different compliant (means passing the TCK) JCache implementations. Key and value objects are copied when passed to the cache, but you cannot rely on the fact that an object in the cache is copied when returned to the application.
So this assumption isn't true for all JCache implementations:
assert cache.get(key) != cache.get(key);
JCache implementations may even vary more, when it gets into detail. An example:
Map map = cache.getAll(...);
assert map.get(key) != map.get(key);
Here is a contradiction in the expected semantics. We would expect that the map contents are stable, OTOH the cache would need to return a copy of the value on every access. The JCache spec doesn't enforce concrete semantics for this. The devil is in the details.
Since the key is copied upon storage by every cache implementation you will get additional safety that the cache internal data structures are sane, but applications still have the chance to break because of shared value references.
My personal conclusion (I am open for discussion):
Since store by reference is an optional JCache feature, requesting it, would mean you limit the number of cache implementations your application works with. Use store by value always, if you don't rely on store by reference semantics.
However, don't make your application depend on the semantics you think you might get with store by value. Never mutate any object after handing its reference to the cache or after retrieving its reference from the cache.
If there is still doubt, ask your cache vendor. IMHO its good practice to document implementation details. A good example (since I spent much thought in it...) is the JCache chapter in the cache2k user guide
It is to prevent concurrent modification of mutable objects. The side effect is to other threads that are using that object for something.
An example would be if you had a bank program with multiple threads with a cache of Integer objects representing bank account numbers shared between them. Suppose thread one retrieves an number from the cache, and then starts to perform an operation on it. While thread 1 is manipulated the object thread 2 retrieves the same object, and starts to manipulate it as well. Since they are simultaneously manipulating the same object in an uncoordinated way the result is unpredictable. The object itself can even become corrupted.
Storing by value eliminate this common problem in concurrent programming if it simply stores a copy of the object when an object is saved to the cache, and handing out a copy of the object when the object is retrieved from the cache.
Related
Looking at the source of Java 6, HashSet<E> is actually implemented using HashMap<E,Object>, using dummy object instance on every entry of the Set.
I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.
But, why is it still used? Is there any reason to use it besides making it easier to maintain the code?
Actually, it's not just HashSet. All implementations of the Set interface in Java 6 are based on an underlying Map. This is not a requirement; it's just the way the implementation is. You can see for yourself by checking out the documentation for the various implementations of Set.
Your main questions are
But, why is it still used? Is there
any reason to use it besides making it
easier to maintain the codes?
I assume that code maintenance is a big motivating factor. So is preventing duplication and bloat.
Set and Map are similar interfaces, in that duplicate elements are not allowed. (I think the only Set not backed by a Map is CopyOnWriteArraySet, which is an unusual Collection, because it's immutable.)
Specifically:
From the documentation of Set:
A collection that contains no
duplicate elements. More formally,
sets contain no pair of elements e1
and e2 such that e1.equals(e2), and at
most one null element. As implied by
its name, this interface models the
mathematical set abstraction.
The Set interface places additional
stipulations, beyond those inherited
from the Collection interface, on the
contracts of all constructors and on
the contracts of the add, equals and
hashCode methods. Declarations for
other inherited methods are also
included here for convenience. (The
specifications accompanying these
declarations have been tailored to the
Set interface, but they do not contain
any additional stipulations.)
The additional stipulation on
constructors is, not surprisingly,
that all constructors must create a
set that contains no duplicate
elements (as defined above).
And from Map:
An object that maps keys to values.
A map cannot contain duplicate keys; each key can map to at most one value.
If you can implement your Sets using existing code, any benefit (speed, for example) you can realize from existing code accrues to your Set as well.
If you choose to implement a Set without a Map backing, you have to duplicate code designed to prevent duplicate elements. Ah, the delicious irony.
That said, there's nothing preventing you from implementing your Sets differently.
My guess is that HashSet was originally implemented in terms of HashMap in order to get it done quickly and easily. In terms of lines of code, HashSet is a fraction of HashMap.
I would guess that the reason it still hasn't been optimized is fear of change.
However, the waste is much worse than you think. On both 32-bit and 64-bit, HashSet is 4x larger than necessary, and HashMap is 2x larger than necessary. HashMap could be implemented with an array with keys and values in it (plus chains for collisions). That means two pointers per entry, or 16 bytes on a 64-bit VM. In fact, HashMap contains an Entry object per entry, which adds 8 bytes for the pointer to the Entry and 8 bytes for the Entry object header. HashSet also uses 32 bytes per element, but the waste is 4x instead of 2x since it only requires 8 bytes per element.
Yes you are right, a small amount of wastage is definetley there. Small because, for every entry it uses the same object PRESENT(which is declared final). Hence the only wastage is for every entry's value in the HashMap.
Mostly I think, they took this approach for maintainability and reusability. (The JCF developers would have thought, we have tested HashMap anyway, why not reuse it.)
But if you are having huge collections, and you are a memory freak, then you may opt out for better alternatives like Trove or Google Collections.
I looked at your question and it took me a while to think about what you said. So here's my opinion regarding the HashSet implementation.
It is necessary to have the dummy instance to know if the value is or is not present in the set.
Take a look at the add method
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
Abd now let's take a look at the put return value
#returns the previous value associated with key, or null if there was no mapping for key. (A null return can also indicate that the map previously associated null with key.)
So the PRESENT object is just used to represent that the set contains the e value. I think you asked why not use null instead of PRESENT. But the, you would not be able to distinguish if the entry was previously on the map because map.put(key,value) would always return null and you would not have way to know if the key existed.
That being said you could argue that they could have used an implementation like this
public boolean add(E e) {
if( map.containsKey(e) ) {
return false;
}
map.put(e, null);
return true;
}
I guess they waste 4 bytes to avoid computing the hashCode, as it could be expensive, of the key two times (if the key is going to be added).
If you question referred to why they used a HashMap that would waste 8 bytes (because of the Map.Entry) instead of some other data structure using a similar Entry of only 4, then yes, I would say they did it for the reasons you mentioned.
I am guessing that it has never turned up as a significant problem for real applications or important benchmarks. Why complicate the code for no real benefit?
Also note, that object sizes are rounded up in many JVM implementation, so there may not actually be an increase in size (I don't know for this example). Also the code for HashMap is likely to be compiled and in cache. Other things being equal, more code => more cache misses => lower performance.
After searching through pages like this wondering why the mildly inefficient standard implementation, found com.carrotsearch.hppc.IntOpenHashSet
Your question:
I think that wastes 4 byte (on 32-bit machines) for the size of the entry itself.
Just one Object variable is created for the entire datastructure of hashset and doing that would save yourself from re-writing the entire hashMap kind of code again.
private static final Object PRESENT = new Object();
All the keys are having one value i.e PRESENT object.
Say I have an object and obj.hashCode() returns 8973846,
Can I call a function with the hash code and get the object back?
No. hashCode() is not unique (i.e. different objects can have the same hashCode. Even different objects of the same type can have the same hashCode), so it's not possible to implement such a method.
The best you could do would be, when you create your objects, to put them into a big HashMap<Integer,Object> that maps hash codes to instances. That way, you'd be able to retrieve them later.
Two major problems, though:
Because hash codes aren't guaranteed to be unique, you'll retrieve something with the right hash code, but not necessarily the thing you were expecting. You'd need to code everything so that hash codes were unique with high probability (which is going to be hard when there's only 32 bits to play with).
Your garbage collector is going to have a huge problem here unless you also remove objects from the hash map when you've finished with them. Normally, the garbage collector cleans up any instances that don't have any references left, but in your case, everything will maintain a reference inside the hash map. Welcome to Memory Leak City, Arizona.
You might try a WeakHashMap to alleviate the second problem, though that might cause more problems: when you try to retrieve an object later, it might have disappeared...
For my usecase, I have to pass quite a few context information from different layers/components of the application. Since few of the components are discrete, I am thinking to use ThreadLocal to store such context information. I have an interceptor/filter in place to clean it before the the response is written back to the user. Now, my question is, is it a good idea to use WeakHashMap inside ThreadLocal (see the code snippet below)?
private static final ThreadLocal<Map<String, Object>> context = new ThreadLocal<WeakHashMap<String, Object>>();
The doubt in my mind (with my limited knowledge of Weak references in Java) is, the weak references can return NULL (because GC collects them as per its own will).
Please help me in understanding this. Should I use a strong reference like HashMap or ConcurrentHashMap or my implementation is good to go?
The Javadoc for WeakHashMap states:
This class is intended primarily for use with key objects whose equals methods test for object identity using the == operator. Once such a key is discarded it can never be recreated, so it is impossible to do a lookup of that key in a WeakHashMap at some later time and be surprised that its entry has been removed. This class will work perfectly well with key objects whose equals methods are not based upon object identity, such as String instances. With such recreatable key objects, however, the automatic removal of WeakHashMap entries whose keys have been discarded may prove to be confusing.
So if you can't tolerate entries randomly disappearing, then you really shouldn't be using WeakHashMap with String keys.
I have a map that I use to store dynamic data that are discarded as soon as they are created (i.e. used; they are consumed quickly). It responds to user interaction in the sense that when user clicks a button the map is filled and then the data is used to do some work and then the map is no longer needed.
So my question is what's a better approach for emptying the map? should I set it to null each time or should I call clear()? I know clear is linear in time. But I don't know how to compare that cost with that of creating the map each time. The size of the map is not constant, thought it may run from n to 3n elements between creations.
If a map is not referenced from other objects where it may be hard to set a new one, simply null-ing out an old map and starting from scratch is probably lighter-weight than calling a clear(), because no linear-time cleanup needs to happen. With the garbage collection costs being tiny on modern systems, there is a good chance that you would save some CPU cycles this way. You can avoid resizing the map multiple times by specifying the initial capacity.
One situation where clear() is preferred would be when the map object is shared among multiple objects in your system. For example, if you create a map, give it to several objects, and then keep some shared information in it, setting the map to a new one in all these objects may require keeping references to objects that have the map. In situations like that it's easier to keep calling clear() on the same shared map object.
Well, it depends on how much memory you can throw at it. If you have a lot, then it doesn't matter. However, setting the map itself to null means that you have freed up the garbage collector - if only the map has references to the instances inside of it, the garbage collector can collect not only the map but also any instances inside of it. Clear does empty the map but it has to iterate over everything in the map to set each reference to null, and this takes place during your execution time that you can control - the garbage collector essentially has to do this work anyways, so let it do its thing. Just note that setting it to null doesn't let you reuse it. A typical pattern to reuse a map variable may be:
Map<String, String> whatever = new HashMap<String, String();
// .. do something with map
whatever = new HashMap<String, String>();
This allows you to reuse the variable without setting it to null at all, you silently discard the reference to the old map. This is atrocious practice in non-memory managed applications since they must reference the old pointer to clear it (this is a dangling pointer in other langauges), but in Java since nothing references this the GC marks it as eligible for collection.
I feel nulling the existing map is more cheaper than clear(). As creation of object is very cheap in modern JVMs.
Short answer: use Collection.clear() unless it is too complicated to keep the collection arround.
Detailed answer: In Java, the allocation of memory is almost instantaneous. It is litle more than a pointer that gets moved inside the VM. However, the initialization of those objects might add up to something significant. Also, all objects that use an internal buffer are sensible to resizing and copying of their content. Using clear() make sure that buffers eventually stabilize to some dimension, so that reallocation of memory and copying if old buffer to new buffer will never be necessary.
Another important issue is that reallocating then releasing a lot of objects will require more frequent execution of the Garbage collector, which might cause suddenly lag.
If you always holds the map, it will be prompted to the old generation. If each user has one corresponding map, the number of map in the old generation is proportionate to the number of the user. It may trigger Full GC more frequently when the number of users increase.
You can use both with similar results.
One prior answer notes that clear is expected to take constant time in a mature map implementation. Without checking the source code of the likes of HashMap, TreeMap, ConcurrentHashMap, I would expect their clear method to take constant time, plus amortized garbage collection costs.
Another poster notes that a shared map cannot be nulled. Well, it can if you want it, but you do it by using a proxy object which encapsulates a proper map and nulls it out when needed. Of course, you'd have to implement the proxy map class yourself.
Map<Foo, Bar> myMap = new ProxyMap<Foo, Bar>();
// Internally, the above object holds a reference to a proper map,
// for example, a hash map. Furthermore, this delegates all calls
// to the underlying map. A true proxy.
myMap.clear();
// The clear method simply reinitializes the underlying map.
Unless you did something like the above, clear and nulling out are equivalent in the ways that matter, but I think it's more mature to assume your map, even if not currently shared, may become shared at a later time due to forces you can't foresee.
There is another reason to clear instead of nulling out, even if the map is not shared. Your map may be instantiated by an external client, like a factory, so if you clear your map by nulling it out, you might end up coupling yourself to the factory unnecessarily. Why should the object that clears the map have to know that you instantiate your maps using Guava's Maps.newHashMap() with God knows what parameters? Even if this is not a realistic concern in your project, it still pays off to align yourself to mature practices.
For the above reasons, and all else being equal, I would vote for clear.
HTH.
I have a hashmap which stores around 1 G of data is terms of key value pairs. This hashmap changes every 15 days. It will be loaded into memory and used from there.
When a new hashmap has to be loaded into the memory, there would be several transactions already accessing the hashmap in memory. How can I replace the old hashmap with the new one without effecting the current transactions accessing the old hashmap. If there a way to hot swap the hashmap in memory?
Use an AtomicReference<Map<Foo, Bar>> rather than exposing a direct (hard) reference to the map. Consumers of the map will use #get(), and when you're ready to swap out the map, your "internal" code will use #set() or #getAndSet().
Provide a getter to the map
Mark the map private and volatile
When updating the map, create a new one, populate it and when it is ready, assign it to your private map variable.
Reference assignments are atomic in Java and volatile ensures visibility.
Caveats:
you will have two maps in memory at some stage
if some code keeps a reference to the old map it will access stale data. If that is an issue you can completely hide the map and provide a get(K key) instead so that users always access the latest map.
I will suggest to use caching tools like memcached if the data size is large like yours. This way you can invalidate individual items or entire cache as per your requirement.