Why does HBase use NavigableMap<Cell, Cell> to store Cell?

Why does HBase use NavigableMap<Cell, Cell> to store Cell? - java

What's the point that makes key and value the same? Will the JVM optimize the memory and make them only one copy in heap?

Map<T, T> is often used to implement a Set<T> with the same properties as a backing map. E.g. if a map is thread-safe, the corresponding set will be thread-safe, too. If a map is navigable, the set will be also navigable, etc.
Keeping an element in both key and value parts provides a way to get an exact instance stored in the set. Here are some typical use cases for this pattern.
Obtaining a canonical object. Think of something like String.intern() but for arbitrary objects. Interning can be easily implemented with Map<T, T>:
T existing = map.putIfAbsent(obj, obj);
return existing != null ? existing : obj;
Storing mutable objects in a set. If you want to modify an existing object, a set backed by Map<T, T> will come to the rescue again:
T existing = map.get(key);
if (existing != null) {
existing.mutate();
}
As far as I understand, a concurrent NavigableMap<Cell, Cell> is used in HBase to implement a concurrent navigable set of Cells with the above properties.
Note that key and value in such map are just two references to the same object. The object itself is not copied.

Related

Deduplication using a Java Set

I have a collection of objects, let's call them A, B, C, D,... and some are equal to others. If A and C are equal, then I want to replace every reference to C with a reference to A. This means (a) object C can be garbage collected, freeing up memory, and (b) I can later use "==" to compare objects in place of an expensive equals() operation. (These objects are large and the equals() operation is slow.)
My instinct was to use a java.util.Set. When I encounter C I can easily see if there is an entry in the Set equal to C. But if there is, there seems to be no easy way to find out what that entry is, and replace my reference to the existing entry. Am I mistaken? Iterating over all the entries to find the one that matches is obviously a non-starter.
Currently, instead of a Set, I'm using a Map in which the value is always the same as the key. Calling map.get(C) then finds A. This works, but it feels incredibly convoluted. Is there a more elegant way of doing it?

This problem is not simple de-duplication: it is a form of canonicalization.
The standard approach is to use a Map rather than a Set. Here's a sketch of how to do it:
public <T> List<T> canonicalizeList(List<T> input) {
HashMap<T, T> map = new HashMap<>();
List<T> output = new ArrayList<>();
for (T element: input) {
T canonical = map.get(element);
if (canonical == null) {
element = canonical;
map.put(canonical, canonical);
}
output.add(canonical);
}
return output;
}
Note that this is O(N). If you can safely assume that the percentage of duplicates in input is likely to be small, then you could set the capacity of map and output to the size of input.
Now you seem to be saying that you are doing it this way already (last paragraph), and you are asking if there is a better way. As far as I know, there isn't one. (The HashSet API lets would let you test if a set contains a value equal to element, but it does not let you find out what it is in O(1).)
For what it is worth, under the hood the HashSet<T> class is implemented as a HashMap<T, T>. So you would not be saving time or space by using a HashSet directly ...

What's meant by "identity-based Map" and "topology-preserving object graph transformations"?

The JavaTutorials have this to say on IdentityHashMap:
IdentityHashMap is an identity-based Map implementation based on a
hash table. This class is useful for topology-preserving object graph
transformations, such as serialization or deep-copying. To perform
such transformations, you need to maintain an identity-based "node
table" that keeps track of which objects have already been seen.
Identity-based maps are also used to maintain
object-to-meta-information mappings in dynamic debuggers and similar
systems. Finally, identity-based maps are useful in thwarting "spoof
attacks" that are a result of intentionally perverse equals methods
because IdentityHashMap never invokes the equals method on its keys.
An added benefit of this implementation is that it is fast.
Could someone please explain in Simple English what is meant by both
"identity-based Map" and
"topology-preserving object graph transformations"?

"Identity-based Map" means that keys are compared via == for identity, not with equals for equality.
"Topology-preserving object graph transformations" means that when you have some object structure and transform it to another object structure, you want to preserve topology i.e. relation between nodes in the original and the target graph. For this you need to map nodes via identity, not equality.
Consider the following example. You have tree of Foo classes (tree defined via Foo parent field) which you want to transform into a tree Bar classes (again Bar has Bar parent) field. For each Foo you'll need to create a new Bar but just once. To keep track of that mapping you'll create a Map<Foo, Bar>. You'll also use this map to find parent Bars.
The problem is, however that if two Foos are equals you may get the wrong parent Bar when getting it from the tracking map. This will break the topology in the tree of Bars, you'll just hang the node to the wrong parent.
To avoid this you need identity comparison, not equality. This is what IdentitiyHashMap does.

Read the Javadoc - as you always should if you need to understand a class.
This class implements the Map interface with a hash table, using reference-equality in place of object-equality when comparing keys (and values). In other words, in an IdentityHashMap, two keys k1 and k2 are considered equal if and only if (k1==k2). (In normal Map implementations (like HashMap) two keys k1 and k2 are considered equal if and only if (k1==null ? k2==null : k1.equals(k2)).)
and
A typical use of this class is topology-preserving object graph transformations, such as serialization or deep-copying. To perform such a transformation, a program must maintain a "node table" that keeps track of all the object references that have already been processed. The node table must not equate distinct objects even if they happen to be equal.

A simple code is better than thousand words, see below:-
public static void main(String[] args){
//Two Keys
Integer key1=new Integer("1");
Integer key2=new Integer("1");
//A normal map
Map<Integer, String> map=new HashMap<Integer, String>();
map.put(key1, "Hello");
map.put(key2, "World");
System.out.println(map); //Output:- {1=World}
//An identity HashMap
Map<Integer, String> identityMap=new IdentityHashMap<Integer, String>();
identityMap.put(key1, "Hello");
identityMap.put(key2, "World");
System.out.println(identityMap); //Output:- {1=Hello, 1=World}
}
What you observed above:-
In first case, key1 is compared to key2 by equals method.
In second case, key1 is compared to key2 by == method.
So, in case of IdentityHashMap, two keys will be equal if and only if
they refer to the same location in memory which is an Identity
Equality, hence this map is a special implementation that only
supports Identity based Equality.
Objects have references to other objects which may in turn have references to more objects, this will result in object graph.
If you want to transform the object graph and maintain the object relationships, you will be doing it on Type i.e. Reference and not on actual values of Objects. If you use IdentityHashMap, it will preserve the original topology of object references as it doesn't rely on hashCode(..) and equals(..) method of objects.

Java Map with container object keys, lookup by container object field value?

Let's say I have a simple Java object, let's call it DefinedData. It will contain a number of final fields of varying types, such as strings, integers, enums, and even perhaps a set or two of strings. All in all, it's just a relatively simple data container. There will be potentially 1k to 2k of these, all static final objects. Most of these fields will be unique in that no other DefinedData object will have the same value for that field.
These will be placed into a Map of (DefinedData, Object). Now, you could easily get that Object out of the Map if you have the DefinedData object, but what if you only have one of the unique field values? You can't just pass that to the Map. You'd have to iterate over the keys and check, and that would mean wrapping the map with a lookup method for each field in DefinedData. Doable, but not the prettiest thing out there, especially if there are a lot of values in the Map and a lot of lookups, which is possible. Either that or there would need to be a lookup for DefinedData objects, which would again be a bunch of Maps...
This almost sounds like a job for a database (look up based on any column), but that's not a good solution for this particular problem. I'd also rather avoid having a dozen different Maps, each mapping a single field from DefinedData to the Object. The multikey maps I've seen wouldn't be applicable as they require all key values, not just one. Is there a Map, Collections, or other implementation that can handle this particular problem?

The only way to avoid having multiple maps is by iterating through all your DefinedData objects in some way. Reason being, you have no way of knowing how to divide them out or sort them until the request is made.
An example could be made if you had a bucket of apples. At any moment someone may come up and request a certain color, a certain kind, or a certain size. You have to choose to sort by one of those categories, and the other categories have to be searched through all the apples.
If only you could have three identical sets of apples; one for each category.
Having multiple maps would be a faster solution, though take up more memory, while iterating would be easier to achieve, slower, and use less memory.

I hesitate to propose this, but you could encapsulate your lookups behind some sort of Indexer class that auto-generates a single map via reflection using the fields of supplied objects.
By single map, I mean just one single map for the whole indexer which creates a key based on both the field name and data (say concatenating the string representing the field name with a string representation of the data).
Lookups against the indexer would supply both a field name and data value, which would then be looked up in the single map encapsulated by the indexer.
I do not think this necessarily has any advantage over a similar solution where the indexer is instead backed by a map of maps (map of field name to map of data to object).
The indexer could also be designed to use annotations so that not all fields are indexed, only those suitably annotated (or vice-versa, with annotations to exclude fields).
Overall, a map of map solutions strikes me as easier since it cuts out the step of complicated key assembly (which could be complicated for certain field data types). In either case, encapsulating it all in an Indexer that auto-generates its maps seems to be the way to go.
Update:
Made a quick non-generified proof of concept for an Indexer type class (using the map of maps approach). This is in no way a finished work, but illustrates the concept above. One major deficiency being the reliance on beans, so both public and private fields without accessor methods are invisible to this indexer.
public class Indexer
{
private Map<String,Map<Object,Set<Object>>> index = new HashMap<String,Map<Object,Set<Object>>>();
// Add an object to the index, all properties are indexed.
public void add(Object object) throws Exception
{
BeanInfo info = Introspector.getBeanInfo(object.getClass());
PropertyDescriptor[] propertyDescriptors = info.getPropertyDescriptors();
for (PropertyDescriptor descriptor : propertyDescriptors)
{
String fieldName = descriptor.getName();
Map<Object,Set<Object>> map = index.get(fieldName);
if (map == null)
{
map = new HashMap<Object,Set<Object>>();
index.put(fieldName, map);
}
Method method = descriptor.getReadMethod();
Object data = method.invoke(object);
Set<Object> set = map.get(data);
if (set == null)
{
set = new HashSet<Object>();
map.put(data, set);
}
set.add(object);
}
}
// Retrieve the set of all objects from the index whose property matches the supplied.
public Set<Object> get(String fieldName, Object value)
{
Map<Object,Set<Object>> map = index.get(fieldName);
if (map != null)
{
Set<Object> set = map.get(value);
if (set != null)
{
return Collections.unmodifiableSet(set);
}
}
return null;
}
}

Java fixed memory map

Is there a simple, efficient Map implementation that allows a limit on the memory to be used by the map.
My use case is that I want to allocate dynamically most of the memory available at the time of its creation but I don't want OutOFMemoryError at any time in future. Basically, I want to use this map as a cache, but but I wanna avoid heavy cache implementations like EHCache. My need is simple (at most an LRU algorithm)
I should further clarify that objects in my cache are char[] or similar primitives that will not hold references to other objects.
I can put an upper limit on max size for each entry.

You can use a LinkedHashMap to limit the number of entries in the Map:
removeEldestEntry(Map.Entry<K,V> eldest): Returns true if this map should remove its eldest entry. This method is invoked by put and putAll after inserting a new entry into the map. It provides the implementor with the opportunity to remove the eldest entry each time a new one is added. This is useful if the map represents a cache: it allows the map to reduce memory consumption by deleting stale entries.
Sample use: this override will allow the map to grow up to 100 entries and then delete the eldest entry each time a new entry is added, maintaining a steady state of 100 entries.
private static final int MAX_ENTRIES = 100;
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > MAX_ENTRIES;
}
Related questions
How do I limit the number of entries in a java hashtable?
Easy, simple to use LRU cache in java
What is a data structure kind of like a hash table, but infrequently-used keys are deleted?

For caches, a SoftHashMap is much more appropriate than a WeakHashMap. A WeakhashMap is usually used when you want to maintain an association with an object for as long as that object is alive, but without preventing it from being reclaimed.
In contrast, a SoftReference is more closely involved with memory allocation. See No SoftHashMap? for details on the differences.
WeakHashMap is also not usually appropriate as it has the association around the wrong way for a cache - it uses weak keys and hard values. That is, the key and value are removed from the map when the key is cleared by the garbage collector. This is typically not what you want for a cache - where the keys are usually lightweight identifiers (e.g. strings, or some other simple value type) - caches usually operate such that the key/value is reclaimed when the value reference is cleared.
The Commons Collections has a ReferenceMap where you can plug in what types of references you wish to use for keys and values. For a memory-sensitive cache, you will probably use hard references for keys, and soft references for values.
To obtain LRU semantics for a given number of references N, maintain a list of the last N entries fetched from the cache - as an entry is retrieved from the cache it is added to the head of the list (and the tail of the list removed.) To ensure this does not hold on to too much memory, you can create a soft reference and use that as a trigger to evict a percentage of the entries from the end of the list. (And create a new soft reference for the next trigger.)

Java Platform Solutions
If all you're looking for is a Map whose keys can be cleaned up to avoid OutOfMemoryErrors, you might want to look into WeakHashMap. It uses WeakReferences in order to allow the garbage collector to reap the map entries. It won't enforce any sort of LRU semantics, though, except those present in the generational garbage collection.
There's also LinkedHashMap, which has this in the documentation:
A special constructor is provided to
create a linked hash map whose order
of iteration is the order in which its
entries were last accessed, from
least-recently accessed to
most-recently (access-order). This
kind of map is well-suited to building
LRU caches. Invoking the put or get
method results in an access to the
corresponding entry (assuming it
exists after the invocation
completes). The putAll method
generates one entry access for each
mapping in the specified map, in the
order that key-value mappings are
provided by the specified map's entry
set iterator. No other methods
generate entry accesses. In
particular, operations on
collection-views do not affect the
order of iteration of the backing map.
So if you use this constructor to make a map whose Iterator iterates in LRU, it becomes pretty easy to prune the map. The one (fairly big) caveat is that LinkedHashMap is not synchronized whatsoever, so you're on your own for concurrency. You can just wrap it in a synchronized wrapper, but that may have throughput issues.
Roll Your Own Solution
If I had to write my own data structure for this use-case, I'd probably create some sort of data structure with a map, queue, and ReadWriteLock along with a janitor thread to handle the cleanup when too many entries were in the map. It would be possible to go slightly over the desired max size, but in the steady-state you'd stay under it.

WeakHashMap won't necessarily attain your purpose since if enough strong reference to the keys are hold by your app., you WILL see OOME.
Alternatively you could look into SoftReference, which will null out the content once the heap is scarce. However, most of the comments I seen indicate that it will not null out the reference until the heap is really really low and a lot of GC starts to kick in with severe performance hit (so I don't recommend using it for your purpose).
My recommendation is to use a simple LRU map, e.g. http://commons.apache.org/collections/apidocs/org/apache/commons/collections/LRUMap.html

thanks for replies guys!
As jasonmp85 pointed out LinkedHashMap has a constructor that allows access order. I missed out that bit when I looked at API docs. The implementation also looks quite efficient(see below). Combined with max size cap for each entry, that should solve my problem.
I will also look closely at SoftReference. Just for the record, Google Collections seems to have pretty good API for SoftKeys and SoftValues and Maps in general.
Here is a snippet from Java LikedHashMap class that shows how they maintain LRU behavior.
/**
* Removes this entry from the linked list.
*/
private void remove() {
before.after = after;
after.before = before;
}
/**
* Inserts this entry before the specified existing entry in the list.
*/
private void addBefore(Entry<K,V> existingEntry) {
after = existingEntry;
before = existingEntry.before;
before.after = this;
after.before = this;
}
/**
* This method is invoked by the superclass whenever the value
* of a pre-existing entry is read by Map.get or modified by Map.set.
* If the enclosing Map is access-ordered, it moves the entry
* to the end of the list; otherwise, it does nothing.
*/
void recordAccess(HashMap<K,V> m) {
LinkedHashMap<K,V> lm = (LinkedHashMap<K,V>)m;
if (lm.accessOrder) {
lm.modCount++;
remove();
addBefore(lm.header);
}

Accesing hidden getEntry(Object key) in HashMap

I have similar problem to one discussed here, but with stronger practical usage.
For example, I have a Map<String, Integer>, and I have some function, which is given a key and in case the mapped integer value is negative, puts NULL to the map:
Map<String, Integer> map = new HashMap<String, Integer>();
public void nullifyIfNegative(String key) {
Integer value = map.get(key);
if (value != null && value.intValue() < 0) {
map.put(key, null);
}
}
I this case, the lookup (and hence, hashCode calculation for the key) is done twice: one for lookup and one for replacement. It would be nice to have another method (which is already in HashMap) and allows to make this more effective:
public void nullifyIfNegative(String key) {
Map.Entry<String, Integer> entry = map.getEntry(key);
if (entry != null && entry.getValue().intValue() < 0) {
entry.setValue(null);
}
}
The same concerns cases, when you want to manipulate immutable objects, which can be map values:
Map<String, String>: I want to append something to the string value.
Map<String, int[]>: I want to insert a number into the array.
So the case is quite common. Solutions, which might work, but not for me:
Reflection. Is good, but I cannot sacrifice performance just for this nice feature.
Use org.apache.commons.collections.map.AbstractHashedMap (it has at least protected getEntry() method), but unfortunately, commons-collections do not support generics.
Use generic commons-collections, but this library (AFAIK) is out-of-date (not in sync with latest library version from Apache), and (what is critical) is not available in central maven repository.
Use value wrappers, which means "making values mutable" (e.g. use mutable integers [e.g. org.apache.commons.lang.mutable.MutableInt], or collections instead of arrays). This solutions leads to memory loss, which I would like to avoid.
Try to extend java.util.HashMap with custom class implementation (which should be in java.util package) and put it to endorsed folder (as java.lang.ClassLoader will refuse to load it in Class<?> defineClass(String name, byte[] b, int off, int len), see sources), but I don't want to patch JDK and it seems like the list of packages that can be endorsed, does not include java.util.
The similar question is already raised on sun.com bugtracker, but I would like to know, what is the opinion of the community and what can be the way out taking in mind the maximum memory & performance effectiveness.
If you agree, this is nice and beneficiary functionality, please, vote this bug!

As a logical matter, you're right in that the single getEntry would save you a hash lookup. As a practical matter, unless you have a specific use case where you have reason to be concerned about the performance hit( which seems pretty unlikely, hash lookup is common, O(1), and well optimized) what you're worrying about is probably negligible.
Why don't you write a test? Create a hashtable with a few 10's of millions of objects, or whatever's an order of magnitude greater than what your application is likely to create, and average the time of a get() over a million or so iterations (hint: it's going to be a very small number).
A bigger issue with what you're doing is synchronization. You should be aware that if you're doing conditional alterations on a map you could run into issues, even if you're using a Synchronized map, as you'd have to lock access to the key covering the span of both the get() and set() operations.

Not pretty, but you could use lightweight object to hold a reference to the actual value to avoid second lookups.
HashMap<String, String[]> map = ...;
// append value to the current value of key
String key = "key";
String value = "value";
// I use an array to hold a reference - even uglier than the whole idea itself ;)
String[] ref = new String[1]; // lightweigt object
String[] prev = map.put(key, ref);
ref[0] = (prev != null) ? prev[0] + value : value;
I wouldn't worry about hash lookup performance too much though (Steve B's answer is pretty good in pointing out why). Especially with String keys, I wouldn't worry too much about hashCode() as its result is cached. You could worry about equals() though as it might be called more than once per lookup. But for short strings (which are often used as keys) this is negligible too.

There are no performance gain from this proposal, because performance of Map in average case is O(1). But enabling access to the raw Entry in such case will raise another problem. It will be possible to change key in entry (even if it's only possible via reflection) and therefore break order of the internal array.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.