Consistent and efficient bi-directional data structure implementation (Java) - java

I needed an implementation of a bi-directional map in Java so I tried to use BiMap and BidiMap from Guava and Commons. However, the inverse view capability is not maintained after a modification on an element. Here is an example with BiMap (same behavior with BidiMap) :
BiMap<Set<String>, Set<String>> map = HashBiMap.create();
Set<String> foo = new HashSet<>();
foo.add("foo");
Set<String> bar = new HashSet<>();
bar.add("bar");
map.put(foo, bar);
map.get(foo); // returns [bar], ok
map.inverse().get(map.get(foo)); // returns [foo], ok
map.get(foo).add("someString");
map.get(foo); // returns [bar, someString], ok
map.inverse().get(map.get(foo)); // returns null, not ok <=
Of course this behavior can be expected for an implementation using HashMaps but it illustrates the problem.
So the question is, is there a bi-directional data structure which can handle this kind of situation, with elements of arbitrary types, and still have a better average time complexity than an array of pairs?
EDIT : I'm not trying to solve this problem or avoid it, this is more of an academic question. I just want to know if such a data structure exists. That is, a data structure allowing bi-directional binding, mutable keys and with reasonable time complexity.

Your trouble is not with bidirectional maps, but with the assumption that you are allowed to modify a map key. Keys are in fact fundamentally required to be stable at least regarding the behavior of their equals and hashCode methods (in case of a hashtable-backed map) or their comparison method (in case of a binary tree-backed map).
Perhaps you can consider removing an element, changing it, then inserting it backā€”that's one way to meet the constraints of implementation.

Related

Should we use HashSet?

A HashSet is backed by a HashMap. From it's JavaDoc:
This class implements the Set interface, backed by a hash table
(actually a HashMap instance)
When taking a look at the source we can also see how they relate to each other:
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
Therefore a HashSet<E> is backed by a HashMap<E,Object>. For all HashSets in our application we have one reference object PRESENT that we use in the HashMap for the value. While the memory needed to store PRESENT is neglectable, we still store a reference to it for each value in the map.
Would it not be more efficient to use null instead of PRESENT? A further consideration then is should we forgo the HashSet altogether and directly use a HashMap, given the circumstance permits the use of a Map instead of a Set.
My basic problem that triggered these thoughts is the following situation: I have a collection of objects on with the following properties:
big collection of objects > 30'000
Insertion order is not relevant
Efficient check if an item is contained
Adding new items to the collection is not relevant
The chosen solution should perform optimal in the context to the above criteria as well as minimize memory consumption. On this basis the datastructures HashSet and HashMap spring to mind. When thinking about alternative approaches, the key question is:
How to check containement efficiently?
The only answer that comes to my mind is using the items hash to calculate the storage location. I might be missing something here. Are there any other approaches?
I had a look at various issues, that did shed some light on the issue, but not quietly answered my question:
Java : HashSet vs. HashMap
clarifying facts behind Java's implementation of HashSet/HashMap
Java HashSet vs HashMap
I am not looking for suggestions of any alternative libraries or framework to address this, but I want to understand if there is an other way to think about efficient containement checking of an element in a Collection.
In short, yes you should use HashSet. It might not be the most possibly efficient Set implementation, but that hardly ever matters, unless you are working with huge amounts of data.
In that case, I would suggest using specialized libraries. EnumMaps if you can use enums, primitive maps like Trove if your data is mostly primitives, a bunch of other data-structures that are optimized for certain data-types, or even an in-memory-database.
Don't get me wrong, I'm someone who likes performance-tuning, too, but replacing the built-in data-structures should only be done when its really necessary. For most cases, they work perfectly fine.
What you could do, in case you really want to save the last bit of memory and do not care about inserting, is using a fixed-sized array, sorting that and doing a binary search every time. But I doubt that it's more efficient than a HashSet.
Hashtables and HashSets should be used entirely different, so maybe the two shouldn't be compared as "which is more efficient". The hashset would be more suitable for the mathematical "set" (ex. {1,2,3,4}). They contain no duplicates and allow for only one null value. While a hashmap is more of a key-> pair value system. They allow multiple null values as well as duplicates, just not duplicate key vales. I know this is probably answering "difference between a hashtable and hashset" but I think my point is they really can't be compared.

Accessing a HashSet using the HashCode directly? (Java)

Hi I'm wondering if it is possible to access the contents of a HashSet directly if you have the Hashcode for the object you're looking for, sort of like using the HashCode as a key in a HashMap.
I imagine it might work something sort of like this:
MyObject object1 = new MyObject(1);
Set<MyObject> MyHashSet = new HashSet<MyObject>();
MyHashSet.add(object1)
int hash = object1.getHashCode
MyObject object2 = MyHashSet[hash]???
Thanks!
edit: Thanks for the answers. Okay I understand that I might be pushing the contract of HashSet a bit, but for this particular project equality is solely determined by the hashcode and I know for sure that there will be only one object per hashcode/hashbucket. The reason I was pretty reluctant to use a HashMap is because I would need to convert the primitive ints I'm mapping with to Integer objects as a HashMap only takes in objects as keys, and I'm also worried that this might affect performance. Is there anything else I could do to implement something similar with?
The common implementation of HashSet is backed (rather lazily) by a HashMap so your effort to avoid HashMap is probably defeated.
On the basis that premature optimization is the root of all evil, I suggest you use a HashMap initially and if the boxing/unboxing overhead of int to and from Integer really is a problem you'll have to implement (or find) a handcrafted HashSet using primitive ints for comparison.
The standard Java library really doesn't want to concern itself with boxing/unboxing costs.
The whole language sold that performance issue for a considerable gain in simplicity long ago.
Notice that these days (since 2004!) the language automatically boxes and unboxes which reveals a "you don't need to be worrying about this" policy. In most cases it's right.
I don't know how 'richly' featured your HashKeyedSet needs to be but a basic hash-table is really not too hard.
HashSet is internally backed by a HashMap, which is unavailable through the public API unfortunately for this question. However, we can use reflection to gain access to the internal map and then find a key with an identical hashCode:
private static <E> E getFromHashCode(final int hashcode, HashSet<E> set) throws Exception {
// reflection stuff
Field field = set.getClass().getDeclaredField("map");
field.setAccessible(true);
// get the internal map
#SuppressWarnings("unchecked")
Map<E, Object> interalMap = (Map<E, Object>) (field.get(set));
// attempt to find a key with an identical hashcode
for (E elem : interalMap.keySet()) {
if (elem.hashCode() == hashcode) return elem;
}
return null;
}
Used in an example:
HashSet<String> set = new HashSet<>();
set.add("foo"); set.add("bar"); set.add("qux");
int hashcode = "qux".hashCode();
System.out.println(getFromHashCode(hashcode, set));
Output:
qux
This is not possible as HashSet is an object and there is no public API as such. Also multiple objects can have the same hashcode but the objects can be different.
Finally only arrays can be accessed using myArray[<index>] syntax.
You can easily write code that will directly access the internal data structures of the HashSet implementation using reflection. Of course, your code will depend on the implementation details of the particular JVM you are coding to. You also will be subject to the constraints of the SecurityManager (if any).
A typical implementation of HashSet uses a HashMap as its internal data structure. The HashMap has an array, which is indexed by the key's hashcode mapped to an index in the array. The hashcode mapping function is available by calling non-public methods in the implementation - you will have to read the source code and figure it out. Once you get to the right bucket, you will just need to find (using equals) the right entry in the bucket.

Is the order of HashMap elements reproducible?

First of all, I want to make it clear that I would never use a HashMap to do things that require some kind of order in the data structure and that this question is motivated by my curiosity about the inner details of Java HashMap implementation.
You can read in the java documentation on Object about the Object method hashCode.
I understand from there that hashCode implementation for classes such as String and basic types wrappers (Integer, Long,...) is predictable once the value contained by the object is given. An example of that would be that calls to hashCode for any String object containing the value hello should return always: 99162322
Having an algorithm that always insert into an empty Java HashMap where Strings are used as keys the same values in the same order. Then, the order of its elements at the end should be always the same, am I wrong?
Since the hash code for a concrete value is always the same, if there are not collisions the order should be the same.
On the other hand, if there are collisions, I think (I don't know the facts) that the collisions resolutions should result in the same order for exactly the same input elements.
So, isn't it right that two HashMap objects with the same elements, inserted in the same order should be traversed (by an iterator) giving the same elements sequence?
As far as I know the order (assuming we call "order" the order of elements as returned by values() iterator) of the elements in HashMap are kept until map rehash is performed. We can influence on probability of that event by providing capacity and/or loadFactor to the constructor.
Nevertheless, we should never rely on this statement because the internal implementation of HashMap is not a part of its public contract and is a subject to change in future.
I think you are asking "Is HashMap non-deterministic?". The answer is "probably not" (look at the source code of your favourite implementation to find out).
However, bear in mind that because the Java standard does not guarantee a particular order, the implementation is free to alter at any time (e.g. in newer JRE versions), giving a different (yet deterministic) result.
Whether or not that is true is entirely dependent upon the implementation. What's more important is that it isn't guaranteed. If you order is important to you there are options. You could create your own implementation of Map that does preserve order, you can use a SortedMap/LinkedHashMap or you can use something like the apache commons-collections OrderedMap: http://commons.apache.org/proper/commons-collections/javadocs/api-release/org/apache/commons/collections4/OrderedMap.html.

Why Guava does not provide a way to transform map keys

This question is kind of already posted here:
How to convert Map<String, String> to Map<Long, String> using guava
I think the answer of CollinD is appropriate:
All of Guava's methods for transforming and filtering produce lazy
results... the function/predicate is only applied when needed as the
object is used. They don't create copies. Because of that, though, a
transformation can easily break the requirements of a Set.
Let's say, for example, you have a Map<String, String> that contains
both "1" and "01" as keys. They are both distinct Strings, and so the
Map can legally contain both as keys. If you transform them using
Long.valueOf(String), though, they both map to the value 1. They are
no longer distinct keys. This isn't going to break anything if you
create a copy of the map and add the entries, because any duplicate
keys will overwrite the previous entry for that key. A lazily
transformed Map, though, would have no way of enforcing unique keys
and would therefore break the contract of a Map.
This is true, but actually I don't understand why it is not done because:
When the key transformation happen, if 2 keys are "merged", a runtime exception could be raised, or we could pass a flag to indicate to Guava to take any value of the multiple possible values for the newly computed key (failfast/failsafe possibilities)
We could have a Maps.transformKeys which produces a Multimap
Is there a drawback I don't see in doing such things?
As #CollinD suggests, there's no way to do this in a lazy way. To implement get, you have to convert all the keys with your transformation function (to ensure any duplicates are discovered).
So applying Function<K,NewK> to Map<K,V> is out.
You could safely apply Function<NewK,K> to the map:
V value = innerMap.get( fn.apply(newK) );
I don't see a Guava shorthand for that--it may just not be useful enough. You could get similar results with:
Function<NewK,V> newFn = Functions.compose(Functions.forMap(map), fn);

Why does Java's Map interface have a containsValue(Object) method, but no value->keys lookup?

There are questions here how to get a Maps keys associated with a given value, with answers pointing to google collections (for bidirectional maps) or essentially saying "loop over it".
I just recently noticed that the Map interface has a boolean containsValue(Object value) method that "will probably require time linear in the map size for most implementations of the Map interface" and the implementation in AbstractMap indeed iterates over the entrySet().
What could be the reason for the design decision to include containsValue in Map, but no Collection<V> getKeysForValue(Object)? I can see why one would omit both, or include both, but if there is one, why not the other?
One thing that came to my mind is that it would require any Map implementation to know about a Collection implementation for the return value, but that is not actually a good reason as the Collection<V> values() method also returns a collection (an anonymous new AbstractCollection<V>() in case of AbstractMap).
There are collections which support this, but they usually involve mainlining a reverse lookup map which is more expensive that than the relatively simple one to one mapping. As such supporting this could make all Maps more than twice as expensive on update.
Another problem is generalisation. Keys have to implement hashCode and equals (for Hash maps) or comparable (for Sorted Maps) Values don't have to implement anything which makes constructing a generalised reverse lookup either impossible, or it places extra requirements on values which are unlikely to be needed.
Maps can return a Collection of their keys and values since 1.2, so it was trivial to look for a value: public Object containsValue(Object v) {return values().contains(v);} This method uses natively optimisations from values() and contains() for any implementation of Map, but is likely to be slow anyway in most of them...
The getKeysForValue(Object) you're looking for is NOT trivial. It requires a specific algorithm, and this algorithm cannot be made generic enough, it must be optimised for every implmentation of Map.
It could be the reason, or it is simply that the Collection API is full of this kind of little loopholes...

Categories