Hi I'm wondering if it is possible to access the contents of a HashSet directly if you have the Hashcode for the object you're looking for, sort of like using the HashCode as a key in a HashMap.
I imagine it might work something sort of like this:
MyObject object1 = new MyObject(1);
Set<MyObject> MyHashSet = new HashSet<MyObject>();
MyHashSet.add(object1)
int hash = object1.getHashCode
MyObject object2 = MyHashSet[hash]???
Thanks!
edit: Thanks for the answers. Okay I understand that I might be pushing the contract of HashSet a bit, but for this particular project equality is solely determined by the hashcode and I know for sure that there will be only one object per hashcode/hashbucket. The reason I was pretty reluctant to use a HashMap is because I would need to convert the primitive ints I'm mapping with to Integer objects as a HashMap only takes in objects as keys, and I'm also worried that this might affect performance. Is there anything else I could do to implement something similar with?
The common implementation of HashSet is backed (rather lazily) by a HashMap so your effort to avoid HashMap is probably defeated.
On the basis that premature optimization is the root of all evil, I suggest you use a HashMap initially and if the boxing/unboxing overhead of int to and from Integer really is a problem you'll have to implement (or find) a handcrafted HashSet using primitive ints for comparison.
The standard Java library really doesn't want to concern itself with boxing/unboxing costs.
The whole language sold that performance issue for a considerable gain in simplicity long ago.
Notice that these days (since 2004!) the language automatically boxes and unboxes which reveals a "you don't need to be worrying about this" policy. In most cases it's right.
I don't know how 'richly' featured your HashKeyedSet needs to be but a basic hash-table is really not too hard.
HashSet is internally backed by a HashMap, which is unavailable through the public API unfortunately for this question. However, we can use reflection to gain access to the internal map and then find a key with an identical hashCode:
private static <E> E getFromHashCode(final int hashcode, HashSet<E> set) throws Exception {
// reflection stuff
Field field = set.getClass().getDeclaredField("map");
field.setAccessible(true);
// get the internal map
#SuppressWarnings("unchecked")
Map<E, Object> interalMap = (Map<E, Object>) (field.get(set));
// attempt to find a key with an identical hashcode
for (E elem : interalMap.keySet()) {
if (elem.hashCode() == hashcode) return elem;
}
return null;
}
Used in an example:
HashSet<String> set = new HashSet<>();
set.add("foo"); set.add("bar"); set.add("qux");
int hashcode = "qux".hashCode();
System.out.println(getFromHashCode(hashcode, set));
Output:
qux
This is not possible as HashSet is an object and there is no public API as such. Also multiple objects can have the same hashcode but the objects can be different.
Finally only arrays can be accessed using myArray[<index>] syntax.
You can easily write code that will directly access the internal data structures of the HashSet implementation using reflection. Of course, your code will depend on the implementation details of the particular JVM you are coding to. You also will be subject to the constraints of the SecurityManager (if any).
A typical implementation of HashSet uses a HashMap as its internal data structure. The HashMap has an array, which is indexed by the key's hashcode mapped to an index in the array. The hashcode mapping function is available by calling non-public methods in the implementation - you will have to read the source code and figure it out. Once you get to the right bucket, you will just need to find (using equals) the right entry in the bucket.
Related
I have following scenario (modified one than actual business purpose).
I have a program which predicts how much calories a person will
loose for the next 13 weeks based on certain attributes.
I want to cache this result in the database so that i don't call the
prediction again for the same combination.
I have class person
class Person { int personId; String weekStartDate; }
I have HashMap<List<Person>, Integer> - The key is 13 weeks data of a person and the value is the prediction
I will keep the hashvalue in the database for caching purpose
Is there a better way to handle above scenario? Any design pattern to support such scenarios
Depends: the implementation of hashCode() uses the elements of your list. So adding elements later on changes the result of that operation:
public int hashCode() {
int hashCode = 1;
for (E e : this)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
return hashCode;
}
Maps aren't build for keys that can change their hash values! And of course, it doesn't really make sense to implement that method differently.
So: it can work when your lists are all immutable, meaning that neither the list nor any of its members is modified after the list was used as key. But there is a certain risk: if you forget about that contract later on, and these lists see modifications, then you will run into interesting issues.
This works because the hashcode of the standard List implementations is computed with the hashcodes of the contents. You need to make sure, however, to also implement hashCode and equals in the Person class, otherwise you will get the same problem this guy had. See also my answer on that question.
I would suggest you define a class (say Data) and use it as a key in your hashmap. Override equals/hashcode accordingly with knowledge of data over weeks.
A HashSet is backed by a HashMap. From it's JavaDoc:
This class implements the Set interface, backed by a hash table
(actually a HashMap instance)
When taking a look at the source we can also see how they relate to each other:
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
Therefore a HashSet<E> is backed by a HashMap<E,Object>. For all HashSets in our application we have one reference object PRESENT that we use in the HashMap for the value. While the memory needed to store PRESENT is neglectable, we still store a reference to it for each value in the map.
Would it not be more efficient to use null instead of PRESENT? A further consideration then is should we forgo the HashSet altogether and directly use a HashMap, given the circumstance permits the use of a Map instead of a Set.
My basic problem that triggered these thoughts is the following situation: I have a collection of objects on with the following properties:
big collection of objects > 30'000
Insertion order is not relevant
Efficient check if an item is contained
Adding new items to the collection is not relevant
The chosen solution should perform optimal in the context to the above criteria as well as minimize memory consumption. On this basis the datastructures HashSet and HashMap spring to mind. When thinking about alternative approaches, the key question is:
How to check containement efficiently?
The only answer that comes to my mind is using the items hash to calculate the storage location. I might be missing something here. Are there any other approaches?
I had a look at various issues, that did shed some light on the issue, but not quietly answered my question:
Java : HashSet vs. HashMap
clarifying facts behind Java's implementation of HashSet/HashMap
Java HashSet vs HashMap
I am not looking for suggestions of any alternative libraries or framework to address this, but I want to understand if there is an other way to think about efficient containement checking of an element in a Collection.
In short, yes you should use HashSet. It might not be the most possibly efficient Set implementation, but that hardly ever matters, unless you are working with huge amounts of data.
In that case, I would suggest using specialized libraries. EnumMaps if you can use enums, primitive maps like Trove if your data is mostly primitives, a bunch of other data-structures that are optimized for certain data-types, or even an in-memory-database.
Don't get me wrong, I'm someone who likes performance-tuning, too, but replacing the built-in data-structures should only be done when its really necessary. For most cases, they work perfectly fine.
What you could do, in case you really want to save the last bit of memory and do not care about inserting, is using a fixed-sized array, sorting that and doing a binary search every time. But I doubt that it's more efficient than a HashSet.
Hashtables and HashSets should be used entirely different, so maybe the two shouldn't be compared as "which is more efficient". The hashset would be more suitable for the mathematical "set" (ex. {1,2,3,4}). They contain no duplicates and allow for only one null value. While a hashmap is more of a key-> pair value system. They allow multiple null values as well as duplicates, just not duplicate key vales. I know this is probably answering "difference between a hashtable and hashset" but I think my point is they really can't be compared.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Everywhere you can find answer what are differences:
Map is storing keys-values, it is not synchronized(not a thread safe), allows null values and only one null key, faster to get value because all values have unique key, etc.
Set - not sorted, slower to get value, storing only value, does not allow duplicates or null values I guess.
BUT what means Hash word (that is what they have the same). Is it something about hashing values or whatever I hope you can answer me clearly.
Both use hash value of the Object to store which internally uses hashCode(); method of Object class.
So if you are storing instances of your custom class then you need to override hashCode(); method.
HashSet and HashMap have a number of things in common:
The start of their name - which is a clue to the real similarity.
They use Hash Codes (from the hashCode method built into all Java objects) to quickly process and organize Objects.
They are both unordered collections - but both provide ordered varients (LinkedHashX to store objects in the order of addition)
There is also TreeSet/TreeMap to sort all objects present in the collection and keep them sorted. A comparison of TreeSet to TreeMap will find very similar differences and similarities to one between HashSet and HashMap.
They are also both impacted by the strengths and limitations of Hash algorithms in general.
Hashing is only effective if the objects have well behaved hash functions.
Hashing breaks entirely if equals and hashCode do not follow the correct contract.
Key objects in maps and objects in set should be immutable (or at least their hashCode and equals return values should never change) as otherwise behavior becomes undefined.
If you look at the Map API you can also see a number of other interesting connections - such as the fact that keySet and entrySet both return a Set.
None of the Java Collections are thread safe. Some of the older classes from other packages were but they have mostly been retired. For thread-safety look at the concurrent package for non-thread-safety look at the collections package.
Just look into HashSet source code and you will see that it uses HashMap. So they have the same properties of null-safety, synchronization etc:
public class HashSet<E>
...
private transient HashMap<E,Object> map;
// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();
/**
* Constructs a new, empty set; the backing <tt>HashMap</tt> instance has
* default initial capacity (16) and load factor (0.75).
*/
public HashSet() {
map = new HashMap<>();
}
...
public boolean contains(Object o) {
return map.containsKey(o);
}
...
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
...
}
HashSet is like a HashMap where you don't care about the values but about the keys only.
So you care only if a given key K is in the set but not about the value V to which it is mapped (you can think of it as if V is a constant e.g. V=Boolean.TRUE for all keys in the HashSet). So HashSet has no values (V set). This is the whole difference from structural point of view. The hash part means that when putting elements into the structure Java first calls the hashCode method. See also http://en.wikipedia.org/wiki/Open_addressing to understand in general what happens under the hood.
The hash value is used to check faster if two objects are the same. If two objects have same hash, they can be equal or not equal (so they are then compared for equality with the equals method). But if they have different hashes they are different for sure and the check for equality is not needed. This doesn't mean that if two objects have same hash values they overwrite each other when they are stored in the HashSet or in the HashMap.
Both are not Thread safe and store values using hashCode(). Those are common facts. And another one is both are member of Java collection framework. But there are lots of variations between those two.
Hash regards the technique used to convert the key to an index. Back in the data strucutures class we used to learn how to construct a hash table, to do that you would need to get the strings that were inserted as values and convert them to a number to index an array used internally as the storing data structure.
One problem that was also very discussed was to find a hashing function that would incurr in minimum colision so that we won't have two different objects, with different keys sharing the same position.
So, the hash is about how the keys are processed to be stored. If we think about it for a while, there isn't a (real) way to index memory with strings, only with numbers, so to have a 2d structure like a table that is indexed by a string (or an object as you wish) you need to generate a number (or a hash) for that string and store the value in an array in this index. However, if you need the key "name" you would need a different array to, in the same index, store the key "name".
Cheers
The "HASH" word is common because both uses hashing mechanism. HashSet is actually implemented using HashMap, using dummy object instance on every entry of the Set. And thereby a wastage of 4 bytes for each entry.
There are questions here how to get a Maps keys associated with a given value, with answers pointing to google collections (for bidirectional maps) or essentially saying "loop over it".
I just recently noticed that the Map interface has a boolean containsValue(Object value) method that "will probably require time linear in the map size for most implementations of the Map interface" and the implementation in AbstractMap indeed iterates over the entrySet().
What could be the reason for the design decision to include containsValue in Map, but no Collection<V> getKeysForValue(Object)? I can see why one would omit both, or include both, but if there is one, why not the other?
One thing that came to my mind is that it would require any Map implementation to know about a Collection implementation for the return value, but that is not actually a good reason as the Collection<V> values() method also returns a collection (an anonymous new AbstractCollection<V>() in case of AbstractMap).
There are collections which support this, but they usually involve mainlining a reverse lookup map which is more expensive that than the relatively simple one to one mapping. As such supporting this could make all Maps more than twice as expensive on update.
Another problem is generalisation. Keys have to implement hashCode and equals (for Hash maps) or comparable (for Sorted Maps) Values don't have to implement anything which makes constructing a generalised reverse lookup either impossible, or it places extra requirements on values which are unlikely to be needed.
Maps can return a Collection of their keys and values since 1.2, so it was trivial to look for a value: public Object containsValue(Object v) {return values().contains(v);} This method uses natively optimisations from values() and contains() for any implementation of Map, but is likely to be slow anyway in most of them...
The getKeysForValue(Object) you're looking for is NOT trivial. It requires a specific algorithm, and this algorithm cannot be made generic enough, it must be optimised for every implmentation of Map.
It could be the reason, or it is simply that the Collection API is full of this kind of little loopholes...
With a TreeMap it's trivial to provide a custom Comparator, thus overriding the semantics provided by Comparable objects added to the map. HashMaps however cannot be controlled in this manner; the functions providing hash values and equality checks cannot be 'side-loaded'.
I suspect it would be both easy and useful to design an interface and to retrofit this into HashMap (or a new class)? Something like this, except with better names:
interface Hasharator<T> {
int alternativeHashCode(T t);
boolean alternativeEquals(T t1, T t2);
}
class HasharatorMap<K, V> {
HasharatorMap(Hasharator<? super K> hasharator) { ... }
}
class HasharatorSet<T> {
HasharatorSet(Hasharator<? super T> hasharator) { ... }
}
The case insensitive Map problem gets a trivial solution:
new HasharatorMap(String.CASE_INSENSITIVE_EQUALITY);
Would this be doable, or can you see any fundamental problems with this approach?
Is the approach used in any existing (non-JRE) libs? (Tried google, no luck.)
EDIT: Nice workaround presented by hazzen, but I'm afraid this is the workaround I'm trying to avoid... ;)
EDIT: Changed title to no longer mention "Comparator"; I suspect this was a bit confusing.
EDIT: Accepted answer with relation to performance; would love a more specific answer!
EDIT: There is an implementation; see the accepted answer below.
EDIT: Rephrased the first sentence to indicate more clearly that it's the side-loading I'm after (and not ordering; ordering does not belong in HashMap).
.NET has this via IEqualityComparer (for a type which can compare two objects) and IEquatable (for a type which can compare itself to another instance).
In fact, I believe it was a mistake to define equality and hashcodes in java.lang.Object or System.Object at all. Equality in particular is hard to define in a way which makes sense with inheritance. I keep meaning to blog about this...
But yes, basically the idea is sound.
A bit late for you, but for future visitors, it might be worth knowing that commons-collections has an AbstractHashedMap (in 3.2.2 and with generics in 4.0). You can override these protected methods to achieve your desired behaviour:
protected int hash(Object key) { ... }
protected boolean isEqualKey(Object key1, Object key2) { ... }
protected boolean isEqualValue(Object value1, Object value2) { ... }
protected HashEntry createEntry(
HashEntry next, int hashCode, Object key, Object value) { ... }
An example implementation of such an alternative HashedMap is commons-collections' own IdentityMap (only up to 3.2.2 as Java has its own since 1.4).
This is not as powerful as providing an external "Hasharator" to a Map instance. You have to implement a new map class for every hashing strategy (composition vs. inheritance striking back...). But it's still good to know.
HashingStrategy is the concept you're looking for. It's a strategy interface that allows you to define custom implementations of equals and hashcode.
public interface HashingStrategy<E>
{
int computeHashCode(E object);
boolean equals(E object1, E object2);
}
You can't use a HashingStrategy with the built in HashSet or HashMap. GS Collections includes a java.util.Set called UnifiedSetWithHashingStrategy and a java.util.Map called UnifiedMapWithHashingStrategy.
Let's look at an example.
public class Data
{
private final int id;
public Data(int id)
{
this.id = id;
}
public int getId()
{
return id;
}
// No equals or hashcode
}
Here's how you might set up a UnifiedSetWithHashingStrategy and use it.
java.util.Set<Data> set =
new UnifiedSetWithHashingStrategy<>(HashingStrategies.fromFunction(Data::getId));
Assert.assertTrue(set.add(new Data(1)));
// contains returns true even without hashcode and equals
Assert.assertTrue(set.contains(new Data(1)));
// Second call to add() doesn't do anything and returns false
Assert.assertFalse(set.add(new Data(1)));
Why not just use a Map? UnifiedSetWithHashingStrategy uses half the memory of a UnifiedMap, and one quarter the memory of a HashMap. And sometimes you don't have a convenient key and have to create a synthetic one, like a tuple. That can waste more memory.
How do we perform lookups? Remember that Sets have contains(), but not get(). UnifiedSetWithHashingStrategy implements Pool in addition to Set, so it also implements a form of get().
Here's a simple approach to handle case-insensitive Strings.
UnifiedSetWithHashingStrategy<String> set =
new UnifiedSetWithHashingStrategy<>(HashingStrategies.fromFunction(String::toLowerCase));
set.add("ABC");
Assert.assertTrue(set.contains("ABC"));
Assert.assertTrue(set.contains("abc"));
Assert.assertFalse(set.contains("def"));
Assert.assertEquals("ABC", set.get("aBc"));
This shows off the API, but it's not appropriate for production. The problem is that the HashingStrategy constantly delegates to String.toLowerCase() which creates a bunch of garbage Strings. Here's how you can create an efficient hashing strategy for case-insensitive Strings.
public static final HashingStrategy<String> CASE_INSENSITIVE =
new HashingStrategy<String>()
{
#Override
public int computeHashCode(String string)
{
int hashCode = 0;
for (int i = 0; i < string.length(); i++)
{
hashCode = 31 * hashCode + Character.toLowerCase(string.charAt(i));
}
return hashCode;
}
#Override
public boolean equals(String string1, String string2)
{
return string1.equalsIgnoreCase(string2);
}
};
Note: I am a developer on GS collections.
Trove4j has the feature I'm after and they call it hashing strategies.
Their map has an implementation with different limitations and thus different prerequisites, so this does not implicitly mean that an implementation for Java's "native" HashMap would be feasible.
Note: As noted in all other answers, HashMaps don't have an explicit ordering. They only recognize "equality". Getting an order out of a hash-based data structure is meaningless, as each object is turned into a hash - essentially a random number.
You can always write a hash function for a class (and often times must), as long as you do it carefully. This is a hard thing to do properly because hash-based data structures rely on a random, uniform distribution of hash values. In Effective Java, there is a large amount of text devoted to properly implementing a hash method with good behaviour.
With all that being said, if you just want your hashing to ignore the case of a String, you can write a wrapper class around String for this purpose and insert those in your data structure instead.
A simple implementation:
public class LowerStringWrapper {
public LowerStringWrapper(String s) {
this.s = s;
this.lowerString = s.toLowerString();
}
// getter methods omitted
// Rely on the hashing of String, as we know it to be good.
public int hashCode() { return lowerString.hashCode(); }
// We overrode hashCode, so we MUST also override equals. It is required
// that if a.equals(b), then a.hashCode() == b.hashCode(), so we must
// restore that invariant.
public boolean equals(Object obj) {
if (obj instanceof LowerStringWrapper) {
return lowerString.equals(((LowerStringWrapper)obj).lowerString;
} else {
return lowerString.equals(obj);
}
}
private String s;
private String lowerString;
}
good question, ask josh bloch. i submitted that concept as an RFE in java 7, but it was dropped, i believe the reason was something performance related. i agree, though, should have been done.
I suspect this has not been done because it would prevent hashCode caching?
I attempted creating a generic Map solution where all keys are silently wrapped. It turned out that the wrapper would have to hold the wrapped object, the cached hashCode and a reference to the callback interface responsible for equality-checks. This is obviously not as efficient as using a wrapper class, where you'd only have to cache the original key plus one more object (see hazzens answer).
(I also bumped into a problem related to generics; the get-method accepts Object as input, so the callback interface responsible for hashing would have to perform an additional instanceof-check. Either that, or the map class would have to know the Class of its keys.)
This is an interesting idea, but it's absolutely horrendous for performance. The reason for this is quite fundamental to the idea of a hashtable: the ordering cannot be relied upon. Hashtables are very fast (constant time) because of the way in which they index elements in the table: by computing a pseudo-unique integer hash for that element and accessing that location in an array. It's literally computing a location in memory and directly storing the element.
This contrasts with a balanced binary search tree (TreeMap) which must start at the root and work its way down to the desired node every time a lookup is required. Wikipedia has some more in-depth analysis. To summarize, the efficiency of a tree map is dependent upon a consistent ordering, thus the order of the elements is predictable and sane. However, because of the performance hit imposed by the "traverse to your destination" approach, BSTs are only able to provide O(log(n)) performance. For large maps, this can be a significant performance hit.
It is possible to impose a consistent ordering on a hashtable, but to do so involves using techniques similar to LinkedHashMap and manually maintaining the ordering. Alternatively, two separate data structures can be maintained internally: a hashtable and a tree. The table can be used for lookups, while the tree can be used for iteration. The problem of course is this uses more than double the required memory. Also, insertions are only as fast as the tree: O(log(n)). Concurrent tricks can bring this down a bit, but that isn't a reliable performance optimization.
In short, your idea sounds really good, but if you actually tried to implement it, you would see that to do so would impose massive performance limitations. The final verdict is (and has been for decades): if you need performance, use a hashtable; if you need ordering and can live with degraded performance, use a balanced binary search tree. I'm afraid there's really no efficiently combining the two structures without losing some of the guarantees of one or the other.
There's such a feature in com.google.common.collect.CustomConcurrentHashMap, unfortunately, there's currently no public way how to set the Equivalence (their Hasharator). Maybe they're not yet done with it, maybe they don't consider the feature to be useful enough. Ask at the guava mailing list.
I wonder why it haven't happened yet, as it was mentioned in this talk over two years ago.