How to generate hash of a set to ensure integrity? - java

Maybe this has been asked before (where I didn't find it)...
I have a java.util.Set of aprox. 50000 Strings. I would like to generate some sort of hash to check if it has been changed (comparing hashes of two versions of the Set)?
If the Set changes, the hash has to be different.
How can that be achieved? Thanks!
EDIT:
Sorry for that misleading wording. I don't want to check if "it" has been changed (the same instance). Instead I want to check if two database queries, which are generating two - maybe identical - instances of a Set of Strings are equal.

I'd try using java.util.AbstractSet's hashCode method, as stated in the documentation:
Returns the hash code value for this set. The hash code of a set is
defined to be the sum of the hash codes of the elements in the set,
where the hash code of a null element is defined to be zero. This
ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode()
for any two sets s1 and s2, as required by the general contract of
Object.hashCode().
Of course, this only works if your Set implementation extends from AbstractSet, I suppose you use e.g. java.util.HashSet. As always there is a chance of hash collision.
Alternatively, you could extend an existing Set implementation and override the state changing methods, this may make sense if hash computation for each object becomes too expensive, like:
class ChangeSet<E> extends java.util.HashSet<E> {
private boolean changed = false;
#Override
public boolean add(E e) {
changed = true;
super.add(e);
}
public void commit() {
changed = false;
}
public boolean isChanged() {
return changed;
}
/* and all the other methods (addAll, remove, removeAll, etc.) */
}

Based on this statement:
If the Set changes, the hash has to be different
It really can't be achieved, unless you have more constraints. In general, a hash is a value in some fixed space. For example, your hash may be a 32 bit integer, so there are 2^32 possible hash values. In general, b bits gets you 2^b possible hash values. In order to achieve what you want, you have to make sure that every possible set (i.e. - the set of all sets!) is less than or equal to 2^b. But my guess is that you can have arbitrary strings so this isn't possible. And even if it was possible, you'd have to come up with a way to map onto the hash space, which can be challenging.
However, with a good hash function, it's not very likely that changing the set will end up producing the same hash value. So you can use the hash to determine inequality, but if the hash is the same, you still need to check for equality. (This is the same idea behind a hash set or a hash map, where elements map to buckets based on a hashcode, but you have to check for equality).
Similar to what Paul mentioned but different: you can instead make a set implementation that has version numbers and ensure that you always generate a new version number when the set is mutated. Then you can compare the version number? I'm not sure if you care about immutable sets or whether the mutable set changes back to a version you have seen (i.e. - if it should always get the same version).
Hope this helps.

If you need to improve the performance of hashCode (as it rather expensive for a large Set) you can cache it and update it as you go.
class MyHashSet<E> extends LinkedHashSet<E> {
int hashCode = 0;
#Override
public boolean add(E e) {
if (super.add(e)) {
hashCode ^= e.hashCode();
return true;
}
return false;
}
#Override
public boolean remove(Object o) {
if(super.remove(o)) {
hashCode ^= o.hashCode();
return true;
}
return false;
}
#Override
public void clear() {
super.clear();
hashCode = 0;
}
#Override
public int hashCode() {
return hashCode;
}
}

Sometimes simpler is better. I suggest writing your own Set implementation. In it, override the add and remove methods so they set a flag if the Set is modified. Add a getter for the flag, isModified, and you don't have to worry about hash overhead or collisions. Just call MyCustomSet.isModified.
Alternately you can call Collections.unmodifiableSet to get a wrapper around your Set that can't be modified. An exception will be thrown if code attempts to modify the set.

Related

Cache that only uses the reference of the key object in java (and not hashCode or equals)

I'd have a class that runs some expensive calculations multiple times. I'd like to add a cache for it like:
private Map<MyObj, Result> cache = new HashMap<>();
private Result getFoo(MyObj myObj) {
Result r = cache.get(myObj);
if (r == null) {
r = expensiveCalculation(myObj);
}
return r;
}
Since I know the only way two MyObj-s can be equal is if they're identical (the reference), I don't want the cache to calculate the hashCode() and equals(). Is there a way to have a Map that uses only the reference for hashing?
Or a better way to cache like this?
Please mind that this no optimization. There is no benefit and your code gets more complicated.
Since I know the only way two MyObj-s can be equal is if they're identical (the reference), I don't want the cache to calculate the hashCode() and equals().
The hashCode is always calculated to make the storage in the hash table.
Before the equals method is called on the key objects it is a common optimization in hash tables implementations to check for identical reference first.

Unique Id for java object for 100% collision free storage in a data structure

I have a method that checks if two objects are equal(by reference).
public boolean isUnique( T uniqueIdOfFirstObject, T uniqueIdOfSecondObject ) {
return (uniqueIdOfFirstObject == uniqueIdOfSecondObject);
}
(Use case) Assuming that I don't have any control over creation of the object.
I have a method
void currentNodeExistOrAddToHashSet(Object newObject, HashSet<T> objectHash) {
// will it be 100% precise? Assuming both object have the same field values.
if(!objectHash.contains(newObject){
objectHash.add(newObject);
}
}
or I could do something like this
void currentNodeExistOrAddToHashSet(Object newObject, HashSet<T> objectHash){
//as per my knowledge, there might be collision for different objects.
int uniqueId = System.identityHashCode(newObject);
if(!objectHash.contains(uniqueId){
objectHash.add(uniqueId);
}
}
Is it possible to get a 100% collision proof Id in java i.e different object having different IDs, the same object having same ids irrespective of the content of the object?
Since you put them into a HashSet that uses hashcode/equals and hashCode is 32 bits long - this has a limit; thus collision will happen. Especially since a HashSet actually only cares about n-last bits before making itself bigger in size and thus adding one more bit and so on. You can read a lot more about this here for example.
The question is different here: why you want a collision free structure in the first place? If you define a fairly well distributed hashCode and a fairly decent equals - these things should not matter to you at all. If you worry about performance of a search, it is O(1) for HashSet.
You could define hashCode and equality based on UUID, like let's say UUID#randomUUID - but this still bounds your hashCode to the same 32-bits, thus collision could still happen.

Extend HashMap to get all objects with the specified hashcode?

From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code. I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
The reasoning is that I'm trying to optimize a part of a code that, currently, is just a while loop that finds the first object with that hashcode and stores/removes it. This would be a lot faster if I could just return the full list in one go.
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
The keys are Chunk objects and the values are Strings. Here are the hashcode() and equals() functions for the Chunk class:
/**
* Returns a string representation of the ArrayList of words
* thereby storing chunks with the same words but with different
* locations and next words in the same has bucket, triggering the
* use of equals() when searching and adding
*/
public int hashCode() {
return (Words.toString()).hashCode();
}
#Override
/**
* result depends on the value of location. A location of -1 is obviously
* not valid and therefore indicates that we are searching for a match rather
* than adding to the map. This allows multiples of keys with matching hashcodes
* to be considered unequal when adding to the hashmap but equal when searching
* it, which is integral to the MakeMap() and GetOptions() methods of the
* RandomTextGenerator class.
*
*/
public boolean equals(Object obj) {
Chunk tempChunk = (Chunk)obj;
if (LocationInText == -1 && Words.size() == tempChunk.GetText().size())
{
for (int i = 0; i < Words.size(); i++) {
if (!Words.get(i).equals(tempChunk.GetText().get(i))) {
return false;
}
}
return true;
}
else {
if (tempChunk.GetLocation() == LocationInText) {
return true;
}
return false;
}
}
Thanks!
HashMap does not expose any way to do this, but I think you're misunderstanding how HashMap works in the first place.
The first thing you need to know is that if every single object had exactly the same hash code, HashMap would still work. It would never "mix up" keys. If you call get(key), it will only return the value associated with key.
The reason this works is that HashMap only uses hashCode as a first grouping, but then it checks the object you passed to get against the keys stored in the map using the .equals method.
There is no way, from the outside, to tell that HashMap uses linked lists. (In fact, in more recent versions of Java, it doesn't always use linked lists.) The implementation doesn't provide any way to look at hash codes, to find out how hash codes are grouped, or anything along those lines.
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
This code does not "find the first object with that hashcode and store/remove it." It finds the one and only object equal to toSearch according to .equals, stores and removes it. (There can only be one such object in a Map.)
Your while isn't really going. It makes max one turn, if the WorkingMap is a plain Java HashMap. .get(key) return the last saved Object in the HashMap that is saved on 'key'. If it matched toSearch, than it going once.
I'm not sure about that many open questions here. But if you need that one and your farther code is understanding
What kind of type is class Possibles? ArrayList?
// this one should make the same as your while
if(WorkingMap.containsKey(toSearch)) {
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
// farher: expand your Possibles to get that LinkedList what you want to have.
public class possibilities {
// List<LinkedList<String>> container = new ArrayList<LinkedList<String>>();
public Map<Chunk, LinkedList<String>> container2 = new HashMap<Chunk, LinkedList<String>>();
public void put(Chunk key, String value) {
if(!this.container2.containsKey(key)) {
this.container2.put(key, new LinkedList<String>());
}
this.container2.get(key).add(value);
}
}
// this one works with updated Possibles
if(WorkingMap.containsKey(toSearch)) {
Possibles.put(toSearch, WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
//---
How ever, yes it can go like that, but keys should not be a complex object.
Notice: That LinkedLists takes memory and how big are chunks? check Memory Usage
Possibles.(get)container2.keySet();
Good Look
Sail
From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code.
Yes, but it's more complicated than that. It often needs to put objects in linked lists even when they have differing hash codes, since it only uses some bits of the hash codes to choose which bucket to store objects in; the number of bits it uses depends on the current size of the internal hash table, which approximately depends on the number of things in the map. And when a bucket needs to contain multiple objects it will also try to use binary trees like a TreeMap if possible (if objects are mutually Comparable), rather than linked lists.
Anyway.....
I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
No.
A HashMap compares keys for equality according to the equals method. Equality according to the equals method is the only valid way to set, replace, or retrieve values associated with a particular key.
Yes, it also uses hashCode as a way to arrange objects in a structure that allows for far faster location of potentially equal objects. Still, the contract for matching keys is defined in terms of equals, not hashCode.
Note that it is perfectly legal for every hashCode method to be implemented as return 0; and the map will still work just as correctly (but very slowly). So any idea that involves getting a list of objects sharing a hash code is either impossible or pointless or both.
I'm not 100% sure what you're doing in your equals method with the LocationInText variable, but it looks dangerous, as it violates the contract of the equals method. It is required that the equals method be symmetric, transitive, and consistent:
Symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
Transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
Consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
And the hashCode method is required to always agree with equals about equal objects:
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
The LocationInText variable is playing havoc with those rules, and may well break things. If not today, then some day. Get rid of it!
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
Something that jumps out at me is that you only need to do the key lookup once, instead of doing it three times, since Map.remove returns the removed value or null if the key is not present:
for (;;) {
String s = WorkingMap.remove(toSearch);
if (s == null) break;
Occurences++;
Possibles.add(s);
}
Either way, the loop is still faulty, since it is supposed to be impossible for a map to contain more than one key equal to toSearch. I can't overstate that the LocationInText variable as you're using it is not a good idea.
I agree with the other commenters it looks like you're looking for a map-of-list structure. Some Java libraries like Guava offer a Multimap for this, but you can do it manually pretty easily. I think the declaration you want is:
Map<Chunk,List<String>> map = new HashMap<>();
To add a new chunk-string pair to the map, do:
void add(Chunk chunk, String string) {
map.computeIfAbsent(chunk, k -> new ArrayList<>()).add(string);
}
That method puts a new ArrayList in the map if the chunk is new, or fetches the existing ArrayList if there is one for that chunk. Then it adds the string to the list that it fetched or created.
To retrieve the list of all strings for a particular chunk value is as simple as map.get(chunkToSearch), which you can add to your Possibles list as Possibles.addAll(map.get(chunkToSearch));.
Other potential optimizations I'd point out:
In your Chunk.hashCode method, consider caching the hash code instead of recomputing it every time the method is called. If Chunk is mutable (which is not a good idea for a map key, but vaguely allowed so long as you're careful) then recompute the hash code only after the Chunk's value has changed. Also, if Words is a List, which it seems to be, it would likely be faster to use its hash code than convert it to a string and use the string's hash code, but I'm not sure.
In your Chunk.equals method, you can return true immediately if the instances are the same (which they often will be). Also, if GetText returns a copy of the data, then don't call it; you can access the private Words list of the other Chunk since you are in the same class, and finally, you can just defer to the List.equals method:
#Override
public boolean equals(Object o) {
return (this == o) || (o instanceof Chunk && this.Words.equals(((Chunk)o).Words));
}
Simple! Fast!

Why we need to override hashCode and equals?

By default hashCode and equals works fine.
I have used objects with hash tables like HashMap, without overriding this methods, and it was fine. For example:
public class Main{
public static void main(String[] args) throws Exception{
Map map = new HashMap<>();
Object key = new Main();
map.put(key, "2");
Object key2 = new Main();
map.put(key2, "3");
System.out.println(map.get(key));
System.out.println(map.get(key2));
}
}
This code works fine. By default hashCode returning memory address of object, and equals checks if two objects is the same. So what is the problem with using default implementation of this methods?
Note this example from an old pdf I have:
This code
public class Name {
private String first, last;
public Name(String first, String last) { this.first = first; this.last = last;
}
public boolean equals(Object o) {
if (!(o instanceof Name)) return false;
Name n = (Name)o;
return n.first.equals(first) && n.last.equals(last);
}
public static void main(String[] args) {
Set s = new HashSet();
s.add(new Name("Donald", "Duck"));
System.out.println(
s.contains(new Name("Donald", "Duck")));
}
}
...will not always give the same result because as it is stated in the pdf
Donald is in the set, but the set can’t find him. The Name class
violates the hashCode contract
Because, in this case, there are two strings composing the object the hashcode should also be composed of those two elements.
To fix this code we should add a hashCode method:
public int hashCode() {
return 31 * first.hashCode() + last.hashCode();
}
This question in the pdf ends saying that we should
override hashCode when overriding equals
In your example, whenever you want to retrieve something from you HashMap, you need to have key and key2, because their equals() is the same as object identity. This makes the HashMap completely useless, because you cannot retrieve anything from it without having these two keys. Passing the keys around doesn't make sense, because you could just as well pass the values around, it would be equally awkward.
Now try to imagine some use case, where a HashMap actually makes sense. For example, suppose that you get String-valued requests from the outside, and want to return, say, ip-addresses. The keys that come from the outside obviously cannot be the same as the keys you used to set up your map. Therefore you need some methods that compare requests from the outside to the keys you used during the initialization phase. This is exactly what equals is good for: it defines an equivalence relation on objects that are not identical in the sense of being represented by the same bits in physical memory. hashCode is a coarser version of equals, which is necessary to retrieve values from HashMaps quickly.
Your example is not very useful as it would be simpler to have simple variables. i.e. the only way to lookup the value in the map is to hold the original key. In which case, you may as well just hold the value and not have a Map in the first place.
If instead you want to be able to create a new key which is considered equivalent to a key used previously, you have to provide how equivalence is determined.
Given that most objects are never asked for their identity hash code, the system does not keep for most objects any information that would be sufficient to establish a permanent identity. Instead, Java uses two bits in the object header to distinguish three states:
The identity hashcode for the object has never been queried.
The identity hashcode has been queried, but the object has not been moved by the GC since then.
The identity hashcode has been queried, and the object has been moved since then.
For objects in the first state, asking for the identity hash code will change the object to the second state and process it as a second-state object.
For objects in the second state, including those which had moments before been in the first state, the identity hash code will be formed from the address.
When an object in the second state is moved by the GC, the GC will allocate an extra 32 bits to the object, which will be used to hold a hash-code derived from its original address. The object will then be assigned to the third state.
Subsequent requests for the hash code from a state-3 object will use that value that was stored when it was moved.
At times when the system knows that no objects within a certain address range are in state 2, it may change the formula used to compute hash codes from addresses in that range.
Although at any given time there may only be one object at any given address, it is entirely possible that an object might be asked for its identity hash code and later moved, and that another object might be placed at the either same address as the first one, or an address that would hash to the same value (the system might change the formula used to compute hash values to avoid duplication, but would be unable to eliminate it).

Set that only needs equals

I'm curious, is there any Set that only requires .equals() to determine the uniqueness?
When looking at Set classes from java.util, I can only find HashSet which needs .hashCode() and TreeSet (or generally SortedSet) which requires Comparator. I cannot find any class that use only .equals().
Does it make sense that if I have .equals() method, it is sufficient to use it to determine object uniqueness? Thus have a Set implementation that only need to use .equals()? Or did I miss something here that .equals() are not sufficient to determine object uniqueness in Set implementation?
Note that I am aware of Java practice that if we override .equals(), we should override .hashCode() as well to maintain contract defined in Object.
On its own, the equals method is perfectly sufficient to implement a set correctly, but not to implement it efficiently.
The point of a hash code or a comparator is that they provide ways to arrange objects in some ordered structure (a hash table or a tree) which allows for fast finding of objects. If you have only the equals method for comparing pairs of objects, you can't arrange the objects in any meaningful or clever order; you have only a loose jumble of objects.
For example, with only the equals method, ensuring that objects in a set are unique requires comparing each added object to every other object in the jumble. Adding n objects requires
n * (n - 1) / 2 comparisons. For 5 objects that's 10 comparisons, which is fine, but for 1,000 objects that's 499,500 comparisons. It scales terribly.
Because it would not give scalable performance, no such set implementation is in the standard library.
If you don't care about hash table performance, this is a minimal implementation of the hashCode method which works for any class:
#Override
public int hashCode() {
return 0; // or any other constant
}
Although it is required that equal objects have equal hash codes, it is never required for correctness that inequal objects have inequal hash codes, so returning a constant is legal. If you put these objects in a HashSet or use them as HashMap keys, they will end up in a jumble in a single hash table bucket. Performance will be bad, but it will work correctly.
Also, for what it's worth, a minimal working Set implementation which only ever uses the equals method would be:
public class ArraySet<E> extends AbstractSet<E> {
private final ArrayList<E> list = new ArrayList<>();
#Override
public boolean add(E e) {
if (!list.contains(e)) {
list.add(e);
return true;
}
return false;
}
#Override
public Iterator<E> iterator() {
return list.iterator();
}
#Override
public int size() {
return list.size();
}
}
The set stores objects in an ArrayList, and uses list.contains to call equals on objects. Inherited methods from AbstractSet and AbstractCollection provide the bulk of the functionality of the Set interface; for example its remove method gets implemented via the list iterator's remove method. Each operation to add or remove an object or test an object's membership does a comparison against every object in the set, so it scales terribly, but works correctly.
Is this useful? Maybe, in certain special cases. For sets that are known to be very tiny, the performance might be fine, and if you have millions of these sets, this could save memory compared to a HashSet.
In general, though, it is better to write meaningful hash code methods and comparators, so you can have sets and maps that scale efficiently.
You should always override hashCode() when you override equals(). The contract for Object clearly specifies that two equal objects have identical hash codes, and a surprising number of data structures and algorithms depend on this behavior. It's not difficult to add a hashCode(), and if you skip it now, you'll eventually get hard-to-diagnose bugs when your objects start getting put in hash-based structures.
It would mathematically make sense to have a set that requires nothing but .equals().
But such an implementation would be so slow (linear time for every operation) that it has been decided that you can always give a hint.
Anyway, if there is really no way you can write a hashCode(), just make it always return 0 and you will have a structure that is as slow as the one you hoped for!

Categories