Set that only needs equals - java

I'm curious, is there any Set that only requires .equals() to determine the uniqueness?
When looking at Set classes from java.util, I can only find HashSet which needs .hashCode() and TreeSet (or generally SortedSet) which requires Comparator. I cannot find any class that use only .equals().
Does it make sense that if I have .equals() method, it is sufficient to use it to determine object uniqueness? Thus have a Set implementation that only need to use .equals()? Or did I miss something here that .equals() are not sufficient to determine object uniqueness in Set implementation?
Note that I am aware of Java practice that if we override .equals(), we should override .hashCode() as well to maintain contract defined in Object.

On its own, the equals method is perfectly sufficient to implement a set correctly, but not to implement it efficiently.
The point of a hash code or a comparator is that they provide ways to arrange objects in some ordered structure (a hash table or a tree) which allows for fast finding of objects. If you have only the equals method for comparing pairs of objects, you can't arrange the objects in any meaningful or clever order; you have only a loose jumble of objects.
For example, with only the equals method, ensuring that objects in a set are unique requires comparing each added object to every other object in the jumble. Adding n objects requires
n * (n - 1) / 2 comparisons. For 5 objects that's 10 comparisons, which is fine, but for 1,000 objects that's 499,500 comparisons. It scales terribly.
Because it would not give scalable performance, no such set implementation is in the standard library.
If you don't care about hash table performance, this is a minimal implementation of the hashCode method which works for any class:
#Override
public int hashCode() {
return 0; // or any other constant
}
Although it is required that equal objects have equal hash codes, it is never required for correctness that inequal objects have inequal hash codes, so returning a constant is legal. If you put these objects in a HashSet or use them as HashMap keys, they will end up in a jumble in a single hash table bucket. Performance will be bad, but it will work correctly.
Also, for what it's worth, a minimal working Set implementation which only ever uses the equals method would be:
public class ArraySet<E> extends AbstractSet<E> {
private final ArrayList<E> list = new ArrayList<>();
#Override
public boolean add(E e) {
if (!list.contains(e)) {
list.add(e);
return true;
}
return false;
}
#Override
public Iterator<E> iterator() {
return list.iterator();
}
#Override
public int size() {
return list.size();
}
}
The set stores objects in an ArrayList, and uses list.contains to call equals on objects. Inherited methods from AbstractSet and AbstractCollection provide the bulk of the functionality of the Set interface; for example its remove method gets implemented via the list iterator's remove method. Each operation to add or remove an object or test an object's membership does a comparison against every object in the set, so it scales terribly, but works correctly.
Is this useful? Maybe, in certain special cases. For sets that are known to be very tiny, the performance might be fine, and if you have millions of these sets, this could save memory compared to a HashSet.
In general, though, it is better to write meaningful hash code methods and comparators, so you can have sets and maps that scale efficiently.

You should always override hashCode() when you override equals(). The contract for Object clearly specifies that two equal objects have identical hash codes, and a surprising number of data structures and algorithms depend on this behavior. It's not difficult to add a hashCode(), and if you skip it now, you'll eventually get hard-to-diagnose bugs when your objects start getting put in hash-based structures.

It would mathematically make sense to have a set that requires nothing but .equals().
But such an implementation would be so slow (linear time for every operation) that it has been decided that you can always give a hint.
Anyway, if there is really no way you can write a hashCode(), just make it always return 0 and you will have a structure that is as slow as the one you hoped for!

Related

Unique Id for java object for 100% collision free storage in a data structure

I have a method that checks if two objects are equal(by reference).
public boolean isUnique( T uniqueIdOfFirstObject, T uniqueIdOfSecondObject ) {
return (uniqueIdOfFirstObject == uniqueIdOfSecondObject);
}
(Use case) Assuming that I don't have any control over creation of the object.
I have a method
void currentNodeExistOrAddToHashSet(Object newObject, HashSet<T> objectHash) {
// will it be 100% precise? Assuming both object have the same field values.
if(!objectHash.contains(newObject){
objectHash.add(newObject);
}
}
or I could do something like this
void currentNodeExistOrAddToHashSet(Object newObject, HashSet<T> objectHash){
//as per my knowledge, there might be collision for different objects.
int uniqueId = System.identityHashCode(newObject);
if(!objectHash.contains(uniqueId){
objectHash.add(uniqueId);
}
}
Is it possible to get a 100% collision proof Id in java i.e different object having different IDs, the same object having same ids irrespective of the content of the object?
Since you put them into a HashSet that uses hashcode/equals and hashCode is 32 bits long - this has a limit; thus collision will happen. Especially since a HashSet actually only cares about n-last bits before making itself bigger in size and thus adding one more bit and so on. You can read a lot more about this here for example.
The question is different here: why you want a collision free structure in the first place? If you define a fairly well distributed hashCode and a fairly decent equals - these things should not matter to you at all. If you worry about performance of a search, it is O(1) for HashSet.
You could define hashCode and equality based on UUID, like let's say UUID#randomUUID - but this still bounds your hashCode to the same 32-bits, thus collision could still happen.

Should I override hashCode() of Collections?

Given that I some class with various fields in it:
class MyClass {
private String s;
private MySecondClass c;
private Collection<someInterface> coll;
// ...
#Override public int hashCode() {
// ????
}
}
and of that, I do have various objects which I'd like to store in a HashMap. For that, I need to have the hashCode() of MyClass.
I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
I OPENED A SECOND QUESTION regarding the actual problem of uniquely IDing an object here: How do I generate an (almost) unique hash ID for objects?
Clarification:
is there anything more or less unqiue in your class? The String s? Then only use that as hashcode.
MyClass hashCode() of two objects should definitely differ, if any of the values in the coll of one of the objects is changed. HashCode should only return the same value if all fields of two objects store the same values, resursively. Basically, there is some time-consuming calculation going on on a MyClass object. I want to spare this time, if the calculation had already been done with the exact same values some time ago. For this purpose, I'd like to look up in a HashMap, if the result is available already.
Would you be using MyClass in a HashMap as the key or as the value? If the key, you have to override both equals() and hashCode()
Thus, I'm using the hashCode OF MyClass as the key in a HashMap. The value (calculation result) will be something different, like an Integer (simplified).
What do you think equality should mean for multiple collections? Should it depend on element ordering? Should it only depend on the absolute elements that are present?
Wouldn't that depend on the kind of Collection that is stored in coll? Though I guess ordering not really important, no
The response you get from this site is gorgeous. Thank you all
#AlexWien that depends on whether that collection's items are part of the class's definition of equivalence or not.
Yes, yes they are.
I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
That's correct. It's not as onerous as it sounds because the rule of thumb is that you only need to override hashCode() if you override equals(). You don't have to worry about classes that use the default equals(); the default hashCode() will suffice for them.
Also, for your class, you only need to hash the fields that you compare in your equals() method. If one of those fields is a unique identifier, for instance, you could get away with just checking that field in equals() and hashing it in hashCode().
All of this is predicated upon you also overriding equals(). If you haven't overridden that, don't bother with hashCode() either.
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
Yes, you can rely on any collection type in the Java standard library to implement hashCode() correctly. And yes, any List or Set will take into account its contents (it will mix together the items' hash codes).
So you want to do a calculation on the contents of your object that will give you a unique key you'll be able to check in a HashMap whether the "heavy" calculation that you don't want to do twice has already been done for a given deep combination of fields.
Using hashCode alone:
I believe hashCode is not the appropriate thing to use in the scenario you are describing.
hashCode should always be used in association with equals(). It's part of its contract, and it's an important part, because hashCode() returns an integer, and although one may try to make hashCode() as well-distributed as possible, it is not going to be unique for every possible object of the same class, except for very specific cases (It's easy for Integer, Byte and Character, for example...).
If you want to see for yourself, try generating strings of up to 4 letters (lower and upper case), and see how many of them have identical hash codes.
HashMap therefore uses both the hashCode() and equals() method when it looks for things in the hash table. There will be elements that have the same hashCode() and you can only tell if it's the same element or not by testing all of them using equals() against your class.
Using hashCode and equals together
In this approach, you use the object itself as the key in the hash map, and give it an appropriate equals method.
To implement the equals method you need to go deeply into all your fields. All of their classes must have equals() that matches what you think of as equal for the sake of your big calculation. Special care needs to be be taken when your objects implement an interface. If the calculation is based on calls to that interface, and different objects that implement the interface return the same value in those calls, then they should implement equals in a way that reflects that.
And their hashCode is supposed to match the equals - when the values are equal, the hashCode must be equal.
You then build your equals and hashCode based on all those items. You may use Objects.equals(Object, Object) and Objects.hashCode( Object...) to save yourself a lot of boilerplate code.
But is this a good approach?
While you can cache the result of hashCode() in the object and re-use it without calculation as long as you don't mutate it, you can't do that for equals. This means that calculation of equals is going to be lengthy.
So depending on how many times the equals() method is going to be called for each object, this is going to be exacerbated.
If, for example, you are going to have 30 objects in the hashMap, but 300,000 objects are going to come along and be compared to them only to realize that they are equal to them, you'll be making 300,000 heavy comparisons.
If you're only going to have very few instances in which an object is going to have the same hashCode or fall in the same bucket in the HashMap, requiring comparison, then going the equals() way may work well.
If you decide to go this way, you'll need to remember:
If the object is a key in a HashMap, it should not be mutated as long as it's there. If you need to mutate it, you may need to make a deep copy of it and keep the copy in the hash map. Deep copying again requires consideration of all the objects and interfaces inside to see if they are copyable at all.
Creating a unique key for each object
Back to your original idea, we have established that hashCode is not a good candidate for a key in a hash map. A better candidate for that would be a hash function such as md5 or sha1 (or more advanced hashes, like sha256, but you don't need cryptographic strength in your case), where collisions are a lot rarer than a mere int. You could take all the values in your class, transform them into a byte array, hash it with such a hash function, and take its hexadecimal string value as your map key.
Naturally, this is not a trivial calculation. So you need to think if it's really saving you much time over the calculation you are trying to avoid. It is probably going to be faster than repeatedly calling equals() to compare objects, as you do it only once per instance, with the values it had at the time of the "big calculation".
For a given instance, you could cache the result and not calculate it again unless you mutate the object. Or you could just calculate it again only just before doing the "big calculation".
However, you'll need the "cooperation" of all the objects you have inside your class. That is, they will all need to be reasonably convertible into a byte array in such a way that two equivalent objects produce the same bytes (including the same issue with the interface objects that I mentioned above).
You should also beware of situations in which you have, for example, two strings "AB" and "CD" which will give you the same result as "A" and "BCD", and then you'll end up with the same hash for two different objects.
For future readers.
Yes, equals and hashCode go hand in hand.
Below shows a typical implementation using a helper library, but it really shows the "hand in hand" nature. And the helper library from apache keeps things simpler IMHO:
#Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
MyCustomObject castInput = (MyCustomObject) o;
boolean returnValue = new org.apache.commons.lang3.builder.EqualsBuilder()
.append(this.getPropertyOne(), castInput.getPropertyOne())
.append(this.getPropertyTwo(), castInput.getPropertyTwo())
.append(this.getPropertyThree(), castInput.getPropertyThree())
.append(this.getPropertyN(), castInput.getPropertyN())
.isEquals();
return returnValue;
}
#Override
public int hashCode() {
return new org.apache.commons.lang3.builder.HashCodeBuilder(17, 37)
.append(this.getPropertyOne())
.append(this.getPropertyTwo())
.append(this.getPropertyThree())
.append(this.getPropertyN())
.toHashCode();
}
17, 37 .. those you can pick your own values.
From your clarifications:
You want to store MyClass in an HashMap as key.
This means the hashCode() is not allowed to change after adding the object.
So if your collections may change after object instantiation, they should not be part of the hashcode().
From http://docs.oracle.com/javase/8/docs/api/java/util/Map.html
Note: great care must be exercised if mutable objects are used as map
keys. The behavior of a map is not specified if the value of an object
is changed in a manner that affects equals comparisons while the
object is a key in the map.
For 20-100 objects it is not worth that you enter the risk of an inconsistent hash() or equals() implementation.
There is no need to override hahsCode() and equals() in your case.
If you don't overide it, java takes the unique object identity for equals and hashcode() (and that works, epsecially because you stated that you don't need an equals() considering the values of the object fields).
When using the default implementation, you are on the safe side.
Making an error like using a custom hashcode() as key in the HashMap when the hashcode changes after insertion, because you used the hashcode() of the collections as part of your object hashcode may result in an extremly hard to find bug.
If you need to find out whether the heavy calculation is finished, I would not absue equals(). Just write an own method objectStateValue() and call hashcode() on the collection, too. This then does not interfere with the objects hashcode and equals().
public int objectStateValue() {
// TODO make sure the fields are not null;
return 31 * s.hashCode() + coll.hashCode();
}
Another simpler possibility: The code that does the time consuming calculation can raise an calculationCounter by one as soon as the calculation is ready. You then just check whether or not the counter has changed. this is much cheaper and simpler.

Java Overriding hashCode() method has any Performance issue?

If i will override hashCode() method will it degrade the performance of application. I am overriding this method in many places in my application.
Yes you can degrade the performance of a hashed collection if the hashCode method is implemented in a bad way. The best implementation of a hashCode method should generate the unique hashCode for unique objects. Unique hashCode will avoid collisions and an element can be stored and retrieved with O(1) complexity. But only hashCode method will not be able to do it, you need to override the equals method also to help the JVM.
If the hashCode method is not able to generate unique hash for unique objects then there is a chance that you will be holding more than one objects at a bucket. This will occur when you have two elements with same hash but equals method returns false for them. So each time this happens the element will be added to the list at hash bucket. This will slow down both the insertion and retreival of elements. It will lead to O(n) complexity for the get method, where n is the size of the list at a bucket.
Note: While you try to generate unique hash for unique objects in your hashCode implementation, be sure that you write simple algorithm for doing so. If your algorithm for generating the hash is too heavy then you will surely see a poor performance for operations on your hashed collection. As hashCode method is called for most of the operations on the hashed collection.
It would improve performance if the right data structure used at right place,
For example: a proper hashcode implementation in Object can nearly convert O(N) to O(1) for HashMap lookup
unless you are doing too much complicated operation in hashCode() method
It would invoke hashCode() method every time it has to deal with Hash data structure with your Object and if you have heavy hashCode() method (which shouldn't be)
It depends entirely on how you're implementing hashCode. If you're doing lots of expensive deep operations, then perhaps it might, and in that case, you should consider caching a copy of the hashCode (like String does). But a decent implementation, such as with HashCodeBuilder, won't be a big deal. Having a good hashCode value can make lookups in data structures like HashMaps and HashSets much, much faster, and if you override equals, you need to override hashCode.
Java's hashCode() is a virtual function anyway, so there is no performance loss by the sheer fact that it is overridden and the overridden method is used.
The real difference may be the implementation of the method. By default, hashCode() works like this (source):
As much as is reasonably practical, the hashCode method defined by
class Object does return distinct integers for distinct objects. (This
is typically implemented by converting the internal address of the
object into an integer, but this implementation technique is not
required by the JavaTM programming language.)
So, whenever your implementation is as simple as this, there will be no performance loss. However, if you perform complex computing operations based on many fields, calling many other functions - you will notice a performance loss but only because your hashCode() does more things.
There is also the issue of inefficient hashCode() implementations. For example, if your hashCode() simply returns value 1 then the use of HashMap or HashSet will be significantly slower than with proper implementation. There is a great question which covers the topic of implementing hashCode() and equals() on SO: What issues should be considered when overriding equals and hashCode in Java?
One more note: remember, that whenever you implement hashCode() you should also implement equals(). Moreover, you should do it with care, because if you write an invalid hashCode() you may break equality checks for various collections.
Overriding hashCode() in a class in itself does not cause any performance issues. However when an instance of such class is inserted either into a HashMap HashSet or equivalent data structure hashCode() & optionally equals() method is invoked to identify right bucket to put the element in. same applicable to Retrival Search & Deletion.
As posted by others performance totally depends on how hashCode() is implemented.
However If a particular class's equals method is not used at all then it is not mandatory to override equals() and hashCode() , but if equals() is overridden , hashcode() must be overridden as well
As all previous comments mentioned, hash-code is used for hashing in collections or it could be used as negative condition in equals. So, yes you can slow you app a lot. Obviously there is more use-cases.
First of all I would say that the approach (whether to rewrite it at all) depends on the type of objects you are talking about.
Default implementation of hash-code is fast as possible because it's unique for every object. It's possible to be enough for many cases.
This is not good when you want to use hashset and let say want to do not store two same objects in a collection. Now, the point is in "same" word.
"Same" can mean "same instance". "Same" can mean object with same (database) identifier when your object is entity or "same" can mean the object with all equal properties. It seems that it can affect performance so far.
But one of properties can be a object which could evaluate hashCode() on demand too and right now you can get evaluation of object tree's hash-code always when you call hash-code method on the root object.
So, what I would recommend? You need to define and clarify what you want to do. Do you really need to distinguish different object instances, or identifier is crucial, or is it value object?
It also depends on immutability. It's possible to calculate hashcode value once when object is constructed using all constructor properties (which has only get) and use it always when hashcode() is call. Or another option is to calculate hashcode always when any property gets change. You need to decide whether most cases read the value or write it.
The last thing I would say is to override hashCode() method only when you know that you need it and when you know what are you doing.
If you will override hashCode() method will it degrade the performance of application.It would improve performance if the right data structure used at right place,
For example: a proper hashcode() implementation in Object can nearly convert O(N) to O(1) for HashMap lookup.unless you are doing too much complicated operation in hashCode() method
The main purpose of hashCode method is to allow an object to be a key in the hash map or a member of a hash set. In this case an object should also implement equals(Object) method, which is consistent with hashCode implementation:
If a.equals(b) then a.hashCode() == b.hashCode()
If hashCode() was called twice on the same object, it should return the same result provided that the object was not changed
hashCode from the performance point of view
From the performance point of view, the main objective for your hashCode method implementation is to minimize the number of objects sharing the same hash code.
All JDK hash based collections store their values in an array.
Hash code is used to calculate an initial lookup position in this array. After that equals is used to compare given value with values stored in the internal array. So, if all values have distinct hash codes, this will minimize the possibility of hash collisions.
On the other hand, if all values will have the same hash code, hash map (or set) will degrade into a list with operations on it having O(n2) complexity.
From Java 8 onwards though collision will not impact performance as much as it does in earlier versions because after a threshold the linked list will be replaced by the binary tree, which will give you O(logN) performance in the worst case as compared to O(n) of linked list.
Never write a hashCode method which returns a constant.
String.hashCode results distribution is nearly perfect, so you can sometimes substitute Strings with their hash codes.
The next objective is to check how many identifiers with non-unique has codes you still have. Improve your hashCode method or increase a range of allowed hash code values if you have too many of non-unique hash codes. In the perfect case all your identifiers will have unique hash codes.

compareTo involving non-comparable field: how to maintain transitivity?

Consider a class with a comparable (consistent with equals) and a non-comparable field (of a class about which I do not know whether it overrides Object#equals or not).
The class' instances shall be compared, where the resulting order shall be consistent with equals, i.e. 0 returned iff both fields are equal (as per Object#equals) and consistent with the order of the comparable field. I used System.identityHashCode to cover most of the cases not covered by these requirements (the order of instances with same comparable, but different other value is arbitrary), but am not sure whether this is the best approach.
public class MyClass implements Comparable<MyClass> {
private Integer intField;
private Object nonCompField;
public int compareTo(MyClass other) {
int intFieldComp = this.intField.compareTo(other.intField);
if (intFieldComp != 0)
return intFieldComp;
if (this.nonCompField.equals(other.nonCompField))
return 0;
// ...and now? My current approach:
if (Systems.identityHashCode(this.nonCompField) < Systems.identityHashCode(other.nonCompField))
return -1;
else
return 1;
}
}
Two problems I see here:
If Systems.identityHashCode is the same for two objects, each is greater than the other. (Can this happen at all?)
The order of instances with same intField value and different nonCompField values need not be consistent between runs of the program, as far as I understand what Systems.identityHashCode does.
Is that correct? Are there more problems? Most importantly, is there a way around this?
The first problem, although highly unlikely, could happen (I think you would need an enormous amount of memory, and a very bad luck). But it's solved by Guava's Ordering.arbitrary(), which uses the identity hash code behind the scenes, but maintains a cache of comparison results for the cases where two different objects have the same identity hash code.
Regarding your second question, no, the identity hash codes are not preserved between runs.
Systems.identityHashCode […] the same for two objects […] (Can this happen at all?)
Yes it can. Quoting from the Java API Documentation:
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects.
identityHashCode(Object x) returns the same hash code for the given object as would be returned by the default method hashCode(), whether or not the given object's class overrides hashCode().
So you may encounter hash collisions, and with memory ever growing but hash codes staying fixed at 32 bit, they will become increasingly more likely.
The order of instances with same intField value and different nonCompField values need not be consistent between runs of the program, as far as I understand what Systems.identityHashCode does.
Right. It might even be different during a single invocation of the same program: You could have (1,foo) < (1,bar) < (1,baz) even though foo.equals(baz).
Most importantly, is there a way around this?
You can maintain a map which maps each distinct value of the non-comparable type to a sequence number which you increase for each distinct value you encounter.
Memory management will be tricky, though: You cannot use a WeakHashMap as the code might make your key object unreachable but still hold a reference to another object of the same value. So either you maintain a list of weak references to all the objects of a given value, or you simply use strong references and accept the fact that any uncomparable value ever encountered will never be garbage collected.
Note that this scheme will still not result in reproducible sequence numbers unless you create values reproducibly in just the same order.
If the class of the nonCompField has implemented a reasonably good toString(), you might be able to use
return String.valueOf(this.nonCompField).compareTo(String.valueOf(other.nonCompField));
Unfortunately, the default Object.toString() uses the hashcode, which has potential issues as noted by others.

Does List.retainAll() use HashMap internally?

I am purposefully violating the hashCode contract that says that if we override equals() in our class, we must override hashCode() as well, and I am making sure that no Hash related data structures (like HashMap, HashSet, etc) are using it. The problem is that I fear methods like removeAll() and containsAll() of Lists might use HashMaps internally, and in that case, since I am not overriding hashCode() in my classes, their functionality might break.
Can anyone please conform whether my doubt is valid ? The classes contain a lot of fields that are being used for equality comparison, and I will have to come up with an efficient technique to get a hashCode using all of them. I really don't require them in any hash-related operations, and as such, I am trying to avoid implementing hashCode()
From AbstractCollection.retainAll()
* <p>This implementation iterates over this collection, checking each
* element returned by the iterator in turn to see if it's contained
* in the specified collection. If it's not so contained, it's removed
* from this collection with the iterator's <tt>remove</tt> method.
public boolean retainAll(Collection<?> c) {
boolean modified = false;
Iterator<E> e = iterator();
while (e.hasNext()) {
if (!c.contains(e.next())) {
e.remove();
modified = true;
}
}
return modified;
}
As for
I will have to come up with an efficient technique to get a hashCode using all of them
You don't need to use all of the fields used by equals in your hashCode implementation:
It is not required that if two objects are unequal according to the equals method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.
Therefore, your hashCode implementation could be very simple and still obey the contract:
public int hashCode() {
return 1;
}
This will ensure that hash-based data structures still work (alebit at degraded performance). If you add logging to your hashCode implementation, then you could even check if it is ever called.
I think a simple way to test if hashCode() is being used anywhere is to override hashCode() for your class, make it print a statement to the console (or a file if you prefer) and then return some random value (won't matter since you said you don't want to use any hash-based classes anyway).
However, i think the best would be to just override it, i'm sure some IDE's even can do it for you (Eclipse can, for example). If you never expect it to get called, it can't hurt.

Categories