cache key generation - java

I'm using ehcache (via the Grails plugin). The method that adds objects to the cache requires the keys to be serializable, so a typical usage would be:
def key = 22
def someObject = new Object();
cacheService.cache(key, true, someObject)
(The boolean param indicates whether the object should be added to a distributed or local cache)
My question is how I should go about generating keys from value objects such as:
class Person implements Serializable {
String firstName
String lastName
Integer age
}
One approach would be to provide a hashCode() and equals() methods and use the hashCode as the key. In this case, I wouldn't need to make the Person class implement Serializable.
Alternatively, I could simply use the Person object itself as the key. It seems like I would still need to provide the equals and hashCode methods but would also need to implement Serializable. However, there would seem to be a smaller chance of collisions using this approach because a Person can only be equal to another instance of Person.
I'm assuming that ehcache uses the equals() method of a key to determine whether that key already exists in the cache, is this assumption correct?
Is either of the approaches outlined above intrinsically better than the other, or is there another approach that I haven't considered?
Thanks,
Don

Your hashkey question is mostly orthogonal to the serializable question. In answer to the hashkey, I'd use the Apache Commons HashCodeBuilder. It does all of the heavy lifting for you. Similarly with equals, use the EqualsBuilder.
Remember though, hashcodes need to stay the same over the lifespan of the object, so only hash those internal elements that won't change.
I'd avoid using the Person object as the key as that'll call it's equals() to check key comparisons, which is likely slower than comparing an integer hashcode.

Related

Do you need to override hashCode() and equals() for records?

Assuming the following example:
public record SomeRecord(int foo, byte bar, long baz)
{ }
Do I need to override hashCode and equals if I were to add said object to a HashMap?
No you do not need to define your own hashCode and equals. You may do so if you wish to override the default implementation.
See section 8.10.3 of the specification for details https://docs.oracle.com/javase/specs/jls/se14/preview/specs/records-jls.html#jls-8.10
Note, specifically, the caveat on implementing your own version of these:
All the members inherited from java.lang.Record. Unless explicitly
overridden in the record body, R has implicitly declared methods that
override the equals, hashCode and toString methods from
java.lang.Record.
Should any of these methods from java.lang.Record be explicitly
declared in the record body, the implementations should satisfy the
expected semantics as specified in java.lang.Record.
In particular, a custom equals implementation must satisfy the expected semantic that a copy of a record must equal the record. This is not generally true for classes (e.g. two Car objects might be equals if their VIN value is the same even if owner fields are different) but must be true for records. This restriction would mean that there is rarely any reason to override equals.
The answer to whether you need it or not would really be - it depends on the implementation of the entity you decide to create as a Record. There are no restrictions at compile or runtime to constraint you form doing so either and that has been always the case for classes extending Object anyway.
heads
On the other hand, one of the primary motivations for the proposal has been the "low-value, repetitive, error-prone code:constructors, accessors, equals(), hashCode(), toString() etc". In a data carrier, this implies quite often in today's Java programming. Hence the decision as stated further was to prefer semantic goals an
...: modeling data as data. (If the
semantics are right, the boilerplate will take care of itself.) It
should be easy, clear, and concise to declare shallowly-immutable,
well-behaved nominal data aggregates.
tails
So, the boilerplate has been taken care of, but do note, you might still for some reason want one of your record components to be not treated as part of the process of comparison between two different objects and that is where you might want to override the default implementation of equals and hashCode provided. Also, there is no doubt in my thoughts around the fanciness that is sometimes desired of a toString and therefore the need to override it as well.
The above mostly cannot be categorized as a compile or runtime failure, but the proposal itself reads the risk that it comes along with:
Any of the members that are automatically derived from the state
description can also be declared explicitly. However, carelessly
implementing accessors or equals/hashCode risks undermining the
semantic invariants of records.
(Note: The latter is mostly my opinion, such that consumers would desire all sorts of flexibilities so that they can use the latest features but in a manner, the existing implementation used to work. You see, backward compatibility matters to a greater extent as well during upgradations.)
What Is a Java Record? One of the most common complaints about Java is that you need to write a lot of code for a class to be
useful. Quite often you need to write the following:
toString()
hashCode()
equals()
Getter methods
A public constructor
For simple domain classes, these methods are usually boring,
repetitive, and the kind of thing that could easily be generated
mechanically (and IDEs often provide this capability), but as of now,
the language itself doesn’t provide any way to do this.
The goal of records is to extend the Java language syntax and create a
way to say that a class is “the fields, just the fields, and nothing
but the fields.” By you making that statement about a class, the
compiler can help by creating all the methods automatically and having
all the fields participate in methods such as hashCode().
The records come with a default implementation for hashCode(), equals() and toString() for all attributes inside the record
The default implementation of hashCode()
The record will use the hash code of all attributes inside the record
The default implementation equals()
The record will use all attributes to decide if tow records are equals or not
So any hash implementations e.g. HashSet, LinkedHashSet, HashMap, LinkedHashMap,
etc will use hashCode() and in-case of any collision will use equals()
Default implementation or a custom one?
If you want to use all attributes in hashCode() and equals() then don't override
Do you need to override hashCode() and equals() for records?
It's up to you to keep the default implementation or select only some attributes for that
anything, but if you want custom attribute you can override to decide which
attributes decide equality and attributes make the hashCode
How the hashCode is calculated in the default implementation?
will use hashCode of integer and string like this:
int hashCode = 1 * 31;
hashCode = (hashCode + "a".hashCode()) & 0x7fffffff;
, The code below with the default implementation for hashCode(), equals()
and toString().
the HashSet didn't add the two of them because the two records have the same hashCode and equals
static record Record(int id, String name) {
}
public static void main(String[] args) {
Record r1 = new Record(1, "a");
Record r2 = new Record(1, "a");
Set<Record> set = new HashSet<>();
set.add(r1);
set.add(r2);
System.out.println(set);
System.out.println("Hashcode for record1: " + r1.hashCode());
System.out.println("Hashcode for record2: " + r2.hashCode());
int hashCode = 1 * 31;
hashCode = (hashCode + "a".hashCode()) & 0x7fffffff;
System.out.println("The hashCode: " + hashCode);
}
, output
[Record[id=1, name=a]]
Hashcode for record1: 128
Hashcode for record2: 128
The hashCode: 128
, Resources:
blogs oracle

Should I override hashCode() of Collections?

Given that I some class with various fields in it:
class MyClass {
private String s;
private MySecondClass c;
private Collection<someInterface> coll;
// ...
#Override public int hashCode() {
// ????
}
}
and of that, I do have various objects which I'd like to store in a HashMap. For that, I need to have the hashCode() of MyClass.
I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
I OPENED A SECOND QUESTION regarding the actual problem of uniquely IDing an object here: How do I generate an (almost) unique hash ID for objects?
Clarification:
is there anything more or less unqiue in your class? The String s? Then only use that as hashcode.
MyClass hashCode() of two objects should definitely differ, if any of the values in the coll of one of the objects is changed. HashCode should only return the same value if all fields of two objects store the same values, resursively. Basically, there is some time-consuming calculation going on on a MyClass object. I want to spare this time, if the calculation had already been done with the exact same values some time ago. For this purpose, I'd like to look up in a HashMap, if the result is available already.
Would you be using MyClass in a HashMap as the key or as the value? If the key, you have to override both equals() and hashCode()
Thus, I'm using the hashCode OF MyClass as the key in a HashMap. The value (calculation result) will be something different, like an Integer (simplified).
What do you think equality should mean for multiple collections? Should it depend on element ordering? Should it only depend on the absolute elements that are present?
Wouldn't that depend on the kind of Collection that is stored in coll? Though I guess ordering not really important, no
The response you get from this site is gorgeous. Thank you all
#AlexWien that depends on whether that collection's items are part of the class's definition of equivalence or not.
Yes, yes they are.
I'll have to go into all fields and respective parent classes recursively to make sure they all implement hashCode() properly, because otherwise hashCode() of MyClass might not take into consideration some values. Is this right?
That's correct. It's not as onerous as it sounds because the rule of thumb is that you only need to override hashCode() if you override equals(). You don't have to worry about classes that use the default equals(); the default hashCode() will suffice for them.
Also, for your class, you only need to hash the fields that you compare in your equals() method. If one of those fields is a unique identifier, for instance, you could get away with just checking that field in equals() and hashing it in hashCode().
All of this is predicated upon you also overriding equals(). If you haven't overridden that, don't bother with hashCode() either.
What do I do with that Collection? Can I always rely on its hashCode() method? Will it take into consideration all child values that might exist in my someInterface object?
Yes, you can rely on any collection type in the Java standard library to implement hashCode() correctly. And yes, any List or Set will take into account its contents (it will mix together the items' hash codes).
So you want to do a calculation on the contents of your object that will give you a unique key you'll be able to check in a HashMap whether the "heavy" calculation that you don't want to do twice has already been done for a given deep combination of fields.
Using hashCode alone:
I believe hashCode is not the appropriate thing to use in the scenario you are describing.
hashCode should always be used in association with equals(). It's part of its contract, and it's an important part, because hashCode() returns an integer, and although one may try to make hashCode() as well-distributed as possible, it is not going to be unique for every possible object of the same class, except for very specific cases (It's easy for Integer, Byte and Character, for example...).
If you want to see for yourself, try generating strings of up to 4 letters (lower and upper case), and see how many of them have identical hash codes.
HashMap therefore uses both the hashCode() and equals() method when it looks for things in the hash table. There will be elements that have the same hashCode() and you can only tell if it's the same element or not by testing all of them using equals() against your class.
Using hashCode and equals together
In this approach, you use the object itself as the key in the hash map, and give it an appropriate equals method.
To implement the equals method you need to go deeply into all your fields. All of their classes must have equals() that matches what you think of as equal for the sake of your big calculation. Special care needs to be be taken when your objects implement an interface. If the calculation is based on calls to that interface, and different objects that implement the interface return the same value in those calls, then they should implement equals in a way that reflects that.
And their hashCode is supposed to match the equals - when the values are equal, the hashCode must be equal.
You then build your equals and hashCode based on all those items. You may use Objects.equals(Object, Object) and Objects.hashCode( Object...) to save yourself a lot of boilerplate code.
But is this a good approach?
While you can cache the result of hashCode() in the object and re-use it without calculation as long as you don't mutate it, you can't do that for equals. This means that calculation of equals is going to be lengthy.
So depending on how many times the equals() method is going to be called for each object, this is going to be exacerbated.
If, for example, you are going to have 30 objects in the hashMap, but 300,000 objects are going to come along and be compared to them only to realize that they are equal to them, you'll be making 300,000 heavy comparisons.
If you're only going to have very few instances in which an object is going to have the same hashCode or fall in the same bucket in the HashMap, requiring comparison, then going the equals() way may work well.
If you decide to go this way, you'll need to remember:
If the object is a key in a HashMap, it should not be mutated as long as it's there. If you need to mutate it, you may need to make a deep copy of it and keep the copy in the hash map. Deep copying again requires consideration of all the objects and interfaces inside to see if they are copyable at all.
Creating a unique key for each object
Back to your original idea, we have established that hashCode is not a good candidate for a key in a hash map. A better candidate for that would be a hash function such as md5 or sha1 (or more advanced hashes, like sha256, but you don't need cryptographic strength in your case), where collisions are a lot rarer than a mere int. You could take all the values in your class, transform them into a byte array, hash it with such a hash function, and take its hexadecimal string value as your map key.
Naturally, this is not a trivial calculation. So you need to think if it's really saving you much time over the calculation you are trying to avoid. It is probably going to be faster than repeatedly calling equals() to compare objects, as you do it only once per instance, with the values it had at the time of the "big calculation".
For a given instance, you could cache the result and not calculate it again unless you mutate the object. Or you could just calculate it again only just before doing the "big calculation".
However, you'll need the "cooperation" of all the objects you have inside your class. That is, they will all need to be reasonably convertible into a byte array in such a way that two equivalent objects produce the same bytes (including the same issue with the interface objects that I mentioned above).
You should also beware of situations in which you have, for example, two strings "AB" and "CD" which will give you the same result as "A" and "BCD", and then you'll end up with the same hash for two different objects.
For future readers.
Yes, equals and hashCode go hand in hand.
Below shows a typical implementation using a helper library, but it really shows the "hand in hand" nature. And the helper library from apache keeps things simpler IMHO:
#Override
public boolean equals(Object o) {
if (this == o) {
return true;
}
if (o == null || getClass() != o.getClass()) {
return false;
}
MyCustomObject castInput = (MyCustomObject) o;
boolean returnValue = new org.apache.commons.lang3.builder.EqualsBuilder()
.append(this.getPropertyOne(), castInput.getPropertyOne())
.append(this.getPropertyTwo(), castInput.getPropertyTwo())
.append(this.getPropertyThree(), castInput.getPropertyThree())
.append(this.getPropertyN(), castInput.getPropertyN())
.isEquals();
return returnValue;
}
#Override
public int hashCode() {
return new org.apache.commons.lang3.builder.HashCodeBuilder(17, 37)
.append(this.getPropertyOne())
.append(this.getPropertyTwo())
.append(this.getPropertyThree())
.append(this.getPropertyN())
.toHashCode();
}
17, 37 .. those you can pick your own values.
From your clarifications:
You want to store MyClass in an HashMap as key.
This means the hashCode() is not allowed to change after adding the object.
So if your collections may change after object instantiation, they should not be part of the hashcode().
From http://docs.oracle.com/javase/8/docs/api/java/util/Map.html
Note: great care must be exercised if mutable objects are used as map
keys. The behavior of a map is not specified if the value of an object
is changed in a manner that affects equals comparisons while the
object is a key in the map.
For 20-100 objects it is not worth that you enter the risk of an inconsistent hash() or equals() implementation.
There is no need to override hahsCode() and equals() in your case.
If you don't overide it, java takes the unique object identity for equals and hashcode() (and that works, epsecially because you stated that you don't need an equals() considering the values of the object fields).
When using the default implementation, you are on the safe side.
Making an error like using a custom hashcode() as key in the HashMap when the hashcode changes after insertion, because you used the hashcode() of the collections as part of your object hashcode may result in an extremly hard to find bug.
If you need to find out whether the heavy calculation is finished, I would not absue equals(). Just write an own method objectStateValue() and call hashcode() on the collection, too. This then does not interfere with the objects hashcode and equals().
public int objectStateValue() {
// TODO make sure the fields are not null;
return 31 * s.hashCode() + coll.hashCode();
}
Another simpler possibility: The code that does the time consuming calculation can raise an calculationCounter by one as soon as the calculation is ready. You then just check whether or not the counter has changed. this is much cheaper and simpler.

Best practices on what should be key in a hashtable

The best look-up structure is a HashTable. It provides constant access on average (linear in worst case).
This depends on the hash function. Ok.
My question is the following. Assuming a good implementation of a HashTable e.g. HashMap is there a best practice concerning the keys passed in the map?I mean it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
Example the size of the key? For example in a good hashmap (in the way described above) if we used String as keys, won't the "bottleneck" be in the string comparison for equals (trying to find the key)? So should the keys be kept small? Or are there objects that should not be used as keys? E.g. a URL? In such cases how can you choose what to use as a key?
The best performing key for an HashMap is probably an Integer, where hashCode() and equals() are implemented as:
public int hashCode() {
return value;
}
public boolean equals(Object obj) {
if (obj instanceof Integer) {
return value == ((Integer)obj).intValue();
}
return false;
}
Said that, the purpose of an HashMap is to map some object (value) to some others (key). The fact that a hash function is used to address the (value) objects is to provide fast, constant-time access.
it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The recommendation is to Map objects to what you need: don't think what is faster; but think what is the best for your business logic to address the objects to retrieve.
The important requirement is that the key object must be immutable, because if you change the key object after storing it in the Map it may be not possible to retrieve the associated value later.
The key word in HashMap is Map. Your object should just map. If you sacrifice the mapping task optimizing the key, you are defeating the purpose of the Map - without probably achieving any performance boost.
I 100% agree with the first two comments in your question:
the major constraint is that it has to be the thing that you want to base the lookup on ;)
– Oli Charlesworth
The general rule is to use as the key whatever you need to look up with.
– Louis Wasserman
Remember the two rules for optimization:
Don't.
(for experts only) don't yet.
The third rule is: profile before to optimize.
You should use whatever key you want to use to lookup things in the data structure, it's typically a domain-specific constraint. With that said, keep in mind that both hashCode() and equals() will be used in finding a key in the table.
hashCode() is used to find the position of the key, while equals() is used to determine if the key you are searching for is actually the key that we just found using hashCode().
For example, consider two keys a and b that have the same hash code in a table using separate chaining. Then a search for a would require testing if a.equals(key) for potentially both a and b in the table once we find the index of the list containing a and b from hashCode().
it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The key of the value should be final.
Most times a field of the object is used as key. If that field changes then the map cannot find it:
void foo(Employee e) {
map.put(e.getId(), e);
String newId = e.getId() + "new";
e.setId(newId);
Employee e2 = e.get(newId);
// e != e2 !
}
So Employee should not have a setId() method at all, but that is difficult because when you are writing Employee you don't know what it will be keyed by.
I digged up the implementation. I had an assumption that the effectiveness of the hashCode() method will be the key factor.
When I looked into the HashMap() and the Hashtable() implementation, I found that the implementation is quite similar (with one exception). Both are using and storing an internal hash code for all entries, so that's a good point that hashCode() is not so heavily influencing the performance.
Both are having a number of buckets, where the values are stored. It is important balance between the number of buckets (say n), and the average number of keys within a bucket (say k). The bucket is found in O(1) time, the content of the bucket is iterated in O(k) size, but the more bucket we have, the more memory will be allocated. Also, if many buckets are empty, it means that the hashCode() method for the key class does not the hashcode wide enough.
The algorithm works like this:
Take the `hashCode()` of the Key (and make a slight bijective transformation on it)
Find the appropriate bucket
Loop through the content of the bucket (which is some kind of LinkedList)
Make the comparison of the keys as follows:
1. Compare the hashcodes
(it is calculated in the first step, and stored for the entry)
2. Examine if key `==` the stored key (still no call)
(this step is missing from Hashtable)
3. Compare the keys by `key.equals(storedKey)`
To summarize:
hashCode() is called once per call (this is a must, you cannot do
without it)
equals() is called if the hashCode is not so well spread, and two keys happen to have the same hashcode
The same algorithm is for get() and put() (because in put() case you can set the value for an existing key). So, the most important thing is how the hashCode() method was implemented. That is the most frequently called method.
Two strategies are: make it fast and make it effective (well-spread). The JDK developers made efforts to make it both, but it's not always possible to have them both.
Numeric types are good
Object (and non-overriden classes) are good (hashCode() is native), except that you cannot specify an own equals()
String is not good, iterates through the characters, but caches after that (see my comment below)
Any class with synchronized hashCode() is not good
Any class that has an iteration is not good
Classes that have hashcode cache are a bit better (depends on the usage)
Comment on the String: To make it fast, in the first versions of JDK the String hash code calculation was made for the first 32 characters only. But the hashcode it produced was not well spread, so they decided to take all the characters into the hashcode.

java hashing objects

I'd like to be able to determine whether I've encountered an object before - I have a graph implementation and I want to see if I've created a cycle, probably by iterating through the Node objects with a tortoise/hare floyd algorithm.
But I want to avoid a linear search through my list of "seen" nodes each time. This would be great if I had a hash table for just keys. Can I somehow hash an object? Aren't java objects just references to places in memory anyway? I wonder how much of a problem collisions would be if so..
The simple answer is to create a HashSet and add each node to the set the first time you encounter it.
The only case that this won't work is if you've overloaded hashCode() and equals(Object) for the node class to implement equality based on node contents (or whatever). Then you'll need to:
use the IdentityHashMap class which uses == and System.identityHashcode rather than equals(Object) and hashCode(), or
build a hashtable yourself using your own flavour of object identity.
Aren't java objects just references to places in memory anyway?
Yes and no. Yes, the reference is represented by a memory address (on most JVMs). The problem is that 1) you can't get hold of the address, and 2) it can change when the GC relocates the object. This means that you can't use the object address as a hashcode.
The identityHashCode method deals this by returning a value that is initially based on the memory address. If you then call identityHashCode again for the same object, you are guaranteed to get the same value as before ... even if the object has been relocated.
I wonder how much of a problem collisions would be if so..
The hash values produced by the identityHashCode method can collide. (That is, two distinct objects can have the same identity hashcode value.) Anything that uses these values has to deal with this. (The standard HashSet and IdentityHashMap classes take care of these collisions ... if you chose to use them.)
I'd like to be able to determine whether I've encountered an object
before
Use an IdentityHashMap. It is the ideal for your job since it is not an equals but a == implementation.
Take a look at HashSet. Note that in order for objects to work with HashSet, they need to provide correct implementations of hashCode and equals methods of the java.lang.Object class.
You'll need to implement a hash function for your objects. This is done by overriding hashCode() defined in java.lang.Object. This method is used by HashMap, HashSet etc to store objects. In hashCode() it's up to you to calculate a hash for the object. Don't forget to also implement the equals()-method!
Take a look at Java collection framework (http://docs.oracle.com/javase/tutorial/collections/)

object reference set in java

I need to create a Set of objects. The concern is I do not want to base the hashing or the equality on the objects' hashCode and equals implementation. Instead, I want the hash code and equality to be based only on each object's reference identity (i.e.: the value of the reference pointer).
I'm not sure how to do this in Java.
The reasoning behind this is my objects do not reliably implement equals or hashCode, and in this case reference identity is good enough.
I guess that java.util.IdentityHashMap is what you're looking for (note, there's no IdentityHashSet). Lookup the API documentation:
This class implements the Map interface with a hash table, using reference-equality in place of object-equality when comparing keys (and values). In other words, in an IdentityHashMap, two keys k1 and k2 are considered equal if and only if (k1==k2). (In normal Map implementations (like HashMap) two keys k1 and k2 are considered equal if and only if (k1==null ? k2==null : k1.equals(k2)).)
This class is not a general-purpose Map implementation! While this class implements the Map interface, it intentionally violates Map's general contract, which mandates the use of the equals method when comparing objects. This class is designed for use only in the rare cases wherein reference-equality semantics are required.
edit: See Joachim Sauer's comment below, it's really easy to make a Set based on a certain Map. You'd need to do something like this:
Set<E> mySet = Collections.newSetFromMap(new IdentityHashMap<E, Boolean>());
You could wrap your objects into a wrapper class which could then implement hashcode and equals based simply on the object's identity.
You can extend HashSet (or actually - AbstractSet) , and back it with IdentityHashMap which uses System.identityHashCode(object) instead of obj.hashCode().
You can simply google for IdentityHashSet, there are some implementations already. Or use Collections.newSetFromMap(..) as suggested by Joachim Sauer.
This of course should be done only if you are not in "possession" of your objects' classes. Otherwise just fix their hashCode()

Categories