From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code. I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
The reasoning is that I'm trying to optimize a part of a code that, currently, is just a while loop that finds the first object with that hashcode and stores/removes it. This would be a lot faster if I could just return the full list in one go.
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
The keys are Chunk objects and the values are Strings. Here are the hashcode() and equals() functions for the Chunk class:
/**
* Returns a string representation of the ArrayList of words
* thereby storing chunks with the same words but with different
* locations and next words in the same has bucket, triggering the
* use of equals() when searching and adding
*/
public int hashCode() {
return (Words.toString()).hashCode();
}
#Override
/**
* result depends on the value of location. A location of -1 is obviously
* not valid and therefore indicates that we are searching for a match rather
* than adding to the map. This allows multiples of keys with matching hashcodes
* to be considered unequal when adding to the hashmap but equal when searching
* it, which is integral to the MakeMap() and GetOptions() methods of the
* RandomTextGenerator class.
*
*/
public boolean equals(Object obj) {
Chunk tempChunk = (Chunk)obj;
if (LocationInText == -1 && Words.size() == tempChunk.GetText().size())
{
for (int i = 0; i < Words.size(); i++) {
if (!Words.get(i).equals(tempChunk.GetText().get(i))) {
return false;
}
}
return true;
}
else {
if (tempChunk.GetLocation() == LocationInText) {
return true;
}
return false;
}
}
Thanks!
HashMap does not expose any way to do this, but I think you're misunderstanding how HashMap works in the first place.
The first thing you need to know is that if every single object had exactly the same hash code, HashMap would still work. It would never "mix up" keys. If you call get(key), it will only return the value associated with key.
The reason this works is that HashMap only uses hashCode as a first grouping, but then it checks the object you passed to get against the keys stored in the map using the .equals method.
There is no way, from the outside, to tell that HashMap uses linked lists. (In fact, in more recent versions of Java, it doesn't always use linked lists.) The implementation doesn't provide any way to look at hash codes, to find out how hash codes are grouped, or anything along those lines.
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
This code does not "find the first object with that hashcode and store/remove it." It finds the one and only object equal to toSearch according to .equals, stores and removes it. (There can only be one such object in a Map.)
Your while isn't really going. It makes max one turn, if the WorkingMap is a plain Java HashMap. .get(key) return the last saved Object in the HashMap that is saved on 'key'. If it matched toSearch, than it going once.
I'm not sure about that many open questions here. But if you need that one and your farther code is understanding
What kind of type is class Possibles? ArrayList?
// this one should make the same as your while
if(WorkingMap.containsKey(toSearch)) {
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
// farher: expand your Possibles to get that LinkedList what you want to have.
public class possibilities {
// List<LinkedList<String>> container = new ArrayList<LinkedList<String>>();
public Map<Chunk, LinkedList<String>> container2 = new HashMap<Chunk, LinkedList<String>>();
public void put(Chunk key, String value) {
if(!this.container2.containsKey(key)) {
this.container2.put(key, new LinkedList<String>());
}
this.container2.get(key).add(value);
}
}
// this one works with updated Possibles
if(WorkingMap.containsKey(toSearch)) {
Possibles.put(toSearch, WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
//---
How ever, yes it can go like that, but keys should not be a complex object.
Notice: That LinkedLists takes memory and how big are chunks? check Memory Usage
Possibles.(get)container2.keySet();
Good Look
Sail
From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code.
Yes, but it's more complicated than that. It often needs to put objects in linked lists even when they have differing hash codes, since it only uses some bits of the hash codes to choose which bucket to store objects in; the number of bits it uses depends on the current size of the internal hash table, which approximately depends on the number of things in the map. And when a bucket needs to contain multiple objects it will also try to use binary trees like a TreeMap if possible (if objects are mutually Comparable), rather than linked lists.
Anyway.....
I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
No.
A HashMap compares keys for equality according to the equals method. Equality according to the equals method is the only valid way to set, replace, or retrieve values associated with a particular key.
Yes, it also uses hashCode as a way to arrange objects in a structure that allows for far faster location of potentially equal objects. Still, the contract for matching keys is defined in terms of equals, not hashCode.
Note that it is perfectly legal for every hashCode method to be implemented as return 0; and the map will still work just as correctly (but very slowly). So any idea that involves getting a list of objects sharing a hash code is either impossible or pointless or both.
I'm not 100% sure what you're doing in your equals method with the LocationInText variable, but it looks dangerous, as it violates the contract of the equals method. It is required that the equals method be symmetric, transitive, and consistent:
Symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
Transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
Consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
And the hashCode method is required to always agree with equals about equal objects:
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
The LocationInText variable is playing havoc with those rules, and may well break things. If not today, then some day. Get rid of it!
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
Something that jumps out at me is that you only need to do the key lookup once, instead of doing it three times, since Map.remove returns the removed value or null if the key is not present:
for (;;) {
String s = WorkingMap.remove(toSearch);
if (s == null) break;
Occurences++;
Possibles.add(s);
}
Either way, the loop is still faulty, since it is supposed to be impossible for a map to contain more than one key equal to toSearch. I can't overstate that the LocationInText variable as you're using it is not a good idea.
I agree with the other commenters it looks like you're looking for a map-of-list structure. Some Java libraries like Guava offer a Multimap for this, but you can do it manually pretty easily. I think the declaration you want is:
Map<Chunk,List<String>> map = new HashMap<>();
To add a new chunk-string pair to the map, do:
void add(Chunk chunk, String string) {
map.computeIfAbsent(chunk, k -> new ArrayList<>()).add(string);
}
That method puts a new ArrayList in the map if the chunk is new, or fetches the existing ArrayList if there is one for that chunk. Then it adds the string to the list that it fetched or created.
To retrieve the list of all strings for a particular chunk value is as simple as map.get(chunkToSearch), which you can add to your Possibles list as Possibles.addAll(map.get(chunkToSearch));.
Other potential optimizations I'd point out:
In your Chunk.hashCode method, consider caching the hash code instead of recomputing it every time the method is called. If Chunk is mutable (which is not a good idea for a map key, but vaguely allowed so long as you're careful) then recompute the hash code only after the Chunk's value has changed. Also, if Words is a List, which it seems to be, it would likely be faster to use its hash code than convert it to a string and use the string's hash code, but I'm not sure.
In your Chunk.equals method, you can return true immediately if the instances are the same (which they often will be). Also, if GetText returns a copy of the data, then don't call it; you can access the private Words list of the other Chunk since you are in the same class, and finally, you can just defer to the List.equals method:
#Override
public boolean equals(Object o) {
return (this == o) || (o instanceof Chunk && this.Words.equals(((Chunk)o).Words));
}
Simple! Fast!
Related
I recently ran across a problem on leetcode which I solved with a nested hashset. This is the problem, if you're interested: https://leetcode.com/problems/group-anagrams/.
My intuition was to add all of the letters of each word into a hashset, then put that hashset into another hashset. At each iteration, I would check if the hashset already existed, and if it did, add to the existing hashset.
Oddly enough, that seems to work. Why do 2 hashsets share the same hashcode if they are different objects? Would something like if(set1.hashCode() == set2.hashCode()) doStuff() be valid code?
This is expected. HashSet extends AbstractSet. The hashCode() method in AbstractSet says:
Returns the hash code value for this set. The hash code of a set is defined to be the sum of the hash codes of the elements in the set, where the hash code of a null element is defined to be zero. This ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode() for any two sets s1 and s2, as required by the general contract of Object.hashCode.
This implementation iterates over the set, calling the hashCode method on each element in the set, and adding up the results.
Here's the code from AbstractSet:
public int hashCode() {
int h = 0;
Iterator<E> i = iterator();
while (i.hasNext()) {
E obj = i.next();
if (obj != null)
h += obj.hashCode();
}
return h;
}
Why do 2 hashsets share the same hashcode if they are different objects?
With HashSet, the hashCode is calculated using the contents of the set. Since it's just numeric addition, the order of addition doesn't matter – just add them all up. So it makes sense that you have two sets, each containing objects which are equivalent (and thus should have matching hashCode() values), and then the sum of hashCodes within each set is the same.
Would something like if(set1.hashCode() == set2.hashCode()) doStuff() be valid code?
Sure.
EDIT: The best way of comparing two sets for equality is to use equals(). In the case of AbstractSet, calling set1.equals(set2) would result in individual calls to equals() at the level of the objects within the set (as well as some other checks).
Why do two different HashSets with the same data have the same
HashCode?
Actually this is needed to fulfill another need that is specified in Java.
The equals method of Set is overridden to take in consideration that equals returns true (example a.equals(b)) if:
a is of type Set and b is of type Set.
both a and b have exactly the same size.
a contains all elements of b.
b contains all elements of a.
Since the default equals (which compares only the memory reference to be the same) is overridden for Set, according to java guidelines the hashCode method has to be overridden as well. So, this custom implementation of hashCode is provided in order to match with the custom implementation of equals.
In order to see why it is necessary to override hashCode method when the equals method is overridden, you can take a look at this previous answer of mine.
Why do 2 hashsets share the same hashcode if they are different
objects
Because as explained above this is needed so that Set can have the custom functionality for equals that it currently has.
If you want to just check if a and b are different instances of set you can still check this with operators == and !=.
a == b -> true means a and b point to the same instance of Set in memory
a != b -> true means a and b point to different instances of Set in memory
By default hashCode and equals works fine.
I have used objects with hash tables like HashMap, without overriding this methods, and it was fine. For example:
public class Main{
public static void main(String[] args) throws Exception{
Map map = new HashMap<>();
Object key = new Main();
map.put(key, "2");
Object key2 = new Main();
map.put(key2, "3");
System.out.println(map.get(key));
System.out.println(map.get(key2));
}
}
This code works fine. By default hashCode returning memory address of object, and equals checks if two objects is the same. So what is the problem with using default implementation of this methods?
Note this example from an old pdf I have:
This code
public class Name {
private String first, last;
public Name(String first, String last) { this.first = first; this.last = last;
}
public boolean equals(Object o) {
if (!(o instanceof Name)) return false;
Name n = (Name)o;
return n.first.equals(first) && n.last.equals(last);
}
public static void main(String[] args) {
Set s = new HashSet();
s.add(new Name("Donald", "Duck"));
System.out.println(
s.contains(new Name("Donald", "Duck")));
}
}
...will not always give the same result because as it is stated in the pdf
Donald is in the set, but the set can’t find him. The Name class
violates the hashCode contract
Because, in this case, there are two strings composing the object the hashcode should also be composed of those two elements.
To fix this code we should add a hashCode method:
public int hashCode() {
return 31 * first.hashCode() + last.hashCode();
}
This question in the pdf ends saying that we should
override hashCode when overriding equals
In your example, whenever you want to retrieve something from you HashMap, you need to have key and key2, because their equals() is the same as object identity. This makes the HashMap completely useless, because you cannot retrieve anything from it without having these two keys. Passing the keys around doesn't make sense, because you could just as well pass the values around, it would be equally awkward.
Now try to imagine some use case, where a HashMap actually makes sense. For example, suppose that you get String-valued requests from the outside, and want to return, say, ip-addresses. The keys that come from the outside obviously cannot be the same as the keys you used to set up your map. Therefore you need some methods that compare requests from the outside to the keys you used during the initialization phase. This is exactly what equals is good for: it defines an equivalence relation on objects that are not identical in the sense of being represented by the same bits in physical memory. hashCode is a coarser version of equals, which is necessary to retrieve values from HashMaps quickly.
Your example is not very useful as it would be simpler to have simple variables. i.e. the only way to lookup the value in the map is to hold the original key. In which case, you may as well just hold the value and not have a Map in the first place.
If instead you want to be able to create a new key which is considered equivalent to a key used previously, you have to provide how equivalence is determined.
Given that most objects are never asked for their identity hash code, the system does not keep for most objects any information that would be sufficient to establish a permanent identity. Instead, Java uses two bits in the object header to distinguish three states:
The identity hashcode for the object has never been queried.
The identity hashcode has been queried, but the object has not been moved by the GC since then.
The identity hashcode has been queried, and the object has been moved since then.
For objects in the first state, asking for the identity hash code will change the object to the second state and process it as a second-state object.
For objects in the second state, including those which had moments before been in the first state, the identity hash code will be formed from the address.
When an object in the second state is moved by the GC, the GC will allocate an extra 32 bits to the object, which will be used to hold a hash-code derived from its original address. The object will then be assigned to the third state.
Subsequent requests for the hash code from a state-3 object will use that value that was stored when it was moved.
At times when the system knows that no objects within a certain address range are in state 2, it may change the formula used to compute hash codes from addresses in that range.
Although at any given time there may only be one object at any given address, it is entirely possible that an object might be asked for its identity hash code and later moved, and that another object might be placed at the either same address as the first one, or an address that would hash to the same value (the system might change the formula used to compute hash values to avoid duplication, but would be unable to eliminate it).
I am curious that in the Java collections library, HashMap has a method that searches for the existance of a particular object value called containsValue(Object value) returing a boolean, but no method exists to get the value object by value object directly like you do by providing a key via the get(Object key) method. Now, I know that the purpose of HashMap is to access object values via the keys, but in exceptional cases may want retrieve via the object value, so why is there not a getValue(Object value) method? I ask this, because the algorithm that the method containsValue() implements to search for the object value is faster than my custom search (see below). Also, is there a better way to accomplish this search using HashMap in Java 7 ?
Code Snippet:
// Custom Search
MyCustomer findCust = new MyCustomer(50000, "Joe Bloggs", "London");
for (MyCustomer value : hashMap.values()) {
if (value.equals(findCust)) { // found
cust = value;
break;
}
}
The basic assumption of the collections framework is that if two objects are .equals, they are interchangeable in every way. Given that assumption, there's no reason to get out the value from a Map, because you already have one that is equals and interchangeable. As far as the Collections Framework is concerned, these two methods are fully equivalent:
for (V value : map.values()) {
if (value.equals(myValue)) {
return value;
}
}
and
if (map.containsValue(myValue)) {
return myValue;
}
This assumption is built into the Collections Framework in many places, and this is one of many examples.
hashMap.values().contains(findCust)
You will need equals and hashCode on Customer based on your "business rules" (for example, are two customers with the same "id" but with different other values "equal"????... Obviously you are already doing that because you are using equals...)
HashMap is designed to aid constant lookups using hashcode() and equals() of the key you use to put some value into map.
If you look at the internal structure of HashMap, it's nothing but an array. Each index is called a bucket which can be obtained by normalizing current array's length and the hashcode of the key you pass. Once you find the bucket, it will store the element at that particular index. But if there's already some element stored in that index, they will form a LinkedList of these elements chaining all the values having same hashcode() but different equals() criteria.
In Java 8, this linked list is even changed to TreeMap if the number of elements in that linked list reaches some threshold (8) for improving performance.
Coming to your question, containsValue() basically iterates over all the buckets in the array and again through all the elements in the linked list of each bucket
// iterate through buckets
for (int i = 0; i < table.length; ++i) {
// iterate through each element in linked list at each bucket
for (Node<K,V> e = table[i]; e != null; e = e.next) {
if ((v = e.value) == value ||
(value != null && value.equals(v)))
return true;
}
}
HashMap.values() returns a Collection with the iterator implemented to traverse each element in HashMap providing access to Value object in each iteration.
containsValue() is used when you want to do something if some value is already there in the map but you don't need that value to proceed with your flow.This is merely a convenience method because if you're using values, you will be creating a Collection object and an iterator object to iterate over them but using containsValue(), you just have two nested for loops. I think the reason for not having a getValue() is to encourage the purpose HashMap is intended for - near constant time look ups using hashcode & equals of some key.
values() is used when you basically need to iterate over all the values. This is different from calling map.get(key) in a loop because you don't have to normalize the hashcode, find the bucket, then find the element in the linked list in each iteration, you just loop in the natural order, the way the elements are laid out in the array.
If you're doing this value lookup way too many times, you lose the advantage of constant lookups offered by HashMap. If you're only going to skim through the values searching for some value, I suggest you use an ArrayList. If there are too many elements in that list, and you need to search for some random value quite often, sort the list and use Binary Search.
This may not be the real world scenario but just curious to know what happens, below is the code.
I am creating a set of object of class UsingSet.
According to hashing concept in Java, when I first add object which contains "a", it will create a bucket with hashcode 97 and put the object inside it.
Again when it encounters an object with "a", it will call the overridden hashcode method in the class UsingSet and it will get hashcode 97 so what is next?
As I have not overridden equals method, the default implementation will return false. So where will be the Object with value "a" be kept, in the same bucket where the previous object with hashcode 97 kept? or will it create new bucket?
anybody know how it will be stored internally?
/* package whatever; // don't place package name! */
import java.util.*;
import java.lang.*;
import java.io.*;
class UsingSet {
String value;
public UsingSet(String value){
this.value = value;
}
public String toString() {
return value;
}
public int hashCode() {
int hash = value.hashCode();
System.out.println("hashcode called" + hash);
return hash;
}
public static void main(String args[]) {
java.util.Set s = new java.util.HashSet();
s.add(new UsingSet("A"));
s.add(new UsingSet("b"));
s.add(new UsingSet("a"));
s.add(new UsingSet("b"));
s.add(new UsingSet("a"));
s.add(new Integer(1));
s.add(new Integer(1));
System.out.println("s = " + s);
}
}
output is:
hashcode called65
hashcode called98
hashcode called97
hashcode called98
hashcode called97
s = [1, b, b, A, a, a]
HashCode & Equals methods
Only Override HashCode, Use the default Equals:
Only the references to the same object will return true. In other words, those objects you expected to be equal will not be equal by calling the equals method.
Only Override Equals, Use the default HashCode: There might be duplicates in the HashMap or HashSet. We write the equals method and expect{"abc", "ABC"} to be equals. However, when using a HashMap, they might appear in different buckets, thus the contains() method will not detect them each other.
James Large answer is incorrect, or rather misleading (and part incorrect as well). I will explain.
If two objects are equal according to their equals() method, they must also have the same hash code.
If two objects have the same hash code, they do NOT have to be equal too.
Here is the actual wording from the java.util.Object documentation:
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
It is true, that if two objects don't have the same hash then they are not equal. However, hashing is not a way to check equality - so it is wildly incorrect to say that it is a faster way to check equality.
Also, it is also wildly incorrect to say the hashCode function is an efficient way to do anything. This is all up to implementation, but the default implementation for hashCode of a string is very inefficient as the String gets large. It will perform a calculation based on each char of the String, so if you are using large Strings as keys, then this becomes very inefficient; moreso if you have a large number of buckets.
In a Map (HashSet uses a HashMap internally), there are buckets and in each bucket is a linked list. Java uses the hashCode() function to find out which bucket it belongs in (it actually will modify the hash, depending on how many buckets exist). Since two objects may share the same hash, it will iterate through the linked list sequentially next, checking the equals() method to see if the object is a duplicate. Per the java.util.Set documenation:
A collection that contains no duplicate elements.
So, if its hashCode() leads it to a bucket, in which that bucket contains an Object where the .equals() evaluates to true, then the previous Object is overwritten with the new Object. You can probably view here for more information:
How does a Java HashMap handle different objects with the same hash code?
Generally speaking though, it is good practice that if you overwrite the hashCode function, you also overwrite the equals function (if I'm not mistaken, this breaks the contract if you choose not to).
Simply you can Assume hashcode and equals methods as a 2D search like:-
Where Hashcode is the Rows and the object list is the Column.
Consider the following class structure.
public class obj
{
int Id;
String name;
public obj(String name,int id)
{
this.id=id;
this.name=name;
}
}
now if you create the objects like this:-
obj obj1=new obj("Hassu",1);
obj obj2=new obj("Hoor",2);
obj obj3=new obj("Heniel",3);
obj obj4=new obj("Hameed",4);
obj obj5=new obj("Hassu",1);
and you place this objects in map like this :-
HashMap hMap=new HashMap();
1. hMap.put(obj1,"value1");
2. hMap.put(obj2,"value2");
3. hMap.put(obj3,"value3");
4. hMap.put(obj4,"value4");
5. hMap.put(obj5,"value5");
now if you have not override the hashcode and equals then after putting all the objects till line 5 if you put obj5 in the map as By Default HashCode you get different hashCode so the row(Bucket will be different).
So in runtime memory it will be stored like this.
|hashcode | Objects
|-----------| ---------
|000562 | obj1
|000552 | obj2
|000588 | obj3
|000546 | obj4
|000501 | obj5
Now if you create the same object Like :-
obj obj6 = new obj("hassu",1);
And if you search for this value in the map.like
if(hMap.conaints(obj6))
or
hMpa.get(obj 6);
though the key(obj1) with the same content is available you will get false and null respectively.
Now if you override only equals method.
and perform the same content search key will also get the Null as the HashCode for obj6 is different and in that hashcode you wont find any key.
Now if you override only hashCode method.
You will get the same bucket (HashCode row) but the content cant be checked and it will take the reference checked implementation by Super Object Class.
SO here if you search for the key hMap.get(obj6) you will get the correct hashcode:- 000562 but as the reference for both obj1 and obj6 is different you will get null.
Set will behave differently.
Uniqueness wont happen. Because unique will be achieved by both hashcode and equals methods.
output will be liked this s = [A, a, b, 1] instead of early one.
Apart that remove and contains all wont work.
Without looking at your code...
The whole point of hash codes is to speed up the process of testing two objects for equality. It can be costly to test whether two large, complex objects are equal, but it is trivially easy to compare their hash codes, and hash codes can be pre-computed.
The rule is: If two objects don't have the same hash code, that means they are not equal. No need to do the expensive equality test.
So, the answer to the question in your title: If you define an equals() method that says object A is equal to object B, and you define a hashCode() method that says object A is not equal to object B (i.e., it says they have different hash codes), and then you hand those two objects to some library that cares whether they are equal or not (e.g., if you put them in a hash table), then the behavior of the library is going to be undefined (i.e., probably wrong).
Added information: Wow! I really missed seeing the forest for the trees here---thinking about the purpose of hashCode() without putting it in the context of HashMap. If m is a Map with N entries, and k is a key; what is the purpose of calling m.get(k)? The purpose, obviously, is to search the map for an entry whose key is equal to k.
What if hash codes and hash maps had not been invented? Well the best you could do, assuming that the keys have a natural, total order, is to search a TreeMap, comparing the given key for equality with O(log(N)) other keys. In the worst case, where the keys have no order, you would have to compare the given key for equality with every key in the map until you either find a match or tested them all. In other words, the complexity of m.get(k) would be O(N).
When m is a HashMap, the complexity of m.get(k) is O(1), whether the keys can be ordered or not.
So, I messed up by saying that the point of hash codes was to speed up the process of testing two objects for equality. It's really about testing an object for equality with a whole collection of other objects. That's where comparing hash codes doesn't just help a little; It helps by orders of magnitude...
...If the k.hashCode() and k.equals(o) methods obey the rule: j.hashCode()!=k.hashCode() implies !j.equals(k).
Maybe this has been asked before (where I didn't find it)...
I have a java.util.Set of aprox. 50000 Strings. I would like to generate some sort of hash to check if it has been changed (comparing hashes of two versions of the Set)?
If the Set changes, the hash has to be different.
How can that be achieved? Thanks!
EDIT:
Sorry for that misleading wording. I don't want to check if "it" has been changed (the same instance). Instead I want to check if two database queries, which are generating two - maybe identical - instances of a Set of Strings are equal.
I'd try using java.util.AbstractSet's hashCode method, as stated in the documentation:
Returns the hash code value for this set. The hash code of a set is
defined to be the sum of the hash codes of the elements in the set,
where the hash code of a null element is defined to be zero. This
ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode()
for any two sets s1 and s2, as required by the general contract of
Object.hashCode().
Of course, this only works if your Set implementation extends from AbstractSet, I suppose you use e.g. java.util.HashSet. As always there is a chance of hash collision.
Alternatively, you could extend an existing Set implementation and override the state changing methods, this may make sense if hash computation for each object becomes too expensive, like:
class ChangeSet<E> extends java.util.HashSet<E> {
private boolean changed = false;
#Override
public boolean add(E e) {
changed = true;
super.add(e);
}
public void commit() {
changed = false;
}
public boolean isChanged() {
return changed;
}
/* and all the other methods (addAll, remove, removeAll, etc.) */
}
Based on this statement:
If the Set changes, the hash has to be different
It really can't be achieved, unless you have more constraints. In general, a hash is a value in some fixed space. For example, your hash may be a 32 bit integer, so there are 2^32 possible hash values. In general, b bits gets you 2^b possible hash values. In order to achieve what you want, you have to make sure that every possible set (i.e. - the set of all sets!) is less than or equal to 2^b. But my guess is that you can have arbitrary strings so this isn't possible. And even if it was possible, you'd have to come up with a way to map onto the hash space, which can be challenging.
However, with a good hash function, it's not very likely that changing the set will end up producing the same hash value. So you can use the hash to determine inequality, but if the hash is the same, you still need to check for equality. (This is the same idea behind a hash set or a hash map, where elements map to buckets based on a hashcode, but you have to check for equality).
Similar to what Paul mentioned but different: you can instead make a set implementation that has version numbers and ensure that you always generate a new version number when the set is mutated. Then you can compare the version number? I'm not sure if you care about immutable sets or whether the mutable set changes back to a version you have seen (i.e. - if it should always get the same version).
Hope this helps.
If you need to improve the performance of hashCode (as it rather expensive for a large Set) you can cache it and update it as you go.
class MyHashSet<E> extends LinkedHashSet<E> {
int hashCode = 0;
#Override
public boolean add(E e) {
if (super.add(e)) {
hashCode ^= e.hashCode();
return true;
}
return false;
}
#Override
public boolean remove(Object o) {
if(super.remove(o)) {
hashCode ^= o.hashCode();
return true;
}
return false;
}
#Override
public void clear() {
super.clear();
hashCode = 0;
}
#Override
public int hashCode() {
return hashCode;
}
}
Sometimes simpler is better. I suggest writing your own Set implementation. In it, override the add and remove methods so they set a flag if the Set is modified. Add a getter for the flag, isModified, and you don't have to worry about hash overhead or collisions. Just call MyCustomSet.isModified.
Alternately you can call Collections.unmodifiableSet to get a wrapper around your Set that can't be modified. An exception will be thrown if code attempts to modify the set.