Duplicate elements in java.util.Set - java

java.util.Set implementations removes the duplicate elements.
How are duplicates elements deleted internally in a java.util.Set?

Actually AFAIK from the sources most Set implementations in java don't even check if the element is already contained.
They just always execute the add() on their internal structure which holds the set elements and let that object handle the duplication case.
e.g. HashSet calls put(K,V) on the internal HashMap which just inserts the new object overwriting the old entry if duplicate.

Reading a little into your question I'm guessing that you're seeing strange behaviour with a java.util.HashSet (typically what everyone uses by default).
Contary to the contract of java.util.Set it is possible to get the same object in a java.util.HashSet twice like this:
import java.util.HashSet;
import java.util.Set;
public class SetTest
{
public static void main(String[] args)
{
MyClass myObject = new MyClass(1, "testing 1 2 3");
Set<MyClass> set = new HashSet<MyClass>();
set.add(myObject);
myObject.setHashCode(2);
set.add(myObject);
System.out.println(set.size()); // this will print 2.
}
private static class MyClass
{
private int hashCode;
private String otherField;
public MyClass(int hashCode, String otherField)
{
this.hashCode = hashCode;
this.otherField = otherField;
}
public void setHashCode(int hashCode)
{
this.hashCode = hashCode;
}
public boolean equals(Object obj)
{
return obj != null && obj.getClass().equals(getClass()) && ((MyClass)obj).otherField.equals(otherField);
}
public int hashCode()
{
return hashCode;
}
}
}
After the pointer from #jitter and a look at the source you can see why this would happen.
Like #jitter says, the java.util.HashSet uses a java.util.HashMap internally. When the hash changes between the first and second add a different bucket is used in the java.util.HashMap and the object is in the set twice.
The code sample may look a little contrieved but I've seen this happen in the wild with domain classes where the hash is created from mutable fields and the equals method hasn't been kept in sync with those fields.

An easy way to find this out is to look in the source for the code you are interested in.
Each JDK has a src.zip included which contains the source code for the public classes so you can just locate the source for HashSet and have a look :) I often use Eclipse for this. Start it, create a new Java project, set the JVM to be an installed JDK (if not you are using the system default JRE which doesn't have src.zip), and Ctrl-Shift-T to go to HashSet.

Read your question more detailed:
You can't add duplicates, from java doc for Set.add() or do you mean addAll?:
Adds the specified element to this set if it is not already present (optional operation). More formally, adds the specified element e to this set if the set contains no element e2 such that (e==null ? e2==null : e.equals(e2)). If this set already contains the element, the call leaves the set unchanged and returns false. In combination with the restriction on constructors, this ensures that sets never contain duplicate elements.

Adds the specified element to the set if it is not already present.
If the set already contains the element, the call leaves the set unchanged and returns false.In combination with the restriction on constructors, this ensures that sets never contain duplicate elements.

First off, set doesn't "Delete" duplicates, it doesn't allow entering duplicates in the first place.
Let me walk you through the implementation of set.add(e) method.
set.add(e) returns boolean stating whether e has been added in the set or not.
Let's take this simple code for example:
We will get x as true and y as false.
Let us see what add() actually does:
So, HashSet basically uses HashMap internally, and sends the element as key (and an empty initialized object called PRESENT as the value.).
This map.put(k,v) either returns a null, if the key never existed, or it would return the old value which the key had.
Therefore while doing set.add(1) for the first time, we get null in response of map.put(1,PRESENT), and that's why we get true.
And when we call it the second time we don't get null in response to map.put(1,PRESENT) and hence the set.add(1) returns false.
(You can dig deeper into the put method, which internally calls putVal and uses hash to identify if a key is already existing, depending on which it returns a null or old Value.)
And since we are using HashMap internally, which uses hash to find uniqueness of a key, we would never end up having same element twice in a HashSet.

Related

Why Set implementations don't store just keys?

I know Set uses implementation of Map<K,V>, where Set elements are keys. But what happens with values? They use private static final Object PRESENT = new Object() as constant value for each key.
Cool. But why? That means for each key we will store value we will never use, just so we can reuse implementation of Map? Why? Couldn't they just make Key implementation? And is that constant ever used or it just 'sits' there?
As mentioned in implementation, if you can see add method of HashSet returning boolean.
The add method calls put(key, PRESENT) on the internal HashMap. The remove method calls remove(key) on the internal HashMap, but it must return a boolean indicating whether the key was present. If null were stored as the value, then the HashSet would need to call containsKey first, then remove, to determine if the key was present -- additional overhead. Here, there is only the memory overhead of one Object, which is quite minimal.
public boolean add(E e)
{
return map.put(e, PRESENT)==null;
}
For each potential 'place' that can hold an entry in the HashSet, it must be possible to tell whether that place is occupied or not.
If you want to support a key value of null, as indeed HashSet does, then you can't use key=null as the "not used" indicator. Since you need to use key.equals() to find an object in the set, you can't have a special key object that might accidentally be equal to (not identical to) an actual key.
Thus you need a separate flag apart from the key to say whether the 'place' is occupied. This might as well be an object reference as anything else; then you can reuse the Map implementation.

Why does Set.add not return an object?

Why does Set.add() not return the object you are trying to add, or an object that is equal to the one you are trying to add if it is already in the set?
I could have an object with two properties, but only one of these properties is considered in the hashcode()/equals() method (and then use the property I do not consider for hashcode()/equals().
A simple answer is that Set extends the Collection interface, but that is not satisfying; then the question is why Set does not have an additional method that accomplishes what I want it to do, e.g.: set.addObject().
I know I can use Map.getOrDefault, but then I need to specify the two parameters of the Map.
A simple answer is that Set extends the Collection interface, but that is not satisfying
Satisfying or not, that is the answer to why add returns a boolean rather than the Object already in the Set; Collection.add returns a boolean to indicate whether the underlying collection was changed.
In general, what would you return to indicate that the element wasn't added, bearing in mind that "already in the collection" isn't the only reason for not adding an element?
The question of why there is not addObject method is quite separate, and opinion-based: you might consider it useful, but the API designers either:
Didn't consider adding it, because they didn't see the need;
Considered it, but decided it didn't carry its weight, since you can do it by other means on the rare occasions when you want such functionality.
Considered that it might be useful, but were then unable to modify the interface to add a new method, as that would break existing implementations of Set.
This is obviously possible now that interfaces can have default method implementations; but then we're back to the other two reasons above.
Actually, I take back what I say above "that "already in the collection" isn't the only reason for not adding an element".
According to the Javadoc of Collection.add:
Returns true if this collection changed as a result of the call. (Returns false if this collection does not permit duplicates and already contains the specified element.)
So, OK, it should be the only reason. But then you've got to wonder why it would be useful, say, for a List to return the element you've just added.
It indeed would be nice.
A simple usecase would be a cache, so no duplicate instances are used:
This must currently be implemented as follows:
class Shared<T>
private final Map<T, T> map = new HashMap<>();
public T share(T obj) {
T old = map.get(obj);
if (old == null) {
map.put(obj, obj);
old = obj;
}
return old;
}
}
But when Set.addObject would return the old object or when null added object:
private final Set<T> set = new HashSet<>();
public T share(T obj) {
return set.addObject(obj);
}
The algebraic completeness of Set, having all operations available, indeed lacks that functionality.

Extend HashMap to get all objects with the specified hashcode?

From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code. I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
The reasoning is that I'm trying to optimize a part of a code that, currently, is just a while loop that finds the first object with that hashcode and stores/removes it. This would be a lot faster if I could just return the full list in one go.
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
The keys are Chunk objects and the values are Strings. Here are the hashcode() and equals() functions for the Chunk class:
/**
* Returns a string representation of the ArrayList of words
* thereby storing chunks with the same words but with different
* locations and next words in the same has bucket, triggering the
* use of equals() when searching and adding
*/
public int hashCode() {
return (Words.toString()).hashCode();
}
#Override
/**
* result depends on the value of location. A location of -1 is obviously
* not valid and therefore indicates that we are searching for a match rather
* than adding to the map. This allows multiples of keys with matching hashcodes
* to be considered unequal when adding to the hashmap but equal when searching
* it, which is integral to the MakeMap() and GetOptions() methods of the
* RandomTextGenerator class.
*
*/
public boolean equals(Object obj) {
Chunk tempChunk = (Chunk)obj;
if (LocationInText == -1 && Words.size() == tempChunk.GetText().size())
{
for (int i = 0; i < Words.size(); i++) {
if (!Words.get(i).equals(tempChunk.GetText().get(i))) {
return false;
}
}
return true;
}
else {
if (tempChunk.GetLocation() == LocationInText) {
return true;
}
return false;
}
}
Thanks!
HashMap does not expose any way to do this, but I think you're misunderstanding how HashMap works in the first place.
The first thing you need to know is that if every single object had exactly the same hash code, HashMap would still work. It would never "mix up" keys. If you call get(key), it will only return the value associated with key.
The reason this works is that HashMap only uses hashCode as a first grouping, but then it checks the object you passed to get against the keys stored in the map using the .equals method.
There is no way, from the outside, to tell that HashMap uses linked lists. (In fact, in more recent versions of Java, it doesn't always use linked lists.) The implementation doesn't provide any way to look at hash codes, to find out how hash codes are grouped, or anything along those lines.
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
This code does not "find the first object with that hashcode and store/remove it." It finds the one and only object equal to toSearch according to .equals, stores and removes it. (There can only be one such object in a Map.)
Your while isn't really going. It makes max one turn, if the WorkingMap is a plain Java HashMap. .get(key) return the last saved Object in the HashMap that is saved on 'key'. If it matched toSearch, than it going once.
I'm not sure about that many open questions here. But if you need that one and your farther code is understanding
What kind of type is class Possibles? ArrayList?
// this one should make the same as your while
if(WorkingMap.containsKey(toSearch)) {
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
// farher: expand your Possibles to get that LinkedList what you want to have.
public class possibilities {
// List<LinkedList<String>> container = new ArrayList<LinkedList<String>>();
public Map<Chunk, LinkedList<String>> container2 = new HashMap<Chunk, LinkedList<String>>();
public void put(Chunk key, String value) {
if(!this.container2.containsKey(key)) {
this.container2.put(key, new LinkedList<String>());
}
this.container2.get(key).add(value);
}
}
// this one works with updated Possibles
if(WorkingMap.containsKey(toSearch)) {
Possibles.put(toSearch, WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
//---
How ever, yes it can go like that, but keys should not be a complex object.
Notice: That LinkedLists takes memory and how big are chunks? check Memory Usage
Possibles.(get)container2.keySet();
Good Look
Sail
From what I understand, when two objects are put in a HashMap that have the same hashcode, they are put in a LinkedList (I think) of objects with the same hash code.
Yes, but it's more complicated than that. It often needs to put objects in linked lists even when they have differing hash codes, since it only uses some bits of the hash codes to choose which bucket to store objects in; the number of bits it uses depends on the current size of the internal hash table, which approximately depends on the number of things in the map. And when a bucket needs to contain multiple objects it will also try to use binary trees like a TreeMap if possible (if objects are mutually Comparable), rather than linked lists.
Anyway.....
I am wondering if there is a way to either extend HashMap or manipulate the existing methods to return a list or array of objects that share a hash code instead of going into equals to see if they are the same object.
No.
A HashMap compares keys for equality according to the equals method. Equality according to the equals method is the only valid way to set, replace, or retrieve values associated with a particular key.
Yes, it also uses hashCode as a way to arrange objects in a structure that allows for far faster location of potentially equal objects. Still, the contract for matching keys is defined in terms of equals, not hashCode.
Note that it is perfectly legal for every hashCode method to be implemented as return 0; and the map will still work just as correctly (but very slowly). So any idea that involves getting a list of objects sharing a hash code is either impossible or pointless or both.
I'm not 100% sure what you're doing in your equals method with the LocationInText variable, but it looks dangerous, as it violates the contract of the equals method. It is required that the equals method be symmetric, transitive, and consistent:
Symmetric: for any non-null reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
Transitive: for any non-null reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
Consistent: for any non-null reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
And the hashCode method is required to always agree with equals about equal objects:
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
The LocationInText variable is playing havoc with those rules, and may well break things. If not today, then some day. Get rid of it!
Here's the bit of code I'd like to replace:
while (WorkingMap.containsKey(toSearch)) {
Occurences++;
Possibles.add(WorkingMap.get(toSearch));
WorkingMap.remove(toSearch);
}
Something that jumps out at me is that you only need to do the key lookup once, instead of doing it three times, since Map.remove returns the removed value or null if the key is not present:
for (;;) {
String s = WorkingMap.remove(toSearch);
if (s == null) break;
Occurences++;
Possibles.add(s);
}
Either way, the loop is still faulty, since it is supposed to be impossible for a map to contain more than one key equal to toSearch. I can't overstate that the LocationInText variable as you're using it is not a good idea.
I agree with the other commenters it looks like you're looking for a map-of-list structure. Some Java libraries like Guava offer a Multimap for this, but you can do it manually pretty easily. I think the declaration you want is:
Map<Chunk,List<String>> map = new HashMap<>();
To add a new chunk-string pair to the map, do:
void add(Chunk chunk, String string) {
map.computeIfAbsent(chunk, k -> new ArrayList<>()).add(string);
}
That method puts a new ArrayList in the map if the chunk is new, or fetches the existing ArrayList if there is one for that chunk. Then it adds the string to the list that it fetched or created.
To retrieve the list of all strings for a particular chunk value is as simple as map.get(chunkToSearch), which you can add to your Possibles list as Possibles.addAll(map.get(chunkToSearch));.
Other potential optimizations I'd point out:
In your Chunk.hashCode method, consider caching the hash code instead of recomputing it every time the method is called. If Chunk is mutable (which is not a good idea for a map key, but vaguely allowed so long as you're careful) then recompute the hash code only after the Chunk's value has changed. Also, if Words is a List, which it seems to be, it would likely be faster to use its hash code than convert it to a string and use the string's hash code, but I'm not sure.
In your Chunk.equals method, you can return true immediately if the instances are the same (which they often will be). Also, if GetText returns a copy of the data, then don't call it; you can access the private Words list of the other Chunk since you are in the same class, and finally, you can just defer to the List.equals method:
#Override
public boolean equals(Object o) {
return (this == o) || (o instanceof Chunk && this.Words.equals(((Chunk)o).Words));
}
Simple! Fast!

Why doesn't addAll() support the addition of a collection's copy to the collection?

In a method, I make two calls. The first call constructs and returns a hashset from another method. The second call adds this newly constructed set to an existing one, passed in as a parameter to this method.
public static void someMethod(java.util.HashSet<Coordinate> invalidPositions)
{
java.util.HashSet<Coordinate> newSet = SomeClass.getInvalidPositions(x, y);
invalidPositions.addAll(newSet);
}
Often times, the passed in set, the pre-existing one, will add another set whose contents are the same as itself! That is, setOne.equals(setTwo) == true
Instead of adding the other set, however, the JavaDocs say of addAll():
public boolean addAll(Collection c)
Adds all of the elements in the specified collection to this collection (optional operation).
The behavior of this operation is undefined if the specified collection is modified while the operation is in progress. (This implies that the behavior of this call is undefined if the specified collection is this collection, and this collection is nonempty.)
Have I understood this correctly? If two sets are equal, java will not support one's addition of the other? If this is true, is there any reason to design the language in this way?
a.equals(b) is different from a == b.
What the javadoc means is that the behavior is undefined if you do a.addAll(a). There is no problem in doing a.addAll(b) as long as they are different instances.

Updating an object within a Set

Let's say I have this type in my application:
public class A {
public int id;
public B b;
public boolean equals(Object another) { return this.id == ((A)another).id; }
public int hashCode() { return 31 * id; //nice prime number }
}
and a Set<A> structure. Now, I have an object of type A and want to do the following:
If my A is within the set, update its field b to match my object.
Else, add it to the set.
So checking if it is in there is easy enough (contains), and adding to the set is easy too. My question is this: how do I get a handle to update the object within? Interface Set doesn't have a get method, and the best I could think of was to remove the object in the set and add mine. another, even worse, alternative is to traverse the set with an iterator to try and locate the object.
I'll gladly take better suggestions... This includes the efficient use of other data structures.
Yuval =8-)
EDIT: Thank you all for answering... Unfortunately I can't 'accept' the best answers here, those that suggest using a Map, because changing the type of the collection radically for this purpose only would be a little extreme (this collection is already mapped through Hibernate...)
Since a Set can only contain one instance of an object (as defined by its equals and hashCode methods), just remove it and then add it. If there was one already, that other one will be removed from the Set and replaced by the one you want.
I have code that does something similar - I am caching objects so that everywhere a particular object appears in a bunch of different places on the GUI, it's always the same one. In that case, instead of using a Set I'm using a Map, and then I get an update, I retrieve it from the Map and update it in place rather than creating a new instance.
You really want to use a Map<Integer,A>, not a Set<A>.
Then map the ID (even though it's also stored in A!) to the object. So storing new is this:
A a = ...;
Map<Integer,A> map = new HashMap<Integer,A>();
map.put( a.id, a );
Your complete update algorithm is:
public static void update( Map<Integer,A> map, A obj ) {
A existing = map.get( obj.id );
if ( existing == null )
map.put( obj.id, obj );
else
existing.b = obj.b;
}
However, it might be even simpler. I'm assuming you have more fields than that in A that what you gave. If this is not the case, just using a Map<Integer,B> is in fact what you want, then it collapses to nothing:
Map<Integer,B> map = new HashMap<Integer,B>();
// The insert-or-update is just this:
map.put( id, b );
I don't think you can make it any easier than using remove/add if you are using a Set.
set.remove(a);
set.add(a);
If a matching A was found it will be removed and then you add the new one, you don't even need the if (set.contains(A)) conditional.
If you have an object with an ID and an updated field and you don't really care about any other aspects of that object, just throw it out and replace it.
If you need to do anything else to the A that matches that ID then you'll have to iterate through the Set to find it or use a different Container (like the Map as Jason suggested).
No one has mentioned this yet, but basing hashCode or equals on a mutable property is one of those really, really big things that you shouldn't do. Don't muck about with object identity after you leave the constructor - doing so greatly increases your chances of having really difficult-to-figure out bugs down the road. Even if you don't get hit with bugs, the accounting work to make sure that you always properly update any and all data structures that relies on equals and hashCode being consistent will far outweigh any perceived benefits of being able to just change the id of the object as you run.
Instead, I strongly recommend that you pass id in via the constructor, and if you need to change it, create a new instance of A. This will force users of your object (including yourself) to properly interact with the collection classes (and many others) that rely on immutable behavior in equals and hashCode.
What about Map<A,A> I know it's redundant, but I believe it will get you the behavior you'd like. Really I'd love to see Set have a get(Object o) method on it.
You might want to generate a decorator called ASet and use an internal Map as the backing data structure
class ASet {
private Map<Integer, A> map;
public ASet() {
map = new HashMap<Integer, A>();
}
public A updateOrAdd(Integer id, int delta) {
A a = map.get(a);
if(a == null) {
a = new A(id);
map.put(id,a);
}
a.setX(a.getX() + delta);
}
}
You can also take a look at the Trove API. While that is better for performance and for accounting that you are working with primitive variables, it exposes this feature very nicely (e.g. map.adjustOrPutValue(key, initialValue, deltaValue).
It's a bit outside scope, but you forgot to re-implement hashCode(). When you override equals please override hashCode(), even in an example.
For example; contains() will very probably go wrong when you have a HashSet implementation of Set as the HashSet uses the hashCode of Object to locate the bucket (a number which has nothing to do with business logic), and only equals() the elements within that bucket.
public class A {
public int id;
public B b;
public int hashCode() {return id;} // simple and efficient enough for small Sets
public boolean equals(Object another) {
if (object == null || ! (object instanceOf A) ) {
return false;
}
return this.id == ((A)another).id;
}
}
public class Logic {
/**
* Replace the element in data with the same id as element, or add element
* to data when the id of element is not yet used by any A in data.
*/
public void update(Set<A> data, A element) {
data.remove(element); // Safe even if the element is not in the Set
data.add(element);
}
}
EDIT Yuvalindicated correctly that Set.add does not overwrite an existing element, but only adds if the element is not yet in the collection (with "is" implemented by equals)

Categories