Should a HashSet be allowed to be added to itself in Java?

Should a HashSet be allowed to be added to itself in Java? - java

According to the contract for a Set in Java, "it is not permissible for a set to contain itself as an element" (source). However, this is possible in the case of a HashSet of Objects, as demonstrated here:
Set<Object> mySet = new HashSet<>();
mySet.add(mySet);
assertThat(mySet.size(), equalTo(1));
This assertion passes, but I would expect the behavior to be to either have the resulting set be 0 or to throw an Exception. I realize the underlying implementation of a HashSet is a HashMap, but it seems like there should be an equality check before adding an element to avoid violating that contract, no?

Others have already pointed out why it is questionable from a mathematical point of view, by referring to Russell's paradox.
This does not answer your question on a technical level, though.
So let's dissect this:
First, once more the relevant part from the JavaDoc of the Set interface:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Interestingly, the JavaDoc of the List interface makes a similar, although somewhat weaker, and at the same time more technical statement:
While it is permissible for lists to contain themselves as elements, extreme caution is advised: the equals and hashCode methods are no longer well defined on such a list.
And finally, the crux is in the JavaDoc of the Collection interface, which is the common ancestor of both the Set and the List interface:
Some collection operations which perform recursive traversal of the collection may fail with an exception for self-referential instances where the collection directly or indirectly contains itself. This includes the clone(), equals(), hashCode() and toString() methods. Implementations may optionally handle the self-referential scenario, however most current implementations do not do so.
(Emphasis by me)
The bold part is a hint at why the approach that you proposed in your question would not be sufficient:
it seems like there should be an equality check before adding an element to avoid violating that contract, no?
This would not help you here. The key point is that you'll always run into problems when the collection will directly or indirectly contain itself. Imagine this scenario:
Set<Object> setA = new HashSet<Object>();
Set<Object> setB = new HashSet<Object>();
setA.add(setB);
setB.add(setA);
Obviously, neither of the sets contains itself directly. But each of them contains the other - and therefore, itself indirectly. This could not be avoided by a simple referential equality check (using == in the add method).
Avoiding such an "inconsistent state" is basically impossible in practice. Of course it is possible in theory, using referential Reachability computations. In fact, the Garbage Collector basically has to do exactly that!
But it becomes impossible in practice when custom classes are involved. Imagine a class like this:
class Container {
Set<Object> set;
#Override
int hashCode() {
return set.hashCode();
}
}
And messing around with this and its set:
Set<Object> set = new HashSet<Object>();
Container container = new Container();
container.set = set;
set.add(container);
The add method of the Set basically has no way of detecting whether the object that is added there has some (indirect) reference to the set itself.
Long story short:
You cannot prevent the programmer from messing things up.

Adding the collection into itself once causes the test to pass. Adding it twice causes the StackOverflowError which you were seeking.
From a personal developer standpoint, it doesn't make any sense to enforce a check in the underlying code to prevent this. The fact that you get a StackOverflowError in your code if you attempt to do this too many times, or calculate the hashCode - which would cause an instant overflow - should be enough to ensure that no sane developer would keep this kind of code in their code base.

You need to read the full doc and quote it fully:
The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
The actual restriction is in the first sentence. The behavior is unspecified if an element of a set is mutated.
Since adding a set to itself mutates it, and adding it again mutates it again, the result is unspecified.
Note that the restriction is that the behavior is unspecified, and that a special case of that restriction is adding the set to itself.
So the doc says, in other words, that adding a set to itself results in unspecified behavior, which is what you are seeing. It's up to the concrete implementation to deal with (or not).

I agree with you that, from a mathematical perspective, this behavior really doesn't make sense.
There are two interesting questions here: first, to what extent were the designers of the Set interface trying to implement a mathematical set? Secondly, even if they weren't, to what extent does that exempt them from the rules of set theory?
For the first question, I will point you to the documentation of the Set:
A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.
It's worth mentioning here that current formulations of set theory don't permit sets to be members of themselves. (See the Axiom of regularity). This is due in part to Russell's Paradox, which exposed a contradiction in naive set theory (which permitted a set to be any collection of objects - there was no prohibition against sets including themselves). This is often illustrated by the Barber Paradox: suppose that, in a particular town, a barber shaves all of the men - and only the men - who do not shave themselves. Question: does the barber shave himself? If he does, it violates the second constraint; if he doesn't, it violates the first constraint. This is clearly logically impossible, but it's actually perfectly permissible under the rules of naive set theory (which is why the newer "standard" formulation of set theory explicitly bans sets from containing themselves).
There's more discussion in this question on Math.SE about why sets cannot be an element of themselves.
With that said, this brings up the second question: even if the designers hadn't been explicitly trying to model a mathematical set, would this be completely "exempt" from the problems associated with naive set theory? I think not - I think that many of the problems that plagued naive set theory would plague any kind of a collection that was insufficiently constrained in ways that were analogous to naive set theory. Indeed, I may be reading too much into this, but the first part of the definition of a Set in the documentation sounds suspiciously like the intuitive concept of a set in naive set theory:
A collection that contains no duplicate elements.
Admittedly (and to their credit), they do place at least some constraints on this later (including stating that you really shouldn't try to have a Set contain itself), but you could question whether it's really "enough" to avoid the problems with naive set theory. This is why, for example, you have a "turtles all the way down" problem when trying to calculate the hash code of a HashSet that contains itself. This is not, as some others have suggested, merely a practical problem - it's an illustration of the fundamental theoretical problems with this type of formulation.
As a brief digression, I do recognize that there are, of course, some limitations on how closely any collection class can really model a mathematical set. For example, Java's documentation warns against the dangers of including mutable objects in a set. Some other languages, such as Python, at least attempt to ban many kinds of mutable objects entirely:
The set classes are implemented using dictionaries. Accordingly, the requirements for set elements are the same as those for dictionary keys; namely, that the element defines both __eq__() and __hash__(). As a result, sets cannot contain mutable elements such as lists or dictionaries. However, they can contain immutable collections such as tuples or instances of ImmutableSet. For convenience in implementing sets of sets, inner sets are automatically converted to immutable form, for example, Set([Set(['dog'])]) is transformed to Set([ImmutableSet(['dog'])]).
Two other major differences that others have pointed out are
Java sets are mutable
Java sets are finite. Obviously, this will be true of any collection class: apart from concerns about actual infinity, computers only have a finite amount of memory. (Some languages, like Haskell, have lazy infinite data structures; however, in my opinion, a lawlike choice sequence seems like a more natural way model these than classical set theory, but that's just my opinion).
TL;DR No, it really shouldn't be permitted (or, at least, you should never do that) because sets can't be members of themselves.

Related

Reason not to use a guarded/constrained collection

Is there any reasons/arguments not to implement a Java collection that restricts its members based on a predicate/constraint?
Given that such functionality should be necessary often, I was expecting it to be implemented already on collections frameworks like apache-commons or Guava. But while apache indeed had it, Guava deprecated its version of it and recommend not using similar approaches.
The Collection interface contract states that a collection may place any restrictions on its elements as long as it is properly documented, so I'm unable to see why a guarded collection would be discouraged. What other option is there to, say, ensure a Integer collection never contains negative values without hiding the whole collection?

It is just a matter of preference -look at thread about checking before vs checking after - I think that is what it boils down to. Also checking only on add() i good enough only for immutable objects.

There can hardly be one ("acceptable") answer, so I'll just add some thoughts:
As mentioned in the comments, the Collection#add(E) already allows for throwing an IllegalArgumentException, with the reason
if some property of the element prevents it from being added to this collection
So one could say that this case was explicitly considered in the design of the collection interface, and there is no obvious, profound, purely technical (interface-contract related) reason to not allow creating such a collection.
However, when thinking about possible application patterns, one quickly finds cases where the observed behavior of such a collection could be ... counterintuitive, to say the least.
One was already mentioned by dcsohl in the comments, and referred to cases where such a collection would only be a view on another collection:
List<Integer> listWithIntegers = new ArrayList<Integer>();
List<Integer> listWithPositiveIntegers =
createView(listWithIntegers, e -> e > 0);
//listWithPositiveIntegers.add(-1); // Would throw IllegalArgumentException
listWithIntegers.add(-1); // Fine
// This would be true:
assert(listWithPositiveIntegers.contains(-1));
However, one could argue that
Such a collection would not necessarily have to be only a view. Instead, one could enforce that only new collections with such constraints may be created
The behavior is similar to that of Collections.unmodifiableCollection(Collection), which is widely anticipated as it is. (Although it serves a far broader and omnipresent use-case, namely avoiding the internal state of a class to be exposed by returning a modifiable version of a collection via an accessor method)
But in this case, the potential for "inconsistencies" is much higher.
For example, consider a call to Collection#addAll(Collection). It also allows throwing an IllegalArgumentException "if some property of an element of the specified collection prevents it from being added to this collection". But there are no guarantees about things like atomicity. To phrase it that way: It is not specified what the state of the collection will be when such an exception was thrown. Imagine a case like this:
List<Integer> listWithPositiveIntegers = createList(e -> e > 0);
listWithPositiveIntegers.add(1); // Fine
listWithPositiveIntegers.add(2); // Fine
listWithPositiveIntegers.add(Arrays.asList(3,-4,5)); // Throws
assert(listWithPositiveIntegers.contains(3)); // True or false?
assert(listWithPositiveIntegers.contains(5)); // True or false?
(It may be subtle, but it may be an issue).
All this might become even trickier when the condition changes after the collection has been created (regardless of whether it is only a view or not). For example, one could imagine a sequence of calls like this:
List<Integer> listWithPredicate = create(predicate);
listWithPredicate.add(-1); // Fine
someMethod();
listWithPredicate.add(-1); // Throws
Where in someMethod(), there is an innocent line like
predicate.setForbiddingNegatives(true);
One of the comments already mentioned possible performance issues. This is certainly true, but I think that this is not really a strong technical argument: There are no formal complexity guarantees for the runtime of any method of the Collection interface, anyhow. You don't know how long a collection.add(e) call takes. For a LinkedList it is O(1), but for a TreeSet it may be O(n log n) (and who knows what n is at this point in time).
Maybe the performance issue and the possible inconsistencies can be considered as special cases of a more general statement:
Such a collection would allow to basically execute arbitrary code during many operations - depending on the implementation of the predicate.
This may literally have arbitrary implications, and makes reasoning about algorithms, performance and the exact behavior (in terms of consistency) impossible.
The bottom line is: There are many possible reasons to not use such a collection. But I can't think of a strong and general technical reason. So there may be application cases for such a collection, but the caveats should be kept in mind, considering how exactly such a collection is intended to be used.

I would say that such a collection would have too many responsibilities and violate SRP.
The main issue I see here is the readability and maintainability of the code that uses the collection. Suppose you have a collection to which you allow adding only positive integers (Collection<Integer>) and you use it throughout the code. Then the requirements change and you are only allowed to add odd positive integers to it. Because there are no compile time checks, it would be much harder for you to find all the occurrences in the code where you add elements to that collection than it would be if you had a separate wrapper class which encapsulates the collection.
Although of course not even close to such an extreme, it bears some resemblance to using Object reference for all objects in the application.
The better approach is to utilize compile time checks and follow the well-established OOP principles like type safety and encapsulation. That means creating a separate wrapper class or creating a separate type for collection elements.
For example, if you really want to make quite sure that you only work with positive integers in a context, you could create a separate type PositiveInteger extends Number and then add them to a Collection<PositiveInteger>. This way you get compile time safety and converting PositiveInteger to OddPositiveInteger requires much less effort.
Enums are an excellent example of preferring dedicated types vs runtime-constrained values (constant strings or integers).

Why is Set of java.util there in the API?

The interface Set in java.lang.util has the exact same structure
as Collection of the same package.
In the inheritance hierarchy, AbstractSet is
sub- to both Set and AbstractCollection, both
of which are sub- to Collection.
The other immediate descendant of Set is SortedSet,
and SortedSet is extending only Set.
What I'm wondering is, what's the gain in Set in java.lang.util-- why is it there?
If i'm not missing anything, it's not adding anything
to the current structure or hierarchy of the API.
All would be the same if AbstractSet didn't
implement Set but just extended AbstractCollection, and SortedSet
directly extended Collection.
The only thing I can think of is Set is there for documentation purposes.
Shouldn't be for further structuring/re-structuring the hierarchy-- that would mean
structural modifications of the descendants and doesn't make sense.
I'm looking for verification or counter-arguments if I'm missing something here.
//===========================================
EDIT: The Q is: "Why is Set there"-- what is it adding to the structure of the APIs?"
Obvious how set is particular among collections mathematically.

The methods in Set and Collection have the same signatures and return types, but they have different behavioural contracts ... deriving from the fact that a set cannot contain "the same" element more than once. THAT is why they are distinct interfaces.
It is not just documentation. Since Java doesn't do "duck typing", the distinction between Collection and Set is visible in both compile time and runtime type checking.
And the distinction is a useful one. If there was only Collection, then you would not be able to write methods that require a collection with no duplicates as an argument.
You write:
Set is a copy/paste of Collection apart from the comments.
I know that. The comments are the behavioural contract. They are critical. There is no other way to specify how something will behave in Java1, 2.
Reference:
Design by contract
1 - In one or two languages, you can specify the behavioural aspect of the "contract" in the language itself. Eiffel is the classical example ... that gave rise to the "design by contract" paradigm.
2 - In fact, the JML system adds formal preconditions, postconditions and invariants to Java, and checks them using an automated theorem prover. The problem is that it would be difficult to fully integrate this with the Java language's type system / static type checker. (How do you statically type check something when the theorem prover says "I don't know" ... because it is not smart enough to prove/disprove the JML assertions in the code?)

A set can't contain duplicate elements. A collection can.

what's the gain in Set in java.lang.util-- why is it there?
Separating the Sets from the other Collections lets you write code so that only a Set can be passed in. Here's an example where it's useful:
public void sendMessageTo(Collection<String> addresses) {
addresses.add("admin#example.com"); //The admin might now be on the list twice, and gets two emails, oops :(
//do something
}
I want to change the interface to take a Set:
public void sendMessageTo(Set<String> addresses) {
addresses.add("admin#example.com"); //This will add the admin if they were not already on the list, otherwise it won't because Sets don't allow duplicates
//do something
}

A Set is a Collection that contains no duplicates. For more info from the page:
More formally, sets contain no pair of
elements e1 and e2 such that e1.equals(e2), and at most one null
element. As implied by its name, this interface models the
mathematical set abstraction.
The Set interface places additional stipulations, beyond those
inherited from the Collection interface, on the contracts of all
constructors and on the contracts of the add, equals and hashCode
methods. Declarations for other inherited methods are also included
here for convenience. (The specifications accompanying these
declarations have been tailored to the Set interface, but they do not
contain any additional stipulations.)
The additional stipulation on constructors is, not surprisingly, that
all constructors must create a set that contains no duplicate elements
(as defined above).
If Set did not exist, there would be no way to enforce uniqueness in a Collection. It does not matter that the code is the same as Collection, Set exists to enforce behavioral restrictions, as in due to the defined behavior, when Set is implemented, the implementing class must adhere to its behavioral contract(s).

Collection<E> and Set<E> are the same?

I have a question about those two interfaces in Java.
Set extends Collection, but doesn't add anything. They are exactly the same.
Am I missing something here ?

Set doesn't allow duplicates.
It's a semantic difference, not a syntactic one.

From the documentation of Collection:
A collection represents a group of objects, known as its elements. Some collections allow duplicate elements and others do not. Some are ordered and others unordered.
From the documentation of Set:
A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.
That should clarify the difference between a Set and a (the more general interface) Collection.

Good question. I guess the main purpose of explicitly having an interface for the concept of a Set as compared to the concept of a Collection is to actually formally distinguish the concepts. Let's say you're writing a method
void x(Collection<?> c);
You won't have the same idea of what arguments you want to get, as if you were writing
void x(Set<?> s);
The second method expects Collections that contain every element at most once (i.e. Sets). That's a big semantic difference to the first method, which doesn't care whether it receives Sets, Lists or any other type of Collection
If you look closely, the Javadoc of the Set method is different as well, explicitly showing the different notions that come into play when talking about Collection or Set

Collection is a more generic interface which comprises of Lists, Queues, Sets and many more.
Have a look at the 'All Known Subinterfaces' section here.

Everything is in the documentation:
Set - A collection that contains no
duplicate elements. More formally,
sets contain no pair of elements e1
and e2 such that e1.equals(e2), and at
most one null element. As implied by
its name, this interface models the
mathematical set abstraction.
and
Collection - The root interface in the
collection hierarchy. A collection
represents a group of objects, known
as its elements. Some collections
allow duplicate elements and others do
not. Some are ordered and others
unordered. The SDK does not provide
any direct implementations of this
interface: it provides implementations
of more specific subinterfaces like
Set and List. This interface is
typically used to pass collections
around and manipulate them where
maximum generality is desired.
It is only to distinguish the implementation and future usage.
This came from the Set theory and dictionary
Collection - something that is collected; a group of objects or an amount of material accumulated in one location, especially for some purpose or as a result of some process
Set - is a collection of distinct objects

Additionally, the Set documentation defines a contract for .equals, which says "only other Sets may be equal to this Set". If we couldn't recognize the other Sets by their type (with instanceof), it would be impossible to implement this.
If it were only for equals(), it would be possible to have a allowsDuplicates() method for Collection. But there are often cases where APIs want to say "please don't give me duplicates" or "I guarantee that this does not contain duplicates", and in Java there is no way to say in a method declaration "please give only collections whose allowsDuplicates() method returns false". Thus the additional type.

What is the reason behind Enum.hashCode()?

The method hashCode() in class Enum is final and defined as super.hashCode(), which means it returns a number based on the address of the instance, which is a random number from programmers POV.
Defining it e.g. as ordinal() ^ getClass().getName().hashCode() would be deterministic across different JVMs. It would even work a bit better, since the least significant bits would "change as much as possible", e.g., for an enum containing up to 16 elements and a HashMap of size 16, there'd be for sure no collisions (sure, using an EnumMap is better, but sometimes not possible, e.g. there's no ConcurrentEnumMap). With the current definition you have no such guarantee, have you?
Summary of the answers
Using Object.hashCode() compares to a nicer hashCode like the one above as follows:
PROS
simplicity
CONTRAS
speed
more collisions (for any size of a HashMap)
non-determinism, which propagates to other objects making them unusable for
deterministic simulations
ETag computation
hunting down bugs depending e.g. on a HashSet iteration order
I'd personally prefer the nicer hashCode, but IMHO no reason weights much, maybe except for the speed.
UPDATE
I was curious about the speed and wrote a benchmark with surprising results. For a price of a single field per class you can a deterministic hash code which is nearly four times faster. Storing the hash code in each field would be even faster, although negligibly.
The explanation why the standard hash code is not much faster is that it can't be the object's address as objects gets moved by the GC.
UPDATE 2
There are some strange things going on with the hashCode performance in general. When I understand them, there's still the open question, why System.identityHashCode (reading from the object header) is way slower than accessing a normal object field.

The only reason for using Object's hashCode() and for making it final I can imagine, is to make me ask this question.
First of all, you should not rely on such mechanisms for sharing objects between JVMs. That's simply not a supported use case. When you serialize / deserialize you should rely on your own comparison mechanisms or only "compare" the results against objects within your own JVM.
The reason for letting enums hashCode be implemented as Objects hash code (based on identity) is because, within one JVM there will only be one instance of each enum object. This is enough to ensure that such implementation makes sense and is correct.
You could argue like "Hey, String and the wrappers for the primitives (Long, Integer, ...) all have well defined, deterministic, specifications of hashCode! Why doesn't the enums have it?", Well, to begin with, you can have several distinct string references representing the same string which means that using super.hashCode would be an error, so these classes necessarily need their own hashCode implementations. For these core classes it made sense to let them have well-defined deterministic hashCodes.
Why did they choose to solve it like this?
Well, look at the requirements of the hashCode implementation. The main concern is to make sure that each object should return a distinct hash code (unless it is equal to another object). The identity-based approach is super efficient and guarantees this, while your suggestion does not. This requirement is apparently stronger than any "convenience bonus" about easing up on serialization etc.

I think that the reason they made it final is to avoid developers shooting themselves in the foot by rewriting a suboptimal (or even incorrect) hashCode.
Regarding the chosen implementation: it's not stable across JVMs, but it's very fast, avoid collisions, and doesn't need an additional field in the enum. Given the normally small number of instances of an enum class, and the speed of the equals method, I wouldn't be surprised if the HashMap lookup time was bigger with your algorithm than with the current one, due to its additional complexity.

I've asked the same question, because did not saw this one. Why in Enum hashCode() refers to the Object hashCode() implementaion, instead of ordinal() function?
I encountered it as a sort of a problem, when defining my own hash function, for an Object relying on enum hashCode as one of the composites. When checking a value in a Set of Objects, returned by the function, I checked them in an order, which I would expect it to be the same, since the hashCode I define myself, and so I expect elements to fall at the same nodes on the tree, but since hashCode returned by enum changes from start to start, this assumption was wrong, and test could fail once in a while.
So, when I figured out the problem, I started using ordinal instead. I am not sure everyone writing hashCode for their Object realize this.
So basically, you can't define your own deterministic hashCode, while relying on enum hashCode, and you need to use ordinal instead
P.S. This was too big for a comment :)

The JVM enforces that for an enum constant, only one object will exist in memory. There is no way that you could end up with two different instance objects of the same enum constant within a single VM, not with reflection, not across the network via serialization/deserialization.
That being said, since it is the only object to represent this constant, it doesn't matter that its hascode is its address since no other object can occupy the same address space at the same time. It is guaranteed to be unique & "deterministic" (in the sense that in the same VM, in memory, all objects will have the same reference, no matter what it is).

There is no requirement for hash codes to be deterministic between JVMs and no advantage gained if they were. If you are relying on this fact you are using them wrong.
As only one instance of each enum value exists, Object.hashcode() is guaranteed never to collide, is good code reuse and is very fast.
If equality is defined by identity, then Object.hashcode() will always give the best performance.
The determinism of other hash codes is just a side effect of their implementation. As their equality is usually defined by field values, mixing in non-deterministic values would be a waste of time.

As long as we can't send an enum object1 to a different JVM I see no reason for putting such a requirements on enums (and objects in general)
1 I thought it was clear enough - an object is an instance of a class. A serialized object is a sequence of bytes, usually stored in a byte array. I was talking about an object.

One more reason that it is implemented like this I could imagine is because of the requirement for hashCode() and equals() to be consistent, and for the design goal of Enums that they sould be simple to use and compile-time constant (to use them is "case" constants). This also makes it legal to compare enum instances with "==", and you simply wouldn't want "equals" to behave differntly from "==" for enums. This again ties hashCode to the default Object.hashCode() reference-based behavior.
As said before, I also don't expect equals() and hashCode() to consider two enum constants from different JVM as being equal. When talking about serialization: For instance fields typed as enums the default binary serializer in Java has a special behaviour that serializess only the name of the constant, and on deserialization the reference to the corresponding enum value in the de-serializing JVM is re-created. JAXB and other XML-based serialization mechanisms work in a similar way. So: just don't worry

when to use Set vs. Collection?

Is there any practical difference between a Set and Collection in Java, besides the fact that a Collection can include the same element twice? They have the same methods.
(For example, does Set give me more options to use libraries which accept Sets but not Collections?)
edit: I can think of at least 5 different situations to judge this question. Can anyone else come up with more? I want to make sure I understand the subtleties here.
designing a method which accepts an argument of Set or Collection. Collection is more general and accepts more possibilities of input. (if I'm designing a specific class or interface, I'm being nicer to my consumers and stricter on my subclassers/implementers if I use Collection.)
designing a method which returns a Set or Collection. Set offers more guarantees than Collection (even if it's just the guarantee not to include one element twice). (if I'm designing a specific class or interface, I'm being nicer to my consumers and stricter on my subclassers/implementers if I use Set.)
designing a class that implements the interface Set or Collection. Similar issues as #2. Users of my class/interface get more guarantees, subclassers/implementers have more responsibility.
designing an interface that extends the interface Set or Collection. Very similar to #3.
writing code that uses a Set or Collection. Here I might as well use Set; the only reasons for me to use Collection is if I get back a Collection from someone else's code, or if I have to handle a collection that contains duplicates.

Collection is also the supertype of List, Queue, Deque, and others, so it gives you more options. For example, I try to use Collection as a parameter to library methods that shouldn't explicitly depend on a certain type of collection.
Generally, you should use the right tool for the job. If you don't want duplicates, use Set (or SortedSet if you want ordering, or LinkedHashSet if you want to maintain insertion order). If you want to allow duplicates, use List, and so on.

I think you already have it figured out- use a Set when you want to specifically exclude duplicates. Collection is generally the lowest common denominator, and it's useful to specify APIs that accept/return this, which leaves you room to change details later on if needed. However if the details of your application require unique entries, use Set to enforce this.
Also worth considering is whether order is important to you; if it is, use List, or LinkedHashSet if you care about order and uniqueness.

See Java's Collection tutorial for a good walk-through of Collection usage. In particular, check out the class hierarchy.

As #mmyers states, Collection includes Set, as well as List.
When you declare something as a Set, rather than a Collection, you are saying that the variable cannot be a List or a Map. It will always be a Collection, though. So, any function that accepts a Collection will accept a Set, but a function that accepts a Set cannot take a Collection (unless you cast it to a Set).

One other thing to consider... Sets have extra overhead in time, memory, and coding in order to guarantee that there are no duplicates. (Time and memory because sets are usually backed by a HashMap or a Tree, which adds overhead over a list or an array. Coding because you have to implement the hashCode() and equals() methods.)
I usually use sets when I need a fast implementation of contains() and use Collection or List otherwise, even if the collection shouldn't have duplicates.

You should use a Set when that is what you want.
For example, a List without any order or duplicates. Methods like contains are quite useful.
A collection is much more generic. I believe that what mmyers wrote on their usage says it all.

The practical difference is that Set enforces the set logic, i.e. no duplicates and unordered, while Collection does not. So if you need a Collection and you have no particular requirement for avoiding duplicates then use a Collection. If you have the requirement for Set then use Set. Generally use the highest interface possibble.

As Collection is a super type of Set and SortedSet these can be passed to a method which expects a Collection. Collection just means it may or may not be sorted, order or allow duplicates.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.