Why is Set of java.util there in the API? - java

The interface Set in java.lang.util has the exact same structure
as Collection of the same package.
In the inheritance hierarchy, AbstractSet is
sub- to both Set and AbstractCollection, both
of which are sub- to Collection.
The other immediate descendant of Set is SortedSet,
and SortedSet is extending only Set.
What I'm wondering is, what's the gain in Set in java.lang.util-- why is it there?
If i'm not missing anything, it's not adding anything
to the current structure or hierarchy of the API.
All would be the same if AbstractSet didn't
implement Set but just extended AbstractCollection, and SortedSet
directly extended Collection.
The only thing I can think of is Set is there for documentation purposes.
Shouldn't be for further structuring/re-structuring the hierarchy-- that would mean
structural modifications of the descendants and doesn't make sense.
I'm looking for verification or counter-arguments if I'm missing something here.
//===========================================
EDIT: The Q is: "Why is Set there"-- what is it adding to the structure of the APIs?"
Obvious how set is particular among collections mathematically.

The methods in Set and Collection have the same signatures and return types, but they have different behavioural contracts ... deriving from the fact that a set cannot contain "the same" element more than once. THAT is why they are distinct interfaces.
It is not just documentation. Since Java doesn't do "duck typing", the distinction between Collection and Set is visible in both compile time and runtime type checking.
And the distinction is a useful one. If there was only Collection, then you would not be able to write methods that require a collection with no duplicates as an argument.
You write:
Set is a copy/paste of Collection apart from the comments.
I know that. The comments are the behavioural contract. They are critical. There is no other way to specify how something will behave in Java1, 2.
Reference:
Design by contract
1 - In one or two languages, you can specify the behavioural aspect of the "contract" in the language itself. Eiffel is the classical example ... that gave rise to the "design by contract" paradigm.
2 - In fact, the JML system adds formal preconditions, postconditions and invariants to Java, and checks them using an automated theorem prover. The problem is that it would be difficult to fully integrate this with the Java language's type system / static type checker. (How do you statically type check something when the theorem prover says "I don't know" ... because it is not smart enough to prove/disprove the JML assertions in the code?)

A set can't contain duplicate elements. A collection can.

what's the gain in Set in java.lang.util-- why is it there?
Separating the Sets from the other Collections lets you write code so that only a Set can be passed in. Here's an example where it's useful:
public void sendMessageTo(Collection<String> addresses) {
addresses.add("admin#example.com"); //The admin might now be on the list twice, and gets two emails, oops :(
//do something
}
I want to change the interface to take a Set:
public void sendMessageTo(Set<String> addresses) {
addresses.add("admin#example.com"); //This will add the admin if they were not already on the list, otherwise it won't because Sets don't allow duplicates
//do something
}

A Set is a Collection that contains no duplicates. For more info from the page:
More formally, sets contain no pair of
elements e1 and e2 such that e1.equals(e2), and at most one null
element. As implied by its name, this interface models the
mathematical set abstraction.
The Set interface places additional stipulations, beyond those
inherited from the Collection interface, on the contracts of all
constructors and on the contracts of the add, equals and hashCode
methods. Declarations for other inherited methods are also included
here for convenience. (The specifications accompanying these
declarations have been tailored to the Set interface, but they do not
contain any additional stipulations.)
The additional stipulation on constructors is, not surprisingly, that
all constructors must create a set that contains no duplicate elements
(as defined above).
If Set did not exist, there would be no way to enforce uniqueness in a Collection. It does not matter that the code is the same as Collection, Set exists to enforce behavioral restrictions, as in due to the defined behavior, when Set is implemented, the implementing class must adhere to its behavioral contract(s).

Related

Covariant return type best practices

I vaguely remember learning in university that the return type of a method should always be as narrow as possible, but my search for any references online came up empty and SonarQube calls it a code smell. E.g. in the following example (note that TreeSet and Set are just examples)
public interface NumberGetter {
Number get();
Set<Number> getAll();
}
public class IntegerGetter implements NumberGetter {
#Override
public Integer get() {
return 1;
}
#Override
public TreeSet<Number> getAll() {
return new TreeSet<>(Collections.singleton(1));
}
}
SonarQube tells me
Declarations should use Java collection interfaces such as "List" rather than specific implementation classes such as "LinkedList". The purpose of the Java Collections API is to provide a well defined hierarchy of interfaces in order to hide implementation details. (squid:S1319)
I see the point about hiding implementation details, since I cannot easily change the return type of IntegerGetter::getAll() without breaking backwards-compatibility. But by doing this, I also provide the consumers with potentially valuable information, i.e. they could change their algorithm to be more appropriate for using a TreeSet. And if they don't care about this property, they can still just use IntegerGetter (however they obtained it), like this:
Set<Number> allNumbers = integerGetter.getAll();
So I have the following questions:
Is it appropriate for IntegerGetter::get() to return a narrower type?
Is it appropriate for IntegerGetter::getAll() to return a narrower type?
Are there any best practices regarding this topic or is the answer just "it depends"?
(N.B. SonarQube doesn't complain if I replace TreeSet with e.g. SortedSet. Is this code smell only about not using a Collection API interface? What if there is only a concrete class for my particular problem?)
The return type must strike a balance between the needs of the caller and the needs of the implementation: the more you tell a caller about the return type, the harder it is to change that type later.
Which is more important will depend on the specific circumstances. Do you ever see this type changing? How valuable is knowing the type for the caller?
In the case of IntegerGetter.get(), it would be very surprising if the return type ever changes, so telling the caller does no harm.
In the case of IntegerGetter.getAll(), it depends on what the caller uses the method for:
If he merely wants to iterate, an Iterable would be the right choice.
If we need more methods such as size, Collection might.
If he relies on the numbers being unique, a Set.
If he additionally relies on the numbers being sorted, a SortedSet.
And if it actually needs to be the red black tree from the JDK so it can manipulate its internal state directly for an ugly hack, a TreeSet might be the right choice.
I try - as a rule of thumb - to use on method signature the most general type (class or interface does'n matter) that supports the API I need: more general the type, more minimal the API.
So, if I need a parameter representing a family of object of the same type, I start using Collection; if in the specific scenario is the idea of ordering important, I use List, avoiding to publish any info about the specific List implementation I use internally: my idea is to keep the ability to change implementation (may be for performance optimization, to support a different data structore, and so on) without break clients.
As you stated, publishing information like I use a TreSet can leave place for client-side optimization - but my idea is that it depends: case by case you can evaluate if the specific scenario requires to relax the general rule to expose the more general interface you can.
So, coming to your questions:
yes, it is appropriate in the IntegerGetter implementation of NumberGetter interface to return a narrower type: Java allows you to do and you are not breaking my more generic is more beautiful rule: the NumberGetter interface exposes the more general interface using Number as return type but in specific implementations we can use a narrower return type to guide the method implementation: clients referring to the more abstract interface are not affected by this choice and client referring to the specific subclass can try advantage from using the more concrete interface
the same as the previous point, but I think it's less useful than in the previous case for clients: may be a client can find useful to refer to Integer than to Number (if I use explicitly a NumberGetter, may be I think in terms of Integers, not in term of Numbers), but referring to TreeSet rather than to Set is useful only if you need the API exposed by the subclass and not by the interface...
see initial dissertation
It's a quasi-philosophic question - and so is my answer: I hope it can be useful to you!
This cannot have a single answer, as it depends on the usecase.
And by that I mean that it depends on the degree of flexibility you want your implementation to have, plus the degree of flexibility you want to give the consumers of your API.
I'll give you a more general answer.
Do you want your consumer to loop-only? Return a Collection<T>.
Do you want your consumer to be able to access by index? Return a List<T>
Do you want your consumer to know and be able to efficiently verify if an element is present? Return a Set<T>
And so on.
The same approach is valid for input parameters. What's the point of accepting a List<T>, or even concrete classes such as ArrayList<T> or LinkedList<T>, if you only loop it? You are just giving your code less flexibility for future modifications.
What you're doing here with the IntegerGetter's inherited methods' return types is called type specialization. It's okay as long as you keep exposing interfaces to the outer world.
My rule of thumb is being as generic as possibile when dealing with the outer world, and being as specific as possible when implementing critical (core) parts of my application, to restrict the possible use-cases, protect myself from abusing what I just coded, and for documentation purposes.
What you shouldn't absolutely be doing is using instanceof or class comparison to discover the actual type and take different routes. That's ruining a codebase.

Should a HashSet be allowed to be added to itself in Java?

According to the contract for a Set in Java, "it is not permissible for a set to contain itself as an element" (source). However, this is possible in the case of a HashSet of Objects, as demonstrated here:
Set<Object> mySet = new HashSet<>();
mySet.add(mySet);
assertThat(mySet.size(), equalTo(1));
This assertion passes, but I would expect the behavior to be to either have the resulting set be 0 or to throw an Exception. I realize the underlying implementation of a HashSet is a HashMap, but it seems like there should be an equality check before adding an element to avoid violating that contract, no?
Others have already pointed out why it is questionable from a mathematical point of view, by referring to Russell's paradox.
This does not answer your question on a technical level, though.
So let's dissect this:
First, once more the relevant part from the JavaDoc of the Set interface:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Interestingly, the JavaDoc of the List interface makes a similar, although somewhat weaker, and at the same time more technical statement:
While it is permissible for lists to contain themselves as elements, extreme caution is advised: the equals and hashCode methods are no longer well defined on such a list.
And finally, the crux is in the JavaDoc of the Collection interface, which is the common ancestor of both the Set and the List interface:
Some collection operations which perform recursive traversal of the collection may fail with an exception for self-referential instances where the collection directly or indirectly contains itself. This includes the clone(), equals(), hashCode() and toString() methods. Implementations may optionally handle the self-referential scenario, however most current implementations do not do so.
(Emphasis by me)
The bold part is a hint at why the approach that you proposed in your question would not be sufficient:
it seems like there should be an equality check before adding an element to avoid violating that contract, no?
This would not help you here. The key point is that you'll always run into problems when the collection will directly or indirectly contain itself. Imagine this scenario:
Set<Object> setA = new HashSet<Object>();
Set<Object> setB = new HashSet<Object>();
setA.add(setB);
setB.add(setA);
Obviously, neither of the sets contains itself directly. But each of them contains the other - and therefore, itself indirectly. This could not be avoided by a simple referential equality check (using == in the add method).
Avoiding such an "inconsistent state" is basically impossible in practice. Of course it is possible in theory, using referential Reachability computations. In fact, the Garbage Collector basically has to do exactly that!
But it becomes impossible in practice when custom classes are involved. Imagine a class like this:
class Container {
Set<Object> set;
#Override
int hashCode() {
return set.hashCode();
}
}
And messing around with this and its set:
Set<Object> set = new HashSet<Object>();
Container container = new Container();
container.set = set;
set.add(container);
The add method of the Set basically has no way of detecting whether the object that is added there has some (indirect) reference to the set itself.
Long story short:
You cannot prevent the programmer from messing things up.
Adding the collection into itself once causes the test to pass. Adding it twice causes the StackOverflowError which you were seeking.
From a personal developer standpoint, it doesn't make any sense to enforce a check in the underlying code to prevent this. The fact that you get a StackOverflowError in your code if you attempt to do this too many times, or calculate the hashCode - which would cause an instant overflow - should be enough to ensure that no sane developer would keep this kind of code in their code base.
You need to read the full doc and quote it fully:
The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
The actual restriction is in the first sentence. The behavior is unspecified if an element of a set is mutated.
Since adding a set to itself mutates it, and adding it again mutates it again, the result is unspecified.
Note that the restriction is that the behavior is unspecified, and that a special case of that restriction is adding the set to itself.
So the doc says, in other words, that adding a set to itself results in unspecified behavior, which is what you are seeing. It's up to the concrete implementation to deal with (or not).
I agree with you that, from a mathematical perspective, this behavior really doesn't make sense.
There are two interesting questions here: first, to what extent were the designers of the Set interface trying to implement a mathematical set? Secondly, even if they weren't, to what extent does that exempt them from the rules of set theory?
For the first question, I will point you to the documentation of the Set:
A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.
It's worth mentioning here that current formulations of set theory don't permit sets to be members of themselves. (See the Axiom of regularity). This is due in part to Russell's Paradox, which exposed a contradiction in naive set theory (which permitted a set to be any collection of objects - there was no prohibition against sets including themselves). This is often illustrated by the Barber Paradox: suppose that, in a particular town, a barber shaves all of the men - and only the men - who do not shave themselves. Question: does the barber shave himself? If he does, it violates the second constraint; if he doesn't, it violates the first constraint. This is clearly logically impossible, but it's actually perfectly permissible under the rules of naive set theory (which is why the newer "standard" formulation of set theory explicitly bans sets from containing themselves).
There's more discussion in this question on Math.SE about why sets cannot be an element of themselves.
With that said, this brings up the second question: even if the designers hadn't been explicitly trying to model a mathematical set, would this be completely "exempt" from the problems associated with naive set theory? I think not - I think that many of the problems that plagued naive set theory would plague any kind of a collection that was insufficiently constrained in ways that were analogous to naive set theory. Indeed, I may be reading too much into this, but the first part of the definition of a Set in the documentation sounds suspiciously like the intuitive concept of a set in naive set theory:
A collection that contains no duplicate elements.
Admittedly (and to their credit), they do place at least some constraints on this later (including stating that you really shouldn't try to have a Set contain itself), but you could question whether it's really "enough" to avoid the problems with naive set theory. This is why, for example, you have a "turtles all the way down" problem when trying to calculate the hash code of a HashSet that contains itself. This is not, as some others have suggested, merely a practical problem - it's an illustration of the fundamental theoretical problems with this type of formulation.
As a brief digression, I do recognize that there are, of course, some limitations on how closely any collection class can really model a mathematical set. For example, Java's documentation warns against the dangers of including mutable objects in a set. Some other languages, such as Python, at least attempt to ban many kinds of mutable objects entirely:
The set classes are implemented using dictionaries. Accordingly, the requirements for set elements are the same as those for dictionary keys; namely, that the element defines both __eq__() and __hash__(). As a result, sets cannot contain mutable elements such as lists or dictionaries. However, they can contain immutable collections such as tuples or instances of ImmutableSet. For convenience in implementing sets of sets, inner sets are automatically converted to immutable form, for example, Set([Set(['dog'])]) is transformed to Set([ImmutableSet(['dog'])]).
Two other major differences that others have pointed out are
Java sets are mutable
Java sets are finite. Obviously, this will be true of any collection class: apart from concerns about actual infinity, computers only have a finite amount of memory. (Some languages, like Haskell, have lazy infinite data structures; however, in my opinion, a lawlike choice sequence seems like a more natural way model these than classical set theory, but that's just my opinion).
TL;DR No, it really shouldn't be permitted (or, at least, you should never do that) because sets can't be members of themselves.

How to set cardinalities in Java field?

Let us suppose we have the following class:
public class MyClass{
List<String> list;
void method() {
}
}
Each object of this class has a list of strings, but what if we want to set the cardinality? For example, I want to force this class to have at least 2 strings in that list.
Is there a general pattern to represent the cardinalities on fields?
You simply need to make sure that there are at least 2 elements in list, by some means. There is no standard or simple way of doing this.
This is known as an invariant of the class. It is your responsibility as the person who writes the class to ensure that this invariant is preserved at all times. In particular:
You need to document the invariant that there are at least 2 elements in the list.
You need to ensure that the list contains at least two elements by the time the constructor finishes. The instance should be "valid" (in the sense that its documented invariant are true) when the constructor finishes.
You need to ensure that all code within the class honors the invariant when it manipulates the list - you are allowed to temporarily make it invalid, but you must ensure that it is not possible to observe your instance in an invalid state.
In the single-threaded cases, this simply means that the invariant must be true once the public method returns (note that this includes if an exception occurs); but if your code is designed to be used in a multithreaded way, you must also ensure that no thread can ever see your instance in a state where the invariant is false (e.g. you may need synchronized blocks, or ensure that updates are atomic).
You need to ensure that subclasses of your class are unable to make the invariant false; or document clearly that it is the responsibility of people writing subclasses to ensure that the invariant remains true.
There is a great item in Effective Java 2nd Ed about this: "Design and document for inheritance or else prohibit it".
You need to ensure that nothing outside your class is able to access the reference to the list. For example, your code makes the list visible to other classes in the same package - any of these classes could call theInstance.list.clear(), and invalidate your invariant.
It's pretty hard to prevent this absolutely - for example, it could be possible for malicious actors to invalidate the invariant using reflection. You can prevent this, but it's a question of weighing the effort of identifying and blocking such methods vs the actual cost of the invariant becoming false (this strongly depends upon how this class is used).
By far the easiest way to enforce an invariant is on an immutable class. If it's not possible to change the observable state of an instance, it's not possible to invalidate the invariant. Then, all you need to worry about is a) making sure that the class is immutable; b) making sure that the invariant is true once the constructor returns. All of the other points above then simply fall away.
Is there a general pattern to represent the cardinalities on fields?
Obviously, you can represent the cardinalities using integer fields, either in the objects themselves, or at the meta level. But that's not much help if you cannot enforce them.
There is no general pattern for that. #Andy Turner's answer provides a good summary of the alternatives on enforcement of cardinalities. I just want to add a couple of points:
Attempting to enforce the cardinality constraints via static typing is unlikely to work. The Java type system is not rich enough to do this in a pleasant way1.
Construction of objects that have fields that have minimum cardinalities can be tricky, especially if there are potentially circularities involving those fields.
One way to deal with construction is to separate the lifecycle of the objects into a "construction" phase and a "completed" phase. During the construction phase, the constraints are relaxed, to allow the construction to be performed in stages. At some point, a "completed" switch is "flipped". At that point 1) the cardinality constraints are checked, and 2) the behavior of mutating operations is changed to prevent changes that would violate cardinality.
This can be implemented using public constructors and a public method to "flip the switch". Alternatively, you can implement this using the builder pattern; e.g.
make the constructors private
use alternative private means to side-step the cardinalities while building
check the cardinalities and flip the switch (it one is needed) in the build() method.
Another approach is to allow fields to be below cardinality, but only allow items to be added to the fields when they are in that state. In other words, this is the "flip the switch" approach without an explicit switch. The downside is that a client needs to test if the cardinality constraint is in force yet; i.e. the constraint is weak.
1 - In theory, you could implement a ListOfTwoOrMore interface, etcetera, but that would bring a raft of new problems.
One way to do that is use a different type for the field. Now it has type List<String> which is a collection that can contain 0 or more elements. You can change it to a type which represents a list that contains 2 or more elements.
You could use var-args in constructor.
public MyClass(String s1, String s2, String... rest){
}

What is detailed explanation of argument "Subclass Only Where It Makes Sense"?

From presentation called How to Design a Good API and Why it Matters
I'm stuck on page 25 of the presentation in which says:
Public classes should not subclass other public classes for ease of
implementation
And it gave us an examples (Java syntax):
Bad: Properties extends Hashtable
Stack extends Vector
Good: Set extends Collection
But why are those examples bad and good?
Because a Properties is not a Hashtable, and they shouldn't be used interchangeably, i.e., you don't want users to use Properties where they only need Hashtable. Same for Stack vs Vector.
Good design should strive for simplicity of API. If you are designing a Stack, you should basically only provide the push and pop methods. Publicly inheriting from Vector leaks an implementation detail that the user does not need to know. Beside the confusion, this means you can never change the implementation! So if tomorrow Vector gets deprecated (which I believe it actually is at this point), you are still stuck with a Stack that uses it because your clients might expect it. Changing the implementation would violate backward compatibility, which is another design goal.
Note that the example above is not random. Both Vector and Hashtable are classes that are considered obsolete (see the last comments here and here). These classes have some design flaws and were replaced by ArrayList and HashMap or similar others. This makes classes that inherit from them obsolete as well. If instead of inheriting you used composition, you could easily swap Vector and Hashtable for their modern counterparts without affecting any user.
On the other hand, Set is a Collection. That is, if some code specifies that it needs some kind of Collection, the user is free to provide a Set (or a List or whatever). This gives more flexibility to the API if there are no specific requirements on what this collection should provide (no random access for example).
Inheriting from a class is typically thought of as implementing an "is-a" relationship.
Is a collection of properties a hashtable? No, not really. The hashtable is an implementation detail, not a fundamental characteristic of "Properties". This implies that you should be using aggregation, like so:
class Properties {
private HashTable mPropertyTable;
}
The same goes for Stack and Vector. A Stack isn't a special kind of Vector, so the Vector should be a member of the Stack used for implementation only.
Contrast this with Set deriving from Collection. Is a Set a type of Collection? Yes, it is, so in this case inheritance makes sense.
Bloch is distinguishing inheritance of interface from inheritance of implementation.
It is impossible to know from a slide deck what he said at this point in his presentation, but a typical argument for avoiding one public class inheriting from another is that it permanently ties the subclass to the superclass's implementation. For example, Properties cannot ever implement its property storage in HashMap form or any other form other than a HashTable.
Another argument is that it couples the classes too tightly together. Modifications that would be beneficial to the superclass can break the subclass, or render its behavior worse.
A third argument is that the subclass inherits methods from the superclass that may not make sense for it, or may do so in the future if the methods are added to the superclass. For example, because Stack extends Vector, it is possible to examine arbitrary elements below the top one, and even to add, remove, or modify internal elements. This one applies to some extent also to inheritance of interface, but it is not usually as much of a problem in that case.
The first two arguments remain applicable to some extent even when the super- and subclasses have an is-a relationship.

Collection<E> and Set<E> are the same?

I have a question about those two interfaces in Java.
Set extends Collection, but doesn't add anything. They are exactly the same.
Am I missing something here ?
Set doesn't allow duplicates.
It's a semantic difference, not a syntactic one.
From the documentation of Collection:
A collection represents a group of objects, known as its elements. Some collections allow duplicate elements and others do not. Some are ordered and others unordered.
From the documentation of Set:
A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.
That should clarify the difference between a Set and a (the more general interface) Collection.
Good question. I guess the main purpose of explicitly having an interface for the concept of a Set as compared to the concept of a Collection is to actually formally distinguish the concepts. Let's say you're writing a method
void x(Collection<?> c);
You won't have the same idea of what arguments you want to get, as if you were writing
void x(Set<?> s);
The second method expects Collections that contain every element at most once (i.e. Sets). That's a big semantic difference to the first method, which doesn't care whether it receives Sets, Lists or any other type of Collection
If you look closely, the Javadoc of the Set method is different as well, explicitly showing the different notions that come into play when talking about Collection or Set
Collection is a more generic interface which comprises of Lists, Queues, Sets and many more.
Have a look at the 'All Known Subinterfaces' section here.
Everything is in the documentation:
Set - A collection that contains no
duplicate elements. More formally,
sets contain no pair of elements e1
and e2 such that e1.equals(e2), and at
most one null element. As implied by
its name, this interface models the
mathematical set abstraction.
and
Collection - The root interface in the
collection hierarchy. A collection
represents a group of objects, known
as its elements. Some collections
allow duplicate elements and others do
not. Some are ordered and others
unordered. The SDK does not provide
any direct implementations of this
interface: it provides implementations
of more specific subinterfaces like
Set and List. This interface is
typically used to pass collections
around and manipulate them where
maximum generality is desired.
It is only to distinguish the implementation and future usage.
This came from the Set theory and dictionary
Collection - something that is collected; a group of objects or an amount of material accumulated in one location, especially for some purpose or as a result of some process
Set - is a collection of distinct objects
Additionally, the Set documentation defines a contract for .equals, which says "only other Sets may be equal to this Set". If we couldn't recognize the other Sets by their type (with instanceof), it would be impossible to implement this.
If it were only for equals(), it would be possible to have a allowsDuplicates() method for Collection. But there are often cases where APIs want to say "please don't give me duplicates" or "I guarantee that this does not contain duplicates", and in Java there is no way to say in a method declaration "please give only collections whose allowsDuplicates() method returns false". Thus the additional type.

Categories