Is there any practical difference between a Set and Collection in Java, besides the fact that a Collection can include the same element twice? They have the same methods.
(For example, does Set give me more options to use libraries which accept Sets but not Collections?)
edit: I can think of at least 5 different situations to judge this question. Can anyone else come up with more? I want to make sure I understand the subtleties here.
designing a method which accepts an argument of Set or Collection. Collection is more general and accepts more possibilities of input. (if I'm designing a specific class or interface, I'm being nicer to my consumers and stricter on my subclassers/implementers if I use Collection.)
designing a method which returns a Set or Collection. Set offers more guarantees than Collection (even if it's just the guarantee not to include one element twice). (if I'm designing a specific class or interface, I'm being nicer to my consumers and stricter on my subclassers/implementers if I use Set.)
designing a class that implements the interface Set or Collection. Similar issues as #2. Users of my class/interface get more guarantees, subclassers/implementers have more responsibility.
designing an interface that extends the interface Set or Collection. Very similar to #3.
writing code that uses a Set or Collection. Here I might as well use Set; the only reasons for me to use Collection is if I get back a Collection from someone else's code, or if I have to handle a collection that contains duplicates.
Collection is also the supertype of List, Queue, Deque, and others, so it gives you more options. For example, I try to use Collection as a parameter to library methods that shouldn't explicitly depend on a certain type of collection.
Generally, you should use the right tool for the job. If you don't want duplicates, use Set (or SortedSet if you want ordering, or LinkedHashSet if you want to maintain insertion order). If you want to allow duplicates, use List, and so on.
I think you already have it figured out- use a Set when you want to specifically exclude duplicates. Collection is generally the lowest common denominator, and it's useful to specify APIs that accept/return this, which leaves you room to change details later on if needed. However if the details of your application require unique entries, use Set to enforce this.
Also worth considering is whether order is important to you; if it is, use List, or LinkedHashSet if you care about order and uniqueness.
See Java's Collection tutorial for a good walk-through of Collection usage. In particular, check out the class hierarchy.
As #mmyers states, Collection includes Set, as well as List.
When you declare something as a Set, rather than a Collection, you are saying that the variable cannot be a List or a Map. It will always be a Collection, though. So, any function that accepts a Collection will accept a Set, but a function that accepts a Set cannot take a Collection (unless you cast it to a Set).
One other thing to consider... Sets have extra overhead in time, memory, and coding in order to guarantee that there are no duplicates. (Time and memory because sets are usually backed by a HashMap or a Tree, which adds overhead over a list or an array. Coding because you have to implement the hashCode() and equals() methods.)
I usually use sets when I need a fast implementation of contains() and use Collection or List otherwise, even if the collection shouldn't have duplicates.
You should use a Set when that is what you want.
For example, a List without any order or duplicates. Methods like contains are quite useful.
A collection is much more generic. I believe that what mmyers wrote on their usage says it all.
The practical difference is that Set enforces the set logic, i.e. no duplicates and unordered, while Collection does not. So if you need a Collection and you have no particular requirement for avoiding duplicates then use a Collection. If you have the requirement for Set then use Set. Generally use the highest interface possibble.
As Collection is a super type of Set and SortedSet these can be passed to a method which expects a Collection. Collection just means it may or may not be sorted, order or allow duplicates.
Related
I want to understand why CopyOnWriteArraySet does not allow(ignore) the duplicate elements in them. I understand the fact that since it is SET, it is meant to avoid duplicates.
But acccording to the oracle definition:
CopyOnWriteArraySet: A Set that uses an internal CopyOnWriteArrayList for all of its operations.
Oracle java docs
So practically it should allow duplicates. Is it the internal implementation of add() method that restricts the duplicate elements?
So practically it should allow duplicates.
No, it should not. It's a set. If it allowed duplicates, it shouldn't be called a set.
But acccording to the oracle definition: CopyOnWriteArraySet: A Set that uses an internal CopyOnWriteArrayList for all of its operations.
This is just helpful information, so that if you're already familiar with CopyOnWriteArrayList, then you will understand the consequences, such as thread safety at the expense of slow writes.
In general, the implementation details shouldn't be your concern.
This class implements Set, so it should behave that way.
It's the job of the authors of this class to ensure that there will be no duplicates, despite using a data structure that's capable to contain duplicates. The authors should also avoid potential performance bottle necks implied by the underlying data structure, such as linear lookups instead of something faster as usually expected from sets.
Lastly, keep in mind that just because you can do something,
doesn't mean that you should.
You are mixing up the CONTRACT and the IMPLEMENTATION of that class.
Oracle is free to change the underlying implementation at any point : the only thing that matters is the behavior of this Set implementing class.
In other words: the fact that "uses list" shows up in the javadoc is only meant to give the user some helpful context knowledge about this set implementation.
I've been looking at a lot of code recently (for my own benefit, as I'm still learning to program), and I've noticed a number of Java projects (from what appear to be well respected programmers) wherein they use some sort of immediate down-casting.
I actually have multiple examples, but here's one that I pulled straight from the code:
public Set<Coordinates> neighboringCoordinates() {
HashSet<Coordinates> neighbors = new HashSet<Coordinates>();
neighbors.add(getNorthWest());
neighbors.add(getNorth());
neighbors.add(getNorthEast());
neighbors.add(getWest());
neighbors.add(getEast());
neighbors.add(getSouthEast());
neighbors.add(getSouth());
neighbors.add(getSouthWest());
return neighbors;
}
And from the same project, here's another (perhaps more concise) example:
private Set<Coordinates> liveCellCoordinates = new HashSet<Coordinates>();
In the first example, you can see that the method has a return type of Set<Coordinates> - however, that specific method will always only return a HashSet - and no other type of Set.
In the second example, liveCellCoordinates is initially defined as a Set<Coordinates>, but is immediately turned into a HashSet.
And it's not just this single, specific project - I've found this to be the case in multiple projects.
I am curious as to what the logic is behind this? Is there some code-conventions that would consider this good practice? Does it make the program faster or more efficient somehow? What benefit would it have?
When you are designing a method signature, it is usually better to only pin down what needs to be pinned down. In the first example, by specifying only that the method returns a Set (instead of a HashSet specifically), the implementer is free to change the implementation if it turns out that a HashSet is not the right data structure. If the method had been declared to return a HashSet, then all code that depended on the object being specifically a HashSet instead of the more general Set type would also need to be revised.
A realistic example would be if it was decided that neighboringCoordinates() needed to return a thread-safe Set object. As written, this would be very simple to do—replace the last line of the method with:
return Collections.synchronizedSet(neighbors);
As it turns out, the Set object returned by synchronizedSet() is not assignment-compatible with HashSet. Good thing the method was declared to return a Set!
A similar consideration applies to the second case. Code in the class that uses liveCellCoordinates shouldn't need to know anything more than that it is a Set. (In fact, in the first example, I would have expected to see:
Set<Coordinates> neighbors = new HashSet<Coordinates>();
at the top of the method.)
Because now if they change the type in the future, any code depending on neighboringCoordinates does not have to be updated.
Let's you had:
HashedSet<Coordinates> c = neighboringCoordinates()
Now, let's say they change their code to use a different implementation of set. Guess what, you have to change your code too.
But, if you have:
Set<Coordinates> c = neighboringCoordinates()
As long as their collection still implements set, they can change whatever they want internally without affecting your code.
Basically, it's just being the least specific possible (within reason) for the sake of hiding internal details. Your code only cares that it can access the collection as a set. It doesn't care what specific type of set it is, if that makes sense. Thus, why make your code be coupled to a HashedSet?
In the first example, that the method will always only return a HashSet is an implementation detail that users of the class should not have to know. This frees the developer to use a different implementation if it is desirable.
The design principle in play here is "always prefer specifying abstract types".
Set is abstract; there is no such concrete class Set - it's an interface, which is by definition abstract. The method's contract is to return a Set - it's up the developer to chose what kind of Set to return.
You should do this with fields as well, eg:
private List<String> names = new ArrayList<String>;
not
private ArrayList<String> names = new ArrayList<String>;
Later, you may want to change to using a LinkedList - specifying the abstract type allows you to do this with no code changes (except for the initializtion of course).
The question is how you want to use the variable. e.g. is it in your context important that it is a HashSet? If not, you should say what you need, and this is just a Set.
Things were different if you would use e.g. TreeSet here. Then you would lose the information that the Set is sorted, and if your algorithm relies on this property, changing the implementation to HashSet would be a disaster. In this case the best solution would be to write SortedSet<Coordinates> set = new TreeSet<Coordinates>();. Or imagine you would write List<String> list = new LinkedList<String>();: That's ok if you want to use list just as list, but you wouldn't be able to use the LinkedList as deque any longer, as methods like offerFirst or peekLast are not on the List interface.
So the general rule is: Be as general as possible, but as specific as needed. Ask yourself what you really need. Does a certain interface provide all functionality and promises you need? If yes, then use it. Else be more specific, use another interface or the class itself as type.
Here is another reason. It's because more general (abstract) types have fewer behaviors which is good because there is less room to mess up.
For example, let's say you implemented a method like this: List<User> users = getUsers(); when in fact you could have used a more abstract type like this: Collection<User> users = getUsers();. Now Bob might assume wrongly that your method returns users in alphabetic order and create a bug. Had you used Collection, there wouldn't have been such confusion.
It's quite simple.
In your example, the method returns Set. From an API designer's point of view this has one significant advantage, compared to returning HashSet.
If at some point, the programmer decides to use SuperPerformantSetForDirections then he can do it without changing the public API, if the new class extends Set.
The trick is "code to the interface".
The reason for this is that in 99.9% of the cases you just want the behavior from HashSet/TreeSet/WhateverSet that conforms to the Set-interface implemented by all of them. It keeps your code simpler, and the only reason you actually need to say HashSet is to specify what behavior the Set you need has.
As you may know HashSet is relatively fast but returns items in seemingly random order. TreeSet is a bit slower, but returns items in alphabetical order. Your code does not care, as long as it behaves like a Set.
This gives simpler code, easier to work with.
Note that the typical choices for a Set is a HashSet, Map is HashMap and List is ArrayList. If you use a non-typical (for you) implementation, there should be a good reason for it (like, needing the alphabetical sorting) and that reason should be put in a comment next to the new statement. Makes the life easier for future maintainers.
What is the most efficient way of maintaining a list that does not allow duplicates, but maintains insertion order and also allows the retrieval of the last inserted element in Java?
Try LinkedHashSet, which keeps the order of input.
Note that re-inserting an element would update its position in the input order, thus you might first try and check whether the element is already contained in the set.
Edit:
You could also try the Apache commons collections class ListOrderedSet which according to the JavaDoc (if I didn't missread anything again :) ) would decorate a set in order to keep insertion order and provides a get(index) method.
Thus, it seems you can get what you want by using new ListOrderedSet(new HashSet());
Unfortunately this class doesn't provide a generic parameter, but it might get you started.
Edit 2:
Here's a project that seems to represent commons collections with generics, i.e. it has a ListOrderedSet<E> and thus you could for example call new ListOrderedSet<String>(new HashSet<String>());
I don't think there's anything in the JDK which does this.
However, LinkedHashMap, which is used as the basis for LinkedHashSet, comes close: it maintains a circular doubly-linked list of the entries in the map. It only tracks the head of the list not the tail, but because the list is circular, header.before is the tail (the most recently inserted element).
You could therefore implement what you need on top of this. LinkedHashMap has not been designed for extension, so this is somewhat awkward. You could copy the code into your own class and add a suitable last() method (be aware of licensing issues here), or you could extend the existing class, and add a method which uses reflection to get at the private header and before fields.
That would get you a Map, rather than a Set. However, HashSet is already a wrapper which makes a Map look like a Set. Again, it is not designed for general extension, but you could write a subclass whose constructor calls the super constructor, then uses more reflection to replace the superclass's value of map with an instance of your new map. From there on, the class should do exactly what you want.
As an aside, the library classes here were all written by Josh Bloch and Neal Gafter. Those guys are two of the giants of Java. And yet the code in there is largely horrible. Never meet your heroes.
Just use a TreeSet.
How does a TreeSet, HashSet or LinkedHashSet behave when the objects are mutable? I cannot imagine that they would work in any sense?
If I modify an object after I have added it; what is the behaviour of the list?
Is there a better option for dealing with a collection of mutable objects (which I need to sort/index/etc) other than a linked list or an array and simply iterating through them each time?
The Set interface addresses this issue directly: "Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element."
Addendum:
Is there a better option for dealing with a collection of mutable objects?
When trying to decide which collection implementation is most suitable, it may be worth looking over the core collection interfaces. For Set implementations in particular, as long as equals() and hashCode() are implemented correctly, any unrelated attributes may be mutable. By analogy with a database relation, any attribute may change, but the primary key must be inviolate.
Being mutable is only a problem for the collection if the objects' hashCode and behaviour of compare methods change after it is inserted.
The way you could handle this is to remove the objects from the collection and re-adding them after such a change so that the object.
In essence this results in a inmutable object from the collections' point of view.
Another less performant way could be to keep a set containing all objects and creating a TreeSet/HashSet when you need the set to be sorted or indexed. This is no real solution for a situation where the objects change constantly and you need map access at the same time.
The "best" way to deal with this situation is to keep ancillary data structures for lookup, a bit like indexes in a database. Then all of your modifications need to make sure the indexes are updated. Good examples would be maps or multimaps - before an update, remove the entry from any indexes, and then after an update add them back in with the new values. Obviously this needs care with concurrency etc.
Let's say I want to put words in a data structure and I want to have constant time lookups to see if the word is in this data structure. All I want to do is to see if the word exists. Would I use a HashMap (containsKey()) for this? HashMaps use key->value pairings, but in my case I don't have a value. Of course I could use null for the value, but even null takes space. It seems like there ought to be a better data structure for this application.
The collection could potentially be used by multiple threads, but since the objects contained by the collection would not change, I do not think I have a synchronization/concurrency requirement.
Can anyone help me out?
Use HashSet instead. It's a hash implementation of Set, which is used primarily for exactly what you describe (an unordered set of items).
You'd generally use an implementation of Set, and most usually HashSet. If you did need concurrent access, then ConcurrentHashSet provides a drop-in replacement that provides safe, concurrent access, including safe iteration over the set.
I'd recommend in any case referring to it as simply a Set throughout your code, except in the one place where you construct it; that way, it's easier to drop in one implementation for the other if you later require it.
Even if the set is read-only, if it's used by a thread other than the one that creates it, you do need to think about safe publication (that is, making sure that any other thread sees the set in a consistent state: remember any memory writes, even in constructors, aren't guaranteed to be made available to other threads when or in the otder you expect, unless you take steps to ensure this). This can be done by both of the following:
making sure the only reference(s) to the set are in final fields;
making sure that it really is true that no thread modifies the set.
You can help to ensure the latter by using the Collections.unmodifiableSet() wrapper. This gives you an unmodifiable view of the given set-- so provided no other "normal" reference to the set escapes, you're safe.
You probably want to use a java.util.Set. Implementations include java.util.HashSet, which is the Set equivalent of HashMap.
Even if the objects contained in the collection do not change, you may need to do synchronization. Do new objects need to be added to the Set after the Set is passed to a different thread? If so, you can use Collections.synchronizedSet() to make the Set thread-safe.
If you have a Map with values, and you have some code that just wants to treat the Map as a Set, you can use Map.entrySet() (though keep in mind that entrySet returns a Set view of the keys in the Map; if the Map is mutable, the Map can be changed through the set returned by entrySet).
You want to use a Collection implementing the Set interface, probably HashSet to get the performance you stated. See http://java.sun.com/javase/6/docs/api/java/util/Set.html
Other than Sets, in some circumstances you might want to convert a Map into a Set with Collections.newSetFromMap(Map<E,Boolean>) (some Maps disallow null values, hence the Boolean).
as everyone said HashSet is probably the simplest solution but you won't have constant time lookup in a HashSet (because entries may be chained) and you will store a dummy object (always the same) for every entry...
For information here a list of data structures maybe you'll find one that better fits your needs.