The question is framed for List but easily applies to others in the java collections framework.
For example, I would say it is certainly appropriate to make a new List sub-type to store something like a counter of additions since it is an integral part of the list's operation and doesn't alter that it "is a list". Something like this:
public class ChangeTrackingList<E> extends ArrayList<E> {
private int changeCount;
...
#Override public boolean add(E e) {
changeCount++;
return super.add(e);
}
// other methods likewise overridden as appropriate to track change counter
}
However, what about adding additional functionality out of the knowledge of a list ADT, such as storing arbitrary data associated with a list element? Assuming the associated data was properly managed when elements are added and removed, of course. Something like this:
public class PayloadList<E> extends ArrayList<E> {
private Object[] payload;
...
public void setData(int index, Object data) {
... // manage 'payload' array
payload[index] = data;
}
public Object getData(int index) {
... // manage 'payload' array, error handling, etc.
return payload[index];
}
}
In this case I have altered that it is "just a list" by adding not only additional functionality (behavior) but also additional state. Certainly part of the purpose of type specification and inheritance, but is there an implicit restriction (taboo, deprecation, poor practice, etc.) on Java collections types to treat them specially?
For example, when referencing this PayloadList as a java.util.List, one will mutate and iterate as normal and ignore the payload. This is problematic when it comes to something like persistence or copying which does not expect a List to carry additional data to be maintained. I've seen many places that accept an object, check to see that it "is a list" and then simply treat it as java.util.List. Should they instead allow arbitrary application contributions to manage specific concrete sub-types?
Since this implementation would constantly produce issues in instance slicing (ignoring sub-type fields) is it a bad idea to extend a collection in this way and always use composition when there is additional data to be managed? Or is it instead the job of persistence or copying to account for all concrete sub-types including additional fields?
This is purely a matter of opinion, but personally I would advise against extending classes like ArrayList in almost all circumstances, and favour composition instead.
Even your ChangeTrackingList is rather dodgy. What does
list.addAll(Arrays.asList("foo", "bar"));
do? Does it increment changeCount twice, or not at all? It depends on whether ArrayList.addAll() uses add(), which is an implementation detail you should not have to worry about. You would also have to keep your methods in sync with the ArrayList methods. For example, at present addAll(Collection<?> collection) is implemented on top of add(), but if they decided in a future release to check first if collection instanceof ArrayList, and if so use System.arraycopy to directly copy the data, you would have to change your addAll() method to only increment changeCount by collection.size() if the collection is an ArrayList (otherwise it gets done in add()).
Also if a method is ever added to List (this happened with forEach() and stream() for example) this would cause problems if you were using that method name to mean something else. This can happen when extending abstract classes too, but at least an abstract class has no state, so you are less likely to be able to cause too much damage by overriding methods.
I would still use the List interface, and ideally extend AbstractList. Something like this
public final class PayloadList<E> extends AbstractList<E> implements RandomAccess {
private final ArrayList<E> list;
private final Object[] payload;
// details missing
}
That way you have a class that implements List and makes use of ArrayList without you having to worry about implementation details.
(By the way, in my opinion, the class AbstractList is amazing. You only have to override get() and size() to have a functioning List implementation and methods like containsAll(), toString() and stream() all just work.)
One aspect you should consider is that all classes that inherit from AbstractList are value classes. That means that they have meaningful equals(Object) and hashCode() methods, therefore two lists are judged to be equal even if they are not the same instance of any class.
Furthermore, the equals() contract from AbstractList allows any list to be compared with the current list - not just a list with the same implementation.
Now, if you add a value item to a value class when you extend it, you need to include that value item in the equals() and hashCode() methods. Otherwise you will allow two PayloadList lists with different payloads to be considered "the same" when somebody uses them in a map, a set, or just a plain equals() comparison in any algorithm.
But in fact, it's impossible to extend a value class by adding a value item! You'll end up breaking the equals() contract, either by breaking symmetry (A plain ArrayList containing [1,2,3] will return true when compared with a PayloadList containing [1,2,3] with a payload of [a,b,c], but the reverse comparison won't return true). Or you'll break transitivity.
This means that basically, the only proper way to extend a value class is by adding non-value behavior (e.g. a list that logs every add() and remove()). The only way to avoid breaking the contract is to use composition. And it has to be composition that does not implement List at all (because again, other lists will accept anything that implements List and gives the same values when iterating it).
This answer is based on item 8 of Effective Java, 2nd Edition, by Joshua Bloch
If the class is not final, you can always extend it. Everything else is subjective and a matter of opinion.
My opinion is to favor composition over inheritance, since in the long run, inheritance produces low cohesion and high coupling, which is the opposite of a good OO design. But this is just my opinion.
The following is all just opinion, the question invites opinionated answers (I think its borderline to not being approiate for SO).
While your approach is workable in some situations, I'd argue its bad design and it is very brittle. Its also pretty complicated to cover all loopholes (maybe even impossible, for example when the list is sorted by means other than List.sort).
If you require extra data to be stored and managed it would be better integrated into the list items themselves, or the data could be associated using existing collection types like Map.
If you really need an association list type, consider making it not an instance of java.util.List, but a specialized type with specialized API. That way no accidents are possible by passing it where a plain list is expected.
Related
I have often seen declarations like List<String> list = new ArrayList<>(); or Set<String> set = new HashSet<>(); for fields in classes. For me it makes perfect sense to use the interfaces for the variable types to provide flexibility in the implementation. The examples above do still define which kind of Collections have to be used, respectively which operations are allowed and how it should behave in some cases (due to docs).
Now consider the case where actually only the functionality of the Collection (or even the Iterable) interface is required to use the field in the class and the kind of Collection doesn't actually matter or I don't want to overspecify it. So I choose for example HashSet as implementation and declare the field as Collection<String> collection = new HashSet<>();.
Should the field then actually be of type Set in this case? Is this kind of declaration bad practice, if so, why? Or is it good practice to specify the actual type as less as possible (and still provide all required methods). The reason why I ask this is because I have hardly ever seen such a declaration and lately I get more an more in the situation where I only need to specify the functionality of the Collection interface.
Example:
// Only need Collection features, but decided to use a LinkedList
private final Collection<Listener> registeredListeners = new LinkedList<>();
public void init() {
ExampleListener listener = new ExampleListener();
registerListenerSomewhere(listener);
registeredListeners.add(listener);
listener = new ExampleListener();
registerListenerSomewhere(listener);
registeredListeners.add(listener);
}
public void reset() {
for (Listener listener : registeredListeners) {
unregisterListenerSomewhere(listener);
}
registeredListeners.clear();
}
Since your example uses a private field it doesn't matter all that much about hiding the implementation type. You (or whoever is maintaining this class) can always just go look at the field's initializer to see what it is.
Depending on how it's used, though, it might be worth declaring a more specific interface for the field. Declaring it to be a List indicates that duplicates are allowed and that ordering is significant. Declaring it to be a Set indicates that duplicates aren't allowed and that ordering is not significant. You might even declare the field to have a particular implementation class if there's something about it that's significant. For example, declaring it to be LinkedHashSet indicates that duplicates aren't allowed but that ordering is significant.
The choice of whether to use an interface, and what interface to use, becomes much more significant if the type appears in the public API of the class, and on what the compatibility constraints on this class are. For example, suppose there were a method
public ??? getRegisteredListeners() {
return ...
}
Now the choice of return type affects other classes. If you can change all the callers, maybe it's no big deal, you just have to edited other files. But suppose the caller is an application that you have no control over. Now the choice of interface is critical, as you can't change it without potentially breaking the applications. The rule here is usually to choose the most abstract interface that supports the operations you expect callers to want to perform.
Most of the Java SE APIs return Collection. This provides a fair degree of abstraction from the underlying implementation, but it also provides the caller a reasonable set of operations. The caller can iterate, get the size, do a contains check, or copy all the elements to another collection.
Some code bases use Iterable as the most-abstract interface to return. All it does is allow the caller to iterate. Sometimes this is all that's necessary, but it might be somewhat limiting compared to Collection.
Another alternative is to return a Stream. This is helpful if you think the caller might want to use stream's operations (such as filter, map, find, etc.) instead of iterating or using collection operations.
Note that if you choose to return Collection or Iterable, you need to make sure that you return an unmodifiable view or make a defensive copy. Otherwise, callers could modify your class's internal data, which would probably lead to bugs. (Yes, even an Iterable can permit modification! Consider getting an Iterator and then calling the remove() method.) If you return a Stream, you don't need to worry about that, since you can't use a Stream to modify the underlying source.
Note that I turned your question about the declaration of a field into a question about the declaration of method return types. There is this idea of "program to the interface" that's quite prevalent in Java. In my opinion it doesn't matter very much for local variables (which is why it's usually fine to use var), and it matters little for private fields, since those (almost) by definition affect only the class in which they're declared. However, the "program to the interface" principle is very important for API signatures, so those cases are where you really need to think about interface types. Private fields, not so much.
(One final note: there is a case where you need to be concerned about the types of private fields, and that's when you're using a reflective framework that manipulates private fields directly. In that case, you need to think of those fields as being public -- just like method return types -- even though they're not declared public.)
As with all things, it's a question of tradeoffs. There are two opposing forces.
The more generic the type, the more freedom the implementation has. If you use Collection you're free to use an ArrayList, HashSet, or LinkedList without affecting the user/caller.
The more generic the return type, the less features there are available to the user/caller. A List provides index-based lookup. A SortedSet makes it easy to get contiguous subsets via headSet, tailSet, and subSet. A NavigableSet provides efficient O(log n) binary search lookup methods. If you return Collection, none of these are available. Only the most generic access functions can be used.
Furthermore, the sub-types guarantee special properties that Collection does not: Sets hold unique items. SortedSets are sorted. Lists have an order; they're not unordered bags of items. If you use Collection then the user/caller can't necessarily assume that these properties hold. They may be forced to code defensively and, for instance, handle duplicate items even if you know there won't be duplicates.
A reasonable decision process might be:
If O(1) indexed access is guaranteed, use List.
If elements are sorted and unique, use SortedSet or NavigableSet.
If element uniqueness is guaranteed and order is not, use Set.
Otherwise, use Collection.
It really depends on what you want to do with the collection object.
Collection<String> cSet = new HashSet<>();
Collection<String> cList = new ArrayList<>();
Here in this case if you want you can do :
cSet = cList;
But if you do like :
Set<String> cSet = new HashSet<>();
the above operation is not permissible though you can construct a new list using the constructor.
Set<String> set = new HashSet<>();
List<String> list = new ArrayList<>();
list = new ArrayList<>(set);
So basically depending on the usage you can use Collection or Set interface.
In the JDK, there's Collection.emtpyList() and Collection.emptySet(). Both in their own right. But sometimes all that is needed is an empty, immutable instance of Collection. To me, there's no reason to chose one over the other as both implement all operations of Collection in an efficient way and with the same results. Yet each time I need such an empty collection I ponder which one to use for a second of two.
I do not expect to gain a deeper understanding of the collections framework from an answer to this question but maybe there's a subtle reason I could use to justify choosing one over the other without thinking about it ever again.
An answer should state at least one reason preferring one of Collection.emtpyList() and Collection.emptySet() over the other in a context where they're functionally equivalent. An answer is better if the stated reason is near the top of this list:
There's a case where the type system is happier with one over the other (e.g. type inference allows shorter code with one than the other).
There is a performance difference, maybe in some special case (e.g. if the empty collection is passed as an argument to some of the collection framework's static or instance methods like Collections.sort() or Collection.removeAll()).
Choosing one over the other "makes more sense" in the general case, if you think about it.
Examples where this question arises
To give some context, here are two examples where I am in need of an empty, unmodifiable collection.
This is an example of an API that allows creating some object by optionally specifying a collection of objects that are used in the creation. The second method just calls the first one with an empty collection:
static void createObjectWithTheseThings(Collection<Thing> things) {
...
}
static void createObjectWithoutAnyThings() {
createObjectWithTheseThings(Collections.emptyXXX());
}
This is an example of an Entity with state represented by an immutable collection stored in a non-final field. On initialization the field should be set to an empty collection:
class Example {
// Initialized to an empty collection.
private Collection<T> containedThings = Collections.emptyXXX();
...
}
Unfortunately I don't have an answer that will make the top of your priority list but if I were you I'd settle on
Collections.emptySet
Type inference was your first priority but I don't know if the choice can/should influence that given you were looking for an emptyCollection()
On the second priority, think about any api that takes in a collection which performs differently (accidentally/intentionally) based on the sub-interfaces of the concrete object passed in. Aren't they more likely to offer varied performance based on the concrete implementations (as with an ArrayList or LinkedList) instead? The empty set/list are not modeled on any empty data structures anyway; they are dummy implementations - hence no real difference
Based on java's modelling of these interfaces (which admittedly is not ideal), a Collection is very similar to a Set. In fact I think the methods are almost exactly the same. Logically too it looks OK with List being the specific-sub type that adds additional ordering concerns.
Now Collection and Set looking very similar(java-wise) brings up a question. If you are using a Collection type, it is clear it is not a list you want. Now the question is are you sure you don't mean a Set. If you don't, then are you using something like a Bag (surely there must be concrete instances which are not empty in the overall logic). So if you are concerned with say a Bag, then shouldn't it be up to the Bag api to provide an emptyBag() method? Just wondering. btw, I'd stick with emptySet() in the meantime :)
For the emptyXXX(), it really doesn't matter at all - since they are both empty (and they are unmodifieable, so they always stay empty) it doesn't matter at all. They will be equally suited to all operations Collection offers.
Take a look at what Collections really gives you there: Special implementations (the instances are shared across calls!). All relevant operations are dummy implementations that either return a constant result or immediately throw. Even iterator() is just a dummy with no state.
It wont make any notable difference at all.
Edit: You could say for the special case of emptyList/Set, they are semantically and complexity-wise the same at the Collecton interface level. All operations available on Collection are implemented by emptySet/List as O(1) operations. And since they're following both the contract defined by Collection, they are semantically identical too.
The only situation I can imagine this making a difference is if the code that will use your Collection does something like this:
Collection<T> collection = ...
List<T> asAList;
if (collection instanceof List) {
asAList = (List<T>) collection;
} else {
asAList = new ArrayList<T>(collection);
}
Obviously in a case like this you would want to use emptyList(), while if the secret target type was a Set, you'd want emptySet().
Otherwise, in terms of what "makes more sense", I agree with #ac3's logic that a generic Collection is like a Bag, and an empty immutable Set and empty immutable Bag are pretty much the same thing. However, a person very used to using immutable lists might find those easier to think of.
I don't understand difference between:
ArrayList<Integer> list = new ArrayList<Integer>();
Collection<Integer> list1 = new ArrayList<Integer>();
Class ArrayList extends class which implements interface Collection, so Class ArrayList implements Collection interface. Maybe list1 allows us to use static methods from the Collection interface?
An interface has no static methods [in Java 7]. list1 allows to access only the methods in Collection, whereas list allows to access all the methods in ArrayList.
It is preferable to declare a variable with its least specific possible type. So, for example, if you change ArrayList into LinkedList or HashSet for any reason, you don't have to refactor large portions of the code (for example, client classes).
Imagine you have something like this (just for illustrational purposes, not compilable):
class CustomerProvider {
public LinkedList<Customer> getAllCustomersInCity(City city) {
// retrieve and return all customers for that city
}
}
and you later decide to implement it returning a HashSet. Maybe there is some client class that relies on the fact that you return a LinkedList, and calls methods that HashSet doesn't have (e.g. LinkedList.getFirst()).
That's why you better do like this:
class CustomerProvider {
public Collection<Customer> getAllCustomersInCity(City city) {
// retrieve and return all customers for that city
}
}
What we're dealing with here is the difference between interface and implementation.
An interface is a set of methods without any regard to how those methods are implemented. When we instantiate an object as having a type that is actually an interface, what we're saying is that it is an object that implements all of the methods in that interface... but doesn't provide is with access to any of the methods in the class that actually provides those implementations.
When you instantiate an object with the type of an implementing class, then you have access to all of relevant methods of that class. Since that class is implementing an interface, you have access to the methods specified in the interface, plus any extras provided by the implementing class.
Why would you want to do this? Well, by restricting the type of your object to the interface, you can switch in new implementations without worrying about changing the rest of your code. This makes it a whole lot more flexible.
The difference, as others have said, is that you are limited to the methods defined by the Collection interface when you specify that as your variable type. But that doesn't answer the question of why you would want to do this.
The reason is that the choice of data type provides information to the people using the code. Especially when used as the parameter or return type from a function (where outside programmers may have no access to the internals).
In order of specificity, here is what different type choices might tell you:
Collection - a group of objects, with no further guarantees. The consumer of this object can iterate over the collection (with no guarantees as to iteration order), and can learn its size, but cannot do anything else.
List - a group of objects that have a specific order. When you iterate over these objects, you will always get them in the same order. You can also retrieve specific items from the collection by index, but you cannot make any assumptions about the performance of such retrieval.
ArrayList - a group of objects that have a specific order, and may be accessed by index in constant time.
And although you didn't ask about them, here are some other collection classes:
Set a group of objects that is guaranteed to contain no duplicates per the equals() method. There are no guarantees regarding the iteration order of these objects.
SortedSet a group of objects that contains no duplicates, and will always iterate in a specific order (although that specific order is not guaranteed by the collection).
TreeSet a group of ordered objects with no duplicates, that exhibits O(logN) insert and retrieval times.
HashSet a group of objects with no duplicates, that does not have an inherent order, but provides (amortized) constant-time access.
The only difference is that you're providing access to list1 through the Collection interface, whereas you provide access to list2 through the ArrayList interface. Sometimes, providing access through a restricted interface is useful, in that it promotes encapsulation and reduces dependence on implementation details.
When you perform operations on "list1", you'll only be able to access methods from the Collection interface (get, size, etc.). By declaring "list" as an ArrayList, you gain access to additional methods only defined in the ArrayList class (ensureCapacity and trimToSize, for example.
It's typically best practice to declare the variable as the least specific class you need. So, if you only need the methods from Collection, use it. Typically in this case, that would mean using List, which lets you know it's ordered and can handle duplicates.
Using the least specific class/interface allows you to freely change the implementation later. For example, if you later learn that a LinkedList would be a better implementation to use, you could change it without breaking all your code if you define the variable to be a List.
For example:
List<String> list = new ArrayList<String>();
vs
ArrayList<String> list = new ArrayList<String>();
What is the exact difference between these two?
When should we use the first one and when should we use the second?
Use the first form whenever possible (I would even say: use Collection if sufficient). This is especially important when accepting input from client code (method arguments). Sometimes, for the convenience of the client code/library user it is better to accept the most generic input you can (like Collection) and deal with it rather than forcing the user to convert arguments all the time (user has LinkedList but the API requires ArrayList - terrible).
Use the second form only when you need to invoke methods on list variable that are defined in ArrayList but not in List (like ArrayList.trimToSize()). Also when returning data to the user consider (but this is not the rule of thumb) returning more specific types. E.g. consider List over Collection so the client code can easier deal with the result. However! Returning too specific types (e.g. ArrayList) will lock your implementation for the future, so try to find a compromise.
This is a general rule - use the most general type you can. Even more general: use common sense.
List is not a superclass, it is an interface.
By using List rather than ArrayList, you make sure that users of your list will only use methods that are defined on List. Meaning that you can change the implementation to (for example) Vector, without breaking the existing code.
So, use the first form.
The first form is the most desirable one because you hide the implementation (ArrayList) from the rest of your code and ensure your code only works with the abstraction (List). The advantage of this is that your code will be more generic and therefore easier to adapt, for example when you change from using an ArrayList to a LinkedList, Vector or own List implementation. It also means local changes are less likely to cause changes in other parts of your code ('ripple-effect'), increasing your code's maintainability.
You need the second form when you want to do things with your variable that are not offered by the List interface, for example ensureCapacity or trimToSize
EDIT: extra explanation of changing the implementation
Here is an example of declaring a variable as a Collection (an even more generic interface in java.util):
public class Example {
private Collection<String> greetings = new ArrayList<String>();
public void addGreeting(String greeting) {
greetings.add(greeting);
}
}
Now suppose you want to change the implementation in order to store unique greetings, and therefore switch from ArrayList to HashSet. Both are implementations of the Collection interface. This would be easy in this case because all the existing code treats the greetings field as a Collection:
public class Example {
private Collection<String> greetings = new HashSet<String>();
public void addGreeting(String greeting) {
greetings.add(greeting);
}
}
There is an exception. If there is code which casts the greetings field back to its implementation, this makes that code 'implementation-aware', violating the information-hiding you tried to achieve, for example:
ArrayList<String> greetingList = (ArrayList<String>) greetings;
greetingList.ensureCapacity(42);
Such code would cause a runtime error 'java.lang.ClassCastException: java.util.HashSet incompatible with java.util.ArrayList' if you change the implementation to HashSet, so this practice should be avoided if possible.
There are some advantages of using interfaces against concrete classes:
You are not stuck to concrete implementation (you can easy change it without modifying code)
Your code is clearer as no methods of concrete class are available
You need concrete implementation only in case if you USE some features of it.
E.g. we have Matrix interface and have two concrete implementations SparseMathix and FullMatrix. If you want to effectively multiply them you CAN use some implementation details of SparseMatrix otherwise performance MAY be too slow.
I have an object that stores some data in a list. The implementation could change later, and I don't want to expose the internal implementation to the end user. However, the user must have the ability to modify and access this collection of data. Currently I have something like this:
public List<SomeDataType> getData() {
return this.data;
}
public void setData(List<SomeDataType> data) {
this.data = data;
}
Does this mean that I have allowed the internal implementation details to leak out? Should I be doing this instead?
public Collection<SomeDataType> getData() {
return this.data;
}
public void setData(Collection<SomeDataType> data) {
this.data = new ArrayList<SomeDataType>(data);
}
It just depends, do you want your users to be able to index into the data? If yes, use List. Both are interfaces, so you're not leaking implementation details, really, you just need to decide the minimum functionality needed.
Returning a List is in line with programming to the Highest Suitable Interface.
Returning a Collection would cause ambiguity to the user, as a returned collection could be either: Set, List or Queue.
Independent of the ability to index into the list via List.get(int), do the users (or you) have an expectation that the elements of the collection are in a reliable and predictable order? Can the collection have multiples of the same item? Both of these are expectations of lists that are not common to more general collections. These are the tests I use when determining which abstraction to expose to the end user.
When returning an implementation of an interface or class that is in a tall hierarchy, the rule of thumb is that the declared return type should be the HIGHEST level that provides the minimum functionality that you are prepared to guarantee to the caller, and that the caller reasonably needs. For example, suppose what you really return is an ArrayList. ArrayList implements List and Collection (among other things). If you expect the caller to need to use the get(int x) function, then it won't work to return a Collection, you'll need to return a List or ArrayList. As long as you don't see any reason why you would ever change your implementation to use something other than a list -- say a Set -- then the right answer is to return a List. I'm not sure if there's any function in ArrayList that isn't in List, but if there is, the same reasoning would apply. On the other hand, once you do return a List instead of a Collection, you have now locked in your implementation to some extent. The less you put in your API, the less restriction you put on future improvements.
(In practice, I almost always return a List in such situations, and it has never burned me. But I probably really should return a Collection.)
Using the most general type, which is Collection, makes the most sense unless there is some explicit reason to use the more specific type - List. But whatever you do, if this is an API for public consumption be clear in the documentation what it does; if it returns a shallow copy of the collection say so.
Yes, your first alternative does leak implementation details if it's not part of your interface contract that the method will always return a List. Also, allowing user code to replace your collection instance is somewhat dangerous, because the implementation they pass in may not behave as you expect.
Of course, it's all a matter of how much you trust your users. If you take the Python philosophy that "we're all consenting adults here" then the first method is just fine. If you think that your library will be used by inexperienced developers and you need to do all you can to "babysit" them and make sure they don't do something wrong then it's preferable not to let them set the collection and not to even return the actual collection. Instead return a (shallow) copy of it.
It depends on what guarantees you want to provide the user. If the data is sequential such that the order of the elements matter and you are allowing duplicates, then use a list. If order of elements does not matter and duplicates may or may not be allowed, then use a collection. Since you are actually returning the underlying collection you should not have both a get and set function, only a get function, since the returned collection may be mutated. Also, providing a set function allows the type of collection to be changed by the user, whereas you probably want for the particular type to be controlled by you.
Were I concerned with obscuring internal representation of my data to an outside user, I would use either XML or JSON. Either way, they're fairly universal.