Which java data structures have a deterministic order of iteration?

Which java data structures have a deterministic order of iteration? - java

In an interview, I was asked the following question:
Your application requires to store objects such that the order of
entries returned while iterating through the structure is
deterministic. In other words, if you iterate over the same
structure twice, the order of elements returned in both iterations
will be the same. Which of the following classes would you use?
Assume that structure is not mutated. (Check ANY that apply)
HashMap
LinkedHashSet
HashTable
LinkedHashMap
TreeSet
TreeMap
I suggested using a LinkedHashSet. Was this the correct answer? Why or why not?

A determenistic order just means that it's constantly reproducible - the same input will always provide the same iteration order. In this case, the answer is "all of the above". Although most Set's and Map's ordering can't be trusted, it is still determenistic, and will remain the same until the underlying implementation is changed (e.g., if you change or upgrade JVMs).
A predictable order is something more, though - it means that the collection guarantees the order items are returned when iterating the collection. Both "linked" types you mentioned above do that - the order that items were inserted to the collection is the order they will be returned when iterating over it. The "tree" types also guarantee a deterministic order of iteration - a sorted one.

As noted by #Elliot Frisch, they are all "deterministic" and will iterate in the same order if nothing has changed. That said, to paraphrase Animal House, some collections are more deterministic than others. :-)
Hash... collections have a deterministic iteration order which the JVM can "predict", but is very challenging for a human to predict and not worth the effort. In practice, they are not "predictable". As #Mureinik points out, the order is officially "unspecified" and subject to change of you change JVMs. The API docs describe this as "generally chaotic ordering" and all sane programmers would agree.
Linked... collections have "predictable iteration order" in that they iterate in the order elements were inserted, with the important caveat that if you insert the same element twice it retains the original order. i.e.
add("Tom");
add("Fred");
add("Tom");
would iterate "Tom", "Fred", not "Fred", "Tom"
This is clearly "more predictable" than Hash..., but still a bit challenging if elements get inserted multiple times and ordering is crucial. For stuff like properties files, XML, or JSON, Linked... collections are generally a good choice as they maintain the original order for nicer human viewing and comparison.
Tree... collections iterate the "most predictably", using the ordering provided by a Comparator at construction time, or else the "natural ordering" if the elements are Comparable. Assuming you have a predicable comparison method, they are completely predictable. In the Tom/Fred example, it would always iterate as "Fred", "Tom", unless your Comparator is unusual.

When answering this type of questions, I would highly suggest doing so according to the Java API Specification and not based on your assumptions of the implementations. So for example, even though you could argue that all the of those collections would have a deterministic iteration order provided that they are not mutated between repeated iterations, because you think it would not make sense to implement it that way, the only real answer—according to the options you listed—strictly adhering to the Java API Specification would be all except HashMap and HashTable.
The reason for answering this is that of all of them, according to the Java API Specification, the only classes that give no guarantee on the iteration order are those two I mentioned (HashMap and HashTable). So in general, when you program in Java, you should never assume an specific implementation of the API and or the JVM or anything that is based on an Specification unless that Specification.
So, as an example, what could be a problem of assuming a particular implementation for the HashSet collection? The thing is that the JDK that might be used to run your program (which is not necessarily the same you use for developing) could implement the HashTable with certain optimizations as long as it doesn't violate the Java API Specification, and such optimization could be such that if for example, a HashTable is not used for 10 minutes, then it could call the rehash function automatically in order to reorganized and optimize the access to its entries. And this is a possible scenario, because as noted here https://docs.oracle.com/javase/7/docs/api/java/util/Hashtable.html "The exact details as to when and whether the rehash method is invoked are implementation-dependent".

Related

Why doesn't LinkedHashSet have addFirst method?

As the documentation of LinkedHashSet states, it is
Hash table and linked list implementation of the Set interface, with
predictable iteration order. This implementation differs from HashSet
in that it maintains a doubly-linked list running through all of its
entries.
So it's essentially a HashSet with FIFO queue of keys implemented by a linked list. Considering that LinkedList is Deque and permits, in particular, insertion at the beginning, I wonder why doesn't LinkedHashSet have the addFirst(E e) method in addition to the methods present in the Set interface. It seems not hard to implement this.

As Eliott Frisch said, the answer is in the next sentence of the paragraph you quoted:
… This linked list defines the iteration ordering, which is the order
in which elements were inserted into the set (insertion-order). …
An addFirst method would break the insertion order and thereby the design idea of LinkedHashSet.
If I may add a bit of guesswork too, other possible reasons might include:
It’s not so simple to implement as it appears since a LinkedHashSet is really implemented as a LinkedHasMap where the values mapped to are not used. At least you would have to change that class too (which in turn would also break its insertion order and thereby its design idea).
As that other guy may have intended in a comment, they didn’t find it useful.
That said, you are asking the question the wrong way around. They designed a class with a functionality for which they saw a need. They moved on to implement it using a hash table and a linked list. You are starting out from the implementation and using it as a basis for a design discussion. While that may occasionally add something useful, generally it’s not the way to good designs.
While I can in theory follow your point that there might be a situation where you want a double-ended queue with set property (duplicates are ignored/eliminated), I have a hard time imagining when a Deque would not fulfil your needs in this case (Eliott Frisch mentioned the under-used ArrayDeque). You need pretty large amounts of data and/or pretty strict performance requirements before the linear complexity of contains and remove would be prohibitive. And in that case you may already be better off custom designing your own data structure.

Should a HashSet be allowed to be added to itself in Java?

According to the contract for a Set in Java, "it is not permissible for a set to contain itself as an element" (source). However, this is possible in the case of a HashSet of Objects, as demonstrated here:
Set<Object> mySet = new HashSet<>();
mySet.add(mySet);
assertThat(mySet.size(), equalTo(1));
This assertion passes, but I would expect the behavior to be to either have the resulting set be 0 or to throw an Exception. I realize the underlying implementation of a HashSet is a HashMap, but it seems like there should be an equality check before adding an element to avoid violating that contract, no?

Others have already pointed out why it is questionable from a mathematical point of view, by referring to Russell's paradox.
This does not answer your question on a technical level, though.
So let's dissect this:
First, once more the relevant part from the JavaDoc of the Set interface:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
Interestingly, the JavaDoc of the List interface makes a similar, although somewhat weaker, and at the same time more technical statement:
While it is permissible for lists to contain themselves as elements, extreme caution is advised: the equals and hashCode methods are no longer well defined on such a list.
And finally, the crux is in the JavaDoc of the Collection interface, which is the common ancestor of both the Set and the List interface:
Some collection operations which perform recursive traversal of the collection may fail with an exception for self-referential instances where the collection directly or indirectly contains itself. This includes the clone(), equals(), hashCode() and toString() methods. Implementations may optionally handle the self-referential scenario, however most current implementations do not do so.
(Emphasis by me)
The bold part is a hint at why the approach that you proposed in your question would not be sufficient:
it seems like there should be an equality check before adding an element to avoid violating that contract, no?
This would not help you here. The key point is that you'll always run into problems when the collection will directly or indirectly contain itself. Imagine this scenario:
Set<Object> setA = new HashSet<Object>();
Set<Object> setB = new HashSet<Object>();
setA.add(setB);
setB.add(setA);
Obviously, neither of the sets contains itself directly. But each of them contains the other - and therefore, itself indirectly. This could not be avoided by a simple referential equality check (using == in the add method).
Avoiding such an "inconsistent state" is basically impossible in practice. Of course it is possible in theory, using referential Reachability computations. In fact, the Garbage Collector basically has to do exactly that!
But it becomes impossible in practice when custom classes are involved. Imagine a class like this:
class Container {
Set<Object> set;
#Override
int hashCode() {
return set.hashCode();
}
}
And messing around with this and its set:
Set<Object> set = new HashSet<Object>();
Container container = new Container();
container.set = set;
set.add(container);
The add method of the Set basically has no way of detecting whether the object that is added there has some (indirect) reference to the set itself.
Long story short:
You cannot prevent the programmer from messing things up.

Adding the collection into itself once causes the test to pass. Adding it twice causes the StackOverflowError which you were seeking.
From a personal developer standpoint, it doesn't make any sense to enforce a check in the underlying code to prevent this. The fact that you get a StackOverflowError in your code if you attempt to do this too many times, or calculate the hashCode - which would cause an instant overflow - should be enough to ensure that no sane developer would keep this kind of code in their code base.

You need to read the full doc and quote it fully:
The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
The actual restriction is in the first sentence. The behavior is unspecified if an element of a set is mutated.
Since adding a set to itself mutates it, and adding it again mutates it again, the result is unspecified.
Note that the restriction is that the behavior is unspecified, and that a special case of that restriction is adding the set to itself.
So the doc says, in other words, that adding a set to itself results in unspecified behavior, which is what you are seeing. It's up to the concrete implementation to deal with (or not).

I agree with you that, from a mathematical perspective, this behavior really doesn't make sense.
There are two interesting questions here: first, to what extent were the designers of the Set interface trying to implement a mathematical set? Secondly, even if they weren't, to what extent does that exempt them from the rules of set theory?
For the first question, I will point you to the documentation of the Set:
A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2), and at most one null element. As implied by its name, this interface models the mathematical set abstraction.
It's worth mentioning here that current formulations of set theory don't permit sets to be members of themselves. (See the Axiom of regularity). This is due in part to Russell's Paradox, which exposed a contradiction in naive set theory (which permitted a set to be any collection of objects - there was no prohibition against sets including themselves). This is often illustrated by the Barber Paradox: suppose that, in a particular town, a barber shaves all of the men - and only the men - who do not shave themselves. Question: does the barber shave himself? If he does, it violates the second constraint; if he doesn't, it violates the first constraint. This is clearly logically impossible, but it's actually perfectly permissible under the rules of naive set theory (which is why the newer "standard" formulation of set theory explicitly bans sets from containing themselves).
There's more discussion in this question on Math.SE about why sets cannot be an element of themselves.
With that said, this brings up the second question: even if the designers hadn't been explicitly trying to model a mathematical set, would this be completely "exempt" from the problems associated with naive set theory? I think not - I think that many of the problems that plagued naive set theory would plague any kind of a collection that was insufficiently constrained in ways that were analogous to naive set theory. Indeed, I may be reading too much into this, but the first part of the definition of a Set in the documentation sounds suspiciously like the intuitive concept of a set in naive set theory:
A collection that contains no duplicate elements.
Admittedly (and to their credit), they do place at least some constraints on this later (including stating that you really shouldn't try to have a Set contain itself), but you could question whether it's really "enough" to avoid the problems with naive set theory. This is why, for example, you have a "turtles all the way down" problem when trying to calculate the hash code of a HashSet that contains itself. This is not, as some others have suggested, merely a practical problem - it's an illustration of the fundamental theoretical problems with this type of formulation.
As a brief digression, I do recognize that there are, of course, some limitations on how closely any collection class can really model a mathematical set. For example, Java's documentation warns against the dangers of including mutable objects in a set. Some other languages, such as Python, at least attempt to ban many kinds of mutable objects entirely:
The set classes are implemented using dictionaries. Accordingly, the requirements for set elements are the same as those for dictionary keys; namely, that the element defines both __eq__() and __hash__(). As a result, sets cannot contain mutable elements such as lists or dictionaries. However, they can contain immutable collections such as tuples or instances of ImmutableSet. For convenience in implementing sets of sets, inner sets are automatically converted to immutable form, for example, Set([Set(['dog'])]) is transformed to Set([ImmutableSet(['dog'])]).
Two other major differences that others have pointed out are
Java sets are mutable
Java sets are finite. Obviously, this will be true of any collection class: apart from concerns about actual infinity, computers only have a finite amount of memory. (Some languages, like Haskell, have lazy infinite data structures; however, in my opinion, a lawlike choice sequence seems like a more natural way model these than classical set theory, but that's just my opinion).
TL;DR No, it really shouldn't be permitted (or, at least, you should never do that) because sets can't be members of themselves.

Duplicates not allowed in CopyOnWriteArraySet even though it uses an internal CopyOnWriteArrayList for all of its operations

I want to understand why CopyOnWriteArraySet does not allow(ignore) the duplicate elements in them. I understand the fact that since it is SET, it is meant to avoid duplicates.
But acccording to the oracle definition:
CopyOnWriteArraySet: A Set that uses an internal CopyOnWriteArrayList for all of its operations.
Oracle java docs
So practically it should allow duplicates. Is it the internal implementation of add() method that restricts the duplicate elements?

So practically it should allow duplicates.
No, it should not. It's a set. If it allowed duplicates, it shouldn't be called a set.
But acccording to the oracle definition: CopyOnWriteArraySet: A Set that uses an internal CopyOnWriteArrayList for all of its operations.
This is just helpful information, so that if you're already familiar with CopyOnWriteArrayList, then you will understand the consequences, such as thread safety at the expense of slow writes.
In general, the implementation details shouldn't be your concern.
This class implements Set, so it should behave that way.
It's the job of the authors of this class to ensure that there will be no duplicates, despite using a data structure that's capable to contain duplicates. The authors should also avoid potential performance bottle necks implied by the underlying data structure, such as linear lookups instead of something faster as usually expected from sets.
Lastly, keep in mind that just because you can do something,
doesn't mean that you should.

You are mixing up the CONTRACT and the IMPLEMENTATION of that class.
Oracle is free to change the underlying implementation at any point : the only thing that matters is the behavior of this Set implementing class.
In other words: the fact that "uses list" shows up in the javadoc is only meant to give the user some helpful context knowledge about this set implementation.

Google Collections ImmutableMap iteration order

I need combination of Google Collection ImmutableMap and LinkedHashMap — immutable map with defined iteration order. It seems that ImmutableMap itself actually has defined iteration order, at least its documentation says:
An immutable, hash-based Map with reliable user-specified iteration order.
However there are no more details. Quick test shows that this might be true, but I want to make sure.
My question is: can I rely on iteration order of ImmutableMap? If I do ImmutableMap.copyOf(linkedHashMap), will it have same iteration order as original linked hash map? What about immutable maps created by builder? Some link to authoritative answer would help, since Google didn't find anything useful. (And no, links to the sources don't count).

To be more precise, the ImmutableMap factory methods and builder return instances that follow the iteration order of the inputs provided when the map in constructed. However, an ImmutableSortedMap, which is a subclass of ImmutableMap. sorts the keys.

I've actually found discussion about this, with answers from library authors:
Kevin Bourrillion: What we mean by "user-specified" is "it can be whatever order you want it to
be"; in other words, whatever order you provide the entries to us in the
first place, that's the order we use.
Jared Levy: You can also copy a TreeMap or LinkedHashMap that have the desired order.
Yes, I should have believed the javadoc, although I think that javadoc can be better in this case. It seems I'm not first who was confused by it. If nothing else, this Q/A will help Google next time someone searches for "ImmutableMap iteration" :-)

You should believe the javadoc. If it is not enough, read the source code or report the bug.
A quick view to the source code shows that the map is backed by array and iteration will be done through ImmutableSet that is also backed by an array. So I think the documentation is correct and the order of the elements will be kept as it is.

when to use Set vs. Collection?

Is there any practical difference between a Set and Collection in Java, besides the fact that a Collection can include the same element twice? They have the same methods.
(For example, does Set give me more options to use libraries which accept Sets but not Collections?)
edit: I can think of at least 5 different situations to judge this question. Can anyone else come up with more? I want to make sure I understand the subtleties here.
designing a method which accepts an argument of Set or Collection. Collection is more general and accepts more possibilities of input. (if I'm designing a specific class or interface, I'm being nicer to my consumers and stricter on my subclassers/implementers if I use Collection.)
designing a method which returns a Set or Collection. Set offers more guarantees than Collection (even if it's just the guarantee not to include one element twice). (if I'm designing a specific class or interface, I'm being nicer to my consumers and stricter on my subclassers/implementers if I use Set.)
designing a class that implements the interface Set or Collection. Similar issues as #2. Users of my class/interface get more guarantees, subclassers/implementers have more responsibility.
designing an interface that extends the interface Set or Collection. Very similar to #3.
writing code that uses a Set or Collection. Here I might as well use Set; the only reasons for me to use Collection is if I get back a Collection from someone else's code, or if I have to handle a collection that contains duplicates.

Collection is also the supertype of List, Queue, Deque, and others, so it gives you more options. For example, I try to use Collection as a parameter to library methods that shouldn't explicitly depend on a certain type of collection.
Generally, you should use the right tool for the job. If you don't want duplicates, use Set (or SortedSet if you want ordering, or LinkedHashSet if you want to maintain insertion order). If you want to allow duplicates, use List, and so on.

I think you already have it figured out- use a Set when you want to specifically exclude duplicates. Collection is generally the lowest common denominator, and it's useful to specify APIs that accept/return this, which leaves you room to change details later on if needed. However if the details of your application require unique entries, use Set to enforce this.
Also worth considering is whether order is important to you; if it is, use List, or LinkedHashSet if you care about order and uniqueness.

See Java's Collection tutorial for a good walk-through of Collection usage. In particular, check out the class hierarchy.

As #mmyers states, Collection includes Set, as well as List.
When you declare something as a Set, rather than a Collection, you are saying that the variable cannot be a List or a Map. It will always be a Collection, though. So, any function that accepts a Collection will accept a Set, but a function that accepts a Set cannot take a Collection (unless you cast it to a Set).

One other thing to consider... Sets have extra overhead in time, memory, and coding in order to guarantee that there are no duplicates. (Time and memory because sets are usually backed by a HashMap or a Tree, which adds overhead over a list or an array. Coding because you have to implement the hashCode() and equals() methods.)
I usually use sets when I need a fast implementation of contains() and use Collection or List otherwise, even if the collection shouldn't have duplicates.

You should use a Set when that is what you want.
For example, a List without any order or duplicates. Methods like contains are quite useful.
A collection is much more generic. I believe that what mmyers wrote on their usage says it all.

The practical difference is that Set enforces the set logic, i.e. no duplicates and unordered, while Collection does not. So if you need a Collection and you have no particular requirement for avoiding duplicates then use a Collection. If you have the requirement for Set then use Set. Generally use the highest interface possibble.

As Collection is a super type of Set and SortedSet these can be passed to a method which expects a Collection. Collection just means it may or may not be sorted, order or allow duplicates.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.