Why does Java toString() loop infinitely on indirect cycles?

Why does Java toString() loop infinitely on indirect cycles? - java

This is more a gotcha I wanted to share than a question: when printing with toString(), Java will detect direct cycles in a Collection (where the Collection refers to itself), but not indirect cycles (where a Collection refers to another Collection which refers to the first one - or with more steps).
import java.util.*;
public class ShonkyCycle {
static public void main(String[] args) {
List a = new LinkedList();
a.add(a); // direct cycle
System.out.println(a); // works: [(this Collection)]
List b = new LinkedList();
a.add(b);
b.add(a); // indirect cycle
System.out.println(a); // shonky: causes infinite loop!
}
}
This was a real gotcha for me, because it occurred in debugging code to print out the Collection (I was surprised when it caught a direct cycle, so I assumed incorrectly that they had implemented the check in general). There is a question: why?
The explanation I can think of is that it is very inexpensive to check for a collection that refers to itself, as you only need to store the collection (which you have already), but for longer cycles, you need to store all the collections you encounter, starting from the root. Additionally, you might not be able to tell for sure what the root is, and so you'd have to store every collection in the system - which you do anyway - but you'd also have to do a hash lookup on every collection element. It's very expensive for the relatively rare case of cycles (in most programming). (I think) the only reason it checks for direct cycles is because it so cheap (one reference comparison).
OK... I've kinda answered my own question - but have I missed anything important? Anyone want to add anything?
Clarification: I now realize the problem I saw is specific to printing a Collection (i.e. the toString() method). There's no problem with cycles per se (I use them myself and need to have them); the problem is that Java can't print them. Edit Andrzej Doyle points out it's not just collections, but any object whose toString is called.
Given that it's constrained to this method, here's an algorithm to check for it:
the root is the object that the first toString() is invoked on (to determine this, you need to maintain state on whether a toString is currently in progress or not; so this is inconvenient).
as you traverse each object, you add it to an IdentityHashMap, along with a unique identifier (e.g. an incremented index).
but if this object is already in the Map, write out its identifier instead.
This approach also correctly renders multirefs (a node that is referred to more than once).
The memory cost is the IdentityHashMap (one reference and index per object); the complexity cost is a hash lookup for every node in the directed graph (i.e. each object that is printed).

I think fundamentally it's because while the language tries to stop you from shooting yourself in the foot, it shouldn't really do so in a way that's expensive. So while it's almost free to compare object pointers (e.g. does obj == this) anything beyond that involves invoking methods on the object you're passing in.
And at this point the library code doesn't know anything about the objects you're passing in. For one, the generics implementation doesn't know if they're instances of Collection (or Iterable) themselves, and while it could find this out via instanceof, who's to say whether it's a "collection-like" object that isn't actually a collection, but still contains a deferred circular reference? Secondly, even if it is a collection there's no telling what it's actual implementation and thus behaviour is like. Theoretically one could have a collection containing all the Longs which is going to be used lazily; but since the library doesn't know this it would be hideously expensive to iterate over every entry. Or in fact one could even design a collection with an Iterator that never terminated (though this would be difficult to use in practice because so many constructs/library classes assume that hasNext will eventually return false).
So it basically comes down to an unknown, possibly infinite cost in order to stop you from doing something that might not actually be an issue anyway.

I'd just like to point out that this statement:
when printing with toString(), Java will detect direct cycles in a collection
is misleading.
Java (the JVM, the language itself, etc) is not detecting the self-reference. Rather this is a property of the toString() method/override of java.util.AbstractCollection.
If you were to create your own Collection implementation, the language/platform wouldn't automatically safe you from a self-reference like this - unless you extend AbstractCollection, you would have to make sure you cover this logic yourself.
I might be splitting hairs here but I think this is an important distinction to make. Just because one of the foundation classes in the JDK does something doesn't mean that "Java" as an overall umbrella does it.
Here is the relevant source code in AbstractCollection.toString(), with the key line commented:
public String toString() {
Iterator<E> i = iterator();
if (! i.hasNext())
return "[]";
StringBuilder sb = new StringBuilder();
sb.append('[');
for (;;) {
E e = i.next();
// self-reference check:
sb.append(e == this ? "(this Collection)" : e);
if (! i.hasNext())
return sb.append(']').toString();
sb.append(", ");
}
}

The problem with the algorithm that you propose is that you need to pass the IdentityHashMap to all Collections involved. This is not possible using the published Collection APIs. The Collection interface does not define a toString(IdentityHashMap) method.
I imagine that whoever at Sun put the self reference check into the AbstractCollection.toString() method thought of all of this, and (in conjunction with his colleagues) decided that a "total solution" is over the top. I think that the current design / implementation is correct.
It is not a requirement that Object.toString implementations be bomb-proof.

You are right, you already answered your own question. Checking for longer cycles (especially really long ones like period length 1000) would be too much overhead and is not needed in most cases. If someone wants it, he has to check it himself.
The direct cycle case, however, is easy to check and will occur more often, so it's done by Java.

You can't really detect indirect cycles; it's a typical example of the halting problem.

Related

Static default method for not initialized Classes

sometimes it would be convenient to have an easy way of doing the following:
Foo a = dosomething();
if (a != null){
if (a.isValid()){
...
}
}
My idea was to have some kind of static “default” methods for not initialized variables like this:
class Foo{
public boolean isValid(){
return true;
}
public static boolean isValid(){
return false;
}
}
And now I could do this…
Foo a = dosomething();
if (a.isValid()){
// In our example case -> variable is initialized and the "normal" method gets called
}else{
// In our example case -> variable is null
}
So, if a == null the static “default” methods from our class gets called, otherwise the method of our object gets called.
Is there either some keyword I’m missing to do exactly this or is there a reason why this is not already implemented in programming languages like java/c#?
Note: this example is not very breathtaking if this would work, however there are examples where this would be - indeed - very nice.

It's very slightly odd; ordinarily, x.foo() runs the foo() method as defined by the object that the x reference is pointing to. What you propose is a fallback mechanism where, if x is null (is referencing nothing) then we don't look at the object that x is pointing to (there's nothing its pointing at; hence, that is impossible), but that we look at the type of x, the variable itself, instead, and ask this type: Hey, can you give me the default impl of foo()?
The core problem is that you're assigning a definition to null that it just doesn't have. Your idea requires a redefinition of what null means which means the entire community needs to go back to school. I think the current definition of null in the java community is some nebulous ill defined cloud of confusion, so this is probably a good idea, but it is a huge commitment, and it is extremely easy for the OpenJDK team to dictate a direction and for the community to just ignore it. The OpenJDK team should be very hesitant in trying to 'solve' this problem by introducing a language feature, and they are.
Let's talk about the definitions of null that make sense, which definition of null your idea specifically is catering to (at the detriment of the other interpretations!), and how catering to that specific idea is already easy to do in current java, i.e. - what you propose sounds outright daft to me, in that it's just unneccessary and forces an opinion of what null means down everybody's throats for no reason.
Not applicable / undefined / unset
This definition of null is exactly how SQL defines it, and it has the following properties:
There is no default implementation available. By definition! How can one define what the size is of, say, an unset list? You can't say 0. You have no idea what the list is supposed to be. The very point is that interaction with an unset/not-applicable/unknown value should immediately lead to a result that represents either [A] the programmer messed up, the fact that they think they can interact with this value means they programmed a bug - they made an assumption about the state of the system which does not hold, or [B] that the unset nature is infectuous: The operation returns the notion 'unknown / unset / not applicable' as result.
SQL chose the B route: Any interaction with NULL in SQL land is infectuous. For example, even NULL = NULL in SQL is NULL, not FALSE. It also means that all booleans in SQL are tri-state, but this actually 'works', in that one can honestly fathom this notion. If I ask you: Hey, are the lights on?, then there are 3 reasonable answers: Yes, No, and I can't tell you right now; I don't know.
In my opinion, java as a language is meant for this definition as well, but has mostly chosen the [A] route: Throw an NPE to let everybody know: There is a bug, and to let the programmer get to the relevant line extremely quickly. NPEs are easy to solve, which is why I don't get why everybody hates NPEs. I love NPEs. So much better than some default behaviour that is usually but not always what I intended (objectively speaking, it is better to have 50 bugs that each takes 3 minutes to solve, than one bug that takes an an entire working day, by a large margin!) – this definition 'works' with the language:
Uninitialized fields, and uninitialized values in an array begin as null, and in the absence of further information, treating it as unset is correct.
They are, in fact, infectuously erroneous: Virtually all attempts to interact with them results in an exception, except ==, but that is intentional, for the same reason in SQL IS NULL will return TRUE or FALSE and not NULL: Now we're actually talking about the pointer nature of the object itself ("foo" == "foo" can be false if the 2 strings aren't the same ref: Clearly == in java between objects is about the references itself and not about the objects referenced).
A key aspect to this is that null has absolutely no semantic meaning, at all. Its lack of semantic meaning is the point. In other words, null doesn't mean that a value is short or long or blank or indicative of anything in particular. The only thing it does mean is that it means nothing. You can't derive any information from it. Hence, foo.size() is not 0 when foo is unset/unknown - the question 'what is the size of the object foo is pointing at' is unanswerable, in this definition, and thus NPE is exactly right.
Your idea would hurt this interpretation - it would confound matters by giving answers to unanswerable questions.
Sentinel / 'empty'
null is sometimes used as a value that does have semantic meaning. Something specific. For example, if you ever wrote this, you're using this interpretation:
if (x == null || x.isEmpty()) return false;
Here you've assigned a semantic meaning to null - the same meaning you assigned to an empty string. This is common in java and presumably stems from some bass ackwards notion of performance. For example, in the eclipse ecj java parser system, all empty arrays are done with null pointers. For example, the definition of a method has a field Argument[] arguments (for the method parameters; using argument is the slightly wrong word, but it is used to store the param definitions); however, for methods with zero parameters, the semantically correct choice is obviously new Argument[0]. However, that is NOT what ecj fills the Abstract Syntax Tree with, and if you are hacking around on the ecj code and assign new Argument[0] to this, other code will mess up as it just wasn't written to deal with this.
This is in my opinion bad use of null, but is quite common. And, in ecj's defense, it is about 4 times faster than javac, so I don't think it's fair to cast aspersions at their seemingly deplorably outdated code practices. If it's stupid and it works it isn't stupid, right? ecj also has a better track record than javac (going mostly by personal experience; I've found 3 bugs in ecj over the years and 12 in javac).
This kind of null does get a lot better if we implement your idea.
The better solution
What ecj should have done, get the best of both worlds: Make a public constant for it! new Argument[0], the object, is entirely immutable. You need to make a single instance, once, ever, for an entire JVM run. The JVM itself does this; try it: List.of() returns the 'singleton empty list'. So does Collections.emptyList() for the old timers in the crowd. All lists 'made' with Collections.emptyList() are actually just refs to the same singleton 'empty list' object. This works because the lists these methods make are entirely immutable.
The same can and generally should apply to you!
If you ever write this:
if (x == null || x.isEmpty())
then you messed up if we go by the first definition of null, and you're simply writing needlessly wordy, but correct, code if we go by the second
definition. You've come up with a solution to address this, but there's a much, much better one!
Find the place where x got its value, and address the boneheaded code that decided to return null instead of "". You should in fact emphatically NOT be adding null checks to your code, because it's far too easy to get into this mode where you almost always do it, and therefore you rarely actually have null refs, but it's just swiss cheese laid on top of each other: There may still be holes, and then you get NPEs. Better to never check so you get NPEs very quickly in the development process - somebody returned null where they should be returning "" instead.
Sometimes the code that made the bad null ref is out of your control. In that case, do the same thing you should always do when working with badly designed APIs: Fix it ASAP. Write a wrapper if you have to. But if you can commit a fix, do that instead. This may require making such an object.
Sentinels are awesome
Sometimes sentinel objects (objects that 'stand in' for this default / blank take, such as "" for strings, List.of() for lists, etc) can be a bit more fancy than this. For example, one can imagine using LocalDate.of(1800, 1, 1) as sentinel for a missing birthdate, but do note that this instance is not a great idea. It does crazy stuff. For example, if you write code to determine the age of a person, then it starts giving completely wrong answers (which is significantly worse than throwing an exception. With the exception you know you have a bug faster and you get a stacktrace that lets you find it in literally 500 milliseconds (just click the line, voila. That is the exact line you need to look at right now to fix the problem). It'll say someone is 212 years old all of a sudden.
But you could make a LocalDate object that does some things (such as: It CAN print itself; sentinel.toString() doesn't throw NPE but prints something like 'unset date'), but for other things it will throw an exception. For example, .getYear() would throw.
You can also make more than one sentinel. If you want a sentinel that means 'far future', that's trivially made (LocalDate.of(9999, 12, 31) is pretty good already), and you can also have one as 'for as long as anyone remembers', e.g. 'distant past'. That's cool, and not something your proposal could ever do!
You will have to deal with the consequences though. In some small ways the java ecosystem's definitions don't mesh with this, and null would perhaps have been a better standin. For example, the equals contract clearly states that a.equals(a) must always hold, and yet, just like in SQL NULL = NULL isn't TRUE, you probably don't want missingDate.equals(missingDate) to be true; that's conflating the meta with the value: You can't actually tell me that 2 missing dates are equal. By definition: The dates are missing. You do not know if they are equal or not. It is not an answerable question. And yet we can't implement the equals method of missingDate as return false; (or, better yet, as you also can't really know they aren't equal either, throw an exception) as that breaks contract (equals methods must have the identity property and must not throw, as per its own javadoc, so we can't do either of those things).
Dealing with null better
There are a few things that make dealing with null a lot easier:
Annotations: APIs can and should be very clear in communicating when their methods can return null and what that means. Annotations to turn that documentation into compiler-checked documentation is awesome. Your IDE can start warning you, as you type, that null may occur and what that means, and will say so in auto-complete dialogs too. And it's all entirely backwards compatible in all senses of the word: No need to start considering giant swaths of the java ecosystem as 'obsolete' (unlike Optional, which mostly sucks).
Optional, except this is a non-solution. The type isn't orthogonal (you can't write a method that takes a List<MaybeOptionalorNot<String>> that works on both List<String> and List<Optional<String>>, even though a method that checks the 'is it some or is it none?' state of all list members and doesn't add anything (except maybe shuffle things around) would work equally on both methods, and yet you just can't write it. This is bad, and it means all usages of optional must be 'unrolled' on the spot, and e.g. Optional<X> should show up pretty much never ever as a parameter type or field type. Only as return types and even that is dubious - I'd just stick to what Optional was made for: As return type of Stream terminal operations.
Adopting it also isn't backwards compatible. For example, hashMap.get(key) should, in all possible interpretations of what Optional is for, obviously return an Optional<V>, but it doesn't, and it never will, because java doesn't break backwards compatibility lightly and breaking that is obviously far too heavy an impact. The only real solution is to introduce java.util2 and a complete incompatible redesign of the collections API, which is splitting the java ecosystem in twain. Ask the python community (python2 vs. python3) how well that goes.
Use sentinels, use them heavily, make them available. If I were designing LocalDate, I'd have created LocalDate.FAR_FUTURE and LocalDate_DISTANT_PAST (but let it be clear that I think Stephen Colebourne, who designed JSR310, is perhaps the best API designer out there. But nothing is so perfect that it can't be complained about, right?)
Use API calls that allow defaulting. Map has this.
Do NOT write this code:
String phoneNr = phoneNumbers.get(userId);
if (phoneNr == null) return "Unknown phone number";
return phoneNr;
But DO write this:
return phoneNumbers.getOrDefault(userId, "Unknown phone number");
Don't write:
Map<Course, List<Student>> participants;
void enrollStudent(Student student) {
List<Student> participating = participants.get(econ101);
if (participating == null) {
participating = new ArrayList<Student>();
participants.put(econ101, participating);
}
participating.add(student);
}
instead write:
Map<Course, List<Student>> participants;
void enrollStudent(Student student) {
participants.computeIfAbsent(econ101,
k -> new ArrayList<Student>())
.add(student);
}
and, crucially, if you are writing APIs, ensure things like getOrDefault, computeIfAbsent, etc. are available so that the users of your API don't have to deal with null nearly as much.

You can write a static test() method like this:
static <T> boolean test(T object, Predicate<T> validation) {
return object != null && validation.test(object);
}
and
static class Foo {
public boolean isValid() {
return true;
}
}
static Foo dosomething() {
return new Foo();
}
public static void main(String[] args) {
Foo a = dosomething();
if (test(a, Foo::isValid))
System.out.println("OK");
else
System.out.println("NG");
}
output:
OK
If dosomething() returns null, it prints NG

Not exactly, but take a look at Optional:
Optional.ofNullable(dosomething())
.filter(Foo::isValid)
.ifPresent(a -> ...);

Java software design - Looping, object creation VS modifying variables. Memory, performance & reliability comparison

Let's say we are trying to build a document scanner class in java that takes 1 input argument, the log path(eg. C:\document\text1.txt). Which of the following implementations would you prefer based on performance/memory/modularity?
ArrayList<String> fileListArray = new ArrayList<String>();
fileListArray.add("C:\\document\\text1.txt");
fileListArray.add("C:\\document\\text2.txt");
.
.
.
//Implementation A
for(int i =0, j = fileListArray.size(); i < j; i++){
MyDocumentScanner ds = new MyDocumentScanner(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
//Implementation B
MyDocumentScanner ds = new MyDocumentScanner();
for(int i=0, j=fileListArray.size(); i < j; i++){
ds.setDocPath(fileListArray.get(i));
ds.scanDocument();
ds.resultOutput();
}
Personally I would prefer A due to its encapsulation, but it seems like more memory usage due to creation of multiple instances. I'm curious if there is an answer to this, or it is another "that depends on the situation/circumstances" dilemma?

Although this is obviously opinion-based, I will try an answer to tell my opinion.
You approach A is far better. Your document scanner obviously handles a file. That should be set at construction time and be saved in an instance field. So every method can refer to this field. Moreover, the constructor can do some checks on the file reference (null check, existence, ...).
Your approach B has two very serious disadvantages:
After constructing a document scanner, clients could easily call all of the methods. If no file was set before, you must handle that "illegal state" with maybe an IllegalStateException. Thus, this approach increases code and complexity of that class.
There seems to be a series of method calls that a client should or can perform. It's easy to call the file setting method again in the middle of such a series with a completely other file, breaking the whole scan facility. To avoid this, your setter (for the file) should remember whether a file was already set. And that nearly automatically leads to approach A.
Regarding the creation of objects: Modern JVMs are really very fast at creating objects. Usually, there is no measurable performance overhead for that. The processing time (here: the scan) usually is much higher.

If you don't need multiple instances of DocumentScanner to co-exist, I see no point in creating a new instance in each iteration of the loop. It just creates work to the garbage collector, which has to free each of those instances.
If the length of the array is small, it doesn't make much difference which implementation you choose, but for large arrays, implementation B is more efficient, both in terms of memory (less instances created that the GC hasn't freed yet) and CPU (less work for the GC).

Are you implementing DocumentScanner or using an existing class?
If the latter, and it was designed for being able to parse multiple documents in a row, you can just reuse the object as in variant B.
However, if you are designing DocumentScanner, I would recommend to design it such that it handles a single document and does not even have a setDocPath method. This leads to less mutable state in that class and thus makes its design much easier. Also using an instance of the class becomes less error-prone.
As for performance, there won't be a measurable difference unless instantiating a DocumentScanner is doing a lot of work (like instantiating many other objects, too). Instantiating and freeing objects in Java is pretty cheap if they are used only for a short time due to the generational garbage collector.

Helping the JVM with stack allocation by using separate objects

I have a bottleneck method which attempts to add points (as x-y pairs) to a HashSet. The common case is that the set already contains the point in which case nothing happens. Should I use a separate point for adding from the one I use for checking if the set already contains it? It seems this would allow the JVM to allocate the checking-point on stack. Thus in the common case, this will require no heap allocation.
Ex. I'm considering changing
HashSet<Point> set;
public void addPoint(int x, int y) {
if(set.add(new Point(x,y))) {
//Do some stuff
}
}
to
HashSet<Point> set;
public void addPoint(int x, int y){
if(!set.contains(new Point(x,y))) {
set.add(new Point(x,y));
//Do some stuff
}
}
Is there a profiler which will tell me whether objects are allocated on heap or stack?
EDIT: To clarify why I think the second might be faster, in the first case the object may or may not be added to the collection, so it's not non-escaping and cannot be optimized. In the second case, the first object allocated is clearly non-escaping so it can be optimized by the JVM and put on stack. The second allocation only occurs in the rare case where it's not already contained.

Marko Topolnik properly answered your question; the space allocated for the first new Point may or may not be immediately freed and it is probably foolish to bank on it happening. But I want to expand on why you're currently in a deep state of sin:
You're trying to optimise this the wrong way.
You've identified object creation to be the bottleneck here. I'm going to assume that you're right about this. You're hoping that, if you create fewer objects, the code will run faster. That might be true, but it will never run very fast as you've designed it.
Every object in Java has a pretty fat header (16 bytes; an 8-byte "mark word" full of bit fields and an 8-byte pointer to the class type) and, depending on what's happened in your program thus far, possibly another pretty fat trailer. Your HashSet isn't storing just the contents of your objects; it's storing pointers to those fat-headers-followed-by-contents. (Actually, it's storing pointers to Entry classes that themselves store pointers to Points. Two levels of indirection there.)
A HashSet lookup, then, figures out which bucket it needs to look at and then chases one pointer per thing in the bucket to do the comparison. (As one great big chain in series.) There probably aren't very many of these objects, but they almost certainly aren't stored close together, making your cache angry. Note that object allocation in Java is extremely cheap---you just increment a pointer---and that this is quite probably a bigger source of slowness.
Java doesn't provide any abstraction like C++'s templates, so the only real way to make this fast and still provide the Set abstraction is to copy HashSet's code, change all of the data structures to represent your objects inline, modify the methods to work with the new data structures, and, if you're still worried, make copies of the relevant methods that take a list of parameters corresponding to object contents (i.e. contains(int, int)) that do the right thing without constructing a new object.
This approach is error-prone and time-consuming, but it's necessary unfortunately often when working on Java projects where performance matters. Take a look at the Trove library Marko mentioned and see if you can use it instead; Trove did exactly this for the primitive types.
With that out of the way, a monomorphic call site is one where only one method is called. Hotspot aggressively inlines calls from monomorphic call sites. You'll notice that HashSet.contains punts to HashMap.containsKey. You'd better pray for HashMap.containsKey to be inlined since you need the hashCode call and equals calls inside to be monomorphic. You can verify that your code is being compiled nicely by using the -XX:+PrintAssembly option and poring over the output, but it's probably not---and even if it is, it's probably still slow because of what a HashSet is.

As soon as you have written new Point(x,y), you are creating a new object. It may happen not to be placed on the heap, but that's just a bet you can lose. For example, the contains call should be inlined for the escape analysis to work, or at least it should be a monomorphic call site. All this means that you are optimizing against a quite erratic performance model.
If you want to avoid allocation the solid way, you can use Trove library's TLongHashSet and have your (int,int) pairs encoded as single long values.

Thread-safe iteration over a collection

We all know when using Collections.synchronizedXXX (e.g. synchronizedSet()) we get a synchronized "view" of the underlying collection.
However, the document of these wrapper generation methods states that we have to explicitly synchronize on the collection when iterating of the collections using an iterator.
Which option do you choose to solve this problem?
I can only see the following approaches:
Do it as the documentation states: synchronize on the collection
Clone the collection before calling iterator()
Use a collection which iterator is thread-safe (I am only aware of CopyOnWriteArrayList/Set)
And as a bonus question: when using a synchronized view - is the use of foreach/Iterable thread-safe?

You've already answered your bonus question really: no, using an enhanced for loop isn't safe - because it uses an iterator.
As for which is the most appropriate approach - it really depends on how your context:
Are writes very infrequent? If so, CopyOnWriteArrayList may be most appropriate.
Is the collection reasonably small, and the iteration quick? (i.e. you're not doing much work in the loop) If so, synchronizing may well be fine - especially if this doesn't happen too often (i.e. you won't have much contention for the collection).
If you're doing a lot of work and don't want to block other threads working at the same time, the hit of cloning the collection may well be acceptable.

Depends on your access model. If you have low concurrency and frequent writes, 1 will have the best performance. If you have high concurrency with and infrequent writes, 3 will have the best performance. Option 2 is going to perform badly in almost all cases.
foreach calls iterator(), so exactly the same things apply.

You could use one of the newer collections added in Java 5.0 which support concurrent access while iterating. Another approach is to take a copy using toArray which is thread safe (during the copy).
Collection<String> words = ...
// enhanced for loop over an array.
for(String word: words.toArray(new String[0])) {
}

I might be totally off with your requirements, but if you are not aware of them, check out google-collections with "Favor immutability" in mind.

I suggest dropping Collections.synchronizedXXX and handle all locking uniformly in the client code. The basic collections don't support the sort of compound operations useful in threaded code, and even if you use java.util.concurrent.* the code is more difficult. I suggest keeping as much code as possible thread-agnostic. Keep difficult and error-prone thread-safe (if we are very lucky) code to a minimum.

All three of your options will work. Choosing the right one for your situation will depend on what your situation is.
CopyOnWriteArrayList will work if you want a list implementation and you don't mind the underlying storage being copied every time you write. This is pretty good for performance as long as you don't have very big collections.
ConcurrentHashMap or "ConcurrentHashSet" (using Collections.newSetFromMap) will work if you need a Map or Set interface, obviously you don't get random access this way. One great! thing about these two is that they will work well with large data sets - when mutated they just copy little bits of the underlying data storage.

It does depend on the result one needs to achieve cloning/copying/toArray(), new ArrayList(..) and the likes obtain a snapshot and does not lock the the collection.
Using synchronized(collection) and iteration through ensure by the end of the iteration would be no modification, i.e. effectively locking it.
side note:(toArray() is usually preferred with some exceptions when internally it needs to create a temporary ArrayList). Also please note, anything but toArray() should be wrapped in synchronized(collection) as well, provided using Collections.synchronizedXXX.

This Question is rather old (sorry, i am a bit late..) but i still want to add my Answer.
I would choose your second choice (i.e. Clone the collection before calling iterator()) but with a major twist.
Asuming, you want to iterate using iterator, you do not have to coppy the Collection before calling .iterator() and sort of negating (i am using the term "negating" loosely) the idea of the iterator pattern, but you could write a "ThreadSafeIterator".
It would work on the same premise, coppying the Collection, but without letting the iterating class know, that you did just that. Such an Iterator might look like this:
class ThreadSafeIterator<T> implements Iterator<T> {
private final Queue<T> clients;
private T currentElement;
private final Collection<T> source;
AsynchronousIterator(final Collection<T> collection) {
clients = new LinkedList<>(collection);
this.source = collection;
}
#Override
public boolean hasNext() {
return clients.peek() != null;
}
#Override
public T next() {
currentElement = clients.poll();
return currentElement;
}
#Override
public void remove() {
synchronized(source) {
source.remove(currentElement);
}
}
}
Taking this a Step furhter, you might use the Semaphore Class to ensure thread-safety or something. But take the remove method with a grain of salt.
The point is, by using such an Iterator, no one, neither the iterating nor the iterated Class (is that a real word) has to worrie about Thread safety.

Why should pop() take an argument?

Quick background
I'm a Java developer who's been playing around with C++ in my free/bored time.
Preface
In C++, you often see pop taking an argument by reference:
void pop(Item& removed);
I understand that it is nice to "fill in" the parameter with what you removed. That totally makes sense to me. This way, the person who asked to remove the top item can have a look at what was removed.
However, if I were to do this in Java, I'd do something like this:
Item pop() throws StackException;
This way, after the pop we return either: NULL as a result, an Item, or an exception would be thrown.
My C++ text book shows me the example above, but I see plenty of stack implementations taking no arguments (stl stack for example).
The Question
How should one implement the pop function in C++?
The Bonus
Why?

To answer the question: you should not implement the pop function in C++, since it is already implemented by the STL. The std::stack container adapter provides the method top to get a reference to the top element on the stack, and the method pop to remove the top element. Note that the pop method alone cannot be used to perform both actions, as you asked about.
Why should it be done that way?
Exception safety: Herb Sutter gives a good explanation of the issue in GotW #82.
Single-responsibility principle: also mentioned in GotW #82. top takes care of one responsibility and pop takes care of the other.
Don't pay for what you don't need: For some code, it may suffice to examine the top element and then pop it, without ever making a (potentially expensive) copy of the element. (This is mentioned in the SGI STL documentation.)
Any code that wishes to obtain a copy of the element can do this at no additional expense:
Foo f(s.top());
s.pop();
Also, this discussion may be interesting.
If you were going to implement pop to return the value, it doesn't matter much whether you return by value or write it into an out parameter. Most compilers implement RVO, which will optimize the return-by-value method to be just as efficient as the copy-into-out-parameter method. Just keep in mind that either of these will likely be less efficient than examining the object using top() or front(), since in that case there is absolutely no copying done.

The problem with the Java approach is that its pop() method has at least two effects: removing an element, and returning an element. This violates the single-responsibility principle of software design, which in turn opens door for design complexities and other issues. It also implies a performance penalty.
In the STL way of things the idea is that sometimes when you pop() you're not interested in the item popped. You just want the effect of removing the top element. If the function returns the element and you ignore it then that's a wasted copy.
If you provide two overloads, one which takes a reference and another which doesn't then you allow the user to choose whether he (or she) is interested in the returned element or not. The performance of the call will optimal.
The STL doesn't overload the pop() functions but rather splits these into two functions: back() (or top() in the case of the std::stack adapter) and pop(). The back() function just returns the element, while the pop() function just removes it.

Using C++0x makes the whole thing hard again.
As
stack.pop(item); // move top data to item without copying
makes it possible to efficiently move the top element from the stack. Whereas
item = stack.top(); // make a copy of the top element
stack.pop(); // delete top element
doesn't allow such optimizations.

The only reason I can see for using this syntax in C++:
void pop(Item& removed);
is if you're worried about unnecessary copies taking place.
if you return the Item, it may require an additional copy of the object, which may be expensive.
In reality, C++ compilers are very good at copy elision, and almost always implement return value optimization (often even when you compile with optimizations disabled), which makes the point moot, and may even mean the simple "return by value" version becomes faster in some cases.
But if you're into premature optimization (if you're worried that the compiler might not optimize away the copy, even though in practice it will do it), you might argue for "returning" parameters by assigning to a reference parameter.
More information here

IMO, a good signature for the eqivalent of Java's pop function in C++ would be something like:
boost::optional<Item> pop();
Using option types is the best way to return something that may or may not be available.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.