Java Best Practice regarding clearing a Linked List - java

I am coding a data structure Linked list in java(For my learning sake I am not using any standard java libraries) and I want to clear the data structure via null out the references. Please suggest me which approach is better
1) Just null the start reference of the list and that will suffice.
2) Apart from nulling out the start , I de-reference all next pointers of all internal nodes to null.Does this help the garbage collector in any way?
The confusion in I see approach 2 is followed in JDK for LinkedList implementation.But i don't see the same for TreeMap
I am using JDK 8

This is an interesting question, and the answer has a long history with subtle tradeoffs.
The short answer is, clearing the references is not necessary for the correct operation of the data structure. Since you're learning data structures, I'd suggest that you not worry about this issue in your own implementation. My hunch (though I haven't benchmarked this) is that any benefit that might accrue from clearing all the link nodes will rarely be noticeable under typical conditions.
(In addition, it's likely that under typical conditions, LinkedList will be outperformed by ArrayList or ArrayDeque. There are benchmarks that illustrate this. It's not too difficult to come up with workloads where LinkedList outperforms the others, but it's rarer than people think.)
I was quite surprised to learn that the clear operation of LinkedList unlinks all the nodes from each other and from the contained element. Here's a link to the code in JDK 8. This change dates back to 2003, and the change appeared in JDK 5. This change was tracked by the bug JDK-4863813. That change (or a slightly earlier one) clears the next and previous references from individual nodes when they're unlinked from the list. There's also a test case in that bug report that's of some interest.
The problem seems to be that it is possible to make changes to the LinkedList, which creates garbage nodes, faster than the garbage collector can reclaim them. This eventually causes the JVM to run out of memory. The fact that the garbage nodes are all linked together also seems to have the effect of impeding the garbage collector, making it easier for the mutator threads to outstrip the collector threads. (It's not clear to me how important it is to have multiple threads mutating the list. In the test case they all synchronize on the list, so there's no actual parallelism.) The change to LinkedList to unlink the nodes from each other makes it easier for the collector to do its work, and so apparently makes the test no longer run out of memory.
Fast forward to 2009, when the LinkedList code was given a "facelift." This was tracked by bug JDK-6897553 and discussed in this email review thread. One of the original motivations for the "facelift" was to reduce the clear() operation from O(n) to O(1), as unlinking all the nodes seemed unnecessary to that bug's submitter. (It certainly seemed unnecessary to me!) But after some discussion, it was decided that the unlinking behavior provided enough benefit to the garbage collector to retain it and to document it.
The comment also says that unlinking the nodes
is sure to free memory even if there is a reachable Iterator
This refers to a somewhat pathological case like the following:
// fields in some class
List<Obj> list = createAndPopulateALinkedList();
Iterator<Object> iterator;
void someMethod() {
iterator = list.iterator();
// ...
list.clear();
}
The iterator points to one of the linked list's nodes. Even though the list has been cleared, the iterator still keeps a node alive, and since that node has next and previous references, all of the nodes formerly in the list are still alive. Unlinking all the nodes in clear() lets these be collected. I think this is pretty pathological, though, since it's rare for an iterator to be stored in a field. Usually iterators are created, used, and discarded within a single method, most often within a single for loop.
Now, regarding TreeMap. I don't think there's a fundamental reason why LinkedList unlinks its nodes whereas TreeMap does not. One might like to think that the entire JDK code base is maintained consistently, so that if it's good practice for LinkedList to unlink its nodes, this also ought to have been done to TreeMap. Alas, this is not the case. Most likely what happened is that a customer ran into the pathological behavior with LinkedList and the change was made there, but nobody has ever observed similar behavior with TreeMap. Thus there was no impetus to update TreeMap.

Related

Is there a way to opt for "unspecified behavior" rather than ConcurrentModificationException?

I know that code like
for ( Object o: collection){
if (condition(i)){
collection.remove(i);
}
}
will throw a ConcurrentModificationException, and I understand why: modifying the collection directly could interfere with the Iterator's ability to keep track of its place, by, for instance, leaving it with a reference to an element that's no longer a part of the collection, or causing it to skip over one that's just been added. For code like the above, that's a reasonable concern, however, I would like to write something like
for (Object o: set){// set is an instance of java.util.LinkedHashSet
if (condition(o)){
set.remove(other(o));
}
}
Where other(o) is guaranteed to be "far" from o in the ordering of set. In my particular implementation it will never be less than 47 "steps" away from o. Additionally, if if condition(o) is true, the loop in question will be guaranteed to short-circuit well before it reaches the place where other(o) was. Thus the entire portion of the set accessed by the iterator is thoroughly decoupled from the portion that is modified. Furthermore, the particular strengths of LinkedHashSet (fast random-access insertion and removal, guaranteed iteration order) seem particularly well-suited to this exact sort of operation.
I suppose my question is twofold: First of all, is such an operation still dangerous given the above constraints? The only way that I can think that it might be is that the Iterator values are preloaded far in advance and cached, which I suppose would improve performance in many applications, but seems like it would also reduce it in many others, and therefore be a strange choice for a general-purpose class from java.util. But perhaps I'm wrong about that. When it comes to things like caching, my intuition about efficiency is often suspect. Secondly, assuming this sort of thing is, at least in theory, safe, is there a way, short of completely re-implementing LinkedHashSet, or sacrificing efficiency, to achieve this operation? Can I tell Collections to ignore the fact that I'm modifying a different part of the Set, and just go about its business as usual? My current work-around is to add elements to an intermediate collection first, then add them to the main set once the loop is complete, but this is inefficient, since it has to add the values twice.
The ConcurrentModificationException is thrown because your collection may not be able to handle the removal (or addition) at all times. For example, what if the removal you performed meant that your LinkedHashSet had to reduce/increase the space the underlying HashMap takes under the hood? It would have to make a lot of changes, which would possibly render the iterator useless.
You have two options:
Use Iterator to iterate elements and remove them, e.g. calling Iterator iter = linkedHashSet.iterator() to get the iterator and then remove elements by iter.remove()
Use one of the concurrent collections available under the java.util.concurrent package, which are designed to allow concurrent modifications
This question contains nice details on using Iterator
UPDATE after comments:
You can use the following pattern in order to remove the elements you wish without causing a ConcurrentModificationException: gather the elements you wish to remove in a List while looping through the LinkedHashSet elements. Afterwards, loop through each toBeDeleted element in the list and remove it from the LinkedHashSet.

Why is clear an O(n) operation for linked list?

According to attachment 1, linked list's clear operation is O(n).
I have a question about why is it so.
Here is how we implemented the linked list in class(java)
public class LinkedIntList {
private ListNode front;
......
}
And if I were to write a clear method for this linked list class, this is how I would write it
public void clear() {
front = null;
}
Given this implementation(think this is how most people would write this), this would be one operation that is independent of the size of the list (just setting front to null). Also by setting the front pointer as null, wouldn't you essentially be asking the garbage collector to "reclaim the underlying memory and reuses it for future object allocation." In this case , the underlying memory would be the front node and all the nodes that are consecutively attached to it.(http://javabook.compuware.com/content/memory/how-garbage-collection-works.aspx)
After stating all of that, how is clear an O(n) operation for linked list?
Attachment 1:
This is from a data structures class I am in
Remember that a Linked List has n entries that were allocated for it, and for clearing it, you actually need to free them.
Since java has a built in garbage collector (GC) - you don't need to explicitly free those - but the GC will go over each and every one of them and free them when time comes
So even though your explicit method is O(1), invoking it requires O(n) time from the GC, which will make your program O(n)
I expect that your data structure class is not assuming that JAVA is the only system in the world.
In C, C++, Pascal, Assembly, Machine Code, Objective C, VB 6, etc, it takes a fixed time to free each block of memory, as they do not have a garbage collector. Until very recently most programs where wrote without the benefits of a garbage collector.
So in any of the above, all the node will need to be pass to free(), and the call to free() takes about a fixed time.
In Java, the link listed would take O(1) time to clear for a simple implantation of a linked list.
However as it may be possible that nodes would be pointed to from outside of the list, or that a garbage collector will consider different part of the memory at different time, there can be real life benefits from setting all the “next” and “prev” pointers to null. But in 99% of cases, it is best just to set the “front” pointer in the header to null as your code shows.
I think you should ask your lecture about this, as I expect lots of the students in the class will have the same issue. You need to learn C well before you can understand most generally data structure books or classes.

Need an efficient Map or Set that does NOT produce any garbage when adding and removing

So because Javolution does not work (see here) I am in deep need of a Java Map implementation that is efficient and produces no garbage under simple usage. java.util.Map will produce garbage as you add and remove keys. I checked Trove and Guava but it does not look they have Set<E> implementations. Where can I find a simple and efficient alternative for java.util.Map?
Edit for EJP:
An entry object is allocated when you add an entry, and released to GC when you remove it. :(
void addEntry(int hash, K key, V value, int bucketIndex) {
Entry<K,V> e = table[bucketIndex];
table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
if (size++ >= threshold)
resize(2 * table.length);
}
Taken literally, I am not aware of any existing implementation of Map or Set that never produces any garbage on adding and removing a key.
In fact, the only way that it would even be technically possible (in Java, using the Map and Set APIs as defined) is if you were to place a strict upper bound on the number of entries. Practical Map and Set implementations need extra state proportional to the number of elements they hold. This state has to be stored somewhere, and when the current allocation is exceeded that storage needs to be expanded. In Java, that means that new nodes need to be allocated.
(OK, you could designed a data structure class that held onto old useless nodes for ever, and therefore never generated any collectable garbage ... but it is still generating garbage.)
So what can you do about this in practice ... to reduce the amount of garbage generated. Let's take HashMap as an example:
Garbage is created when you remove an entry. This is unavoidable, unless you replace the hash chains with an implementation that never releases the nodes that represent the chain entries. (And that's a bad idea ... unless you can guarantee that the free node pool size will always be small. See below for why it is a bad idea.)
Garbage is created when the main hash array is resized. This can be avoided in a couple of ways:
You can give a 'capacity' argument in the HashMap constructor to set the size of the initial hash array large enough that you never need to resize it. (But that potentially wastes space ... especially if you can't accurately predict how big the HashMap is going to grow.)
You can supply a ridiculous value for the 'load factor' argument to cause the HashMap to never resize itself. (But that results in a HashMap whose hash chains are unbounded, and you end up with O(N) behaviour for lookup, insertion, deletion, etc.
In fact, creating garbage is not necessarily bad for performance. Indeed, hanging onto nodes so that the garbage collector doesn't collect them can actually be worse for performance.
The cost of a GC run (assuming a modern copying collector) is mostly in three areas:
Finding nodes that are not garbage.
Copying those non-garbage nodes to the "to-space".
Updating references in other non-garbage nodes to point to objects in "to-space".
(If you are using a low-pause collector there are other costs too ... generally proportional to the amount of non-garbage.)
The only part of the GC's work that actually depends on the amount of garbage, is zeroing the memory that the garbage objects once occupied to make it ready for reuse. And this can be done with a single bzero call for the entire "from-space" ... or using virtual memory tricks.
Suppose your application / data structure hangs onto nodes to avoid creating garbage. Now, when the GC runs, it has to do extra work to traverse all of those extra nodes, and copy them to "to-space", even though they contain no useful information. Furthermore, those nodes are using memory, which means that if the rest of the application generates garbage there will be less space to hold it, and the GC will need to run more often.
And if you've used weak/soft references to allow the GC to claw back nodes from your data structure, then that's even more work for the GC ... and space to represent those references.
Note: I'm not claiming that object pooling always makes performance worse, just that it often does, especially if the pool gets unexpectedly big.
And of course, that's why HashMap and similar general purpose data structure classes don't do any object pooling. If they did, they would perform significantly badly in situations where the programmer doesn't expect it ... and they would be genuinely broken, IMO.
Finally, there is an easy way to tune a HashMap so that an add immediately followed by a remove of the same key produces no garbage (guaranteed). Wrap it in a Map class that caches the last entry "added", and only does the put on the real HashMap when the next entry is added. Of course, this is NOT a general purpose solution, but it does address the use case of your earlier question.
I guess you need a version of HashMap that uses open addressing, and you'll want something better than linear probing. I don't know of a specific recommendation though.
http://sourceforge.net/projects/high-scale-lib/ has implementations of Set and Map which do not create garbage on add or remove of keys. The implementation uses a single array with alternating keys and values, so put(k,v) does not create an Entry object.
Now, there are some caveats:
Rehash creates garbage b/c it replaces the underlying array
I think this map will rehash given enough interleaved put & delete operations, even if the overall size is stable. (To harvest tombstone values)
This map will create Entry object if you ask for the entry set (one at a time as you iterate)
The class is called NonBlockingHashMap.
One option is to try to fix the HashMap implementation to use a pool of entries. I have done that. :) There are also other optimizations for speed you can do there. I agree with you: that issue with Javolution FastMap is mind-boggling. :(

Java: what is the overhead of using ConcurrentSkipList* when no concurrency is needed?

I need a sorted list in a scenario dominated by iteration (compared to insert/remove, not random get at all). For this reason I thought about using a skip list compared to a tree (the iterator should be faster).
The problem is that java6 only has a concurrent implementation of a skip list, so I was guessing whether it makes sense to use it in a non-concurrent scenario or if the overhead makes it a wrong decision.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, but I wanted to hear somebody else's opinion.
EDIT:
After some micro-benchmarking (running iteration multiple times on different-sized TreeSet, LinkedList, ConcurrentSkipList and ArrayList) shows that there's quite an overhead. ConcurrentSkipList does store the elements in a linked list inside, so the only reason why it would be slower on iteration than a LinkedList would be due to the aforementioned overhead.
If thread-safety's not required I'd say to skip package java.util.concurrent altogether.
What's interesting is that sometimes ConcurrentSkipList is slower than TreeSet on the same input and I haven't sorted out yet why.
I mean, have you seen the source code for ConcurrentSkipListMap? :-) I always have to smile when I look at it. It's 3000 lines of some of the most insane, scary, and at the same time beautiful code I've ever seen in Java. (Kudos to Doug Lea and co. for getting all the concurrency utils integrated so nicely with the collections framework!) Having said that, on modern CPUs the code and algorithmic complexity won't even matter so much. What usually makes more difference is having the data to iterate co-located in memory, so that the CPU cache can do its job better.
So in the end I'll wrap ArrayList with a new addSorted() method that does a sorted insert into the ArrayList.
Sounds good. If you really need to squeeze every drop of performance out of iteration you could also try iterating a raw array directly. Repopulate it upon each change, e.g. by calling TreeSet.toArray() or generating it then sorting it in-place using Arrays.sort(T[], Comparator<? super T>). But the gain could be tiny (or even nothing if the JIT does its job well) so it might not be worth the inconvenience.
As measured using Open JDK 6 on typical production hardware my company uses, you can expect all add and query operations on a skip-list map to take roughly double the time as the same operation on a tree map.
examples:
63 usec vs 31 usec to create and add 200 entries.
145 ns vs 77 ns for get() on that 200-element map.
And the ratio doesn't change all that much for smaller and larger sizes.
(The code for this benchmark will eventually be shared so you can review it and run it yourself; sorry we're not ready to do that yet.)
Well you can use a lot of other structures to do the skip list, it exists in Concurrent package because concurrent data structures are a lot more complicated and because using a concurrent skip list would cost less than using other concurrent data structures to mimic a skip list.
In a single thread world is different: you can use a sorted set, a binary tree or your custom data structure that would perform better than concurrent skip list.
The complexity in iterating a tree list or a skip list will be always O(n), but if you instead use a linked list or an array list, you have the problem with insertion: to insert an item in the right position (sorted linked list) the complexity of insertion will be O(n) instead of O(log n) for a binary tree or for a skip list.
You can iterate in TreeMap.keySet() to obtain all inserted keys in order and it will not be so slow.
There is also the TreeSet class, that probably is what you need, but since it is just a wrapper to TreeMap, the direct use of TreeMap would be faster.
Without concurrency, it is usually more efficient to use a balanced binary search tree. In Java, this would be a TreeMap.
Skip lists are generally reserved for concurrent programming because of their ease in implementation the speed in multithreaded applications.
You seem to have a good grasp of the trade-off here, so I doubt anyone can give you a definitive, principled answer. Fortunately, this is pretty straightforward to test.
I started by creating a simple Iterator<String> that loops indefinitely over a finite list of randomly generated strings. (That is: on initialization, it generates an array _strings of a random strings of length b out of a pool of c distinct characters. The first call to next() returns _strings[0], the next call returns _strings[1] … the (n+1)th call returns _strings[0] again.) The strings returned by this iterator were what I used in all calls to SortedSet<String>.add(...) and SortedSet<String>remove(...).
I then wrote a test method that accepts an empty SortedSet<String> and loops d times. On each iteration, it adds e elements, then removes f elements, then iterates over the entire set. (As a sanity-check, it keeps track of the set's size by using the return values of add() and remove(), and when iterates over the entire set, it makes sure it finds the expected number of elements. Mostly I did that just so there would be something in the body of the loop.)
I don't think I need to explain what my main(...) method does. :-)
I tried various values for the various parameters, and I found that sometimes ConcurrentSkipListSet<String> performed better, and sometimes TreeSet<String> did, but the difference was never much more than twofold. In general, ConcurrentSkipListSet<String> performed better when:
a, b, and/or c were relatively large. (I mean, within the ranges I tested. My a's ranged from 1000 to 10000, my b's from 3 to 10, my c's from 10 to 80. Overall, the resulting set-sizes ranged from around 450 to exactly 10000, with modes of 666 and 6666 because I usually used e=2‎f.) This suggests that ConcurrentSkipListSet<String> copes somewhat better than TreeSet<String> with larger sets, and/or with more-expensive string-comparisons. Trying specific values designed to tease apart these two factors, I got the impression that ConcurrentSkipListSet<String> coped noticeably better than TreeSet<String> with larger sets, and slightly less well with more-expensive string-comparisons. (That's basically what you'd expect; TreeSet<String>'s binary-tree approach aims to do the absolute minimum possible number of comparisons.)
e and f were small; that is, when I called add(...)s and remove(...)s only a small number of times per iteration. (This is exactly what you predicted.) The exact turn-over point depended on a, b, and c, but to a first approximation, ConcurrentSkipListSet<String> performed better when e + f was less than around 10, and TreeSet<String> performed better when e + f was more than around 20.
Of course, this was on a machine that may look nothing like yours, using a JDK that may look nothing like yours, and using very artificial data that might look nothing like yours. I'd recommend that you run your own tests. Since Tree* and ConcurrentSkipList* both implement Sorted*, you should have no difficulty trying your code both ways and seeing what you find.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, […]
My understanding is that this will depend on the machine. On some systems a lock-free implementation may not be possible, in which case these classes will have to use locks. (But since you're not actually multi-threading, even locks may not be all that expensive. Synchronization has overhead, of course, but its main cost is lock contention and forced single-threading. That isn't an issue for you. Again, I think you'll just have to test and see how the two versions perform.)
As noted SkipList has a lot of overhead compared to TreeMap and the TreeMap iterator isn't well suited to your use case because it just repeatedly calls the method successor() which turns out to be very slow.
So one alternative that will be significantly faster than the previous two is to write your own TreeMap iterator. Actually, I would dump TreeMap altogether since 3000 lines of code is a bit bulkier than you probably need and just write a clean AVL tree implementation with the methods you need. The basic AVL logic is just a few hundred lines of code in any language then add the iterator that works best in your case.

Are LinkedLists an unintuitive solution since most of the time I don't need to know the physical location of an element in a Collection?

Recently a coworker showed me some code he had written with a LinkedList and I couldn't get my head around it.
a -> b -> c -> d -> e -> f
If I want to get d from the LinkedList, don't I have to traverse the list starting with a and iterating up to d or starting with f and iterating back to d?
Why would I care WHERE d is stored physically in the Collection?
Not every linked list is linked in both directions but, normally, yes. This type of collection features sequential access in the forward or forward and reverse directions.
The advantages are:
least amount of memory overhead except for a flat array
very fast insert and delete
memory can be allocated and released one element at a time
easy to implement (not so important with modern languages but it was important in C89 and C99)
LIFO or FIFO ordering is possible
I think that the right question is not WHERE, but HOW it stored your collection. According to this, your time of adding, searching, deleting and keeping your collection consistent is different. So, when you choose your type collection you should keep in mind, what will be the most frequent operation and pick the best solution for your case.
Linked lists typically have better performance characteristics than arrays for adding and removing elements.
And yes, if you're operating on sorted data, you do normally care what order elements are stored in.
You probably don't care, regardless of whether you're using a LinkedList or an ArrayList. LinkedLists offer the advantage of being able to easily add elements to the beginning of a list, which you can't do with an ArrayList.
Lists are not about "physical locations" (whatever you mean by that), lists are a certain data structure that can grow and shrink and provide decent complexity across the various operations.
You don't have to explicitly traverse the linked list, as LinkedList offers indexOf(Object) and get(int). These will still traverse the list, but will do it implicitly.
You'll care about how a collection orders items because this affects efficiency of operations on the collection, particularly insert, fetch & removal. Any ordering on the items in a collection also affect timing of algorithms that use the data structure.

Categories