Is !list.isEmpty() and list.size()>0 equal? - java

I've seen code as below:
if (!substanceList.isEmpty() && (substanceList.size() > 0))
{
substanceText = createAmountText(substanceList);
}
Would the following be a valid refactor?
if (!substanceList.isEmpty())
{
substanceText = createAmountText(substanceList);
}
I would be grateful for an explanation of the above code and whether the second version may cause errors?

If in doubt, read the Javadoc:
Collection.isEmpty():
Returns true if this collection contains no elements.
Collection.size():
Returns the number of elements in this collection
So, assuming the collection is implemented correctly:
collection.isEmpty() <=> collection.size() == 0
Or, conversely:
!collection.isEmpty() <=> collection.size() != 0
Since the number of elements should only be positive, this means that:
!collection.isEmpty() <=> collection.size() > 0
So yes, the two forms are equivalent.
Caveat: actually, they're only equivalent if your collection isn't being modified from another thread at the same time.
This:
!substanceList.isEmpty() && (substanceList.size() > 0)
is equivalent to, by the logic I present above:
!substanceList.isEmpty() && !substanceList.isEmpty()
You can only simplify this to
!substanceList.isEmpty()
if you can guarantee that its value doesn't change in between evaluations of substanceList.isEmpty().
Practically, it is unlikely that you need to care about the difference between these cases, at least at this point in the code. You might need to care about the list being changed in another thread, however, if it can become empty before (or while) executing createAmountText. But that's not something that was introduced by this refactoring.
TL;DR: using if (!substanceList.isEmpty()) { does practically the same thing, and is clearer to read.

The only difference between the first and the second approach is that the first approach performs a redundant check. nothing else.
Thus, you'd rather avoid the redundant check and go with the second approach.

Actually, you can read the source code downloaded in the JDK:
/**
* Returns <tt>true</tt> if this list contains no elements.
*
* #return <tt>true</tt> if this list contains no elements
*/
public boolean isEmpty() {
return size == 0;
}
I think that this settles all the queries.

Implementation of isEmpty() in AbstractCollection is as follows:
public boolean isEmpty() {
return size() == 0;
}
So you can safely assume that !list.isEmpty() is equivalent to list.size() > 0.
As for "what is better code", if you want to check if the list is empty or not, isEmpty() is definitely more expressive.
Subclasses might also override isEmpty() from AbstractCollection and implement in more efficient manner than size() == 0. So (purely theoretically) isEmpty() might be more efficient.

The javadocs for Collection.size() and Collection.isEmpty() say:
boolean isEmpty()
Returns true if this collection contains no elements.
int size()
Returns the number of elements in this collection
Since "contains no elements" implies that the number of elements in the collection is zero, it follows that list.isEmpty() and list.size() == 0 will evaluate to the same value.
I want to some explanation of above code
The second version is correct. The first version looks like it was written either by an automatic code generator, or a programmer who doesn't really understand Java. There is no good reason to write the code that way.
(Note: if some other thread could be concurrently modifying the list, then both versions are problematic unless there is proper synchronization. If the list operations are not synchronized then may be memory hazards. But in the first version, there is also the possibility of a race condition ... where the list appears be empty and have a non-zero size!)
and want to know second way may be caused some error.
It won't.
Incidentally list.isEmpty() is preferable to list.size() == 0 for a couple of reasons:
It is more concise (fewer characters).
It expresses the intent of your code more precisely.
It may be more efficient. Some collection implementations may need to count the elements in the collection to compute the size. That may be an O(N) operation, and could other undesirable effects. For example, if a collection is a lazy list that only gets reified as you iterate the elements, then calling size() may result in excessive memory use.

Yes, it can be refactored as you did. The issue with both approaches is that you would do the check every time you want to call the method createAmountText on a List. This means you would be repeating the logic, a better way would be to use DRY (don't Repeat Yourself) principle and factor these checks into your method.
So your method's body should encapsulated by this check.
It should look like:
<access modifier> String createAmountText(List substanceList){
if(substanceList != null && !substanceList.isEmpty()){
<The methods logic>
}
return null;
}

Sure - the two methods can be used to express the same thing.
But worth adding here: going for size() > 0 is somehow a more direct violation of the Tell, don't ask principle: you access an "implementation detail", to then make a decision based on that.
In that sense, isEmpty() should be your preferred choice here!
Of course, you are still violating TDA when using isEmpty() - because you are again fetching status from some object to then make a decision on it.
So the really best choice would be to write code that doesn't need at all to make such a query to internal state of your collection to then drive decisions on it. Instead, simply make sure that createAmountText() properly deals with you passing in an empty list! Why should users of this list, or of that method need to care whether the list is empty or not?!
Long story short: maybe that is "over thinking" here - but again: not using these methods would lead you to write less code! And that is always an indication of a good idea.

Related

Cleanest way to shorten two similar methods

down bellow you can see two example methods, which are structured in the same way, but have to work with completely different integers.
You can guess if the code gets longer, it is pretty anoying to have a second long method which is doing the same.
Do you have any idea, how i can combine those two methods without using "if" or "switch" statements at every spot?
Thanks for your help
public List<> firstTestMethod(){
if(blabla != null){
if(blabla.getChildren().size() > 1){
return blabla.getChildren().subList(2, blabla.getChildren().size());
}
}
return null;
}
And:
public List<> secondTestMethod(){
if(blabla != null){
if(blabla.getChildren().size() > 4){
return blabla.getChildren().subList(0, 2);
}
}
return null;
}
Attempting to isolate common ground from 2 or more places into its own Helper method is not a good idea if you're just looking at what the code does without any context.
The right approach is first to define what you're actually isolating. It's not so much about the how (the fact that these methods look vaguely similar suggests that the how is the same, yes), but the why. What do these methods attempt to accomplish?
Usually, the why is also mostly the same. Rarely, the why is completely different, and the fact that the methods look similar is a pure coincidence.
Here's a key takeaway: If the why is completely different but the methods look somewhat similar, you do not want to turn them into a single method. DRY is a rule of thumb, not a commandment!
Thus, your question isn't directly answerable, because the 2 snippets are so abstractly named (blabla isn't all that informative), it's not possible to determine with the little context the question provides what the why might be.
Thus, answer the why question first, and usually the strategy on making a single method that can cater to both snippets here becomes trivial.
Here is an example answer: If list is 'valid', return the first, or last, X elements inside it. Validity is defined as follows: The list is not null, and contains at least Z entries. Otherwise, return null.
That's still pretty vague, and dangerously close to a 'how', but it sounds like it might describe what you have here.
An even better answer would be: blabla represents a family; determine the subset of children who are eligible for inheriting the property.
The reason you want this is twofold:
It makes it much easier to describe a method. A method that seems to do a few completely unrelated things and is incapable of describing the rhyme or reason of any of it cannot be understood without reading the whole thing through, which takes a long time and is error-prone. A large part of why you want methods in the first place is to let the programmer (the human) abstract ideas away. Instead of remembering what these 45 lines do, all you need to remember is 'fetch the eligible kids'.
Code changes over time. Bugs are found and need fixing. External influences change around you (APIs change, libraries change, standards change). Feature requests are a thing. Without the why part it is likely that one of the callers of this method grows needs that this method cannot provide, and then the 'easiest' (but not best!) solution is to just add the functionality to this method. The method will eventually grow into a 20 page monstrosity doing completely unrelated things, and having 50 parameters. To guard against this growth, define what the purpose of this method is in a way that is unlikely to spiral into 'read this book to understand what all this method is supposed to do'.
Thus, your question is not really answerable, as the 2 snippets do not make it obvious what the common thread might be, here.
Why do these methods abuse null? You seem to think null means empty list. It does not. Empty list means empty list. Shouldn't this be returning e.g. List.of instead of null? Once you fix that up, this method appears to simply be: "Give me a sublist consisting of everything except the first two elements. If the list is smaller than that or null, return an empty list", which is starting to move away from the 'how' and slowly towards a 'what' and 'why'. There are only 2 parameters to this generalized concept: The list, and the # of items from the start that need to be omitted.
The second snippet, on the other hand, makes no sense. Why return the first 3 elements, but only if the list has 5 or more items in it? What's the link between 3 and 5? If the answer is: "Nothing, it's a parameter", then this conundrum has far more parameters than the first snippet, and we see that whilst the code looks perhaps similar, once you start describing the why/what instead of the how, these two jobs aren't similar at all, and trying to shoehorn these 2 unrelated jobs into a single method is just going to lead to bad code now, and worse code later on as changes occur.
Let's say instead that this last snippet is trying to return all elements except the X elements at the end, returning an empty list if there are fewer than X. This matches much better with the first snippet (which does the same thing, except replace 'at the end' with 'at the start'). Then you could write:
// document somewhere that `blabla.getChildren()` is guaranteed to be sorted by age.
/** Returns the {#code numEldest} children. */
public List<Child> getEldest(int numEldest) {
if (numEldest < 0) throw new IllegalArgumentException();
return getChildren(numEldest, true);
}
/** Returns all children except the {#code numEldest} ones. */
public List<Child> getAllButEldest(int numEldest) {
if (numEldest < 0) throw new IllegalArgumentException();
return getChildren(numEldest, false);
}
private List<Child> getChildren(int numEldest, boolean include) {
if (blabla == null) return List.of();
List<Child> children = blabla.getChildren();
if (numEldest >= children.size()) return include ? children : List.of();
int startIdx = include ? 0 : numEldest;
int endIdx = include ? numEldest : children.size();
return children.subList(startIdx, endIdx);
}
Note a few stylistic tricks here:
boolean parameters are bad, because why would you know 'true' matches up with 'I want the eldest' and 'false' matches up with 'I want the youngest'? Names are good. This snippet has 2 methods that make very clear what they do, by using names.
That 'when extracting common ground, define the why, not the how' is a hierarchical idea - apply it all the way down, and as you get further away from the thousand-mile view, the what and how become more and more technical. That's okay. The more down to the details you get, the more private things should be.
By having defined what this all actually means, note that the behaviour is subtly different: If you ask for the 5 eldest children and there are only 4 children, this returns those 4 children instead of null. That shows off some of the power of defining the 'why': Now it's a consistent idea. Returning all 4 when you ask for 'give me the 5 eldest', is no doubt 90%+ of all those who get near this code would assume happens.
Preconditions, such as what comprises sane inputs, should always be checked. Here, we check if the numEldest param is negative and just crash out, as that makes no sense. Checks should be as early as they can reasonably be made: That way the stack traces are more useful.
You can pass objects that encapsulate the desired behavior differences at various points in your method. Often you can use a predefined interface for behavior encapsulation (Runnable, Callable, Predicate, etc.) or you may need to define your own.
public List<> testMethod(Predicate<BlaBlaType> test,
Function<BlaBlaType, List<>> extractor)
{
if(blabla != null){
if(test.test(blabla)){
return extractor.apply(blabla);
}
}
return null;
}
You could then call it with a couple of lambdas:
testMethod(
blabla -> blabla.getChildren().size() > 1,
blabla -> blabla.getChildren().subList(2, blabla.getChildren().size())
);
testMethod(
blabla -> blabla.getChildren().size() > 4,
blabla -> blabla.getChildren().subList(0, 2)
);
Here is one approach. Pass a named boolean to indicate which version you want. This also allows the list of children to be retrieved independent of the return. For lack of more meaningful names I choose START and END to indicate which parts of the list to return.
static boolean START = true;
static boolean END = false;
public List<Children> TestMethod(boolean type) {
if (blabla != null) {
List<Children> list = blabla.getChildren();
int size = list.size();
return START ?
(size > 1 ? list.subList(0, 2) : null) :
(size > 4 ? list.subList(2, size) :
null);
}
return null;
}

why does iterator not enforce hasnext call?

It seems like quite a good method if hasNext and Next worked like this:
boolean hasNextCalled = false;
boolean hasNext() {
hasNextCalled = true
}
next() {
assert hasNextCalled
}
This way we would never land up in a case where we would get NoSuchElementException().
Any practical reason why hasNext() call is not enforced ?
What would be the benefit? You're simply replacing a NoSuchElementException with an AssertionError, plus introducing a tiny bit of overhead. Also, since Iterator is an interface, you couldn't implement this once; it would have to go in every implementation of Iterator. Plus the documentation doesn't impose a requirement to call hasNext before calling next, so your proposal would break the current contract. Such a change would break any code that was written to rely on a NoSuchElementException. Finally, assertions can be turned off in production code, so you would still need the NoSuchElementException mechanism.
NoSuchElementException is a runtime exception, and reflects programmer error...exactly like your approach does. It's not obligatory to call hasNext() because maybe you don't need to -- you know the size of the collection in advance, for example, and know how many calls to next() you can make.
The point is that you're exchanging one way of reporting programmer error for...another way of reporting programmer error that can disable some useful approaches.
Maybe we already know that there are elements left. For example, maybe we're iterating over two equally-sized lists in lockstep, and we only need to call hasNext on one iterator to check for both. Also, asserting the hasNext call doesn't actually prevent anyone from calling next without hasNext, especially if assertions are off.
You may know there's a next(), for example if you always have pairs of elements, one call to hasNext() will allow two calls to next().
My 2 cents. Here the API makes an assumption about the client usage and forces that. Let us say I am always sure that I get back only 1 result then it is better to by pass the hasNext() call and directly retrieve the element by calling just next()
Firstly, the assert you suggest would only run when assertions are enabled.
But the main issue is that you consider only one use-case. The library is designed to support programmers in their work with minimal restrictions, and each class and method has to fit into a cohesive and coherent whole.
Other posters give good reasons also (as I was typing), especially that Iterator is an Interface, and has many implementations.

Check if only one element exist using Guava

Recently I had a need to do 'special case' scenario if only one element exist in collection. Checking for ...size() == 1 and retrieving by using ...iterator.next() looked ugly so I've created two methods in home brew Collections class:
public class Collections {
public static <T> boolean isSingleValue(Collection<T> values) {
return values.size() == 1;
}
public static <T> T singleValue(Collection<T> values) {
Assert.isTrue(isSingleValue(values));
return values.iterator().next();
}
}
Few days ago I discovered that Guava has method called Iterables.getOnlyElement. It covers my need and replaces singleValue, but I can't find match for isSingleValue. Is that by design? Is it worth to put feature request for having Iterables.isOnlyElement method?
EDIT:
Since there were few upvotes I decided to open enhancement on guava - issue 957. Final resolution - 'WontFix'. Arguments are very similar to what Thomas/Xaerxess provided.
Well, you'd not gain much by replacing values.size() == 1 with a method, except you could check for null. However, there are methods in Apache Commons Collections (as well as in Guava, I assume) to do that.
I'd rather write if( values.size() == 1 ) or if( SomeHelper.size(values) == 1 ) than if( SomeHelper.isSingleValue(values) ) - the intent is much clearer in the first two approaches and it's as much code to write as with the third approach.
Just in addition to other answers (I was going to write something like #daveb who deleted his one: If there is not exactly one element, then Iterables#getOnlyElement will throw an IllegalArgumentException or NoSuchElementException) - an answer to question why there isn't any Iterables.isSingleValue(Iterable) in Guava.
I think you're making this wrong. If:
method invokation doesn't change state (unlike next() in iterator, that's why hasNext() exists)
and you can clearly and exlicitly say that value returned isn't exceptional case (unlike null returned from Map#get(Object) - it can be null value or it can mean that key wasn't found in map)
there is no need for method checking if condition is true and then doing some operation (with assertion in it!) like in your sample code.
If you are absolutely sure that the iterable at this place cannot have size other than 1, than condition checking is redundant (exception is thrown in other cases).
If you want only get first element in non-empty collection - collection.iterator.next() is perfectly OK (NoSuchElementException is thrown if collection is empty).
If you don't know anything about collection's size than Iterables.getFirst(iterable, default) is for you.
P.S. If your Collections#isSingleValue used only locally here (hence could be private) that really means you don't need that check before calling Iterables#getOnlyValue.
P.P.S. Another answer to your question about Guava's design could be Item 57 of Joshua Bloch's Effective Java - there are few different helper methods in Guava I mentioned before that explicitly say what is an exceptional case for each one; boolean check wasn't added becasuse of keeping API as small as possible.
Right now I'm having the same problem.
I'll to solve with this code:
public static <T> void hasJustOne(T... values) {
hasJustOne(Predicates.notNull(), values);
}
public static <T> void hasJustOne(Predicate<T> predicate, T... values) {
Collection<T> filtred = Collections2.filter(Arrays.asList(values),predicate);
Preconditions.checkArgument(filtred.size() == 1);
}

Is -1 a magic number? An anti-pattern? A code smell? Quotes and guidelines from authorities [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Constant abuse?
I've seen -1 used in various APIs, most commonly when searching into a "collection" with zero-based indices, usually to indicate the "not found" index. This "works" because -1 is never a legal index to begin with. It seems that any negative number should work, but I think -1 is almost always used, as some sort of (unwritten?) convention.
I would like to limit the scope to Java at least for now. My questions are:
What are the official words from Sun regarding using -1 as a "special" return value like this?
What quotes are there regarding this issue, from e.g. James Gosling, Josh Bloch, or even other authoritative figures outside of Java?
What were some of the notable discussions regarding this issue in the past?
This is a common idiom in languages where the types do not include range checks. An "out of bounds" value is used to indicate one of several conditions. Here, the return value indicates two things: 1) was the character found, and 2) where was it found.
The use of -1 for not found and a non-negative index for found succinctly encodes both of these into one value, and the fact that not-found does not need to return an index.
In a language with strict range checking, such as Ada or Pascal, the method might be implemented as (pseudo code)
bool indexOf(c:char, position:out Positive);
Positive is a subtype of int, but restricted to non-negative values.
This separates the found/not-found flag from the position. The position is provided as an out parameter - essentialy another return value. It could also be an in-out parameter, to start the search from a given position. Use of -1 to indicate not-found would not be allowed here since it violates range checks on the Positive type.
The alternatives in java are:
throw an exception: this is not a good choice here, since not finding a character is not an exceptional condition.
split the result into several methods, e.g. boolean indexOf(char c); int lastFoundIndex();. This implies the object must hold on to state, which will not work in a concurrent program, unless the state is stored in thread-local storage, or synchronization is used - all considerable overheads.
return the position and found flag separately: such as boolean indexOf(char c, Position pos). Here, creating the position object may be seen as unnecessary overhead.
create a multi-value return type
such as
class FindIndex {
boolean found;
int position;
}
FindIndex indexOf(char c);
although it clearly separates the return values, it suffers object creation overhead. Some of that could be mitigated by passing the FindIndex as a parameter, e.g.
FindIndex indexOf(char c, FindIndex start);
Incidentally, multiple return values were going to be part of java (oak), but were axed prior to 1.0 to cut time to release. James Gosling says he wishes they had been included. It's still a wished-for feature.
My take is that use of magic values are a practical way of encoding a multi-valued results (a flag and a value) in a single return value, without requiring excessive object creation overhead.
However, if using magic values, it's much nicer to work with if they are consistent across related api calls. For example,
// get everything after the first c
int index = str.indexOf('c');
String afterC = str.substring(index);
Java falls short here, since the use of -1 in the call to substring will cause an IndeOutOfBoundsException. Instead, it might have been more consistent for substring to return "" when invoked with -1, if negative values are considered to start at the end of the string. Critics of magic values for error conditions say that the return value can be ignored (or assumed to be positive). A consistent api that handles these magic values in a useful way would reduce the need to check for -1 and allow for cleaner code.
Is -1 a magic number?
In this context, not really. There is nothing special about -1 ... apart from the fact that it is guaranteed to be an invalid index value by virtue of being negative.
An anti-pattern?
No. To qualify as an anti-pattern there would need to be something harmful about this idiom. I see nothing harmful in using -1 this way.
A code smell?
Ditto. (It is arguably better style to use a named constant rather than a bare -1 literal. But I don't think that is what you are asking about, and it wouldn't count as "code smell" anyway, IMO.)
Quotes and guidelines from authorities
Not that I'm aware of. However, I would observe that this "device" is used in various standard classes. For example, String.indexOf(...) returns -1 to say that the character or substring could not be found.
As far as I am concerned, this is simply an "algorithmic device" that is useful in some cases. I'm sure that if you looked back through the literature, you will see examples of using -1 (or 0 for languages with one-based arrays) this way going back to the 1960's and before.
The choice of -1 rather than some other negative number is simply a matter of personal taste, and (IMO) not worth analyzing., in this context.
It may be a bad idea for a method to return -1 (or some other value) to indicate an error instead of throwing an exception. However, the problem here is not the value returned but the fact that the method is requiring the caller to explicitly test for errors.
The flip side is that if the "condition" represented by -1 (or whatever) is not an "error" / "exceptional condition", then returning the special value is both reasonable and proper.
Both Java and JavaScript use -1 when an index isn't found. Since the index is always 0-n it seems a pretty obvious choice.
//JavaScript
var url = 'example.com/foo?bar&admin=true';
if(url.indexOf('&admin') != -1){
alert('we likely have an insecure app!');
}
I find this approach (which I've used when extending Array-type elements to have a .indexOf() method) to be quite normal.
On the other hand, you can try the PHP approach e.g. strpos() but IMHO it gets confusing as there are multiple return types (it returns FALSE when not found)
-1 as a return value is slightly ugly but necessary. The alternatives to signal a "not found" condition are IMHO all much worse:
You could throw an Exception, but
this isn't ideal because Exceptions
are best used to signal unexpected
conditions that require some form of
recovery or propagated failure. Not
finding an occurrence of a substring
is actually pretty expected. Also
Exception throwing has a significant
performance penalty.
You could use a compound result
object with (found,index) but this
requires an object allocation and
more complex code on the part of the
caller to inspect the result.
You could separate out two separate
function calls for contains and indexOf - however this is
again quite cumbersome for the caller
and also results in a performance hit
as both calls would be O(n) and
require a full traversal of the
String.
Personally, I never like to refer to the -1 constant: my test for not-found is always something like:
int i = someString.indexOf("substring");
if (i>=0) {
// do stuff with found index
} else {
// handle not found case
}
It is good practice to define a final class variable for all constant values in your code.
But it is general accepted to use 0, 1, -1, "" (empty string) without an explicit declaration.
This is an inheritance from C where only a single primitive value could be returned. In java you Can also return a single object.
So for new code return an object of a basetype with the subtype indicating the problem to be used with instaceof, or throw a "not Found" exception.
For existing special values make -1 a constant in your code names accordingly - NOT_FOUND - so the reader Can tell the meaning without having to check javadocs.
The same practice as with null applies to -1. Its been discussed many times.
e.g. Java api design - NULL or Exception
Its used because its the first invalid value you encounter in 0-based arrays. As you know, not all types can hold null or nothing so need "something" to signify nothing.
I would say its not official, it has just become convention (unwritten) because its very sensible for the situation. Personally, I wouldn't also call it an issue. API design is also down to the author, but guidelines can be found online.
As far as I know, such values are called sentinel values, although most common definitions differ slightly from this scenario.
Languages such as Java chose to not support passing by reference (which I think is a good idea), so while the values of individual arguments are mutable, the variables passed to a function remain unaffected. As a consequence of this, you can only have one return value of only one type. So what you do is to chose an otherwise invalid value of a valid type, and return it to transport additional semantics, because the return value is not actually the return value of the operation but a special signal.
Now I guess, the cleanest approach would be to have a contains and an indexOf method, the second of which would throw an exception, if the element you're asking for is not in the collection. Why? Because one would expect the following to be true:
someCollection.objectAtIndex(someCollection.indexOf(someObject)) == someObject
What you're likely to get is an exception because -1 is out of bounds, while the actual reason why this plausible relation is not true is, that someObject is not an element of someCollection, and that is why the inner call should raise the exception.
Now as clean and robust, as this may be, it has two key flaws:
Usually both operations would usually cost you O(n) (unless you have an inverse map within the collection), so you're better off if you do just one.
It is really quite verbose.
In the end, it's up to you to decide. This is a matter of philosophy. I'd call it a "semantic hack" to achieve both shortness & speed at the cost of robustness. Your call ;)
greetz
back2dos
like why 51% means everything among shareholders of a company, since it's the best nearest and makes sense rather than -2 or -3 ...

Why does Java toString() loop infinitely on indirect cycles?

This is more a gotcha I wanted to share than a question: when printing with toString(), Java will detect direct cycles in a Collection (where the Collection refers to itself), but not indirect cycles (where a Collection refers to another Collection which refers to the first one - or with more steps).
import java.util.*;
public class ShonkyCycle {
static public void main(String[] args) {
List a = new LinkedList();
a.add(a); // direct cycle
System.out.println(a); // works: [(this Collection)]
List b = new LinkedList();
a.add(b);
b.add(a); // indirect cycle
System.out.println(a); // shonky: causes infinite loop!
}
}
This was a real gotcha for me, because it occurred in debugging code to print out the Collection (I was surprised when it caught a direct cycle, so I assumed incorrectly that they had implemented the check in general). There is a question: why?
The explanation I can think of is that it is very inexpensive to check for a collection that refers to itself, as you only need to store the collection (which you have already), but for longer cycles, you need to store all the collections you encounter, starting from the root. Additionally, you might not be able to tell for sure what the root is, and so you'd have to store every collection in the system - which you do anyway - but you'd also have to do a hash lookup on every collection element. It's very expensive for the relatively rare case of cycles (in most programming). (I think) the only reason it checks for direct cycles is because it so cheap (one reference comparison).
OK... I've kinda answered my own question - but have I missed anything important? Anyone want to add anything?
Clarification: I now realize the problem I saw is specific to printing a Collection (i.e. the toString() method). There's no problem with cycles per se (I use them myself and need to have them); the problem is that Java can't print them. Edit Andrzej Doyle points out it's not just collections, but any object whose toString is called.
Given that it's constrained to this method, here's an algorithm to check for it:
the root is the object that the first toString() is invoked on (to determine this, you need to maintain state on whether a toString is currently in progress or not; so this is inconvenient).
as you traverse each object, you add it to an IdentityHashMap, along with a unique identifier (e.g. an incremented index).
but if this object is already in the Map, write out its identifier instead.
This approach also correctly renders multirefs (a node that is referred to more than once).
The memory cost is the IdentityHashMap (one reference and index per object); the complexity cost is a hash lookup for every node in the directed graph (i.e. each object that is printed).
I think fundamentally it's because while the language tries to stop you from shooting yourself in the foot, it shouldn't really do so in a way that's expensive. So while it's almost free to compare object pointers (e.g. does obj == this) anything beyond that involves invoking methods on the object you're passing in.
And at this point the library code doesn't know anything about the objects you're passing in. For one, the generics implementation doesn't know if they're instances of Collection (or Iterable) themselves, and while it could find this out via instanceof, who's to say whether it's a "collection-like" object that isn't actually a collection, but still contains a deferred circular reference? Secondly, even if it is a collection there's no telling what it's actual implementation and thus behaviour is like. Theoretically one could have a collection containing all the Longs which is going to be used lazily; but since the library doesn't know this it would be hideously expensive to iterate over every entry. Or in fact one could even design a collection with an Iterator that never terminated (though this would be difficult to use in practice because so many constructs/library classes assume that hasNext will eventually return false).
So it basically comes down to an unknown, possibly infinite cost in order to stop you from doing something that might not actually be an issue anyway.
I'd just like to point out that this statement:
when printing with toString(), Java will detect direct cycles in a collection
is misleading.
Java (the JVM, the language itself, etc) is not detecting the self-reference. Rather this is a property of the toString() method/override of java.util.AbstractCollection.
If you were to create your own Collection implementation, the language/platform wouldn't automatically safe you from a self-reference like this - unless you extend AbstractCollection, you would have to make sure you cover this logic yourself.
I might be splitting hairs here but I think this is an important distinction to make. Just because one of the foundation classes in the JDK does something doesn't mean that "Java" as an overall umbrella does it.
Here is the relevant source code in AbstractCollection.toString(), with the key line commented:
public String toString() {
Iterator<E> i = iterator();
if (! i.hasNext())
return "[]";
StringBuilder sb = new StringBuilder();
sb.append('[');
for (;;) {
E e = i.next();
// self-reference check:
sb.append(e == this ? "(this Collection)" : e);
if (! i.hasNext())
return sb.append(']').toString();
sb.append(", ");
}
}
The problem with the algorithm that you propose is that you need to pass the IdentityHashMap to all Collections involved. This is not possible using the published Collection APIs. The Collection interface does not define a toString(IdentityHashMap) method.
I imagine that whoever at Sun put the self reference check into the AbstractCollection.toString() method thought of all of this, and (in conjunction with his colleagues) decided that a "total solution" is over the top. I think that the current design / implementation is correct.
It is not a requirement that Object.toString implementations be bomb-proof.
You are right, you already answered your own question. Checking for longer cycles (especially really long ones like period length 1000) would be too much overhead and is not needed in most cases. If someone wants it, he has to check it himself.
The direct cycle case, however, is easy to check and will occur more often, so it's done by Java.
You can't really detect indirect cycles; it's a typical example of the halting problem.

Categories