Streams with TreeMap return incoherent results - java

I am trying to solve the following exercise from "Core Java for the Impatient" by Cay Horstmann:
When an encoder of a Charset with partial Unicode coverage can’t encode a
character, it replaces it with a default—usually, but not always, the encoding of "?".
Find all replacements of all available character sets that support encoding. Use the
newEncoder method to get an encoder, and call its replacement method to get
the replacement. For each unique result, report the canonical names of the charsets
that use it.
For the sake of education, I have decided to tackle the exercise with gargantuan one-liner using the streaming API, even though - in my opinion - a cleaner solution would divide the calculations into a number of steps, with intermediate variables in-between (certainly it would ease the debugging). Without further ado, here is a monster of code I have created:
Charset.availableCharsets().values().stream().filter(charset -> charset.canEncode()).collect(
Collectors.groupingBy(
charset -> charset.newEncoder().replacement(),
() -> new TreeMap<>((arr1, arr2) -> Arrays.equals(arr1, arr2) == true ? 0 : Integer.compare(arr1.hashCode(), arr2.hashCode())),
Collectors.mapping( charset -> charset.name(), Collectors.toList()))).
values().stream().map(list -> list.stream().collect(Collectors.joining(", "))).forEach(System.out::println);
Basically, we take into account only the charsets that canEncode; create a Map with replacement as key and a list of canonical names as values; because grouping didn't work for arrays with default implementation of groupingBy, which uses HashMap, I have decided to use TreeMap. We then work with the Lists of canonical names, join them with comma and print.
Unfortunately, I have found it to give incoherent results. If I run the function twice in the same program, the first instance returns results consisting of 23 Strings, the second one - just 21 Strings. I suspect it has something to do with a poor implementation of Comparator for TreeMap, which was defined as follows:
((arr1, arr2) -> Arrays.equals(arr1, arr2) == true ? 0 : Integer.compare(arr1.hashCode(), arr2.hashCode()))
If that is the cause, what should be a proper Comparator in this case? Apart from that, can the one-liner be improved in any way?
I am also curious if such convoluted constructs as the code I have written are encountered in professional programs? Maybe it's only me who find it unreadable?

There is no guarantee that the hash code of two distinct instances will be different. That would be an ideal situation, but is never guaranteed. Only the opposite is true: if two objects are equal, they have the same hash code.
So if you create a comparator that considers the objects to be the same when they have the same hash code, arbitrary objects might be considered to be the same. Since the byte[] arrays returned by replacement() are defensive copies, read temporary objects, the result may vary in every run of this code.
Further, since the hash code of an array has nothing to do with its content, your comparator violates the transitivity rule: two arrays with equal content are supposed to be the same, but since they might/very likely have different hash codes, they have a different relation when being compared with a third array, not having the same content, a == b, but a < c and b > c. This is the reason why even equal arrays, which you compare by Arrays.equals can end up in different groups, as the TreeSet failed to find the existing key when comparing with other keys then.
If you want the arrays to be compared by value, you can use:
Charset.availableCharsets().values().stream().filter(Charset::canEncode).collect(
Collectors.groupingBy(
charset -> charset.newEncoder().replacement(),
() -> new TreeMap<>(Comparator.comparing(ByteBuffer::wrap)),
Collectors.mapping(Charset::name, Collectors.joining(", "))))
.values().forEach(System.out::println);
ByteBuffers are Comparable and consistently evaluate the contents of the wrapped array.
I moved the Collectors.joining collector into the grouping collector to avoid the creation of the temporary List whose contents you are going to join afterwards anyway.
By the way, never use code like expression == true. There is no reason to append == true as expression is already sufficient.
Since you are only interested in the values, in other words, don’t need the keys to be of a certain type, you may wrap all arrays beforehand, simplifying the operation and even make it slightly more efficient:
Charset.availableCharsets().values().stream().filter(Charset::canEncode).collect(
Collectors.groupingBy(
charset -> ByteBuffer.wrap(charset.newEncoder().replacement()),
TreeMap::new,
Collectors.mapping(Charset::name, Collectors.joining(", "))))
.values().forEach(System.out::println);
This change even allows resorting to hashing, if no consistent iteration order is required:
Charset.availableCharsets().values().stream().filter(Charset::canEncode).collect(
Collectors.groupingBy(
charset -> ByteBuffer.wrap(charset.newEncoder().replacement()),
Collectors.mapping(Charset::name, Collectors.joining(", "))))
.values().forEach(System.out::println);
This works, because ByteBuffer also implements equals and hashCode.

Related

How to increase efficiency

I have the following homework question:
Suppose you are given two sequences S1 and S2 of n elements, possibly containing duplicates, on which a total order relation is defined. Describe an efficient algorithm for determining if S1 and S2 contain the same set of elements. Analyze the running time of this method
To solve this question I have compared elemements of the two arrays using retainAll and a HashSet.
Set1.retainAll(new HashSet<Integer>(Set2));
This would solve the problem in constant time.
Do I need to sort the two arrays before the retainAll step to increase efficiency?
I suspect from the code you've posted that you are missing the point of the assignment. The idea is not to use a Java library to check if two collections are equal (for that you could use collection1.equals(collections2). Rather the point is to come up with an algorithm for comparing the collections. The Java API does not specify an algorithm: it's hidden away in the implementation.
Without providing an answer, let me give you an example of an algorithm that would work, but is not necessarily efficient:
for each element in coll1
if element not in coll2
return false
remove element from coll2
return coll2 is empty
The problem specifies that the sequences are ordered (i.e. total order relation is defined) which means you can do much better than the algorithm above.
In general if you are asked to demonstrate an algorithm it's best to stick with native data types and arrays - otherwise the implementation of a library class can significantly impact efficiency and hide the data you want to collect on the algorithm itself.

Stream.reduce always preserving order on parallel, unordered stream

I've gone through several previous questions like Encounter order preservation in java stream, this answer by Brian Goetz, as well as the javadoc for Stream.reduce(), and the java.util.stream package javadoc, and yet I still can't grasp the following:
Take this piece of code:
public static void main(String... args) {
final String[] alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".split("");
System.out.println("Alphabet: ".concat(Arrays.toString(alphabet)));
System.out.println(new HashSet<>(Arrays.asList(alphabet))
.parallelStream()
.unordered()
.peek(System.out::println)
.reduce("", (a,b) -> a + b, (a,b) -> a + b));
}
Why is the reduction always* preserving the encounter order?
So far, after several dozen runs, output is the same
First of all unordered does not imply an actual shuffling; all it does it sets a flag for the Stream pipeline - that could later be leveraged.
A shuffle of the source elements could potentially be much more expensive then the operations on the stream pipeline themselves, so the implementation might choose not to do this(like in this case).
At the moment (tested and looked at the sources) of jdk-8 and jdk-9 - reduce does not take that into account. Notice that this could very well change in a future build or release.
Also when you say unordered - you actually mean that you don't care about that order and the stream returning the same result is not a violation of that rule.
For example notice this question/answer that explains that findFirst for example (just another terminal operation) changed to take unordered into consideration in java-9 as opposed to java-8.
To help explain this, I am going to reduce the scope of this string to ABCD.
The parallel stream will divide the string into two pieces: AB and CD. When we go to combine these later, the result of the AB side will be the first argument passed into the function, while the result of the CD side will be the second argument passed into the function. This is regardless of which of the two actually finishes first.
The unordered operator will affect some operations on a stream, such as a limit operation, it does not affect a simple reduce.
TLDR: .reduce() is not always preserving order, its result is based on the stream spliterator characteristics.
Spliterator
The encounter order of the stream depends on stream spliterator (None of the answers mentioned that before).
There are different spliterators based on the source stream. You can get the types of spliterators from the source code of those collections.
HashSet -> HashMap#KeySpliterator = Not ordered
ArrayDeque = Ordered
ArrayList = Ordered
TreeSet -> TreeMap#Spliterator = Ordered and sorted
logicbig.com - Ordering
logicbig.com - Stateful vs Stateless
Additionally you can apply .unordered() intermediate stream operation that specifies following operations in the stream should not rely on ordering.
Stream operations (mostly stateful) that are affected by spliterator and usage of .unordered() method are:
.findFirst()
.limit()
.skip()
.distinct()
Those operations will give us different results based on the order property of the stream and its spliterator.
.peek() method does not take ordering into consideration, if stream is executed in parallel it will always print/receive elements in unordered manner.
.reduce()
Now for the terminal .reduce() method. Intermediate operation .unordered() doesn't have any affect on type of spliterator (as #Eugene mentioned). But important notice, it still stays the same as it is in the source spliterator. If source spliterator is ordered, result of the .reduce() will be ordered, if source was unordered result of .reduce() will be unordered.
You are using new HashSet<>(Arrays.asList(alphabet)) to get the instance of the stream. Its spliterator is unordered. It was just a coincidence that you are getting your result ordered because you are using the single alphabet Strings as elements of the stream and unordered result is actually the same. Now if you would mix that with numbers or mix it with lower case and upper case then this doesn't hold true anymore. For example take following inputs, the first one is subset of the example you posted:
HashSet .reduce() - Unordered
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "a1Ab2Bc3C"
"Apple","Orange","Banana","Mango" -> "AppleMangoOrangeBanana"
TreeSet .reduce() - Ordered, Sorted
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "123ABCabc"
"Apple","Orange","Banana","Mango" -> "AppleBananaMangoOrange"
ArrayList .reduce() - Ordered
"A","B","C","D","E","F" -> "ABCDEF"
"a","b","c","1","2","3","A","B","C" -> "abc123ABC"
"Apple","Orange","Banana","Mango" -> "AppleOrangeBananaMango"
You see that testing .reduce() operation only with an alphabet source stream can lead to false conclusions.
The answer is .reduce() is not always preserving order, its result is based on the stream spliterator characteristics.

What are the performance differences with these two uses of the map stream function in java 8

Say I have the functions mutateElement() which does x operations and mutateElement2() which does y operations. What is the difference in performance between these two pieces of code.
Piece1:
List<Object> = array.stream().map(elem ->
mutateElement(elem);
mutateElement2(elem);
)
.collect(Collectors.toList());
Piece2:
List<Object> array = array.stream().map(elem ->
mutateElement(elem);
)
.collect(Collectors.toList());
array = array.stream().map(elem ->
mutateElement2(elem);
)
.collect(Collectors.toList());
Clearly The first implementation is better as it only uses one iterator, however the second uses two iterators. But would the difference be noticeable if I had say a million elements in the array.
The first implementation is not better simply because it uses only one iterator, the first implementation is better because it only collects once.
Nobody can tell you whether the difference would be noticeable if you had a million elements. (And if someone did try to tell you, you should not believe them.) Benchmark it.
Whatever you use stream or external loop, the problem is the same.
One iteration on the List in the first code and two iterations on the List in the second code.
The time of execution of the second code is so logically more important.
Besides invoking twice the terminal operation on the stream :
.collect(Collectors.toList());
rather than once, has also a cost.
But would the difference be noticeable if I had say a million elements
in the array.
It could be.
Now the question is hard to answer : yes or no.
It depends on other parameters such as cpus, number of concurrent users and processing and your definition of "noticeable".

Custom Java sort by name

I want to sort something like this:
Given an ArrayList of objects with name Strings, I am trying to write the compareTo function such that Special T is always first, Special R is always second, Special C is always third, and then everything else is just alphabetical:
Special T
Special R
Special C
Aaron
Alan
Bob
Dave
Ron
Tom
Is there a standard way of writing this kind of compare function without needing to iterate over all possible combinations between the special cases and then invoking return getName().compareTo(otherObject).getName()); if it's a non-special case?
I would put the special cases in a HashMap<String, Integer> with the name as key and position as value. The advantages are:
search is in O(1) order of magnitude
The HashMap may be populated from an external source

Efficient data structure that checks for existence of String

I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).
However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found
Correct, checking a list for an item is linear in the number of entries of the list.
However, I am not keen on filling a HashMap where the value is completely arbitrary
You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.
You can put all your strings there, and then check for presence or absence of other strings in constant time;
Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings
if (knownString.contains(myString)) {
...
}
It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.
A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).
If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).
Otherwise, well, a HashSet...
A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:
String wordList = "$word1$word2$word3$word4$...";
Then create a search argument with your word between the separators:
String searchArg = "$" + searchWord + "$";
Then search with, say, contains:
bool wordFound = wordList.contains(searchArg);
You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.
As others mentioned HashSet is the way to go. But if the size is going to be large and you are fine with false positives (checking if the username exists) you can use BloomFilters (probabilistic data structure) as well.

Categories