Performance of Java Parallel Stream vs ExecutorService

Performance of Java Parallel Stream vs ExecutorService - java

Suppose we have a list and want to pick all the elements satisfying a property(let say some functions f). There are 3 ways to parallel this process.
One :
listA.parallelStream.filter(element -> f(element)).collect(Collectors.toList());
Two:
listA.parallelStream.collect(Collectors.partitioningBy(element -> f(element))).get(true);
Three:
ExecutorService executorService = Executors.newFixedThreadPool(nThreads);
//separate the listA into several batches
for each batch {
Future<List<T>> result = executorService.submit(() -> {
// test the elements in this batch and return the validate element list
});
}
//merge the results from different threads.
Suppose the testing function is a CPU intensive task. I want to know which method is more efficient. Thanks a lot.

One and two use ForkJoinPool which is designed exactly for parallel processing of one task while ThreadPoolExecutor is used for concurrent processing of independent tasks. So One and Two are supposed to be faster.

When you use .filter(element -> f(element)).collect(Collectors.toList()), it will collect the matching elements into a List, whereas .collect(Collectors.partitioningBy(element -> f(element))) will collect all elements into either of two lists, followed by you dropping one of them and only retrieving the list of matches via .get(true).
It should be obvious that the second variant can only be on par with the first in the best case, i.e. if all elements match the predicate anyway or when the JVM’s optimizer is capable of removing the redundant work. In the worst lase, e.g. when no element matches, the second variant collects a list of all elements only to drop it afterwards, where the first variant would not collect any element.
The third variant is not comparable, as you didn’t show an actual implementation but just a sketch. There is no point in comparing a hypothetical implementation with an actual. The logic you describe, is the same as the logic of the parallel stream implementation. So you’re just reinventing the wheel. It may happen that you do something slightly better than the reference implementation or just better tailored to your specific task, but chances are much higher that you overlook things which the Stream API implementors did already consider during the development process which lasted several years.
So I wouldn’t place any bets on your third variant. If we add the time you will need to complete the third variant’s implementation, it can never be more efficient than just using either of the other variants.
So the first variant is the most efficient variant, especially as it is also the simplest, most readable, directly expressing your intent.

Related

What's the difference between Stream.map(...) and Collectors.mapping(...)?

I've noticed many functionalities exposed in Stream are apparently duplicated in Collectors, such as Stream.map(Foo::bar) versus Collectors.mapping(Foo::bar, ...), or Stream.count() versus Collectors.counting(). What's the difference between these approaches? Is there a performance difference? Are they implemented differently in some way that affects how well they can be parallelized?

The collectors that appear to duplicate functionality in Stream exist so they can be used as downstream collectors for collector combinators like groupingBy().
As a concrete example, suppose you want to compute "count of transactions by seller". You could do:
Map<Seller, Long> salesBySeller =
txns.stream()
.collect(groupingBy(Txn::getSeller, counting()));
Without collectors like counting() or mapping(), these kinds of queries would be much more difficult.

There's a big difference. The stream operations could be divided into two group:
Intermediate operations - Stream.map, Stream.flatMap, Stream.filter. Those produce instance of the Stream and are always lazy, e.g. no actual traversal of the Stream elements happens. Those operations are used to create transformation chain.
Terminal operations - Stream.collect, Stream.findFirst, Stream.reduce etc. Those do the actual work, e.g. perform the transformation chain operations on the stream, producing a terminal value. Which could be a List, count of element, first element etc.
Take a look at the Stream package summary javadoc for more information.

Most efficient collection for filtering a Java Stream?

I'm storing several Things in a Collection. The individual Things are unique, but their types aren't. The order in which they are stored also doesn't matter.
I want to use Java 8's Stream API to search it for a specific type with this code:
Collection<Thing> things = ...;
// ... populate things ...
Stream<Thing> filtered = things.stream.filter(thing -> thing.type.equals(searchType));
Is there a particular Collection that would make the filter() more efficient?
I'm inclined to think no, because the filter has to iterate through the entire collection.
On the other hand, if the collection is some sort of tree that is indexed by the Thing.type then the filter() might be able to take advantage of that fact. Is there any way to achieve this?

The stream operations like filter are not that specialized to take an advantage in special cases. For example, IntStream.range(0, 1_000_000_000).filter(x -> x > 999_999_000) will actually iterate all the input numbers, it cannot just "skip" the first 999_999_000. So your question is reduced to find the collection with the most efficient iteration.
The iteration is usually performed in Spliterator.forEachRemaining method (for non-short-circuiting stream) and in Spliterator.tryAdvance method (for short-circuiting stream), so you can take a look into the corresponding spliterator implementation and check how efficient it is. To my opinion the most efficient is an array (either bare or wrapped into list with Arrays.asList): it has minimal overhead. ArrayList is also quite fast, but for short-circuiting operation it will check the modCount (to detect concurrent modification) on every iteration which would add very slight overhead. Other types like HashSet or LinkedList are comparably slower, though in most of applications this difference is practically insignificant.
Note that parallel streams should be used with care. For example, the splitting of LinkedList is quite poor and you may experience worse performance than in sequential case.

The most important thing to understand, regarding this question, is that when you pass a lambda expression to a particular library like the Stream API, all the library receives is an implementation of a functional interface, e.g. an instance of Predicate. It has no knowledge about what that implementation will do and therefore has no way to exploit scenarios like filtering sorted data via comparison. The stream library simply doesn’t know that the Predicate is doing a comparison.
An implementation doing such an optimization would need an interaction of the JVM, which knows and understands the code, and the library, which knows the semantics. Such thing does not happen in current implementation and is currently far away, at least as I can see it.
If the source is a tree or sorted list and you want to benefit from that for filtering, you have to do it using APIs operating on the source, before creating the stream. E.g. suppose, we have a TreeSet and want to filter it to get items within a particular range, like
// our made-up source
TreeSet<Integer> tree=IntStream.range(0, 100).boxed()
.collect(Collectors.toCollection(TreeSet::new));
// the naive implementation
tree.stream().filter(i -> i>=65 && i<91).forEach(i->System.out.print((char)i.intValue()));
We can do instead:
tree.tailSet(65).headSet(91).stream().forEach(i->System.out.print((char)i.intValue()));
which will utilize the sorted/tree nature. When we have a sorted list instead, say
List<Integer> list=new ArrayList<>(tree);
utilizing the sorted nature is more complex as the collection itself doesn’t know that it’s sorted and doesn’t offer operations utilizing that directly:
int ix=Collections.binarySearch(list, 65);
if(ix<0) ix=~ix;
if(ix>0) list=list.subList(ix, list.size());
ix=Collections.binarySearch(list, 91);
if(ix<0) ix=~ix;
if(ix<list.size()) list=list.subList(0, ix);
list.stream().forEach(i->System.out.print((char)i.intValue()));
Of course, the stream operations here are only exemplary and you don’t need a stream at all, when all you do then is forEach…

As far as I am aware, there's no such differenciation for normal streaming.
However, you might be better off when you use parallel streaming when you use a collection which is easily devideable, like ArrayList over LinkedList or any type of Set.

Produce a Stream from a Stream and an element, Java 8

I'm working on getting my head around some of the Java 8 Stream features. I'm tolerably familiar with FP, having written some Lisp thirty years ago, and I think I might be trying to do something this new facility isn't really targeted at. Anyway, if the question is stupid, I'm happy to learn the error of my ways.
I'll give a specific problem, though it's really a general concept I'm trying to resolve.
Suppose I want to get a Stream from every third element of a Stream. In regular FP I would (approximately) create a recursive function that operates by concatenating the first element of the list with the (call-thyself) of the remainder of the list after dropping two elements. Easy enough. But to do this in a stream, I feel like I want one of two tools:
1) a means of having an operation extract multiple items from the stream for processing (then I'd just grab three, use the first, and dump the rest)
2) a means of making a Supplier that takes an item and a Stream and creates a Stream. Then it feels like I could create a downstream stream out of the first item and the shortened stream, though it's still unclear to me if this would do the necessary recursive magic to actually work.
BEGIN EDIT
So, there's some interesting and useful feedback; thanks everyone. In particular, the comments have helped me clarify what my head is trying to get around a bit better.
First, one can--conceptually, at least--having / needing knowledge of order in a sequence should not prevent one from permitting fully parallelizable operations. An example sprang to mind, and that's the convolution operations that graphics folks are inclined to do. Imagine blurring an image. Each pixel is modified by virtue of pixels near to it, but those pixels are only read, not modified, in themselves.
It's my understanding (very shaky at this stage, for sure!) that the streams mechanism is the primary entry point to the wonderful world of VM managed parallelism, and that iterators are still what they always were (yes? no?) If that's correct, then using the iterator to solve the problem domain that I'm waffling around doesn't seem great.
So, at this point, at least, the suggestion to create a chunking spliterator seems the most promising, but boy, does the code that supports that example seem like hard work! I think I'd rather do it with a ForkJoin mechanism, despite it being "old hat" now :)
Anyway, still interested in any more insights folks wish to offer.
END EDIT
Any thoughts? Am I trying to use these Streams to do something they're not intended for, or am I missing something obvious?
Cheers,
Toby.

One of the things to keep in mind is that Stream was primarily designed to be a way of taking advantage of parallel processing. An implication of this is that they have a number of conditions associated with them that are aimed at giving the VM a lot of freedom to process the elements in any convenient order. An example of this is insisting that reduction functions are associative. Another is that local variables manipulated are final. These types of conditions mean the stream items can be evaluated and collected in any order.
A natural consequence of this is that the best use cases for Stream involve no dependencies between the values of the stream. Things such as mapping a stream of integers to their cumulative values are trivial in languages like LISP but a pretty unnatural fit for Java streams (see this question).
There are clever ways of getting around some of these restrictions by using sequential to force the Stream to not be parallel but my experience has been that these are more trouble than they are worth. If your problem involves an essentially sequential series of items in which state is required to process the values then I recommend using traditional collections and iteration. The code will end up being clearer and will perform no worse given the stream cannot be parallelised anyway.
Having said all that, if you really want to do this then the most straightforward way is to have a collector that stores every third item then sends them out as a stream again:
class EveryThird {
private final List<Integer> list = new ArrayList<>();
private int count = 0;
public void accept(Integer i) {
if (count++ % 3 == 0)
list.add(i);
}
public EveryThird combine(EveryThird other) {
list.addAll(other.list);
count += other.count;
return this;
}
public Stream<Integer> stream() {
return list.stream();
}
}
This can then be used like:
IntStream.range(0, 10000)
.collect(EveryThird::new, EveryThird::accept, EveryThird::combine)
.stream()
But that's not really what collectors are designed for and this is pretty inefficient as it's unnecessarily collecting the stream. As stated above my recommendation is to use traditional iteration for this sort of situation.

My StreamEx library enhances standard Stream API. In particular it adds the headTail method which allows recursive definition of custom operations. It takes a function which receives stream head (first element) and tail (stream of the rest elements) and should return the resulting stream which will be used instead of the original one. For example, you can define every3 operation as follows:
public static <T> StreamEx<T> every3(StreamEx<T> input) {
return input.headTail(
(first, tail1) -> tail1.<T>headTail(
(second, tail2) -> tail2.headTail(
(third, tail3) -> every3(tail3))).prepend(first));
}
Here prepend is also used which just prepends the given element to the stream (this operation is just a best friend of headTail.
In general using headTail you can define almost any intermediate operation you want, including existing ones and new ones. You may find some samples here.
Note that I implemented some mechanism which optimizes tails in such recursive operation definition, so properly defined operation will not eat the whole stack when processing the long stream.

Java's streams are nothing like FP's (lazy) sequences. If you are familiar with Clojure, the difference is exactly like the difference between lazy seq and reducer. Whereas the lazy seq packages all processing with each element individually, and thus allows getting individually processed elements, a reducer collapses the complete sequence in a single atomic operation.
Specifically for the example you have described, consider relying on a stream partitioning transformation, as described in detail here. You would then easily do
partition(originalStream, 3).map(xs -> xs.get(0));
resulting in a stream having every third element of the original.
This would maintain efficiency, laziness, and parallelizability.

You can use (or peek at the source) of BatchingSpliterator
Then given aStream you can create a stream that consists of lists with size=3 (except maybe for the last one) and use the first element of that list
Stream<T> aStream = ...;
Stream<List<T>> batchedStream =
new BatchingSpliterator.Builder().wrap(aStream).batchSize(3).stream();
batchedStream.map(l -> l.get(0) ). ...
You can also "go parallel":
batchedStream.parallel().map(l -> l.get(0) ). ....

Perform arithmetic on number array without iterating

For example, I would like to do something like the following in java:
int[] numbers = {1,2,3,4,5};
int[] result = numbers*2;
//result now equals {2,4,6,8,10};
Is this possible to do without iterating through the array? Would I need to use a different data type, such as ArrayList? The current iterating step is taking up some time, and I'm hoping something like this would help.

No, you can't multiply each item in an array without iterating through the entire array. As pointed out in the comments, even if you could use the * operator in such a way the implementation would still have to touch each item in the array.
Further, a different data type would have to do the same thing.

I think a different answer from the obvious may be beneficial to others who have the same problem and don't mind a layer of complexity (or two).
In Haskell, there is something known as "Lazy Evaluation", where you could do something like multiply an infinitely large array by two, and Haskell would "do" that. When you accessed the array, it would try to evaluate everything as needed. In Java, we have no such luxury, but we can emulate this behavior in a controllable manner.
You will need to create or extend your own List class and add some new functions. You would need functions for each mathematical operation you wanted to support. I have examples below.
LazyList ll = new LazyList();
// Add a couple million objects
ll.multiplyList(2);
The internal implementation of this would be to create a Queue that stores all the primitive operations you need to perform, so that order of operations is preserved. Now, each time an element is read, you perform all operations in the Queue before returning the result. This means that reads are very slow (depending on the number of operations performed), but we at least get the desired result.
If you find yourself iterating through the whole array each time, it may be useful to de-queue at the end instead of preserving the original values.
If you find that you are making random accesses, I would preserve the original values and returned modified results when called.
If you need to update entries, you will need to decide what that means. Are you replacing a value there, or are you replacing a value after the operations were performed? Depending on your answer, you may need to run backwards through the queue to get a "pre-operations" value to replace an older value. The reasoning is that on the next read of that same object, the operations would be applied again and then the value would be restored to what you intended to replace in the list.
There may be other nuances with this solution, and again the way you implement it would be entirely different depending on your needs and how you access this (sequentially or randomly), but it should be a good start.

With the introduction of Java 8 this task can be done using streams.
private long tileSize(int[] sizes) {
return IntStream.of(sizes).reduce(1, (x, y) -> x * y);
}

No it isn't. If your collection is really big and you want to do it faster you can try to operates on elements in 2 or more threads, but you have to take care of synchronization(use synchronized collection) or divide your collection to 2(or more) collections and in each thread operate on one collection. I'm not sure wheather it will be faster than just iterating through the array - it depends on size of your collection and on what you want to do with each element. If you want to use this solution you will have wheather is it faster in your case - it might be slower and definitely it will be much more complicated.
Generally - if it's not critical part of code and time of execution isn't too long i would leave it as it is now.

Java: what is the overhead of using ConcurrentSkipList* when no concurrency is needed?

I need a sorted list in a scenario dominated by iteration (compared to insert/remove, not random get at all). For this reason I thought about using a skip list compared to a tree (the iterator should be faster).
The problem is that java6 only has a concurrent implementation of a skip list, so I was guessing whether it makes sense to use it in a non-concurrent scenario or if the overhead makes it a wrong decision.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, but I wanted to hear somebody else's opinion.
EDIT:
After some micro-benchmarking (running iteration multiple times on different-sized TreeSet, LinkedList, ConcurrentSkipList and ArrayList) shows that there's quite an overhead. ConcurrentSkipList does store the elements in a linked list inside, so the only reason why it would be slower on iteration than a LinkedList would be due to the aforementioned overhead.

If thread-safety's not required I'd say to skip package java.util.concurrent altogether.
What's interesting is that sometimes ConcurrentSkipList is slower than TreeSet on the same input and I haven't sorted out yet why.
I mean, have you seen the source code for ConcurrentSkipListMap? :-) I always have to smile when I look at it. It's 3000 lines of some of the most insane, scary, and at the same time beautiful code I've ever seen in Java. (Kudos to Doug Lea and co. for getting all the concurrency utils integrated so nicely with the collections framework!) Having said that, on modern CPUs the code and algorithmic complexity won't even matter so much. What usually makes more difference is having the data to iterate co-located in memory, so that the CPU cache can do its job better.
So in the end I'll wrap ArrayList with a new addSorted() method that does a sorted insert into the ArrayList.
Sounds good. If you really need to squeeze every drop of performance out of iteration you could also try iterating a raw array directly. Repopulate it upon each change, e.g. by calling TreeSet.toArray() or generating it then sorting it in-place using Arrays.sort(T[], Comparator<? super T>). But the gain could be tiny (or even nothing if the JIT does its job well) so it might not be worth the inconvenience.

As measured using Open JDK 6 on typical production hardware my company uses, you can expect all add and query operations on a skip-list map to take roughly double the time as the same operation on a tree map.
examples:
63 usec vs 31 usec to create and add 200 entries.
145 ns vs 77 ns for get() on that 200-element map.
And the ratio doesn't change all that much for smaller and larger sizes.
(The code for this benchmark will eventually be shared so you can review it and run it yourself; sorry we're not ready to do that yet.)

Well you can use a lot of other structures to do the skip list, it exists in Concurrent package because concurrent data structures are a lot more complicated and because using a concurrent skip list would cost less than using other concurrent data structures to mimic a skip list.
In a single thread world is different: you can use a sorted set, a binary tree or your custom data structure that would perform better than concurrent skip list.
The complexity in iterating a tree list or a skip list will be always O(n), but if you instead use a linked list or an array list, you have the problem with insertion: to insert an item in the right position (sorted linked list) the complexity of insertion will be O(n) instead of O(log n) for a binary tree or for a skip list.
You can iterate in TreeMap.keySet() to obtain all inserted keys in order and it will not be so slow.
There is also the TreeSet class, that probably is what you need, but since it is just a wrapper to TreeMap, the direct use of TreeMap would be faster.

Without concurrency, it is usually more efficient to use a balanced binary search tree. In Java, this would be a TreeMap.
Skip lists are generally reserved for concurrent programming because of their ease in implementation the speed in multithreaded applications.

You seem to have a good grasp of the trade-off here, so I doubt anyone can give you a definitive, principled answer. Fortunately, this is pretty straightforward to test.
I started by creating a simple Iterator<String> that loops indefinitely over a finite list of randomly generated strings. (That is: on initialization, it generates an array _strings of a random strings of length b out of a pool of c distinct characters. The first call to next() returns _strings[0], the next call returns _strings[1] … the (n+1)th call returns _strings[0] again.) The strings returned by this iterator were what I used in all calls to SortedSet<String>.add(...) and SortedSet<String>remove(...).
I then wrote a test method that accepts an empty SortedSet<String> and loops d times. On each iteration, it adds e elements, then removes f elements, then iterates over the entire set. (As a sanity-check, it keeps track of the set's size by using the return values of add() and remove(), and when iterates over the entire set, it makes sure it finds the expected number of elements. Mostly I did that just so there would be something in the body of the loop.)
I don't think I need to explain what my main(...) method does. :-)
I tried various values for the various parameters, and I found that sometimes ConcurrentSkipListSet<String> performed better, and sometimes TreeSet<String> did, but the difference was never much more than twofold. In general, ConcurrentSkipListSet<String> performed better when:
a, b, and/or c were relatively large. (I mean, within the ranges I tested. My a's ranged from 1000 to 10000, my b's from 3 to 10, my c's from 10 to 80. Overall, the resulting set-sizes ranged from around 450 to exactly 10000, with modes of 666 and 6666 because I usually used e=2‎f.) This suggests that ConcurrentSkipListSet<String> copes somewhat better than TreeSet<String> with larger sets, and/or with more-expensive string-comparisons. Trying specific values designed to tease apart these two factors, I got the impression that ConcurrentSkipListSet<String> coped noticeably better than TreeSet<String> with larger sets, and slightly less well with more-expensive string-comparisons. (That's basically what you'd expect; TreeSet<String>'s binary-tree approach aims to do the absolute minimum possible number of comparisons.)
e and f were small; that is, when I called add(...)s and remove(...)s only a small number of times per iteration. (This is exactly what you predicted.) The exact turn-over point depended on a, b, and c, but to a first approximation, ConcurrentSkipListSet<String> performed better when e + f was less than around 10, and TreeSet<String> performed better when e + f was more than around 20.
Of course, this was on a machine that may look nothing like yours, using a JDK that may look nothing like yours, and using very artificial data that might look nothing like yours. I'd recommend that you run your own tests. Since Tree* and ConcurrentSkipList* both implement Sorted*, you should have no difficulty trying your code both ways and seeing what you find.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, […]
My understanding is that this will depend on the machine. On some systems a lock-free implementation may not be possible, in which case these classes will have to use locks. (But since you're not actually multi-threading, even locks may not be all that expensive. Synchronization has overhead, of course, but its main cost is lock contention and forced single-threading. That isn't an issue for you. Again, I think you'll just have to test and see how the two versions perform.)

As noted SkipList has a lot of overhead compared to TreeMap and the TreeMap iterator isn't well suited to your use case because it just repeatedly calls the method successor() which turns out to be very slow.
So one alternative that will be significantly faster than the previous two is to write your own TreeMap iterator. Actually, I would dump TreeMap altogether since 3000 lines of code is a bit bulkier than you probably need and just write a clean AVL tree implementation with the methods you need. The basic AVL logic is just a few hundred lines of code in any language then add the iterator that works best in your case.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.