Is there a way to force parallelStream() to go parallel?

Is there a way to force parallelStream() to go parallel? - java

If the input size is too small the library automatically serializes the execution of the maps in the stream, but this automation doesn't and can't take in account how heavy is the map operation. Is there a way to force parallelStream() to actually parallelize CPU heavy maps?

There seems to be a fundamental misunderstanding. The linked Q&A discusses that the stream apparently doesn’t work in parallel, due to the OP not seeing the expected speedup. The conclusion is that there is no benefit in parallel processing if the workload is too small, not that there was an automatic fallback to sequential execution.
It’s actually the opposite. If you request parallel, you get parallel, even if it actually reduces the performance. The implementation does not switch to the potentially more efficient sequential execution in such cases.
So if you are confident that the per-element workload is high enough to justify the use of a parallel execution regardless of the small number of elements, you can simply request a parallel execution.
As can easily demonstrated:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.forEach(System.out::println);
On Ideone, it prints
processing 2 in Thread[main,5,main]
2
processing 1 in Thread[ForkJoinPool.commonPool-worker-1,5,main]
1
but the order of messages and details may vary. It may even be possible that in some environments, both task may happen to get executed by the same thread, if it can steel the second task before another thread gets started to pick it up. But of course, if the tasks are expensive enough, this won’t happen. The important point is that the overall workload has been split and enqueued to be potentially picked up by other worker threads.
If execution by a single thread happens in your environment for the simple example above, you may insert simulated workload like this:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.map(x -> {
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(3));
return x;
})
.forEach(System.out::println);
Then, you may also see that the overall execution time will be shorter than “number of elements”×“processing time per element” if the “processing time per element” is high enough.
Update: the misunderstanding might be cause by Brian Goetz’ misleading statement: “In your case, your input set is simply too small to be decomposed”.
It must be emphasized that this is not a general property of the Stream API, but the Map that has been used. A HashMap has a backing array and the entries are distributed within that array depending on their hash code. It might be the case that splitting the array into n ranges doesn’t lead to a balanced split of the contained element, especially, if there are only two. The implementors of the HashMap’s Spliterator considered searching the array for elements to get a perfectly balanced split to be too expensive, not that splitting two elements was not worth it.
Since the HashMap’s default capacity is 16 and the example had only two elements, we can say that the map was oversized. Simply fixing that would also fix the example:
long start = System.nanoTime();
Map<String, Supplier<String>> input = new HashMap<>(2);
input.put("1", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "a";
});
input.put("2", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "b";
});
Map<String, String> results = input.keySet()
.parallelStream().collect(Collectors.toConcurrentMap(
key -> key,
key -> input.get(key).get()));
System.out.println("Time: " + TimeUnit.NANOSECONDS.toMillis(System.nanoTime()- start));
on my machine, it prints
Thread[main,5,main]
Thread[ForkJoinPool.commonPool-worker-1,5,main]
Time: 2058
The conclusion is that the Stream implementation always tries to use parallel execution, if you request it, regardless of the input size. But it depends on the input’s structure how well the workload can be distributed to the worker threads. Things could be even worse, e.g. if you stream lines from a file.
If you think that the benefit of a balanced splitting is worth the cost of a copying step, you could also use new ArrayList<>(input.keySet()).parallelStream() instead of input.keySet().parallelStream(), as the distribution of elements within ArrayList always allows a perflectly balanced split.

Related

Is Java 8 stream laziness useless in practice?

I have read a lot about Java 8 streams lately, and several articles about lazy loading with Java 8 streams specifically: here and over here. I can't seem to shake the feeling that lazy loading is COMPLETELY useless (or at best, a minor syntactic convenience offering zero performance value).
Let's take this code as an example:
int[] myInts = new int[]{1,2,3,5,8,13,21};
IntStream myIntStream = IntStream.of(myInts);
int[] myChangedArray = myIntStream
.peek(n -> System.out.println("About to square: " + n))
.map(n -> (int)Math.pow(n, 2))
.peek(n -> System.out.println("Done squaring, result: " + n))
.toArray();
This will log in the console, because the terminal operation, in this case toArray(), is called, and our stream is lazy and executes only when the terminal operation is called. Of course I can also do this:
IntStream myChangedInts = myIntStream
.peek(n -> System.out.println("About to square: " + n))
.map(n -> (int)Math.pow(n, 2))
.peek(n -> System.out.println("Done squaring, result: " + n));
And nothing will be printed, because the map isn't happening, because I don't need the data. Until I call this:
int[] myChangedArray = myChangedInts.toArray();
And voila, I get my mapped data, and my console logs. Except I see zero benefit to it whatsoever. I realize I can define the filter code long before I call to toArray(), and I can pass around this "not-really-filtered stream around), but so what? Is this the only benefit?
The articles seem to imply there is a performance gain associated with laziness, for example:
In the Java 8 Streams API, the intermediate operations are lazy and their internal processing model is optimized to make it being capable of processing the large amount of data with high performance.
and
Java 8 Streams API optimizes stream processing with the help of short circuiting operations. Short Circuit methods ends the stream processing as soon as their conditions are satisfied. In normal words short circuit operations, once the condition is satisfied just breaks all of the intermediate operations, lying before in the pipeline. Some of the intermediate as well as terminal operations have this behavior.
It sounds literally like breaking out of a loop, and not associated with laziness at all.
Finally, there is this perplexing line in the second article:
Lazy operations achieve efficiency. It is a way not to work on stale data. Lazy operations might be useful in the situations where input data is consumed gradually rather than having whole complete set of elements beforehand. For example consider the situations where an infinite stream has been created using Stream#generate(Supplier<T>) and the provided Supplier function is gradually receiving data from a remote server. In those kind of the situations server call will only be made at a terminal operation when it's needed.
Not working on stale data? What? How does lazy loading keep someone from working on stale data?
TLDR: Is there any benefit to lazy loading besides being able to run the filter/map/reduce/whatever operation at a later time (which offers zero performance benefit)?
If so, what's a real-world use case?

Your terminal operation, toArray(), perhaps supports your argument given that it requires all elements of the stream.
Some terminal operations don't. And for these, it would be a waste if streams weren't lazily executed. Two examples:
//example 1: print first element of 1000 after transformations
IntStream.range(0, 1000)
.peek(System.out::println)
.mapToObj(String::valueOf)
.peek(System.out::println)
.findFirst()
.ifPresent(System.out::println);
//example 2: check if any value has an even key
boolean valid = records.
.map(this::heavyConversion)
.filter(this::checkWithWebService)
.mapToInt(Record::getKey)
.anyMatch(i -> i % 2 == 0)
The first stream will print:
0
0
0
That is, intermediate operations will be run just on one element. This is an important optimization. If it weren't lazy, then all the peek() calls would have to run on all elements (absolutely unnecessary as you're interested in just one element). Intermediate operations can be expensive (such as in the second example)
Short-circuiting terminal operation (of which toArray isn't) make this optimization possible.

Laziness can be very useful for the users of your API, especially when the final result of the Stream pipeline evaluation might be very large!
The simple example is the Files.lines method in the Java API itself. If you don't want to read the whole file into the memory and you only need the first N lines, then just write:
Stream<String> stream = Files.lines(path); // lazy operation
List<String> result = stream.limit(N).collect(Collectors.toList()); // read and collect

You're right that there won't be a benefit from map().reduce() or map().collect(), but there's a pretty obvious benefit with findAny() findFirst(), anyMatch(), allMatch(), etc. Basically, any operation that can be short-circuited.

Good question.
Assuming you write textbook perfect code, the difference in performance between a properly optimized for and a stream is not noticeable (streams tend to be slightly better class loading wise, but the difference should not be noticeable in most cases).
Consider the following example.
// Some lengthy computation
private static int doStuff(int i) {
try { Thread.sleep(1000); } catch (InterruptedException e) { }
return i;
}
public static OptionalInt findFirstGreaterThanStream(int value) {
return IntStream
.of(MY_INTS)
.map(Main::doStuff)
.filter(x -> x > value)
.findFirst();
}
public static OptionalInt findFirstGreaterThanFor(int value) {
for (int i = 0; i < MY_INTS.length; i++) {
int mapped = Main.doStuff(MY_INTS[i]);
if(mapped > value){
return OptionalInt.of(mapped);
}
}
return OptionalInt.empty();
}
Given the above methods, the next test should show they execute in about the same time.
public static void main(String[] args) {
long begin;
long end;
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanStream(5));
end = System.currentTimeMillis();
System.out.println(end-begin);
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanFor(5));
end = System.currentTimeMillis();
System.out.println(end-begin);
}
OptionalInt[8]
5119
OptionalInt[8]
5001
Anyway, we spend most of the time in the doStuff method. Let's say we want to add more threads to the mix.
Adjusting the stream method is trivial (considering your operations meets the preconditions of parallel streams).
public static OptionalInt findFirstGreaterThanParallelStream(int value) {
return IntStream
.of(MY_INTS)
.parallel()
.map(Main::doStuff)
.filter(x -> x > value)
.findFirst();
}
Achieving the same behavior without streams can be tricky.
public static OptionalInt findFirstGreaterThanParallelFor(int value, Executor executor) {
AtomicInteger counter = new AtomicInteger(0);
CompletableFuture<OptionalInt> cf = CompletableFuture.supplyAsync(() -> {
while(counter.get() != MY_INTS.length-1);
return OptionalInt.empty();
});
for (int i = 0; i < MY_INTS.length; i++) {
final int current = MY_INTS[i];
executor.execute(() -> {
int mapped = Main.doStuff(current);
if(mapped > value){
cf.complete(OptionalInt.of(mapped));
} else {
counter.incrementAndGet();
}
});
}
try {
return cf.get();
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
return OptionalInt.empty();
}
}
The tests execute in about the same time again.
public static void main(String[] args) {
long begin;
long end;
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanParallelStream(5));
end = System.currentTimeMillis();
System.out.println(end-begin);
ExecutorService executor = Executors.newFixedThreadPool(10);
begin = System.currentTimeMillis();
System.out.println(findFirstGreaterThanParallelFor(5678, executor));
end = System.currentTimeMillis();
System.out.println(end-begin);
executor.shutdown();
executor.awaitTermination(10, TimeUnit.SECONDS);
executor.shutdownNow();
}
OptionalInt[8]
1004
OptionalInt[8]
1004
In conclusion, although we don't squeeze a big performance benefit out of streams (considering you write excellent multi-threaded code in your for alternative), the code itself tends to be more maintainable.
A (slightly off-topic) final note:
As with programming languages, higher level abstractions (streams relative to fors) make stuff easier to develop at the cost of performance. We did not move away from assembly to procedural languages to object-oriented languages because the later offered greater performance. We moved because it made us more productive (develop the same thing at a lower cost). If you are able to get the same performance out of a stream as you would do with a for and properly written multi-threaded code, I would say it's already a win.

I have a real example from our code base, since I'm going to simplify it, not entirely sure you might like it or fully grasp it...
We have a service that needs a List<CustomService>, I am suppose to call it. Now in order to call it, I am going to a database (much simpler than reality) and obtaining a List<DBObject>; in order to obtain a List<CustomService> from that, there are some heavy transformations that need to be done.
And here are my choices, transform in place and pass the list. Simple, yet, probably not that optimal. Second option, refactor the service, to accept a List<DBObject> and a Function<DBObject, CustomService>. And this sounds trivial, but it enables laziness (among other things). That service might sometimes need only a few elements from that List, or sometimes a max by some property, etc. - thus no need for me to do the heavy transformation for all elements, this is where Stream API pull based laziness is a winner.
Before Streams existed, we used to use guava. It had Lists.transform( list, function) that was lazy too.
It's not a fundamental feature of streams as such, it could have been done even without guava, but it's s lot simpler that way. The example here provided with findFirst is great and the simplest to understand; this is the entire point of laziness, elements are pulled only when needed, they are not passed from an intermediate operation to another in chunks, but pass from one stage to another one at a time.

One interesting use case that hasn't been mentioned is arbitrary composition of operations on streams, coming from different parts of the code base, responding to different sorts of business or technical requisites.
For example, say you have an application where certain users can see all the data but certain other users can only see part of it. The part of the code that checks user permissions can simply impose a filter on whatever stream is being handed about.
Without lazy streams, that same part of the code could be filtering the already realized full collection, but that may have been expensive to obtain, for no real gain.
Alternatively, that same part of the code might want to append its filter to a data source, but now it has to know whether the data comes from a database, so it can impose an additional WHERE clause, or some other source.
With lazy streams, it's a filter that can be implemented ever which way. Filters imposed on streams from the database can translate into the aforementioned WHERE clause, with obvious performance gains over filtering in-memory collections resulting from whole table reads.
So, a better abstraction, better performance, better code readability and maintainability, sounds like a win to me. :)

Non-lazy implementation would process all input and collect output to a new collection on each operation. Obviously, it's impossible for unlimited or large enough sources, memory-consuming otherwise, and unnecessarily memory-consuming in case of reducing and short-circuiting operations, so there are great benefits.

Check the following example
Stream.of("0","0","1","2","3","4")
.distinct()
.peek(a->System.out.println("after distinct: "+a))
.anyMatch("1"::equals);
If it was not behaving as lazy you would expect that all elements would pass through the distinct filtering first. But because of lazy execution it behaves differently. It will stream the minimum amount of elements needed to calculate the result.
The above example will print
after distinct: 0
after distinct: 1
How it works analytically:
First "0" goes until the terminal operation but does not satisfy it. Another element must be streamed.
Second "0" is filtered through .distinct() and never reaches terminal operation.
Since the terminal operation is not satisfied yet, next element is streamed.
"1" goes through terminal operation and satisfies it.
No more elements need to be streamed.

Difference between forEachOrdered() and sequential() methods of Java 8?

I am working on java 8 parallel stream and wanting to print the elements in parallel stream is some order (say insertion order, reverse order or sequential order).
For which i tried the following code:
System.out.println("With forEachOrdered:");
listOfIntegers
.parallelStream()
.forEachOrdered(e -> System.out.print(e + " "));
System.out.println("");
System.out.println("With Sequential:");
listOfIntegers.parallelStream()
.sequential()
.forEach(e -> System.out.print(e + " "));
And for both of these, i got the same output as follows:
With forEachOrdered:
1 2 3 4 5 6 7 8
With Sequential:
1 2 3 4 5 6 7 8
from the api documentation, i can see that:
forEachOrdered -> This is a terminal operation.
and
sequential -> This is an intermediate operation.
So my question is which one is more better to use?
and in which scenarios, one should be preferred over other?

listOfIntegers.parallelStream().sequential().forEach() creates a parallel Stream and then converts it to a sequential Stream, so you might as well use listOfIntegers.stream().forEach() instead, and get a sequential Stream in the first place.
listOfIntegers.parallelStream().forEachOrdered(e -> System.out.print(e + " ")) performs the operation on a parallel Stream, but guarantees the elements will be consumed in the encounter order of the Stream (if the Stream has a defined encounter order). However, it can be executed on multiple threads.
I don't see a reason of ever using listOfIntegers.parallelStream().sequential(). If you want a sequential Stream, why create a parallel Stream first?

You are asking somehow a misleading question, first you ask about:
.parallelStream()
.forEachOrdered(...)
This will create a parallel Stream, but elements will be consumed in order. If you add a map operation like this:
.map(...)
.parallelStream()
.forEachOrdered(...)
This will make the map very limited (from a parallel processing point of view) operations since threads have to wait for all other elements in encounter order to be processed (consumed by forEachOrdered). This regards stateless operations.
On the other hand if you have a stateful operation like:
.parallelStream()
.map()
.sorted()
.// other operations
Since sorted is stateful, the benefit of the stateless operations before it from a parallel processing will be bigger. And that happens because sorted has to gather all elements from the Stream, and Threads don't have to "wait" (at the forEachOrdered) for the elements in encounter order.
For the second example:
listOfIntegers.parallelStream()
.sequential()
.forEach(e -> System.out.print(e + " "))
you are basically saying turn parallel on and then turn it off. Streams are driven by the terminal operation, so even if you do:
.map...
.filter...
.parallel()
.map...
.sequential
This means that the entire pipeline will be executed sequentially, not that some part will be parallel and the other sequential. You are also relying on the fact that forEach preserves order and may be at the moment it does, but may be in a later release, sine you said you don't care about order (by using forEach in the first place), there will be an internal shuffling of the elements.

Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution. For example, Collection.stream() creates a sequential stream, and Collection.parallelStream() creates a parallel one. This choice of execution mode may be modified by the BaseStream.sequential() or BaseStream.parallel() methods.
So there is no need to use:
listOfIntegers.parallelStream().sequential()
You can only use:
listOfIntegers.stream()
If you are creating a parallel stream, it is possible for the elements of the stream to be processed by different threads. The difference between forEach and forEachOrdered is that forEach will allow any element of a parallel stream to be processed in any order, while forEachOrdered will always process the elements of a parallel stream in the order of their appearance in the original stream. When using parallelStream() and forEachOrdered is a very good example on how you can take advantage of multiple cores and still preserve the order of the output. Note that forEachOrdered forces the iteration of the elements of the stream in an ordered fashion. However, any operation that is chained before forEachOrdered will still happen in parallel because the stream is a parallel stream.
It is not documented by Oracle exactly what happens when you change the stream execution mode multiple times in a pipeline. It is not clear whether it is the last change that matters or whether operations invoked after calling parallel() can be executed in parallel and operations invoked after calling sequential() will be executed sequentially.

What are the performance differences with these two uses of the map stream function in java 8

Say I have the functions mutateElement() which does x operations and mutateElement2() which does y operations. What is the difference in performance between these two pieces of code.
Piece1:
List<Object> = array.stream().map(elem ->
mutateElement(elem);
mutateElement2(elem);
)
.collect(Collectors.toList());
Piece2:
List<Object> array = array.stream().map(elem ->
mutateElement(elem);
)
.collect(Collectors.toList());
array = array.stream().map(elem ->
mutateElement2(elem);
)
.collect(Collectors.toList());
Clearly The first implementation is better as it only uses one iterator, however the second uses two iterators. But would the difference be noticeable if I had say a million elements in the array.

The first implementation is not better simply because it uses only one iterator, the first implementation is better because it only collects once.
Nobody can tell you whether the difference would be noticeable if you had a million elements. (And if someone did try to tell you, you should not believe them.) Benchmark it.

Whatever you use stream or external loop, the problem is the same.
One iteration on the List in the first code and two iterations on the List in the second code.
The time of execution of the second code is so logically more important.
Besides invoking twice the terminal operation on the stream :
.collect(Collectors.toList());
rather than once, has also a cost.
But would the difference be noticeable if I had say a million elements
in the array.
It could be.
Now the question is hard to answer : yes or no.
It depends on other parameters such as cpus, number of concurrent users and processing and your definition of "noticeable".

stream parallel skip - does the order of the chained stream methods make any difference?

stream.parallel().skip(1)
vs
stream.skip(1).parallel()
This is about Java 8 streams.
Are both of these skipping the 1st line/entry?
The example is something like this:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.concurrent.atomic.AtomicLong;
public class Test010 {
public static void main(String[] args) {
String message =
"a,b,c\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n";
try(BufferedReader br = new BufferedReader(new StringReader(message))){
AtomicLong cnt = new AtomicLong(1);
br.lines().parallel().skip(1).forEach(
s -> {
System.out.println(cnt.getAndIncrement() + "->" + s);
}
);
}catch (IOException e) {
e.printStackTrace();
}
}
}
Earlier today, I was sometimes getting the header line "a,b,c" in the lambda expression. This was a surprise since I was expecting to have skipped it already. Now I cannot get that example to work i.e. I cannot get the header line in the lambda expression. So I am pretty confused now, maybe something else was influencing that behavior. Of course this is just an example. In the real world the message is being read from a CSV file. The message is the full content of that CSV file.

You actually have two questions in one, the first being whether it makes a difference in writing stream.parallel().skip(1) or stream.skip(1).parallel(), the second being whether either or both will always skip the first element. See also “loaded question”.
The first answer is that it makes no difference, because specifying a .sequential() or .parallel() execution policy affects the entire Stream pipeline, regardless of where you place it in the call chain—of course, unless you specify multiple contradicting policies, in which case the last one wins.
So in either case you are requesting a parallel execution which might affect the outcome of the skip operation, which is subject of the second question.
The answer is not that simple. If the Stream has no defined encounter order in the first place, an arbitrary element might get skipped, which is a consequence of the fact that there is no “first” element, even if there might be an element you encounter first when iterating over the source.
If you have an ordered Stream, skip(1) should skip the first element, but this has been laid down only recently. As discussed in “Stream.skip behavior with unordered terminal operation”, chaining an unordered terminal operation had an effect on the skip operation in earlier implementations and there was some uncertainty of whether this could even be intentional, as visible in “Is this a bug in Files.lines(), or am I misunderstanding something about parallel streams?”, which happens to be close to your code; apparently skipping the first line is a common case.
The final word is that the behavior of earlier JREs is a bug and skip(1) on an ordered stream should skip the first element, even when the stream pipeline is executed in parallel and the terminal operation is unordered. The associated bug report names jdk1.8.0_60 as first fixed version, which I could verify. So if you are using on older implementation, you might experience the Stream skipping different elements when using .parallel() and the unordered .forEach(…) terminal operation. It’s not contradicting if the implementation occasionally skips the expected element, that’s the unpredictability of multi-threading.
So the answer still is that stream.parallel().skip(1) and stream.skip(1).parallel() have the same behavior, even when being used in earlier versions, as both are equally unpredictable when being used with an unordered terminal operation like forEach. They should always skip the first element with ordered Streams and when being used with 1.8.0_60 or newer, they do.

Yes, but skip(n) is slower as n is larger with a parallel stream.
Here's the API note from skip():
While skip() is generally a cheap operation on sequential stream pipelines, it can be quite expensive on ordered parallel pipelines, especially for large values of n, since skip(n) is constrained to skip not just any n elements, but the first n elements in the encounter order. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significant speedups of skip() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with skip() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.
So essentially, if you want better performance with skip(), don't use a parellel stream, or use an unordered stream.
As for it seeming to not work with parallel streams, perhaps you're actually seeing that the elements are no longer ordered? For example, an output of this code:
Stream.of("Hello", "How", "Are", "You?")
.parallel()
.skip(1)
.forEach(System.out::println);
Is
Are
You?
How
Ideone Demo
This is perfectly fine because forEach doesn't enforce the encounter order in a parallel stream. If you want it to enforce the encounter order, use a sequential stream (and perhaps use forEachOrdered so that your intent is obvious).
Stream.of("Hello", "How", "Are", "You?")
.skip(1)
.forEachOrdered(System.out::println);
How
Are
You?

How collectors are used when turning the stream in parallel

I actually tried to answer this question How to skip even lines of a Stream<String> obtained from the Files.lines. So I though this collector wouldn't work well in parallel:
private static Collector<String, ?, List<String>> oddLines() {
int[] counter = {1};
return Collector.of(ArrayList::new,
(l, line) -> {
if (counter[0] % 2 == 1) l.add(line);
counter[0]++;
},
(l1, l2) -> {
l1.addAll(l2);
return l1;
});
}
but it works.
EDIT: It didn't actually work; I got fooled by the fact that my input set was too small to trigger any parallelism; see discussion in comments.
I thought it wouldn't work because of the two following plans of executions comes to my mind.
1. The counter array is shared among all threads.
Thread t1 read the first element of the Stream, so the if condition is satisfied. It adds the first element to its list. Then the execution stops before he has the time to update the array value.
Thread t2, which says started at the 4th element of the stream add it to its list. So we end up with a non-wanted element.
Of course since this collector seems to works, I guess it doesn't work like that. And the updates are not atomic anyway.
2. Each Thread has its own copy of the array
In this case there is no more problems for the update, but nothing prevents me that the thread t2 will not start at the 4th element of the stream. So he doesn't work like that either.
So it seems that it doesn't work like that at all, which brings me to the question... how the collector is used in parallel?
Can someone explain me basically how it works and why my collector works when ran in parallel?
Thank you very much!

Passing a parallel() source stream into your collector is enough to break the logic because your shared state (counter) may be incremented from different tasks. You can verify that, because it is never returning the correct result for any finite stream input:
Stream<String> lines = IntStream.range(1, 20000).mapToObj(i -> i + "");
System.out.println(lines.isParallel());
lines = lines.parallel();
System.out.println(lines.isParallel());
List<String> collected = lines.collect(oddLines());
System.out.println(collected.size());
Note that for infinite streams (e.g. when reading from Files.lines()) you need to generate some significant amount of data in the stream, so it actually forks a task to run some chunks concurrently.
Output for me is:
false
true
12386
Which is clearly wrong.
As #Holger in the comments correctly pointed out, there is a different race that can happen when your collector is specifying CONCURRENT and UNORDERED, in which case they operate on a single shared collection across tasks (ArrayList::new called once per stream), where-as with only parallel() it will run the accumulator on a collection per task and then later combine the result using your defined combiner.
If you'd add the characteristics to the collector, you might run into the following result due to the shared state in a single collection:
false
true
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 73
at java.util.ArrayList.add(ArrayList.java:459)
at de.jungblut.stuff.StreamPallel.lambda$0(StreamPallel.java:18)
at de.jungblut.stuff.StreamPallel$$Lambda$3/1044036744.accept(Unknown Source)
at java.util.stream.ReferencePipeline.lambda$collect$207(ReferencePipeline.java:496)
at java.util.stream.ReferencePipeline$$Lambda$6/2003749087.accept(Unknown Source)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:496)
at de.jungblut.stuff.StreamPallel.main(StreamPallel.java:32)12386

Actually it's just a coincidence that this collector work. It doesn't work with custom data source. Consider this example:
List<String> list = IntStream.range(0, 10).parallel().mapToObj(String::valueOf)
.collect(oddLines());
System.out.println(list);
This produces always different result. The real cause is just because when BufferedReader.lines() stream is split by at least java.util.Spliterators.IteratorSpliterator.BATCH_UNIT number of lines which is 1024. If you have substantially bigger number of lines, it may fail even with BufferedReader:
String data = IntStream.range(0, 10000).mapToObj(String::valueOf)
.collect(Collectors.joining("\n"));
List<String> list = new BufferedReader(new StringReader(data)).lines().parallel()
.collect(oddLines());
list.stream().mapToInt(Integer::parseInt).filter(x -> x%2 != 0)
.forEach(System.out::println);
Were collector working normally this should not print anything. But sometimes it prints.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.