If the input size is too small the library automatically serializes the execution of the maps in the stream, but this automation doesn't and can't take in account how heavy is the map operation. Is there a way to force parallelStream() to actually parallelize CPU heavy maps?
There seems to be a fundamental misunderstanding. The linked Q&A discusses that the stream apparently doesn’t work in parallel, due to the OP not seeing the expected speedup. The conclusion is that there is no benefit in parallel processing if the workload is too small, not that there was an automatic fallback to sequential execution.
It’s actually the opposite. If you request parallel, you get parallel, even if it actually reduces the performance. The implementation does not switch to the potentially more efficient sequential execution in such cases.
So if you are confident that the per-element workload is high enough to justify the use of a parallel execution regardless of the small number of elements, you can simply request a parallel execution.
As can easily demonstrated:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.forEach(System.out::println);
On Ideone, it prints
processing 2 in Thread[main,5,main]
2
processing 1 in Thread[ForkJoinPool.commonPool-worker-1,5,main]
1
but the order of messages and details may vary. It may even be possible that in some environments, both task may happen to get executed by the same thread, if it can steel the second task before another thread gets started to pick it up. But of course, if the tasks are expensive enough, this won’t happen. The important point is that the overall workload has been split and enqueued to be potentially picked up by other worker threads.
If execution by a single thread happens in your environment for the simple example above, you may insert simulated workload like this:
Stream.of(1, 2).parallel()
.peek(x -> System.out.println("processing "+x+" in "+Thread.currentThread()))
.map(x -> {
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(3));
return x;
})
.forEach(System.out::println);
Then, you may also see that the overall execution time will be shorter than “number of elements”דprocessing time per element” if the “processing time per element” is high enough.
Update: the misunderstanding might be cause by Brian Goetz’ misleading statement: “In your case, your input set is simply too small to be decomposed”.
It must be emphasized that this is not a general property of the Stream API, but the Map that has been used. A HashMap has a backing array and the entries are distributed within that array depending on their hash code. It might be the case that splitting the array into n ranges doesn’t lead to a balanced split of the contained element, especially, if there are only two. The implementors of the HashMap’s Spliterator considered searching the array for elements to get a perfectly balanced split to be too expensive, not that splitting two elements was not worth it.
Since the HashMap’s default capacity is 16 and the example had only two elements, we can say that the map was oversized. Simply fixing that would also fix the example:
long start = System.nanoTime();
Map<String, Supplier<String>> input = new HashMap<>(2);
input.put("1", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "a";
});
input.put("2", () -> {
System.out.println(Thread.currentThread());
LockSupport.parkNanos("simulated workload", TimeUnit.SECONDS.toNanos(2));
return "b";
});
Map<String, String> results = input.keySet()
.parallelStream().collect(Collectors.toConcurrentMap(
key -> key,
key -> input.get(key).get()));
System.out.println("Time: " + TimeUnit.NANOSECONDS.toMillis(System.nanoTime()- start));
on my machine, it prints
Thread[main,5,main]
Thread[ForkJoinPool.commonPool-worker-1,5,main]
Time: 2058
The conclusion is that the Stream implementation always tries to use parallel execution, if you request it, regardless of the input size. But it depends on the input’s structure how well the workload can be distributed to the worker threads. Things could be even worse, e.g. if you stream lines from a file.
If you think that the benefit of a balanced splitting is worth the cost of a copying step, you could also use new ArrayList<>(input.keySet()).parallelStream() instead of input.keySet().parallelStream(), as the distribution of elements within ArrayList always allows a perflectly balanced split.
stream.parallel().skip(1)
vs
stream.skip(1).parallel()
This is about Java 8 streams.
Are both of these skipping the 1st line/entry?
The example is something like this:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.concurrent.atomic.AtomicLong;
public class Test010 {
public static void main(String[] args) {
String message =
"a,b,c\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n1,2,3\n4,5,6\n7,8,9\n";
try(BufferedReader br = new BufferedReader(new StringReader(message))){
AtomicLong cnt = new AtomicLong(1);
br.lines().parallel().skip(1).forEach(
s -> {
System.out.println(cnt.getAndIncrement() + "->" + s);
}
);
}catch (IOException e) {
e.printStackTrace();
}
}
}
Earlier today, I was sometimes getting the header line "a,b,c" in the lambda expression. This was a surprise since I was expecting to have skipped it already. Now I cannot get that example to work i.e. I cannot get the header line in the lambda expression. So I am pretty confused now, maybe something else was influencing that behavior. Of course this is just an example. In the real world the message is being read from a CSV file. The message is the full content of that CSV file.
You actually have two questions in one, the first being whether it makes a difference in writing stream.parallel().skip(1) or stream.skip(1).parallel(), the second being whether either or both will always skip the first element. See also “loaded question”.
The first answer is that it makes no difference, because specifying a .sequential() or .parallel() execution policy affects the entire Stream pipeline, regardless of where you place it in the call chain—of course, unless you specify multiple contradicting policies, in which case the last one wins.
So in either case you are requesting a parallel execution which might affect the outcome of the skip operation, which is subject of the second question.
The answer is not that simple. If the Stream has no defined encounter order in the first place, an arbitrary element might get skipped, which is a consequence of the fact that there is no “first” element, even if there might be an element you encounter first when iterating over the source.
If you have an ordered Stream, skip(1) should skip the first element, but this has been laid down only recently. As discussed in “Stream.skip behavior with unordered terminal operation”, chaining an unordered terminal operation had an effect on the skip operation in earlier implementations and there was some uncertainty of whether this could even be intentional, as visible in “Is this a bug in Files.lines(), or am I misunderstanding something about parallel streams?”, which happens to be close to your code; apparently skipping the first line is a common case.
The final word is that the behavior of earlier JREs is a bug and skip(1) on an ordered stream should skip the first element, even when the stream pipeline is executed in parallel and the terminal operation is unordered. The associated bug report names jdk1.8.0_60 as first fixed version, which I could verify. So if you are using on older implementation, you might experience the Stream skipping different elements when using .parallel() and the unordered .forEach(…) terminal operation. It’s not contradicting if the implementation occasionally skips the expected element, that’s the unpredictability of multi-threading.
So the answer still is that stream.parallel().skip(1) and stream.skip(1).parallel() have the same behavior, even when being used in earlier versions, as both are equally unpredictable when being used with an unordered terminal operation like forEach. They should always skip the first element with ordered Streams and when being used with 1.8.0_60 or newer, they do.
Yes, but skip(n) is slower as n is larger with a parallel stream.
Here's the API note from skip():
While skip() is generally a cheap operation on sequential stream pipelines, it can be quite expensive on ordered parallel pipelines, especially for large values of n, since skip(n) is constrained to skip not just any n elements, but the first n elements in the encounter order. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significant speedups of skip() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with skip() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.
So essentially, if you want better performance with skip(), don't use a parellel stream, or use an unordered stream.
As for it seeming to not work with parallel streams, perhaps you're actually seeing that the elements are no longer ordered? For example, an output of this code:
Stream.of("Hello", "How", "Are", "You?")
.parallel()
.skip(1)
.forEach(System.out::println);
Is
Are
You?
How
Ideone Demo
This is perfectly fine because forEach doesn't enforce the encounter order in a parallel stream. If you want it to enforce the encounter order, use a sequential stream (and perhaps use forEachOrdered so that your intent is obvious).
Stream.of("Hello", "How", "Are", "You?")
.skip(1)
.forEachOrdered(System.out::println);
How
Are
You?
I'm processing a potentially infinite stream of data elements that follow the pattern:
E1 <start mark>
E2 foo
E3 bah
...
En-1 bar
En <end mark>
That is, a stream of <String>s, which must be accumulated in a buffer before I can map them to object model.
Goal: aggregate a Stream<String> into a Stream<ObjectDefinedByStrings> without the overhead of collecting on an infinite stream.
In english, the code would be something like "Once you see a start marker, start buffering. Buffer until you see an end marker, then get ready to return the old buffer, and prepare a fresh buffer. Return the old buffer."
My current implementation has the form:
Data<String>.stream()
.map(functionReturningAnOptionalPresentOnlyIfObjectIsComplete)
.filter(Optional::isPresent)
I have several questions:
What is this operation properly called? (i.e. what can I Google for more examples? Every discussion I find of .map() talks about 1:1 mapping. Every discussion of .reduce) talks about n:1 reduction. Every discussion of .collect() talks about accumulating as a terminal operation...)
This seems bad in many different ways. Is there a better way of implementing this? (A candidate of the form .collectUntilConditionThenApplyFinisher(Collector,Condition,Finisher)...?)
Thanks!
To avoid your kludge you could filter before mapping.
Data<String>.stream()
.filter(text -> canBeConvertedToObject(text))
.map(text -> convertToObject(text))
That works perfectly well on an infinite stream and only constructs objects that need to be constructed. It also avoids the overhead of creating unnecessary Optional objects.
Unfortunately there's no partial reduce operation in Java 8 Stream API. However such operation is implemented in my StreamEx library which enhances standard Java 8 Streams. So your task can be solved like this:
Stream<ObjectDefinedByStrings> result =
StreamEx.of(strings)
.groupRuns((a, b) -> !b.contains("<start mark>"))
.map(stringList -> constructObjectDefinedByStrings());
The strings is normal Java-8 stream or other source like array, Collection, Spliterator, etc. Works fine with infinite or parallel streams. The groupRuns method takes a BiPredicate which is applied to two adjacent stream elements and returns true if these elements must be grouped. Here we say that elements should be grouped unless the second one contains "<start mark>" (which is the start of the new element). After that you will get the stream of List<String> elements.
If collecting to the intermediate lists is not appropriate for you, you can use the collapse(BiPredicate, Collector) method and specify the custom Collector to perform the partial reduction. For example, you may want to join all the strings together:
Stream<ObjectDefinedByStrings> result =
StreamEx.of(strings)
.collapse((a, b) -> !b.contains("<start mark>"), Collectors.joining())
.map(joinedString -> constructObjectDefinedByStrings());
I propose 2 more use cases for this partial reduction:
1. Parsing SQL and PL/SQL (Oracle procedural) statements
Standard delimiter for SQL statements is semicolon (;). It separates normal SQL statements from each other. But if you have PL/SQL statement then semicolon separates operators inside statement from each other, not only statements as whole.
One of the ways of parsing script file containing both normal SQL and PL/SQL statements is to first split them by semicolon and then if particular statement starts with specific keywords (DECLARE, BEGIN, etc.) join this statement with next statements following rules of PL/SQL grammar.
By the way, this cannot be done by using StreamEx partial reduce operations since they only test two adjacent elements. Since you need to know about previous stream elements starting from initial PL/SQL keyword element to determine whether or not to include current element into partial reduction or partial reduction should be finished. In this case mutable partial reduction may be usable with collector holding information of already collected elements and some Predicate testing either only collector itself (if partial reduction should be finished) or BiPredicate testing both collector and current stream element.
In theory, we're speaking about implementing LR(0) or LR(1) parser (see https://en.wikipedia.org/wiki/LR_parser) using Stream pipeline ideology. LR-parser can be used to parse syntax of most programming languages.
Parser is a finite automata with stack. In case of LR(0) automata its transition depends on stack only. In case of LR(1) automata it depends both on stack and next element from the stream (theoretically there can be LR(2), LR(3), etc. automatas peeking 2, 3, etc. next elements to determine transition but in practice all programming languages are syntactically LR(1) languages).
To implement parser there should be a Collector containing stack of finite automata and predicate testing whether final state of this automata is reached (so we can stop reduction). In case of LR(0) it should be Predicate testing Collector itself. And in case of LR(1) it should be BiPredicate testing both Collector and next element from stream (since transition depends on both stack and next symbol).
So to implement LR(0) parser we would need something like following (T is stream elements type, A is accumulator holding both finite automata stack and result, R is result of each parser work forming output stream):
<R,A> Stream<R> Stream<T>.parse(
Collector<T,A,R> automataCollector,
Predicate<A> isFinalState)
(i removed complexity like ? super T instead of T for compactness - result API should contain these)
To implement LR(1) parser we would need something like following:
<R,A> Stream<R> Stream<T>.parse(
BiPredicate<A, T> isFinalState
Collector<T,A,R> automataCollector)
NOTE: In this case BiPredicate should test element before it would be consumed by accumulator. Remember LR(1) parser is peeking next element to determine transition. So there can be a potential exception if empty accumulator rejects to accept next element (BiPredicate returns true, signalizing that partial reduction is over, on empty accumulator just created by Supplier and next stream element).
2. Conditional batching based on stream element type
When we're executing SQL statemens we want to merge adjacent data-modification (DML) statements into a single batch (see JDBC API) to improve overall performance. But we don't want to batch queries. So we need conditional batching (instead of unconditional batching like in Java 8 Stream with batch processing).
For this specific case StreamEx partial reduce operations can be used since if both adjacent elements tested by BiPredicate are DML statements they should be included into batch. So we don't need to know previous history of batch collection.
But we can increase complexity of the task and say that batches should be limited by size. Say, no more than 100 DML statements in a batch. In this case we cannot ignore previous batch collection history and using of BiPredicate to determine whether batch collection should be continued or stopped is insufficient.
Though we can add flatMap after StreamEx partial reduction to split long batches into parts. But this would delay specific 100-element batch execution until all DML statements would be collected into unlimited batch. Needless to say that this is against pipeline ideology: we want to minimize buffering to maximize speed between input and output. Moreover, unlimited batch collection may result in OutOfMemoryError in case of very long list of DML statements without any queries in between (say, million of INSERTs as a result of database export) which is intolerable.
So in case of this complex conditional batch collection with upper limit we also need something as powerful as LR(0) parser described in previous use case.
I have the following problem:
for example, I have 10 lists, each one has a link to some other lists, I would like to create a code to search elements in theses lists, I've already done this algorithm but sequentially, it start the search in the first list, then if the search is failed, it send messages for searching in lists that have a link with it (to the first one), at the end of the algorithm, he show the results as the number of lists visited and if he find the element or no.
now, I want to transform it to be a parallel algorithm, at least a concurrent one using multi-threads:
to use threads for searching;
to start a search in the 10 lists at the same time;
As long as you don't change anything, you can consider your search read only. In that case, you probably don't need synchronization. If you want to have a fast search, don't use threads directly but use runnables and look for the appropriate classes. If you do work directly with threads, make sure that you don't exceed the number of processors.
Before going further, read into multi-threading. I would mention "Java Concurrency in Practice" as a (rather safe) recommendation. It's too easy to get wrong.
I am not sure from your problem statement if there is a sequential dependency between the searches in different lists, namely whether search results from the first list should take priority over those from the second list or not.
Assume there's no such dependency, so that any search result from any list is fine, then this is expressed in a very concise way as a speculative search in Ateji PX:
Object parallelSearch(Collection<List> lists)
{
// start of parallel block
[
// create one parallel branch for each of the lists
|| (List list: lists) {
// start searching lists in parallel
Object result = search(list);
// the first branch that finds a result returns it,
// thereby stopping all remaining parallel branches
if(result != null) return result;
}
]
}
The term "speculative" means that you run a number of searches in parallel although you know that you will use the result of only one of them. Then as soon as you have found a result in one of the searches, you want to stop all the remaining searches.
If you run this code on a 4-core machine, the Ateji PX runtime will schedule 4 parallel branches at a time in order to make the best use of the available parallel hardware.