This question arrose from the answer to another question where map and reduce were suggested to calculate a sum concurrently.
In that question there's a complexCalculation(e), but now I was wondering how to parallellise even more, by splitting the calculation in two parts, so that complexCalculation(e) = part1(e) * part2(e). I wonder whether it would be possible to calculate part1 and part2 on a collection concurrently (using map() again) and then zip the two resulting streams so that the ith element of both streams is combined with the function * so that the resulting stream equals the stream that can be gotten by mapping complexCalculation(e) on that collection. In code this would look like:
Stream map1 = bigCollection.parallelStream().map(e -> part1(e));
Stream map2 = bigCollection.parallelStream().map(e -> part2(e));
// preferably map1 and map2 are computed concurrently...
Stream result = map1.zip(map2, (e1, e2) -> e1 * e2);
result.equals(bigCollection.map(e -> complexCalculation(e))); //should be true
So my question is: does there exist some functionality like the zip function I tried to describe here?
parallelStream() is guarenteed to complete in the order submitted. This means you cannot assume that two parallelStreams can be zipped together like this.
Your original bigCollection.map(e -> complexCalculation(e)) is likely to be faster unless your collection is actually smaller than the number of CPUs you have.
If you really want to parallelize part1 and part2 (for example your bigCollection has very few elements, less than CPU cores), you can do the following trick. Suppose you have two methods part1 and part2 in the current class:
public long part1(Type t) { ... }
public long part2(Type t) { ... }
Create a stream of two functions created from these methods and process it in parallel like this:
bigCollection.parallelStream()
.map(e -> Stream.<ToLongFunction<Type>>of(this::part1, this::part2)
.parallel()
.mapToLong(fn -> fn.applyAsLong(e)).reduce(1, (a, b) -> a*b))
.// continue the outer stream operations
However it's very rare case. As #PeterLawrey noted if your outer collection is big enough, no need to parallelize part1 and part2. Instead you will handle separate elements in parallel.
Related
I have currently this code:
AtomicInteger counter = new AtomicInteger(0);
return IntStream.range(0, costs.length)
.mapToObj(i -> new int[]{costs[i][0]-costs[i][1], i})
.sorted(Comparator.comparingInt(d -> d[0]))
.mapToInt(s ->
counter.getAndIncrement() < costs.length/2 ? costs[s[1]][0] : costs[s[1]][1]
)
.sum();
Where I compute diff of two elements of an array and then sort it and in the end I need to process two halves independently.
Is there any better way to do this than using AtomicInteger as a counter? Is there some method like mapToIntWithIndex that is accessible inside JDK (not in external libraries)? Is there something like zip() in python where I could join indices together with stream? If not is there any plan to add this to next Java releases?
This is not a reliable way to do this. The Streams API makes it clear that functions used in maps should not be stateful.
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful.
If you use stateful functions, it may appear to work, but because you aren't using it according to the documentation, the behaviour is technically undefined, and could break in future versions of Java.
Collect to a list, and then process the two halves of the list:
List<int[]> list = /* your stream up to and including the sort */.collect(toList());
int sum = list.subList(0, half ).stream().mapToInt(s -> costs[s[1]][0]).sum()
+ list.subList(half, list.size()).stream().mapToInt(s -> costs[s[1]][1]).sum();
Actually, I'd be tempted to write it as for loops, as I just find it easier on the eye:
int sum = 0;
for (int[][] s : list.subList(0, half)) sum += costs[s[1]][0];
for (int[][] s : list.subList(half, list.size())) sum += costs[s[1]][1];
There are mulitple questions for streams but for this usecase & in java, didnt find any.
I have a huge stream of objects Stream<A> [~1Million objects]. StreamA comes from a file.
Class A { enum status [Running,queued,Completed], String name }
I want to split Stream<A> into three streams without using any Collect statements. Collect statement loads everything into memory.
I am facing StackOverflowException as I am calling stream.concat multiple times here.
Stream.Concat has problem mentioned in Java Docs
"Implementation Note:
Use caution when constructing streams from repeated concatenation. Accessing an element of a deeply concatenated stream can result in deep call chains, or even StackOverflowException."
Map<Status, Stream<String>> splitStream = new HashMap<>();
streamA.foreach(aObj ->
Stream<String> statusBasedStream = splitStream.getOrDefault(aObj.status,Stream.of());
splitStream.put(aObj.status, Stream.concat(statusBasedStream, Stream.of(aObj.name)));
There are few options where custom streams are available in github to achieve Concatenation but wanted to use standard libraries to solve this.
If data is smaller would have taken a list approach as mentioned here (Split stream into substreams with N elements)
Not the exact solution of the problem but if you have information about the indexes then
combination of Stream.skip() and Stream.limit() can help in this - Below is the dummy code that I tried -
int queuedNumbers = 100;
int runningNumbers=200;
Stream<Object> all = Stream.of();
Stream<Object> queuedAndCompleted = all.skip(queuedNumbers);
Stream<Object> queued = all.limit(queuedNumbers);
Stream<Object> running = queuedAndCompleted.limit(runningNumbers);
Stream<Object> completed = queuedAndCompleted.skip(runningNumbers);
Hope it would be of some help.
I am having trouble comprehending why parallel stream and stream are giving a different result for the exact same statement.
List<String> list = Arrays.asList("1", "2", "3");
String resultParallel = list.parallelStream().collect(StringBuilder::new,
(response, element) -> response.append(" ").append(element),
(response1, response2) -> response1.append(",").append(response2.toString()))
.toString();
System.out.println("ResultParallel: " + resultParallel);
String result = list.stream().collect(StringBuilder::new,
(response, element) -> response.append(" ").append(element),
(response1, response2) -> response1.append(",").append(response2.toString()))
.toString();
System.out.println("Result: " + result);
ResultParallel: 1, 2, 3
Result: 1 2 3
Can somebody explain why this is happening and how I get the non-parallel version to give the same result as the parallel version?
The Java 8 Stream.collect method has the following signature:
<R> R collect(Supplier<R> supplier,
BiConsumer<R, ? super T> accumulator,
BiConsumer<R, R> combiner);
Where BiConsumer<R, R> combiner is called only in the parallel streams (in order to combine partial results into a single container), therefore the output of your first code snippet is:
ResultParallel: 1, 2, 3
In the sequential version the combiner doesn't get called (see this answer), therefore the following statement is ignored:
(response1, response2) -> response1.append(",").append(response2.toString())
and the result is different:
1 2 3
How to fix it? Check #Eugene's answer or this question and answers.
To understand why this is going wrong, consider this from the javadoc.
accumulator - an associative, non-interfering, stateless function that must fold an element into a result container.
combiner - an associative, non-interfering, stateless function that accepts two partial result containers and merges them, which must be compatible with the accumulator function. The combiner function must fold the elements from the second result container into the first result container.
What this is saying is that it should not matter whether the elements are collected by "accumulating" or "combining" or some combination of the two. But in your code, the accumulator and the combiner concatenate using a different separator. They are not "compatible" in the sense required by the javadoc.
That leads to inconsistent results depending on whether sequential or parallel streams are used.
In the parallel case, the stream is split into substreams1 to be handled by different threads. This leads to a separate collection for each substream. The collections are then combined.
In the sequential case, the stream is not split. Instead, the stream is simply accumulated into a single collection, and no combining needs to take place.
Observations:
In general, for a stream of this size performing a simple transformation, parallelStream() is liable to make things slower.
In this specific case, the bottleneck with the parallelStream() version will be the combining step. That is a serial step, and it performs the same amount of copying as the entire serial pipeline. So, in fact, parallelization is definitely going to make things slower.
In fact, the lambdas do not behave correctly. They add an extra space at the start, and double some spaces if the combiner is used. A more correct version would be:
String result = list.stream().collect(
StringBuilder::new,
(b, e) -> b.append(b.isEmpty() ? "" : " ").append(e),
(l, r) -> l.append(l.isEmpty() ? "" : " ").append(r)).toString();
The Joiner class is a far simpler and more efficient way to concatenate streams. (Credit: #Eugene)
1 - In this case, the substreams each have only one element. For a longer list, you would typically get as many substreams as there are worker threads, and the substreams would contain multiple elements.
As a side note, even if you replace , with a space in the combiner, your results are still going to differ (slightly altered the code to make it more readable):
String resultParallel = list.parallelStream().collect(
StringBuilder::new,
(builder, elem) -> builder.append(" ").append(elem),
(left, right) -> left.append(" ").append(right)).toString();
String result = list.stream().collect(
StringBuilder::new,
(builder, elem) -> builder.append(" ").append(elem),
(left, right) -> left.append(" ").append(right)).toString();
System.out.println("ResultParallel: ->" + resultParallel + "<-"); // -> 1 2 3 4<-
System.out.println("Result: ->" + result + "<-"); // -> 1 2 3 4<-
Notice how you have a little too many spaces.
The java-doc has the hint:
combiner... must be compatible with the accumulator function
If you want to join, there are simpler options like:
String.join(",", yourList)
yourList.stream().collect(Collectors.joining(","))
I have the following code:
ArrayList <String> entries = new ArrayList <String>();
entries.add("0");
entries.add("1");
entries.add("2");
entries.add("3");
String firstNotHiddenItem = entries.stream()
.filter(e -> e.equals("2"))
.findFirst()
.get();
I need to know what is the index of that first returned element, since I need to edit it inside of entries ArrayList. As far as I know get() returns the value of the element, not a reference. Should I just use
int indexOf(Object o)
instead?
You can get the index of an element using an IntStream like:
int index = IntStream.range(0, entries.size())
.filter(i -> "2".equals(entries.get(i)))
.findFirst().orElse(-1);
But you should use the List::indexOf method which is the preferred way, because it's more concise, more expressive and computes the same results.
You can't in a straightforward way - streams process elements without context of where they are in the stream.
However, if you're prepared to take the gloves off...
int[] position = {-1};
String firstNotHiddenItem = entries.stream()
.peek(x -> position[0]++) // increment every element encounter
.filter("2"::equals)
.findFirst()
.get();
System.out.println(position[0]); // 2
The use of an int[], instead of a simple int, is to circumvent the "effectively final" requirement; the reference to the array is constant, only its contents change.
Note also the use of a method reference "2"::equals instead of a lambda e -> e.equals("2"), which not only avoids a possible NPE (if a stream element is null) and more importantly looks way cooler.
A more palatable (less hackalicious) version:
AtomicInteger position = new AtomicInteger(-1);
String firstNotHiddenItem = entries.stream()
.peek(x -> position.incrementAndGet()) // increment every element encounter
.filter("2"::equals)
.findFirst()
.get();
position.get(); // 2
This will work using Eclipse Collections with Java 8
int firstIndex = ListIterate.detectIndex(entries, "2"::equals);
If you use a MutableList, you can simplify the code as follows:
MutableList<String> entries = Lists.mutable.with("0", "1", "2", "3");
int firstIndex = entries.detectIndex("2"::equals);
There is also a method to find the last index.
int lastIndex = entries.detectLastIndex("2"::equals);
Note: I am a committer for Eclipse Collections
Yes, you should use indexOf("2") instead. As you might have noticed, any stream based solution has a higher complexity, without providing any benefit.
In this specific situation, there is no significant difference in performance, but overusing streams can cause dramatic performance degradation, e.g. when using map.entrySet().stream().filter(e -> e.getKey().equals(object)).map(e -> e.getValue()) instead of a simple map.get(object).
The collection operations may utilize their known structure while most stream operation imply a linear search. So genuine collection operations are preferable.
Of course, if there is no collection operation, like when your predicate is not a simple equality test, the Stream API may be the right tool. As shown in “Is there a concise way to iterate over a stream with indices in Java 8?”, the solution for any task involving the indices works by using the indices as starting point, e.g. via IntStream.range, and accessing the list via List.get(int). If the source in not an array or a random access List, there is no equally clean and efficient solution. Sometimes, a loop might turn out to be the simplest and most efficient solution.
I am looking for a memory-efficient way in Java to find top n elements from a huge collection. For instance, I have a word, a distance() method, and a collection of "all" words.
I have implemented a class Pair that implements compareTo() so that pairs are sorted by their values.
Using streams, my naive solution looks like this:
double distance(String word1, String word2){
...
}
Collection<String> words = ...;
String word = "...";
words.stream()
.map(w -> new Pair<String, Double>(w, distance(word, w)))
.sorted()
.limit(n);
To my understanding, this will process and intermediately store each element in words so that it can be sorted before applying limit(). However, it is more memory-efficient to have a collection that stores n elements and whenever a new element is added, it removes the smallest element (according to the comparable object's natural order) and thus never grows larger than n (or n+1).
This is exactly what the Guava MinMaxPriorityQueue does. Thus, my current best solution to the above problem is this:
Queue<Pair<String, Double>> neighbours = MinMaxPriorityQueue.maximumSize(n).create();
words.stream()
.forEach(w -> neighbours.add(new Pair<String, Double>(w, distance(word, w)));
The sorting of the top n elements remains to be done after converting the queue to a stream or list, but this is not an issue since n is relatively small.
My question is: is there a way to do the same using streams?
A heap-based structure will of course be more efficient than sorting the entire huge list. Luckily, streams library is perfectly happy to let you use specialized collections when necessary:
MinMaxPriorityQueue<Pair<String, Double>> topN = words.stream()
.map(w -> new Pair<String, Double>(w, distance(word, w)))
.collect(toCollection(
() -> MinMaxPriorityQueue.maximumSize(n).create()
));
This is better than the .forEach solution because it's easy to parallelize and is more idiomatic java8.
Note that () -> MinMaxPriorityQueue.maximumSize(n).create() should be possible to be replaced with MinMaxPriorityQueue.maximumSize(n)::create but, for some reason, that won't compile under some conditions (see comments below).