Making a Parallel IntStream more efficient/faster? - java

I've looked for a while for this answer, but couldn't find anything.
I'm trying to create an IntStream that can very quickly find primes (many, many primes, very fast -- millions in a few seconds).
I currently am using this parallelStream:
import java.util.stream.*;
import java.math.BigInteger;
public class Primes {
public static IntStream stream() {
return IntStream.iterate( 3, i -> i + 2 ).parallel()
.filter( i -> i % 3 != 0 ).mapToObj( BigInteger::valueOf )
.filter( i -> i.isProbablePrime( 1 ) == true )
.flatMapToInt( i -> IntStream.of( i.intValue() ) );
}
}
but it takes too long to generate numbers. (7546ms to generate 1,000,000 primes).
Is there any obvious way to making this more efficient/faster?

There are two general problems for efficient parallel processing with your code. First, using iterate, which unavoidably requires the previous element to calculate the next one, which is not a good starting point for parallel processing. Second, you are using an infinite stream. Efficient workload splitting requires at least an estimate of the number of element to process.
Since you are processing ascending integer numbers, there is an obvious limit when reaching Integer.MAX_VALUE, but the stream implementation doesn’t know that you are actually processing ascending numbers, hence, will treat your formally infinite stream as truly infinite.
A solution fixing these issues, is
public static IntStream stream() {
return IntStream.rangeClosed(1, Integer.MAX_VALUE/2).parallel()
.map(i -> i*2+1)
.filter(i -> i % 3 != 0).mapToObj(BigInteger::valueOf)
.filter(i -> i.isProbablePrime(1))
.mapToInt(BigInteger::intValue);
}
but it must be emphasized that in this form, this solution is only useful if you truly want to process all or most of the prime numbers in the full integer range. As soon as you apply skip or limit to the stream, the parallel performance will drop significantly, as specified by the documentation of these methods. Also, using filter with a predicate that accepts values in a smaller numeric range only, implies that there will be a lot of unnecessary work that would better not be done than done in parallel.
You could adapt the method to receive a value range as parameter to adapt the range of the source IntStream to solve this.
This is the time to emphasize the importance of algorithms over parallel processing. Consider the Sieve of Eratosthenes. The following implementation
public static IntStream primes(int max) {
BitSet prime = new BitSet(max>>1);
prime.set(1, max>>1);
for(int i = 3; i<max; i += 2)
if(prime.get((i-1)>>1))
for(int b = i*3; b>0 && b<max; b += i*2) prime.clear((b-1)>>1);
return IntStream.concat(IntStream.of(2), prime.stream().map(i -> i+i+1));
}
turned out to be faster by an order of magnitude compared to the other approaches despite not using parallel processing, even when using Integer.MAX_VALUE as upper bound (measured using a terminal operation of .reduce((a,b) -> b) instead of toArray or forEach(System.out::println), to ensure complete processing of all values without adding additional storage or printing costs).
The takeaway is, isProbablePrime is great when you have a particular candidate or want to process a small range of numbers (or when the number is way outside the int or even long range)¹, but for processing a large ascending sequence of prime numbers there are better approaches, and parallel processing is not the ultimate answer to performance questions.
¹ consider, e.g.
Stream.iterate(new BigInteger("1000000000000"), BigInteger::nextProbablePrime)
.filter(b -> b.isProbablePrime(1))

It seems that I can do 1/2 better than what you have in place, by doing some modifications:
return IntStream.iterate(3, i -> i + 2)
.parallel()
.unordered()
.filter(i -> i % 3 != 0)
.mapToObj(BigInteger::valueOf)
.filter(i -> i.isProbablePrime(1))
.mapToInt(BigInteger::intValue);

Related

Is Java 8 filters on Stream of large list more time consuming than for loop?

I was replacing one of my legacy code from simple for loop over a list to Java 8 stream and filters.
I have the for loop like below:
List<Integer> numbers = Arrays.asList(1,2,3,4,5,6,7,8,9,10);
for(int i: numbers) {
if((i > 4) && (i % 2 == 0)) {
System.out.println("First even number in list more than 4 : " + numbers.get(i))
break;
}
}
Here I have 6 iterations of the loop. When 6 is obtained, I print it.
Now, I am replacing it with below:
numbers.stream()
.filter(e -> e > 4)
.filter(e -> e % 2 == 0)
.findFirst()
I am confused, if we are applying filter twice on the list, then is the time complexity more than the for loop case or is there something I am missing?
The answer to your question is no. A stream with multiple filter, map, or other intermediate operations is not necessarily slower than an equivalent for loop.
The reason for this is that all intermediate Stream operations are lazy, meaning that they are not actually applied individually at the time the filter method is called, but are instead all evaluated at once during the terminal operation (findFirst() in this case). From the documentation:
Stream operations are divided into intermediate and terminal operations, and are combined to form stream pipelines. A stream pipeline consists of a source (such as a Collection, an array, a generator function, or an I/O channel); followed by zero or more intermediate operations such as Stream.filter or Stream.map; and a terminal operation such as Stream.forEach or Stream.reduce.
Intermediate operations return a new stream. They are always lazy; executing an intermediate operation such as filter() does not actually perform any filtering, but instead creates a new stream that, when traversed, contains the elements of the initial stream that match the given predicate. Traversal of the pipeline source does not begin until the terminal operation of the pipeline is executed.
In practice, since your two pieces of code do not compile to exactly the same bytecode, there will likely be some minor performance difference between them, but logically they do very similar things and will perform very similarly.

Will using a parallel stream on a single-core processor be slower than using a sequential stream?

I am applying an operation to every element in a very large LinkedList<LinkedList<Double>>:
list.stream().map(l -> l.stream().filter(d ->
(Collections.max(l) - d) < 5)
.collect(Collectors.toCollection(LinkedList::new))).collect(Collectors.toCollection(LinkedList::new));
On my computer (quad-core), parallel streams seem to be faster than using sequential streams:
list.parallelStream().map(l -> l.parallelStream().filter(d ->
(Collections.max(l) - d) < 5)
.collect(Collectors.toCollection(LinkedList::new))).collect(Collectors.toCollection(LinkedList::new));
However, not every computer is going to be multi-core. My question is, will using parallel streams on a single-processor computer be noticeably slower than using sequential streams?
This is highly implementation specific, but usually, a parallel stream will go through a different code path for most operations, which implies performing additional work, but at the same time, the thread pool will be configured to the number of CPU cores.
E.g., if you run the following program
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "1");
System.out.println("Parallelism: "+ForkJoinPool.getCommonPoolParallelism());
Set<Thread> threads = ConcurrentHashMap.newKeySet();
for(int run=0; run<2; run++) {
IntStream stream = IntStream.range(0, 100);
if(run==1) {
stream = stream.parallel();
System.out.println("Parallel:");
}
int chunks = stream
.mapToObj(i->Thread.currentThread())
.collect(()->new int[]{1}, (a,t)->threads.add(t), (a,b)->a[0]+=b[0])[0];
System.out.println("processed "+chunks+" chunk(s) with "+threads.size()+" thread(s)");
}
it will print something like
Parallelism: 1
processed 1 chunk(s) with 1 thread(s)
Parallel:
processed 4 chunk(s) with 1 thread(s)
You can see the effect of splitting the workload, whereas splitting to four times the configured parallelism is not a coincidence, but also that only one thread is involved, so there is no inter-thread communication happening here. Whether the JVM’s optimizer will detect the single-thread nature of this operation and elide synchronization costs in this case, is, like anything else, an implementation detail.
All in all, the overhead is not very big and doesn’t scale with the actual amount of work, so if the actual work is big enough to benefit from parallel processing on SMP machines, the fraction of the overhead will be negligible on single core machines.
But if you care for performance, you should also look at the other aspects of your code.
By repeating an operation like Collections.max(l) for every element of l, you are combining two linear operations to an operation with quadratic time complexity. It’s easy to perform this operation only once instead:
List<List<Double>> result =
list.parallelStream()
.map(l -> {
double limit = Collections.max(l)-5;
return l.parallelStream()
.filter(d -> limit < d)
.collect(Collectors.toCollection(LinkedList::new));
})
.collect(Collectors.toCollection(LinkedList::new));
Depending on the list sizes, the impact of this little change, turning a quadratic operation to linear, might be far bigger than dividing the processing time by just the number of cpu cores (in the best case).
The other consideration is whether you really need a LinkedList. For most practical purposes, a LinkedList performs worse than, e.g. an ArrayList, and if you don’t need mutability, you may just use the toList() collector and let the JRE return the best list it can offer…
List<List<Double>> result =
list.parallelStream()
.map(l -> {
double limit = Collections.max(l)-5;
return l.parallelStream()
.filter(d -> limit < d)
.collect(Collectors.toList());
})
.collect(Collectors.toList());
Keep in mind that after changing the performance characteristics, rechecking whether the parallelization still has any benefit is recommended. It should also be checked for both stream operations individually. Usually, if the outer stream has a decent parallelization, turning the inner stream to parallel does not improve the overall performance.
Also, the benefit of parallel streams will be much higher if the source lists are random access lists instead of LinkedLists.
Soon we won't be having single core CPU any more. But if you are curious how threading works on single non hyper threaded core then see this answer:
Why is threading works on a single core CPU?
So to answer your question the run time most likely will be better for sequential processing since it won't be involving thread starting, scheduling and synchronization.
I did three benchmarking tests, one testing Holger's suggested optimizations, one using parallel and sequential streams on my quad-core computer (Asus FX550IU-WSFX), without optimizations, and one using parallel and sequential streams on a single core computer (Dell Optiplex 170L), also without optimizations. The lists for every test will contain 1.25 million elements.
Benchmarking code:
long average = 0;
for(int i = 0; i < 100; i++) {
long start = System.nanoTime();
//testing code...
average += (System.nanoTime() - start);
}
System.out.println((average / 100) / 1000000 + "ms average");
Testing Optimizations (on a 4-core processor)
Un-optimized code:
List<List<Double>> result = list.parallelStream().map(l -> l.parallelStream().filter(d ->
(Collections.max(l) - d) < 5)
.collect(Collectors.toCollection(LinkedList::new)))
.collect(Collectors.toCollection(LinkedList::new));
Optimized code:
List<List<Double>> result =
list.parallelStream()
.map(l -> {
double limit = Collections.max(l)-5;
return l.parallelStream()
.filter(d -> limit < d)
.collect(Collectors.toList());
})
.collect(Collectors.toList());
Times:
Using the un-optimized code, the average execution time was 633ms, while using the optimized code, the average execution time was 25ms.
Testing Un-optimized Code on 4-core Processor
Sequential code:
List<List<Double>> result = list.stream().map(l -> l.stream().filter(d ->
(Collections.max(l) - d) < 5)
.collect(Collectors.toCollection(LinkedList::new)))
.collect(Collectors.toCollection(LinkedList::new));
Parallel code:
List<List<Double>> result = list.parallelStream().map(l -> l.parallelStream().filter(d ->
(Collections.max(l) - d) < 5)
.collect(Collectors.toCollection(LinkedList::new)))
.collect(Collectors.toCollection(LinkedList::new));
Times:
Using the sequential code, the average execution time was 879ms, while using the parallel code yields an average execution time of 539ms.
Testing Un-optimized Code on 1-core Processor
Sequential code:
List<List<Double>> result = list.stream().map(l -> l.stream().filter(d ->
(Collections.max(l) - d) < 5)
.collect(Collectors.toCollection(LinkedList::new)))
.collect(Collectors.toCollection(LinkedList::new));
Parallel code:
List<List<Double>> result = list.parallelStream().map(l -> l.parallelStream().filter(d ->
(Collections.max(l) - d) < 5)
.collect(Collectors.toCollection(LinkedList::new)))
.collect(Collectors.toCollection(LinkedList::new));
Times:
Using the sequential code, the average execution time was 2398ms, while using the parallel code yields an average execution time of 3942ms.
Conclusion
While using parallel streams on a single-core processor, and sequential streams on a four-core processor, does seem to be slower, optimizing the code resulted in the fastest execution times.

Create collection of N identical elements from one element using Java 8 streams [duplicate]

In many other languages, eg. Haskell, it is easy to repeat a value or function multiple times, eg. to get a list of 8 copies of the value 1:
take 8 (repeat 1)
but I haven't found this yet in Java 8. Is there such a function in Java 8's JDK?
Or alternatively something equivalent to a range like
[1..8]
It would seem an obvious replacement for a verbose statement in Java like
for (int i = 1; i <= 8; i++) {
System.out.println(i);
}
to have something like
Range.from(1, 8).forEach(i -> System.out.println(i))
though this particular example doesn't look much more concise actually... but hopefully it's more readable.
For this specific example, you could do:
IntStream.rangeClosed(1, 8)
.forEach(System.out::println);
If you need a step different from 1, you can use a mapping function, for example, for a step of 2:
IntStream.rangeClosed(1, 8)
.map(i -> 2 * i - 1)
.forEach(System.out::println);
Or build a custom iteration and limit the size of the iteration:
IntStream.iterate(1, i -> i + 2)
.limit(8)
.forEach(System.out::println);
Here's another technique I ran across the other day:
Collections.nCopies(8, 1)
.stream()
.forEach(i -> System.out.println(i));
The Collections.nCopies call creates a List containing n copies of whatever value you provide. In this case it's the boxed Integer value 1. Of course it doesn't actually create a list with n elements; it creates a "virtualized" list that contains only the value and the length, and any call to get within range just returns the value. The nCopies method has been around since the Collections Framework was introduced way back in JDK 1.2. Of course, the ability to create a stream from its result was added in Java SE 8.
Big deal, another way to do the same thing in about the same number of lines.
However, this technique is faster than the IntStream.generate and IntStream.iterate approaches, and surprisingly, it's also faster than the IntStream.range approach.
For iterate and generate the result is perhaps not too surprising. The streams framework (really, the Spliterators for these streams) is built on the assumption that the lambdas will potentially generate different values each time, and that they will generate an unbounded number of results. This makes parallel splitting particularly difficult. The iterate method is also problematic for this case because each call requires the result of the previous one. So the streams using generate and iterate don't do very well for generating repeated constants.
The relatively poor performance of range is surprising. This too is virtualized, so the elements don't actually all exist in memory, and the size is known up front. This should make for a fast and easily parallelizable spliterator. But it surprisingly didn't do very well. Perhaps the reason is that range has to compute a value for each element of the range and then call a function on it. But this function just ignores its input and returns a constant, so I'm surprised this isn't inlined and killed.
The Collections.nCopies technique has to do boxing/unboxing in order to handle the values, since there are no primitive specializations of List. Since the value is the same every time, it's basically boxed once and that box is shared by all n copies. I suspect boxing/unboxing is highly optimized, even intrinsified, and it can be inlined well.
Here's the code:
public static final int LIMIT = 500_000_000;
public static final long VALUE = 3L;
public long range() {
return
LongStream.range(0, LIMIT)
.parallel()
.map(i -> VALUE)
.map(i -> i % 73 % 13)
.sum();
}
public long ncopies() {
return
Collections.nCopies(LIMIT, VALUE)
.parallelStream()
.mapToLong(i -> i)
.map(i -> i % 73 % 13)
.sum();
}
And here are the JMH results: (2.8GHz Core2Duo)
Benchmark Mode Samples Mean Mean error Units
c.s.q.SO18532488.ncopies thrpt 5 7.547 2.904 ops/s
c.s.q.SO18532488.range thrpt 5 0.317 0.064 ops/s
There is a fair amount of variance in the ncopies version, but overall it seems comfortably 20x faster than the range version. (I'd be quite willing to believe that I've done something wrong, though.)
I'm surprised at how well the nCopies technique works. Internally it doesn't do very much special, with the stream of the virtualized list simply being implemented using IntStream.range! I had expected that it would be necessary to create a specialized spliterator to get this to go fast, but it already seems to be pretty good.
For completeness, and also because I couldn't help myself :)
Generating a limited sequence of constants is fairly close to what you would see in Haskell, only with Java level verboseness.
IntStream.generate(() -> 1)
.limit(8)
.forEach(System.out::println);
Once a repeat function is somewhere defined as
public static BiConsumer<Integer, Runnable> repeat = (n, f) -> {
for (int i = 1; i <= n; i++)
f.run();
};
You can use it now and then this way, e.g.:
repeat.accept(8, () -> System.out.println("Yes"));
To get and equivalent to Haskell's
take 8 (repeat 1)
You could write
StringBuilder s = new StringBuilder();
repeat.accept(8, () -> s.append("1"));
Another alternative is to use the Stream.generate() method. For example the snippet below will create a list with 5 instances of MyClass:
List<MyClass> timezones = Stream
.generate(MyClass::createInstance)
.limit(5)
.collect(Collectors.toList());
From java doc:
generate(Supplier s)
Returns an infinite sequential unordered
stream where each element is generated by the provided Supplier.
This is my solution to implementing the times function. I'm a junior so I admit it could be not ideal, I'd be glad to hear if this is not a good idea for whatever reason.
public static <T extends Object, R extends Void> R times(int count, Function<T, R> f, T t) {
while (count > 0) {
f.apply(t);
count--;
}
return null;
}
Here's some example usage:
Function<String, Void> greet = greeting -> {
System.out.println(greeting);
return null;
};
times(3, greet, "Hello World!");

Find first index of matching character from two strings using parallel streams

Trying to figure out whether it is possible to find what the first index of a matching character that is within one string that is also in another string. So for example:
String first = "test";
String second = "123er";
int value = get(test, other);
// method would return 1, as the first matching character in
// 123er, e is at index 1 of test
So I'm trying to accomplish this using parallel streams. I know I can find whether there is a matching character fairly simply like such:
test.chars().parallel().anyMatch(other::contains);
How would I use this to find the exact index?
If you really care for performance, you should try to avoid the O(n × m) time complexity of iterating over one string for every character of the other. So, first iterate over one string to get a data structure supporting efficient (O(1)) lookup, then iterate over the other utilizing this.
BitSet encountered = new BitSet();
test.chars().forEach(encountered::set);
int index = IntStream.range(0, other.length())
.filter(ix->encountered.get(other.charAt(ix)))
.findFirst().orElse(-1);
If the strings are sufficiently large, the O(n + m) time complexity of this solution will turn to much shorter execution times. For smaller strings, it’s irrelevant anyway.
If you really think, the strings are large enough to benefit from parallel processing (which is very unlikely), you can perform both operations in parallel, with small adaptions:
BitSet encountered = CharBuffer.wrap(test).chars().parallel()
.collect(BitSet::new, BitSet::set, BitSet::or);
int index = IntStream.range(0, other.length()).parallel()
.filter(ix -> encountered.get(other.charAt(ix)))
.findFirst().orElse(-1);
The first operation uses the slightly more complicated, parallel compatible collect now and it contains a not-so-obvious change for the Stream creation.
The problem is described in bug report JDK-8071477. Simply said, the stream returned by String.chars() has a poor splitting capability, hence a poor parallel performance. The code above wraps the string in a CharBuffer, whose chars() method returns a different implementation, having the same semantics, but a good parallel performance. This work-around should become obsolete with Java 9.
Alternatively, you could use IntStream.range(0, test.length()).map(test::charAt) to create a stream with a good parallel performance. The second operation already works that way.
But, as said, for this specific task it’s rather unlikely that you ever encounter strings large enough to make parallel processing beneficial.
You can do it by relying on String#indexOf(int ch), keeping only values >= 0 to remove non existing characters then get the first value.
// Get the index of each characters of test in other
// Keep only the positive values
// Then return the first match
// Or -1 if we have no match
int result = test.chars()
.parallel()
.map(other::indexOf)
.filter(i -> i >= 0)
.findFirst()
.orElse(-1);
System.out.println(result);
Output:
1
NB 1: The result is 1 not 2 because indexes start from 0 not 1.
NB 2: Unless you have very very long String, using a parallel Stream in this case should not help much in term of performances because the tasks are not complexes and creating, starting and synchronizing threads has a very high cost so you will probably get your result much slower than with a normal stream.
Upgrading Nicolas' answer here. min() method enforces consumption of the whole Stream. In such cases, it's better to use findFirst() which stops the whole execution after finding the first matching element and not the minimum of all:
test.chars().parallel()
.map(other::indexOf)
.filter(i -> i >= 0)
.findFirst()
.ifPresent(System.out::println);

Advantages of parallelStream in Java SE8 [duplicate]

This question already has answers here:
Should I always use a parallel stream when possible?
(6 answers)
Closed 8 years ago.
I am just reading through some Java 8 code and from what I understand, you are able to iterate over collections using Stream() or parallelStream(). The latter has the advantage of using concurrency to split the task up over modern multicore processors and speed up the iteration (although it does not guarantee the order of the results).
The example code from the Oracle Java tutorial:
double average = roster
.stream() /** non-parallel **/
.filter(p -> p.getGender() == Person.Sex.MALE)
.mapToInt(Person::getAge)
.average()
.getAsDouble();
double average = roster
.parallelStream()
.filter(p -> p.getGender() == Person.Sex.MALE)
.mapToInt(Person::getAge)
.average()
.getAsDouble();
If I had a collection and I did not care about the order that it was processed in (i.e. they all have unique ID's or are unordered anyway or in a presorted state), would it make sense to always use the parallelStream way of iterating over a collection?
Other than when my code is run on a single core machine (at which point I assume the JVM would allocate all the work to the singlecore, thus not breaking my program),
are there any drawbacks to using parallelStream() everywhere?
If you listen to people from Oracle talking about design choices behind Java 8, you will often hear that parallelism was the main motivation. Parallelization was the main driving force behind lambdas, stream API and others.
Let's take a look at an example of stream API.
private long countPrimes(int max) {
return range(1, max).parallel().filter(this::isPrime).count();
}
private boolean isPrime(long n) {
return n > 1 && rangeClosed(2, (long) sqrt(n)).noneMatch(divisor -> n % divisor == 0);
}
Here we have method countPrimes that counts number of prime numbers between 1 and max. Stream of numbers is created by a range method. The stream is then switched to parallel mode, numbers that are not primes are filtered out and the remaining numbers are counted.
You can see that stream API allow us to describe the problem in a neat and compact way. Moreover, parallelization is just a matter of calling parallel() method. When we do that, the stream is split into multiple chunks, with each chunk processed independently and with the result summarized at the end. Since our implementation of isPrime method is extremely ineffective and CPU intensive, we can take advantage of parallelization and utilize all available CPU cores.
references:
http://java.dzone.com/articles/think-twice-using-java-8

Categories