Why does stream parallel() not use all available threads?

Why does stream parallel() not use all available threads? - java

I tried to run 100 Sleep tasks in parallel using Java8(1.8.0_172) stream.parallel() submitted inside a custom ForkJoinPool with 100+ threads available. Each task would sleep for 1s. I expected the whole work would finish after ~1s, given the 100 sleeps could be done in parallel. However I observe a runtime of 7s.
#Test
public void testParallelStream() throws Exception {
final int REQUESTS = 100;
ForkJoinPool forkJoinPool = null;
try {
// new ForkJoinPool(256): same results for all tried values of REQUESTS
forkJoinPool = new ForkJoinPool(REQUESTS);
forkJoinPool.submit(() -> {
IntStream stream = IntStream.range(0, REQUESTS);
final List<String> result = stream.parallel().mapToObj(i -> {
try {
System.out.println("request " + i);
Thread.sleep(1000);
return Integer.toString(i);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
// assertThat(result).hasSize(REQUESTS);
}).join();
} finally {
if (forkJoinPool != null) {
forkJoinPool.shutdown();
}
}
}
With output indicating ~16 stream elements are executed before a pause of 1s, then another ~16 and so on. So it seems even though the forkjoinpool was created with 100 threads, only ~16 get used.
This pattern emerges as soon as I use more than 23 threads:
1-23 threads: ~1s
24-35 threads: ~2s
36-48 threads: ~3s
...
System.out.println(Runtime.getRuntime().availableProcessors());
// Output: 4

Since the Stream implementation’s use of the Fork/Join pool is an implementation detail, the trick to force it to use a different Fork/Join pool is undocumented as well and seems to work by accident, i.e. there’s a hardcoded constant determining the actual parallelism, depending on the default pool’s parallelism. So using a different pool was not foreseen, originally.
However, it has been recognized that using a different pool with an inappropriate target parallelism is a bug, even if this trick is not documented, see JDK-8190974.
It has been fixed in Java 10 and backported to Java 8, update 222.
So a simple solution world be updating the Java version.
You may also change the default pool’s parallelism, e.g.
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "100");
before doing any Fork/Join activity.
But this may have unintended effects on other parallel operations.

As you wrote it, you let the stream decide the parallelism of the executions.
There you have the effect that ArrayList.parallelStream tries to outsmart you by splitting the data up evenly, without taking the number of available threads into account. This is good for CPU-Bound operations, where it's not usefull to have more threads than CPU Cores, but is not made for processes that need to wait for IO.
Why not force-feed all your items sequentially to the ForkJoinPool, so it's forced to use all available threads?
IntStream stream = IntStream.range(0, REQUESTS);
List<ForkJoinTask<String>> results
= stream.mapToObj(i -> forkJoinPool.submit(() -> {
try {
System.out.println("request " + i);
Thread.sleep(1000);
return Integer.toString(i);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
})).collect(Collectors.toList());
results.forEach(ForkJoinTask::join);
This takes less than two seconds on my machine.

Related

How to check when all CompleteableFuture are done?

I have a Stream<Item> which I'm mapping to a CompleteableFuture<ItemResult>
What I'd like to do is to know when all the futures are completed.
One may suggest to:
collect all the futures to an array and use CompleteableFuture.allOf(). This is somewhat problematic since there could be hundreds of thousands of items
just continue with forEach(CompleteableFuture::join). This is problematic too as calling forEach with join will just block the stream and it will be essentially a serial processing and not concurrent
Inject a poisoned item in the end of the stream. This could work but it's not that elegant in my view
check if the executor queue is empty - This is quite limiting because I might use more than one executor in the future. Also, the queue can be momentarily empty
Monitor the database instead and check the number of new items
I feel like all the suggested solutions aren't good enough.
What is the appropriate way to monitor the futures?
Thanks
EDIT:
another (vague) idea I had in mind is to use a counter and wait for it to go down to zero. But again, need to check that it's not a momentarily 0..

Disclaimer: I'm not sure whether Phaser is the right tool here, and if yes, whether it's better to have one root with multiple children or to chain them like I'm proposing below, so feel free to correct me.
Here's one approach that uses Phaser.
A Phaser has a limited number of parties, so we need to create a new child Phaser if that limit is about to get reached:
private Phaser register(Phaser phaser) {
if (phaser.getRegisteredParties() < 65534) {
// warning: side-effect,
// conflicts with AtomicReference#updateAndGet recommendation,
// might not fit well if the Stream is parallel:
phaser.register();
return phaser;
} else {
return new Phaser(phaser, 1);
}
}
Register each CompletableFuture against that Phaser chain, and deregister once done:
private void register(CompletableFuture<?> future, AtomicReference<Phaser> phaser) {
Phaser registeredPhaser = phaser.updateAndGet(this::register);
future
.thenRun(registeredPhaser::arriveAndDeregister)
.exceptionally(e -> {
// log e?
registeredPhaser.arriveAndDeregister();
return null;
});
}
Wait for all futures to be finished:
private <T> void await(Stream<CompletableFuture<T>> futures) {
Phaser rootPhaser = new Phaser(1);
AtomicReference<Phaser> phaser = new AtomicReference<>(rootPhaser);
futures.forEach(future -> register(future, phaser));
rootPhaser.arriveAndAwaitAdvance();
rootPhaser.arriveAndDeregister();
}
Example:
ExecutorService executor = Executors.newFixedThreadPool(500);
// creating fake stream with 500,000 futures:
Stream<CompletableFuture<Integer>> stream = IntStream
.rangeClosed(1, 500_000)
.mapToObj(i -> CompletableFuture.supplyAsync(() -> {
try {
TimeUnit.MILLISECONDS.sleep(10);
if (i % 50_000 == 0) {
System.out.println(Thread.currentThread().getName() + ": " + i);
}
return i;
} catch (InterruptedException e) {
throw new IllegalStateException(e);
}
}, executor));
// usage:
await(stream);
System.out.println("Done");
Outputs:
pool-1-thread-348: 50000
pool-1-thread-395: 100000
pool-1-thread-333: 150000
pool-1-thread-30: 200000
pool-1-thread-120: 250000
pool-1-thread-10: 300000
pool-1-thread-241: 350000
pool-1-thread-340: 400000
pool-1-thread-283: 450000
pool-1-thread-176: 500000
Done

How to avoid context switching in Java ExecutorService

I use a software (AnyLogic) to export runnable jar files that themselves repeated re-run a set of simulations with different parameters (so-called parameter variation experiments). The simulations I'm running have very RAM intensive, so I have to limit the number of cores available to the jar file. In AnyLogic, the number of available cores is easily set, but from the Linux command line on the servers, the only way I know how to do this is by using the taskset command to just manually specify the available cores to use (using a CPU affinity "mask"). This has worked very well so far, but since you have to specify individual cores to use, I'm learning that there can be pretty substantial differences in performance depending on which cores you select. For example, you would want to maximize the use of CPU cache levels, so if you choose cores that share too much cache, you'll get much slower performance.
Since AnyLogic is written in Java, I can use Java code to specify the running of simulations. I'm looking at using the Java ExecutorService to build a pool of individual runs such that I can just specify the size of the pool to be whatever number of cores would match the RAM of the machine I'm using. I'm thinking that this would offer a number of benefits, most importantly perhaps the computer's scehduler can do a better job of selecting the cores to minimize runtime.
In my tests, I built a small AnyLogic model that take about 10 seconds to run (it just switches between 2 statechart states repeatedly). Then I created a custom experiment with this simple code.
ExecutorService service = Executors.newFixedThreadPool(2);
for (int i=0; i<10; i++)
{
Simulation experiment = new Simulation();
experiment.variable = i;
service.execute( () -> experiment.run() );
}
What I would hope to see is that only 2 Simulation objects start up at a time, since that's the size of the thread pool. But I see all 10 start up and running in parallel over the 2 threads. This makes me think that context switching is happening, which I assume is pretty inefficient.
When, instead of calling the AnyLogic Simulation, I just call a custom Java class (below) in the service.execute function, it seems to work fine, showing only 2 Tasks running at a time.
public class Task implements Runnable, Serializable {
public void run() {
traceln("Starting task on thread " + Thread.currentThread().getName());
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
e.printStackTrace();
}
traceln("Ending task on thread " + Thread.currentThread().getName());
}
}
Does anyone know why the AnyLogic function seems to be setting up all the simulations at once?

I'm guessing Simulation extends from ExperimentParamVariation. The key to achieve what you want would be to determine when the experiment has ended.
The documentation shows some interesting methods like getProgress() and getState(), but you would have to poll those methods until the progress is 1 or the state is FINISHED or ERROR. There are also the methods onAfterExperiment() and onError() that should be called by the engine to indicate that the experiment has ended or there was an error. I think you could use these last two methods with a Semaphore to control how many experiments run at once:
import java.util.concurrent.Semaphore;
import com.anylogic.engine.ExperimentParamVariation;
public class Simulation extends ExperimentParamVariation</* Agent */> {
private final Semaphore semaphore;
public Simulation(Semaphore semaphore) {
this.semaphore = semaphore;
}
public void onAfterExperiment() {
this.semaphore.release();
super.onAfterExperiment();
}
public void onError(Throwable error) {
this.semaphore.release();
super.onError(error);
}
// run() cannot be overriden because it is final
// You could create another run method or acquire a permit from the semaphore elsewhere
public void runWithSemaphore() throws InterruptedException {
// This acquire() will block until a permit is available or the thread is interrupted
this.semaphore.acquire();
this.run();
}
}
Then you will have to configure a semaphore with the desired number of permits an pass it to the Simulation instances:
import java.util.concurrent.Semaphore;
// ...
Semaphore semaphore = new Semaphore(2);
for (int i = 0; i < 10; i++)
{
Simulation experiment = new Simulation(semaphore);
// ...
// Handle the InterruptedException thrown here
experiment.runWithSemaphore();
/* Alternative to runWithSemaphore(): acquire the permit and call run().
semaphore.acquire();
experiment.run();
*/
}

Firstly, this whole question has been nullified by what I think is a relatively new addition to AnyLogic's functionality. You can specify an ini file with a specified number of "parallel workers".
https://help.anylogic.com/index.jsp?topic=%2Fcom.anylogic.help%2Fhtml%2Frunning%2Fexport-java-application.html&cp=0_3_9&anchor=customize-settings
But I had managed to find a workable solution just before finding this (better) option. Hernan's answer was almost enough. I think it was hampered by some vagaries of AnyLogic's engine (as I detailed in a comment).
The best version I could muster myself was using ExecuterService. In a Custom Experiment, I put this code:
ExecutorService service = Executors.newFixedThreadPool(2);
List<Callable<Integer>> tasks = new ArrayList<>();
for (int i=0; i<10; i++)
{
int t = i;
tasks.add( () -> simulate(t) );
}
try{
traceln("starting setting up service");
List<Future<Integer>> futureResults = service.invokeAll(tasks);
traceln("finished setting up service");
List<Integer> res = futureResults.stream().parallel().map(
f -> {
try {
return f.get();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
return null;
}).collect(Collectors.toList());
System.out.println("----- Future Results are ready -------");
System.out.println("----- Finished -------");
} catch (InterruptedException e) {
e.printStackTrace();
}
service.shutdown();
The key here was using the Java Future. Also, to use the invokeAll function, I created a function in the Additional class code block:
public int simulate(int variable){
// Create Engine, initialize random number generator:
Engine engine = createEngine();
// Set stop time
engine.setStopTime( 100000 );
// Create new root object:
Main root = new Main( engine, null, null );
root.parameter = variable;
// Prepare Engine for simulation:
engine.start( root );
// Start simulation in fast mode:
//traceln("attempting to acquire 1 permit on run "+variable);
//s.acquireUninterruptibly(1);
traceln("starting run "+variable);
engine.runFast();
traceln("ending run "+variable);
//s.release();
// Destroy the model:
engine.stop();
traceln( "Finished, run "+variable);
return 1;
}
The only limitation I could see to this approach is that I don't have a waiting-while loop to output progress every few minutes. But instead of finding a solution to that, I must abandon this work for the much better settings file solution in the link up top.

Parallellize a for loop in Java using multi-threading

I am very new to java and I want to parallelize a nested for loop using executor service or using any other method in java. I want to create some fixed number of threads so that CPU is not completely acquired by threads.
for(SellerNames sellerNames : sellerDataList) {
for(String selleName : sellerNames) {
//getSellerAddress(sellerName)
//parallize this task
}
}
size of sellerDataList = 1000 and size of sellerNames = 5000.
Now I want to create 10 threads and assign equal chunk of task to each thread equally. That is for i'th sellerDataList, first thread should get address for 500 names, second thread should get address for next 500 names and so on.
What is the best way to do this job?

There are two ways to make it run parallelly: Streams and Executors.
Using streams
You can use parallel streams and leave the rest to the jvm. In this case you don't have too much control over what happens when. On the other hand your code will be easy to read and maintain:
sellerDataList.stream().forEach(sellerNames -> {
Stream<String> stream = StreamSupport.stream(sellerNames.spliterator(), true); // true means use parallel stream
stream.forEach(sellerName -> {
getSellerAddress(sellerName);
});
});
Using an ExecutorService
Suppose, you want 5 Threads and you want to be able to wait until task completion. Then you can use a fixed thread pool with 5 threads and use Future-s so you can wait until they are done.
final ExecutorService executor = Executors.newFixedThreadPool(5); // it's just an arbitrary number
final List<Future<?>> futures = new ArrayList<>();
for (SellerNames sellerNames : sellerDataList) {
for (final String sellerName : sellerNames) {
Future<?> future = executor.submit(() -> {
getSellerAddress(sellerName);
});
futures.add(future);
}
}
try {
for (Future<?> future : futures) {
future.get(); // do anything you need, e.g. isDone(), ...
}
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}

If you are using a parallel stream you can still control the thread by creating your own ForkJoinPool.
List<Long> aList = LongStream.rangeClosed(firstNum, lastNum).boxed()
.collect(Collectors.toList());
ForkJoinPool customThreadPool = new ForkJoinPool(4);
long actualTotal = customThreadPool.submit(
() -> aList.parallelStream().reduce(0L, Long::sum)).get();
Here on this site, it is described very well.
https://www.baeldung.com/java-8-parallel-streams-custom-threadpool

Custom thread pool in Java 8 parallel stream

Is it possible to specify a custom thread pool for Java 8 parallel stream? I can not find it anywhere.
Imagine that I have a server application and I would like to use parallel streams. But the application is large and multi-threaded so I want to compartmentalize it. I do not want a slow running task in one module of the applicationblock tasks from another module.
If I can not use different thread pools for different modules, it means I can not safely use parallel streams in most of real world situations.
Try the following example. There are some CPU intensive tasks executed in separate threads.
The tasks leverage parallel streams. The first task is broken, so each step takes 1 second (simulated by thread sleep). The issue is that other threads get stuck and wait for the broken task to finish. This is contrived example, but imagine a servlet app and someone submitting a long running task to the shared fork join pool.
public class ParallelTest {
public static void main(String[] args) throws InterruptedException {
ExecutorService es = Executors.newCachedThreadPool();
es.execute(() -> runTask(1000)); //incorrect task
es.execute(() -> runTask(0));
es.execute(() -> runTask(0));
es.execute(() -> runTask(0));
es.execute(() -> runTask(0));
es.execute(() -> runTask(0));
es.shutdown();
es.awaitTermination(60, TimeUnit.SECONDS);
}
private static void runTask(int delay) {
range(1, 1_000_000).parallel().filter(ParallelTest::isPrime).peek(i -> Utils.sleep(delay)).max()
.ifPresent(max -> System.out.println(Thread.currentThread() + " " + max));
}
public static boolean isPrime(long n) {
return n > 1 && rangeClosed(2, (long) sqrt(n)).noneMatch(divisor -> n % divisor == 0);
}
}

There actually is a trick how to execute a parallel operation in a specific fork-join pool. If you execute it as a task in a fork-join pool, it stays there and does not use the common one.
final int parallelism = 4;
ForkJoinPool forkJoinPool = null;
try {
forkJoinPool = new ForkJoinPool(parallelism);
final List<Integer> primes = forkJoinPool.submit(() ->
// Parallel task here, for example
IntStream.range(1, 1_000_000).parallel()
.filter(PrimesPrint::isPrime)
.boxed().collect(Collectors.toList())
).get();
System.out.println(primes);
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException(e);
} finally {
if (forkJoinPool != null) {
forkJoinPool.shutdown();
}
}
The trick is based on ForkJoinTask.fork which specifies: "Arranges to asynchronously execute this task in the pool the current task is running in, if applicable, or using the ForkJoinPool.commonPool() if not inForkJoinPool()"

The parallel streams use the default ForkJoinPool.commonPool which by default has one less threads as you have processors, as returned by Runtime.getRuntime().availableProcessors() (This means that parallel streams leave one processor for the calling thread).
For applications that require separate or custom pools, a ForkJoinPool may be constructed with a given target parallelism level; by default, equal to the number of available processors.
This also means if you have nested parallel streams or multiple parallel streams started concurrently, they will all share the same pool. Advantage: you will never use more than the default (number of available processors). Disadvantage: you may not get "all the processors" assigned to each parallel stream you initiate (if you happen to have more than one). (Apparently you can use a ManagedBlocker to circumvent that.)
To change the way parallel streams are executed, you can either
submit the parallel stream execution to your own ForkJoinPool: yourFJP.submit(() -> stream.parallel().forEach(soSomething)).get(); or
you can change the size of the common pool using system properties: System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "20") for a target parallelism of 20 threads.
Example of the latter on my machine which has 8 processors. If I run the following program:
long start = System.currentTimeMillis();
IntStream s = IntStream.range(0, 20);
//System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "20");
s.parallel().forEach(i -> {
try { Thread.sleep(100); } catch (Exception ignore) {}
System.out.print((System.currentTimeMillis() - start) + " ");
});
The output is:
215 216 216 216 216 216 216 216 315 316 316 316 316 316 316 316 415 416 416 416
So you can see that the parallel stream processes 8 items at a time, i.e. it uses 8 threads. However, if I uncomment the commented line, the output is:
215 215 215 215 215 216 216 216 216 216 216 216 216 216 216 216 216 216 216 216
This time, the parallel stream has used 20 threads and all 20 elements in the stream have been processed concurrently.

Alternatively to the trick of triggering the parallel computation inside your own forkJoinPool you can also pass that pool to the CompletableFuture.supplyAsync method like in:
ForkJoinPool forkJoinPool = new ForkJoinPool(2);
CompletableFuture<List<Integer>> primes = CompletableFuture.supplyAsync(() ->
//parallel task here, for example
range(1, 1_000_000).parallel().filter(PrimesPrint::isPrime).collect(toList()),
forkJoinPool
);

The original solution (setting the ForkJoinPool common parallelism property) no longer works. Looking at the links in the original answer, an update which breaks this has been back ported to Java 8. As mentioned in the linked threads, this solution was not guaranteed to work forever. Based on that, the solution is the forkjoinpool.submit with .get solution discussed in the accepted answer. I think the backport fixes the unreliability of this solution also.
ForkJoinPool fjpool = new ForkJoinPool(10);
System.out.println("stream.parallel");
IntStream range = IntStream.range(0, 20);
fjpool.submit(() -> range.parallel()
.forEach((int theInt) ->
{
try { Thread.sleep(100); } catch (Exception ignore) {}
System.out.println(Thread.currentThread().getName() + " -- " + theInt);
})).get();
System.out.println("list.parallelStream");
int [] array = IntStream.range(0, 20).toArray();
List<Integer> list = new ArrayList<>();
for (int theInt: array)
{
list.add(theInt);
}
fjpool.submit(() -> list.parallelStream()
.forEach((theInt) ->
{
try { Thread.sleep(100); } catch (Exception ignore) {}
System.out.println(Thread.currentThread().getName() + " -- " + theInt);
})).get();

We can change the default parallelism using the following property:
-Djava.util.concurrent.ForkJoinPool.common.parallelism=16
which can set up to use more parallelism.

To measure the actual number of used threads, you can check Thread.activeCount():
Runnable r = () -> IntStream
.range(-42, +42)
.parallel()
.map(i -> Thread.activeCount())
.max()
.ifPresent(System.out::println);
ForkJoinPool.commonPool().submit(r).join();
new ForkJoinPool(42).submit(r).join();
This can produce on a 4-core CPU an output like:
5 // common pool
23 // custom pool
Without .parallel() it gives:
3 // common pool
4 // custom pool

Until now, I used the solutions described in the answers of this question. Now, I came up with a little library called Parallel Stream Support for that:
ForkJoinPool pool = new ForkJoinPool(NR_OF_THREADS);
ParallelIntStreamSupport.range(1, 1_000_000, pool)
.filter(PrimesPrint::isPrime)
.collect(toList())
But as #PabloMatiasGomez pointed out in the comments, there are drawbacks regarding the splitting mechanism of parallel streams which depends heavily on the size of the common pool. See Parallel stream from a HashSet doesn't run in parallel .
I am using this solution only to have separate pools for different types of work but I can not set the size of the common pool to 1 even if I don't use it.

Note:
There appears to be a fix implemented in JDK 10 that ensures the Custom Thread Pool uses the expected number of threads.
Parallel stream execution within a custom ForkJoinPool should obey the parallelism
https://bugs.openjdk.java.net/browse/JDK-8190974

If you don't want to rely on implementation hacks, there's always a way to achieve the same by implementing custom collectors that will combine map and collect semantics... and you wouldn't be limited to ForkJoinPool:
list.stream()
.collect(parallel(i -> process(i), executor, 4))
.join()
Luckily, it's done already here and available on Maven Central:
http://github.com/pivovarit/parallel-collectors
Disclaimer: I wrote it and take responsibility for it.

Go to get abacus-common. Thread number can by specified for parallel stream. Here is the sample code:
LongStream.range(4, 1_000_000).parallel(threadNum)...
Disclosure： I'm the developer of abacus-common.

If you don't need a custom ThreadPool but you rather want to limit the number of concurrent tasks, you can use:
List<Path> paths = List.of("/path/file1.csv", "/path/file2.csv", "/path/file3.csv").stream().map(e -> Paths.get(e)).collect(toList());
List<List<Path>> partitions = Lists.partition(paths, 4); // Guava method
partitions.forEach(group -> group.parallelStream().forEach(csvFilePath -> {
// do your processing
}));
(Duplicate question asking for this is locked, so please bear me here)

Here is how I set the max thread count flag mentioned above programatically and a code sniped to verify that the parameter is honored
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "2");
Set<String> threadNames = Stream.iterate(0, n -> n + 1)
.parallel()
.limit(100000)
.map(i -> Thread.currentThread().getName())
.collect(Collectors.toSet());
System.out.println(threadNames);
// Output -> [ForkJoinPool.commonPool-worker-1, Test worker, ForkJoinPool.commonPool-worker-3]

If you don't mind using a third-party library, with cyclops-react you can mix sequential and parallel Streams within the same pipeline and provide custom ForkJoinPools. For example
ReactiveSeq.range(1, 1_000_000)
.foldParallel(new ForkJoinPool(10),
s->s.filter(i->true)
.peek(i->System.out.println("Thread " + Thread.currentThread().getId()))
.max(Comparator.naturalOrder()));
Or if we wished to continue processing within a sequential Stream
ReactiveSeq.range(1, 1_000_000)
.parallel(new ForkJoinPool(10),
s->s.filter(i->true)
.peek(i->System.out.println("Thread " + Thread.currentThread().getId())))
.map(this::processSequentially)
.forEach(System.out::println);
[Disclosure I am the lead developer of cyclops-react]

I tried the custom ForkJoinPool as follows to adjust the pool size:
private static Set<String> ThreadNameSet = new HashSet<>();
private static Callable<Long> getSum() {
List<Long> aList = LongStream.rangeClosed(0, 10_000_000).boxed().collect(Collectors.toList());
return () -> aList.parallelStream()
.peek((i) -> {
String threadName = Thread.currentThread().getName();
ThreadNameSet.add(threadName);
})
.reduce(0L, Long::sum);
}
private static void testForkJoinPool() {
final int parallelism = 10;
ForkJoinPool forkJoinPool = null;
Long result = 0L;
try {
forkJoinPool = new ForkJoinPool(parallelism);
result = forkJoinPool.submit(getSum()).get(); //this makes it an overall blocking call
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
} finally {
if (forkJoinPool != null) {
forkJoinPool.shutdown(); //always remember to shutdown the pool
}
}
out.println(result);
out.println(ThreadNameSet);
}
Here is the output saying the pool is using more threads than the default 4.
50000005000000
[ForkJoinPool-1-worker-8, ForkJoinPool-1-worker-9, ForkJoinPool-1-worker-6, ForkJoinPool-1-worker-11, ForkJoinPool-1-worker-10, ForkJoinPool-1-worker-1, ForkJoinPool-1-worker-15, ForkJoinPool-1-worker-13, ForkJoinPool-1-worker-4, ForkJoinPool-1-worker-2]
But actually there is a weirdo, when I tried to achieve the same result using ThreadPoolExecutor as follows:
BlockingDeque blockingDeque = new LinkedBlockingDeque(1000);
ThreadPoolExecutor fixedSizePool = new ThreadPoolExecutor(10, 20, 60, TimeUnit.SECONDS, blockingDeque, new MyThreadFactory("my-thread"));
but I failed.
It will only start the parallelStream in a new thread and then everything else is just the same, which again proves that the parallelStream will use the ForkJoinPool to start its child threads.

I made utility method to run task in parallel with argument which defines max number of threads.
public static void runParallel(final int maxThreads, Runnable task) throws RuntimeException {
ForkJoinPool forkJoinPool = null;
try {
forkJoinPool = new ForkJoinPool(maxThreads);
forkJoinPool.submit(task).get();
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException(e);
} finally {
if (forkJoinPool != null) {
forkJoinPool.shutdown();
}
}
}
It creates ForkJoinPool with max number of allowed threads and it shuts it down after the task completes (or fails).
Usage is following:
final int maxThreads = 4;
runParallel(maxThreads, () ->
IntStream.range(1, 1_000_000).parallel()
.filter(PrimesPrint::isPrime)
.boxed().collect(Collectors.toList()));

The (currently) accepted answer is partly wrong. It is not sufficient to just submit() the parallel stream to the dedicated fork-join-pool. In this case, the stream will use that pool's threads and additionally the common fork-join-pool and even the calling thread to handle the workload of the stream, it seems up to the size of the common fork-join pool. The behaviour is a bit weird but definitely not what is required.
To actually restrict the work completely to the dedicated pool, you must encapsulate it into a CompletableFuture:
final int parallelism = 4;
ForkJoinPool forkJoinPool = null;
try {
forkJoinPool = new ForkJoinPool(parallelism);
final List<Integer> primes = CompletableFuture.supplyAsync(() ->
// Parallel task here, for example
IntStream.range(1, 1_000_000).parallel()
.filter(PrimesPrint::isPrime)
.boxed().collect(Collectors.toList()),
forkJoinPool) // <- passes dedicated fork-join pool as executor
.join(); // <- Wait for result from forkJoinPool
System.out.println(primes);
} finally {
if (forkJoinPool != null) {
forkJoinPool.shutdown();
}
}
This code stays with all operations in forkJoinPool on both Java 8u352 and Java 17.0.1.

Java concurrency pattern to parallel parts of task

I read lines from file, in one thread of course. Lines was sorted by key.
Then I collect lines with same key (15-20 lines), make parsing, big calculation, etc, and push resulting object to statistic class.
I want to paralell my programm to read in one thread, make parsing and calc in many threads, and join results in one thread to write to stat class.
Is any ready pattern or solution in java7 framework for this problem?
I realize it with executor for multithreading, pushing to blockingQueue, and reading queue in another thread, but i think my code sucks and will produce bugs
Many thanks
upd:
I can't map all file in memory - it's very big

You already have the main classes of approaches in mind. CountDownLatch, Thread.join, Executors, Fork/Join. Another option is the Akka framework, which has message passing overheads measured in 1-2 microseconds and is open source. However let me share another approach that often out performs the above approaches and is simpler, this approach is born from working on batch file loads in Java for a number of companies.
Assuming that your goal of splitting the work up is performance, rather than learning. Performance as measured by how long it takes from start to finish. Then it is often difficult to make it faster than memory mapping the file, and processing in a single thread that has been pinned to a single core. It is also gives much simpler code too. A double win.
This may be counter intuitive, however the speed of processing files is nearly always limited by how efficient the file loading is. Not how parallel the processing is. Hence memory mapping the file is a huge win. Once memory mapped we want the algorithm to have low contention with the hardware as it performs the file load. Modern hardware tend to have the IO controller and the memory controller on the same socket as the CPU; which when combined with the prefetchers within the CPU itself lead to a hell of a lot of efficiency when processing the file in a orderly fashion from a single thread. This can be so extreme that going parallel may actually be a lot slower. Pinning a thread to a core usually speeds up memory bound algorithms by a factor of 5. Which is why the memory mapping part is so important.
If you have not already, give it a try.

Without facts and numbers it is hard to give you advices. So let's start from the beginning:
You must identify the bottleneck. Do you really need to perform the computation in parallel or is your job IO bound ? Avoid concurrency if possible, it could be faster.
If computations must be done in parallel you must decide how fine or coarse grained your tasks must be. You need to measure your computations and tasks to be able to size them. Avoid to create too many tasks
You should have a IO thread, several workers, and a "data gatherer" thread. No mutable data.
Be sure to not slow down the IO thread because of task submission. Otherwise you should use more coarse grained tasks or use a better task dispatcher (who said disruptor ?)
The "Data gatherer" thread should be the only one to mutate the final state
Avoid unnecessary data copy and object creation. Quite often, when iterating on large files the bottleneck is the GC. Last week, I achieved a 6x speedup replacing a standard scala object by a flyweight pattern. You should also try to pre-allocate everything and use large buffers (page sized).
Avoid disk seeks.
Having that said you should be one the right track. You can start with an Executor using properly sized tasks. Tasks write into a data structure, like your blocking queue, shared between workers and the "data gatherer" thread. This threading model is really simple, efficient and hard to get wrong. It is usually efficient enough. If you still require better performances then you must profile your application and understand the bottleneck. Then you can decide the way to go: refine your task size, use faster tools like the disruptor/Akka, improve IO, create fewer objects, tune your code, buy a bigger machine or faster disks, move to Hadoop etc. Pinning each thread to a core (require platform specific code) could also provide a significant boost.

You can use CountDownLatch
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/CountDownLatch.html
to synchronize the starting and joining of threads. This is better than looping on the set of threads and calling join() on each thread reference.

Here is what I would do if asked to split work as you are trying to:
public class App {
public static class Statistics {
}
public static class StatisticsCalculator implements Callable<Statistics> {
private final List<String> lines;
public StatisticsCalculator(List<String> lines) {
this.lines = lines;
}
#Override
public Statistics call() throws Exception {
//do stuff with lines
return new Statistics();
}
}
public static void main(String[] args) {
final File file = new File("path/to/my/file");
final List<List<String>> partitionedWork = partitionWork(readLines(file), 10);
final List<Callable<Statistics>> callables = new LinkedList<>();
for (final List<String> work : partitionedWork) {
callables.add(new StatisticsCalculator(work));
}
final ExecutorService executorService = Executors.newFixedThreadPool(Math.min(partitionedWork.size(), 10));
final List<Future<Statistics>> futures;
try {
futures = executorService.invokeAll(callables);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
try {
for (final Future<Statistics> future : futures) {
final Statistics statistics = future.get();
//do whatever to aggregate the individual
}
} catch (InterruptedException | ExecutionException ex) {
throw new RuntimeException(ex);
}
executorService.shutdown();
try {
executorService.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
}
static List<String> readLines(final File file) {
//read lines
return new ArrayList<>();
}
static List<List<String>> partitionWork(final List<String> lines, final int blockSize) {
//divide up the incoming list into a number of chunks
final List<List<String>> partitionedWork = new LinkedList<>();
for (int i = lines.size(); i > 0; i -= blockSize) {
int start = i > blockSize ? i - blockSize : 0;
partitionedWork.add(lines.subList(start, i));
}
return partitionedWork;
}
}
I have create a Statistics object, this holds the result of the work done.
There is a StatisticsCalculator object which is a Callable<Statistics> - this does the calculation. It is given a List<String> and it processes the lines and creates the Statistics.
The readLines method I leave to you to implement.
The most important method in many ways is the partitionWork method, this divides the incoming List<String> which is all the lines in the file into a List<List<String>> using the blockSize. This essentially decides how much work each thread should have, tuning of the blockSize parameter is very important. As if each work is only one line then the overheads would probably outweight the advantages whereas if each work of ten thousand lines then you only have one working Thread.
Finally the meat of the opertation is the main method. This calls the read and then partition methods. It spawns an ExecutorService with a number of threads equal to the number of bits of work but up to a maximum of 10. You may way to make this equal to the number of cores you have.
The main method then submits a List of all the Callables, one for each chunk, to the executorService. The invokeAll method blocks until the work is done.
The method now loops over each returned List<Future> and gets the generated Statistics object for each; ready for aggregation.
Afterwards don't forget to shutdown the executorService as it will prevent your application form exiting.
EDIT
OP wants to read line by line so here is a revised main
public static void main(String[] args) throws IOException {
final File file = new File("path/to/my/file");
final ExecutorService executorService = Executors.newFixedThreadPool(10);
final List<Future<Statistics>> futures = new LinkedList<>();
try (final BufferedReader reader = new BufferedReader(new FileReader(file))) {
List<String> tmp = new LinkedList<>();
String line = null;
while ((line = reader.readLine()) != null) {
tmp.add(line);
if (tmp.size() == 100) {
futures.add(executorService.submit(new StatisticsCalculator(tmp)));
tmp = new LinkedList<>();
}
}
if (!tmp.isEmpty()) {
futures.add(executorService.submit(new StatisticsCalculator(tmp)));
}
}
try {
for (final Future<Statistics> future : futures) {
final Statistics statistics = future.get();
//do whatever to aggregate the individual
}
} catch (InterruptedException | ExecutionException ex) {
throw new RuntimeException(ex);
}
executorService.shutdown();
try {
executorService.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
}
}
This streams the file line by line and, after a given number of lines fires a new task to process the lines to the executor.
You would need to call clear on the List<String> in the Callable when you are done with it as the Callable instances are references by the Futures they return. If you clear the Lists when you're done with them that should reduce the memory footprint considerably.
A further enhancement may well be to use the suggestion here for a ExecutorService that blocks until there is a spare thread - this will guranatee that there are never more than threads*blocksize lines in memory at a time if you clear the Lists when the Callables are done with them.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.