The new Stream API in Java 8 is really nice, especially for the parallel processing capabilities. However, I don't see how to apply the parallel processing outside of the Collections parallelStream method.
For example, if I am creating a Stream from a File, I use the following:
Stream<String> lines = Files.lines(Paths.get("test.csv"));
However, there is no counterpart parallelStream method, like in Collections. It seems like there could be one thread grabbing the next line, while there could be several threads parsing and processing the lines.
Could this be done with StreamSupport.stream()?
There's a much simpler answer: Any stream can be turned parallel by calling .parallel():
Stream<String> lines = Files.lines(Paths.get("test.csv"))
.parallel();
The .parallelStream() method on Collection is just a convenience.
Note that, unless you're doing a lot of processing per line, the sequential nature of IO from the file will probably dominate and you may not get as much parallelism as you hope.
Yes - turns out you can create a parallel stream from the sequential stream with StreamSupport.stream(). Following the pattern of my question, it would look like the following.
StreamSupport.stream(Files.lines(Paths.get("test.csv")).spliterator(), true);
The 'true' is to make it parallel. In testing, it expanded the use from a single core to all cores on my machine. It read the lines in order, however the processing of the lines did not complete in order, which is fine for my purposes.
Related
We're having this issue: SLURM slow for array job
Is there some way that
collection.stream().someFunction1().someFunction2() etc.
or
Arrays.stream(values).someFunction1().someFunction2() etc.
does cause some multithreading?
We don't have anything like "parallel" or "thread" in our code.
Thanks in advance
Martin
No.
From the documentation for Collection.stream:
Returns a sequential Stream with this collection as its source.
From the documentation for Arrays.stream:
Returns a sequential Stream with the specified array as its source.
A sequential stream is the opposite of a parallel stream. It is processed in the calling thread only.
I am trying to process records in processor step using multiple processor classes. These classes can work in parallel. Currently I have written a multi threaded step where I
Set input and output row for a processor class
Submit it to Executor service
Get all future objects and collect final output
Now as soon as I make my job parallel by adding taskExecutor ; I get issues as input objects set in step 1 get overwritten in step 2 and processors are called with overwritten values. I tried to search if I can write composite processor that delegates task to multiple steps in parallel but they work only in sequential manner.
Any inputs would be greatly helpful. Thanks !
Welcome to concurrency. You can get yourself into allot of trouble when you do not follow the path which keeps you in safe deterministic world. You can get rid of all your issues if you use pure functions. As in your functions do not have any side effects, all your variables should be final, you'll find that you wont have any concurrency issues if you stick to this. In general stay away for the threading libraries that get shipped with Java. You should treat thread pools and executors etc. as a resource. Probably should do a bit of reading about concurrency, locks, volatile variables, why these lower level constructs are hard to use, and then look at higher order constructs such as promises, futures and actors.
I'm trying to create a flow that I can consume via something like an Iterator.
I'm implementing a library that exposes an iterator-like interface, so that would be the simplest thing for me to consume.
My graph designed so far is essentially a Source<Iterator<DataRow>>. One thing I see so far is to flatten it to Source<DataRow> and then use http://doc.akka.io/japi/akka/current/akka/stream/javadsl/StreamConverters.html#asJavaStream-- followed by https://docs.oracle.com/javase/8/docs/api/java/util/stream/BaseStream.html#iterator--
But given that there will be lots potentially many rows, I'm wondering whether it would make sense to avoid the flattening step (at least within the akka streams context, I'm assuming there's some minor per-element overhead when passed via stages), or if there's a more direct way.
Also, I'm curious how backpressure works in the created stream, especially the child Iterator; does it only buffer one element?
Flattening Step
Flattening a Source<Iterator<DataRow>> to a Source<DataRow> does add some amount of overhead since you'll have to use flatMapConcat which does eventually create a new GraphStage.
However, if you have "many" rows then this separate stage may come in handy since it will provide concurrency for the flattening step.
Backpressure
If you look at the code of StreamConverters.asJavaStream you'll see that there is a QueueSink that is spawning a Future to pull the next element from the akka stream and then doing an Await.result(nextElementFuture, Inf) to wait on the Future to complete so the next element can be forwarded to the java Stream.
Answering your question: yes the child Iterator only buffers one element, but the QueueSink has a Future which may also have the next DataRow. Therefore the javaStream & Iterator may have 2 elements buffered, on top of however much buffering is going on in your original akka Source.
Alternatively, you may implement an Iterator using prefixAndTail(1) under the hood for implementing hasNext and next.
I saw this code somewhere using stream().map().reduce().
Does this map() function really works parallel? If Yes, then how many maximum number of threads it can initiate for map() function?
What if I use parallelStream() instead of just stream() for the below particular use-case.
Can anyone give me good example of where to NOT use parallelStream()
Below code is just to extract tName from tCode and returns comma separated String.
String ts = atList.stream().map(tcode -> {
return CacheUtil.getTCache().getTInfo(tCode).getTName();
}).reduce((tName1, tName2) -> {
return tName1 + ", " + tName2;
}).get();
this stream().map().reduce() is not parallel, thus a single thread acts on the stream.
you have to add parallel or in other cases parallelStream (depends on the API, but it's the same thing). Using parallel by default you will get number of available processors - 1; but the main thread is used too in the ForkJoinPool#commonPool; thus there will be usually 2, 4, 8 threads etc. To check how many you will get, use:
Runtime.getRuntime().availableProcessors()
You can use a custom pool and get as many threads as you want, as shown here.
Also notice that the entire pipeline is run in parallel, not just the map operation.
There isn't a golden law about when to use and when not to use parallel streams, the best way is to measure. But there are obvious choices, like a stream of 10 elements - this is way too little to have any real benefit from parallelization.
All parallel streams use common fork-join thread pool and if you submit a long-running task, you effectively block all threads in the pool. Consequently you block all other tasks that are using parallel streams.
There are only two options how to make sure that such thing will never happen. The first is to ensure that all tasks submitted to the common fork-join pool will not get stuck and will finish in a reasonable time. But it's easier said than done, especially in complex applications. The other option is to not use parallel streams and wait until Oracle allows us to specify the thread pool to be used for parallel streams.
Use case
Lets say you have a collection (List) which gets loaded with values at the start of application and no new value is added to it at any later point. In above scenario you can use parallel stream without any concerns.
Don't worry stream is efficient and safe.
I have a huge line-separated text file and I want to make some calculations on each line. I need to make a multithreaded program to process it because it is the processing of each line that takes the most time to complete rather than reading each line. (the bottleneck lies in the CPU processing, rather than the IO)
There are two options I came up with:
1) Open the file from main thread, create a lock on the file handle and pass the file handle around the worker threads and then let each worker read-access the file directly
2) Create a producer / consumer setup where only the main thread has direct read-access to the file, and feeds lines to each worker thread using a shared queue
Things to know:
I am really interested in speed performance for this task
Each line is independent
I am working this in C++ but I guess the issue here is a bit language-independent
Which option would you choose and why?
I would suggest the second option, since it will be more clear design wise and less complicated than first option. First option is less scalable and require additional communication among thread in order to synchronize they progress on file lines. While in second option you have one dispatcher which deals with IO and initiate workers threads to starts they computation, and each computational thread is completely independent from each other, hence allows you scaling. Moreover in the second option you separate your logic in more clear way.
If we are talking about massively large file, which needs to be processed with a large cluster - MapReduce is probably the best solution.
The framework allows you great scalability, and already handles all the dirty work of managing the workers and tolerating failures for you.
The framework is specifically designed to recieve files read from file system [originally for GFS] as input.
Note that there is an open source implementation of map-reduce: Apache Hadoop
If each line is really independent and processing is much slower than reading the file, what you can do is to read all the data at once and store it in an array, such that each line represents element of an array.
Then all your threads can do the processing in parallel. For example, if you have 200 lines and 4 threads, each thread could perform calculation on 50 lines. Morever, Since this method would be embarrassingly parallel, you could easily use OpenMP for that.
I would suggest the second option because it is definitely better design-wise and would allow you to have better control over the work that the worker threads are doing.
Moreover that would increase the performance since the inter-thread communication in that case is the minimum of the two options you described
Another option is to memory map the file and maintaining a shared structure properly handling mutual exclusion of the threads.