Avoiding multithreading with Java streams - java

We're having this issue: SLURM slow for array job
Is there some way that
collection.stream().someFunction1().someFunction2() etc.
or
Arrays.stream(values).someFunction1().someFunction2() etc.
does cause some multithreading?
We don't have anything like "parallel" or "thread" in our code.
Thanks in advance
Martin

No.
From the documentation for Collection.stream:
Returns a sequential Stream with this collection as its source.
From the documentation for Arrays.stream:
Returns a sequential Stream with the specified array as its source.
A sequential stream is the opposite of a parallel stream. It is processed in the calling thread only.

Related

How to get an iterator from an akka streams Source?

I'm trying to create a flow that I can consume via something like an Iterator.
I'm implementing a library that exposes an iterator-like interface, so that would be the simplest thing for me to consume.
My graph designed so far is essentially a Source<Iterator<DataRow>>. One thing I see so far is to flatten it to Source<DataRow> and then use http://doc.akka.io/japi/akka/current/akka/stream/javadsl/StreamConverters.html#asJavaStream-- followed by https://docs.oracle.com/javase/8/docs/api/java/util/stream/BaseStream.html#iterator--
But given that there will be lots potentially many rows, I'm wondering whether it would make sense to avoid the flattening step (at least within the akka streams context, I'm assuming there's some minor per-element overhead when passed via stages), or if there's a more direct way.
Also, I'm curious how backpressure works in the created stream, especially the child Iterator; does it only buffer one element?
Flattening Step
Flattening a Source<Iterator<DataRow>> to a Source<DataRow> does add some amount of overhead since you'll have to use flatMapConcat which does eventually create a new GraphStage.
However, if you have "many" rows then this separate stage may come in handy since it will provide concurrency for the flattening step.
Backpressure
If you look at the code of StreamConverters.asJavaStream you'll see that there is a QueueSink that is spawning a Future to pull the next element from the akka stream and then doing an Await.result(nextElementFuture, Inf) to wait on the Future to complete so the next element can be forwarded to the java Stream.
Answering your question: yes the child Iterator only buffers one element, but the QueueSink has a Future which may also have the next DataRow. Therefore the javaStream & Iterator may have 2 elements buffered, on top of however much buffering is going on in your original akka Source.
Alternatively, you may implement an Iterator using prefixAndTail(1) under the hood for implementing hasNext and next.

Java Bitset - What is the complete meaning of “Not Thread-Safe”

I know that java.util.BitSet operations are not thread-safe. Does only reading and writing to a BitSet in parallel threads cause a permanent(in the current runtime of application) loss of information? Or write operation executes correctly, and only the current parallel read operation may return wrong information and later read operations return correct information. In other words, I mean that, if I only synchronize the write operations and allow write operations to run in parallel with read operations, will some information still be lost permanently?
The only thread-safe operation is read vs read: nothing is written in memory, memory can be accessed from any thread without any problem.
BUT when you have read vs write you can have surprises, ex: reading while writing may give you half the previous result and half the new result since bitfield is not atomic.
In your question, you accept that the concurrent read/write returns incorrect results in read. In that case, how do you know if the data returned by a read is correct? read many times and make an average?
So you have to synchronize your read operations with your write operations too.
EDIT: if you really want to go down the "I don't care if data is corrupt when reading" road, I suggest you add CRC to the emitted data, and you can reject the data if incorrect.
When you're talking about a standard Java library class, then you should go by what the javadoc says.
There are users out there running different JRE versions, from different vendors, on different operating system versions. You can't rely on the behavior of Bitset or any other library class to be exactly the same in every environment, but you can rely on it to do whatever the Javadoc says it will do.
if I only synchronize the write operations and allow write operations to run in parallel with read operations, will some information still be lost permanently?
It's highly unlikely that overlapped read operations or reads overlapped with a single write operation could leave a Bitset (or any other object) in some invalid state. If you think that your application can cope with incorrect results that might be returned by a read, then that might be a reasonable risk to take,
BUT
Are you certain that synchronizing reads causes a performance problem? If you haven't actually measured the performance, and you haven't found that the difference between unsynchronized and synchronized makes the difference beetween acceptable performance and unacceptable performance, then why not just synchronize all access?
To do otherwise is called "premature optimization", and more often than not, it's a waste of your own time.
A BitSet is not safe for multithreaded use without external synchronization.
Please read this article for basic understanding of readers writer problem.
https://dzone.com/articles/java-concurrency-read-write-lo
Only READing will not cause any issue in concurreny.
READ/WRITE or WRITE/WRITE cause inconsistent issues when you access the information concurrently

ParallelStream for Files

The new Stream API in Java 8 is really nice, especially for the parallel processing capabilities. However, I don't see how to apply the parallel processing outside of the Collections parallelStream method.
For example, if I am creating a Stream from a File, I use the following:
Stream<String> lines = Files.lines(Paths.get("test.csv"));
However, there is no counterpart parallelStream method, like in Collections. It seems like there could be one thread grabbing the next line, while there could be several threads parsing and processing the lines.
Could this be done with StreamSupport.stream()?
There's a much simpler answer: Any stream can be turned parallel by calling .parallel():
Stream<String> lines = Files.lines(Paths.get("test.csv"))
.parallel();
The .parallelStream() method on Collection is just a convenience.
Note that, unless you're doing a lot of processing per line, the sequential nature of IO from the file will probably dominate and you may not get as much parallelism as you hope.
Yes - turns out you can create a parallel stream from the sequential stream with StreamSupport.stream(). Following the pattern of my question, it would look like the following.
StreamSupport.stream(Files.lines(Paths.get("test.csv")).spliterator(), true);
The 'true' is to make it parallel. In testing, it expanded the use from a single core to all cores on my machine. It read the lines in order, however the processing of the lines did not complete in order, which is fine for my purposes.

Java input stream "blocking" and multithreading

Can't seem to find anything about input stream "blocking" that describes both what it is and when it occurs. Is this some type of multi-thread prevention of concurrent threads accessing the same stream?
On that note, when two concurrent threads access the same stream at the same time, can this cause problems, or do both threads get their own stream pointers? Obviously, one would need to wait, but hopefully it wouldn't lead to an unchecked exception.
"Blocking" is when a read or write hangs, while waiting for either more information (for reads) or for more space in some internal buffer (for writes) before returning control to the calling thread.
And I'm pretty sure the stream object takes care of its own read/write locations, so the pointer just points to the stream object, which reads out of its own buffer. So, if you're reading with synchronized methods, then each read will wait its turn, and get cohesive (but not overlapping) data. If the methods aren't synchronized, then I'm pretty sure all hell will break loose.
In the context of input streams, "blocking" typically refers to the stream waiting for more data becoming available. The term would probably make more sense if you think about sockets rather than files.
If you have multiple threads concurrently reading from the same stream, you have to do your own synchronization. There are no thread-specific "stream pointers". Again, think about multiple threads reading from the same socket (rather than from a file).
Each stream has a stream pointer. It doesn't make much sense to have two threads reading the same stream.

Multi-Threaded Application - Help with some pseudo code!

I am working on a multi-threaded application and need help with some pseudo-code. To make it simpler for implementation I will try to explain that in simple terms / test case.
Here is the scenario -
I have an array list of strings (say 100 strings)
I have a Reader Class that reads the strings and passes them to a Writer Class that prints the strings to the console. Right now this runs in a Single Thread Model.
I wanted to make this multi-threaded but with the following features -
Ability to set MAX_READERS
Ability to set MAX_WRITERS
Ability to set BATCH_SIZE
So basically the code should instantiate those many Readers and Writers and do the work in parallel.
Any pseudo code will really be helpful to keep me going!
This sounds like the classic consumer-producer problem. Have a look at Wikipedia's article about it. They have plenty of pseudo code there.
Aside from using the producer-consumer pattern that has been suggested, I would recommend that you use the CopyOnWriteArrayList so you can have lock-free read/write/iteration of your list. Since you're only working with a couple of hundred strings you will probably not have any performance issues with the CopyOnWriteArrayList.
If you're concerned about performance then I actually think it might be better if you use the BlockingQueue or a ConcurrentHashMap. They will allow you to maximize throughput with your multithreaded application.
The recommended option:
A BlockingQueue works very well with multiple producers and consumers, but of course it implies an order of data processing (FIFO). If you're OK with FIFO ordering, then you will probably find that the BlockingQueue is a faster and more robust option.
I think that the Wikipedia article has sufficient pseudo code for you to use, but you can also check out some of the following SO questions:
https://stackoverflow.com/search?q=java+producer+consumer
Java Producer-Consumer Designs:
Producer/Consumer threads using a Queue
design of a Producer/Consumer app

Categories