Produce Lists with Parallel Stream - java

I have a list of Json Strings, which contain lists of movies. I need to collect those movies, process them and store them in the disk. I am thinking of using a parallel stream method to collect the movies and test its performance. My approach is this:
The following method produces a List of Movies.
protected abstract List<T> parseJsonString(JsonIterator iter);
This method contains a parallel stream that collects a List of all Lists ( List<List<Movies) ) produced in the stream:
public CompletableFuture<List<List<T>>> parseJsonPages(List<CompletableFuture<String>> jsonPageList)
{
return jsonPageList.parallelStream()
.map( jsonPageStr -> CompletableFuture.supplyAsync( () -> {
try {
return parseJsonString(JsonIterator.parse( jsonPageStr.get() ) );
}
catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
System.exit(-1);
}
return null;
} ) )
.collect( ParallelCollectors.toFuture( Collectors.toList() ) );
}
The problem with this approach is that the stream will produce the lists of movies and then append all lists inside a list. Do You think this is an effective way of collecting all those movies? Should I merge the movies from all lists in one list, instead of just appending the entire lists inside a list (even though this also costs some time). If so, How do I perform such task?
Thanks in advance.

Project Loom
In the future, when Project Loom arrives with its virtual threads, it will be much simpler and likely much faster execution to simply assign each task to a virtual thread.
Preliminary builds of Project Loom are available now, built on early-access Java 16. Though subject to change, and not ready for production, if this is a non-mission-critical personal project, you might consider using it now.
By the way, your Movie class might be suitable to define as a record, one of the features coming in Java 16.
List< String > inputListsOfMoviesAsJson = … ; // Input.
Set< Movie > movies = Set.of() ; // Output. Default to unmodifiable empty `Set`.
try
(
ExecutorService executorService = Executors.newVirtualThreadExecutor() ;
)
{
movies = Collections.synchronizedSet( new HashSet< Movie > ) ;
for( String inputJson : inputListsOfMoviesAsJson )
{
Runnable task = () -> movies.addAll( this.parseJsonIntoSetOfMovies( inputJson ) ) ;
executorService.submit( task ) ;
}
}
// At this point, flow-of-control blocks until all tasks are done.
// Then the executor service is automatically shutdown as part of being closed, as an `AutoCloseable` in a try-with-resources.
… use your `Set` of `Movie` objects.
If you want to track success/failure, then capture and collect the Future object returned by each call to executorService.submit( task ). The code above ignores that return value, for simplicity of the demo.
As to your Question about accumulating a list, of resulting Movie objects versus merging later, I do not think collecting those objects will be a bottleneck. My guess is that processing JSON will be the bottleneck. Either way, using profiler tools to verify your actual bottlenecks will likely be easier with the simpler coding possible when using Project Loom.
In code above, I use a Set made thread-safe by a call to Collections.synchronized…. You could try various implementations of Set or List. A list might be faster, but a set has the benefit of eliminating duplicates if that is an issue in your data inputs.
Caveats
Memory
This approach assumes you have plenty of memory to handle all the JSON work. With virtual threads, all of those inputs might be getting processed at nearly the same time.
In Project Loom, a blocked virtual thread is “parked”, moved aside for another thread to run. So you can have many virtual threads running, even millions.
With conventional platform/kernel threads, a blocked thread does not make way for another thread to start working. So you have few threads running at one time.
So if memory is a constrained resource, you’ll need to take further measures to prevent too many virtual threads from starting the JSON processing.
CPU-bound tasks
Virtual threads (fibers) are appropriate for work that involves blocking code. For purely CPU-bound tasks such as video-encoding, conventional platform/kernel threads are best. If you are doing nothing but processing JSON text already loaded into memory, then virtual threads may not show a benefit if they turn out to be CPU-bound. But I’d give it a try, as a test run is so easy. If you are doing any I/O (logging, accessing files, hitting a database, making network calls) then you will definitely see dramatic performance improvements with virtual threads.
Related code must be thread-safe
Be sure your JSON processing library is built to be thread-safe.
And be sure your parseJsonIntoSetOfMovies method is thread-safe.
Recommended reading
Read the book, Java Concurrency In Practice by Brian Goetz et al.

Related

Question about potential race condition with ParallelStream of Lists

I came across this piece of code that uses Java streams, specifically parallelStream() in order to collect some data from an oracle database. See below where in this case:
range = some list of input Id
rangeLimit = 1000
rangeLimitedFunction = some function that queries a DB for some content
ForkJoinPool threadPool = new ForkJoinPool(Math.min(Runtime.getRuntime().availableProcessors(), parallelism));
try {
Optional<C> res = threadPool.submit(new Callable<Optional<C>>() {
#Override
public Optional<C> call() throws Exception {
return splitByLimit(range, rangeLimit).parallelStream()
.map(rangeLimitedFunction::apply)
.reduce((list, items) -> {
list.addAll(items);
return list;
});
}
}).get();
From what I understand this is how this is working:
Split range into chunks of 1000 to feed into the function
Process each chunk in a thread to return some results
Aggregate the results to a list of POJOs
My question is around a potential race condition imposed by trying to reduce into a single list. Is it not possible for many of these threads to be trying to add content to the resulting list and potentially corrupt data?
That depends largely on the implementation of List that's used in this case.
That said, this pice of code would be way better using flatMap and a collector to leverage the thread-safety of Java paralell streams and avoid potential pitfalls from non-thread-safe list implementations.
That said, paralellStreams don't offer much benefit for IO-operations. They target processor heavy operations and usually only pay-off if there are more than 15000 (IIRC) operations (that is stream-iterations times cpu-heavy stream operations), which is kind of rare.

Parallelize a for loop in java

I have a for loop that is looping over a list of collections. Inside the loop some select/update queries are taking place on collection which are exclusive of the other collections. Since each collection has a lot of data to process on i would like to parallelize it.
The code snippet looks something like this:
//Some variables that are used within the for loop logic
for(String collection : collections) {
//Select queries on collection
//Update queries on collection
}
How can i achieve this in java?
You can use the parallelStream() method (since java 8):
collections.parallelStream().forEach((collection) -> {
//Select queries on collection
//Update queries on collection
});
More informations about streams.
Another way to do it is using Executors :
try
{
final ExecutorService exec = Executors.newFixedThreadPool(collections.size());
for (final String collection : collections)
{
exec.submit(() -> {
// Select queries on collection
// Update queries on collection
});
}
// We want to wait that the jobs are done.
final boolean terminated = exec.awaitTermination(500, TimeUnit.MILLISECONDS);
if (terminated == false)
{
exec.shutdownNow();
}
} catch (final InterruptedException e)
{
e.printStackTrace();
}
This example is more powerfull since you can easily know when the job is done, force termination... and more.
final int numberOfThreads = 32;
final ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
// List to store the 'handles' (Futures) for all tasks:
final List<Future<MyResult>> futures = new ArrayList<>();
// Schedule one (parallel) task per String from "collections":
for(final String str : collections) {
futures.add(executor.submit(() -> { return doSomethingWith(str); }));
}
// Wait until all tasks have completed:
for ( Future<MyResult> f : futures ) {
MyResult aResult = f.get(); // Will block until the result of the task is available.
// Optionally do something with the result...
}
executor.shutdown(); // Release the threads held by the executor.
// At this point all tasks have ended and we can continue as if they were all executed sequentially
Adjust the numberOfThreads as needed to achieve the best throughput. More threads will tend to utilize the local CPU better, but may cause more overhead at the remote end. To get good local CPU utilization, you want to have (much) more threads than CPUs (/cores) so that, whenever one thread has to wait, e.g. for a response from the DB, another thread can be switched in to execute on the CPU.
There are a number of question that you need to ask yourself to find the right answer:
If I have as many threads as the number of my CPU cores, would that be enough?
Using parallelStream() will give you as many threads as your CPU cores.
Will parallelizing the loop give me a performance boost or is there a bottleneck on the DB?
You could spin up 100 threads, processing in parallel, but this doesn't mean that you will do things 100 times faster, if your DB or the network cannot handle the volume. DB locking can also be an issue here.
Do I need to process my data in a specific order?
If you have to process your data in a specific order, this may limit your choices. E.g. forEach() doesn't guarantee that the elements of your collection will be processed in a specific order, but forEachOrdered() does (with a performance cost).
Is my datasource capable of fetching data reactively?
There are cases when our datasource can provide data in the form of a stream. In that case, you can always process this stream using a technology such as RxJava or WebFlux. This would enable you to take a different approach on your problem.
Having said all the above, you can choose the approach you want (executors, RxJava etc.) that fit better to your purpose.

Can we improve performance on lists other than java 8 parallel streams

I have to dump data from somewhere by calling rest API which returns List.
First i have to get some List object from one rest api. Now used parallel stream and gone through each item with forEach.
Now on for each element i have to call some other api to get the data which returns again list and save the same list by calling another rest api.
This is taking around 1 Hour for 6000 records of step 1.
I tried like below:
restApiMethodWhichReturns6000Records
.parallelStream().forEach(id ->{
anotherMethodWhichgetsSomeDataAndPostsToOtherRestCall(id);
});
public void anotherMethodWhichgetsSomeDataAndPostsToOtherRestCall(String id) {
sestApiToPostData(url,methodThatGetsListOfData(id));
}
parallelStream can cause unexpected behavior some times. It uses a common ForkJoinPool. So if you have parallel streams somewhere else in the code, it may have a blocking nature for long running tasks. Even in the same stream if some tasks are time taking, all the worker threads will be blocked.
A good discussion on this stackoverflow. Here you see some tricks to assign task specific ForkJoinPool.
First of all make sure your REST service is non-blocking.
One more thing you can do is to play with pool size by supplying -Djava.util.concurrent.ForkJoinPool.common.parallelism=4 to JVM.
IF the API calls are blocking, even when you run them in parallel, you will be able to do just a few calls in parallel.
I would try out a solution using CompletableFuture.
The code would be something like this:
List<CompletableFuture>> apiCallsFutures = restApiMethodWhichReturns6000Records
.stream()
.map(id -> CompletableFuture.supplyAsync(() -> getListOfData(id)) // Mapping the get list of data call to a Completable Future
.thenApply(listOfData -> callAPItoPOSTData(url, listOfData)) // when the get list call is complete, the post call can be performed
.collect(Collectors.toList());
CompletableFuture[] completableFutures = apiCallsFutures.toArray(new CompletableFuture[apiCallsFutures.size()]); // CompletableFuture.allOf accepts only arrays :(
CompletableFuture<Void> all = CompletableFuture.allOf(completableFutures); // Combine all the futures
all.get(); // perform calls
For more details about CompletableFutures, have a look over: https://www.baeldung.com/java-completablefuture

Using Threads in Loop

I have a for loop which needs to execute 36000 times
for(int i=0;i<36000;i++)
{
}
Whether its possible to use Multiple threads inorder to execute the loop faster at the same time
Please suggest how to use it.
If you want a more explicit method, you can use thread pools with Thread, Callable or Runnable. See my answere here for examples:
Java : a method to do multiple calculations on arrays quickly
Thread won't naturally exit at end of run()
I do not recommend using Java's Fork/Join as they are not that great as they were hyped to be. Performance is pretty bad. Instead, I would use Java 8's map and parallel streams if you want to make it easy. You have several options using this method.
IntStream.range(1, 4)
.mapToObj(i -> "testing " + i)
.forEach(System.out::println);
You would want to call map( lambda ). Java 8 finally brings lambda functions. It is possible to feed the stream one huge list, but there will be a performance impact. IntStream.range will do what you want. Then you need to figure out which of the new functions you want to use like filter, map, count, sum, reduce, etc. You may have to tell it that you want it to be a parallel stream. See these links.
https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/
Classic method and still has the best performance is to do it yourself using a thread pool:
Basically, you would create a Runnable (does not return something) or Callable (returns a result) object that will do some work on one of the treads in the pool. The pool with handle scheduling, which is great for us. Java has several options on the pool you use. You can create a Runnable/Callable in a loop, then submit that into the pool. The pool immediately returns a Future object that represents the task. You can add that Future to an ArrayList if you have many of these. After adding all the futures to the list, loop through them and call future.get(), which will wait for the end of execution. See the linked example above, which does not use a list, but does everything else I said.

Processing sub-streams of a stream in Java using executors

I have a program that processes a huge stream (not in the sense of java.util.stream, but rather InputStream) of data coming in through the network. The stream consists of objects, each having a sort of sub-stream identifier. Right now the whole processing is done in a single thread, but it takes a lot of CPU time and each sub-stream can easily be processed independently, so I'm thinking of multi-threading it.
However, each sub-stream requires to keep a lot of bulky state, including various buffers, hash maps and such. There is no particular reason to make it concurrent or synchronized since sub-streams are independent of each other. Moreover, each sub-stream requires that its objects are processed in the order they arrive, which means that probably there should be a single thread for each sub-stream (but possibly one thread processing multiple sub-streams).
I'm thinking of several approaches to this, but they are not quite elegant.
Create a single ThreadPoolExecutor for all tasks. Each task will contain the next object to process and the reference to a Processor instance which keeps all the state. That would ensure the necessary happens-before relationship thus ensuring that the processing thread will see the up-to-date state for this sub-stream. This approach has no way to make sure that the next object of the same sub-stream will be processed in the same thread, as far as I can see. Moreover, it needs some guarantee that objects will be processed in the order they come in, which will require additional synchronization of the Processor objects, introducing unnecessary delays.
Create multiple single-thread executors manually and a sort of hash-map that maps sub-stream identifiers to executor. This approach requires manual management of executors, creating or shutting down them as new sub-streams begin or end, and distributing the tasks between them accordingly.
Create a custom executor that processes a special subclass of tasks each having a sub-stream ID. This executor would use it as a hint to use the same thread for executing this task as the previous one with the same ID. However, I don't see an easy way to implement such executor. Unfortunately, it doesn't seem possible to extend any of the existing executor classes, and implementing an executor from scratch is kind of overkill.
Create a single ThreadPoolExecutor, but instead of creating a task for each incoming object, create a single long-running task for each sub-stream that would block in a concurrent queue, waiting for the next object. Then put objects in queues according to their sub-stream IDs. This approach needs as many threads as there are sub-streams because the tasks will be blocked. The expected number of sub-streams is about 30-60, so that may be acceptable.
Alternatively, proceed as in 4, but limit the number of threads, assigning multiple sub-streams to a single task. This is sort of a hybrid between 2 and 4. As far as I can see, this is the best approach of these, but it still requires some sort of manual sub-stream distribution between tasks and some way to shut the extra tasks down as sub-streams end.
What would be the best way to ensure that each sub-stream is processed in its own thread without a lot of error-prone code? So that the following pseudo-code will work:
// loop {
Item next = stream.read();
int id = next.getSubstreamID();
Processor processor = getProcessor(id);
SubstreamTask task = new SubstreamTask(processor, next, id);
executor.submit(task); // This makes sure that the task will
// be executed in the same thread as the
// previous task with the same ID.
// } // loop
I suggest having an array of single threaded executors. If you can devise a consistent hashing strategy for sub-streams, you can map sub-streams to individual threads. e.g.
final ExecutorsService[] es = ...
public void submit(int id, Runnable run) {
es[(id & 0x7FFFFFFF) % es.length].submit(run);
}
The key could be an String or long but some way to identify the sub-stream. If you know a particular sub-stream is very expensive, you could assign it a dedicated thread.
The solution I finally chose looks like this:
private final Executor[] streamThreads
= new Executor[Runtime.getRuntime().availableProcessors()];
{
for (int i = 0; i < streamThreads.length; ++i) {
streamThreads[i] = Executors.newSingleThreadExecutor();
}
}
private final ConcurrentHashMap<SubstreamId, Integer>
threadById = new ConcurrentHashMap<>();
This code determines which executor to use:
Message msg = in.readNext();
SubstreamId msgSubstream = msg.getSubstreamId();
int exe = threadById.computeIfAbsent(msgSubstream,
id -> findBestExecutor());
streamThreads[exe].execute(() -> {
// processing goes here
});
And the findBestExecutor() function is this:
private int findBestExecutor() {
// Thread index -> substream count mapping:
final int[] loads = new int[streamThreads.length];
for (int thread : threadById.values()) {
++loads[thread];
}
// return the index of the minimum load
return IntStream.range(0, streamThreads.length)
.reduce((i, j) -> loads[i] <= loads[j] ? i : j)
.orElse(0);
}
This is, of course, not very efficient, but note that this function is only called when a new sub-stream shows up (which happens several times every few hours, so it's not a big deal in my case). My real code looks a bit more complicated because I have a way to determine whether two sub-streams are likely to finish simultaneously, and if they are, I try to assign them to different threads in order to maintain even load after they do finish. But since I never mentioned this detail in the question, I guess it doesn't belong to the answer either.

Categories