I'm not an expert in Spark, and I'm using Spark to do some calculations.
// [userId, lastPurchaseLevel]
JavaPairRDD<String, Integer> lastPurchaseLevels =
levels.groupByKey()
.join(purchases.groupByKey())
.mapValues(t -> getLastPurchaseLevel(t));
And inside the getLastPurchaseLevel() function, I had such code:
private static Integer getLastPurchaseLevel(Tuple2<Iterable<SourceLevelRecord>, Iterable<PurchaseRecord>> t) {
....
final Comparator<PurchaseRecord> comp = (a, b) -> Long.compare(a.dateMsec, b.dateMsec);
PurchaseRecord latestPurchase = purchaseList.stream().max(comp).get();
But my boss told me not to use the stream(), he said:
We better do the classic way because there are no CPU core remains to do the streaming -- all CPUs are used by Spark workers already.
I know the classic way is to iterate through and find the max, so stream will cause more CPU consumption or overhead than the classic way? Or is it only in these kind of Spark context?
We better do the classic way because there are no CPU core remains to do the streaming -- all CPUs are used by Spark workers already.
Your boss's idea: Spark already schedules the tasks to threads ( or cpu cores ), no need to do things concurrently inside single task.
... so stream will cause more CPU consumption or overhead than the classic way? Or is it only in these kind of Spark context?
Java stream is single threaded unless otherwise specified ( by calling Stream.parallel() method ). So as long as you don't parallelize the stream, your boss won't complain.
Related
I have a list of Json Strings, which contain lists of movies. I need to collect those movies, process them and store them in the disk. I am thinking of using a parallel stream method to collect the movies and test its performance. My approach is this:
The following method produces a List of Movies.
protected abstract List<T> parseJsonString(JsonIterator iter);
This method contains a parallel stream that collects a List of all Lists ( List<List<Movies) ) produced in the stream:
public CompletableFuture<List<List<T>>> parseJsonPages(List<CompletableFuture<String>> jsonPageList)
{
return jsonPageList.parallelStream()
.map( jsonPageStr -> CompletableFuture.supplyAsync( () -> {
try {
return parseJsonString(JsonIterator.parse( jsonPageStr.get() ) );
}
catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
System.exit(-1);
}
return null;
} ) )
.collect( ParallelCollectors.toFuture( Collectors.toList() ) );
}
The problem with this approach is that the stream will produce the lists of movies and then append all lists inside a list. Do You think this is an effective way of collecting all those movies? Should I merge the movies from all lists in one list, instead of just appending the entire lists inside a list (even though this also costs some time). If so, How do I perform such task?
Thanks in advance.
Project Loom
In the future, when Project Loom arrives with its virtual threads, it will be much simpler and likely much faster execution to simply assign each task to a virtual thread.
Preliminary builds of Project Loom are available now, built on early-access Java 16. Though subject to change, and not ready for production, if this is a non-mission-critical personal project, you might consider using it now.
By the way, your Movie class might be suitable to define as a record, one of the features coming in Java 16.
List< String > inputListsOfMoviesAsJson = … ; // Input.
Set< Movie > movies = Set.of() ; // Output. Default to unmodifiable empty `Set`.
try
(
ExecutorService executorService = Executors.newVirtualThreadExecutor() ;
)
{
movies = Collections.synchronizedSet( new HashSet< Movie > ) ;
for( String inputJson : inputListsOfMoviesAsJson )
{
Runnable task = () -> movies.addAll( this.parseJsonIntoSetOfMovies( inputJson ) ) ;
executorService.submit( task ) ;
}
}
// At this point, flow-of-control blocks until all tasks are done.
// Then the executor service is automatically shutdown as part of being closed, as an `AutoCloseable` in a try-with-resources.
… use your `Set` of `Movie` objects.
If you want to track success/failure, then capture and collect the Future object returned by each call to executorService.submit( task ). The code above ignores that return value, for simplicity of the demo.
As to your Question about accumulating a list, of resulting Movie objects versus merging later, I do not think collecting those objects will be a bottleneck. My guess is that processing JSON will be the bottleneck. Either way, using profiler tools to verify your actual bottlenecks will likely be easier with the simpler coding possible when using Project Loom.
In code above, I use a Set made thread-safe by a call to Collections.synchronized…. You could try various implementations of Set or List. A list might be faster, but a set has the benefit of eliminating duplicates if that is an issue in your data inputs.
Caveats
Memory
This approach assumes you have plenty of memory to handle all the JSON work. With virtual threads, all of those inputs might be getting processed at nearly the same time.
In Project Loom, a blocked virtual thread is “parked”, moved aside for another thread to run. So you can have many virtual threads running, even millions.
With conventional platform/kernel threads, a blocked thread does not make way for another thread to start working. So you have few threads running at one time.
So if memory is a constrained resource, you’ll need to take further measures to prevent too many virtual threads from starting the JSON processing.
CPU-bound tasks
Virtual threads (fibers) are appropriate for work that involves blocking code. For purely CPU-bound tasks such as video-encoding, conventional platform/kernel threads are best. If you are doing nothing but processing JSON text already loaded into memory, then virtual threads may not show a benefit if they turn out to be CPU-bound. But I’d give it a try, as a test run is so easy. If you are doing any I/O (logging, accessing files, hitting a database, making network calls) then you will definitely see dramatic performance improvements with virtual threads.
Related code must be thread-safe
Be sure your JSON processing library is built to be thread-safe.
And be sure your parseJsonIntoSetOfMovies method is thread-safe.
Recommended reading
Read the book, Java Concurrency In Practice by Brian Goetz et al.
I have a for loop that is looping over a list of collections. Inside the loop some select/update queries are taking place on collection which are exclusive of the other collections. Since each collection has a lot of data to process on i would like to parallelize it.
The code snippet looks something like this:
//Some variables that are used within the for loop logic
for(String collection : collections) {
//Select queries on collection
//Update queries on collection
}
How can i achieve this in java?
You can use the parallelStream() method (since java 8):
collections.parallelStream().forEach((collection) -> {
//Select queries on collection
//Update queries on collection
});
More informations about streams.
Another way to do it is using Executors :
try
{
final ExecutorService exec = Executors.newFixedThreadPool(collections.size());
for (final String collection : collections)
{
exec.submit(() -> {
// Select queries on collection
// Update queries on collection
});
}
// We want to wait that the jobs are done.
final boolean terminated = exec.awaitTermination(500, TimeUnit.MILLISECONDS);
if (terminated == false)
{
exec.shutdownNow();
}
} catch (final InterruptedException e)
{
e.printStackTrace();
}
This example is more powerfull since you can easily know when the job is done, force termination... and more.
final int numberOfThreads = 32;
final ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
// List to store the 'handles' (Futures) for all tasks:
final List<Future<MyResult>> futures = new ArrayList<>();
// Schedule one (parallel) task per String from "collections":
for(final String str : collections) {
futures.add(executor.submit(() -> { return doSomethingWith(str); }));
}
// Wait until all tasks have completed:
for ( Future<MyResult> f : futures ) {
MyResult aResult = f.get(); // Will block until the result of the task is available.
// Optionally do something with the result...
}
executor.shutdown(); // Release the threads held by the executor.
// At this point all tasks have ended and we can continue as if they were all executed sequentially
Adjust the numberOfThreads as needed to achieve the best throughput. More threads will tend to utilize the local CPU better, but may cause more overhead at the remote end. To get good local CPU utilization, you want to have (much) more threads than CPUs (/cores) so that, whenever one thread has to wait, e.g. for a response from the DB, another thread can be switched in to execute on the CPU.
There are a number of question that you need to ask yourself to find the right answer:
If I have as many threads as the number of my CPU cores, would that be enough?
Using parallelStream() will give you as many threads as your CPU cores.
Will parallelizing the loop give me a performance boost or is there a bottleneck on the DB?
You could spin up 100 threads, processing in parallel, but this doesn't mean that you will do things 100 times faster, if your DB or the network cannot handle the volume. DB locking can also be an issue here.
Do I need to process my data in a specific order?
If you have to process your data in a specific order, this may limit your choices. E.g. forEach() doesn't guarantee that the elements of your collection will be processed in a specific order, but forEachOrdered() does (with a performance cost).
Is my datasource capable of fetching data reactively?
There are cases when our datasource can provide data in the form of a stream. In that case, you can always process this stream using a technology such as RxJava or WebFlux. This would enable you to take a different approach on your problem.
Having said all the above, you can choose the approach you want (executors, RxJava etc.) that fit better to your purpose.
I wrote code using Java 8 streams and parallel streams for the same functionality with a custom collector to perform an aggregation function.
When I see CPU usage using htop, it shows all CPU cores being used for both 'streams' and 'parallel streams' version. So, it seems when list.stream() is used, it also uses all CPUs. Here, what is the precise difference between parallelStream() and stream() in terms of usage of multi-core.
Consider the following program:
import java.util.ArrayList;
import java.util.List;
public class Foo {
public static void main(String... args) {
List<Integer> list = new ArrayList<>();
for (int i = 0; i < 1000; i++) {
list.add(i);
}
list.stream().forEach(System.out::println);
}
}
You will notice that this program will output the numbers from 0 to 999 sequentially, in the order in which they are in the list. If we change stream() to parallelStream() this is not the case anymore (at least on my computer): all number are written, but in a different order. So, apparently, parallelStream() indeed uses multiple threads.
The htop is explained by the fact that even single-threaded applications are divided over mutliple cores by most modern operating systems (parts of the same thread may run on several cores, but of course not at the same time). So if you see that a process used more than one core, this does not mean necessarily that the program uses multiple threads.
Also the performance may not improve when using multiple threads. The cost of synchronization may nihilite the gains of using multiple threads. For simple testing scenarios this is often the case. For example, in the above example, System.out is synchronized. So, effectively, only number can be written at the same time, although multiple threads are used.
adding to #Hoopje 's answer:
Before using parallelStream (), Read this:
It is multi-threaded. Just writing parallelStream() to get parallelism is almost always bad idea in java. There are some cases where it will work, but not always. There are other ways to achieve parallelism and almost always, you need to think a lot before taking a multi-thread solution .
It uses the default JVM thread pool. So, if you are doing any blocking operation such as network call, the entire java application can get stuck. Thats the biggest problem there. There are other ones with task allocation as well. A simple ExecutionService with n threads provides better performance that parallel streams.
You can also read:
Java Parallel Streams Are Bad for Your Health! | JRebel by Perforce
I have the following code ,
import java.util.Arrays;
public class ParellelStream {
public static void main(String args[]){
Double dbl[] = new Double[1000000];
for(int i=0; i<dbl.length;i++){
dbl[i]=Math.random();
}
long start = System.currentTimeMillis();
Arrays.parallelSort(dbl);
System.out.println("time taken :"+((System.currentTimeMillis())-start));
}
}
When I run this code it takes time approx 700 to 800 ms, but when I replace the line Arrays.parallelSort to Arrays.sort it takes 500 to 600 ms. I read about the Arrays.parallelSort and Arrays.sort method which says that Arrays.parellelSort gives poor performance when dataset are small but here I am using array of 1000000 elements. what could be the reason for parallelSort poor performance ?? I am using java8.
The parallelSort function will use a thread for each cpu core you have on your machine. Specifically parallelSort runs tasks on the ForkJoin common thread pool. If you only have one core you would not see an improvement over single threaded sort.
If you only have multiple cores you are going to have some upfront cost associated with creating the new threads which will mean that for relatively small arrays you are not going to see linear performance gains.
The compare function for comparing doubles is not an expensive function. I think that in this case 1000000 elements can be safely considered small and the benefits of using multiple threads is outweighed by the upfront costs of creating those threads. Since the upfront costs will be fixed you should see a performance gain with larger arrays.
I read about the Arrays.parallelSort and Arrays.sort method which says
that Arrays.parellelSort gives poor performance when dataset are small
but here I am using array of 1000000 elements.
This is not the only thing to take in consideration. It depends a lot on your machine (how your CPU handle multi-threading etc).
Here a quote from the Parallelism part of The Java Tutorials
Note that parallelism is not automatically faster than performing
operations serially, although it can be if you have enough data and
processor cores [...] it is still your responsibility to determine if
your application is suitable for parallelism.
You might also want to have a look at the code of java.util.ArraysParallelSortHelpers for a better understanding of the algorithm.
Note that the parallelSort method use the ForkJoinPool introduced in Java 7 to take advantages of each processors of your computer as stated in the javadoc :
A ForkJoinPool is constructed with a given target parallelism level;
by default, equal to the number of available processors.
Note that if the length of the array is less then 1 << 13, the array will be sorted using the appropriate Arrays.sort method.
See also
Fork/Join
I used concurrent hashmap for creating a matrix. It indices ranges to 100k. I have created 40 threads. Each of the thread access those elements of matrices and modifies to that and write it back of the matrix as:
ConcurrentHashMap<Integer, ArrayList<Double>> matrix =
new ConcurrentHashMap<Integer, ArrayList<Double>>(25);
for (Entry(Integer,ArrayList<Double>)) entry: matrix.entrySet())
upDateEntriesOfValue(entry.getValue());
I did not found it thread safe. Values are frequently returned as null and my program is getting crashed. Is there any other way to make it thread safe.Or this is thread safe and i have bug in some other places. One thing is my program does not crash in single threaded mode.
The iterator is indeed thread-safe for the ConcurrentHashMap.
But what is not thread-safe in your code is the ArrayList<Double> you seem to update! Your problems might come from this data structure.
You may want to use a concurrent data structure adapted to you needs.
Using a map for a matrix is really inefficient, and the way you have used it, it won't even support sparse arrays particularly well.
I suggest you use a double[][] where you lock each row (or column if that is better) If the matrix is small enough you may be better of using only one CPU as this can save you quite a bit of overhead.
I would suggest you create no more threads than you have cores. For CPU intensive tasks, using more thread can be slower, not faster.
Matrix is 100k*50 at max
EDIT: Depending on the operation you are performing, I would try to ensure you have the shorter dimension first so you can process each long dimension in a different thread efficiently.
e.g
double[][] matrix = new double[50][100*1000];
for(int i=0;i<matrix.length;i++) {
final double[] line = matrix[i];
executorService.submit(new Runnable() {
public void run() {
synchronized(line) {
processOneLine(line);
}
}
});
}
This allows all you thread to run concurrently because they don't share any data structures. They can also access each double efficiently because they are continuous in memory and stored as efficiently as possible. i.e. 100K doubles uses about 800KB, but List<Double> uses 2800KB and each value can be randomly arranged in memory which means your cache has to work much harder.
thanks but in fact i have 80 cores in total
To uses 80 core efficiently you might want to break the longer lines in two or four so you can keep all the cores busy, or find a way to perform more than one operation at a time.
TheConcurrentHashMap will be thread safe for accesses into the Map, but the Lists served out need to be thread-safe, if multiple threads can operate on the same List instances concurrently so use a thread-safe list while modifying.
In your case working on ConcurrentHashMap is tread-safe but when thread goes to ArrayList this is not synchronized and hence multiple threads can access it simultaneously which makes it non thread-safe. either you can use synchronized block where you are performing modification in list