I wrote code using Java 8 streams and parallel streams for the same functionality with a custom collector to perform an aggregation function.
When I see CPU usage using htop, it shows all CPU cores being used for both 'streams' and 'parallel streams' version. So, it seems when list.stream() is used, it also uses all CPUs. Here, what is the precise difference between parallelStream() and stream() in terms of usage of multi-core.
Consider the following program:
import java.util.ArrayList;
import java.util.List;
public class Foo {
public static void main(String... args) {
List<Integer> list = new ArrayList<>();
for (int i = 0; i < 1000; i++) {
list.add(i);
}
list.stream().forEach(System.out::println);
}
}
You will notice that this program will output the numbers from 0 to 999 sequentially, in the order in which they are in the list. If we change stream() to parallelStream() this is not the case anymore (at least on my computer): all number are written, but in a different order. So, apparently, parallelStream() indeed uses multiple threads.
The htop is explained by the fact that even single-threaded applications are divided over mutliple cores by most modern operating systems (parts of the same thread may run on several cores, but of course not at the same time). So if you see that a process used more than one core, this does not mean necessarily that the program uses multiple threads.
Also the performance may not improve when using multiple threads. The cost of synchronization may nihilite the gains of using multiple threads. For simple testing scenarios this is often the case. For example, in the above example, System.out is synchronized. So, effectively, only number can be written at the same time, although multiple threads are used.
adding to #Hoopje 's answer:
Before using parallelStream (), Read this:
It is multi-threaded. Just writing parallelStream() to get parallelism is almost always bad idea in java. There are some cases where it will work, but not always. There are other ways to achieve parallelism and almost always, you need to think a lot before taking a multi-thread solution .
It uses the default JVM thread pool. So, if you are doing any blocking operation such as network call, the entire java application can get stuck. Thats the biggest problem there. There are other ones with task allocation as well. A simple ExecutionService with n threads provides better performance that parallel streams.
You can also read:
Java Parallel Streams Are Bad for Your Health! | JRebel by Perforce
Related
I'm working on an existing Java codebase which has an object that extends Thread, and also contains a number of other properties and methods that pertain to the operation of the thread itself. In former versions of the codebase, the thread was always an actual, heavyweight Thread that was started with Thread.start, waited for with Thread.join, and the like.
I'm currently refactoring the codebase, and in the present version, the object's Thread functionality is not always needed (but the object itself is, due to the other functionality contained in the object; in many cases, it's usable even when the thread itself is not running). So there are situations in which the application creates these objects (which extend Thread) and never calls .start() on them, purely using them for their other properties and methods.
In the future, the application may need to create many more of these objects than previously, to the point where I potentially need to worry about performance. Obviously, creating and starting a large number of actual threads would be a performance nightmare. Does the same thing apply to Thread objects that are never started? That is, are any operating system resources, or large Java resources, required purely to create a Thread? Or are the resources used only when the Thread is actually .started, making unstarted Thread objects safe to use in quantity? It would be possible to refactor the code to split the non-threading-related functionality into a separate function, but I don't want to do a large refactoring if it's entirely pointless to do so.
I've attempted to determine the answer to this with a few web searches, but it's hard to aim the query because search engines can't normally distinguish a Thread object from an actual Java thread.
You could implement Runnable instead of extending Thread.
public class MyRunnableClass implements Runnable {
// Your stuff...
#Override
public void run() {
// Thread-related stuff...
}
}
Whenever you need to run your Object to behave as a Thread, simply use:
Thread t = new Thread(new MyRunnableClass());
t.start();
As the others have pointed out: performance isn't a problem here.
I would focus much more on the "good design" approach. It simply doesn't make (much, any?) sense to extend Thread when you do not intend to ever invoke start(). And you see: you write code to communicate your intentions.
Extending Thread without using it as thread, that only communicates confusion. Every new future reader of your code will wonder "why is that"?
Therefore, focus on getting to a straight forward design. And I would go one step further: don't just turn to Runnable, and continuing to use threads. Instead: learn about ExecutorServices, and how to submit tasks, and Futures, and all that.
"Bare iron" Threads (and Runnables) are like 20 year old concepts. Java has better things to offer by now. So, if you are really serious about improving your code base: look into these new abstraction concepts to figure where they would make sense to be used.
You can create about 1.5 million of these objects per GB of memory.
import java.util.LinkedList;
import java.util.List;
class A {
public static void main(String[] args) {
int count = 0;
try {
List<Thread> threads = new LinkedList<>();
while (true) {
threads.add(new Thread());
if (++count % 10000 == 0)
System.out.println(count);
}
} catch (Error e) {
System.out.println("Got " + e + " after " + count + " threads");
}
}
}
using -Xms1g -Xmx1g for Oracle Java 8, the process grinds to halt at around
1 GB - 1780000
2 GB - 3560000
6 GB - 10690000
The object uses a bit more than you might expect from reading the source code, but it's still about 600 bytes each.
NOTE: Throwable also use more memory than you might expect by reading the Java source. It can be 500 - 2000 bytes more depending on the size of the stack at the time it was created.
I'm not an expert in Spark, and I'm using Spark to do some calculations.
// [userId, lastPurchaseLevel]
JavaPairRDD<String, Integer> lastPurchaseLevels =
levels.groupByKey()
.join(purchases.groupByKey())
.mapValues(t -> getLastPurchaseLevel(t));
And inside the getLastPurchaseLevel() function, I had such code:
private static Integer getLastPurchaseLevel(Tuple2<Iterable<SourceLevelRecord>, Iterable<PurchaseRecord>> t) {
....
final Comparator<PurchaseRecord> comp = (a, b) -> Long.compare(a.dateMsec, b.dateMsec);
PurchaseRecord latestPurchase = purchaseList.stream().max(comp).get();
But my boss told me not to use the stream(), he said:
We better do the classic way because there are no CPU core remains to do the streaming -- all CPUs are used by Spark workers already.
I know the classic way is to iterate through and find the max, so stream will cause more CPU consumption or overhead than the classic way? Or is it only in these kind of Spark context?
We better do the classic way because there are no CPU core remains to do the streaming -- all CPUs are used by Spark workers already.
Your boss's idea: Spark already schedules the tasks to threads ( or cpu cores ), no need to do things concurrently inside single task.
... so stream will cause more CPU consumption or overhead than the classic way? Or is it only in these kind of Spark context?
Java stream is single threaded unless otherwise specified ( by calling Stream.parallel() method ). So as long as you don't parallelize the stream, your boss won't complain.
I have the following code ,
import java.util.Arrays;
public class ParellelStream {
public static void main(String args[]){
Double dbl[] = new Double[1000000];
for(int i=0; i<dbl.length;i++){
dbl[i]=Math.random();
}
long start = System.currentTimeMillis();
Arrays.parallelSort(dbl);
System.out.println("time taken :"+((System.currentTimeMillis())-start));
}
}
When I run this code it takes time approx 700 to 800 ms, but when I replace the line Arrays.parallelSort to Arrays.sort it takes 500 to 600 ms. I read about the Arrays.parallelSort and Arrays.sort method which says that Arrays.parellelSort gives poor performance when dataset are small but here I am using array of 1000000 elements. what could be the reason for parallelSort poor performance ?? I am using java8.
The parallelSort function will use a thread for each cpu core you have on your machine. Specifically parallelSort runs tasks on the ForkJoin common thread pool. If you only have one core you would not see an improvement over single threaded sort.
If you only have multiple cores you are going to have some upfront cost associated with creating the new threads which will mean that for relatively small arrays you are not going to see linear performance gains.
The compare function for comparing doubles is not an expensive function. I think that in this case 1000000 elements can be safely considered small and the benefits of using multiple threads is outweighed by the upfront costs of creating those threads. Since the upfront costs will be fixed you should see a performance gain with larger arrays.
I read about the Arrays.parallelSort and Arrays.sort method which says
that Arrays.parellelSort gives poor performance when dataset are small
but here I am using array of 1000000 elements.
This is not the only thing to take in consideration. It depends a lot on your machine (how your CPU handle multi-threading etc).
Here a quote from the Parallelism part of The Java Tutorials
Note that parallelism is not automatically faster than performing
operations serially, although it can be if you have enough data and
processor cores [...] it is still your responsibility to determine if
your application is suitable for parallelism.
You might also want to have a look at the code of java.util.ArraysParallelSortHelpers for a better understanding of the algorithm.
Note that the parallelSort method use the ForkJoinPool introduced in Java 7 to take advantages of each processors of your computer as stated in the javadoc :
A ForkJoinPool is constructed with a given target parallelism level;
by default, equal to the number of available processors.
Note that if the length of the array is less then 1 << 13, the array will be sorted using the appropriate Arrays.sort method.
See also
Fork/Join
I am deciding what is the best way to achieve high performance gain while achieving thread safety (synchronization) for required point.
Consider the following case. There are two entry point in system and I want to make sure there is no two or more threads updates cashAccounts and itemStore at same time. So I created a Object call Lock and use it as follows.
public class ForwardPath {
public void fdWay(){
synchronized (Lock.class){
//here I am updating both cashAccount object and
//itemStore object
}
}
}
.
public class BackWardPath {
public void bwdWay(){
synchronized (Lock.class){
//here I am updating both cashAccount object and
//itemStore object
}
}
}
But this implementation will greatly decrease performance, If both ForwardPath and BackWardPath are triggered frequently.
But in this case it is some what difficult to lock only cashAccount and itemStore because both these objects get updates several times inside both paths.
Is there a good way to achieve both performance gain and thread safety in this scenario ?
The example is far too abstract, and the little you describe leaves no alternative to synchronization in the methods.
To obtain high scalability (thats not necessarily highest performance in all situations, mind you), work is usually subdivided into units of work that are completely independent of each other (these they can be processed without any synchronization).
Lets assume a simple example, summing up numbers (purely to demonstrate the principle):
The naive solution would be to have one accumulator for the sum, and walk the numbers adding to the accumulator. Obviously, if you wanted to use multiple threads, the accumulator would need to be synchronized and become the major point of contention).
To eliminate the contention, you can partition the numbers into multiple slices - separate units of work. Each unit of work can be summed independently (one thread per unit of work, for example). To get the final sum, add up the partial sums of each unit of work. The only point where synchronization is now needed is when combining the partial results. If you had for example 10 billion numbers, and divide them into 10 units of work, you need only synchronized 10 times - instead of 10 billion times in the naive solution.
The principle is always the same here: Make sure you can do as much work as possible without synchronization, then combine the partial results to obtain the final result. Thinking on the individual operation level is to fine a granularity to lend itself well to multi threading.
Performance-gain by using Threads is an architectural question, just adding some Threads and synchronized won't do the trick and usually just screws up your code while not working any faster than before. Therefore your code example is not enough to help you on the actual problem you seem to be facing, as each threaded solution is unique to your actual code.
Here I would focus on custom application where I got degradation (no need for general discussion about fastness of threads against processes).
I've got MPI application on Java which solve some problem using iteration method. The schematic view to application bellow lets call it MyProcess(n), where "n" is the number of processes:
double[] myArray = new double[M*K];
for(int iter = 0;iter<iterationCount;++iter)
{
//some communication between processes
//main loop
for(M)
for(K)
{
//linear sequence of arithmetical instructions
}
//some communication between processes
}
To improve performance I've decided to use Java threads (lets call it MyThreads(n)). The code is almost the same – myArray becomes matrix, where each row contains array for appropriate thread.
double[][] myArray = new double[threadNumber][M*K];
public void run()
{
for(int iter = 0;iter<iterationCount;++iter)
{
//some synchronization primitives
//main loop
for(M)
for(K)
{
//linear sequence of arithmetical instructions
counter++;
}
// some synchronization primitives
}
}
Threads created and started using Executors.newFixedThreadPool(threadNumber).
The problem is that while for MyProcess(n) we got adequate performance(n in [1,8]), in case of MyThreads(n) performance degrades essentially(on my system by factor of n).
Hardware: Intel(R) Xeon(R) CPU X5355(2 processors, 4 cores on each)
Java version: 1.5(using d32 option).
At first I thought that got different workloads on threads, but no, variable “counter” shows, that number of iterations between different run of MyThreads(n) (n in [1,8]) are identical.
And it isn’t synchronization fault, because I have temporary comment all synchronization primitives.
Any suggestions/ideas would be appreciated.
Thanks.
There are 2 issues I see in your piece of code.
Firstly caching problem. Since you try to do this in multi thread/process I'd assume your M * K results in a large number; then when you do
double[][] myArray = new double[threadNumber][M*K];
You are essentially creating an array of double pointer with size threadNumber; each pointing to a double array of size M*K. The interesting point here is that the threadNumber count of arrays are not necessarily allocated onto the same block of memory. They are just double pointers which can be allocated anywhere inside JVM heap. As a result, when multiple threads run, you might encounter a lot of cache miss and you end up reading memory many times, eventually slow down your program.
If the above is the root cause, you can try enlarge your JVM heap size, and then do
double[] myArray = new double[threadNumber * M * K];
And have the threads operating on different segment of the same array. You should be able to see performance better.
Secondly synchronization issue. Note that double (or any primitive) array is NOT volatile. Thus your result on 1 thread isn't guaranteed to be visible to other threads. If you are using synchronization block this resolves the issue, as a side effect of synchronization is make sure visibility across threads; If not, when you are reading and writing the array, please always make sure you use Unsafe.putXXXVolatile() and Unsafe.getXXXVolatile() so that you can do volatile operations on arrays.
To take this further, Unsafe can also be used to create a continuous segment of memory which you can used to hold your data structure and achieve better performance. In your case I think 1) already do the trick.