Arrays.sort and Arrays.parallelSort function behavior - java

I have the following code ,
import java.util.Arrays;
public class ParellelStream {
public static void main(String args[]){
Double dbl[] = new Double[1000000];
for(int i=0; i<dbl.length;i++){
dbl[i]=Math.random();
}
long start = System.currentTimeMillis();
Arrays.parallelSort(dbl);
System.out.println("time taken :"+((System.currentTimeMillis())-start));
}
}
When I run this code it takes time approx 700 to 800 ms, but when I replace the line Arrays.parallelSort to Arrays.sort it takes 500 to 600 ms. I read about the Arrays.parallelSort and Arrays.sort method which says that Arrays.parellelSort gives poor performance when dataset are small but here I am using array of 1000000 elements. what could be the reason for parallelSort poor performance ?? I am using java8.

The parallelSort function will use a thread for each cpu core you have on your machine. Specifically parallelSort runs tasks on the ForkJoin common thread pool. If you only have one core you would not see an improvement over single threaded sort.
If you only have multiple cores you are going to have some upfront cost associated with creating the new threads which will mean that for relatively small arrays you are not going to see linear performance gains.
The compare function for comparing doubles is not an expensive function. I think that in this case 1000000 elements can be safely considered small and the benefits of using multiple threads is outweighed by the upfront costs of creating those threads. Since the upfront costs will be fixed you should see a performance gain with larger arrays.

I read about the Arrays.parallelSort and Arrays.sort method which says
that Arrays.parellelSort gives poor performance when dataset are small
but here I am using array of 1000000 elements.
This is not the only thing to take in consideration. It depends a lot on your machine (how your CPU handle multi-threading etc).
Here a quote from the Parallelism part of The Java Tutorials
Note that parallelism is not automatically faster than performing
operations serially, although it can be if you have enough data and
processor cores [...] it is still your responsibility to determine if
your application is suitable for parallelism.
You might also want to have a look at the code of java.util.ArraysParallelSortHelpers for a better understanding of the algorithm.
Note that the parallelSort method use the ForkJoinPool introduced in Java 7 to take advantages of each processors of your computer as stated in the javadoc :
A ForkJoinPool is constructed with a given target parallelism level;
by default, equal to the number of available processors.
Note that if the length of the array is less then 1 << 13, the array will be sorted using the appropriate Arrays.sort method.
See also
Fork/Join

Related

Method optimization

Which one would be faster and why?
public void increment{
synchronized(this){
i++;
}
}
or
public synchronized void increment(){
I++;
}
Would method inlining improve the second option?
The difference is unlikely to matter or be visible but the second example could be faster. Why? In Oracle/OpenJDK, the less byte code a method uses the more it can be optimised e.g. inlined.
There is a number of thresholds for when a method could be optimised and these thresholds are based on the number of bytes of byte code. The second example uses less byte code so it is possible it could be optimise more.
One of these optimisations is escape analysis which could eliminate the use of synchronized for thread local object. This would make it much faster. The threshold for escape analysis is method which are under 150 bytes (after inlining) by default. You could see cases where the first solution makes the method just over 150 bytes and the second is just under 150 bytes.
NOTE:
don't write code on this basis, the different it too trivial to matter. Code clarity is far, far more important.
if performance does matter, using an alternative like AtomicInteger is likely to be an order of magnitude faster. (consistently, not just in rare cases)
BTW AFAIK the concurrency libraries use little assert statements, not because they are not optimised away but because they still count to the size of the byte code and indirectly slow down their use by meaning in some cases the code isn't as optimal. (this claim of mine should have a reference, but I couldn't find it)

Difference between java 8 streams and parallel streams

I wrote code using Java 8 streams and parallel streams for the same functionality with a custom collector to perform an aggregation function.
When I see CPU usage using htop, it shows all CPU cores being used for both 'streams' and 'parallel streams' version. So, it seems when list.stream() is used, it also uses all CPUs. Here, what is the precise difference between parallelStream() and stream() in terms of usage of multi-core.
Consider the following program:
import java.util.ArrayList;
import java.util.List;
public class Foo {
public static void main(String... args) {
List<Integer> list = new ArrayList<>();
for (int i = 0; i < 1000; i++) {
list.add(i);
}
list.stream().forEach(System.out::println);
}
}
You will notice that this program will output the numbers from 0 to 999 sequentially, in the order in which they are in the list. If we change stream() to parallelStream() this is not the case anymore (at least on my computer): all number are written, but in a different order. So, apparently, parallelStream() indeed uses multiple threads.
The htop is explained by the fact that even single-threaded applications are divided over mutliple cores by most modern operating systems (parts of the same thread may run on several cores, but of course not at the same time). So if you see that a process used more than one core, this does not mean necessarily that the program uses multiple threads.
Also the performance may not improve when using multiple threads. The cost of synchronization may nihilite the gains of using multiple threads. For simple testing scenarios this is often the case. For example, in the above example, System.out is synchronized. So, effectively, only number can be written at the same time, although multiple threads are used.
adding to #Hoopje 's answer:
Before using parallelStream (), Read this:
It is multi-threaded. Just writing parallelStream() to get parallelism is almost always bad idea in java. There are some cases where it will work, but not always. There are other ways to achieve parallelism and almost always, you need to think a lot before taking a multi-thread solution .
It uses the default JVM thread pool. So, if you are doing any blocking operation such as network call, the entire java application can get stuck. Thats the biggest problem there. There are other ones with task allocation as well. A simple ExecutionService with n threads provides better performance that parallel streams.
You can also read:
Java Parallel Streams Are Bad for Your Health! | JRebel by Perforce

Thread Safety Vs Performance

I am deciding what is the best way to achieve high performance gain while achieving thread safety (synchronization) for required point.
Consider the following case. There are two entry point in system and I want to make sure there is no two or more threads updates cashAccounts and itemStore at same time. So I created a Object call Lock and use it as follows.
public class ForwardPath {
public void fdWay(){
synchronized (Lock.class){
//here I am updating both cashAccount object and
//itemStore object
}
}
}
.
public class BackWardPath {
public void bwdWay(){
synchronized (Lock.class){
//here I am updating both cashAccount object and
//itemStore object
}
}
}
But this implementation will greatly decrease performance, If both ForwardPath and BackWardPath are triggered frequently.
But in this case it is some what difficult to lock only cashAccount and itemStore because both these objects get updates several times inside both paths.
Is there a good way to achieve both performance gain and thread safety in this scenario ?
The example is far too abstract, and the little you describe leaves no alternative to synchronization in the methods.
To obtain high scalability (thats not necessarily highest performance in all situations, mind you), work is usually subdivided into units of work that are completely independent of each other (these they can be processed without any synchronization).
Lets assume a simple example, summing up numbers (purely to demonstrate the principle):
The naive solution would be to have one accumulator for the sum, and walk the numbers adding to the accumulator. Obviously, if you wanted to use multiple threads, the accumulator would need to be synchronized and become the major point of contention).
To eliminate the contention, you can partition the numbers into multiple slices - separate units of work. Each unit of work can be summed independently (one thread per unit of work, for example). To get the final sum, add up the partial sums of each unit of work. The only point where synchronization is now needed is when combining the partial results. If you had for example 10 billion numbers, and divide them into 10 units of work, you need only synchronized 10 times - instead of 10 billion times in the naive solution.
The principle is always the same here: Make sure you can do as much work as possible without synchronization, then combine the partial results to obtain the final result. Thinking on the individual operation level is to fine a granularity to lend itself well to multi threading.
Performance-gain by using Threads is an architectural question, just adding some Threads and synchronized won't do the trick and usually just screws up your code while not working any faster than before. Therefore your code example is not enough to help you on the actual problem you seem to be facing, as each threaded solution is unique to your actual code.

Concurrent HashMap iterator:How safe is it for Threading?

I used concurrent hashmap for creating a matrix. It indices ranges to 100k. I have created 40 threads. Each of the thread access those elements of matrices and modifies to that and write it back of the matrix as:
ConcurrentHashMap<Integer, ArrayList<Double>> matrix =
new ConcurrentHashMap<Integer, ArrayList<Double>>(25);
for (Entry(Integer,ArrayList<Double>)) entry: matrix.entrySet())
upDateEntriesOfValue(entry.getValue());
I did not found it thread safe. Values are frequently returned as null and my program is getting crashed. Is there any other way to make it thread safe.Or this is thread safe and i have bug in some other places. One thing is my program does not crash in single threaded mode.
The iterator is indeed thread-safe for the ConcurrentHashMap.
But what is not thread-safe in your code is the ArrayList<Double> you seem to update! Your problems might come from this data structure.
You may want to use a concurrent data structure adapted to you needs.
Using a map for a matrix is really inefficient, and the way you have used it, it won't even support sparse arrays particularly well.
I suggest you use a double[][] where you lock each row (or column if that is better) If the matrix is small enough you may be better of using only one CPU as this can save you quite a bit of overhead.
I would suggest you create no more threads than you have cores. For CPU intensive tasks, using more thread can be slower, not faster.
Matrix is 100k*50 at max
EDIT: Depending on the operation you are performing, I would try to ensure you have the shorter dimension first so you can process each long dimension in a different thread efficiently.
e.g
double[][] matrix = new double[50][100*1000];
for(int i=0;i<matrix.length;i++) {
final double[] line = matrix[i];
executorService.submit(new Runnable() {
public void run() {
synchronized(line) {
processOneLine(line);
}
}
});
}
This allows all you thread to run concurrently because they don't share any data structures. They can also access each double efficiently because they are continuous in memory and stored as efficiently as possible. i.e. 100K doubles uses about 800KB, but List<Double> uses 2800KB and each value can be randomly arranged in memory which means your cache has to work much harder.
thanks but in fact i have 80 cores in total
To uses 80 core efficiently you might want to break the longer lines in two or four so you can keep all the cores busy, or find a way to perform more than one operation at a time.
TheConcurrentHashMap will be thread safe for accesses into the Map, but the Lists served out need to be thread-safe, if multiple threads can operate on the same List instances concurrently so use a thread-safe list while modifying.
In your case working on ConcurrentHashMap is tread-safe but when thread goes to ArrayList this is not synchronized and hence multiple threads can access it simultaneously which makes it non thread-safe. either you can use synchronized block where you are performing modification in list

java threads vs java processes performance degradation

Here I would focus on custom application where I got degradation (no need for general discussion about fastness of threads against processes).
I've got MPI application on Java which solve some problem using iteration method. The schematic view to application bellow lets call it MyProcess(n), where "n" is the number of processes:
double[] myArray = new double[M*K];
for(int iter = 0;iter<iterationCount;++iter)
{
//some communication between processes
//main loop
for(M)
for(K)
{
//linear sequence of arithmetical instructions
}
//some communication between processes
}
To improve performance I've decided to use Java threads (lets call it MyThreads(n)). The code is almost the same – myArray becomes matrix, where each row contains array for appropriate thread.
double[][] myArray = new double[threadNumber][M*K];
public void run()
{
for(int iter = 0;iter<iterationCount;++iter)
{
//some synchronization primitives
//main loop
for(M)
for(K)
{
//linear sequence of arithmetical instructions
counter++;
}
// some synchronization primitives
}
}
Threads created and started using Executors.newFixedThreadPool(threadNumber).
The problem is that while for MyProcess(n) we got adequate performance(n in [1,8]), in case of MyThreads(n) performance degrades essentially(on my system by factor of n).
Hardware: Intel(R) Xeon(R) CPU X5355(2 processors, 4 cores on each)
Java version: 1.5(using d32 option).
At first I thought that got different workloads on threads, but no, variable “counter” shows, that number of iterations between different run of MyThreads(n) (n in [1,8]) are identical.
And it isn’t synchronization fault, because I have temporary comment all synchronization primitives.
Any suggestions/ideas would be appreciated.
Thanks.
There are 2 issues I see in your piece of code.
Firstly caching problem. Since you try to do this in multi thread/process I'd assume your M * K results in a large number; then when you do
double[][] myArray = new double[threadNumber][M*K];
You are essentially creating an array of double pointer with size threadNumber; each pointing to a double array of size M*K. The interesting point here is that the threadNumber count of arrays are not necessarily allocated onto the same block of memory. They are just double pointers which can be allocated anywhere inside JVM heap. As a result, when multiple threads run, you might encounter a lot of cache miss and you end up reading memory many times, eventually slow down your program.
If the above is the root cause, you can try enlarge your JVM heap size, and then do
double[] myArray = new double[threadNumber * M * K];
And have the threads operating on different segment of the same array. You should be able to see performance better.
Secondly synchronization issue. Note that double (or any primitive) array is NOT volatile. Thus your result on 1 thread isn't guaranteed to be visible to other threads. If you are using synchronization block this resolves the issue, as a side effect of synchronization is make sure visibility across threads; If not, when you are reading and writing the array, please always make sure you use Unsafe.putXXXVolatile() and Unsafe.getXXXVolatile() so that you can do volatile operations on arrays.
To take this further, Unsafe can also be used to create a continuous segment of memory which you can used to hold your data structure and achieve better performance. In your case I think 1) already do the trick.

Categories