Automatic parallelization

Automatic parallelization - java

What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).
Take a look at the code below:
for(int i=0;i<100;i++)
sum1 += rand(100)
for(int j=0;j<100;j++)
sum2 += rand(100)/2
This kind of code can automatically get split to 2 different threads that run in parallel.
Do you think it's even possible?
I have a feeling that theoretically it's impossible (it reminds me the halting problem) but I can't justify this thought.
Do you think it's a useful project? is there anything like it?

This is called automatic parallelization. If you're looking for some program you can use that does this for you, it doesn't exist yet. But it may eventually. This is a hard problem and is an area of active research. If you're still curious...
It's possible to automatically split your example into multiple threads, but not in the way you're thinking. Some current techniques try to run each iteration of a for-loop in its own thread. One thread would get the even indicies (i=0, i=2, ...), the other would get the odd indices (i=1, i=3, ...). Once that for-loop is done, the next one could be started. Other techniques might get crazier, executing the i++ increment in one thread and the rand() on a separate thread.
As others have pointed out, there is a true dependency between iterations because rand() has internal state. That doesn't stand in the way of parallelization by itself. The compiler can recognize the memory dependency, and the modified state of rand() can be forwarded from one thread to the other. But it probably does limit you to only a few parallel threads. Without dependencies, you could run this on as many cores as you had available.
If you're truly interested in this topic and don't mind sifting through research papers:
Automatic thread extraction with decoupled software pipelining (2005) by G. Ottoni.
Speculative parallelization using software multi-threaded transactions (2010) by A. Raman.

This is practically not possible.
The problem is that you need to know, in advance, a lot more information than is readily available to the compiler, or even the runtime, in order to parallelize effectively.
While it would be possible to parallelize very simple loops, even then, there's a risk involved. For example, your above code could only be parallelized if rand() is thread-safe - and many random number generation routines are not. (Java's Math.random() is synchronized for you - however.)
Trying to do this type of automatic parallelization is, at least at this point, not practical for any "real" application.

It's certainly possible, but it is an incredibly hard task. This has been the central thrust of compiler research for several decades. The basic issue is that we cannot make a tool that can find the best partition into threads for java code (this is equivalent to the halting problem).
Instead we need to relax our goal from the best partition into some partition of the code. This is still very hard in general. So then we need to find ways to simplify the problem, one is to forget about general code and start looking at specific types of program. If you have simple control-flow (constant bounded for-loops, limited branching....) then you can make much more head-way.
Another simplification is reducing the number of parallel units that you are trying to keep busy. If you put both of these simplifications together then you get the state of the art in automatic vectorisation (a specific type of parallelisation that is used to generate MMX / SSE style code). Getting to that stage has taken decades but if you look at compilers like Intel's then performance is starting to get pretty good.
If you move from vector instructions inside a single thread to multiple threads within a process then you have a huge increase in latency moving data between the different points in the code. This means that your parallelisation has to be a lot better in order to win against the communication overhead. Currently this is a very hot topic in research, but there are no automatic user-targetted tools available. If you can write one that works it would be very interesting to many people.
For your specific example, if you assume that rand() is a parallel version so you can call it independently from different threads then it's quite easy to see that the code can be split into two. A compiler would convert just need dependency analysis to see that neither loop uses data from or affects the other. So the order between them in the user-level code is a false dependency that could split (i.e by putting each in a separate thread).
But this isn't really how you would want to parallelise the code. It looks as if each loop iteration is dependent on the previous as sum1 += rand(100) is the same as sum1 = sum1 + rand(100) where the sum1 on the right-hand-side is the value from the previous iteration. However the only operation involved is addition, which is associative so we rewrite the sum many different ways.
sum1 = (((rand_0 + rand_1) + rand_2) + rand_3) ....
sum1 = (rand_0 + rand_1) + (rand_2 + rand_3) ...
The advantage of the second is that each single addition in brackets can be computed in parallel to all of the others. Once you have 50 results then they can be combined into a further 25 additions and so on... You do more work this way 50+25+13+7+4+2+1 = 102 additions versus 100 in the original but there are only 7 sequential steps so apart from the parallel forking/joining and communication overhead it runs 14 times quicker. This tree of additions is called a gather operation in parallel architectures and it tends to be the expensive part of a computation.
On a very parallel architecture such as a GPU the above description would be the best way to parallelise the code. If you're using threads within a process it would get killed by the overhead.
In summary: it is impossible to do perfectly, it is very hard to do well, there is lots of active research in finding out how much we can do.

Whether it's possible in the general case to know whether a piece of code can be parallelized does not really matter, because even if your algorithm cannot detect all cases that can be parallelized, maybe it can detect some of them.
That does not mean it would be useful. Consider the following:
First of all, to do it at compile-time, you have to inspect all code paths you can potentially reach inside the construct you want to parallelize. This may be tricky for anything but simply computations.
Second, you have to somehow decide what is parallelizable and what is not. You cannot trivially break up a loop that modifies the same state into several threads, for example. This is probably a very difficult task and in many cases you will end up with not being sure - two variables might in fact reference the same object.
Even if you could achieve this, it would end up confusing for the user. It would be very difficult to explain why his code was not parallelizable and how it should be changed.
I think that if you want to achieve this in Java, you need to write it more as a library, and let the user decide what to parallelize (library functions together with annotations? just thinking aloud). Functional languages are much more suited for this.
As a piece of trivia: during a parallel programming course, we had to inspect code and decide whether it was parallelizable or not. I cannot remember the specifics (something about the "at-most-once" property? Someone fill me in?), but the moral of the story is that it was extremely difficult even for what appeared to be trivial cases.

There are some projects that try to simplify parallelization - such as Cilk. It doesn't always work that well, however.

I've learnt that as of JDK 1.8(Java 8), you can utilize/leverage multiple cores of your CPU in case of streams usage by using parallelStream().
However, it has been studied that before finalizing to go to production with parallelStream() it is always better to compare sequential() with parallel, by benchmarking the performance, and then decide which would be ideal.
Why?/Reason is: there could be scenarios where the parallel stream will perform dramatically worse than sequential, when the operation needs to do auto un/boxing. For those scenarios its advisable to use the Java 8 Primitive Streams such as IntStream, LongStream, DoubleStream.
Reference: Modern Java in Action: Manning Publications 2019

The Programming language is Java and Java is a virtual machine. So shouldn't one be able to execute the code at runtime on different Threads owned by the VM. Since all the Memory etc. is handled like that It whould not cause any corruption . You could see the Code as a Stack of instructions estimating execution Time and then distribute it on an Array of Threads which are each have an execution stack of roughtly the same time. It might be dangerous though some graphics like OpenGL immediate mode needs to maintain order and mostly should not be threaded at all.

Related

Problems with Streams in Java 8

As per this article there are some serious flaws with Fork-Join architecture in Java. As per my understanding Streams in Java 8 make use of Fork-Join framework internally. We can easily turn a stream into parallel by using parallel() method. But when we submit a long running task to a parallel stream it blocks all the threads in the pool, check this. This kind of behaviour is not acceptable for real world applications.
My question is what are the various considerations that I should take into account before using these constructs in high-performance applications (e.g. equity analysis, stock market ticker etc.)

The considerations are similar to other uses of multiple threads.
Only use multiple threads if you know they help. The aim is not to use every core you have, but to have a program which performs to your requirements.
Don't forget multi-threading comes with an overhead, and this overhead can exceed the value you get.
Multi-threading can experience large outliers. When you test performance you should not only look at throughput (which should be better) but the distribution of your latencies (which is often worse in extreme cases)
For low latency, switch between threads as little as possible. If you can do everything in one thread that may be a good option.
For low latency, you don't want to play nice, instead you want to minimise jitter by doing things such as pinning busy waiting threads to isolated cores. The more isolated cores you have the less junk cores you have to run things like thread pools.

The streams API makes parallelism deceptively simple. As was stated before, whether using a parallel stream speeds up your application needs to be thoroughly analysed and tested in the actual runtime context. My own experience with parallel streams streams suggests the following (and I am sure this list is far from complete):
The cost of the operations performed on the elements of the stream versus the cost of the parallelising machinery determines the potential benefit of parallel streams. For example, finding the maximum in an array of doubles is so fast using a tight loop that the streams overhead is never worthwhile. As soon as the operations get more expensive, the balance starts to tip in favour of the parallel streams API - under ideal conditions, say, a multi-core machine dedicated to a single algorithm). I encourage you to experiment.
You need to have the time and stamina to learn the intrinsics of the stream API. There are unexpected pitfalls. For example, a Spliterator can be constructed from a regular Iterator in simple statement. Under the hood, the elements produced by the iterator are first collected into an array. Depending on the number of elements produced by the Iterator that approach becomes very or even too resource hungry.
While the cited article make it seem that we are completely at the mercy of Oracle, that is not entirely true. You can write your own Spliterator that splits the input into chunks that are specific to your situation rather than relying on the default implementation. Or, you could write your own ThreadFactory (see the method ForkJoinPool.makeCommonPool).
You need to be careful not to produce deadlocks. If the tasks executed on the elements of the stream use the ForkJoinPool themselves, a deadlock may occur. You need to learn how to use the ForkJoinPool.ManagedBlocker API and its use (which I find rather the opposite of easy to grasp). Technically you are telling the ForkJoinPool that a thread is blocking which may lead to the creation of additional threads to keep the degree of parallelism intact. The creation of extra threads is not free, of course.
Just my five cents...

The point (there are actually 17) of the articles is to point out that the F/J Framework is more of a research project than a general-purpose commercial application development framework.
Criticize the object, not the man. Trying to do that is most difficult when the main problem with the framework is that the architect is a professor/scientist not an engineer/commercial developer. The PDF consolidation downloadable from the article goes more into the problem of using research standards rather than engineering standards.
Parallel streams work fine, until you try to scale them. The framework uses pull technology; the request goes into a submission queue, the thread must pull the request out of the submission queue. The Task goes back into the forking thread's deque, other threads must pull the Task out of the deque. This technique doesn't scale well. In a push technology, each Task is scattered to every thread in the system. That works much better in large scale environments.
There are many other problems with scaling as even Paul Sandoz from Oracle pointed out: For instance if you have 32 cores and are doing Stream.of(s1, s2, s3, s4).flatMap(x -> x).reduce(...) then at most you will only use 4 cores. The article points out, with downloadable software, that scaling does not work well and the parquential technique is necessary to avoid stack overflows and OOME.
Use the parallel streams. But beware of the limitations.

Code optimization to avoid branching

I just came across this article: Compute the minimum or maximum of two integers without branching
It starts with "[o]n some rare machines where branching is expensive...".
I used to think that branching is always expensive as it often forces the processor to clear and restart its execution pipeline (e.g. see Why is it faster to process a sorted array than an unsorted array?).
This leaves me with a couple of questions:
Did the writer of the article get that part wrong? Or was this article maybe written in a time before branching was an issue (I can't find a date on it).
Do modern processors have a way to complete minimal branches like the one in (x < y) ? x : y without performance degradation?
Or do all modern compilers simply implement this hack automatically? Specifically, what does Java do? Especially since its Math.min(...) function is just that ternary statement...

Did the writer of the article get that part wrong? Or was this article maybe written in a time before branching was an issue (I can't find a date on it).
The oldest comment is 5 years old, so it's no hot news. However, unpredictable branching is always expensive and so was it 5 years ago. In the meantime, it just got worse as modern CPUs can do much more per cycle and a mispredicted branch therefore cost more work.
But in a sense, the writer is right. The majority of CPUs is not found in our PCs and servers, but in embedded devices, where the situation differs.
Do modern processors have a way to complete minimal branches like the one in (x < y) ? x : y without performance degradation?
Yes and no. AFAIK Math.max gets always translated as a conditional move, which means no branching. You own max may or may not use it, depending on statistics the JVM collected.
There's no silver bullet. With predictable outcomes, branching is faster. Finding out exactly, what pattern the CPU recognizes, is hard. The JVM simply looks at how often a branch gets takes and uses a magic threshold of about 18%. See my own question and answer for details.
Or do all modern compilers simply implement this hack automatically? Specifically, what does Java do? Especially since its Math.min(...) function is just that ternary statement...
It's actually a compiler intrinsic. Whenever the JITc sees this very method called, it handles it specially. When you copy the method, it gets no special treatments.
In this case, the intrinsic is not very useful, as it's something what gets heavily optimized anyway. For methods like Long#numberOfLeadingZeros, the intrinsic is essential, as the code is rather long and slow and modern CPUs get do it in a single cycle.

Which manner of processing two repetative operations will be completed faster?

Generally speaking, is there a significant differance in the processing speeds of these two example segments of code, and if so, which should complete faster? Assume that "processA(int)" and "processB(int)" are voids that are common to both examples.
for(int x=0;x<1000;x++){
processA(x);
processB(x);
}
or
for(int x=0;x<1000;x++){
processA(x);
}
for(int x=0;x<1000;x++){
processB(x);
}
I'm looking to see if I can speed up one of my programs and it involves cycling through blocks of data several times and processing it differant ways. Currantly, it runs a seperate cycle for each processing method, meaning a lot of cycles get run in total, but each cycle does very little work. I was thinking about rewriting my code so that each cycle incorporates every processing method; in other words, much fewer cycles, but each cycle has a heavier workload.
This would be a very intensive rewrite of my program stucture. So unless it would give me a significant performance boost, it won't be worth the trouble.

The first case will be slightly faster then the second case because a for loop in and of itself has an effect on performance. However, the main question you should ask yourself is this: will the effect be of significance to my program? If not, you should opt for clear and readable code.
One thing to remember in such a case is that the JVM (Java Virtual Machine) does a whole lot of optimisation, and in your case the JVM can even get rid of the for-loop and rewrite the code into 1000 successive calls to processA() and processB(). So even if you have two for loops, the JVM can get rid of both, making your program more optimal than even your first case.
To get a basic understanding of method calls, cost, and the JVM, you can read this short article:
https://plumbr.eu/blog/how-expensive-is-a-method-call-in-java

Absolutely, fusing the 2 loops will be faster. (Some compilers do this automatically as an optimization.) How much faster? That depends on how many iterations the loops are running. Unless the number of iterations is very high, you can expect that the improvement will be minimal.

The single loop case will contain fewer instructions so will run faster.
But, unless processA and processB are very quick functions, such a substantial refactoring would give you negligible performance gain.
If this is production code, you should also take care as there may be side effects. You should make alterations in the context of a unit testing framework testing the code in question. (In C++ for example x may be passed by reference and could be modified by the functions! Of course Java has no such hazard but there may be other reasons why all processA functions have to run before all processB functions and the program comments may not make this clear).

The first piece of code is faster, because it only has one for loop.
If, however, you need to do something AFTER processA() has been executed n times and before processB()'s loop starts, then the second option would be ideal.

Should I inline long code in a loop, or move it in a separate method?

Assume I have a loop (any while or for) like this:
loop{
A long code.
}
From the point of time complexity, should I divide this code in parts, write a function outside the loop, and call that function repeatedly?
I read something about functions very long ago, that calling a function repeatedly takes more time or memory or like something, I don't remember it exactly. Can you also provide some good reference about things like this (time complexity, coding style)?
Can you also provide some reference book or tutorial about heap memory, overheads etc. which affects the performance of program?

The performance difference is probably very minimal in this case. I would concentrate on clarity rather than performance until you identify this portion of your code to be a serious bottleneck.
It really does depend on what kind of code you're running in the loop, however. If you're just doing a tiny mathematical operation that isn't going to take any CPU time, but you're doing it a few hundred thousand times, then inlining the calculation might make sense. Anything more expensive than that, though, and performance shouldn't be an issue.

There is an overhead of calling a function.
So if the "long code" is fast compared to this overhead (and your application cares about performance), then you should definitely avoid the overhead.
However, if the performance is not noticably worse, it's better to make it more readable, by using a (or better multiple) function.

Rule one of performance optmisation: Measure it.
Personally, I go for readable code first and then optimise it IF NECESSARY. Usually, it isn't necessary :-)
See the first line in CHAPTER 3 - Measurement Is Everything
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." - Donald Knuth
In this case, the difference in performance will probably be minimal between the two solutions, so writing clearer code is the way to do it.

There really isnt a simple "tutorial" on performance, it is a very complex subject and one that even seasoned veterans often dont fully understand. Anyway, to give you more of an idea of what the overhead of "calling" a function is, basically what you are doing is "freezing" the state of your function(in Java there are no "functions" per se, they are all called methods), calling the method, then "unfreezing", where your method was before.
The "freezing" essentially consists of pushing state information(where you were in the method, what the value of the variables was etc) on to the stack, "unfreezing" consists of popping the saved state off the stack and updating the control structures to where they were before you called the function. Naturally memory operations are far from free, but the VM is pretty good at keeping the performance impact to an absolute minimum.
Now keep in mind Java is almost entirely heap based, the only things that really have to get pushed on the stack are the value of pointers(small), your place in the program(again small), and whatever primitives you have local to your method, and a tiny bit of control information, nothing else. Furthermore, although you cannot explicitly inline in Java(though Im sure there are bytecode editors out there that essentially let you do that), most VMs, including the most popular HotSpot VM, will do this automatically for you. http://java.sun.com/developer/technicalArticles/Networking/HotSpot/inlining.html
So the bottom line is pretty much 0 performance impact, if you want to verify for yourself you can always run benchmarking and profiling tools, they should be able to confirm it for you.

From a execution speed point of view it shouldn't matter, and if you still believe this is a bottleneck it is easy to measure.
From a development performance perspective, it is a good idea to keep the code short. I would vote for turning the loop contents into one (or more) properly named methods.

Forget it! You can't gain any performance by doing the job of the JIT. Let JIT inline it for you. Keep the methods short for readability and also for performance, as JIT works better with short methods.
There are microptimizations which may help you gain some performance, but don't even think about them. I suggest the following rules:
Write clean code using appropriate objects and algorithms for readability and for performance.
In case the program is too slow, profile and identify the critical parts.
Think about improving them using better objects and algorithms.
As a last resort, you may also consider microoptimizations.

Should I consider parallelism in statistic callculations?

we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...

The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.

In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.

If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.