Should I consider parallelism in statistic callculations? - java

we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...

The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.

In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.

If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.

Related

Get sum of Integers using reduce() vs parallel().reduce() [duplicate]

With Java 8 and lambdas it's easy to iterate over collections as streams, and just as easy to use a parallel stream. Two examples from the docs, the second one using parallelStream:
myShapesCollection.stream()
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
myShapesCollection.parallelStream() // <-- This one uses parallel
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
As long as I don't care about the order, would it always be beneficial to use the parallel? One would think it is faster dividing the work on more cores.
Are there other considerations? When should parallel stream be used and when should the non-parallel be used?
(This question is asked to trigger a discussion about how and when to use parallel streams, not because I think always using them is a good idea.)
A parallel stream has a much higher overhead compared to a sequential one. Coordinating the threads takes a significant amount of time. I would use sequential streams by default and only consider parallel ones if
I have a massive amount of items to process (or the processing of each item takes time and is parallelizable)
I have a performance problem in the first place
I don't already run the process in a multi-thread environment (for example: in a web container, if I already have many requests to process in parallel, adding an additional layer of parallelism inside each request could have more negative than positive effects)
In your example, the performance will anyway be driven by the synchronized access to System.out.println(), and making this process parallel will have no effect, or even a negative one.
Moreover, remember that parallel streams don't magically solve all the synchronization problems. If a shared resource is used by the predicates and functions used in the process, you'll have to make sure that everything is thread-safe. In particular, side effects are things you really have to worry about if you go parallel.
In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.
The Stream API was designed to make it easy to write computations in a way that was abstracted away from how they would be executed, making switching between sequential and parallel easy.
However, just because its easy, doesn't mean its always a good idea, and in fact, it is a bad idea to just drop .parallel() all over the place simply because you can.
First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available. A parallel execution will always involve more work than a sequential one, because in addition to solving the problem, it also has to perform dispatching and coordinating of sub-tasks. The hope is that you'll be able to get to the answer faster by breaking up the work across multiple processors; whether this actually happens depends on a lot of things, including the size of your data set, how much computation you are doing on each element, the nature of the computation (specifically, does the processing of one element interact with processing of others?), the number of processors available, and the number of other tasks competing for those processors.
Further, note that parallelism also often exposes nondeterminism in the computation that is often hidden by sequential implementations; sometimes this doesn't matter, or can be mitigated by constraining the operations involved (i.e., reduction operators must be stateless and associative.)
In reality, sometimes parallelism will speed up your computation, sometimes it will not, and sometimes it will even slow it down. It is best to develop first using sequential execution and then apply parallelism where
(A) you know that there's actually benefit to increased performance and
(B) that it will actually deliver increased performance.
(A) is a business problem, not a technical one. If you are a performance expert, you'll usually be able to look at the code and determine (B), but the smart path is to measure. (And, don't even bother until you're convinced of (A); if the code is fast enough, better to apply your brain cycles elsewhere.)
The simplest performance model for parallelism is the "NQ" model, where N is the number of elements, and Q is the computation per element. In general, you need the product NQ to exceed some threshold before you start getting a performance benefit. For a low-Q problem like "add up numbers from 1 to N", you will generally see a breakeven between N=1000 and N=10000. With higher-Q problems, you'll see breakevens at lower thresholds.
But the reality is quite complicated. So until you achieve experthood, first identify when sequential processing is actually costing you something, and then measure if parallelism will help.
I watched one of the presentations of Brian Goetz (Java Language Architect & specification lead for Lambda Expressions). He explains in detail the following 4 points to consider before going for parallelization:
Splitting / decomposition costs
– Sometimes splitting is more expensive than just doing the work!
Task dispatch / management costs
– Can do a lot of work in the time it takes to hand work to another thread.
Result combination costs
– Sometimes combination involves copying lots of data. For example, adding numbers is cheap whereas merging sets is expensive.
Locality
– The elephant in the room. This is an important point which everyone may miss. You should consider cache misses, if a CPU waits for data because of cache misses then you wouldn't gain anything by parallelization. That's why array-based sources parallelize the best as the next indices (near the current index) are cached and there are fewer chances that CPU would experience a cache miss.
He also mentions a relatively simple formula to determine a chance of parallel speedup.
NQ Model:
N x Q > 10000
where,
N = number of data items
Q = amount of work per item
Other answers have already covered profiling to avoid premature optimization and overhead cost in parallel processing. This answer explains the ideal choice of data structures for parallel streaming.
As a rule, performance gains from parallelism are best on streams over ArrayList , HashMap , HashSet , and ConcurrentHashMap instances; arrays; int ranges; and long ranges. What these data structures have in common is that they can all be accurately and cheaply split into subranges of any desired sizes, which makes it easy to divide work among parallel threads. The abstraction used by the streams library to perform this task is the spliterator , which is returned by the spliterator method on Stream and Iterable.
Another important factor that all of these data structures have in common is that they provide good-to-excellent locality of reference when processed sequentially: sequential element references are stored together in memory. The objects referred to by those references may not be close to one another in memory, which reduces locality-of-reference. Locality-of-reference turns out to be critically important for parallelizing bulk operations: without it, threads spend much of their time idle, waiting for data to be transferred from memory into the processor’s cache. The data structures with the best locality of reference are primitive arrays because the data itself is stored contiguously in memory.
Source: Item #48 Use Caution When Making Streams Parallel, Effective Java 3e by Joshua Bloch
Never parallelize an infinite stream with a limit. Here is what happens:
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.iterate(0, i -> i + 1)
.parallel()
.skip(1)
.findFirst()
.getAsInt());
}
Result
Exception in thread "main" java.lang.OutOfMemoryError
at ...
at java.base/java.util.stream.IntPipeline.findFirst(IntPipeline.java:528)
at InfiniteTest.main(InfiniteTest.java:24)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.stream.SpinedBuffer$OfInt.newArray(SpinedBuffer.java:750)
at ...
Same if you use .limit(...)
Explanation here:
Java 8, using .parallel in a stream causes OOM error
Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g.
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.range(1, 1000_000_000)
.parallel()
.skip(100)
.findFirst()
.getAsInt());
}
This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.
Collection.parallelStream() is a great way to do work in parallel. However you need to keep in mind that this effectively uses a common thread pool with only a few worker threads internally (number of threads equals to the number of cpu cores by default), see ForkJoinPool.commonPool(). If some of pool's tasks are a long-running I/O-bound work then others, potentially fast, parallelStream calls will get stuck waiting for the free pool threads. This obviously leads to a requirement of fork-join tasks being non-blocking and short or, in other words, cpu-bound. For better understanding of details I strongly recommend careful reading of java.util.concurrent.ForkJoinTask javadoc, here are some relevant quotes:
The efficiency of ForkJoinTasks stems from ... their main use as computational tasks calculating pure functions or operating on purely isolated objects.
Computations should ideally avoid synchronized methods or blocks, and should minimize other blocking synchronization
Subdividable tasks should also not perform blocking I/O
These indicate the main purpose of parallelStream() tasks as short computations over isolated in-memory structures. Also recommend checking out article Common parallel stream pitfalls

Performance of Java Parallel Stream vs Sequential stream [duplicate]

With Java 8 and lambdas it's easy to iterate over collections as streams, and just as easy to use a parallel stream. Two examples from the docs, the second one using parallelStream:
myShapesCollection.stream()
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
myShapesCollection.parallelStream() // <-- This one uses parallel
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
As long as I don't care about the order, would it always be beneficial to use the parallel? One would think it is faster dividing the work on more cores.
Are there other considerations? When should parallel stream be used and when should the non-parallel be used?
(This question is asked to trigger a discussion about how and when to use parallel streams, not because I think always using them is a good idea.)
A parallel stream has a much higher overhead compared to a sequential one. Coordinating the threads takes a significant amount of time. I would use sequential streams by default and only consider parallel ones if
I have a massive amount of items to process (or the processing of each item takes time and is parallelizable)
I have a performance problem in the first place
I don't already run the process in a multi-thread environment (for example: in a web container, if I already have many requests to process in parallel, adding an additional layer of parallelism inside each request could have more negative than positive effects)
In your example, the performance will anyway be driven by the synchronized access to System.out.println(), and making this process parallel will have no effect, or even a negative one.
Moreover, remember that parallel streams don't magically solve all the synchronization problems. If a shared resource is used by the predicates and functions used in the process, you'll have to make sure that everything is thread-safe. In particular, side effects are things you really have to worry about if you go parallel.
In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.
The Stream API was designed to make it easy to write computations in a way that was abstracted away from how they would be executed, making switching between sequential and parallel easy.
However, just because its easy, doesn't mean its always a good idea, and in fact, it is a bad idea to just drop .parallel() all over the place simply because you can.
First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available. A parallel execution will always involve more work than a sequential one, because in addition to solving the problem, it also has to perform dispatching and coordinating of sub-tasks. The hope is that you'll be able to get to the answer faster by breaking up the work across multiple processors; whether this actually happens depends on a lot of things, including the size of your data set, how much computation you are doing on each element, the nature of the computation (specifically, does the processing of one element interact with processing of others?), the number of processors available, and the number of other tasks competing for those processors.
Further, note that parallelism also often exposes nondeterminism in the computation that is often hidden by sequential implementations; sometimes this doesn't matter, or can be mitigated by constraining the operations involved (i.e., reduction operators must be stateless and associative.)
In reality, sometimes parallelism will speed up your computation, sometimes it will not, and sometimes it will even slow it down. It is best to develop first using sequential execution and then apply parallelism where
(A) you know that there's actually benefit to increased performance and
(B) that it will actually deliver increased performance.
(A) is a business problem, not a technical one. If you are a performance expert, you'll usually be able to look at the code and determine (B), but the smart path is to measure. (And, don't even bother until you're convinced of (A); if the code is fast enough, better to apply your brain cycles elsewhere.)
The simplest performance model for parallelism is the "NQ" model, where N is the number of elements, and Q is the computation per element. In general, you need the product NQ to exceed some threshold before you start getting a performance benefit. For a low-Q problem like "add up numbers from 1 to N", you will generally see a breakeven between N=1000 and N=10000. With higher-Q problems, you'll see breakevens at lower thresholds.
But the reality is quite complicated. So until you achieve experthood, first identify when sequential processing is actually costing you something, and then measure if parallelism will help.
I watched one of the presentations of Brian Goetz (Java Language Architect & specification lead for Lambda Expressions). He explains in detail the following 4 points to consider before going for parallelization:
Splitting / decomposition costs
– Sometimes splitting is more expensive than just doing the work!
Task dispatch / management costs
– Can do a lot of work in the time it takes to hand work to another thread.
Result combination costs
– Sometimes combination involves copying lots of data. For example, adding numbers is cheap whereas merging sets is expensive.
Locality
– The elephant in the room. This is an important point which everyone may miss. You should consider cache misses, if a CPU waits for data because of cache misses then you wouldn't gain anything by parallelization. That's why array-based sources parallelize the best as the next indices (near the current index) are cached and there are fewer chances that CPU would experience a cache miss.
He also mentions a relatively simple formula to determine a chance of parallel speedup.
NQ Model:
N x Q > 10000
where,
N = number of data items
Q = amount of work per item
Other answers have already covered profiling to avoid premature optimization and overhead cost in parallel processing. This answer explains the ideal choice of data structures for parallel streaming.
As a rule, performance gains from parallelism are best on streams over ArrayList , HashMap , HashSet , and ConcurrentHashMap instances; arrays; int ranges; and long ranges. What these data structures have in common is that they can all be accurately and cheaply split into subranges of any desired sizes, which makes it easy to divide work among parallel threads. The abstraction used by the streams library to perform this task is the spliterator , which is returned by the spliterator method on Stream and Iterable.
Another important factor that all of these data structures have in common is that they provide good-to-excellent locality of reference when processed sequentially: sequential element references are stored together in memory. The objects referred to by those references may not be close to one another in memory, which reduces locality-of-reference. Locality-of-reference turns out to be critically important for parallelizing bulk operations: without it, threads spend much of their time idle, waiting for data to be transferred from memory into the processor’s cache. The data structures with the best locality of reference are primitive arrays because the data itself is stored contiguously in memory.
Source: Item #48 Use Caution When Making Streams Parallel, Effective Java 3e by Joshua Bloch
Never parallelize an infinite stream with a limit. Here is what happens:
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.iterate(0, i -> i + 1)
.parallel()
.skip(1)
.findFirst()
.getAsInt());
}
Result
Exception in thread "main" java.lang.OutOfMemoryError
at ...
at java.base/java.util.stream.IntPipeline.findFirst(IntPipeline.java:528)
at InfiniteTest.main(InfiniteTest.java:24)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.stream.SpinedBuffer$OfInt.newArray(SpinedBuffer.java:750)
at ...
Same if you use .limit(...)
Explanation here:
Java 8, using .parallel in a stream causes OOM error
Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g.
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.range(1, 1000_000_000)
.parallel()
.skip(100)
.findFirst()
.getAsInt());
}
This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.
Collection.parallelStream() is a great way to do work in parallel. However you need to keep in mind that this effectively uses a common thread pool with only a few worker threads internally (number of threads equals to the number of cpu cores by default), see ForkJoinPool.commonPool(). If some of pool's tasks are a long-running I/O-bound work then others, potentially fast, parallelStream calls will get stuck waiting for the free pool threads. This obviously leads to a requirement of fork-join tasks being non-blocking and short or, in other words, cpu-bound. For better understanding of details I strongly recommend careful reading of java.util.concurrent.ForkJoinTask javadoc, here are some relevant quotes:
The efficiency of ForkJoinTasks stems from ... their main use as computational tasks calculating pure functions or operating on purely isolated objects.
Computations should ideally avoid synchronized methods or blocks, and should minimize other blocking synchronization
Subdividable tasks should also not perform blocking I/O
These indicate the main purpose of parallelStream() tasks as short computations over isolated in-memory structures. Also recommend checking out article Common parallel stream pitfalls

Problems with Streams in Java 8

As per this article there are some serious flaws with Fork-Join architecture in Java. As per my understanding Streams in Java 8 make use of Fork-Join framework internally. We can easily turn a stream into parallel by using parallel() method. But when we submit a long running task to a parallel stream it blocks all the threads in the pool, check this. This kind of behaviour is not acceptable for real world applications.
My question is what are the various considerations that I should take into account before using these constructs in high-performance applications (e.g. equity analysis, stock market ticker etc.)
The considerations are similar to other uses of multiple threads.
Only use multiple threads if you know they help. The aim is not to use every core you have, but to have a program which performs to your requirements.
Don't forget multi-threading comes with an overhead, and this overhead can exceed the value you get.
Multi-threading can experience large outliers. When you test performance you should not only look at throughput (which should be better) but the distribution of your latencies (which is often worse in extreme cases)
For low latency, switch between threads as little as possible. If you can do everything in one thread that may be a good option.
For low latency, you don't want to play nice, instead you want to minimise jitter by doing things such as pinning busy waiting threads to isolated cores. The more isolated cores you have the less junk cores you have to run things like thread pools.
The streams API makes parallelism deceptively simple. As was stated before, whether using a parallel stream speeds up your application needs to be thoroughly analysed and tested in the actual runtime context. My own experience with parallel streams streams suggests the following (and I am sure this list is far from complete):
The cost of the operations performed on the elements of the stream versus the cost of the parallelising machinery determines the potential benefit of parallel streams. For example, finding the maximum in an array of doubles is so fast using a tight loop that the streams overhead is never worthwhile. As soon as the operations get more expensive, the balance starts to tip in favour of the parallel streams API - under ideal conditions, say, a multi-core machine dedicated to a single algorithm). I encourage you to experiment.
You need to have the time and stamina to learn the intrinsics of the stream API. There are unexpected pitfalls. For example, a Spliterator can be constructed from a regular Iterator in simple statement. Under the hood, the elements produced by the iterator are first collected into an array. Depending on the number of elements produced by the Iterator that approach becomes very or even too resource hungry.
While the cited article make it seem that we are completely at the mercy of Oracle, that is not entirely true. You can write your own Spliterator that splits the input into chunks that are specific to your situation rather than relying on the default implementation. Or, you could write your own ThreadFactory (see the method ForkJoinPool.makeCommonPool).
You need to be careful not to produce deadlocks. If the tasks executed on the elements of the stream use the ForkJoinPool themselves, a deadlock may occur. You need to learn how to use the ForkJoinPool.ManagedBlocker API and its use (which I find rather the opposite of easy to grasp). Technically you are telling the ForkJoinPool that a thread is blocking which may lead to the creation of additional threads to keep the degree of parallelism intact. The creation of extra threads is not free, of course.
Just my five cents...
The point (there are actually 17) of the articles is to point out that the F/J Framework is more of a research project than a general-purpose commercial application development framework.
Criticize the object, not the man. Trying to do that is most difficult when the main problem with the framework is that the architect is a professor/scientist not an engineer/commercial developer. The PDF consolidation downloadable from the article goes more into the problem of using research standards rather than engineering standards.
Parallel streams work fine, until you try to scale them. The framework uses pull technology; the request goes into a submission queue, the thread must pull the request out of the submission queue. The Task goes back into the forking thread's deque, other threads must pull the Task out of the deque. This technique doesn't scale well. In a push technology, each Task is scattered to every thread in the system. That works much better in large scale environments.
There are many other problems with scaling as even Paul Sandoz from Oracle pointed out: For instance if you have 32 cores and are doing Stream.of(s1, s2, s3, s4).flatMap(x -> x).reduce(...) then at most you will only use 4 cores. The article points out, with downloadable software, that scaling does not work well and the parquential technique is necessary to avoid stack overflows and OOME.
Use the parallel streams. But beware of the limitations.

How many threads are okay to use for tic-tac-toe using minimax?

Let's take a 5x5 tic-tac-toe as an example.
Let's say it's my AI's turn.
Then,
I make 25 moves (at each cell basically, of course, if it's a legal
move),
create a thread for each move (25 threads total (at most)),
call a minimax function on each made move,
then when all results come from each thread,
compare the scores and choose the move with the best score.
Here are my questions:
Is it efficient to use 25 threads? What does using 25 threads mean?
Is it 25 times faster (most likely not)? What it depends on? On the computer, of course, but how can I know how many threads are okay to use based on the computer's resources?
What happens if I use too many threads (nothing I guess...)?
Is my idea good? Thanks.
For a typical compute-bound application, a good rule of thumb is to use as many threads as you have hardware cores (or hyperthreads). Using more threads than cores won't make your application go faster. Instead, it will cause your application to use more memory than is necessary. Each thread typically has a 0.5 to 1Mbyte stack ... depending on your hardware and the Java version. If you create too many threads, the extra memory usage will result in a significant performance hit; i.e. more threads => slower program!
Another thing to consider is that Java threads are expensive to create on a typical JVM. So unless a thread does enough work (in its lifetime) there is a risk that you spend more time creating threads than you gain by using multiple cores in the computation.
Finally, you may find that the work does not spread evenly over all threads, depending on your minmax algorithm ... and the game state.
If I was trying to implement this, I'd start by implementing it as a single threaded application, and then:
benchmark it to figure out how long it takes to calculate a more when run serially,
profile it to get rid of any bottlenecks
re-benchmark to see if it is fast enough already.
If and only if it needs to go faster, I would then examine the code and (if necessary) add some monitoring to see how to break the computation down into large enough chunks to be executed in parallel.
Finally, I'd use those results to design and implement a multi-threaded version.
I'd also look at alternatives ... like using Java 7 fork/join instead of threads.
To answer your direct questions:
Is it efficient to use 25 threads?
Probably not. It would only be efficient if you had that many cores (unlikely!). And even then you are only going to get a good speedup from using lots of threads if you gain more by running things in parallel than you lose due to thread-related overheads. (In other words, it depends how effectively you use those threads.)
What does using 25 threads mean?
I assume that you mean that you have created and started 25 Threads, either explicitly or using some existing thread pool implementation.
But the bottom line is that if you have (say) 4 cores, then at most 4 of those 25 threads can be executing at one time. The other threads will be waiting ...
Is it 25 times faster (most likely not)? What it depends on? On the computer, of course, but how can I know how many threads are okay to use based on the computer's resources?
The primary factor that limits performance is the number of cores. See above.
What happens if I use too many threads (nothing I guess...)?
Too many threads means you use more memory and that makes your application run slower because of memory bandwidth competition, competition for physical memory pages, extra garbage collection. These factors are application and platform dependent, and hard to quantify; i.e. predict or measure.
Depending on the nature of your application (i.e. precisely how you implement the algorithms) too many threads could result in extra lock contention and thread context switching. That will also make your application slower.
It is a impossible to predict what would happen without seeing your actual code. But the number of cores gives you a theoretical upper bound on how much speedup is possible. If you have 4 cores, then you cannot get more than a 4-fold speedup with multi-threading.
So, the threading answers given are ok, but it seemed to me they overlooked the alpha-beta pruning feature of minimax search.
If you launch a thread for each "next move" from your current position, then having those thread talk to each other is slow and painful to write correctly. But, if they cant talk to each other, then you don't get the depth boosting that comes from alpha-beta pruning, until one level further down.
This will act against the efficiency of the result.
For general cases of improve computation time, the best case tends to be 1 thread per core, with either a simple assignment of tasks to thread if they are all similar time (eg matrix multiplication), or having a "set" of tasks, with each thread grabbing the next un-started one whenever it finishes its current task. (this has some locking tasks, but if they are small compared to resolution cost it is very effective).
So, for a 4 core system, and ~25 natural tasks you can hope for a speed up in the range of 3.5-4x. (you would do 4 in parallel ~5 times, then finish messily). But, in the minimax case you have lost the alpha-beta pruning aspect, which I understand is estimated to reduce "effective breadth" from N to about sqrt(N). For ~25 case, that means a effective branching factor of 5. this means using 4 cores and skipping pruning for the first level might actually hurt you.
So, where does the leave us?
Give up on going mutli-thread. or,
thread based on available cores. Is up to 4 times faster, while also being up to sqrt(25)==5 times slower. or,
go multi-thread, but propagate the betas across your threads. This will like require some locking code, but hopefully that wont be too costly. You will reduce the effectiveness of the alpha-beta pruning, since you will be searching sub-trees you wouldn't search in a strict left->right pass, but any thread that happens to be searching redundant areas is still little worse than having a core doing nothing. So, superfluous searches should be more than offset by additional useful work done. (but this is much harder to code then a simple task<-> thread mapping).
The real issue here may be the needing to be/find someone who really groks both alpha-beta pruning and multi-threading for speed. It doesn't strike me as code I would trust many people to write correctly.
(eg I have in my time writtin many multi-threaded programs and several minimax searches but I don't know of the top of my head if you will need to propagate betas or alphas or both between threads for the search from the top node).
As all my friends said that use as many threads as your machine has capacity.
but by adding them to you should go with improving algorithm as well.
for example in 5x5 tic tac toe both will get 12 or 13 moves. so number of posible moves are as nCr(combination equation) base 25C12 = 5,200,300. so now you have decrease number of thread now going to best selection you have best way to find best solution are only 12 (wining position) & 12 to lose worst condition all other are draw condition. so now what you can do is simply check those 12 condition from threads & leave extra combination with calculation that you need to create 12! * 12 no of threads which is very low compare to 25!.
hence your number of thread is going to decrease you can further think on it to decrease your number of thread.
when as your moves goes more & more you can go with alpha-beta pruning so that you can improve your algorithm as well.
If you are using Threads then to prevent memory wastage just use them for first calls of mini-max and then combine the result of the thread to get the output. It is a wastage if you use 25 threads or something number so big because the available cores are way less than that so what you can do is schedule only no of threads equivalent to available cores at a time on different states and combine all the results at the end.
Here is pseudo code:-
int miniMax(State,Player,depth) {
// normal minimax code
}
State ParaMiniMax(State,Player) {
int totalThreads = Runtime.getRuntime().availableProcessors());
NextStates = getNextStates(State);
while(NextStates.size()>0) {
k = totalThreads;
while(k>0 && NextStates.size>0) {
//Schedule thread with nextState. with run calling miniMax with other player
//Store (score,state) in Result List
k--;
NextStates.removeTop();
}
wait(); // waits for threads to complete
}
if(player==max) {
return(maxScore(Result).State);
}
else return(minScore(Result).State);
}
You should only use a number of threads equal to the number of cores the machine has. Scheduling tasks onto those threads is a different thing.
Consider the symmetry of your problem. There are actually only a very limited number of "unique" initial moves - the rest are the same but for reflection or rotation (therefore of identical strategic value). The unique moves for a 5x5 board are:
xxx..
.xx..
..x..
.....
.....
Or just 6 initial moves. Bam - you just reduced the complexity by >4x with no threads.
As others said, more threads than you have cores usually doesn't help in speeding up unless individual threads spend time "waiting" - for inputs, memory access, other results. It might be that six threads would be a good place to start.
Just to convince you about the symmetry, I am marking equivalent positions with the same number- see if you agree
12321
24542
35653
24542
12321
This is the same when you rotate by any multiple of 90 degrees, or reflect about diagonal or left-right, up-down.
PS I realize this is not really answering the question you asked, but I believe it very directly addresses your underlying question - "how do I efficiently solve 5x5 tic-tac-toe exhaustively". As such I won't be upset if you select a different answer but I do hope you will take my advice to heart.

Automatic parallelization

What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).
Take a look at the code below:
for(int i=0;i<100;i++)
sum1 += rand(100)
for(int j=0;j<100;j++)
sum2 += rand(100)/2
This kind of code can automatically get split to 2 different threads that run in parallel.
Do you think it's even possible?
I have a feeling that theoretically it's impossible (it reminds me the halting problem) but I can't justify this thought.
Do you think it's a useful project? is there anything like it?
This is called automatic parallelization. If you're looking for some program you can use that does this for you, it doesn't exist yet. But it may eventually. This is a hard problem and is an area of active research. If you're still curious...
It's possible to automatically split your example into multiple threads, but not in the way you're thinking. Some current techniques try to run each iteration of a for-loop in its own thread. One thread would get the even indicies (i=0, i=2, ...), the other would get the odd indices (i=1, i=3, ...). Once that for-loop is done, the next one could be started. Other techniques might get crazier, executing the i++ increment in one thread and the rand() on a separate thread.
As others have pointed out, there is a true dependency between iterations because rand() has internal state. That doesn't stand in the way of parallelization by itself. The compiler can recognize the memory dependency, and the modified state of rand() can be forwarded from one thread to the other. But it probably does limit you to only a few parallel threads. Without dependencies, you could run this on as many cores as you had available.
If you're truly interested in this topic and don't mind sifting through research papers:
Automatic thread extraction with decoupled software pipelining (2005) by G. Ottoni.
Speculative parallelization using software multi-threaded transactions (2010) by A. Raman.
This is practically not possible.
The problem is that you need to know, in advance, a lot more information than is readily available to the compiler, or even the runtime, in order to parallelize effectively.
While it would be possible to parallelize very simple loops, even then, there's a risk involved. For example, your above code could only be parallelized if rand() is thread-safe - and many random number generation routines are not. (Java's Math.random() is synchronized for you - however.)
Trying to do this type of automatic parallelization is, at least at this point, not practical for any "real" application.
It's certainly possible, but it is an incredibly hard task. This has been the central thrust of compiler research for several decades. The basic issue is that we cannot make a tool that can find the best partition into threads for java code (this is equivalent to the halting problem).
Instead we need to relax our goal from the best partition into some partition of the code. This is still very hard in general. So then we need to find ways to simplify the problem, one is to forget about general code and start looking at specific types of program. If you have simple control-flow (constant bounded for-loops, limited branching....) then you can make much more head-way.
Another simplification is reducing the number of parallel units that you are trying to keep busy. If you put both of these simplifications together then you get the state of the art in automatic vectorisation (a specific type of parallelisation that is used to generate MMX / SSE style code). Getting to that stage has taken decades but if you look at compilers like Intel's then performance is starting to get pretty good.
If you move from vector instructions inside a single thread to multiple threads within a process then you have a huge increase in latency moving data between the different points in the code. This means that your parallelisation has to be a lot better in order to win against the communication overhead. Currently this is a very hot topic in research, but there are no automatic user-targetted tools available. If you can write one that works it would be very interesting to many people.
For your specific example, if you assume that rand() is a parallel version so you can call it independently from different threads then it's quite easy to see that the code can be split into two. A compiler would convert just need dependency analysis to see that neither loop uses data from or affects the other. So the order between them in the user-level code is a false dependency that could split (i.e by putting each in a separate thread).
But this isn't really how you would want to parallelise the code. It looks as if each loop iteration is dependent on the previous as sum1 += rand(100) is the same as sum1 = sum1 + rand(100) where the sum1 on the right-hand-side is the value from the previous iteration. However the only operation involved is addition, which is associative so we rewrite the sum many different ways.
sum1 = (((rand_0 + rand_1) + rand_2) + rand_3) ....
sum1 = (rand_0 + rand_1) + (rand_2 + rand_3) ...
The advantage of the second is that each single addition in brackets can be computed in parallel to all of the others. Once you have 50 results then they can be combined into a further 25 additions and so on... You do more work this way 50+25+13+7+4+2+1 = 102 additions versus 100 in the original but there are only 7 sequential steps so apart from the parallel forking/joining and communication overhead it runs 14 times quicker. This tree of additions is called a gather operation in parallel architectures and it tends to be the expensive part of a computation.
On a very parallel architecture such as a GPU the above description would be the best way to parallelise the code. If you're using threads within a process it would get killed by the overhead.
In summary: it is impossible to do perfectly, it is very hard to do well, there is lots of active research in finding out how much we can do.
Whether it's possible in the general case to know whether a piece of code can be parallelized does not really matter, because even if your algorithm cannot detect all cases that can be parallelized, maybe it can detect some of them.
That does not mean it would be useful. Consider the following:
First of all, to do it at compile-time, you have to inspect all code paths you can potentially reach inside the construct you want to parallelize. This may be tricky for anything but simply computations.
Second, you have to somehow decide what is parallelizable and what is not. You cannot trivially break up a loop that modifies the same state into several threads, for example. This is probably a very difficult task and in many cases you will end up with not being sure - two variables might in fact reference the same object.
Even if you could achieve this, it would end up confusing for the user. It would be very difficult to explain why his code was not parallelizable and how it should be changed.
I think that if you want to achieve this in Java, you need to write it more as a library, and let the user decide what to parallelize (library functions together with annotations? just thinking aloud). Functional languages are much more suited for this.
As a piece of trivia: during a parallel programming course, we had to inspect code and decide whether it was parallelizable or not. I cannot remember the specifics (something about the "at-most-once" property? Someone fill me in?), but the moral of the story is that it was extremely difficult even for what appeared to be trivial cases.
There are some projects that try to simplify parallelization - such as Cilk. It doesn't always work that well, however.
I've learnt that as of JDK 1.8(Java 8), you can utilize/leverage multiple cores of your CPU in case of streams usage by using parallelStream().
However, it has been studied that before finalizing to go to production with parallelStream() it is always better to compare sequential() with parallel, by benchmarking the performance, and then decide which would be ideal.
Why?/Reason is: there could be scenarios where the parallel stream will perform dramatically worse than sequential, when the operation needs to do auto un/boxing. For those scenarios its advisable to use the Java 8 Primitive Streams such as IntStream, LongStream, DoubleStream.
Reference: Modern Java in Action: Manning Publications 2019
The Programming language is Java and Java is a virtual machine. So shouldn't one be able to execute the code at runtime on different Threads owned by the VM. Since all the Memory etc. is handled like that It whould not cause any corruption . You could see the Code as a Stack of instructions estimating execution Time and then distribute it on an Array of Threads which are each have an execution stack of roughtly the same time. It might be dangerous though some graphics like OpenGL immediate mode needs to maintain order and mostly should not be threaded at all.

Categories