Choice between Brute force approach and concurrent threads - java

I have a question concerning graphs. Consider a graph with nodes and edges, each edge having a cost. The problem is to visit all nodes so that the sum of costs of edges traversed is least (Traveling salesman problem, I guess).
Which approach would you recommend? Using brute force approach by recursion or using brute force by spawning threads to concurrently travel different paths and calculate the cost of them.
Or do you have a better way of approaching this problem?

TSP is NP hard. See wikipedia. It scales horribly.
Multithreading it on a 4 core can make it up to 4 times faster,
which is nothing compared to the 100, 1000 or 1000000 times it goes slower
when you try a slightly larger problem.
Just try it with real sized data, it can take years to finish.
One solution is meta heuristics, there are a couple of libs,
such as Drools Planner (open source, java).
Take a look at its TTP example.

Recursion is simpler, and since it's brute force, multithreading is not guaranteed to get you a solution faster. But before you reinvent the wheel, check out Concorde TSP Solver:
http://www.tsp.gatech.edu/concorde/index.html
It's a free download and includes source.

I would go for multi-threaded approach as it will do things in parallel. As an added optimization I would also keep the lowest cost of a full path in some shared variable and have each thread check that cost - if in between traversal they exceed that cost i would terminate processing immediately.

Since I don't know why you're doing this nor do I know what your constraints are, I vote for single-thread recursion. It's easier.

Related

How many threads are okay to use for tic-tac-toe using minimax?

Let's take a 5x5 tic-tac-toe as an example.
Let's say it's my AI's turn.
Then,
I make 25 moves (at each cell basically, of course, if it's a legal
move),
create a thread for each move (25 threads total (at most)),
call a minimax function on each made move,
then when all results come from each thread,
compare the scores and choose the move with the best score.
Here are my questions:
Is it efficient to use 25 threads? What does using 25 threads mean?
Is it 25 times faster (most likely not)? What it depends on? On the computer, of course, but how can I know how many threads are okay to use based on the computer's resources?
What happens if I use too many threads (nothing I guess...)?
Is my idea good? Thanks.
For a typical compute-bound application, a good rule of thumb is to use as many threads as you have hardware cores (or hyperthreads). Using more threads than cores won't make your application go faster. Instead, it will cause your application to use more memory than is necessary. Each thread typically has a 0.5 to 1Mbyte stack ... depending on your hardware and the Java version. If you create too many threads, the extra memory usage will result in a significant performance hit; i.e. more threads => slower program!
Another thing to consider is that Java threads are expensive to create on a typical JVM. So unless a thread does enough work (in its lifetime) there is a risk that you spend more time creating threads than you gain by using multiple cores in the computation.
Finally, you may find that the work does not spread evenly over all threads, depending on your minmax algorithm ... and the game state.
If I was trying to implement this, I'd start by implementing it as a single threaded application, and then:
benchmark it to figure out how long it takes to calculate a more when run serially,
profile it to get rid of any bottlenecks
re-benchmark to see if it is fast enough already.
If and only if it needs to go faster, I would then examine the code and (if necessary) add some monitoring to see how to break the computation down into large enough chunks to be executed in parallel.
Finally, I'd use those results to design and implement a multi-threaded version.
I'd also look at alternatives ... like using Java 7 fork/join instead of threads.
To answer your direct questions:
Is it efficient to use 25 threads?
Probably not. It would only be efficient if you had that many cores (unlikely!). And even then you are only going to get a good speedup from using lots of threads if you gain more by running things in parallel than you lose due to thread-related overheads. (In other words, it depends how effectively you use those threads.)
What does using 25 threads mean?
I assume that you mean that you have created and started 25 Threads, either explicitly or using some existing thread pool implementation.
But the bottom line is that if you have (say) 4 cores, then at most 4 of those 25 threads can be executing at one time. The other threads will be waiting ...
Is it 25 times faster (most likely not)? What it depends on? On the computer, of course, but how can I know how many threads are okay to use based on the computer's resources?
The primary factor that limits performance is the number of cores. See above.
What happens if I use too many threads (nothing I guess...)?
Too many threads means you use more memory and that makes your application run slower because of memory bandwidth competition, competition for physical memory pages, extra garbage collection. These factors are application and platform dependent, and hard to quantify; i.e. predict or measure.
Depending on the nature of your application (i.e. precisely how you implement the algorithms) too many threads could result in extra lock contention and thread context switching. That will also make your application slower.
It is a impossible to predict what would happen without seeing your actual code. But the number of cores gives you a theoretical upper bound on how much speedup is possible. If you have 4 cores, then you cannot get more than a 4-fold speedup with multi-threading.
So, the threading answers given are ok, but it seemed to me they overlooked the alpha-beta pruning feature of minimax search.
If you launch a thread for each "next move" from your current position, then having those thread talk to each other is slow and painful to write correctly. But, if they cant talk to each other, then you don't get the depth boosting that comes from alpha-beta pruning, until one level further down.
This will act against the efficiency of the result.
For general cases of improve computation time, the best case tends to be 1 thread per core, with either a simple assignment of tasks to thread if they are all similar time (eg matrix multiplication), or having a "set" of tasks, with each thread grabbing the next un-started one whenever it finishes its current task. (this has some locking tasks, but if they are small compared to resolution cost it is very effective).
So, for a 4 core system, and ~25 natural tasks you can hope for a speed up in the range of 3.5-4x. (you would do 4 in parallel ~5 times, then finish messily). But, in the minimax case you have lost the alpha-beta pruning aspect, which I understand is estimated to reduce "effective breadth" from N to about sqrt(N). For ~25 case, that means a effective branching factor of 5. this means using 4 cores and skipping pruning for the first level might actually hurt you.
So, where does the leave us?
Give up on going mutli-thread. or,
thread based on available cores. Is up to 4 times faster, while also being up to sqrt(25)==5 times slower. or,
go multi-thread, but propagate the betas across your threads. This will like require some locking code, but hopefully that wont be too costly. You will reduce the effectiveness of the alpha-beta pruning, since you will be searching sub-trees you wouldn't search in a strict left->right pass, but any thread that happens to be searching redundant areas is still little worse than having a core doing nothing. So, superfluous searches should be more than offset by additional useful work done. (but this is much harder to code then a simple task<-> thread mapping).
The real issue here may be the needing to be/find someone who really groks both alpha-beta pruning and multi-threading for speed. It doesn't strike me as code I would trust many people to write correctly.
(eg I have in my time writtin many multi-threaded programs and several minimax searches but I don't know of the top of my head if you will need to propagate betas or alphas or both between threads for the search from the top node).
As all my friends said that use as many threads as your machine has capacity.
but by adding them to you should go with improving algorithm as well.
for example in 5x5 tic tac toe both will get 12 or 13 moves. so number of posible moves are as nCr(combination equation) base 25C12 = 5,200,300. so now you have decrease number of thread now going to best selection you have best way to find best solution are only 12 (wining position) & 12 to lose worst condition all other are draw condition. so now what you can do is simply check those 12 condition from threads & leave extra combination with calculation that you need to create 12! * 12 no of threads which is very low compare to 25!.
hence your number of thread is going to decrease you can further think on it to decrease your number of thread.
when as your moves goes more & more you can go with alpha-beta pruning so that you can improve your algorithm as well.
If you are using Threads then to prevent memory wastage just use them for first calls of mini-max and then combine the result of the thread to get the output. It is a wastage if you use 25 threads or something number so big because the available cores are way less than that so what you can do is schedule only no of threads equivalent to available cores at a time on different states and combine all the results at the end.
Here is pseudo code:-
int miniMax(State,Player,depth) {
// normal minimax code
}
State ParaMiniMax(State,Player) {
int totalThreads = Runtime.getRuntime().availableProcessors());
NextStates = getNextStates(State);
while(NextStates.size()>0) {
k = totalThreads;
while(k>0 && NextStates.size>0) {
//Schedule thread with nextState. with run calling miniMax with other player
//Store (score,state) in Result List
k--;
NextStates.removeTop();
}
wait(); // waits for threads to complete
}
if(player==max) {
return(maxScore(Result).State);
}
else return(minScore(Result).State);
}
You should only use a number of threads equal to the number of cores the machine has. Scheduling tasks onto those threads is a different thing.
Consider the symmetry of your problem. There are actually only a very limited number of "unique" initial moves - the rest are the same but for reflection or rotation (therefore of identical strategic value). The unique moves for a 5x5 board are:
xxx..
.xx..
..x..
.....
.....
Or just 6 initial moves. Bam - you just reduced the complexity by >4x with no threads.
As others said, more threads than you have cores usually doesn't help in speeding up unless individual threads spend time "waiting" - for inputs, memory access, other results. It might be that six threads would be a good place to start.
Just to convince you about the symmetry, I am marking equivalent positions with the same number- see if you agree
12321
24542
35653
24542
12321
This is the same when you rotate by any multiple of 90 degrees, or reflect about diagonal or left-right, up-down.
PS I realize this is not really answering the question you asked, but I believe it very directly addresses your underlying question - "how do I efficiently solve 5x5 tic-tac-toe exhaustively". As such I won't be upset if you select a different answer but I do hope you will take my advice to heart.

Should I consider parallelism in statistic callculations?

we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...
The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.
In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.
If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.

If profiler is not the answer, what other choices do we have?

After watching the presentation "Performance Anxiety" of Joshua Bloch, I read the paper he suggested in the presentation "Evaluating the Accuracy of Java Profilers". Quoting the conclusion:
Our results are disturbing because they indicate that profiler incorrectness is pervasive—occurring in most of our seven benchmarks and in two production JVM—-and significant—all four of
the state-of-the-art profilers produce incorrect profiles. Incorrect
profiles can easily cause a performance analyst to spend time optimizing cold methods that will have minimal effect on performance.
We show that a proof-of-concept profiler that does not use yield
points for sampling does not suffer from the above problems
The conclusion of the paper is that we cannot really believe the result of profilers. But then, what is the alternative of using profilers. Should we go back and just use our feeling to do optimization?
UPDATE: A point that seems to be missed in the discussion is observer effect. Can we build a profiler that really 'observer effect'-free?
Oh, man, where to begin?
First, I'm amazed that this is news. Second, the problem is not that profilers are bad, it is that some profilers are bad.
The authors built one that, they feel, is good, just by avoiding some of the mistakes they found in the ones they evaluated.
Mistakes are common because of some persistent myths about performance profiling.
But let's be positive.
If one wants to find opportunities for speedup, it is really very simple:
Sampling should be uncorrelated with the state of the program.
That means happening at a truly random time, regardless of whether the program is in I/O (except for user input), or in GC, or in a tight CPU loop, or whatever.
Sampling should read the function call stack,
so as to determine which statements were "active" at the time of the sample.
The reason is that every call site (point at which a function is called) has a percentage cost equal to the fraction of time it is on the stack.
(Note: the paper is concerned entirely with self-time, ignoring the massive impact of avoidable function calls in large software. In fact, the reason behind the original gprof was to help find those calls.)
Reporting should show percent by line (not by function).
If a "hot" function is identified, one still has to hunt inside it for the "hot" lines of code accounting for the time. That information is in the samples! Why hide it?
An almost universal mistake (that the paper shares) is to be concerned too much with accuracy of measurement, and not enough with accuracy of location.
For example, here is an example of performance tuning
in which a series of performance problems were identified and fixed, resulting in a compounded speedup of 43 times.
It was not essential to know precisely the size of each problem before fixing it, but to know its location.
A phenomenon of performance tuning is that fixing one problem, by reducing the time, magnifies the percentages of remaining problems, so they are easier to find.
As long as any problem is found and fixed, progress is made toward the goal of finding and fixing all the problems.
It is not essential to fix them in decreasing size order, but it is essential to pinpoint them.
On the subject of statistical accuracy of measurement, if a call point is on the stack some percent of time F (like 20%), and N (like 100) random-time samples are taken, then the number of samples that show the call point is a binomial distribution, with mean = NF = 20, standard deviation = sqrt(NF(1-F)) = sqrt(16) = 4. So the percent of samples that show it will be 20% +/- 4%.
So is that accurate? Not really, but has the problem been found? Precisely.
In fact, the larger a problem is, in terms of percent, the fewer samples are needed to locate it. For example, if 3 samples are taken, and a call point shows up on 2 of them, it is highly likely to be very costly.
(Specifically, it follows a beta distribution. If you generate 4 uniform 0,1 random numbers, and sort them, the distribution of the 3rd one is the distribution of cost for that call point.
It's mean is (2+1)/(3+2) = 0.6, so that is the expected savings, given those samples.)
INSERTED: And the speedup factor you get is governed by another distribution, BetaPrime, and its average is 4. So if you take 3 samples, see a problem on 2 of them, and eliminate that problem, on average you will make the program four times faster.
It's high time we programmers blew the cobwebs out of our heads on the subject of profiling.
Disclaimer - the paper failed to reference my article: Dunlavey, “Performance tuning with instruction-level cost derived from call-stack sampling”, ACM SIGPLAN Notices 42, 8 (August, 2007), pp. 4-8.
If I read it correctly, the paper only talks about sample-based profiling. Many profilers also do instrumentation-based profiling. It's much slower and has some other problems, but it should not suffer from the biases the paper talks about.
The conclusion of the paper is that we
cannot really believe the result of
profilers. But then, what is the
alternative of using profilers.
No. The conclusion of the paper is that current profilers' measuring methods have specific defects. They propose a fix. The paper is quite recent. I'd expect profilers to implement this fix eventually. Until then, even a defective profiler is still much better than "feeling".
Unless you are building bleeding edge applications that need every CPU cycle then I have found that profilers are a good way to find the 10% slowest parts of your code. As a developer, I would argue that should be all you really care about in nearly all cases.
I have experience with http://www.dynatrace.com/en/ and I can tell you it is very good at finding the low hanging fruit.
Profilers are like any other tool and they have their quirks but I would trust them over a human any day to find the hot spots in your app to look at.
If you don't trust profilers, then you can go into paranoia mode by using aspect oriented programming, wrapping around every method in your application and then using a logger to log every method invocation.
Your application will really slow down, but at least you'll have a precise count of how many times each method is invoked. If you also want to see how long each method takes to execute, wrap around every method perf4j.
After dumping all these statistics to text files, use some tools to extract all necessary information and then visualize it. I'd guess this will give you a pretty good overview of how slow your application is in certain places.
Actually, you are better off profiling at the database level. Most enterprise databases come with the ability to show the top queries over a period of time. Start working on those queries until the top ones are down to 300 ms or less, and you will have made great progress. Profilers are useful for showing behavior of the heap and for identifying blocked threads, but I personally have never gotten much traction with the development teams on identifying hot methods or large objects.

Automatic parallelization

What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).
Take a look at the code below:
for(int i=0;i<100;i++)
sum1 += rand(100)
for(int j=0;j<100;j++)
sum2 += rand(100)/2
This kind of code can automatically get split to 2 different threads that run in parallel.
Do you think it's even possible?
I have a feeling that theoretically it's impossible (it reminds me the halting problem) but I can't justify this thought.
Do you think it's a useful project? is there anything like it?
This is called automatic parallelization. If you're looking for some program you can use that does this for you, it doesn't exist yet. But it may eventually. This is a hard problem and is an area of active research. If you're still curious...
It's possible to automatically split your example into multiple threads, but not in the way you're thinking. Some current techniques try to run each iteration of a for-loop in its own thread. One thread would get the even indicies (i=0, i=2, ...), the other would get the odd indices (i=1, i=3, ...). Once that for-loop is done, the next one could be started. Other techniques might get crazier, executing the i++ increment in one thread and the rand() on a separate thread.
As others have pointed out, there is a true dependency between iterations because rand() has internal state. That doesn't stand in the way of parallelization by itself. The compiler can recognize the memory dependency, and the modified state of rand() can be forwarded from one thread to the other. But it probably does limit you to only a few parallel threads. Without dependencies, you could run this on as many cores as you had available.
If you're truly interested in this topic and don't mind sifting through research papers:
Automatic thread extraction with decoupled software pipelining (2005) by G. Ottoni.
Speculative parallelization using software multi-threaded transactions (2010) by A. Raman.
This is practically not possible.
The problem is that you need to know, in advance, a lot more information than is readily available to the compiler, or even the runtime, in order to parallelize effectively.
While it would be possible to parallelize very simple loops, even then, there's a risk involved. For example, your above code could only be parallelized if rand() is thread-safe - and many random number generation routines are not. (Java's Math.random() is synchronized for you - however.)
Trying to do this type of automatic parallelization is, at least at this point, not practical for any "real" application.
It's certainly possible, but it is an incredibly hard task. This has been the central thrust of compiler research for several decades. The basic issue is that we cannot make a tool that can find the best partition into threads for java code (this is equivalent to the halting problem).
Instead we need to relax our goal from the best partition into some partition of the code. This is still very hard in general. So then we need to find ways to simplify the problem, one is to forget about general code and start looking at specific types of program. If you have simple control-flow (constant bounded for-loops, limited branching....) then you can make much more head-way.
Another simplification is reducing the number of parallel units that you are trying to keep busy. If you put both of these simplifications together then you get the state of the art in automatic vectorisation (a specific type of parallelisation that is used to generate MMX / SSE style code). Getting to that stage has taken decades but if you look at compilers like Intel's then performance is starting to get pretty good.
If you move from vector instructions inside a single thread to multiple threads within a process then you have a huge increase in latency moving data between the different points in the code. This means that your parallelisation has to be a lot better in order to win against the communication overhead. Currently this is a very hot topic in research, but there are no automatic user-targetted tools available. If you can write one that works it would be very interesting to many people.
For your specific example, if you assume that rand() is a parallel version so you can call it independently from different threads then it's quite easy to see that the code can be split into two. A compiler would convert just need dependency analysis to see that neither loop uses data from or affects the other. So the order between them in the user-level code is a false dependency that could split (i.e by putting each in a separate thread).
But this isn't really how you would want to parallelise the code. It looks as if each loop iteration is dependent on the previous as sum1 += rand(100) is the same as sum1 = sum1 + rand(100) where the sum1 on the right-hand-side is the value from the previous iteration. However the only operation involved is addition, which is associative so we rewrite the sum many different ways.
sum1 = (((rand_0 + rand_1) + rand_2) + rand_3) ....
sum1 = (rand_0 + rand_1) + (rand_2 + rand_3) ...
The advantage of the second is that each single addition in brackets can be computed in parallel to all of the others. Once you have 50 results then they can be combined into a further 25 additions and so on... You do more work this way 50+25+13+7+4+2+1 = 102 additions versus 100 in the original but there are only 7 sequential steps so apart from the parallel forking/joining and communication overhead it runs 14 times quicker. This tree of additions is called a gather operation in parallel architectures and it tends to be the expensive part of a computation.
On a very parallel architecture such as a GPU the above description would be the best way to parallelise the code. If you're using threads within a process it would get killed by the overhead.
In summary: it is impossible to do perfectly, it is very hard to do well, there is lots of active research in finding out how much we can do.
Whether it's possible in the general case to know whether a piece of code can be parallelized does not really matter, because even if your algorithm cannot detect all cases that can be parallelized, maybe it can detect some of them.
That does not mean it would be useful. Consider the following:
First of all, to do it at compile-time, you have to inspect all code paths you can potentially reach inside the construct you want to parallelize. This may be tricky for anything but simply computations.
Second, you have to somehow decide what is parallelizable and what is not. You cannot trivially break up a loop that modifies the same state into several threads, for example. This is probably a very difficult task and in many cases you will end up with not being sure - two variables might in fact reference the same object.
Even if you could achieve this, it would end up confusing for the user. It would be very difficult to explain why his code was not parallelizable and how it should be changed.
I think that if you want to achieve this in Java, you need to write it more as a library, and let the user decide what to parallelize (library functions together with annotations? just thinking aloud). Functional languages are much more suited for this.
As a piece of trivia: during a parallel programming course, we had to inspect code and decide whether it was parallelizable or not. I cannot remember the specifics (something about the "at-most-once" property? Someone fill me in?), but the moral of the story is that it was extremely difficult even for what appeared to be trivial cases.
There are some projects that try to simplify parallelization - such as Cilk. It doesn't always work that well, however.
I've learnt that as of JDK 1.8(Java 8), you can utilize/leverage multiple cores of your CPU in case of streams usage by using parallelStream().
However, it has been studied that before finalizing to go to production with parallelStream() it is always better to compare sequential() with parallel, by benchmarking the performance, and then decide which would be ideal.
Why?/Reason is: there could be scenarios where the parallel stream will perform dramatically worse than sequential, when the operation needs to do auto un/boxing. For those scenarios its advisable to use the Java 8 Primitive Streams such as IntStream, LongStream, DoubleStream.
Reference: Modern Java in Action: Manning Publications 2019
The Programming language is Java and Java is a virtual machine. So shouldn't one be able to execute the code at runtime on different Threads owned by the VM. Since all the Memory etc. is handled like that It whould not cause any corruption . You could see the Code as a Stack of instructions estimating execution Time and then distribute it on an Array of Threads which are each have an execution stack of roughtly the same time. It might be dangerous though some graphics like OpenGL immediate mode needs to maintain order and mostly should not be threaded at all.

Multi-threaded algorithm for solving sudoku?

I have a homework assignment to write a multi-threaded sudoku solver, which finds all solutions to a given puzzle. I have previously written a very fast single-threaded backtracking sudoku solver, so I don't need any help with the sudoku solving aspect.
My problem is probably related to not really grokking concurrency, but I don't see how this problem benefits from multi-threading. I don't understand how you can find different solutions to the same problem at the same time without maintaining multiple copies of the puzzle. Given this assumption (please prove it wrong), I don't see how the multi-threaded solution is any more efficient than a single-threaded.
I would appreciate it if anyone could give me some starting suggestions for the algorithm (please, no code...)
I forgot to mention, the number of threads to be used is specified as an argument to the program, so as far as I can tell it's not related to the state of the puzzle in any way...
Also, there may not be a unique solution - a valid input may be a totally empty board. I have to report min(1000, number of solutions) and display one of them (if it exists)
Pretty simple really. The basic concept is that in your backtracking solution you would branch when there was a choice. You tried one branch, backtracked and then tried the other choice.
Now, spawn a thread for each choice and try them both simultaneously. Only spawn a new thread if there are < some number of threads already in the system (that would be your input argument), otherwise just use a simple (i.e your existing) single-threaded solution. For added efficiency, get these worker threads from a thread pool.
This is in many ways a divide and conquer technique, you are using the choices as an opportunity to split the search space in half and allocate one half to each thread. Most likely one half is harder than the other meaning thread lifetimes will vary but that is what makes the optimisation interesting.
The easy way to handle the obvious syncronisation issues is to to copy the current board state and pass it into each instance of your function, so it is a function argument. This copying will mean you don't have to worry about any shared concurrency. If your single-threaded solution used a global or member variable to store the board state, you will need a copy of this either on the stack (easy) or per thread (harder). All your function needs to return is a board state and a number of moves taken to reach it.
Each routine that invokes several threads to do work should invoke n-1 threads when there are n pieces of work, do the nth piece of work and then wait with a syncronisation object until all the other threads are finished. You then evaluate their results - you have n board states, return the one with the least number of moves.
Multi-threading is useful in any situation where a single thread has to wait for a resource and you can run another thread in the meantime. This includes a thread waiting for an I/O request or database access while another thread continues with CPU work.
Multi-threading is also useful if the individual threads can be farmed out to diffent CPUs (or cores) as they then run truly concurrently, although they'll generally have to share data so there'll still be some contention.
I can't see any reason why a multi-threaded Sudoku solver would be more efficient than a single-threaded one, simply because there's no waiting for resources. Everything will be done in memory.
But I remember some of the homework I did at Uni, and it was similarly useless (Fortran code to see how deep a tunnel got when you dug down at 30 degrees for one mile then 15 degrees for another mile - yes, I'm pretty old :-). The point is to show you can do it, not that it's useful.
On to the algorithm.
I wrote a single threaded solver which basically ran a series of rules in each pass to try and populate another square. A sample rule was: if row 1 only has one square free, the number is evident from all the other numbers in row 1.
There were similar rules for all rows, all columns, all 3x3 mini-grids. There were also rules which checked row/column intersects (e.g. if a given square could only contain 3 or 4 due to the row and 4 or 7 due to the column, then it was 4). There were more complex rules I won't detail here but they're basically the same way you solve it manually.
I suspect you have similar rules in your implementation (since other than brute force, I can think of no other way to solve it, and if you've used brute force, there's no hope for you :-).
What I would suggest is to allocate each rule to a thread and have them share the grid. Each thread would do it's own rule and only that rule.
Update:
Jon, based on your edit:
[edit] I forgot to mention, the number of threads to be used is specified as an argument to the program, so as far as I can tell it's not related to the state of the puzzle in any way...
Also, there may not be a unique solution - a valid input may be a totally empty board. I have to report min(1000, number of solutions) and display one of them (if it exists)
It looks like your teacher doesn't want you to split based on the rules but instead on the fork-points (where multiple rules could apply).
By that I mean, at any point in the solution, if there are two or more possible moves forward, you should allocate each possibility to a separate thread (still using your rules for efficiency but concurrently checking each possibility). This would give you better concurrency (assuming threads can be run on separate CPUs/cores) since there will be no contention for the board; each thread will get it's own copy.
In addition, since you're limiting the number of threads, you'll have to work some thread-pool magic to achieve this.
What I would suggest is to have a work queue and N threads. The work queue is initially empty when your main thread starts all the worker threads. Then the main thread puts the beginning puzzle state into the work queue.
The worker threads simply wait for a state to be placed on the work queue and one of them grabs it for processing. The work thread is your single-threaded solver with one small modification: when there are X possibilities to move forward (X > 1), your worker puts X-1 of those back onto the work queue then continues to process the other possibility.
So, lets say there's only one solution (true Sudoku :-). The first worker thread will whittle away at the solution without finding any forks and that will be exactly as in your current situation.
But with two possibilities at move 27 (say, 3 or 4 could go into the top left cell), your thread will create another board with the first possibility (put 3 into that cell) and place that in the work queue. Then it would put 4 in its own copy and continue.
Another thread will pick up the board with 3 in that cell and carry on. That way, you have two threads running concurrently handling the two possibilities.
When any thread decides that its board is insoluble, it throws it away and goes back to the work queue for more work.
When any thread decides that its board is solved, it notifies the main thread which can store it, over-writing any previous solution (first-found is solution) or throw it away if it's already got a solution (last-found is solution) then the worker thread goes back to the work queue for more work. In either case, the main thread should increment a count of solutions found.
When all the threads are idle and the work queue is empty, main either will or won't have a solution. It will also have a count of solutions.
Keep in mind that all communications between workers and main thread will need to be mutexed (I'm assuming you know this based on information in your question).
The idea behind multi-threading is taking advantage of having several CPUs, allowing you to make several calculations simultaneously. Of course each thread is going to need its own memory, but that's usually not a problem.
Mostly, what you want to do is divide the possible solution state into several sub-spaces which are as independent as possible (to avoid having to waste too many resources on thread creation overhead), and yet "fit" your algorithm (to actually benefit from having multiple threads).
Here is a greedy brute-force single-threaded solver:
Select next empty cell. If no more empty cells, victory!
Possible cell value = 1
Check for invalid partial solution (duplicates in row, column or 3x3 block).
If partial solution is invalid, increment cell value and return to step 3. Otherwise, go to step 1.
If you look at the above outline, the combination of steps 2 and 3 are obvious candidates for multithreading. More ambitious solutions involve creating a recursive exploration that spawns tasks that are submitted to a thread pool.
EDIT to respond to this point: "I don't understand how you can find different solutions to the same problem at the same time without maintaining multiple copies of the puzzle."
You can't. That's the whole point. However, a concrete 9-thread example might make the benefits more clear:
Start with an example problem.
Find the first empty cell.
Create 9 threads, where each thread has its own copy of the problem with its own index as a candidate value in the empty cell.
Within each thread, run your original single-threaded algorithm on this thread-local modified copy of the problem.
If one of the threads finds an answer, stop all the other threads.
As you can imagine, each thread is now running a slightly smaller problem space and each thread has the potential to run on its own CPU core. With a single-threaded algorithm alone, you can't reap the benefits of a multi-core machine.
TL;DR
Yes, a backtracking-based Sudoku solver can, depending on the puzzle, benefit considerably from parallelization! A puzzle's search space can be modeled as a tree data structure, and backtracking performs a depth-first search (DFS) of this tree, which is inherently not parallelizable. However, by combining DFS with its opposite form of tree traversal, breadth-first search (BFS), parallelism can be unlocked. This is because BFS allows multiple independent sub-trees to be discovered simultaneously, which can then be searched in parallel.
Because BFS unlocks parallelism, employing it warrants the use of a global thread-safe queue, onto/from which the discovered sub-trees can be pushed/popped by all threads, and this entails significant performance overhead compared to DFS. Hence parallelizing such a solver requires fine-tuning the amount of BFS carried out such that just enough is performed to ensure the traversal of the tree is sufficiently parallelized but not too much such that the overhead of thread communication (pushing/popping sub-trees of the queue) outweighs the speedup parallelization provides.
I parallelized a backtracking-based Sudoku solver a while back, and implemented 4 different parallel solver variants along with a sequential (single-threaded) solver. The parallel variants all combined DFS and BFS in different ways and to varying extents, and the fastest variant was on average over three times as fast as the single-threaded solver (see the graph at the bottom).
Also, to answer your question, in my implementation each thread receives a copy of the initial puzzle (once when the thread is spawned) so the required memory is slightly higher than the sequential solver - which is not uncommon when parallelizing something. But that's the only "inefficiency" as you put it: As mentioned above, if the amount of BFS is appropriately fine-tuned, the attainable speedup via parallelization greatly outweighs the parallel overhead from thread communication as well as the higher memory footprint. Also, while my solvers assumed unique solutions, extending them to handle non-proper puzzles and find all their solutions would be simple and wouldn't significantly, if at all, reduce the speedup, due to the nature of the solver's design. See the full answer below for more details.
FULL ANSWER
Whether a Sudoku solver benefits from multithreading strongly depends on its underlying algorithm. Common approaches such as constraint propagation (i.e. a rule-based method) where the puzzle is modeled as a constraint satisfaction problem, or stochastic search, don't really benefit since single-threaded solvers using these approaches are already exceptionally fast. Backtracking however, can benefit considerably most of the time (depending on the puzzle).
As you probably already know, a Sudoku's search space can be modeled as a tree data structure: The first level of the three represents the first empty cell, the second level the second empty cell, and so on. At each level, the nodes represent the potential values of that cell given the values of their ancestor nodes. Thus searching this space can be parallelized by searching independent sub-trees at the same time. A single-threaded backtracking solver must traverse the whole tree by itself, one sub-tree after another, but a parallel solver can have each thread search a separate sub-tree in parallel.
There are multiple ways to realize this, but they're all based on the principle of combining depth-first search (DFS) with breadth-first search (BFS), which are two (opposite) forms of tree traversal. A single-threaded backtracking solver only carries out DFS the entire search, which is inherently non-parallelizable. By adding BFS into the mix however, parallelism can be unlocked. This is because BFS traverses the tree level-by-level (as opposed to branch-by-branch with DFS) and thereby finds all possible nodes on a given level before moving to the next lower level, whereas DFS takes the first possible node and completely searches its sub-tree before moving to the next possible node. As a result, BFS enables multiple independent sub-trees to be discovered right away that can then be searched by separate threads; DFS doesn't know anything about additional independent sub-trees right from the get go, because it's busy searching the first one it finds in a depth-first manner.
As is usually the case with multi-threading, parallelizing your code is tricky and initial attempts often decrease performance if you don't know exactly what you're doing. In this case, it's important to realize that BFS is much slower than DFS, so the chief concern is tweaking the amount of DFS and BFS you carry out such that just enough BFS is executed in order to unlock the ability to discover multiple sub-trees simultaneously, but also minimizing it so its slowness doesn't end up outweighing that ability. Note that BFS isn't inherently slower than DFS, it's just that the threads need access to the discovered sub-trees so they can search them. Thus BFS requires a global thread-safe data structure (e.g. a queue) onto/from which the sub-trees can be pushed/popped by separate threads, and this entails significant overhead compared to DFS which doesn't require any communication between threads. Hence parallelizing such a solver is a fine-tuning process and one wants to conduct enough BFS to provide all threads with enough sub-trees to search (i.e. achieve a good load balancing among all threads) while minimizing the communication between threads (the pushing/popping of sub-trees onto/off the queue).
I parallelized a backtracking-based Sudoku solver a while back, and implemented 4 different parallel solver variants and benchmarked them against a sequential (single-threaded) solver which I also implemented. They were all implemented in C++. The best (fastest) parallel variant's algorithm was as follows:
Start with a DFS of the puzzle's tree, but only do it down to a certain level, or search-depth
At the search-depth, conduct a BFS and push all the sub-trees discovered at that level onto the global queue
The threads then pop these sub-trees off the queue and perform a DFS on them, all the way down to the last level (the last empty cell).
The following figure (taken from my report from back then) illustrates these steps: The different color triangles represent different threads and the sub-trees they traverse. The green nodes represent allowed cell values on that level. Note the single BFS carried out at the search-depth; The sub-trees discovered at this level (yellow, purple, and red) are pushed onto the global queue, which are then traversed independently in parallel all the way down to the last level (last empty cell) of the tree.
As you can see, this implementation performs a BFS of only one level (at the search-depth). This search-depth is adjustable, and optimizing it represents the aforementioned fine-tuning process. The deeper the search-depth, the more BFS is carried out since a tree's width (# nodes on a given level) naturally increases the further down you go. Interestingly, the optimal search depth is typically at a quite shallow level (i.e. a not very deep level in the tree); This reveals that conducting even a small amount of BFS is already enough to generate ample sub-trees and provide a good load balance among all threads.
Also, thanks to the global queue, an arbitrary number of threads can be chosen. It's generally a good idea though to set the number of threads equal to the number of hardware threads (i.e. # logical cores); Choosing more typically won't further increase performance. Furthermore, it's also possible to parallelize the initial DFS performed at the start by performing a BFS of the first level of the tree (first empty cell): the sub-trees discovered at level 1 are then traversed in parallel, each thread stopping at the given search-depth. This is what's done in the figure above. This isn't strictly necessary though as the optimal search depth is typically quite shallow as mentioned above, hence a DFS down to the search-depth is still very quick even if it's single-threaded.
I thoroughly tested all solvers on 14 different Sudoku puzzles (specifically, a select set of puzzles specially designed to be difficult for a backtracking solver), and the figure below shows each solver's average time taken to solve all puzzles, for various thread counts (my laptop has four hardware threads). Parallel variant 2 isn't shown because it actually achieved significantly worse performance than the sequential solver. With parallel variant 1, the # threads was automatically determined during run-time, and depended on the puzzle (specifically, the branching factor of the first level); Hence the blue line represents its average total solving time regardless of the thread count.
All parallel solver variants combine DFS and BFS in different ways and to varying extents. When utilizing 4 threads, the fastest parallel solver (variant 4) was on average over three times as fast as the single-threaded solver!
Does it need to benefit from multithreading, or just make use of multthreading so you can learn for the assignment?
If you use a brute force algoritm it is rather easy to split into multiple threads, and if the assignment is focused on coding threads that may be an acceptable solution.
When you say all solutions to a given puzzle, do you mean the final one and only solution to the puzzle? Or the different ways of arriving at the one solution? I was of the understanding that by definition, a sudoku puzzle could have only one solution...
For the former, either Pax's rule based approach or Tom Leys' take on multi-threading your existing backtracking algorithm might be the way to go.
If the latter, you could implement some kind of branching algorithm which launches a new thread (with it's own copy of the puzzle) for each possible move at each stage of the puzzle.
Depending on how you coded your single threaded solver, you might be able to re-use the logic. You can code a multi-threaded solver to start each thread using a different set of strategies to solve the puzzle.
Using those different strategies, your multi-threaded solver may find the total set of solutions in less time than your single threaded solver (keep in mind though that a true Sudoku puzzle only has one solution...you're not the only one who had to deal with that god awful game in class)
Some general points: I don't run processes in parallel unless 1) it is easy to divide the problem 2) I know I'll get a benefit to doing so - e.g. I won't hit another bottleneck. I entirely avoid sharing mutable values between threads - or minimize it. Some people are smart enough to work safely with mutexes. I'm not.
You need to find points in your algorithm that create natural branches or large units of work. Once you've identified a unit to work, you drop it in a queue for a thread to pick up. As a trivial example. 10 databases to upgrade. Start upgrade async on all 10 servers. Wait for all to complete. I can easily avoid sharing state between threads / processes, and can easily aggregate the results.
What comes to mind for sudoku is that an efficient suduko solution should combining 2-3 (or more) strategies that never run past a certain depth. When I do Sudoku, it's apparent that, at any given moment, different algorithms provide the solution with the least work. You could simply fire off a handful of strategies, let them investigate to a limited depth, wait for report. Rinse, repeat. This avoids "brute-forcing" the solution. Each algorithm has it's own data space, but you combine the answers.
Sciam.com had article on this a year or two back - looks like it isn't public, though.
You said you used back tracking to solve the problem. What you can do is to split the search space into two and handle each space to a thread, then each thread would do the same till you reach the last node. I did a solution to this which can be found www2.cs.uregina.ca/~hmer200a but using single thread but the mechanism of splitting the search space is there using branch and bound.
A few years ago when I looked at solving sudoku, it seemed like the optimal solution used a combination of logical analysis algorithms and only fell back on brute force when necessary. This allowed the solver to find the solution very quickly, and also rank the board by difficulty if you wanted to use it to generate a new puzzle. If you took this approach you could certainly introduce some concurrency, although having the threads actually work together might be tricky.
I have an idea that's pretty fun here.. do it with the Actor Model! I'd say using erlang..
How? You start with the original board, and..
1) at first empty cell create 9 children with different number, then commit suicide
2) each child check if it's invalid, if so it commits suicide, else
if there is an empty cell, go to 1)
if complete, this actor is a solution
Clearly every surviving actor is a solution to the problem =)
Just a side note. I actually implemented an optimized sudoku solver and looked into multithreading, but two things stopped me.
First, the simple overhead of starting a thread took 0.5 milliseconds, while the whole resolution took between 1 to 3 milliseconds (I used Java, other languages or environments may give different results).
Second, most problems don't require any backtracking. And those that do, only need it late in the resolution of the problem, once all game rules have been exhausted and we need to make an hypothesis.
Here's my own penny. Hope it helps.
Remember that inter-processor/inter-thread communications are expensive. Don't multi-thread unless you have to. If there isn't much work/computation to be done in other threads, you might as well just go on with a single-thread.
Try as much as possible to avoid sharing data between threads. Use them only when necessary
Take advantage of SIMD extensions wherever possible. With the Vector Extensions you can perform calculations on multiple data in a single swoop. It can help you aplenty.

Categories