How to profile a concurrent Java code - java

I am practicing concurrent java and wrote a concurrent mergesort. The Mergesort is working well if the number of elements are less than 10,000. However, more than that it seems to take forever, I believe that the some of the threads are stuck(deadlock?). Now I don't have any shared resource as I always pass and return new copy. What are some known ways to profile such code for e.g. which threads are stuck, how many threads have been executed?
Sharing the code for reference:-
package mergesort;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
// 18s
public class Main {
public static final ExecutorService ex = new ThreadPoolExecutor(100, 100, 5, TimeUnit.SECONDS,
new ArrayBlockingQueue<>(10000), new ThreadPoolExecutor.CallerRunsPolicy());
public static void main(String[] args) throws InterruptedException, ExecutionException {
int n = 1_000_000;
Future<int[]> T1 = ex.submit(new Callable<int[]>() {
#Override
public int[] call() throws Exception {
// TODO Auto-generated method stub
return mergesort(generate(n));
}
});
int[] ret = T1.get();
for (int i : ret) {
System.out.println(i);
}
System.out.println("done");
ex.shutdownNow();
}
public static int[] generate(int n) {
int[] nums = new int[n];
for (int i = 0; i < n; i++) {
nums[i] = (int) Math.floor(Math.random() * n);
}
return nums;
}
public static int[] mergesort(int[] nums) throws InterruptedException, ExecutionException {
final int[] B;
if (nums.length < 2) {
return nums;
}
final int[] A = new int[nums.length / 2];
if (nums.length % 2 == 0) {
B = new int[nums.length / 2];
} else {
B = new int[nums.length / 2 + 1];
}
for (int i = 0; i < nums.length; i++) {
if (i < nums.length / 2) {
A[i] = nums[i];
} else {
B[i - nums.length / 2] = nums[i];
}
}
Future<int[]> T2 = ex.submit(new Callable<int[]>() {
#Override
public int[] call() throws Exception {
// TODO Auto-generated method stub
return mergesort(B);
}
});
Future<int[]> T1 = ex.submit(new Callable<int[]>() {
#Override
public int[] call() throws Exception {
// TODO Auto-generated method stub
return mergesort(A);
}
});
Future<int[]> T3 = ex.submit(new Callable<int[]>() {
#Override
public int[] call() throws Exception {
return merge(T1.get(), T2.get());
}
});
return T3.get();
}
public static int[] merge(int[] A, int[] B) {
int[] ret = new int[A.length + B.length];
int i = 0;
int j = 0;
int k = 0;
while (i < A.length && j < B.length) {
if (A[i] < B[j]) {
ret[k] = A[i];
i++;
} else {
ret[k] = B[j];
j++;
}
k++;
}
while (j < B.length) {
ret[k] = B[j];
j++;
k++;
}
while (i < A.length) {
ret[k] = A[i];
i++;
k++;
}
return ret;
}
}
Edit:
So using the tools I could analyze memory dump, see the running threads, live objects etc. But what are the strategies people follow(things they look for) when trying to understand the stack trace of a concurrent process. I.e. where do I start looking? for e.g. in my example I saw that all my tasks are waiting on FutureTask, but that's it. Why FutureTask is not returning I have no idea. How can I move further?

Your problem is that you create Futures recursively, then a huge number of threads are necessary to compute intermediate results and the pool may not have sufficient number of available threads. Mind that most of your threads are blocked waiting for others to give their results, so when the pool is exhausted you have: threads waiting for new threads to be created (while it is impossible due to pool exhaustion).
If you use a cached thread pool, it will work:
ExecutorService ex = Executors.newCachedThreadPool();
as such a pool is expandable.
----- EDIT -----
I also recommend you to use the new Java 8 functional style:
Future<int[]> f2 = ex.submit(() -> mergesort(B));
Future<int[]> f1 = ex.submit(() -> mergesort(A));
return merge(f1.get(),f2.get());
Also note, that it is not useful to use a Future to compute the merge as you are synchronizing on it.

You can use some of the profiler available in the market like YourKit or AppDynamics . You can use trial version of both. or you can simply take the thread dump and analyze yourself but it will be time-consuming.
I preferred App-dynamics , i was having very big application and using the thread-dump and analyze it manually was very time-consuming.
Refer https://docs.appdynamics.com/display/PRO14S/Trace+MultiThreaded+Transactions+for+Java , on how to trace multi-threaded app using App-dynamics.
App-dynamics also shows that which piece of code is causing the thread-contention, how much time thread is running/blocking. Also it shows whether its a CPU which is getting bottleneck or thread is waiting on some shared resource etc.
Let me know if you need more info.

Related

Is my multi-threaded linear search flawed?

In the pursuit of learning I have written a multi-threaded linear search, designed to operate on an int[] array. I believe the search works as intended, however after completing it I tested it against a standard 'for loop' and was surprised to see that the 'for loop' beat my search in terms of speed every time. I've tried tinkering with the code, but cannot get the search to beat a basic 'for loop'. At the moment I am wondering the following:
Is there an obvious flaw in my code that I am not seeing?
Is my code perhaps not well optimised for CPU caches?
Is this just the overheads of multi-threading slowing down my program and so I need a larger array to reap the benefits?
Unable to work it out myself, I am hoping someone here may be able to point me in the right direction, leading to my question:
Is there an inefficiency/flaw in my code that is making it slower than a standard loop, or is this just the overheads of threading slowing it down?
The Search:
public class MLinearSearch {
private MLinearSearch() {};
public static int[] getMultithreadingPositions(int[] data, int processors) {
int pieceSize = data.length / processors;
int remainder = data.length % processors;
int curPosition = 0;
int[] results = new int[processors + 1];
for (int i = 0; i < results.length - 1; i++) {
results[i] = curPosition;
curPosition += pieceSize;
if(i < remainder) {
curPosition++;
}
}
results[results.length - 1] = data.length;
return results;
}
public static int search(int target, int[]data) {
MLinearSearch.processors = Runtime.getRuntime().availableProcessors();
MLinearSearch.foundIndex = -1;
int[] domains = MLinearSearch.getMultithreadingPositions(data, processors);
Thread[] threads = new Thread[MLinearSearch.processors];
for(int i = 0; i < MLinearSearch.processors; i++) {
MLSThread searcher = new MLSThread(target, data, domains[i], domains[(i + 1)]);
searcher.setDaemon(true);
threads[i] = searcher;
searcher.run();
}
for(Thread thread : threads) {
try {
thread.join();
} catch (InterruptedException e) {
return MLinearSearch.foundIndex;
}
}
return MLinearSearch.foundIndex;
}
private static class MLSThread extends Thread {
private MLSThread(int target, int[] data, int start, int end) {
this.counter = start;
this.dataEnd = end;
this.target = target;
this.data = data;
}
#Override
public void run() {
while(this.counter < (this.dataEnd) && MLinearSearch.foundIndex == -1) {
if(this.target == this.data[this.counter]) {
MLinearSearch.foundIndex = this.counter;
return;
}
counter++;
}
}
private int counter;
private int dataEnd;
private int target;
private int[] data;
}
private static volatile int foundIndex = -1;
private static volatile int processors;
}
Note: "getMultithreadingPositions" is normally in a separate class. I have copied the method here for simplicity.
This is how I've been testing the code. Another test (Omitted here, but in the same file & run) runs the basic for loop, which beats my multi-threaded search every time.
public class SearchingTest {
#Test
public void multiLinearTest() {
int index = MLinearSearch.search(TARGET, arrayData);
assertEquals(TARGET, arrayData[index]);
}
private static int[] getShuffledArray(int[] array) {
// https://stackoverflow.com/questions/1519736/random-shuffling-of-an-array
Random rnd = ThreadLocalRandom.current();
for (int i = array.length - 1; i > 0; i--)
{
int index = rnd.nextInt(i + 1);
int a = array[index];
array[index] = array[i];
array[i] = a;
}
return array;
}
private static final int[] arrayData = SearchingTests.getShuffledArray(IntStream.range(0, 55_000_000).toArray());
private static final int TARGET = 7;
}
The loop beating this is literally just a for loop that iterates over the same array. I would imagine for smaller arrays the for loop would win out as its simplicity allows it to get going before my multi-threaded search can initiate its threads. At the array size I am trying though I would have expected a single thread to lose out.
Note: I had to increase my heap size with the following JVM argument:
-Xmx4096m
To avoid a heap memory exception.
Thank you for any help offered.

Execute the method in parallel within the loop using Java

I have the code like the below. In a loop it is executing the method "process". It is running sequentially. I want to run this method parallel, but it should be finished within the loop so that I can sum in the next line. i.e even it is running parallel all functions should finish before the 2nd for loop execute.
How to solve this in Jdk1.7 not JDK1.8 version?
public static void main(String s[]){
int arrlen = 10;
int arr[] = new int[arrlen] ;
int t =0;
for(int i=0;i<arrlen;i++){
arr[i] = i;
t = process(arr[i]);
arr[i] = t;
}
int sum =0;
for(int i=0;i<arrlen;i++){
sum += arr[i];
}
System.out.println(sum);
}
public static int process(int arr){
return arr*2;
}
Below example might help you. I have used fork/join framework to do that.
For small array size like your example, conventional method might be faster and I doubt that fork/join way would take slight higher time. But for larger size or process , fork/join framework is suitable. Even java 8 parallel streams uses fork/join framework as underlying base.
public class ForkMultiplier extends RecursiveAction {
int[] array;
int threshold = 3;
int start;
int end;
public ForkMultiplier(int[] array,int start, int end) {
this.array = array;
this.start = start;
this.end = end;
}
protected void compute() {
if (end - start < threshold) {
computeDirectly();
} else {
int middle = (end + start) / 2;
ForkMultiplier f1= new ForkMultiplier(array, start, middle);
ForkMultiplier f2= new ForkMultiplier(array, middle, end);
invokeAll(f1, f2);
}
}
protected void computeDirectly() {
for (int i = start; i < end; i++) {
array[i] = array[i] * 2;
}
}
}
You main class would like this below
public static void main(String s[]){
int arrlen = 10;
int arr[] = new int[arrlen] ;
for(int i=0;i<arrlen;i++){
arr[i] = i;
}
ForkJoinPool pool = new ForkJoinPool();
pool.invoke(new ForkMultiplier(arr, 0, arr.length));
int sum =0;
for(int i=0;i<arrlen;i++){
sum += arr[i];
}
System.out.println(sum);
}
You basically need to use Executors and Futures combined that exist since Java 1.5 (see Java Documentation).
In the following example, I've created a main class that uses another helper class that acts like the processor you want to parallelize.
The main class is splitted in 3 steps:
Creates the processes pool and executes tasks in parallel.
Waits for all tasks to finish their work.
Collects the results from tasks.
For didactic reasons, I've put some logs and more important, I've put a random waiting time in each process' business logic, simulating a time-consuming algorithm ran by the Process class.
The maximum waiting time for each process is 2 seconds, which is also the highest waiting time for step 2, even if you increase the number of parallel tasks (just try changing the variable totalTasks of the following code to test it).
Here the Main class:
package com.example;
import java.util.ArrayList;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class Main
{
public static void main(String[] args) throws InterruptedException, ExecutionException
{
int totalTasks = 100;
ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(totalTasks);
System.out.println("Step 1 - Starting parallel tasks");
ArrayList<Future<Integer>> tasks = new ArrayList<Future<Integer>>();
for (int i = 0; i < totalTasks; i++) {
tasks.add(newFixedThreadPool.submit(new Process(i)));
}
long ts = System.currentTimeMillis();
System.out.println("Step 2 - Wait for processes to finish...");
boolean tasksCompleted;
do {
tasksCompleted = true;
for (Future<Integer> task : tasks) {
if (!task.isDone()) {
tasksCompleted = false;
Thread.sleep(10);
break;
}
}
} while (!tasksCompleted);
System.out.println(String.format("Step 2 - End in '%.3f' seconds", (System.currentTimeMillis() - ts) / 1000.0));
System.out.println("Step 3 - All processes finished to run, let's collect results...");
Integer sum = 0;
for (Future<Integer> task : tasks) {
sum += task.get();
}
System.out.println(String.format("Total final sum is: %d", sum));
}
}
Here the Process class:
package com.example;
import java.util.concurrent.Callable;
public class Process implements Callable<Integer>
{
private Integer value;
public Process(Integer value)
{
this.value = value;
}
public Integer call() throws Exception
{
Long sleepTime = (long)(Math.random() * 2000);
System.out.println(String.format("Starting process with value %d, sleep time %d", this.value, sleepTime));
Thread.sleep(sleepTime);
System.out.println(String.format("Stopping process with value %d", this.value));
return value * 2;
}
}
Hope this helps.

Using ExecutorService with a multithreaded version of Merge Sort

I am working on a homework problem where I have to create a Multithreaded version of Merge Sort. I was able to implement it, but I am not able to stop the creation of threads. I looked into using an ExecutorService to limit the creation of threads but I cannot figure out how to implement it within my current code.
Here is my current Multithreaded Merge Sort. We are required to implement a specific strategy pattern so that is where my sort() method comes from.
#Override
public int[] sort(int[] list) {
int array_size = list.length;
list = msort(list, 0, array_size-1);
return list;
}
int[] msort(int numbers[], int left, int right) {
final int mid;
final int leftRef = left;
final int rightRef = right;
final int array[] = numbers;
if (left<right) {
mid = (right + left) / 2;
//new thread
Runnable r1 = new Runnable(){
public void run(){
msort(array, leftRef, mid);
}
};
Thread t1 = new Thread(r1);
t1.start();
//new thread
Runnable r2 = new Runnable(){
public void run(){
msort(array, mid+1, rightRef);
}
};
Thread t2 = new Thread(r2);
t2.start();
//join threads back together
try {
t1.join();
t2.join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
merge(numbers, leftRef, mid, mid+1, rightRef);
}
return numbers;
}
void merge(int numbers[], int startA, int endA, int startB, int endB) {
int finalStart = startA;
int finalEnd = endB;
int indexC = 0;
int[] listC = new int[numbers.length];
while(startA <= endA && startB <= endB){
if(numbers[startA] < numbers[startB]){
listC[indexC] = numbers[startA];
startA = startA+1;
}
else{
listC[indexC] = numbers[startB];
startB = startB +1;
}
indexC++;
}
if(startA <= endA){
for(int i = startA; i < endA; i++){
listC[indexC]= numbers[i];
indexC++;
}
}
indexC = 0;
for(int i = finalStart; i <= finalEnd; i++){
numbers[i]=listC[indexC];
indexC++;
}
}
Any pointers would be gratefully received.
Following #mcdowella's comment, I also think that the fork/join framework is your best bet if you want to limit the number of threads that run in parallel.
I know that this won't give you any help on your homework, because you are probably not allowed to use the fork/join framework in Java7. However it is about to learn something, isn't it?;)
As I commented, I think your merge method is wrong. I can't pinpoint the failure, but I have rewritten it. I strongly suggest you to write a testcase with all the edge cases that can happen during that merge method and if you verified it works, plant it back to your multithreaded code.
#lbalazscs also gave you the hint that the fork/join sort is mentioned in the javadocs, however I had nothing else to do- so I will show you the solution if you'd implemented it with Java7.
public class MultithreadedMergeSort extends RecursiveAction {
private final int[] array;
private final int begin;
private final int end;
public MultithreadedMergeSort(int[] array, int begin, int end) {
this.array = array;
this.begin = begin;
this.end = end;
}
#Override
protected void compute() {
if (end - begin < 2) {
// swap if we only have two elements
if (array[begin] > array[end]) {
int tmp = array[end];
array[end] = array[begin];
array[begin] = tmp;
}
} else {
// overflow safe method to calculate the mid
int mid = (begin + end) >>> 1;
// invoke recursive sorting action
invokeAll(new MultithreadedMergeSort(array, begin, mid),
new MultithreadedMergeSort(array, mid + 1, end));
// merge both sides
merge(array, begin, mid, end);
}
}
void merge(int[] numbers, int startA, int startB, int endB) {
int[] toReturn = new int[endB - startA + 1];
int i = 0, k = startA, j = startB + 1;
while (i < toReturn.length) {
if (numbers[k] < numbers[j]) {
toReturn[i] = numbers[k];
k++;
} else {
toReturn[i] = numbers[j];
j++;
}
i++;
// if we hit the limit of an array, copy the rest
if (j > endB) {
System.arraycopy(numbers, k, toReturn, i, startB - k + 1);
break;
}
if (k > startB) {
System.arraycopy(numbers, j, toReturn, i, endB - j + 1);
break;
}
}
System.arraycopy(toReturn, 0, numbers, startA, toReturn.length);
}
public static void main(String[] args) {
int[] toSort = { 55, 1, 12, 2, 25, 55, 56, 77 };
ForkJoinPool pool = new ForkJoinPool();
pool.invoke(new MultithreadedMergeSort(toSort, 0, toSort.length - 1));
System.out.println(Arrays.toString(toSort));
}
Note that the construction of your threadpool limits the number of active parallel threads to the number of cores of your processor.
ForkJoinPool pool = new ForkJoinPool();
According to it's javadoc:
Creates a ForkJoinPool with parallelism equal to
java.lang.Runtime.availableProcessors, using the default thread
factory, no UncaughtExceptionHandler, and non-async LIFO processing
mode.
Also notice how my merge method differs from yours, because I think that is your main problem. At least your sorting works if I replace your merge method with mine.
As mcdowella pointed out, the Fork/Join framework in Java 7 is exactly for tasks that can be broken into smaller pieces recursively.
Actually, the Javadoc for RecursiveAction has a merge sort as the first example :)
Also note that ForkJoinPool is an ExecutorService.

Unable to multi-thread a scalable method

UPDATE: To help clarify what I'm asking I have posted a little java code that gets the idea across.
A while ago I asked a question on how to get an algorithm to break down a set of numbers, the idea was to give it a list of numbers (1,2,3,4,5) and a total(10) and it would figure out all the multiples of each number that would add up to the total('1*10' or '1*1,1*2,1*3,1*4' or '2*5',etc..). It was the first programming exercise I ever did so it took me a while and I got it working but now I want to try to see if I can scale it. The person in the original question said it was scalable but I'm a bit confused at how to do it. The recursive part is the area I'm stuck at scaling the part that combines all the results(the table it is referring to is not scalable but applying caching I am able to make it fast)
I have the following algorithm(pseudo code):
//generates table
for i = 1 to k
for z = 0 to sum:
for c = 1 to z / x_i:
if T[z - c * x_i][i - 1] is true:
set T[z][i] to true
//uses table to bring all the parts together
function RecursivelyListAllThatWork(k, sum) // Using last k variables, make sum
/* Base case: If we've assigned all the variables correctly, list this
* solution.
*/
if k == 0:
print what we have so far
return
/* Recursive step: Try all coefficients, but only if they work. */
for c = 0 to sum / x_k:
if T[sum - c * x_k][k - 1] is true:
mark the coefficient of x_k to be c
call RecursivelyListAllThatWork(k - 1, sum - c * x_k)
unmark the coefficient of x_k
I'm really at a loss at how to thread/multiprocess the RecursivelyListAllThatWork function. I know if I send it a smaller K( which is int of total number of items in list) it will process that subset but I don't know how to do ones that combine results across the subset. For example, if list is [1,2,3,4,5,6,7,8,9,10] and I send it K=3 then only the 1,2,3 get processed which is fine but what about if I need results that include 1 and 10? I have tried to modify the table(variable T) so only the subset I want are there but still doesn't work because, like the solution above, it does a subset but cannot process answers that require a wider range.
I don't need any code just if someone can explain how to conceptually break this recursive step to so other cores/machines can be used.
UPDATE: I still can't seem to figure out how to turn RecursivelyListAllThatWork into a runnable(I know technically how to do it, but I don't understand how to change the RecursivelyListAllThatWork algorithm so it can be ran in parallel. The other parts are just here to make the example work, I only need to implement runnable on RecursivelyListAllThatWork method). Here's the java code:
import java.awt.Point;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class main
{
public static void main(String[] args)
{
System.out.println("starting..");
int target_sum = 100;
int[] data = new int[] { 10, 5, 50, 20, 25, 40 };
List T = tableGeneator(target_sum, data);
List<Integer> coeff = create_coeff(data.length);
RecursivelyListAllThatWork(data.length, target_sum, T, coeff, data);
}
private static List<Integer> create_coeff(int i) {
// TODO Auto-generated method stub
Integer[] integers = new Integer[i];
Arrays.fill(integers, 0);
List<Integer> integerList = Arrays.asList(integers);
return integerList;
}
private static void RecursivelyListAllThatWork(int k, int sum, List T, List<Integer> coeff, int[] data) {
// TODO Auto-generated method stub
if (k == 0) {
//# print what we have so far
for (int i = 0; i < coeff.size(); i++) {
System.out.println(data[i] + " = " + coeff.get(i));
}
System.out.println("*******************");
return;
}
Integer x_k = data[k-1];
// Recursive step: Try all coefficients, but only if they work.
for (int c = 0; c <= sum/x_k; c++) { //the c variable caps the percent
if (T.contains(new Point((sum - c * x_k), (k-1))))
{
// mark the coefficient of x_k to be c
coeff.set((k-1), c);
RecursivelyListAllThatWork((k - 1), (sum - c * x_k), T, coeff, data);
// unmark the coefficient of x_k
coeff.set((k-1), 0);
}
}
}
public static List tableGeneator(int target_sum, int[] data) {
List T = new ArrayList();
T.add(new Point(0, 0));
float max_percent = 1;
int R = (int) (target_sum * max_percent * data.length);
for (int i = 0; i < data.length; i++)
{
for (int s = -R; s < R + 1; s++)
{
int max_value = (int) Math.abs((target_sum * max_percent)
/ data[i]);
for (int c = 0; c < max_value + 1; c++)
{
if (T.contains(new Point(s - c * data[i], i)))
{
Point p = new Point(s, i + 1);
if (!T.contains(p))
{
T.add(p);
}
}
}
}
}
return T;
}
}
The general answer to multi-threading is to de-recursivate a recursive implementation thanks to a stack (LIFO or FIFO). When implementing such an algorithm, the number of threads is a fixed parameter for the algorithm (number of cores for instance).
To implement it, the language call stack is replaced by a stack storing last context as a checkpoint when the tested condition ends the recursivity. In your case it is either k=0 or coeff values matchs targeted sum.
After de-recursivation, a first implementation is to run multiple threads to consume the stack BUT the stack access becomes a contention point because it may require synchronization.
A better scalable solution is to dedicate a stack for each thread but an initial production of contexts in the stack is required.
I propose a mix approach with a first thread working recursively for a limited number of k as a maximum recursion depth: 2 for the small data set in example, but I recommend 3 if larger. Then this first part delegates the generated intermediate contexts to a pool of threads which will process remaining k with a non-recursive implementation. This code is not based on the complex algorithm you use but on a rather "basic" implementation:
import java.util.Arrays;
import java.util.ArrayDeque;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
public class MixedParallel
{
// pre-requisite: sorted values !!
private static final int[] data = new int[] { 5, 10, 20, 25, 40, 50 };
// Context to store intermediate computation or a solution
static class Context {
int k;
int sum;
int[] coeff;
Context(int k, int sum, int[] coeff) {
this.k = k;
this.sum = sum;
this.coeff = coeff;
}
}
// Thread pool for parallel execution
private static ExecutorService executor;
// Queue to collect solutions
private static Queue<Context> solutions;
static {
final int numberOfThreads = 2;
executor =
new ThreadPoolExecutor(numberOfThreads, numberOfThreads, 1000, TimeUnit.SECONDS,
new LinkedBlockingDeque<Runnable>());
// concurrent because of multi-threaded insertions
solutions = new ConcurrentLinkedQueue<Context>();
}
public static void main(String[] args)
{
int target_sum = 100;
// result vector, init to 0
int[] coeff = new int[data.length];
Arrays.fill(coeff, 0);
mixedPartialSum(data.length - 1, target_sum, coeff);
executor.shutdown();
// System.out.println("Over. Dumping results");
while(!solutions.isEmpty()) {
Context s = solutions.poll();
printResult(s.coeff);
}
}
private static void printResult(int[] coeff) {
StringBuffer sb = new StringBuffer();
for (int i = coeff.length - 1; i >= 0; i--) {
if (coeff[i] > 0) {
sb.append(data[i]).append(" * ").append(coeff[i]).append(" ");
}
}
System.out.println(sb.append("from ").append(Thread.currentThread()));
}
private static void mixedPartialSum(int k, int sum, int[] coeff) {
int x_k = data[k];
for (int c = sum / x_k; c >= 0; c--) {
coeff[k] = c;
int[] newcoeff = Arrays.copyOf(coeff, coeff.length);
if (c * x_k == sum) {
//printResult(newcoeff);
solutions.add(new Context(0, 0, newcoeff));
continue;
} else if (k > 0) {
if (data.length - k < 2) {
mixedPartialSum(k - 1, sum - c * x_k, newcoeff);
// for loop on "c" goes on with previous coeff content
} else {
// no longer recursive. delegate to thread pool
executor.submit(new ComputePartialSum(new Context(k - 1, sum - c * x_k, newcoeff)));
}
}
}
}
static class ComputePartialSum implements Callable<Void> {
// queue with contexts to process
private Queue<Context> contexts;
ComputePartialSum(Context request) {
contexts = new ArrayDeque<Context>();
contexts.add(request);
}
public Void call() {
while(!contexts.isEmpty()) {
Context current = contexts.poll();
int x_k = data[current.k];
for (int c = current.sum / x_k; c >= 0; c--) {
current.coeff[current.k] = c;
int[] newcoeff = Arrays.copyOf(current.coeff, current.coeff.length);
if (c * x_k == current.sum) {
//printResult(newcoeff);
solutions.add(new Context(0, 0, newcoeff));
continue;
} else if (current.k > 0) {
contexts.add(new Context(current.k - 1, current.sum - c * x_k, newcoeff));
}
}
}
return null;
}
}
}
You can check which thread has found outputted result and check all are involed: the main thread in recursive mode and the two thread from the pool in context stack mode.
Now this implementation is scalable when data.length is high:
the maximum recursion depth is limited to the main thread at a low level
each thread from the pool works with its own context stack without contention with others
the parameters to tune now are numberOfThreads and maxRecursionDepth
So the answer is yes, your algorithm can be parallelized. Here is a fully recursive implementation based on your code:
import java.awt.Point;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.ArrayDeque;
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
public class OriginalParallel
{
static final int numberOfThreads = 2;
static final int maxRecursionDepth = 3;
public static void main(String[] args)
{
int target_sum = 100;
int[] data = new int[] { 50, 40, 25, 20, 10, 5 };
List T = tableGeneator(target_sum, data);
int[] coeff = new int[data.length];
Arrays.fill(coeff, 0);
RecursivelyListAllThatWork(data.length, target_sum, T, coeff, data);
executor.shutdown();
}
private static void printResult(int[] coeff, int[] data) {
StringBuffer sb = new StringBuffer();
for (int i = coeff.length - 1; i >= 0; i--) {
if (coeff[i] > 0) {
sb.append(data[i]).append(" * ").append(coeff[i]).append(" ");
}
}
System.out.println(sb.append("from ").append(Thread.currentThread()));
}
// Thread pool for parallel execution
private static ExecutorService executor;
static {
executor =
new ThreadPoolExecutor(numberOfThreads, numberOfThreads, 1000, TimeUnit.SECONDS,
new LinkedBlockingDeque<Runnable>());
}
private static void RecursivelyListAllThatWork(int k, int sum, List T, int[] coeff, int[] data) {
if (k == 0) {
printResult(coeff, data);
return;
}
Integer x_k = data[k-1];
// Recursive step: Try all coefficients, but only if they work.
for (int c = 0; c <= sum/x_k; c++) { //the c variable caps the percent
if (T.contains(new Point((sum - c * x_k), (k-1)))) {
// mark the coefficient of x_k to be c
coeff[k-1] = c;
if (data.length - k != maxRecursionDepth) {
RecursivelyListAllThatWork((k - 1), (sum - c * x_k), T, coeff, data);
} else {
// delegate to thread pool when reaching depth 3
int[] newcoeff = Arrays.copyOf(coeff, coeff.length);
executor.submit(new RecursiveThread(k - 1, sum - c * x_k, T, newcoeff, data));
}
// unmark the coefficient of x_k
coeff[k-1] = 0;
}
}
}
static class RecursiveThread implements Callable<Void> {
int k;
int sum;
int[] coeff;
int[] data;
List T;
RecursiveThread(int k, int sum, List T, int[] coeff, int[] data) {
this.k = k;
this.sum = sum;
this.T = T;
this.coeff = coeff;
this.data = data;
System.out.println("New job for k=" + k);
}
public Void call() {
RecursivelyListAllThatWork(k, sum, T, coeff, data);
return null;
}
}
public static List tableGeneator(int target_sum, int[] data) {
List T = new ArrayList();
T.add(new Point(0, 0));
float max_percent = 1;
int R = (int) (target_sum * max_percent * data.length);
for (int i = 0; i < data.length; i++) {
for (int s = -R; s < R + 1; s++) {
int max_value = (int) Math.abs((target_sum * max_percent) / data[i]);
for (int c = 0; c < max_value + 1; c++) {
if (T.contains(new Point(s - c * data[i], i))) {
Point p = new Point(s, i + 1);
if (!T.contains(p)) {
T.add(p);
}
}
}
}
}
return T;
}
}
1) Instead of
if k == 0:
print what we have so far
return
you can check to see how many coefficients are non-zero; if that count is greater than a certain threshold (3 in your example), then just don't print it. (Hint: this would be closely related to the
mark the coefficient of x_k to be c
line.)
2) Recursive functions are generally exponential in nature, which means that as you scale higher, the runtime will grow sharply larger.
With that in mind, you can apply multithreading to both calculating the table and the recursive function.
When considering the table, think about which parts of the loop affect each other and must be done in sequence; the converse, of course, is finding which parts don't affect each other and can be run in parallel.
As for the recursive function, your best bet would probably be to apply the multithreading to the branching part.
They key to making this multithreaded is just to make sure that you don't have unnecessary global data structures, like your "marks" on the coefficients.
Let's say you have K numbers n[0] ... n[K-1] in your table and the sum you want to reach is S. I assume below that the array n[] is sorted from smallest to largest number.
A simple enumeration algorithm is here. i is index to the list of numbers, s is the current sum already built, and cs is a list of coefficients for the numbers 0 .. i - 1:
function enumerate(i, s, cs):
if (s == S):
output_solution(cs)
else if (i == K):
return // dead end
else if ((S - s) < n[i]):
return // no solution can be found
else:
for c in 0 .. floor((S - s) / n[i]): // note: floor(...) > 0
enumerate(i + 1, s + c * n[i], append(cs, c))
To run the process:
enumerate(0, 0, make_empty_list())
Now here are no global data structures anymore, except the table n[] (constant data), and 'enumerate' also does not return anything, so you can change the recursive call to run in its own thread at your will. E.g. you can spawn a new thread to a recursive enumerate() call unless you have too many threads running already, in which case you wait.

Multi-threaded matrix multiplication

I've coded a multi-threaded matrix multiplication. I believe my approach is right, but I'm not 100% sure. In respect to the threads, I don't understand why I can't just run a (new MatrixThread(...)).start() instead of using an ExecutorService.
Additionally, when I benchmark the multithreaded approach versus the classical approach, the classical is much faster...
What am I doing wrong?
Matrix Class:
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class Matrix
{
private int dimension;
private int[][] template;
public Matrix(int dimension)
{
this.template = new int[dimension][dimension];
this.dimension = template.length;
}
public Matrix(int[][] array)
{
this.dimension = array.length;
this.template = array;
}
public int getMatrixDimension() { return this.dimension; }
public int[][] getArray() { return this.template; }
public void fillMatrix()
{
Random randomNumber = new Random();
for(int i = 0; i < dimension; i++)
{
for(int j = 0; j < dimension; j++)
{
template[i][j] = randomNumber.nextInt(10) + 1;
}
}
}
#Override
public String toString()
{
String retString = "";
for(int i = 0; i < this.getMatrixDimension(); i++)
{
for(int j = 0; j < this.getMatrixDimension(); j++)
{
retString += " " + this.getArray()[i][j];
}
retString += "\n";
}
return retString;
}
public static Matrix classicalMultiplication(Matrix a, Matrix b)
{
int[][] result = new int[a.dimension][b.dimension];
for(int i = 0; i < a.dimension; i++)
{
for(int j = 0; j < b.dimension; j++)
{
for(int k = 0; k < b.dimension; k++)
{
result[i][j] += a.template[i][k] * b.template[k][j];
}
}
}
return new Matrix(result);
}
public Matrix multiply(Matrix multiplier) throws InterruptedException
{
Matrix result = new Matrix(dimension);
ExecutorService es = Executors.newFixedThreadPool(dimension*dimension);
for(int currRow = 0; currRow < multiplier.dimension; currRow++)
{
for(int currCol = 0; currCol < multiplier.dimension; currCol++)
{
//(new MatrixThread(this, multiplier, currRow, currCol, result)).start();
es.execute(new MatrixThread(this, multiplier, currRow, currCol, result));
}
}
es.shutdown();
es.awaitTermination(2, TimeUnit.DAYS);
return result;
}
private class MatrixThread extends Thread
{
private Matrix a, b, result;
private int row, col;
private MatrixThread(Matrix a, Matrix b, int row, int col, Matrix result)
{
this.a = a;
this.b = b;
this.row = row;
this.col = col;
this.result = result;
}
#Override
public void run()
{
int cellResult = 0;
for (int i = 0; i < a.getMatrixDimension(); i++)
cellResult += a.template[row][i] * b.template[i][col];
result.template[row][col] = cellResult;
}
}
}
Main class:
import java.util.Scanner;
public class MatrixDriver
{
private static final Scanner kb = new Scanner(System.in);
public static void main(String[] args) throws InterruptedException
{
Matrix first, second;
long timeLastChanged,timeNow;
double elapsedTime;
System.out.print("Enter value of n (must be a power of 2):");
int n = kb.nextInt();
first = new Matrix(n);
first.fillMatrix();
second = new Matrix(n);
second.fillMatrix();
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using threads:\n" +
first.multiply(second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Threaded took "+elapsedTime+" seconds");
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using classical:\n" +
Matrix.classicalMultiplication(first,second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Classical took "+elapsedTime+" seconds");
}
}
P.S. Please let me know if any further clarification is needed.
There is a bunch of overhead involved in creating threads, even when using an ExecutorService. I suspect the reason why you're multithreaded approach is so slow is that you're spending 99% creating a new thread and only 1%, or less, doing the actual math.
Typically, to solve this problem you'd batch a whole bunch of operations together and run those on a single thread. I'm not 100% how to do that in this case, but I suggest breaking your matrix into smaller chunks (say, 10 smaller matrices) and run those on threads, instead of running each cell in its own thread.
You're creating a lot of threads. Not only is it expensive to create threads, but for a CPU bound application, you don't want more threads than you have available processors (if you do, you have to spend processing power switching between threads, which also is likely to cause cache misses which are very expensive).
It's also unnecessary to send a thread to execute; all it needs is a Runnable. You'll get a big performance boost by applying these changes:
Make the ExecutorService a static member, size it for the current processor, and send it a ThreadFactory so it doesn't keep the program running after main has finished. (It would probably be architecturally cleaner to send it as a parameter to the method rather than keeping it as a static field; I leave that as an exercise for the reader. ☺)
private static final ExecutorService workerPool =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(), new ThreadFactory() {
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setDaemon(true);
return t;
}
});
Make MatrixThread implement Runnable rather than inherit Thread. Threads are expensive to create; POJOs are very cheap. You can also make it static which makes the instances smaller (as non-static classes get an implicit reference to the enclosing object).
private static class MatrixThread implements Runnable
From change (1), you can no longer awaitTermination to make sure all tasks are finished (as this worker pool). Instead, use the submit method which returns a Future<?>. Collect all the future objects in a list, and when you've submitted all the tasks, iterate over the list and call get for each object.
Your multiply method should now look something like this:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
for(int currCol = 0; currCol < multiplier.dimension; currCol++) {
Runnable worker = new MatrixThread(this, multiplier, currRow, currCol, result);
futures.add(workerPool.submit(worker));
}
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
Will it be faster than the single-threaded version? Well, on my arguably crappy box the multithreaded version is slower for values of n < 1024.
This is just scratching the surface, though. The real problem is that you create a lot of MatrixThread instances - your memory consumption is O(n²), which is a very bad sign. Moving the inner for loop into MatrixThread.run would improve performance by a factor of craploads (ideally, you don't create more tasks than you have worker threads).
Edit: As I have more pressing things to do, I couldn't resist optimizing this further. I came up with this (... horrendously ugly piece of code) that "only" creates O(n) jobs:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
Runnable worker = new MatrixThread2(this, multiplier, currRow, result);
futures.add(workerPool.submit(worker));
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
private static class MatrixThread2 implements Runnable
{
private Matrix self, mul, result;
private int row, col;
private MatrixThread2(Matrix a, Matrix b, int row, Matrix result)
{
this.self = a;
this.mul = b;
this.row = row;
this.result = result;
}
#Override
public void run()
{
for(int col = 0; col < mul.dimension; col++) {
int cellResult = 0;
for (int i = 0; i < self.getMatrixDimension(); i++)
cellResult += self.template[row][i] * mul.template[i][col];
result.template[row][col] = cellResult;
}
}
}
It's still not great, but basically the multi-threaded version can compute anything you'll be patient enough to wait for, and it'll do it faster than the single-threaded version.
First of all, you should use a newFixedThreadPool of the size as many cores you have, on a quadcore you use 4. Second of all, don't create a new one for each matrix.
If you make the executorservice a static member variable I get almost consistently faster execution of the threaded version at a matrix size of 512.
Also, change MatrixThread to implement Runnable instead of extending Thread also speeds up execution to where the threaded is on my machine 2x as fast on 512

Categories