Java multiple threads give very small perfomance gain

Java multiple threads give very small perfomance gain - java

I wanted to learn parallel programming for speeding up algorithms and chose Java.
I wrote two functions for summing long integers in array - one simple iterating through array, second - dividing array to parts and sum up parts in separated threads.
I expected to be logical a roughly 2x speed up using two threads. However, what I've got is only 24% speed up. Moreover, using more threads, I don't get any improvement (maybe less 1%) over two threads. I know that there should be thread creation/joining overhead, but I guess it shouldn't be that big.
Can you please explain, what I'm missing or where is error in code?
Here is code:
import java.util.concurrent.ThreadLocalRandom;
public class ParallelTest {
public static long sum1 (long[] num, int a, int b) {
long r = 0;
while (a < b) {
r += num[a];
++a;
}
return r;
}
public static class SumThread extends Thread {
private long num[];
private long r;
private int a, b;
public SumThread (long[] num, int a, int b) {
super();
this.num = num;
this.a = a;
this.b = b;
}
#Override
public void run () {
r = ParallelTest.sum1(num, a, b);
}
public long getSum () {
return r;
}
}
public static long sum2 (long[] num, int a, int b, int threadCnt) throws InterruptedException {
SumThread[] th = new SumThread[threadCnt];
int i = 0, c = (b - a + threadCnt - 1) / threadCnt;
for (;;) {
int a2 = a + c;
if (a2 > b) {
a2 = b;
}
th[i] = new SumThread(num, a, a2);
th[i].start();
if (a2 == b) {
break;
}
a = a2;
++i;
}
for (i = 0; i < threadCnt; ++i) {
th[i].join();
}
long r = 0;
for (i = 0; i < threadCnt; ++i) {
r += th[i].getSum();
}
return r;
}
public static void main(String[] args) throws InterruptedException {
final int N = 230000000;
long[] num = new long[N];
for (int i = 0; i < N; ++i) {
num[i] = ThreadLocalRandom.current().nextLong(1, 9999);
}
// System.out.println(Runtime.getRuntime().availableProcessors());
long timestamp = System.nanoTime();
System.out.println(sum1(num, 0, num.length));
System.out.println(System.nanoTime() - timestamp);
for (int n = 2; n <= 4; ++n) {
timestamp = System.nanoTime();
System.out.println(sum2(num, 0, num.length, n));
System.out.println(System.nanoTime() - timestamp);
}
}
}
EDIT: I have i7 processor with 4 cores (8 threads).
Output given by code is:
1149914787860
175689196
1149914787860
149224086
1149914787860
147709988
1149914787860
138243999

The program is probably main memory bandwidth limited with just two threads, as it's a small loop, that fetches data almost as fast as ram can supply data to the processor.

I can think of a number reasons why you might not get as much speedup as you are expecting.
Thread creation overheads are substantial. Thread start() is an expensive operation, which entails multiple syscalls to allocate a thread stack and its "red-zone" and to then create the native thread.
The N threads will not all start at the same time. That means that the time to complete the parallel part of the computation will be approximately the end-time of the last thread - the start-time of the the first time. That will be longer than the time for one thread takes to do its part of the work. (By N-1 times the thread creation time ...)
The N threads are (basically) doing a serial scan of N disjoint sections of the array. This is memory bandwidth intensive, AND the way that you are scanning means that the memory caches are going to be ineffective. Therefore, there is a good chance that performance is limited by the speed and bandwidth of your system's main memory hardware.

Related

How can I best optimize a loop that will loop over a billion times?

Say I want to go through a loop a billion times how could I optimize the loop to get my results faster?
As an example:
double randompoint;
for(long count =0; count < 1000000000; count++) {
randompoint = (Math.random() * 1) + 0; //generate a random point
if(randompoint <= .75) {
var++;
}
}
I was reading up on vecterization? But I'm not quite sure how to go about it. Any Ideas?

Since Java is cross-platform, you pretty much have to rely on the JIT to vectorize. In your case it can't, since each iteration depends heavily on the previous one (due to how the RNG works).
However, there are two other major ways to improve your computation.
The first is that this work is very amenable to parallelization. The technical term is embarrassingly parallel. This means that multithreading will give a perfectly linear speedup over the number of cores.
The second is that Math.random() is written to be multithreading safe, which also means that it's slow because it needs to use atomic operations. This isn't helpful, so we can skip that overhead by using a non-threadsafe RNG.
I haven't written much Java since 1.5, but here's a dumb implementation:
import java.util.*;
import java.util.concurrent.*;
class Foo implements Runnable {
private long count;
private double threshold;
private long result;
public Foo(long count, double threshold) {
this.count = count;
this.threshold = threshold;
}
public void run() {
ThreadLocalRandom rand = ThreadLocalRandom.current();
for(long l=0; l<count; l++) {
if(rand.nextDouble() < threshold)
result++;
}
}
public static void main(String[] args) throws Exception {
long count = 1000000000;
double threshold = 0.75;
int cores = Runtime.getRuntime().availableProcessors();
long sum = 0;
List<Foo> list = new ArrayList<Foo>();
List<Thread> threads = new ArrayList<Thread>();
for(int i=0; i<cores; i++) {
// TODO: account for count%cores!=0
Foo t = new Foo(count/cores, threshold);
list.add(t);
Thread thread = new Thread(t);
thread.start();
threads.add(thread);
}
for(Thread t : threads) t.join();
for(Foo f : list) sum += f.result;
System.out.println(sum);
}
}
You can also optimize and inline the random generator, to avoid going via doubles. Here it is with code taken from the ThreadLocalRandom docs:
public void run() {
long seed = new Random().nextLong();
long limit = (long) ((1L<<48) * threshold);
for(int i=0; i<count; i++) {
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
if (seed < limit) ++result;
}
}
However, the best approach is to work smarter, not harder. As the number of events increases, the probability tends towards a normal distribution. This means that for your huge range, you can randomly generate a number with such a distribution and scale it:
import java.util.Random;
class StayInSchool {
public static void main(String[] args) {
System.out.println(coinToss(1000000000, 0.75));
}
static long coinToss(long iterations, double threshold) {
double mean = threshold * iterations;
double stdDev = Math.sqrt(threshold * (1-threshold) * iterations);
double p = new Random().nextGaussian();
return (long) (p*stdDev + mean);
}
}
Here are the timings on my 4 core system (including VM startup) for these approaches:
Your baseline: 20.9s
Single threaded ThreadLocalRandom: 6.51s
Single threaded optimized random: 1.75s
Multithreaded ThreadLocalRandom: 1.67s
Multithreaded optimized random: 0.89s
Generating a gaussian: 0.14s

Not expected result with multithread programming

I'm in troubles with a multithreading java program.
The program consists of a splitted sum of an array of integers with multithreads and than the total sum of the slices.
The problem is that computing time does not decrements by incrementing number of threads (I know that there is a limit number of threads after that the computing time is slower than less threads). I expect to see a decrease of execution time before that limit number of threads (benefits of parallel execution). I use the variable fake in run method to make time "readable".
public class MainClass {
private final int MAX_THREAD = 8;
private final int ARRAY_SIZE = 1000000;
private int[] array;
private SimpleThread[] threads;
private int numThread = 1;
private int[] sum;
private int start = 0;
private int totalSum = 0;
long begin, end;
int fake;
MainClass() {
fillArray();
for(int i = 0; i < MAX_THREAD; i++) {
threads = new SimpleThread[numThread];
sum = new int[numThread];
begin = (long) System.currentTimeMillis();
for(int j = 0 ; j < numThread; j++) {
threads[j] = new SimpleThread(start, ARRAY_SIZE/numThread, j);
threads[j].start();
start+= ARRAY_SIZE/numThread;
}
for(int k = 0; k < numThread; k++) {
try {
threads[k].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
end = (long) System.currentTimeMillis();
for(int g = 0; g < numThread; g++) {
totalSum+=sum[g];
}
System.out.printf("Result with %d thread-- Sum = %d Time = %d\n", numThread, totalSum, end-begin);
numThread++;
start = 0;
totalSum = 0;
}
}
public static void main(String args[]) {
new MainClass();
}
private void fillArray() {
array = new int[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++)
array[i] = 1;
}
private class SimpleThread extends Thread{
int start;
int size;
int index;
public SimpleThread(int start, int size, int sumIndex) {
this.start = start;
this.size = size;
this.index = sumIndex;
}
public void run() {
for(int i = start; i < start+size; i++)
sum[index]+=array[i];
for(long i = 0; i < 1000000000; i++) {
fake++;
}
}
}
Unexpected Result Screenshot

As a general rule, you won't get a speedup from multi-threading if the "work" performed by each thread is less than the overheads of using the threads.
One of the overheads is the cost of starting a new thread. This is surprisingly high. Each time you start a thread the JVM needs to perform syscalls to allocate the thread stack memory segment and the "red zone" memory segment, and initialize them. (The default thread stack size is typically 500KB or 1MB.) Then there are further syscalls to create the native thread and schedule it.
In this example, you have 1,000,000 elements to sum and you divide this work among N threads. As N increases, the amount of work performed by each thread decreases.
It is not hard to see that the time taken to sum 1,000,000 elements is going to be less than the time needed to start 4 threads ... just based on counting the memory read and write operations. Then you need to take into account that the child threads are created one at a time by the parent thread.
If you do the analysis completely, it is clear that there is a point at which adding more threads actually slows down the computation even if you have enough to cores to run all threads in parallel. And your benchmarking seems to suggest1 that that point is around about 2 threads.
By the way, there is a second reason why you may not get as much speedup as you expect for a benchmark like this one. The "work" that each thread is doing is basically scanning a large array. Reading and writing arrays will generate requests to the memory system. Ideally, these requests will be satisfied by the (fast) on-chip memory caches. However, if you try to read / write an array that is larger than the memory cache, then many / most of those requests turn into (slow) main memory requests. Worse still, if you have N cores all doing this then you can find that the number of main memory requests is too much for the memory system to keep up .... and the threads slow down.
The bottom line is that multi-threading does not automatically make an application faster, and it certainly won't if you do it the wrong way.
In your example:
the amount of work per thread is too small compared with the overheads of creating and starting threads, and
memory bandwidth effects are likely to be a problem if can "factor out" the thread creation overheads
1 - I don't understand the point of the "fake" computation. It probably invalidates the benchmark, though it is possible that the JIT compiler optimizes it away.

Why sum is wrong sometimes?
Because ARRAY_SIZE/numThread may have fractional part (e.g. 1000000/3=333333.3333333333) which gets rounded down so start variable loses some hence the sum maybe less than 1000000 depending on the value of divisor.
Why the time taken is increasing as the number of threads increases?
Because in the run function of each thread you do this:
for(long i = 0; i < 1000000000; i++) {
fake++;
}
which I do not understand from your question :
I use the variable fake in run method to make time "readable".
what that means. But every thread needs to increment your fake variable 1000000000 times.

As a side note, for what you're trying to do there is the Fork/Join-Framework. It allows you easily split tasks recursively and implements an algorithm which will distribute your workload automatically.
There is a guide available here; it's example is very similar to your case, which boils down to a RecursiveTask like this:
class Adder extends RecursiveTask<Integer>
{
private int[] toAdd;
private int from;
private int to;
/** Add the numbers in the given array */
public Adder(int[] toAdd)
{
this(toAdd, 0, toAdd.length);
}
/** Add the numbers in the given array between the given indices;
internal constructor to split work */
private Adder(int[] toAdd, int fromIndex, int upToIndex)
{
this.toAdd = toAdd;
this.from = fromIndex;
this.to = upToIndex;
}
/** This is the work method */
#Override
protected Integer compute()
{
int amount = to - from;
int result = 0;
if (amount < 500)
{
// base case: add ints and return the result
for (int i = from; i < to; i++)
{
result += toAdd[i];
}
}
else
{
// array too large: split it into two parts and distribute the actual adding
int newEndIndex = from + (amount / 2);
Collection<Adder> invokeAll = invokeAll(Arrays.asList(
new Adder(toAdd, from, newEndIndex),
new Adder(toAdd, newEndIndex, to)));
for (Adder a : invokeAll)
{
result += a.invoke();
}
}
return result;
}
}
To actually run this, you can use
RecursiveTask adder = new Adder(fillArray(ARRAY_LENGTH));
int result = ForkJoinPool.commonPool().invoke(adder);

Starting threads is heavy and you'll only see the benefit of it on large processes that don't compete for the same resources (none of it applies here).

Java: how ot optimize sum of big array

I try to solve one problem on codeforces. And I get Time limit exceeded judjment. The only time consuming operation is calculation sum of big array. So I've tried to optimize it, but with no result.
What I want: Optimize the next function:
//array could be Integer.MAX_VALUE length
private long canocicalSum(int[] array) {
int sum = 0;
for (int i = 0; i < array.length; i++)
sum += array[i];
return sum;
}
Question1 [main]: Is it possible to optimize canonicalSum?
I've tried: to avoid operations with very big numbers. So i decided to use auxiliary data. For instance, I convert array1[100] to array2[10], where array2[i] = array1[i] + array1[i+1] + array1[i+9].
private long optimizedSum(int[] array, int step) {
do {
array = sumItr(array, step);
} while (array.length != 1);
return array[0];
}
private int[] sumItr(int[] array, int step) {
int length = array.length / step + 1;
boolean needCompensation = (array.length % step == 0) ? false : true;
int aux[] = new int[length];
for (int i = 0, auxSum = 0, auxPointer = 0; i < array.length; i++) {
auxSum += array[i];
if ((i + 1) % step == 0) {
aux[auxPointer++] = auxSum;
auxSum = 0;
}
if (i == array.length - 1 && needCompensation) {
aux[auxPointer++] = auxSum;
}
}
return aux;
}
Problem: But it appears that canonicalSum is ten times faster than optimizedSum. Here my test:
#Test
public void sum_comparison() {
final int ARRAY_SIZE = 100000000;
final int STEP = 1000;
int[] array = genRandomArray(ARRAY_SIZE);
System.out.println("Start canonical Sum");
long beg1 = System.nanoTime();
long sum1 = canocicalSum(array);
long end1 = System.nanoTime();
long time1 = end1 - beg1;
System.out.println("canon:" + TimeUnit.MILLISECONDS.convert(time1, TimeUnit.NANOSECONDS) + "milliseconds");
System.out.println("Start optimizedSum");
long beg2 = System.nanoTime();
long sum2 = optimizedSum(array, STEP);
long end2 = System.nanoTime();
long time2 = end2 - beg2;
System.out.println("custom:" + TimeUnit.MILLISECONDS.convert(time2, TimeUnit.NANOSECONDS) + "milliseconds");
assertEquals(sum1, sum2);
assertTrue(time2 <= time1);
}
private int[] genRandomArray(int size) {
int[] array = new int[size];
Random random = new Random();
for (int i = 0; i < array.length; i++) {
array[i] = random.nextInt();
}
return array;
}
Question2: Why optimizedSum works slower than canonicalSum?

As of Java 9, vectorisation of this operation has been implemented but disabled, based on benchmarks measuring the all-in cost of the code plus its compilation. Depending on your processor, this leads to the relatively entertaining result that if you introduce artificial complications into your reduction loop, you can trigger autovectorisation and get a quicker result! So the fastest code, for now, assuming numbers small enough not to overflow, is:
public int sum(int[] data) {
int value = 0;
for (int i = 0; i < data.length; ++i) {
value += 2 * data[i];
}
return value / 2;
}
This isn't intended as a recommendation! This is more to illustrate that the speed of your code in Java is dependent on the JIT, its trade-offs, and its bugs/features in any given release. Writing cute code to optimise problems like this is at best vain and will put a shelf life on the code you write. For instance, had you manually unrolled a loop to optimise for an older version of Java, your code would be much slower in Java 8 or 9 because this decision would completely disable autovectorisation. You'd better really need that performance to do it.

Question1 [main]: Is it possible to optimize canonicalSum?
Yes, it is. But I have no idea with what factor.
Some things you can do are:
use the parallel pipelines introduced in Java 8. The processor has instruction for doing parallel sum of 2 arrays (and more). This can be observed in Octave when you sum two vectors with ".+" (parallel addition) or "+" it is way faster than using a loop.
use multithreading. You could use a divide and conquer algorithm. Maybe like this:
divide the array into 2 or more
keep dividing recursively until you get an array with manageable size for a thread.
start computing the sum for the sub arrays (divided arrays) with separate threads.
finally add the sum generated (from all the threads) for all sub arrays together to produce final result
maybe unrolling the loop would help a bit, too. By loop unrolling I mean reducing the steps the loop will have to make by doing more operations in the loop manually.
An example from http://en.wikipedia.org/wiki/Loop_unwinding :
for (int x = 0; x < 100; x++)
{
delete(x);
}
becomes
for (int x = 0; x < 100; x+=5)
{
delete(x);
delete(x+1);
delete(x+2);
delete(x+3);
delete(x+4);
}
but as mentioned this must be done with caution and profiling since the JIT could do this kind of optimizations itself probably.
A implementation for mathematical operations for the multithreaded approach can be seen here.
The example implementation with the Fork/Join framework introduced in java 7 that basically does what the divide and conquer algorithm above does would be:
public class ForkJoinCalculator extends RecursiveTask<Double> {
public static final long THRESHOLD = 1_000_000;
private final SequentialCalculator sequentialCalculator;
private final double[] numbers;
private final int start;
private final int end;
public ForkJoinCalculator(double[] numbers, SequentialCalculator sequentialCalculator) {
this(numbers, 0, numbers.length, sequentialCalculator);
}
private ForkJoinCalculator(double[] numbers, int start, int end, SequentialCalculator sequentialCalculator) {
this.numbers = numbers;
this.start = start;
this.end = end;
this.sequentialCalculator = sequentialCalculator;
}
#Override
protected Double compute() {
int length = end - start;
if (length <= THRESHOLD) {
return sequentialCalculator.computeSequentially(numbers, start, end);
}
ForkJoinCalculator leftTask = new ForkJoinCalculator(numbers, start, start + length/2, sequentialCalculator);
leftTask.fork();
ForkJoinCalculator rightTask = new ForkJoinCalculator(numbers, start + length/2, end, sequentialCalculator);
Double rightResult = rightTask.compute();
Double leftResult = leftTask.join();
return leftResult + rightResult;
}
}
Here we develop a RecursiveTask splitting an array of doubles until
the length of a subarray doesn't go below a given threshold. At this
point the subarray is processed sequentially applying on it the
operation defined by the following interface
The interface used is this:
public interface SequentialCalculator {
double computeSequentially(double[] numbers, int start, int end);
}
And the usage example:
public static double varianceForkJoin(double[] population){
final ForkJoinPool forkJoinPool = new ForkJoinPool();
double total = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
#Override
public double computeSequentially(double[] numbers, int start, int end) {
double total = 0;
for (int i = start; i < end; i++) {
total += numbers[i];
}
return total;
}
}));
final double average = total / population.length;
double variance = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
#Override
public double computeSequentially(double[] numbers, int start, int end) {
double variance = 0;
for (int i = start; i < end; i++) {
variance += (numbers[i] - average) * (numbers[i] - average);
}
return variance;
}
}));
return variance / population.length;
}

If you want to add N numbers then the runtime is O(N). So in this aspect your canonicalSum can not be "optimized".
What you can do to reduce runtime is make the summation parallel. I.e. break the array to parts and pass it to separate threads and in the end sum the result returned by each thread.
Update: This implies multicore system but there is a java api to get the number of cores

How to run this code faster?

import java.io.*;
import java.util.ArrayList;
public class Ristsumma {
static long numberFromFile;
static long sum1, sum2;
static long number, number2;
static long variable, variable2;
static long counter;
public static void main(String args[]) throws IOException{
try{
BufferedReader br = new BufferedReader(new FileReader("ristsis.txt"));
numberFromFile = Long.parseLong(br.readLine());
br.close();
}catch(Exception e){
e.printStackTrace();
}
variable=numberFromFile;
ArrayList<Long> numbers = new ArrayList<Long>();
while (variable > 0){
number = variable %10;
variable/=10;
numbers.add(number);
}
for (int i=0; i< numbers.size(); i++) {
sum1 += numbers.get(i);
}
ArrayList<Long> numbers2 = new ArrayList<Long>();
for(long s=1; s<numberFromFile; s++){
variable2=s;
number2=0;
sum2=0;
while (variable2 > 0){
number2 = variable2 %10;
variable2/=10;
numbers2.add(number2);
}
for (int i=0; i< numbers2.size(); i++) {
sum2 += numbers2.get(i);
}
if(sum1==sum2){
counter+=1;
}
numbers2.clear();
}
PrintWriter pw = new PrintWriter("ristval.txt", "UTF-8");
pw.println(counter);
pw.close();
}
}
So I have this code. It takes a number from a file, adds all numbers separately from that number and adds them together (for example the number is 123 then it gives 1+2+3=6). In the second half it looks out all numbers from 1 to that number in the file and counts how many different numbers give the same answer. If the number is 123, the sum is 6 and the answer that the code writes is 9 (because 6, 15, 24, 33, 42, 51, 60, 105, 114 also give the same answer). The code works, but my problem is that when the number from a file is for example 2 222 222 222, then it takes almost half an hour to get the answer. How can I make this run faster?

Remove unnecessary creation of lists
You are unnecessarily creating lists
ArrayList<Long> numbers = new ArrayList<Long>();
while (variable > 0){
number = variable %10;
variable/=10;
numbers.add(number);
}
for (int i=0; i< numbers.size(); i++) {
sum1 += numbers.get(i);
}
Here you create an arraylist, just to temporaily hold Longs, you can eliminate the entire list
while (variable > 0){
number = variable %10;
variable/=10;
sum1 += number
}
The same for the other arraylist numbers2
Presize arralists
We have already eliminated the arraylists but if we hadn't we could improve speed by presizing the arrays
ArrayList<Long> numbers = new ArrayList<Long>(someGuessAsToSize);
It isn't nessissary that your guess be correct, the arraylist will still auto resize, but if the guess is approximately correct you will speed up the code as the arraylist will not have to periodically resize.
General style
You are holding lots of (what should be) method variables as fields
static long numberFromFile;
static long sum1, sum2;
static long number, number2;
static long variable, variable2;
static long counter;
This is unlikely to affect performance but is an unusual thing to do and makes the code less readable with the potential for "hidden effects"

Your problem is intriguing - it got me wondering how much faster it would run with threads.
Here is a threaded implementation that splits the task of calculating the second problem across threads. My laptop only has two cores so I have set the threads to 4.
public static void main(String[] args) throws Exception {
final long in = 222222222;
final long target = calcSum(in);
final ExecutorService executorService = Executors.newFixedThreadPool(4);
final Collection<Future<Integer>> futures = Lists.newLinkedList();
final int chunk = 100;
for (long i = in; i > 0; i -= chunk) {
futures.add(executorService.submit(new Counter(i > chunk ? i - chunk : 0, i, target)));
}
long res = 0;
for (final Future<Integer> f : futures) {
res += f.get();
}
System.out.println(res);
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.DAYS);
}
public static final class Counter implements Callable<Integer> {
private final long start;
private final long end;
private final long target;
public Counter(long start, long end, long target) {
this.start = start;
this.end = end;
this.target = target;
}
#Override
public Integer call() throws Exception {
int count = 0;
for (long i = start; i < end; ++i) {
if (calcSum(i) == target) {
++count;
}
}
return count;
}
}
public static long calcSum(long num) {
long sum = 0;
while (num > 0) {
sum += num % 10;
num /= 10;
}
return sum;
}
It calculates the solution with 222 222 222 as an input in a few seconds.
I optimised the calculation of the sum to remove all the Lists that you were using.
EDIT
I added some timing code using Stopwatch and tried with and without #Ingo's optimisation using 222222222 * 100 as the input number.
Without the optimisation the code takes 35 seconds. Changing the calc method to:
public static long calcSum(long num, final long limit) {
long sum = 0;
while (num > 0) {
sum += num % 10;
if (limit > 0 && sum > limit) {
break;
}
num /= 10;
}
return sum;
}
With the added the optimisation the code takes 28 seconds.
Note this this is a highly non-scientific benchmark as I didn't warm the JIT or run multiple trials (partly because I'm lazy and partly because I'm busy).
EDIT
Fiddling with the chunk size gives fairly different results too. With a chunk of 1000 time drops to around 17 seconds.
EDIT
If you want to be really fancy you can use a ForkJoinPool:
public static void main(String[] args) throws Exception {
final long in = 222222222;
final long target = calcSum(in);
final ForkJoinPool forkJoinPool = new ForkJoinPool();
final ForkJoinTask<Integer> result = forkJoinPool.submit(new Counter(0, in, target));
System.out.println(result.get());
forkJoinPool.shutdown();
forkJoinPool.awaitTermination(1, TimeUnit.DAYS);
}
public static final class Counter extends RecursiveTask<Integer> {
private static final long THRESHOLD = 1000;
private final long start;
private final long end;
private final long target;
public Counter(long start, long end, long target) {
this.start = start;
this.end = end;
this.target = target;
}
#Override
protected Integer compute() {
if (end - start < 1000) {
return computeDirectly();
}
long mid = start + (end - start) / 2;
final Counter low = new Counter(start, mid, target);
final Counter high = new Counter(mid, end, target);
low.fork();
final int highResult = high.compute();
final int lowResult = low.join();
return highResult + lowResult;
}
private Integer computeDirectly() {
int count = 0;
for (long i = start; i < end; ++i) {
if (calcSum(i) == target) {
++count;
}
}
return count;
}
}
public static long calcSum(long num) {
long sum = 0;
while (num > 0) {
sum += num % 10;
num /= 10;
}
return sum;
}
On a different (much faster) computer this runs in under a second, as compared to 2.8 seconds for the original approach.

You spend most of the time checking numbers that are failing the test. However, as Ingo observed, if you have a number ab, then (a-1)(b+1) has the same sum as ab. Instead of checking all numbers, you can generate them:
Lets say our number is 2 222, the sum is 8.
Approach #1: bottom up
We now generate the number starting with the smallest (we pad with zeroes for reading convenience): 0008. The next one is 0017, the next are 0026, 0035, 0044, 0053, 0062, 0071, 0080, 0107 and so on. The problematic part is finding the first number that has this sum.
Approach #2: top down
We start at 2222, the next lower number is 2213, then 2204, 2150, 2141, and so on. Here you don't have the problem that you need to find the lowest number.
I don't have time to write code now, but there should be an algorithm to realize both approaches, that does not involve trying out all numbers.
For a number abc, (a)(b-1)(c+1) is the next lower number, while (a)(b+1)(c-1) is the next higher number. The only interesting/difficult thing is when you need to overflow because b==9 or c==9, or b==0, c==0. The next bigger number if b==9 is (a+1)(9)(c-1) if c>0, and (a)(8)(0) if c==0. Now go make your algorithm, these examples should be enough.

Observe that you don't need to store the individual digits at all.
Instead, all you're interested in is the actual sum of the digits.
Considering this, a method like
static int diagsum(long number) { ... }
would be great. If it is easy enogh, the JIT could inline it, or at least optimize better than your spaghetti code.
Then again, you could benefit from another method that stops computing the digit sum at some limit. Fore example, when you have
22222222
the sum is 20, and that means that you need not compute any other sum that is greater than 20. For example:
45678993
Instead, you could just stop after you have the last 3 digits (which you get first by your diision method), because 9+9+3 is 21 and this is alread greater than 20.
===================================================================
Another optimization:
If you have some number:
123116
it is immediately clear that all unique permutations of those 6 digits have the same digit sum, thus
321611, 231611, ... are solutions
Then, for any pair of individual digits ab, a transformed number would contain (a+1)(b-1) and (a-1)(b+1) in the same place, as long as a+1, ... are still in the range 0..9. Apply recursively to get even more numbewrs.
You can then turn to numbers with less digits. Obviously, to have the same digit sum, you must combine 2 digits of the original number, if possible, for example
5412 => 912, 642, 741, 552, 561, 543
etc.
Apply the same algorithm recursively as above, until no transformations and combinations are possible.
=========
It must be said, though, that above idea would take lots of memory, because one must maintain a Set-like data structure to take care of duplicates. However, for 987_654_321 we get already 39_541_589 results, and probably much more with even greater numbers. Thus it is questionable if the effort to actually do it the combinatorical way is worth it.

Multi-threaded matrix multiplication

I've coded a multi-threaded matrix multiplication. I believe my approach is right, but I'm not 100% sure. In respect to the threads, I don't understand why I can't just run a (new MatrixThread(...)).start() instead of using an ExecutorService.
Additionally, when I benchmark the multithreaded approach versus the classical approach, the classical is much faster...
What am I doing wrong?
Matrix Class:
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class Matrix
{
private int dimension;
private int[][] template;
public Matrix(int dimension)
{
this.template = new int[dimension][dimension];
this.dimension = template.length;
}
public Matrix(int[][] array)
{
this.dimension = array.length;
this.template = array;
}
public int getMatrixDimension() { return this.dimension; }
public int[][] getArray() { return this.template; }
public void fillMatrix()
{
Random randomNumber = new Random();
for(int i = 0; i < dimension; i++)
{
for(int j = 0; j < dimension; j++)
{
template[i][j] = randomNumber.nextInt(10) + 1;
}
}
}
#Override
public String toString()
{
String retString = "";
for(int i = 0; i < this.getMatrixDimension(); i++)
{
for(int j = 0; j < this.getMatrixDimension(); j++)
{
retString += " " + this.getArray()[i][j];
}
retString += "\n";
}
return retString;
}
public static Matrix classicalMultiplication(Matrix a, Matrix b)
{
int[][] result = new int[a.dimension][b.dimension];
for(int i = 0; i < a.dimension; i++)
{
for(int j = 0; j < b.dimension; j++)
{
for(int k = 0; k < b.dimension; k++)
{
result[i][j] += a.template[i][k] * b.template[k][j];
}
}
}
return new Matrix(result);
}
public Matrix multiply(Matrix multiplier) throws InterruptedException
{
Matrix result = new Matrix(dimension);
ExecutorService es = Executors.newFixedThreadPool(dimension*dimension);
for(int currRow = 0; currRow < multiplier.dimension; currRow++)
{
for(int currCol = 0; currCol < multiplier.dimension; currCol++)
{
//(new MatrixThread(this, multiplier, currRow, currCol, result)).start();
es.execute(new MatrixThread(this, multiplier, currRow, currCol, result));
}
}
es.shutdown();
es.awaitTermination(2, TimeUnit.DAYS);
return result;
}
private class MatrixThread extends Thread
{
private Matrix a, b, result;
private int row, col;
private MatrixThread(Matrix a, Matrix b, int row, int col, Matrix result)
{
this.a = a;
this.b = b;
this.row = row;
this.col = col;
this.result = result;
}
#Override
public void run()
{
int cellResult = 0;
for (int i = 0; i < a.getMatrixDimension(); i++)
cellResult += a.template[row][i] * b.template[i][col];
result.template[row][col] = cellResult;
}
}
}
Main class:
import java.util.Scanner;
public class MatrixDriver
{
private static final Scanner kb = new Scanner(System.in);
public static void main(String[] args) throws InterruptedException
{
Matrix first, second;
long timeLastChanged,timeNow;
double elapsedTime;
System.out.print("Enter value of n (must be a power of 2):");
int n = kb.nextInt();
first = new Matrix(n);
first.fillMatrix();
second = new Matrix(n);
second.fillMatrix();
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using threads:\n" +
first.multiply(second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Threaded took "+elapsedTime+" seconds");
timeLastChanged = System.currentTimeMillis();
//System.out.println("Product of the two using classical:\n" +
Matrix.classicalMultiplication(first,second);
timeNow = System.currentTimeMillis();
elapsedTime = (timeNow - timeLastChanged)/1000.0;
System.out.println("Classical took "+elapsedTime+" seconds");
}
}
P.S. Please let me know if any further clarification is needed.

There is a bunch of overhead involved in creating threads, even when using an ExecutorService. I suspect the reason why you're multithreaded approach is so slow is that you're spending 99% creating a new thread and only 1%, or less, doing the actual math.
Typically, to solve this problem you'd batch a whole bunch of operations together and run those on a single thread. I'm not 100% how to do that in this case, but I suggest breaking your matrix into smaller chunks (say, 10 smaller matrices) and run those on threads, instead of running each cell in its own thread.

You're creating a lot of threads. Not only is it expensive to create threads, but for a CPU bound application, you don't want more threads than you have available processors (if you do, you have to spend processing power switching between threads, which also is likely to cause cache misses which are very expensive).
It's also unnecessary to send a thread to execute; all it needs is a Runnable. You'll get a big performance boost by applying these changes:
Make the ExecutorService a static member, size it for the current processor, and send it a ThreadFactory so it doesn't keep the program running after main has finished. (It would probably be architecturally cleaner to send it as a parameter to the method rather than keeping it as a static field; I leave that as an exercise for the reader. ☺)
private static final ExecutorService workerPool =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors(), new ThreadFactory() {
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setDaemon(true);
return t;
}
});
Make MatrixThread implement Runnable rather than inherit Thread. Threads are expensive to create; POJOs are very cheap. You can also make it static which makes the instances smaller (as non-static classes get an implicit reference to the enclosing object).
private static class MatrixThread implements Runnable
From change (1), you can no longer awaitTermination to make sure all tasks are finished (as this worker pool). Instead, use the submit method which returns a Future<?>. Collect all the future objects in a list, and when you've submitted all the tasks, iterate over the list and call get for each object.
Your multiply method should now look something like this:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
for(int currCol = 0; currCol < multiplier.dimension; currCol++) {
Runnable worker = new MatrixThread(this, multiplier, currRow, currCol, result);
futures.add(workerPool.submit(worker));
}
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
Will it be faster than the single-threaded version? Well, on my arguably crappy box the multithreaded version is slower for values of n < 1024.
This is just scratching the surface, though. The real problem is that you create a lot of MatrixThread instances - your memory consumption is O(n²), which is a very bad sign. Moving the inner for loop into MatrixThread.run would improve performance by a factor of craploads (ideally, you don't create more tasks than you have worker threads).
Edit: As I have more pressing things to do, I couldn't resist optimizing this further. I came up with this (... horrendously ugly piece of code) that "only" creates O(n) jobs:
public Matrix multiply(Matrix multiplier) throws InterruptedException {
Matrix result = new Matrix(dimension);
List<Future<?>> futures = new ArrayList<Future<?>>();
for(int currRow = 0; currRow < multiplier.dimension; currRow++) {
Runnable worker = new MatrixThread2(this, multiplier, currRow, result);
futures.add(workerPool.submit(worker));
}
for (Future<?> f : futures) {
try {
f.get();
} catch (ExecutionException e){
throw new RuntimeException(e); // shouldn't happen, but might do
}
}
return result;
}
private static class MatrixThread2 implements Runnable
{
private Matrix self, mul, result;
private int row, col;
private MatrixThread2(Matrix a, Matrix b, int row, Matrix result)
{
this.self = a;
this.mul = b;
this.row = row;
this.result = result;
}
#Override
public void run()
{
for(int col = 0; col < mul.dimension; col++) {
int cellResult = 0;
for (int i = 0; i < self.getMatrixDimension(); i++)
cellResult += self.template[row][i] * mul.template[i][col];
result.template[row][col] = cellResult;
}
}
}
It's still not great, but basically the multi-threaded version can compute anything you'll be patient enough to wait for, and it'll do it faster than the single-threaded version.

First of all, you should use a newFixedThreadPool of the size as many cores you have, on a quadcore you use 4. Second of all, don't create a new one for each matrix.
If you make the executorservice a static member variable I get almost consistently faster execution of the threaded version at a matrix size of 512.
Also, change MatrixThread to implement Runnable instead of extending Thread also speeds up execution to where the threaded is on my machine 2x as fast on 512

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.