I'm in troubles with a multithreading java program.
The program consists of a splitted sum of an array of integers with multithreads and than the total sum of the slices.
The problem is that computing time does not decrements by incrementing number of threads (I know that there is a limit number of threads after that the computing time is slower than less threads). I expect to see a decrease of execution time before that limit number of threads (benefits of parallel execution). I use the variable fake in run method to make time "readable".
public class MainClass {
private final int MAX_THREAD = 8;
private final int ARRAY_SIZE = 1000000;
private int[] array;
private SimpleThread[] threads;
private int numThread = 1;
private int[] sum;
private int start = 0;
private int totalSum = 0;
long begin, end;
int fake;
MainClass() {
fillArray();
for(int i = 0; i < MAX_THREAD; i++) {
threads = new SimpleThread[numThread];
sum = new int[numThread];
begin = (long) System.currentTimeMillis();
for(int j = 0 ; j < numThread; j++) {
threads[j] = new SimpleThread(start, ARRAY_SIZE/numThread, j);
threads[j].start();
start+= ARRAY_SIZE/numThread;
}
for(int k = 0; k < numThread; k++) {
try {
threads[k].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
end = (long) System.currentTimeMillis();
for(int g = 0; g < numThread; g++) {
totalSum+=sum[g];
}
System.out.printf("Result with %d thread-- Sum = %d Time = %d\n", numThread, totalSum, end-begin);
numThread++;
start = 0;
totalSum = 0;
}
}
public static void main(String args[]) {
new MainClass();
}
private void fillArray() {
array = new int[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++)
array[i] = 1;
}
private class SimpleThread extends Thread{
int start;
int size;
int index;
public SimpleThread(int start, int size, int sumIndex) {
this.start = start;
this.size = size;
this.index = sumIndex;
}
public void run() {
for(int i = start; i < start+size; i++)
sum[index]+=array[i];
for(long i = 0; i < 1000000000; i++) {
fake++;
}
}
}
Unexpected Result Screenshot
As a general rule, you won't get a speedup from multi-threading if the "work" performed by each thread is less than the overheads of using the threads.
One of the overheads is the cost of starting a new thread. This is surprisingly high. Each time you start a thread the JVM needs to perform syscalls to allocate the thread stack memory segment and the "red zone" memory segment, and initialize them. (The default thread stack size is typically 500KB or 1MB.) Then there are further syscalls to create the native thread and schedule it.
In this example, you have 1,000,000 elements to sum and you divide this work among N threads. As N increases, the amount of work performed by each thread decreases.
It is not hard to see that the time taken to sum 1,000,000 elements is going to be less than the time needed to start 4 threads ... just based on counting the memory read and write operations. Then you need to take into account that the child threads are created one at a time by the parent thread.
If you do the analysis completely, it is clear that there is a point at which adding more threads actually slows down the computation even if you have enough to cores to run all threads in parallel. And your benchmarking seems to suggest1 that that point is around about 2 threads.
By the way, there is a second reason why you may not get as much speedup as you expect for a benchmark like this one. The "work" that each thread is doing is basically scanning a large array. Reading and writing arrays will generate requests to the memory system. Ideally, these requests will be satisfied by the (fast) on-chip memory caches. However, if you try to read / write an array that is larger than the memory cache, then many / most of those requests turn into (slow) main memory requests. Worse still, if you have N cores all doing this then you can find that the number of main memory requests is too much for the memory system to keep up .... and the threads slow down.
The bottom line is that multi-threading does not automatically make an application faster, and it certainly won't if you do it the wrong way.
In your example:
the amount of work per thread is too small compared with the overheads of creating and starting threads, and
memory bandwidth effects are likely to be a problem if can "factor out" the thread creation overheads
1 - I don't understand the point of the "fake" computation. It probably invalidates the benchmark, though it is possible that the JIT compiler optimizes it away.
Why sum is wrong sometimes?
Because ARRAY_SIZE/numThread may have fractional part (e.g. 1000000/3=333333.3333333333) which gets rounded down so start variable loses some hence the sum maybe less than 1000000 depending on the value of divisor.
Why the time taken is increasing as the number of threads increases?
Because in the run function of each thread you do this:
for(long i = 0; i < 1000000000; i++) {
fake++;
}
which I do not understand from your question :
I use the variable fake in run method to make time "readable".
what that means. But every thread needs to increment your fake variable 1000000000 times.
As a side note, for what you're trying to do there is the Fork/Join-Framework. It allows you easily split tasks recursively and implements an algorithm which will distribute your workload automatically.
There is a guide available here; it's example is very similar to your case, which boils down to a RecursiveTask like this:
class Adder extends RecursiveTask<Integer>
{
private int[] toAdd;
private int from;
private int to;
/** Add the numbers in the given array */
public Adder(int[] toAdd)
{
this(toAdd, 0, toAdd.length);
}
/** Add the numbers in the given array between the given indices;
internal constructor to split work */
private Adder(int[] toAdd, int fromIndex, int upToIndex)
{
this.toAdd = toAdd;
this.from = fromIndex;
this.to = upToIndex;
}
/** This is the work method */
#Override
protected Integer compute()
{
int amount = to - from;
int result = 0;
if (amount < 500)
{
// base case: add ints and return the result
for (int i = from; i < to; i++)
{
result += toAdd[i];
}
}
else
{
// array too large: split it into two parts and distribute the actual adding
int newEndIndex = from + (amount / 2);
Collection<Adder> invokeAll = invokeAll(Arrays.asList(
new Adder(toAdd, from, newEndIndex),
new Adder(toAdd, newEndIndex, to)));
for (Adder a : invokeAll)
{
result += a.invoke();
}
}
return result;
}
}
To actually run this, you can use
RecursiveTask adder = new Adder(fillArray(ARRAY_LENGTH));
int result = ForkJoinPool.commonPool().invoke(adder);
Starting threads is heavy and you'll only see the benefit of it on large processes that don't compete for the same resources (none of it applies here).
Related
I'm trying to modify the sequential "Sieve of Eratosthenes" algorithm in order to take advantage of multiple cores. My goal was to increase performance relative to the vanilla algorithm, but all of my attempts have been futile...
Here's what I have thus far:
public class ParallelSieve implements SieveCalculator
{
private int nThreads;
public ParallelSieve(int nThreads) {
this.nThreads = nThreads;
}
#Override
public SieveResult calculate(int ceiling) {
if (ceiling < Primes.MIN) {
return SieveResult.emptyResult();
}
ThreadSafeBitSet isComposite = new ThreadSafeBitSet(ceiling + 1);
ForkJoinPool threadPool = new ForkJoinPool(nThreads);
for (int n = Primes.MIN; n * n <= ceiling; ++n) {
if (isComposite.get(n)) {
continue;
}
int from = n * n;
int to = (ceiling / n) * n;
threadPool.invoke(new RecursivelyMarkSieve(isComposite, from, to, n));
}
threadPool.shutdown();
return new SieveResult(isComposite);
}
private class RecursivelyMarkSieve extends RecursiveAction
{
private static final int THRESHOLD = 1000;
private final ThreadSafeBitSet isComposite;
private final int from, to, step;
RecursivelyMarkSieve(ThreadSafeBitSet isComposite, int from, int to, int step) {
this.isComposite = isComposite;
this.from = from;
this.to = to;
this.step = step;
}
#Override
protected void compute() {
int workload = (to - from) / step + 1;
if (workload <= THRESHOLD) {
for (int index = from; index <= to; index += step) {
isComposite.set(index);
}
return;
}
int middle = (to - from) / (2 * step);
int leftSplit = from + middle * step;
int rightSplit = from + (middle + 1) * step;
ForkJoinTask.invokeAll(
new RecursivelyMarkSieve(isComposite, from, leftSplit, step),
new RecursivelyMarkSieve(isComposite, rightSplit, to, step)
);
}
}
}
My thought process was, once a prime value is found, we can break up the work of marking its multiples via a thread pool. I was drawn to the ForkJoinPool because I can limit the number of threads being used, and easily submit it custom, recursive tasks that break up the work of marking multiples. Still, my solution is too slow! Any suggestions?
With all prospective multi-threading solutions you have to balance the advantage to be gained by multiplying the amount of processing available against the overheads of administering the multi-threaded solution.
In particular:
There is some overhead to starting threads.
If you synchronize or use a thread-safe class (which has synchronization built in) there is the overhead of synchronization, plus the fact that while using synchronized methods you are possibly funnelling the solution back down to a single thread.
Looking at your solution, the actual logic (the compute method) has very little in it in terms of computation, but accesses the thread-safe bit set and it starts a new thread. So the overheads will far outweight the actual logic.
To use multi-threading effectively you need to figure out how to split up your task such that there is a significant amount of work to be done by each thread and the use of an synchronized data structures is limited. You can't invoke a new thread for each integer you come across.
There is a lot online on how to parallelize the sieve of Eratosthenes, so I suggest looking at how others have tackled the problem.
The general paradigm today is "map-reduce". Split the problem-set into chunks. Process each chunk separately. Collate the results back together again. Repeat and/or recurse.
Say I want to go through a loop a billion times how could I optimize the loop to get my results faster?
As an example:
double randompoint;
for(long count =0; count < 1000000000; count++) {
randompoint = (Math.random() * 1) + 0; //generate a random point
if(randompoint <= .75) {
var++;
}
}
I was reading up on vecterization? But I'm not quite sure how to go about it. Any Ideas?
Since Java is cross-platform, you pretty much have to rely on the JIT to vectorize. In your case it can't, since each iteration depends heavily on the previous one (due to how the RNG works).
However, there are two other major ways to improve your computation.
The first is that this work is very amenable to parallelization. The technical term is embarrassingly parallel. This means that multithreading will give a perfectly linear speedup over the number of cores.
The second is that Math.random() is written to be multithreading safe, which also means that it's slow because it needs to use atomic operations. This isn't helpful, so we can skip that overhead by using a non-threadsafe RNG.
I haven't written much Java since 1.5, but here's a dumb implementation:
import java.util.*;
import java.util.concurrent.*;
class Foo implements Runnable {
private long count;
private double threshold;
private long result;
public Foo(long count, double threshold) {
this.count = count;
this.threshold = threshold;
}
public void run() {
ThreadLocalRandom rand = ThreadLocalRandom.current();
for(long l=0; l<count; l++) {
if(rand.nextDouble() < threshold)
result++;
}
}
public static void main(String[] args) throws Exception {
long count = 1000000000;
double threshold = 0.75;
int cores = Runtime.getRuntime().availableProcessors();
long sum = 0;
List<Foo> list = new ArrayList<Foo>();
List<Thread> threads = new ArrayList<Thread>();
for(int i=0; i<cores; i++) {
// TODO: account for count%cores!=0
Foo t = new Foo(count/cores, threshold);
list.add(t);
Thread thread = new Thread(t);
thread.start();
threads.add(thread);
}
for(Thread t : threads) t.join();
for(Foo f : list) sum += f.result;
System.out.println(sum);
}
}
You can also optimize and inline the random generator, to avoid going via doubles. Here it is with code taken from the ThreadLocalRandom docs:
public void run() {
long seed = new Random().nextLong();
long limit = (long) ((1L<<48) * threshold);
for(int i=0; i<count; i++) {
seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
if (seed < limit) ++result;
}
}
However, the best approach is to work smarter, not harder. As the number of events increases, the probability tends towards a normal distribution. This means that for your huge range, you can randomly generate a number with such a distribution and scale it:
import java.util.Random;
class StayInSchool {
public static void main(String[] args) {
System.out.println(coinToss(1000000000, 0.75));
}
static long coinToss(long iterations, double threshold) {
double mean = threshold * iterations;
double stdDev = Math.sqrt(threshold * (1-threshold) * iterations);
double p = new Random().nextGaussian();
return (long) (p*stdDev + mean);
}
}
Here are the timings on my 4 core system (including VM startup) for these approaches:
Your baseline: 20.9s
Single threaded ThreadLocalRandom: 6.51s
Single threaded optimized random: 1.75s
Multithreaded ThreadLocalRandom: 1.67s
Multithreaded optimized random: 0.89s
Generating a gaussian: 0.14s
I wrote a small program to find the first 5 Taxicab numbers (so far only 6 are known) by checking each integer from 2 to 5E+15. The definition of Taxicab numbers is here.
However, my program took 8 minutes just to reach 3E+7. Since Taxicab(3) is in the order of 8E+7, I hesitate to let it run any further without optimizing it first.
I'm using NetBeans 8 on Ubuntu 16.10 on a HP 8560w, i7 2600qm quad core, 16GB RAM. However, Java only uses 1 core, to a maximum of 25% total CPU power, even when given Very High Priority. How do I fix this?
public class Ramanujan
{
public static void main(String[] args)
{
long limit;
//limit = 20;
limit = 500000000000000000L;
int order = 1;
for (long testCase = 2; testCase < limit; testCase++)
{
if (isTaxicab(testCase, order))
{
System.out.printf("Taxicab(%d) = %d*****************************\n",
order, testCase);
order++;
}
else
{
if (testCase%0x186a0 ==0) //Prints very 100000 iterations to track progress
{
//To track progress
System.out.printf("%d \n", testCase);
}
}
}
}
public static boolean isTaxicab(long testCase, int order)
{
int way = 0; //Number of ways that testCase can be expressed as sum of 2 cube numbers.
long i = 1;
long iUpperBound = (long) (1+Math.cbrt(testCase/2));
//If testCase = i*i*i + j*j*j AND i<=j
//then i*i*i cant be > testCase/2
//No need to test beyond that
while (i < iUpperBound)
{
if ( isSumOfTwoCubes(testCase, i) )
{
way++;
}
i++;
}
return (way >= order);
}
public static boolean isSumOfTwoCubes(long testCase,long i)
{
boolean isSum = false;
long jLowerBound = (long) Math.cbrt(testCase -i*i*i);
for (long j = jLowerBound; j < jLowerBound+2; j++)
{
long sumCubes = i*i*i + j*j*j;
if (sumCubes == testCase)
{
isSum = true;
break;
}
}
return isSum;
}
}
The program itself will only ever use one core until you parallelize it.
You need to learn how to use Threads.
Your problem is embarrassingly parallel. Parallelizing too much (i.e. creating too many threads) will be detrimental because each thread creates an overhead, so you need to be careful regarding exactly how you parallelize.
If it was up to me, I would initialize a list of worker threads where each thread effectively performs isTaxicab() and simply assign a single testCase to each worker as it becomes available.
You would want to code such that you can easily experiment with the number of workers.
I am trying to compare sequantial and concurent matrix multiplication.Everytime sequential is more fast.For example 60 x 60 matrix sequantial find 4 ms while concurent 277 ms.In my code is something wrong?
concurent:
private static void multiplyMatrixConcurent() {
result_concurent =new Matrix(rows, columns);
for (int i = 0; i < cell; i++) {
Runnable task = new MatrixMultiplicationThread(i);
Thread worker = new Thread(task);
worker.start();
}
}
private static class MatrixMultiplicationThread implements Runnable{
private int cell;
MatrixMultiplicationThread(int cell) {
this.cell=cell;
}
#Override
public void run() {
int row = cell / columns ;
int column = cell % columns;
for (int i = 0; i < rows; i++) {
double t1 = matrix.getCell(row, i);
double t2= matrix.getCell(i, column);
double temp= t1*t2;
double res = result_concurent.getCell(row, column) +temp;
result_concurent.setCell(res, row, column);
}
}
}
sequential:
private static void multiplyMatrixSequence() {
result_sequantial =new Matrix(rows, columns);
for (int i = 0; i < rows; i++) {
for (int j = 0; j <rows; j++) {
for (int k = 0; k < columns; k++) {
double t1=matrix.getCell(i,k);
double t2=matrix.getCell(k, j);
double temp= t1*t2;
double res = result_sequantial.getCell(i, j) + temp;
result_sequantial.setCell(res,i,j);
}
}
}
}
I don't see anything obviously wrong. You don't set cell to rows*columns in the concurrent startup code you posted but I assume that is an issue in the posting not the code you ran.
Threads have overhead. They have memory to allocate and require extra management of the CPU resources. If the number of threads is modest and the hardware can handle multiple threads in parallel, then you win. However, for pure cpu bound tasks, having more threads than there are processing elements is just overhead without any gain. In this case, you have 3600 threads. I'm guessing you have a processor that can handle between 2 and 8 threads at once. Your thread count dwarfs the processor's ability and so you get a slowdown.
Note that when the threads are performing blocking operations such as disk or network I/O then more threads can allow interleaving. The statements also don't apply in the GPU computing case where even memory accesses allow efficient thread interleaving.
BTW, if your goal is actually to produce a fast matrix multiply - use an existing library. These libraries are developed by people who take advantage of processor cache structures, specialized hardware instruction sets and subtle details of floating point to produce libraries that are faster and more accurate than anything a casual coder can produce.
Creating a Thread takes some time (compared to other operations it is expensive). Instead of creating a new Thread for every cell you could use a ThreadPool and re-use existing(finished) threads. This reduces the time spend for creating new threads. But still you are in a very low execution time per thread scenario where setting up the threads takes more time than running it sequential.
private static void multiplyMatrixConcurent() {
result_concurent =new Matrix(rows, columns);
ExecutorService executor = Executors.newFixedThreadPool(4);
for (int i = 0; i < cell; i++) {
Runnable worker = new MatrixMultiplicationThread(i);
executor.execute(worker);
}
executor.shutdown();
}
I'm using the jsr166y ForkJoinPool to distribute computational tasks amongst threads. But I clearly must be doing something wrong.
My tasks seem to work flawlessly if I create the ForkJoinPool with parallelism > 1 (the default is Runtime.availableProcessors(); I've been running with 2-8 threads). But if I create the ForkJoinPool with parallelism = 1, I see deadlocks after an unpredictable number of iterations.
Yes - setting parallelism = 1 is a strange practice. In this case, I'm profiling a parallel algorithm as thread-count increases, and I want to compare the parallel version, run with to a single thread, to a baseline serial implementation, so as to accurately ascertain the overhead of the parallel implementation.
Below is a simple example that illustrates the issue I'm seeing. The 'task' is a dummy iteration over a fixed array, divided recursively into 16 subtasks.
If run with THREADS = 2 (or more), it runs reliably to completion, but if run with THREADS = 1, it invariably deadlocks. After an unpredictable number of iterations, the main loop hangs in ForkJoinPool.invoke(), waiting on task.join(), and the worker thread exits.
I'm running with JDK 1.6.0_21 and 1.6.0_22 under Linux, and using a version of jsr166y downloaded a few days ago from Doug Lea's website (http://gee.cs.oswego.edu/dl/concurrency-interest/index.html)
Any suggestions for what I'm missing? Many thanks in advance.
package concurrent;
import jsr166y.ForkJoinPool;
import jsr166y.RecursiveAction;
public class TestFjDeadlock {
private final static int[] intArray = new int[256 * 1024];
private final static float[] floatArray = new float[256 * 1024];
private final static int THREADS = 1;
private final static int TASKS = 16;
private final static int ITERATIONS = 10000;
public static void main(String[] args) {
// Initialize the array
for (int i = 0; i < intArray.length; i++) {
intArray[i] = i;
}
ForkJoinPool pool = new ForkJoinPool(THREADS);
// Run through ITERATIONS loops, subdividing the iteration into TASKS F-J subtasks
for (int i = 0; i < ITERATIONS; i++) {
pool.invoke(new RecursiveIterate(0, intArray.length));
}
pool.shutdown();
}
private static class RecursiveIterate extends RecursiveAction {
final int start;
final int end;
public RecursiveIterate(final int start, final int end) {
this.start = start;
this.end = end;
}
#Override
protected void compute() {
if ((end - start) <= (intArray.length / TASKS)) {
// We've reached the subdivision limit - iterate over the arrays
for (int i = start; i < end; i += 3) {
floatArray[i] += i + intArray[i];
}
} else {
// Subdivide and start new tasks
final int mid = (start + end) >>> 1;
invokeAll(new RecursiveIterate(start, mid), new RecursiveIterate(mid, end));
}
}
}
}
looks like a bug in the ForkJoinPool. everything i can see in the usage for the class fits your example. the only other possibility might be one of your tasks throwing an exception and dying abnormally (although that should still be handled).