If I have a simple program to parallelize counting the number of 1's from a random integer from 0 - 9 for a large number of iterations, how do I reduce the variable counting the 1's (numOnes) using the sum function so that I can be able to use the total sum later on in my program.
This is equivalent to the reduction directive in OpenMP.
public void run() {
long work = total_iterations / threads;
long numOnes = 0;
for (long i = 0; i < work; i++) {
int randomNum = rand.nextInt(9);
if (randomNum == 1) {
numOnes += 1;
}
}
}
Once each thread is done executing, I want to be able to use numOnes containing the aggregate result.
In Java, you would have to sit down and manage things manually. In other words: assuming that your data is partitioned, you just kick of those 10 threads, and let them do their work.
In the end, you want to join all those threads; as in: only when all threads "joined"; all of them are done; and you are ready to go forward and process their results.
Alternatively, you would look into more "abstract" things like ExecutorService and Futures; to get away from dealing with "bare metal" threads directly.
Of course that is pretty generic; but well, so is your question.
You could use streams for this.
public class ParallelInts
{
public static void main(String[] args) {
int count = new Random().ints( 1_000_000, 0, 10 ).parallel()
.reduce( 0, (sum, i) -> sum + ((i==1)?1:0) );
System.out.println( "count = " + count );
}
}
Related
This question already has answers here:
How do I write a correct micro-benchmark in Java?
(11 answers)
Closed last month.
Here is my code - it is just a basic introduction to parallel computing that was done in class. There was supposed to be a speedup of around 2 for numbers being added up in a random array. All but about four people were getting the correct results. To note, I am using a 2020 MacBook Air with a 1.2Ghz Quad-Core I7.
import java.util.Arrays;
public class TestProgram
{
public static double addSequential(double\[\] values) //method that adds numbers in an array.
{
double sum = 0;
for(int i=0; i\<values.length; i++)
sum += values\[i\];
return sum;
}
public static double addParallel(double[] values) //method that adds code to potentially do parallel computing.
{
int mid = values.length / 2; //calculates a mid point.
SumArrayTask left = new SumArrayTask(0, mid, values);
SumArrayTask right = new SumArrayTask(mid, values.length, values);
left.fork();
right.compute();
left.join();
return left.getResult() + right.getResult();
}
public static void main(String[] args)
{
double[] arr = new double[10];
for(int i = 0; i<arr.length; i++) //create an array with 10 RANDOM values 0-100
arr[i] = Math.floor(100*Math.random()); //Math.random picks a random # between 0-1, so we multiply by 100.
System.out.println(Arrays.toString(arr));
long start, sequentialTime, parallelTime;
start = System.nanoTime();
System.out.println("Result (sequential): " + addSequential(arr)); //Prints out all elements of array added up.
System.out.println("Time: " + (sequentialTime = System.nanoTime() - start) + " ns"); //Prints how many nanoseconds the processing takes.
start = System.nanoTime();
System.out.println("Result (parallel): " + addParallel(arr)); //Prints out all elements of array added up with parallel
System.out.println("Time: " + (parallelTime = System.nanoTime() - start) + " ns"); //Prints how many nanoseconds the parallel processing takes.
System.out.println("Speedup: " + sequentialTime / parallelTime);
}
}
import java.util.concurrent.RecursiveAction;
public class SumArrayTask extends RecursiveAction
{
private int start;
private int end;
private double[] data;
private double result;
public SumArrayTask(int startIndex, int endIndex, double[] arr)
{
start = startIndex;
end = endIndex;
data = arr;
}
public double getResult() //getter method for result
{
return result;
}
protected void compute()
{
double sum = 0;
for(int i = start; i<end; i++)
sum += data[i];
result = sum;
}
}
My result:
I was expecting a speedup of around 2. I've had others try and they get a completely different result with their pc's. I am completely unsure if it may have something to do with my setup or the code itself. I appreciate any help.
First of all, your way of "benchmarking" will always give misleading results:
You do I/O (System.out()) within the benchmarked code. This alone will take much longer than adding ten numbers.
You do not execute the code multiple times. The first executions in Java will always be slower than later ones, due to the "learning phase" of the Hotspot compiler.
Seeing that a simple "add ten doubles" task seemingly takes more than 100,000 clock cycles could already have alarmed you that your measuring must be wrong. Ten additions should not take more than maybe 100 cycles or so.
Now let's talk about parallel execution. There is a cost to creating and managing a thread (or letting the java.util.concurrent package do it for you), and this can be quite high. So, although each parallel task will probably (*) consume less time than the full loop, the management time for the threads will outweigh that by far in your case.
So, in general, only think about parallel execution for code that takes seconds, not microseconds.
(*) It's not even as clear that the half-array loops will take less time than the full-array loop, as there are more variables involved, making it harder for the Hotspot compiler to do aggressive optimizations like e.g. loop unfolding.
I'm trying to modify the sequential "Sieve of Eratosthenes" algorithm in order to take advantage of multiple cores. My goal was to increase performance relative to the vanilla algorithm, but all of my attempts have been futile...
Here's what I have thus far:
public class ParallelSieve implements SieveCalculator
{
private int nThreads;
public ParallelSieve(int nThreads) {
this.nThreads = nThreads;
}
#Override
public SieveResult calculate(int ceiling) {
if (ceiling < Primes.MIN) {
return SieveResult.emptyResult();
}
ThreadSafeBitSet isComposite = new ThreadSafeBitSet(ceiling + 1);
ForkJoinPool threadPool = new ForkJoinPool(nThreads);
for (int n = Primes.MIN; n * n <= ceiling; ++n) {
if (isComposite.get(n)) {
continue;
}
int from = n * n;
int to = (ceiling / n) * n;
threadPool.invoke(new RecursivelyMarkSieve(isComposite, from, to, n));
}
threadPool.shutdown();
return new SieveResult(isComposite);
}
private class RecursivelyMarkSieve extends RecursiveAction
{
private static final int THRESHOLD = 1000;
private final ThreadSafeBitSet isComposite;
private final int from, to, step;
RecursivelyMarkSieve(ThreadSafeBitSet isComposite, int from, int to, int step) {
this.isComposite = isComposite;
this.from = from;
this.to = to;
this.step = step;
}
#Override
protected void compute() {
int workload = (to - from) / step + 1;
if (workload <= THRESHOLD) {
for (int index = from; index <= to; index += step) {
isComposite.set(index);
}
return;
}
int middle = (to - from) / (2 * step);
int leftSplit = from + middle * step;
int rightSplit = from + (middle + 1) * step;
ForkJoinTask.invokeAll(
new RecursivelyMarkSieve(isComposite, from, leftSplit, step),
new RecursivelyMarkSieve(isComposite, rightSplit, to, step)
);
}
}
}
My thought process was, once a prime value is found, we can break up the work of marking its multiples via a thread pool. I was drawn to the ForkJoinPool because I can limit the number of threads being used, and easily submit it custom, recursive tasks that break up the work of marking multiples. Still, my solution is too slow! Any suggestions?
With all prospective multi-threading solutions you have to balance the advantage to be gained by multiplying the amount of processing available against the overheads of administering the multi-threaded solution.
In particular:
There is some overhead to starting threads.
If you synchronize or use a thread-safe class (which has synchronization built in) there is the overhead of synchronization, plus the fact that while using synchronized methods you are possibly funnelling the solution back down to a single thread.
Looking at your solution, the actual logic (the compute method) has very little in it in terms of computation, but accesses the thread-safe bit set and it starts a new thread. So the overheads will far outweight the actual logic.
To use multi-threading effectively you need to figure out how to split up your task such that there is a significant amount of work to be done by each thread and the use of an synchronized data structures is limited. You can't invoke a new thread for each integer you come across.
There is a lot online on how to parallelize the sieve of Eratosthenes, so I suggest looking at how others have tackled the problem.
The general paradigm today is "map-reduce". Split the problem-set into chunks. Process each chunk separately. Collate the results back together again. Repeat and/or recurse.
I'm in troubles with a multithreading java program.
The program consists of a splitted sum of an array of integers with multithreads and than the total sum of the slices.
The problem is that computing time does not decrements by incrementing number of threads (I know that there is a limit number of threads after that the computing time is slower than less threads). I expect to see a decrease of execution time before that limit number of threads (benefits of parallel execution). I use the variable fake in run method to make time "readable".
public class MainClass {
private final int MAX_THREAD = 8;
private final int ARRAY_SIZE = 1000000;
private int[] array;
private SimpleThread[] threads;
private int numThread = 1;
private int[] sum;
private int start = 0;
private int totalSum = 0;
long begin, end;
int fake;
MainClass() {
fillArray();
for(int i = 0; i < MAX_THREAD; i++) {
threads = new SimpleThread[numThread];
sum = new int[numThread];
begin = (long) System.currentTimeMillis();
for(int j = 0 ; j < numThread; j++) {
threads[j] = new SimpleThread(start, ARRAY_SIZE/numThread, j);
threads[j].start();
start+= ARRAY_SIZE/numThread;
}
for(int k = 0; k < numThread; k++) {
try {
threads[k].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
end = (long) System.currentTimeMillis();
for(int g = 0; g < numThread; g++) {
totalSum+=sum[g];
}
System.out.printf("Result with %d thread-- Sum = %d Time = %d\n", numThread, totalSum, end-begin);
numThread++;
start = 0;
totalSum = 0;
}
}
public static void main(String args[]) {
new MainClass();
}
private void fillArray() {
array = new int[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++)
array[i] = 1;
}
private class SimpleThread extends Thread{
int start;
int size;
int index;
public SimpleThread(int start, int size, int sumIndex) {
this.start = start;
this.size = size;
this.index = sumIndex;
}
public void run() {
for(int i = start; i < start+size; i++)
sum[index]+=array[i];
for(long i = 0; i < 1000000000; i++) {
fake++;
}
}
}
Unexpected Result Screenshot
As a general rule, you won't get a speedup from multi-threading if the "work" performed by each thread is less than the overheads of using the threads.
One of the overheads is the cost of starting a new thread. This is surprisingly high. Each time you start a thread the JVM needs to perform syscalls to allocate the thread stack memory segment and the "red zone" memory segment, and initialize them. (The default thread stack size is typically 500KB or 1MB.) Then there are further syscalls to create the native thread and schedule it.
In this example, you have 1,000,000 elements to sum and you divide this work among N threads. As N increases, the amount of work performed by each thread decreases.
It is not hard to see that the time taken to sum 1,000,000 elements is going to be less than the time needed to start 4 threads ... just based on counting the memory read and write operations. Then you need to take into account that the child threads are created one at a time by the parent thread.
If you do the analysis completely, it is clear that there is a point at which adding more threads actually slows down the computation even if you have enough to cores to run all threads in parallel. And your benchmarking seems to suggest1 that that point is around about 2 threads.
By the way, there is a second reason why you may not get as much speedup as you expect for a benchmark like this one. The "work" that each thread is doing is basically scanning a large array. Reading and writing arrays will generate requests to the memory system. Ideally, these requests will be satisfied by the (fast) on-chip memory caches. However, if you try to read / write an array that is larger than the memory cache, then many / most of those requests turn into (slow) main memory requests. Worse still, if you have N cores all doing this then you can find that the number of main memory requests is too much for the memory system to keep up .... and the threads slow down.
The bottom line is that multi-threading does not automatically make an application faster, and it certainly won't if you do it the wrong way.
In your example:
the amount of work per thread is too small compared with the overheads of creating and starting threads, and
memory bandwidth effects are likely to be a problem if can "factor out" the thread creation overheads
1 - I don't understand the point of the "fake" computation. It probably invalidates the benchmark, though it is possible that the JIT compiler optimizes it away.
Why sum is wrong sometimes?
Because ARRAY_SIZE/numThread may have fractional part (e.g. 1000000/3=333333.3333333333) which gets rounded down so start variable loses some hence the sum maybe less than 1000000 depending on the value of divisor.
Why the time taken is increasing as the number of threads increases?
Because in the run function of each thread you do this:
for(long i = 0; i < 1000000000; i++) {
fake++;
}
which I do not understand from your question :
I use the variable fake in run method to make time "readable".
what that means. But every thread needs to increment your fake variable 1000000000 times.
As a side note, for what you're trying to do there is the Fork/Join-Framework. It allows you easily split tasks recursively and implements an algorithm which will distribute your workload automatically.
There is a guide available here; it's example is very similar to your case, which boils down to a RecursiveTask like this:
class Adder extends RecursiveTask<Integer>
{
private int[] toAdd;
private int from;
private int to;
/** Add the numbers in the given array */
public Adder(int[] toAdd)
{
this(toAdd, 0, toAdd.length);
}
/** Add the numbers in the given array between the given indices;
internal constructor to split work */
private Adder(int[] toAdd, int fromIndex, int upToIndex)
{
this.toAdd = toAdd;
this.from = fromIndex;
this.to = upToIndex;
}
/** This is the work method */
#Override
protected Integer compute()
{
int amount = to - from;
int result = 0;
if (amount < 500)
{
// base case: add ints and return the result
for (int i = from; i < to; i++)
{
result += toAdd[i];
}
}
else
{
// array too large: split it into two parts and distribute the actual adding
int newEndIndex = from + (amount / 2);
Collection<Adder> invokeAll = invokeAll(Arrays.asList(
new Adder(toAdd, from, newEndIndex),
new Adder(toAdd, newEndIndex, to)));
for (Adder a : invokeAll)
{
result += a.invoke();
}
}
return result;
}
}
To actually run this, you can use
RecursiveTask adder = new Adder(fillArray(ARRAY_LENGTH));
int result = ForkJoinPool.commonPool().invoke(adder);
Starting threads is heavy and you'll only see the benefit of it on large processes that don't compete for the same resources (none of it applies here).
I wrote a small program to find the first 5 Taxicab numbers (so far only 6 are known) by checking each integer from 2 to 5E+15. The definition of Taxicab numbers is here.
However, my program took 8 minutes just to reach 3E+7. Since Taxicab(3) is in the order of 8E+7, I hesitate to let it run any further without optimizing it first.
I'm using NetBeans 8 on Ubuntu 16.10 on a HP 8560w, i7 2600qm quad core, 16GB RAM. However, Java only uses 1 core, to a maximum of 25% total CPU power, even when given Very High Priority. How do I fix this?
public class Ramanujan
{
public static void main(String[] args)
{
long limit;
//limit = 20;
limit = 500000000000000000L;
int order = 1;
for (long testCase = 2; testCase < limit; testCase++)
{
if (isTaxicab(testCase, order))
{
System.out.printf("Taxicab(%d) = %d*****************************\n",
order, testCase);
order++;
}
else
{
if (testCase%0x186a0 ==0) //Prints very 100000 iterations to track progress
{
//To track progress
System.out.printf("%d \n", testCase);
}
}
}
}
public static boolean isTaxicab(long testCase, int order)
{
int way = 0; //Number of ways that testCase can be expressed as sum of 2 cube numbers.
long i = 1;
long iUpperBound = (long) (1+Math.cbrt(testCase/2));
//If testCase = i*i*i + j*j*j AND i<=j
//then i*i*i cant be > testCase/2
//No need to test beyond that
while (i < iUpperBound)
{
if ( isSumOfTwoCubes(testCase, i) )
{
way++;
}
i++;
}
return (way >= order);
}
public static boolean isSumOfTwoCubes(long testCase,long i)
{
boolean isSum = false;
long jLowerBound = (long) Math.cbrt(testCase -i*i*i);
for (long j = jLowerBound; j < jLowerBound+2; j++)
{
long sumCubes = i*i*i + j*j*j;
if (sumCubes == testCase)
{
isSum = true;
break;
}
}
return isSum;
}
}
The program itself will only ever use one core until you parallelize it.
You need to learn how to use Threads.
Your problem is embarrassingly parallel. Parallelizing too much (i.e. creating too many threads) will be detrimental because each thread creates an overhead, so you need to be careful regarding exactly how you parallelize.
If it was up to me, I would initialize a list of worker threads where each thread effectively performs isTaxicab() and simply assign a single testCase to each worker as it becomes available.
You would want to code such that you can easily experiment with the number of workers.
I have an array of size N. I want to shuffle its elements in 2 threads (or more). Each thread should work with it's own part of the array.
Lets say, the first thread shuffles elements from 0 to K, and the second thread shuffles elements from K to N (where 0 < K < N). So, it can look like this:
//try-catch stuff is ommited
static void shuffle(int[] array) {
Thread t1 = new ShufflingThread(array, 0, array.length / 2);
Thread t2 = new ShufflingThread(array, array.length / 2, array.length);
t1.start();
t2.start();
t1.join();
t2.join();
}
public static void main(String[] args) {
int array = generateBigSortedArray();
shuffle(array);
}
Are there any guaranties from JVM that I will see changes in the array from the main method after such shuffling?
How should I implement ShufflingThread (or, how should I run it, maybe within a synchronized block or whatever else) in order to get such guaranties?
The join() calls are sufficient to ensure memory coherency: when t1.join() returns, the main thread "sees" whatever operations thread t1 did on the array.
Also, Java guarantees that there is no word-tearing on arrays: distinct threads may use distinct elements of the same array without needing synchronization.
I think this is a good exercise in thread control, where (1) a job can be broken up into several parts (2) the parts can run independently and asynchronously and (3) A master thread monitors the completion of all such jobs in their respective threads. All you need is for this master thread to wait() and be notify()-ed jobCount times, every time a thread completes execution. Here is a sample code that you can compile/run. Uncomment the println()'s to see more.
Notes: [1] JVM doesnt guarantee the order of execution of the threads [2] You need to synchronize when your master thread access the big array, to not have corrupted data....
public class ShufflingArray {
private int nPart = 4, // Count of jobs distributed, resource dependent
activeThreadCount, // Currently active, monitored with notify
iRay[]; // Array the threads will work on
public ShufflingArray (int[] a) {
iRay = a;
printArray (a);
}
private void printArray (int[] ia) {
for (int i = 0 ; i < ia.length ; i++)
System.out.print (" " + ((ia[i] < 10) ? " " : "") + ia[i]);
System.out.println();
}
public void shuffle () {
int startNext = 0, pLen = iRay.length / nPart; // make a bunch of parts
for (int i = 0 ; i < nPart ; i++, activeThreadCount++) {
int start = (i == 0) ? 0 : startNext,
stop = start + pLen;
startNext = stop;
if (i == (nPart-1))
stop = iRay.length;
new Thread (new ShuffleOnePart (start, stop, (i+1))).start();
}
waitOnShufflers (0); // returns when activeThreadCount == 0
printArray (iRay);
}
synchronized private void waitOnShufflers (int bump) {
if (bump == 0) {
while (activeThreadCount > 0) {
// System.out.println ("Waiting on " + activeThreadCount + " threads");
try {
wait();
} catch (InterruptedException intex) {
}}} else {
activeThreadCount += bump;
notify();
}}
public class ShuffleOnePart implements Runnable {
private int startIndex, stopIndex; // Operate on global array iRay
public ShuffleOnePart (int i, int j, int k) {
startIndex = i;
stopIndex = j;
// System.out.println ("Shuffler part #" + k);
}
// Suppose shuffling means interchanging the first and last pairs
public void run () {
int tmp = iRay[startIndex+1];
iRay[startIndex+1] = iRay[startIndex]; iRay[startIndex] = tmp;
tmp = iRay[stopIndex-1];
iRay[stopIndex-1] = iRay[stopIndex-2]; iRay[stopIndex-2] = tmp;
try { // Lets imagine it needs to do something else too
Thread.sleep (157);
} catch (InterruptedException iex) { }
waitOnShufflers (-1);
}}
public static void main (String[] args) {
int n = 25, ia[] = new int[n];
for (int i = 0 ; i < n ; i++)
ia[i] = i+1;
new ShufflingArray(ia).shuffle();
}}
Thread.start() and Thread.join() are enough to give you the happens-before relationships between the array initialisation, its hand-off to the threads and then the read back in the main method.
Actions that cause happens-before are documented here.
As mentioned elsewhere, ForkJoin is very well suited to this kind of divide-and-conquer algorithm and frees you from a lot of the book-keeping that you'd otherwise need to implement.
using ExecutorService from java.util.Concurrent package along with Callable Task to return the part of the array's from each thread run, once both thread are completed is another way to do for consistent behaviour.
Well, they can't BOTH be accessing the same array and if you use a lock, or a mutex or any other synchronizing mechanism, you kinda lose the power of the threads (since one will have to wait for another, either to finish the shuffling or finish a bit of the shuffling).
Why don't you just divide the array in half, give each thread its bit of the array and then merge the two arrays?