Parallel sum of elements in a large Array

Parallel sum of elements in a large Array - java

I have program that sums the elements in a very large array. I want to parallelize this sum.
#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
float sum=0.0;
main()
{
for (i=0, i<N, i++)
sum=sum+x[i];
}
How can I parallelize this sum using threads (c/c++/Java any code example is fine)? How many threads should I use if there is 8 cores in the machine for optimal performance?
EDIT: N may be really large ( larger than 1e6 actually) and varies based on the file size I read the data from. The file is in the order of GBs.
Edit: N is changed to a large value (1e12 to 1e16)

In Java you can write
int cpus = Runtime.getRuntime().availableProcessors();
// would keep this of other tasks as well.
ExecutorService service = Executors.newFixedThreadPool(cpus);
float[] floats = new float[N];
List<Future<Double>> tasks = new ArrayList<>();
int blockSize = (floats.length + cpus - 1) / cpus;
for (int i=0, i < floats.length, i++) {
final start = blockSize * i;
final end = Math.min(blockSize * (i+1), floats.length);
tasks.add(service.submit(new Callable<Double>() {
public Double call() {
double d= 0;
for(int j=start;j<end;j++)
d += floats[j];
return d;
}
});
}
double sum = 0;
for(Future<Double> task: tasks)
sum += task.get();
As WhozCraig mentions, it is likely that one million floats isn't enough to need multiple threads, or you could find that your bottle neck is how fast you can load the array from main memory (a single threaded resource) In any case, you can't assume it will be faster by the time you include the cost getting the data.

You say that the array comes from a file. If you time the different parts of the program, you'll find that summing up the elements takes a negligible amount of time compared to how long it takes to read the data from disk. From Amdahl's Law it follows that there is nothing to be gained by parallelising the summing up.
If you need to improve performance, you should focus on improving the I/O throughput.

you can use many threads(more than cores). But no of threads & its performance depends on ur algorithm as how they are working.
As array length is 100000 then create x thread & each will calculate arr[x] to arr[x+limit]. where u have to set limit so that no overlapping with other thread & no element should remain un-used.
thread creation:
pthread_t tid[COUNT];
int i = 0;
int err;
while (i < COUNT)
{
void *arg;
arg = x; //pass here a no which will tell from where this thread will use arr[x]
err = pthread_create(&(tid[i]), NULL, &doSomeThing, arg);
if (err != 0)
printf("\ncan't create thread :[%s]", strerror(err));
else
{
//printf("\n Thread created successfully\n");
}
i++;
}
// NOW CALCULATE....
for (int i = 0; i < COUNT; i++)
{
pthread_join(tid[i], NULL);
}
}
void* doSomeThing(void *arg)
{
int *x;
x = (int *) (arg);
// now use this x to start the array sum from arr[x] to ur limit which should not overlap to other thread
}

Use divide and conquer algorithm
Divide the array into 2 or more (keep dividing recursively until you get an array with manageable size)
Start computing the sum for the sub arrays (divided arrays) (using separate threads)
Finally add the sum generated (from all the threads) for all sub arrays together to produce final result

As others have said, the time-cost of reading the file is almost certainly going to be much larger than that of calculating the sum. Is it a text file or binary? If the numbers are stored as text, then the cost of reading them can be very high depending on your implementation.
You should also be careful adding a large number of floats. Because of their limited precision, small values late in the array may not contribute to the sum. Think about at least using a double to accumulate the values.

You can use pthreads in c to solve your problem
Here is my code for N=4 ( you can change it to suit your needs )
To run this code, apply the following command :
gcc -pthread test.c -o test
./test
#include<stdio.h>
#include<stdlib.h>
#include<pthread.h>
#define NUM_THREADS 5
pthread_t threads[NUM_THREADS];
pthread_mutex_t mutexsum;
int a[2500];
int sum = 0;
void *do_work(void* parms) {
long tid = (long)parms;
printf("I am thread # %ld\n ", tid);
int start, end, mysum;
start = (int)tid * 500;
end = start + 500;
int i = 0;
printf("Thread # %ld with start = %d and end = %d \n",tid,start,end);
for (int i = start; i < end; i++) {
mysum += a[i];
}
pthread_mutex_lock(&mutexsum);
printf("Thread # %ld lock and sum = %d\n",tid,sum);
sum += mysum;
pthread_mutex_unlock(&mutexsum);
pthread_exit(NULL);
}
void main(int argv, char* argc) {
int i = 0; int rc;
pthread_attr_t attr;
pthread_mutex_init(&mutexsum, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_mutex_init(&mutexsum, NULL);
printf("Initializing array : \n");
for(i=0;i<2500;i++){
a[i]=1;
}
for (i = 0; i < NUM_THREADS; i++) {
printf("Creating thread # %d.\n", i);
rc = pthread_create(&threads[i], &attr, &do_work, (void *)i);
if (rc) {
printf("Error in thread %d with rc = %d. \n", i, rc);
exit(-1);
}
}
pthread_attr_destroy(&attr);
printf("Creating threads complete. start ruun " );
for (i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
printf("\n\tSum : %d", sum);
pthread_mutex_destroy(&mutexsum);
pthread_exit(NULL);
}

OpenMP supports built-in reduction. Add flag -fopenmp while compiling.
#include <omp.h>
#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
int main()
{
float sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i=0, i<N, i++)
sum=sum+x[i];
return 0;
}

Related

Divide a size among threads

I'm kind of beginner in parallel computing, so by looking to a java code for threading:
int threadNum = 2;
in middleSum[threadNum];
int size = 2000;
int i = 0;
int sum = 0;
int array[size];
int setThreadSize = size/threadNum;
// this part will be excueted by a thread with different `idThread`
for(i=idThread*setThreadSize;i<(idThread-1)*setThreadSize;i++){
middleSum[idThread] += array[i];
}
wait(); // waiting function that should wait for all threads
// only first thread (thread 0) will excute this part
if(idThread==0){
for(i=0;i<threadNum;i++){
sum += middleSum[i];
}
}
As far as I can see that the array is divided correctly among the threads, but I'm not sure; what do you think?

Not expected result with multithread programming

I'm in troubles with a multithreading java program.
The program consists of a splitted sum of an array of integers with multithreads and than the total sum of the slices.
The problem is that computing time does not decrements by incrementing number of threads (I know that there is a limit number of threads after that the computing time is slower than less threads). I expect to see a decrease of execution time before that limit number of threads (benefits of parallel execution). I use the variable fake in run method to make time "readable".
public class MainClass {
private final int MAX_THREAD = 8;
private final int ARRAY_SIZE = 1000000;
private int[] array;
private SimpleThread[] threads;
private int numThread = 1;
private int[] sum;
private int start = 0;
private int totalSum = 0;
long begin, end;
int fake;
MainClass() {
fillArray();
for(int i = 0; i < MAX_THREAD; i++) {
threads = new SimpleThread[numThread];
sum = new int[numThread];
begin = (long) System.currentTimeMillis();
for(int j = 0 ; j < numThread; j++) {
threads[j] = new SimpleThread(start, ARRAY_SIZE/numThread, j);
threads[j].start();
start+= ARRAY_SIZE/numThread;
}
for(int k = 0; k < numThread; k++) {
try {
threads[k].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
end = (long) System.currentTimeMillis();
for(int g = 0; g < numThread; g++) {
totalSum+=sum[g];
}
System.out.printf("Result with %d thread-- Sum = %d Time = %d\n", numThread, totalSum, end-begin);
numThread++;
start = 0;
totalSum = 0;
}
}
public static void main(String args[]) {
new MainClass();
}
private void fillArray() {
array = new int[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++)
array[i] = 1;
}
private class SimpleThread extends Thread{
int start;
int size;
int index;
public SimpleThread(int start, int size, int sumIndex) {
this.start = start;
this.size = size;
this.index = sumIndex;
}
public void run() {
for(int i = start; i < start+size; i++)
sum[index]+=array[i];
for(long i = 0; i < 1000000000; i++) {
fake++;
}
}
}
Unexpected Result Screenshot

As a general rule, you won't get a speedup from multi-threading if the "work" performed by each thread is less than the overheads of using the threads.
One of the overheads is the cost of starting a new thread. This is surprisingly high. Each time you start a thread the JVM needs to perform syscalls to allocate the thread stack memory segment and the "red zone" memory segment, and initialize them. (The default thread stack size is typically 500KB or 1MB.) Then there are further syscalls to create the native thread and schedule it.
In this example, you have 1,000,000 elements to sum and you divide this work among N threads. As N increases, the amount of work performed by each thread decreases.
It is not hard to see that the time taken to sum 1,000,000 elements is going to be less than the time needed to start 4 threads ... just based on counting the memory read and write operations. Then you need to take into account that the child threads are created one at a time by the parent thread.
If you do the analysis completely, it is clear that there is a point at which adding more threads actually slows down the computation even if you have enough to cores to run all threads in parallel. And your benchmarking seems to suggest1 that that point is around about 2 threads.
By the way, there is a second reason why you may not get as much speedup as you expect for a benchmark like this one. The "work" that each thread is doing is basically scanning a large array. Reading and writing arrays will generate requests to the memory system. Ideally, these requests will be satisfied by the (fast) on-chip memory caches. However, if you try to read / write an array that is larger than the memory cache, then many / most of those requests turn into (slow) main memory requests. Worse still, if you have N cores all doing this then you can find that the number of main memory requests is too much for the memory system to keep up .... and the threads slow down.
The bottom line is that multi-threading does not automatically make an application faster, and it certainly won't if you do it the wrong way.
In your example:
the amount of work per thread is too small compared with the overheads of creating and starting threads, and
memory bandwidth effects are likely to be a problem if can "factor out" the thread creation overheads
1 - I don't understand the point of the "fake" computation. It probably invalidates the benchmark, though it is possible that the JIT compiler optimizes it away.

Why sum is wrong sometimes?
Because ARRAY_SIZE/numThread may have fractional part (e.g. 1000000/3=333333.3333333333) which gets rounded down so start variable loses some hence the sum maybe less than 1000000 depending on the value of divisor.
Why the time taken is increasing as the number of threads increases?
Because in the run function of each thread you do this:
for(long i = 0; i < 1000000000; i++) {
fake++;
}
which I do not understand from your question :
I use the variable fake in run method to make time "readable".
what that means. But every thread needs to increment your fake variable 1000000000 times.

As a side note, for what you're trying to do there is the Fork/Join-Framework. It allows you easily split tasks recursively and implements an algorithm which will distribute your workload automatically.
There is a guide available here; it's example is very similar to your case, which boils down to a RecursiveTask like this:
class Adder extends RecursiveTask<Integer>
{
private int[] toAdd;
private int from;
private int to;
/** Add the numbers in the given array */
public Adder(int[] toAdd)
{
this(toAdd, 0, toAdd.length);
}
/** Add the numbers in the given array between the given indices;
internal constructor to split work */
private Adder(int[] toAdd, int fromIndex, int upToIndex)
{
this.toAdd = toAdd;
this.from = fromIndex;
this.to = upToIndex;
}
/** This is the work method */
#Override
protected Integer compute()
{
int amount = to - from;
int result = 0;
if (amount < 500)
{
// base case: add ints and return the result
for (int i = from; i < to; i++)
{
result += toAdd[i];
}
}
else
{
// array too large: split it into two parts and distribute the actual adding
int newEndIndex = from + (amount / 2);
Collection<Adder> invokeAll = invokeAll(Arrays.asList(
new Adder(toAdd, from, newEndIndex),
new Adder(toAdd, newEndIndex, to)));
for (Adder a : invokeAll)
{
result += a.invoke();
}
}
return result;
}
}
To actually run this, you can use
RecursiveTask adder = new Adder(fillArray(ARRAY_LENGTH));
int result = ForkJoinPool.commonPool().invoke(adder);

Starting threads is heavy and you'll only see the benefit of it on large processes that don't compete for the same resources (none of it applies here).

Java: how ot optimize sum of big array

I try to solve one problem on codeforces. And I get Time limit exceeded judjment. The only time consuming operation is calculation sum of big array. So I've tried to optimize it, but with no result.
What I want: Optimize the next function:
//array could be Integer.MAX_VALUE length
private long canocicalSum(int[] array) {
int sum = 0;
for (int i = 0; i < array.length; i++)
sum += array[i];
return sum;
}
Question1 [main]: Is it possible to optimize canonicalSum?
I've tried: to avoid operations with very big numbers. So i decided to use auxiliary data. For instance, I convert array1[100] to array2[10], where array2[i] = array1[i] + array1[i+1] + array1[i+9].
private long optimizedSum(int[] array, int step) {
do {
array = sumItr(array, step);
} while (array.length != 1);
return array[0];
}
private int[] sumItr(int[] array, int step) {
int length = array.length / step + 1;
boolean needCompensation = (array.length % step == 0) ? false : true;
int aux[] = new int[length];
for (int i = 0, auxSum = 0, auxPointer = 0; i < array.length; i++) {
auxSum += array[i];
if ((i + 1) % step == 0) {
aux[auxPointer++] = auxSum;
auxSum = 0;
}
if (i == array.length - 1 && needCompensation) {
aux[auxPointer++] = auxSum;
}
}
return aux;
}
Problem: But it appears that canonicalSum is ten times faster than optimizedSum. Here my test:
#Test
public void sum_comparison() {
final int ARRAY_SIZE = 100000000;
final int STEP = 1000;
int[] array = genRandomArray(ARRAY_SIZE);
System.out.println("Start canonical Sum");
long beg1 = System.nanoTime();
long sum1 = canocicalSum(array);
long end1 = System.nanoTime();
long time1 = end1 - beg1;
System.out.println("canon:" + TimeUnit.MILLISECONDS.convert(time1, TimeUnit.NANOSECONDS) + "milliseconds");
System.out.println("Start optimizedSum");
long beg2 = System.nanoTime();
long sum2 = optimizedSum(array, STEP);
long end2 = System.nanoTime();
long time2 = end2 - beg2;
System.out.println("custom:" + TimeUnit.MILLISECONDS.convert(time2, TimeUnit.NANOSECONDS) + "milliseconds");
assertEquals(sum1, sum2);
assertTrue(time2 <= time1);
}
private int[] genRandomArray(int size) {
int[] array = new int[size];
Random random = new Random();
for (int i = 0; i < array.length; i++) {
array[i] = random.nextInt();
}
return array;
}
Question2: Why optimizedSum works slower than canonicalSum?

As of Java 9, vectorisation of this operation has been implemented but disabled, based on benchmarks measuring the all-in cost of the code plus its compilation. Depending on your processor, this leads to the relatively entertaining result that if you introduce artificial complications into your reduction loop, you can trigger autovectorisation and get a quicker result! So the fastest code, for now, assuming numbers small enough not to overflow, is:
public int sum(int[] data) {
int value = 0;
for (int i = 0; i < data.length; ++i) {
value += 2 * data[i];
}
return value / 2;
}
This isn't intended as a recommendation! This is more to illustrate that the speed of your code in Java is dependent on the JIT, its trade-offs, and its bugs/features in any given release. Writing cute code to optimise problems like this is at best vain and will put a shelf life on the code you write. For instance, had you manually unrolled a loop to optimise for an older version of Java, your code would be much slower in Java 8 or 9 because this decision would completely disable autovectorisation. You'd better really need that performance to do it.

Question1 [main]: Is it possible to optimize canonicalSum?
Yes, it is. But I have no idea with what factor.
Some things you can do are:
use the parallel pipelines introduced in Java 8. The processor has instruction for doing parallel sum of 2 arrays (and more). This can be observed in Octave when you sum two vectors with ".+" (parallel addition) or "+" it is way faster than using a loop.
use multithreading. You could use a divide and conquer algorithm. Maybe like this:
divide the array into 2 or more
keep dividing recursively until you get an array with manageable size for a thread.
start computing the sum for the sub arrays (divided arrays) with separate threads.
finally add the sum generated (from all the threads) for all sub arrays together to produce final result
maybe unrolling the loop would help a bit, too. By loop unrolling I mean reducing the steps the loop will have to make by doing more operations in the loop manually.
An example from http://en.wikipedia.org/wiki/Loop_unwinding :
for (int x = 0; x < 100; x++)
{
delete(x);
}
becomes
for (int x = 0; x < 100; x+=5)
{
delete(x);
delete(x+1);
delete(x+2);
delete(x+3);
delete(x+4);
}
but as mentioned this must be done with caution and profiling since the JIT could do this kind of optimizations itself probably.
A implementation for mathematical operations for the multithreaded approach can be seen here.
The example implementation with the Fork/Join framework introduced in java 7 that basically does what the divide and conquer algorithm above does would be:
public class ForkJoinCalculator extends RecursiveTask<Double> {
public static final long THRESHOLD = 1_000_000;
private final SequentialCalculator sequentialCalculator;
private final double[] numbers;
private final int start;
private final int end;
public ForkJoinCalculator(double[] numbers, SequentialCalculator sequentialCalculator) {
this(numbers, 0, numbers.length, sequentialCalculator);
}
private ForkJoinCalculator(double[] numbers, int start, int end, SequentialCalculator sequentialCalculator) {
this.numbers = numbers;
this.start = start;
this.end = end;
this.sequentialCalculator = sequentialCalculator;
}
#Override
protected Double compute() {
int length = end - start;
if (length <= THRESHOLD) {
return sequentialCalculator.computeSequentially(numbers, start, end);
}
ForkJoinCalculator leftTask = new ForkJoinCalculator(numbers, start, start + length/2, sequentialCalculator);
leftTask.fork();
ForkJoinCalculator rightTask = new ForkJoinCalculator(numbers, start + length/2, end, sequentialCalculator);
Double rightResult = rightTask.compute();
Double leftResult = leftTask.join();
return leftResult + rightResult;
}
}
Here we develop a RecursiveTask splitting an array of doubles until
the length of a subarray doesn't go below a given threshold. At this
point the subarray is processed sequentially applying on it the
operation defined by the following interface
The interface used is this:
public interface SequentialCalculator {
double computeSequentially(double[] numbers, int start, int end);
}
And the usage example:
public static double varianceForkJoin(double[] population){
final ForkJoinPool forkJoinPool = new ForkJoinPool();
double total = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
#Override
public double computeSequentially(double[] numbers, int start, int end) {
double total = 0;
for (int i = start; i < end; i++) {
total += numbers[i];
}
return total;
}
}));
final double average = total / population.length;
double variance = forkJoinPool.invoke(new ForkJoinCalculator(population, new SequentialCalculator() {
#Override
public double computeSequentially(double[] numbers, int start, int end) {
double variance = 0;
for (int i = start; i < end; i++) {
variance += (numbers[i] - average) * (numbers[i] - average);
}
return variance;
}
}));
return variance / population.length;
}

If you want to add N numbers then the runtime is O(N). So in this aspect your canonicalSum can not be "optimized".
What you can do to reduce runtime is make the summation parallel. I.e. break the array to parts and pass it to separate threads and in the end sum the result returned by each thread.
Update: This implies multicore system but there is a java api to get the number of cores

Java seems to be executing bare-bones algorithms faster than C++. Why?

Introduction:
Using two identical mergesort algorithms, I tested the execution speed of C++ (using Visual Studios C++ 2010 express) vs Java (using NetBeans 7.0). I conjectured that the C++ execution would be at least slightly faster, but testing revealed that the C++ execution was 4 - 10 times slower than the Java execution. I believe that I have set all the speed optimisations for C++, and I am publishing as a release rather than as a debug. Why is this speed discrepancy occurring?
Code:
Java:
public class PerformanceTest1
{
/**
* Sorts the array using a merge sort algorithm
* #param array The array to be sorted
* #return The sorted array
*/
public static void sort(double[] array)
{
if(array.length > 1)
{
int centre;
double[] left;
double[] right;
int arrayPointer = 0;
int leftPointer = 0;
int rightPointer = 0;
centre = (int)Math.floor((array.length) / 2.0);
left = new double[centre];
right = new double[array.length - centre];
System.arraycopy(array,0,left,0,left.length);
System.arraycopy(array,centre,right,0,right.length);
sort(left);
sort(right);
while((leftPointer < left.length) && (rightPointer < right.length))
{
if(left[leftPointer] <= right[rightPointer])
{
array[arrayPointer] = left[leftPointer];
leftPointer += 1;
}
else
{
array[arrayPointer] = right[rightPointer];
rightPointer += 1;
}
arrayPointer += 1;
}
if(leftPointer < left.length)
{
System.arraycopy(left,leftPointer,array,arrayPointer,array.length - arrayPointer);
}
else if(rightPointer < right.length)
{
System.arraycopy(right,rightPointer,array,arrayPointer,array.length - arrayPointer);
}
}
}
public static void main(String args[])
{
//Number of elements to sort
int arraySize = 1000000;
//Create the variables for timing
double start;
double end;
double duration;
//Build array
double[] data = new double[arraySize];
for(int i = 0;i < data.length;i += 1)
{
data[i] = Math.round(Math.random() * 10000);
}
//Run performance test
start = System.nanoTime();
sort(data);
end = System.nanoTime();
//Output performance results
duration = (end - start) / 1E9;
System.out.println("Duration: " + duration);
}
}
C++:
#include <iostream>
#include <windows.h>
using namespace std;
//Mergesort
void sort1(double *data,int size)
{
if(size > 1)
{
int centre;
double *left;
int leftSize;
double *right;
int rightSize;
int dataPointer = 0;
int leftPointer = 0;
int rightPointer = 0;
centre = (int)floor((size) / 2.0);
leftSize = centre;
left = new double[leftSize];
for(int i = 0;i < leftSize;i += 1)
{
left[i] = data[i];
}
rightSize = size - leftSize;
right = new double[rightSize];
for(int i = leftSize;i < size;i += 1)
{
right[i - leftSize] = data[i];
}
sort1(left,leftSize);
sort1(right,rightSize);
while((leftPointer < leftSize) && (rightPointer < rightSize))
{
if(left[leftPointer] <= right[rightPointer])
{
data[dataPointer] = left[leftPointer];
leftPointer += 1;
}
else
{
data[dataPointer] = right[rightPointer];
rightPointer += 1;
}
dataPointer += 1;
}
if(leftPointer < leftSize)
{
for(int i = dataPointer;i < size;i += 1)
{
data[i] = left[leftPointer++];
}
}
else if(rightPointer < rightSize)
{
for(int i = dataPointer;i < size;i += 1)
{
data[i] = right[rightPointer++];
}
}
delete left;
delete right;
}
}
void main()
{
//Number of elements to sort
int arraySize = 1000000;
//Create the variables for timing
LARGE_INTEGER start; //Starting time
LARGE_INTEGER end; //Ending time
LARGE_INTEGER freq; //Rate of time update
double duration; //end - start
QueryPerformanceFrequency(&freq); //Determinine the frequency of the performance counter (high precision system timer)
//Build array
double *temp2 = new double[arraySize];
QueryPerformanceCounter(&start);
srand((int)start.QuadPart);
for(int i = 0;i < arraySize;i += 1)
{
double randVal = rand() % 10000;
temp2[i] = randVal;
}
//Run performance test
QueryPerformanceCounter(&start);
sort1(temp2,arraySize);
QueryPerformanceCounter(&end);
delete temp2;
//Output performance test results
duration = (double)(end.QuadPart - start.QuadPart) / (double)(freq.QuadPart);
cout << "Duration: " << duration << endl;
//Dramatic pause
system("pause");
}
Observations:
For 10000 elements, the C++ execution takes roughly 4 times the amount of time as the Java execution.
For 100000 elements, the ratio is about 7:1.
For 10000000 elements, the ratio is about 10:1.
For over 10000000, the Java execution completes, but the C++ execution stalls, and I have to manually kill the process.

I think there might be a mistake in the way you ran the program. When you hit F5 in Visual C++ Express, the program is running under debugger and it will be a LOT slower. In other versions of Visual C++ 2010 (e.g. Ultimate that I use), try hitting CTRL+F5 (i.e. Start without Debugging) or try running the executable file itself (in the Express) and you see the difference.
I run your program with only one modification on my machine (added delete[] left; delete[] right; to get rid of memory leak; otherwise it would ran out of memory in 32 bits mode!). I have an i7 950. To be fair, I also passed the same array to the Arrays.sort() in Java and to the std::sort in C++. I used an array size of 10,000,000.
Here are the results (time in seconds):
Java code: 7.13
Java Arrays.sort: 0.93
32 bits
C++ code: 3.57
C++ std::sort 0.81
64 bits
C++ code: 2.77
C++ std::sort 0.76
So the C++ code is much faster and even the standard library, which is highly tuned for in both Java and C++, tends to show slight advantage for C++.
Edit: I just realized in your original test, you run the C++ code in the debug mode. You should switch to the Release mode AND run it outside the debugger (as I explained in my post) to get a fair result.

I don't program C++ professionally (or even unprofessionally:) but I notice that you are allocating a double on the heap (double *temp2 = new double[arraySize];). This is expensive compared to Java initialisation but more importantly, it constitutes a memory leak since you never delete it, this could explain why your C++ implementation stalls, it's basically run out of memory.

To start with did you try using std::sort (or std::stable_sort which is typically mergesort) to get a baseline performance in C++?
I can't comment on the Java code but for the C++ code:
Unlike Java, new in C++ requires manual intervention to free the memory. Every recursion you'll be leaking memory. I would suggest using std::vector instead as it manages all the memory for you AND the iterator, iterator constructor will even do the copy (and possibly optimized better than your for loop`). This is almost certainly the cause of your performance difference.
You use arraycopy in Java but don't use the library facility (std::copy) in C++ although again this wouldn't matter if you used vector.
Nit: Declare and initialize your variable on the same line, at the point you first need them, not all at the top of the function.
If you're allowed to use parts of the standard library, std::merge could replace your merge algorithm.
EDIT: If you really are using say delete left; to cleanup memory that's probably your problem. The correct syntax would be delete [] left; to deallocate an array.

Your version was leaking so much memory that the timing were meaningless.
I am sure the time was spent thrashing the memory allocator.
Rewrite it to use standard C++ objects for memory management std::vector and see what happens.
Personally I would still expect the Java version to win (just). Because the JIT allows machine specific optimizations and while the C++ can do machine specific optimizations generally it will only do generic architecture optimizations (unless you provide the exact architecture flags).
Note: Don't forget to compile with optimizations turned on.
Just cleaning up your C++:
I have not tried to make a good merge sort (just re-written) in a C++ style
void sort1(std::vector<double>& data)
{
if(data.size() > 1)
{
std::size_t centre = data.size() / 2;
std::size_t lftSize = centre;
std::size_t rhtSize = data.size() - lftSize;
// Why are we allocating new arrays here??
// Is the whole point of the merge sort to do it in place?
// I forget bbut I think you need to go look at a knuth book.
//
std::vector<double> lft(data.begin(), data.begin() + lftSize);
std::vector<double> rht(data.begin() + lftSize, data.end());
sort1(lft);
sort1(rht);
std::size_t dataPointer = 0;
std::size_t lftPointer = 0;
std::size_t rhtPointer = 0;
while((lftPointer < lftSize) && (rhtPointer < rhtSize))
{
data[dataPointer++] = (lft[lftPointer] <= rht[rhtPointer])
? lft[lftPointer++]
: rht[rhtPointer++];
}
std::copy(lft.begin() + lftPointer, lft.end(), &data[dataPointer]);
std::copy(rht.begin() + rhtPointer, rht.end(), &data[dataPointer]);
}
}
Thinking about merge sort. I would try this:
I have not tested it, so it may not work correctly. Bu it is an attempt to not keep allocating huge amounts of memory to do the sort. instead it uses a single temp area and copies the result back when the sort is done.
void mergeSort(double* begin, double* end, double* tmp)
{
if (end - begin <= 1)
{ return;
}
std::size_t size = end - begin;
double* middle = begin + (size / 2);
mergeSort(begin, middle, tmp);
mergeSort(middle, end, tmp);
double* lft = begin;
double* rht = middle;
double* dst = tmp;
while((lft < middle) && (rht < end))
{
*dst++ = (*lft < *rht)
? *lft++
: *rht++;
}
std::size_t count = dst - tmp;
memcpy(begin, tmp, sizeof(double) * count);
memcpy(begin + count, lft, sizeof(double) * (middle - lft));
memcpy(begin + count, rht, sizeof(double) * (end - rht));
}
void sort2(std::vector<double>& data)
{
double* left = &data[0];
double* right = &data[data.size()];
std::vector<double> tmp(data.size());
mergeSort(left,right, &tmp[0]);
}

A couple of things.
Java is highly optimized and after the code has executed once the JIT compiler then executes the code as native.
Your System.arraycopy in Java is going to execute much faster than simply copying each element one at a time. try replacing this copy with a memcpy and you will see that it is much faster.
EDIT:
Look at this post: C++ performance vs. Java/C#

It is hard to tell from just looking at your code, but I would hazard a guess that the reason is in the handling recursion rather than actual computations. Try using some sorting algorithm that relies on the iteration instead of the recursion and share the results of the performance comparison.

I don't know why Java is so much faster here.
I compared it with the built in Arrays.sort() and it was 4x faster again. (It doesn't create any objects).
Usually if there is a test where Java is much faster its because Java is much better at removing code which doesn't do anything.
Perhaps you could use memcpy rather than a loop at the end.

Try to make a global vector as a buffer, and try not to allocate a lot of memory.
This will run faster than your code, because if uses some tricks(uses only one buffer and the memory is allocated when the program starts, so the memory will not be fragmented):
#include <cstdio>
#define N 500001
int a[N];
int x[N];
int n;
void merge (int a[], int l, int r)
{
int m = (l + r) / 2;
int i, j, k = l - 1;
for (i = l, j = m + 1; i <= m && j <= r;)
if (a[i] < a[j])
x[++k] = a[i++];
else
x[++k] = a[j++];
for (; i <= m; ++i)
x[++k] = a[i];
for (; j <= r; ++j)
x[++k] = a[j];
for (i = l; i <= r; ++i)
a[i] = x[i];
}
void mergeSort (int a[], int l, int r)
{
if (l >= r)
return;
int m = (l + r) / 2;
mergeSort (a, l, m);
mergeSort (a, m + 1, r);
merge (a, l, r);
}
int main ()
{
int i;
freopen ("algsort.in", "r", stdin);
freopen ("algsort.out", "w", stdout);
scanf ("%d\n", &n);
for (i = 1; i <= n; ++i)
scanf ("%d ", &a[i]);
mergeSort (a, 1, n);
for (i = 1; i <= n; ++i)
printf ("%d ", a[i]);
return 0;
}

Improving a prime sieve algorithm

I'm trying to make a decent Java program that generates the primes from 1 to N (mainly for Project Euler problems).
At the moment, my algorithm is as follows:
Initialise an array of booleans (or a bitarray if N is sufficiently large) so they're all false, and an array of ints to store the primes found.
Set an integer, s equal to the lowest prime, (ie 2)
While s is <= sqrt(N)
Set all multiples of s (starting at s^2) to true in the array/bitarray.
Find the next smallest index in the array/bitarray which is false, use that as the new value of s.
Endwhile.
Go through the array/bitarray, and for every value that is false, put the corresponding index in the primes array.
Now, I've tried skipping over numbers not of the form 6k + 1 or 6k + 5, but that only gives me a ~2x speed up, whilst I've seen programs run orders of magnitudes faster than mine (albeit with very convoluted code), such as the one here
What can I do to improve?
Edit: Okay, here's my actual code (for N of 1E7):
int l = 10000000, n = 2, sqrt = (int) Math.sqrt(l);
boolean[] nums = new boolean[l + 1];
int[] primes = new int[664579];
while(n <= sqrt){
for(int i = 2 * n; i <= l; nums[i] = true, i += n);
for(n++; nums[n]; n++);
}
for(int i = 2, k = 0; i < nums.length; i++) if(!nums[i]) primes[k++] = i;
Runs in about 350ms on my 2.0GHz machine.

While s is <= sqrt(N)
One mistake people often do in such algorithms is not precomputing square root.
while (s <= sqrt(N)) {
is much, much slower than
int limit = sqrt(N);
while (s <= limit) {
But generally speaking, Eiko is right in his comment. If you want people to offer low-level optimisations, you have to provide code.
update Ok, now about your code.
You may notice that number of iterations in your code is just little bigger than 'l'. (you may put counter inside first 'for' loop, it will be just 2-3 times bigger) And, obviously, complexity of your solution can't be less then O(l) (you can't have less than 'l' iterations).
What can make real difference is accessing memory effectively. Note that guy who wrote that article tries to reduce storage size not just because he's memory-greedy. Making compact arrays allows you to employ cache better and thus increase speed.
I just replaced boolean[] with int[] and achieved immediate x2 speed gain. (and 8x memory) And I didn't even try to do it efficiently.
update2
That's easy. You just replace every assignment a[i] = true with a[i/32] |= 1 << (i%32) and each read operation a[i] with (a[i/32] & (1 << (i%32))) != 0. And boolean[] a with int[] a, obviously.
From the first replacement it should be clear how it works: if f(i) is true, then there's a bit 1 in an integer number a[i/32], at position i%32 (int in Java has exactly 32 bits, as you know).
You can go further and replace i/32 with i >> 5, i%32 with i&31. You can also precompute all 1 << j for each j between 0 and 31 in array.
But sadly, I don't think in Java you could get close to C in this. Not to mention, that guy uses many other tricky optimizations and I agree that his could would've been worth a lot more if he made comments.

Using the BitSet will use less memory. The Sieve algorithm is rather trivial, so you can simply "set" the bit positions on the BitSet, and then iterate to determine the primes.

Did you also make the array smaller while skipping numbers not of the form 6k+1 and 6k+5?
I only tested with ignoring numbers of the form 2k and that gave me ~4x speed up (440 ms -> 120 ms):
int l = 10000000, n = 1, sqrt = (int) Math.sqrt(l);
int m = l/2;
boolean[] nums = new boolean[m + 1];
int[] primes = new int[664579];
int i, k;
while (n <= sqrt) {
int x = (n<<1)+1;
for (i = n+x; i <= m; nums[i] = true, i+=x);
for (n++; nums[n]; n++);
}
primes[0] = 2;
for (i = 1, k = 1; i < nums.length; i++) {
if (!nums[i])
primes[k++] = (i<<1)+1;
}

The following is from my Project Euler Library...Its a slight Variation of the Sieve of Eratosthenes...I'm not sure, but i think its called the Euler Sieve.
1) It uses a BitSet (so 1/8th the memory)
2) Only uses the bitset for Odd Numbers...(another 1/2th hence 1/16th)
Note: The Inner loop (for multiples) begins at "n*n" rather than "2*n" and also multiples of increment "2*n" are only crossed off....hence the speed up.
private void beginSieve(int mLimit)
{
primeList = new BitSet(mLimit>>1);
primeList.set(0,primeList.size(),true);
int sqroot = (int) Math.sqrt(mLimit);
primeList.clear(0);
for(int num = 3; num <= sqroot; num+=2)
{
if( primeList.get(num >> 1) )
{
int inc = num << 1;
for(int factor = num * num; factor < mLimit; factor += inc)
{
//if( ((factor) & 1) == 1)
//{
primeList.clear(factor >> 1);
//}
}
}
}
}
and here's the function to check if a number is prime...
public boolean isPrime(int num)
{
if( num < maxLimit)
{
if( (num & 1) == 0)
return ( num == 2);
else
return primeList.get(num>>1);
}
return false;
}

You could do the step of "putting the corresponding index in the primes array" while you are detecting them, taking out a run through the array, but that's about all I can think of right now.

I wrote a simple sieve implementation recently for the fun of it using BitSet (everyone says not to, but it's the best off the shelf way to store huge data efficiently). The performance seems to be pretty good to me, but I'm still working on improving it.
public class HelloWorld {
private static int LIMIT = 2140000000;//Integer.MAX_VALUE broke things.
private static BitSet marked;
public static void main(String[] args) {
long startTime = System.nanoTime();
init();
sieve();
long estimatedTime = System.nanoTime() - startTime;
System.out.println((float)estimatedTime/1000000000); //23.835363 seconds
System.out.println(marked.size()); //1070000000 ~= 127MB
}
private static void init()
{
double size = LIMIT * 0.5 - 1;
marked = new BitSet();
marked.set(0,(int)size, true);
}
private static void sieve()
{
int i = 0;
int cur = 0;
int add = 0;
int pos = 0;
while(((i<<1)+1)*((i<<1)+1) < LIMIT)
{
pos = i;
if(marked.get(pos++))
{
cur = pos;
add = (cur<<1);
pos += add*cur + cur - 1;
while(pos < marked.length() && pos > 0)
{
marked.clear(pos++);
pos += add;
}
}
i++;
}
}
private static void readPrimes()
{
int pos = 0;
while(pos < marked.length())
{
if(marked.get(pos++))
{
System.out.print((pos<<1)+1);
System.out.print("-");
}
}
}
}
With smaller LIMITs (say 10,000,000 which took 0.077479s) we get much faster results than the OP.

I bet java's performance is terrible when dealing with bits...
Algorithmically, the link you point out should be sufficient

Have you tried googling, e.g. for "java prime numbers". I did and dug up this simple improvement:
http://www.anyexample.com/programming/java/java_prime_number_check_%28primality_test%29.xml
Surely, you can find more at google.

Here is my code for Sieve of Erastothenes and this is actually the most efficient that I could do:
final int MAX = 1000000;
int p[]= new int[MAX];
p[0]=p[1]=1;
int prime[] = new int[MAX/10];
prime[0]=2;
void sieve()
{
int i,j,k=1;
for(i=3;i*i<=MAX;i+=2)
{
if(p[i])
continue;
for(j=i*i;j<MAX;j+=2*i)
p[j]=1;
}
for(i=3;i<MAX;i+=2)
{
if(p[i]==0)
prime[k++]=i;
}
return;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.