Divide a size among threads

Divide a size among threads - java

I'm kind of beginner in parallel computing, so by looking to a java code for threading:
int threadNum = 2;
in middleSum[threadNum];
int size = 2000;
int i = 0;
int sum = 0;
int array[size];
int setThreadSize = size/threadNum;
// this part will be excueted by a thread with different `idThread`
for(i=idThread*setThreadSize;i<(idThread-1)*setThreadSize;i++){
middleSum[idThread] += array[i];
}
wait(); // waiting function that should wait for all threads
// only first thread (thread 0) will excute this part
if(idThread==0){
for(i=0;i<threadNum;i++){
sum += middleSum[i];
}
}
As far as I can see that the array is divided correctly among the threads, but I'm not sure; what do you think?

Related

Sum of range multithreading

My program is trying to sum a range with a given number of threads in order to run it in parallel but it seems that with just one threads it runs better than 4 (I have an 8 core CPU). It is my first time working with multithreading in Java so maybe I have a problem in my code that makes it take longer?
My benchmarks(sum of range 0-10000) done for the moment are:
1 thread: 1350 microsecs (average)
2 thread: 1800 microsecs (average)
4 thread: 2400 microsecs (average)
8 thread: 3300 microsecs (average)
Thanks in advance!
/*
Compile: javac RangeSum.java
Execute: java RangeSum nThreads initRange finRange
*/
import java.util.ArrayList;
import java.util.concurrent.*;
public class RangeSum implements Runnable {
private int init;
private int end;
private int id;
static public int out = 0;
Object lock = new Object();
public synchronized static void increment(int partial) {
out = out + partial;
}
public RangeSum(int init,int end) {
this.init = init;
this.end = end;
}//parameters to pass in threads
// the function called for each thread
public void run() {
int partial = 0;
for(int k = this.init; k < this.end; k++)
{
partial = k + partial + 1;
}
increment(partial);
}//thread: sum its id to the out variable
public static void main(String args[]) throws InterruptedException {
final long startTime = System.nanoTime()/1000;//start time: microsecs
//get command line values for
int NumberOfThreads = Integer.valueOf(args[0]);
int initRange = Integer.valueOf(args[1]);
int finRange = Integer.valueOf(args[2]);
//int[] out = new int[NumberOfThreads];
// an array of threads
ArrayList<Thread> Threads = new ArrayList<Thread>(NumberOfThreads);
// spawn the threads / CREATE
for (int i = 0; i < NumberOfThreads; i++) {
int initial = i*finRange/NumberOfThreads;
int end = (i+1)*finRange/NumberOfThreads;
Threads.add(i, new Thread(new RangeSum(initial,end)));
Threads.get(i).start();
}
// wait for the threads to finish / JOIN
for (int i = 0; i < NumberOfThreads; i++) {
try {
Threads.get(i).join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("All threads finished!");
System.out.println("Total range sum: " + out);
final long endTime = System.nanoTime()/1000;//end time
System.out.println("Time elapsed: "+(endTime - startTime));
}
}

Your workload entirely in memory-non-blocking computation - on a general principle, in this kind of scenario, a single thread will complete the work faster than multiple threads.
Multiple threads tend to interfere with the L1/L2 CPU caching and incur additional overhead for context
switching
Specifically, wrt to your code, you initialize final long startTime = System.nanoTime()/1000; too early and measure thread setup time as well as the actual time it takes them to complete. Its probably better to setup your Threads list first and then:
final long startTime =...
for (int i = 0; i < NumberOfThreads; i++) {
Thread.get(i).start()
}
but really, in this case, the expectations that multiple threads will improve processing time is not warranted.

Not expected result with multithread programming

I'm in troubles with a multithreading java program.
The program consists of a splitted sum of an array of integers with multithreads and than the total sum of the slices.
The problem is that computing time does not decrements by incrementing number of threads (I know that there is a limit number of threads after that the computing time is slower than less threads). I expect to see a decrease of execution time before that limit number of threads (benefits of parallel execution). I use the variable fake in run method to make time "readable".
public class MainClass {
private final int MAX_THREAD = 8;
private final int ARRAY_SIZE = 1000000;
private int[] array;
private SimpleThread[] threads;
private int numThread = 1;
private int[] sum;
private int start = 0;
private int totalSum = 0;
long begin, end;
int fake;
MainClass() {
fillArray();
for(int i = 0; i < MAX_THREAD; i++) {
threads = new SimpleThread[numThread];
sum = new int[numThread];
begin = (long) System.currentTimeMillis();
for(int j = 0 ; j < numThread; j++) {
threads[j] = new SimpleThread(start, ARRAY_SIZE/numThread, j);
threads[j].start();
start+= ARRAY_SIZE/numThread;
}
for(int k = 0; k < numThread; k++) {
try {
threads[k].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
end = (long) System.currentTimeMillis();
for(int g = 0; g < numThread; g++) {
totalSum+=sum[g];
}
System.out.printf("Result with %d thread-- Sum = %d Time = %d\n", numThread, totalSum, end-begin);
numThread++;
start = 0;
totalSum = 0;
}
}
public static void main(String args[]) {
new MainClass();
}
private void fillArray() {
array = new int[ARRAY_SIZE];
for(int i = 0; i < ARRAY_SIZE; i++)
array[i] = 1;
}
private class SimpleThread extends Thread{
int start;
int size;
int index;
public SimpleThread(int start, int size, int sumIndex) {
this.start = start;
this.size = size;
this.index = sumIndex;
}
public void run() {
for(int i = start; i < start+size; i++)
sum[index]+=array[i];
for(long i = 0; i < 1000000000; i++) {
fake++;
}
}
}
Unexpected Result Screenshot

As a general rule, you won't get a speedup from multi-threading if the "work" performed by each thread is less than the overheads of using the threads.
One of the overheads is the cost of starting a new thread. This is surprisingly high. Each time you start a thread the JVM needs to perform syscalls to allocate the thread stack memory segment and the "red zone" memory segment, and initialize them. (The default thread stack size is typically 500KB or 1MB.) Then there are further syscalls to create the native thread and schedule it.
In this example, you have 1,000,000 elements to sum and you divide this work among N threads. As N increases, the amount of work performed by each thread decreases.
It is not hard to see that the time taken to sum 1,000,000 elements is going to be less than the time needed to start 4 threads ... just based on counting the memory read and write operations. Then you need to take into account that the child threads are created one at a time by the parent thread.
If you do the analysis completely, it is clear that there is a point at which adding more threads actually slows down the computation even if you have enough to cores to run all threads in parallel. And your benchmarking seems to suggest1 that that point is around about 2 threads.
By the way, there is a second reason why you may not get as much speedup as you expect for a benchmark like this one. The "work" that each thread is doing is basically scanning a large array. Reading and writing arrays will generate requests to the memory system. Ideally, these requests will be satisfied by the (fast) on-chip memory caches. However, if you try to read / write an array that is larger than the memory cache, then many / most of those requests turn into (slow) main memory requests. Worse still, if you have N cores all doing this then you can find that the number of main memory requests is too much for the memory system to keep up .... and the threads slow down.
The bottom line is that multi-threading does not automatically make an application faster, and it certainly won't if you do it the wrong way.
In your example:
the amount of work per thread is too small compared with the overheads of creating and starting threads, and
memory bandwidth effects are likely to be a problem if can "factor out" the thread creation overheads
1 - I don't understand the point of the "fake" computation. It probably invalidates the benchmark, though it is possible that the JIT compiler optimizes it away.

Why sum is wrong sometimes?
Because ARRAY_SIZE/numThread may have fractional part (e.g. 1000000/3=333333.3333333333) which gets rounded down so start variable loses some hence the sum maybe less than 1000000 depending on the value of divisor.
Why the time taken is increasing as the number of threads increases?
Because in the run function of each thread you do this:
for(long i = 0; i < 1000000000; i++) {
fake++;
}
which I do not understand from your question :
I use the variable fake in run method to make time "readable".
what that means. But every thread needs to increment your fake variable 1000000000 times.

As a side note, for what you're trying to do there is the Fork/Join-Framework. It allows you easily split tasks recursively and implements an algorithm which will distribute your workload automatically.
There is a guide available here; it's example is very similar to your case, which boils down to a RecursiveTask like this:
class Adder extends RecursiveTask<Integer>
{
private int[] toAdd;
private int from;
private int to;
/** Add the numbers in the given array */
public Adder(int[] toAdd)
{
this(toAdd, 0, toAdd.length);
}
/** Add the numbers in the given array between the given indices;
internal constructor to split work */
private Adder(int[] toAdd, int fromIndex, int upToIndex)
{
this.toAdd = toAdd;
this.from = fromIndex;
this.to = upToIndex;
}
/** This is the work method */
#Override
protected Integer compute()
{
int amount = to - from;
int result = 0;
if (amount < 500)
{
// base case: add ints and return the result
for (int i = from; i < to; i++)
{
result += toAdd[i];
}
}
else
{
// array too large: split it into two parts and distribute the actual adding
int newEndIndex = from + (amount / 2);
Collection<Adder> invokeAll = invokeAll(Arrays.asList(
new Adder(toAdd, from, newEndIndex),
new Adder(toAdd, newEndIndex, to)));
for (Adder a : invokeAll)
{
result += a.invoke();
}
}
return result;
}
}
To actually run this, you can use
RecursiveTask adder = new Adder(fillArray(ARRAY_LENGTH));
int result = ForkJoinPool.commonPool().invoke(adder);

Starting threads is heavy and you'll only see the benefit of it on large processes that don't compete for the same resources (none of it applies here).

Java multiple threads give very small perfomance gain

I wanted to learn parallel programming for speeding up algorithms and chose Java.
I wrote two functions for summing long integers in array - one simple iterating through array, second - dividing array to parts and sum up parts in separated threads.
I expected to be logical a roughly 2x speed up using two threads. However, what I've got is only 24% speed up. Moreover, using more threads, I don't get any improvement (maybe less 1%) over two threads. I know that there should be thread creation/joining overhead, but I guess it shouldn't be that big.
Can you please explain, what I'm missing or where is error in code?
Here is code:
import java.util.concurrent.ThreadLocalRandom;
public class ParallelTest {
public static long sum1 (long[] num, int a, int b) {
long r = 0;
while (a < b) {
r += num[a];
++a;
}
return r;
}
public static class SumThread extends Thread {
private long num[];
private long r;
private int a, b;
public SumThread (long[] num, int a, int b) {
super();
this.num = num;
this.a = a;
this.b = b;
}
#Override
public void run () {
r = ParallelTest.sum1(num, a, b);
}
public long getSum () {
return r;
}
}
public static long sum2 (long[] num, int a, int b, int threadCnt) throws InterruptedException {
SumThread[] th = new SumThread[threadCnt];
int i = 0, c = (b - a + threadCnt - 1) / threadCnt;
for (;;) {
int a2 = a + c;
if (a2 > b) {
a2 = b;
}
th[i] = new SumThread(num, a, a2);
th[i].start();
if (a2 == b) {
break;
}
a = a2;
++i;
}
for (i = 0; i < threadCnt; ++i) {
th[i].join();
}
long r = 0;
for (i = 0; i < threadCnt; ++i) {
r += th[i].getSum();
}
return r;
}
public static void main(String[] args) throws InterruptedException {
final int N = 230000000;
long[] num = new long[N];
for (int i = 0; i < N; ++i) {
num[i] = ThreadLocalRandom.current().nextLong(1, 9999);
}
// System.out.println(Runtime.getRuntime().availableProcessors());
long timestamp = System.nanoTime();
System.out.println(sum1(num, 0, num.length));
System.out.println(System.nanoTime() - timestamp);
for (int n = 2; n <= 4; ++n) {
timestamp = System.nanoTime();
System.out.println(sum2(num, 0, num.length, n));
System.out.println(System.nanoTime() - timestamp);
}
}
}
EDIT: I have i7 processor with 4 cores (8 threads).
Output given by code is:
1149914787860
175689196
1149914787860
149224086
1149914787860
147709988
1149914787860
138243999

The program is probably main memory bandwidth limited with just two threads, as it's a small loop, that fetches data almost as fast as ram can supply data to the processor.

I can think of a number reasons why you might not get as much speedup as you are expecting.
Thread creation overheads are substantial. Thread start() is an expensive operation, which entails multiple syscalls to allocate a thread stack and its "red-zone" and to then create the native thread.
The N threads will not all start at the same time. That means that the time to complete the parallel part of the computation will be approximately the end-time of the last thread - the start-time of the the first time. That will be longer than the time for one thread takes to do its part of the work. (By N-1 times the thread creation time ...)
The N threads are (basically) doing a serial scan of N disjoint sections of the array. This is memory bandwidth intensive, AND the way that you are scanning means that the memory caches are going to be ineffective. Therefore, there is a good chance that performance is limited by the speed and bandwidth of your system's main memory hardware.

Multithreading only .4 of a second faster?

so for my programming class we have to do the following:
Fill an integer array with 5 million integers ranging from 0-9.
Then find the number of times each number (0-9) occurs and display this.
We have to measure the time it takes to count the occurences for both single threaded, and multi-threaded. Currently I average 9.3ms for single threaded, and 8.9 ms multithreaded with 8 threads on my 8 core cpu, why is this?
Currently for multithreading I have one array filled with numbers and am calculating lower and upper bounds for each thread to count occurences. here is my current attempt:
public void createThreads(int divisionSize) throws InterruptedException {
threads = new Thread[threadCount];
for(int i = 0; i < threads.length; i++) {
final int lower = (i*divisionSize);
final int upper = lower + divisionSize - 1;
threads[i] = new Thread(new Runnable() {
long start, end;
#Override
public void run() {
start = System.nanoTime();
for(int i = lower; i <= upper; i++) {
occurences[numbers[i]]++;
}
end = System.nanoTime();
milliseconds += (end-start)/1000000.0;
}
});
threads[i].start();
threads[i].join();
}
}
Could anyone shed some light? Cheers.

You are essentially doing all the work sequentially because each thread you create you immediately join it.
Move the threads[i].join() outside the main construction loop into it's own loop. While you're at it you should probably also start all of the threads outside of the loop as starting them while new threads are still being created is not a good idea because creating threads takes time.
class ThreadTester {
private final int threadCount;
private final int numberCount;
int[] numbers = new int[5_000_000];
AtomicIntegerArray occurences;
Thread[] threads;
AtomicLong milliseconds = new AtomicLong();
public ThreadTester(int threadCount, int numberCount) {
this.threadCount = threadCount;
this.numberCount = numberCount;
occurences = new AtomicIntegerArray(numberCount);
threads = new Thread[threadCount];
Random r = new Random();
for (int i = 0; i < numbers.length; i++) {
numbers[i] = r.nextInt(numberCount);
}
}
public void createThreads() throws InterruptedException {
final int divisionSize = numbers.length / threadCount;
for (int i = 0; i < threads.length; i++) {
final int lower = (i * divisionSize);
final int upper = lower + divisionSize - 1;
threads[i] = new Thread(new Runnable() {
#Override
public void run() {
long start = System.nanoTime();
for (int i = lower; i <= upper; i++) {
occurences.addAndGet(numbers[i], 1);
}
long end = System.nanoTime();
milliseconds.addAndGet(end - start);
}
});
}
}
private void startThreads() {
for (Thread thread : threads) {
thread.start();
}
}
private void finishThreads() throws InterruptedException {
for (Thread thread : threads) {
thread.join();
}
}
public long test() throws InterruptedException {
createThreads();
startThreads();
finishThreads();
return milliseconds.get();
}
}
public void test() throws InterruptedException {
for (int threads = 1; threads < 50; threads++) {
ThreadTester tester = new ThreadTester(threads, 10);
System.out.println("Threads=" + threads + " ns=" + tester.test());
}
}
Note that even here the fastest solution is using one thread but you can clearly see that an even number of threads does it quicker as I am using an i5 which has 2 cores but works as 4 via hyperthreading.
Interestingly though - as suggested by #biziclop - removing all contention between threads via the occurrences by giving each thread its own `occurrences array we get a more expected result:

The other answers all explored the immediate problems with your code, I'll give you a different angle: one that's about design of multi-threading in general.
The idea of parallel computing speeding up calculations depends on the assumption that the small bits you broke the problem up into can indeed be run in parallel, independently of each other.
And at first glance, your problem is exactly like that, chop the input range up into 8 equal parts, fire up 8 threads and off they go.
There is a catch though:
occurences[numbers[i]]++;
The occurences array is a resource shared by all threads, and therefore you must control access to it to ensure correctness: either by explicit synchronization (which is slow) or something like an AtomicIntegerArray. But the Atomic* classes are only really fast if access to them is rarely contested. And in your case access will be contested a lot, because most of what your inner loop does is incrementing the number of occurrences.
So what can you do?
The problem is caused partly by the fact that occurences is such a small structure (an array with 10 elements only, regardless of input size), threads will continuously try to update the same element. But you can turn that to your advantage: make all the threads keep their own separate tally, and when they all finished, just add up their results. This will add a small, constant overhead to the end of the process but will make the calculations go truly parallel.

The join method allows one thread to wait for the completion of another, so the second thread will start only after the first will finish.
Join each thread after you started all threads.
public void createThreads(int divisionSize) throws InterruptedException {
threads = new Thread[threadCount];
for(int i = 0; i < threads.length; i++) {
final int lower = (i*divisionSize);
final int upper = lower + divisionSize - 1;
threads[i] = new Thread(new Runnable() {
long start, end;
#Override
public void run() {
start = System.nanoTime();
for(int i = lower; i <= upper; i++) {
occurences[numbers[i]]++;
}
end = System.nanoTime();
milliseconds += (end-start)/1000000.0;
}
});
threads[i].start();
}
for(int i = 0; i < threads.length; i++) {
threads[i].join();
}
}
Also there seem to be a race condition in code at occurences[numbers[i]]++
So most probably if you update the code and use more threads the output wouldn't be correct. You should use an AtomicIntegerArray: https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/atomic/AtomicIntegerArray.html

Use an ExecutorService with Callable and invoke all tasks then you can safely aggregate them. Also use TimeUnit for elapsing time manipulations (sleep, joining, waiting, convertion, ...)
Start by defining the task with his input/output :
class Task implements Callable<Task> {
// input
int[] source;
int sliceStart;
int sliceEnd;
// output
int[] occurences = new int[10];
String runner;
long elapsed = 0;
Task(int[] source, int sliceStart, int sliceEnd) {
this.source = source;
this.sliceStart = sliceStart;
this.sliceEnd = sliceEnd;
}
#Override
public Task call() {
runner = Thread.currentThread().getName();
long start = System.nanoTime();
try {
compute();
} finally {
elapsed = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
}
return this;
}
void compute() {
for (int i = sliceStart; i < sliceEnd; i++) {
occurences[source[i]]++;
}
}
}
Then let's define some variable to manage parameters:
// Parametters
int size = 5_000_000;
int parallel = Runtime.getRuntime().availableProcessors();
int slices = parallel;
Then generates random input:
// Generated source
int[] source = new int[size];
ThreadLocalRandom random = ThreadLocalRandom.current();
for (int i = 0; i < source.length; i++) source[i] = random.nextInt(10);
Start timing total computation and prepare tasks:
long start = System.nanoTime();
// Prepare tasks
List<Task> tasks = new ArrayList<>(slices);
int sliceSize = source.length / slices;
for (int sliceStart = 0; sliceStart < source.length;) {
int sliceEnd = Math.min(sliceStart + sliceSize, source.length);
Task task = new Task(source, sliceStart, sliceEnd);
tasks.add(task);
sliceStart = sliceEnd;
}
Executes all task on threading configuration (don't forget to shutdown it !):
// Execute tasks
ExecutorService executor = Executors.newFixedThreadPool(parallel);
try {
executor.invokeAll(tasks);
} finally {
executor.shutdown();
}
Then task have been completed, just aggregate data:
// Collect data
int[] occurences = new int[10];
for (Task task : tasks) {
for (int i = 0; i < occurences.length; i++) {
occurences[i] += task.occurences[i];
}
}
Finally you can output computation result:
// Display result
long elapsed = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
System.out.printf("Computation done in %tT.%<tL%n", calendar(elapsed));
System.out.printf("Results: %s%n", Arrays.toString(occurences));
You can also output partial computations:
// Print debug output
int idxSize = (String.valueOf(size).length() * 4) / 3;
String template = "Slice[%," + idxSize + "d-%," + idxSize + "d] computed in %tT.%<tL by %s: %s%n";
for (Task task : tasks) {
System.out.printf(template, task.sliceStart, task.sliceEnd, calendar(task.elapsed), task.runner, Arrays.toString(task.occurences));
}
Which gives on my workstation:
Computation done in 00:00:00.024
Results: [500159, 500875, 500617, 499785, 500017, 500777, 498394, 498614, 499498, 501264]
Slice[ 0-1 250 000] computed in 00:00:00.013 by pool-1-thread-1: [125339, 125580, 125338, 124888, 124751, 124608, 124463, 124351, 125023, 125659]
Slice[1 250 000-2 500 000] computed in 00:00:00.014 by pool-1-thread-2: [124766, 125423, 125111, 124756, 125201, 125695, 124266, 124405, 125083, 125294]
Slice[2 500 000-3 750 000] computed in 00:00:00.013 by pool-1-thread-3: [124903, 124756, 124934, 125640, 124954, 125452, 124556, 124816, 124737, 125252]
Slice[3 750 000-5 000 000] computed in 00:00:00.014 by pool-1-thread-4: [125151, 125116, 125234, 124501, 125111, 125022, 125109, 125042, 124655, 125059]
the small trick to convert elapsed millis in a stopwatch calendar:
static final TimeZone UTC= TimeZone.getTimeZone("UTC");
public static Calendar calendar(long millis) {
Calendar calendar = Calendar.getInstance(UTC);
calendar.setTimeInMillis(millis);
return calendar;
}

Parallel sum of elements in a large Array

I have program that sums the elements in a very large array. I want to parallelize this sum.
#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
float sum=0.0;
main()
{
for (i=0, i<N, i++)
sum=sum+x[i];
}
How can I parallelize this sum using threads (c/c++/Java any code example is fine)? How many threads should I use if there is 8 cores in the machine for optimal performance?
EDIT: N may be really large ( larger than 1e6 actually) and varies based on the file size I read the data from. The file is in the order of GBs.
Edit: N is changed to a large value (1e12 to 1e16)

In Java you can write
int cpus = Runtime.getRuntime().availableProcessors();
// would keep this of other tasks as well.
ExecutorService service = Executors.newFixedThreadPool(cpus);
float[] floats = new float[N];
List<Future<Double>> tasks = new ArrayList<>();
int blockSize = (floats.length + cpus - 1) / cpus;
for (int i=0, i < floats.length, i++) {
final start = blockSize * i;
final end = Math.min(blockSize * (i+1), floats.length);
tasks.add(service.submit(new Callable<Double>() {
public Double call() {
double d= 0;
for(int j=start;j<end;j++)
d += floats[j];
return d;
}
});
}
double sum = 0;
for(Future<Double> task: tasks)
sum += task.get();
As WhozCraig mentions, it is likely that one million floats isn't enough to need multiple threads, or you could find that your bottle neck is how fast you can load the array from main memory (a single threaded resource) In any case, you can't assume it will be faster by the time you include the cost getting the data.

You say that the array comes from a file. If you time the different parts of the program, you'll find that summing up the elements takes a negligible amount of time compared to how long it takes to read the data from disk. From Amdahl's Law it follows that there is nothing to be gained by parallelising the summing up.
If you need to improve performance, you should focus on improving the I/O throughput.

you can use many threads(more than cores). But no of threads & its performance depends on ur algorithm as how they are working.
As array length is 100000 then create x thread & each will calculate arr[x] to arr[x+limit]. where u have to set limit so that no overlapping with other thread & no element should remain un-used.
thread creation:
pthread_t tid[COUNT];
int i = 0;
int err;
while (i < COUNT)
{
void *arg;
arg = x; //pass here a no which will tell from where this thread will use arr[x]
err = pthread_create(&(tid[i]), NULL, &doSomeThing, arg);
if (err != 0)
printf("\ncan't create thread :[%s]", strerror(err));
else
{
//printf("\n Thread created successfully\n");
}
i++;
}
// NOW CALCULATE....
for (int i = 0; i < COUNT; i++)
{
pthread_join(tid[i], NULL);
}
}
void* doSomeThing(void *arg)
{
int *x;
x = (int *) (arg);
// now use this x to start the array sum from arr[x] to ur limit which should not overlap to other thread
}

Use divide and conquer algorithm
Divide the array into 2 or more (keep dividing recursively until you get an array with manageable size)
Start computing the sum for the sub arrays (divided arrays) (using separate threads)
Finally add the sum generated (from all the threads) for all sub arrays together to produce final result

As others have said, the time-cost of reading the file is almost certainly going to be much larger than that of calculating the sum. Is it a text file or binary? If the numbers are stored as text, then the cost of reading them can be very high depending on your implementation.
You should also be careful adding a large number of floats. Because of their limited precision, small values late in the array may not contribute to the sum. Think about at least using a double to accumulate the values.

You can use pthreads in c to solve your problem
Here is my code for N=4 ( you can change it to suit your needs )
To run this code, apply the following command :
gcc -pthread test.c -o test
./test
#include<stdio.h>
#include<stdlib.h>
#include<pthread.h>
#define NUM_THREADS 5
pthread_t threads[NUM_THREADS];
pthread_mutex_t mutexsum;
int a[2500];
int sum = 0;
void *do_work(void* parms) {
long tid = (long)parms;
printf("I am thread # %ld\n ", tid);
int start, end, mysum;
start = (int)tid * 500;
end = start + 500;
int i = 0;
printf("Thread # %ld with start = %d and end = %d \n",tid,start,end);
for (int i = start; i < end; i++) {
mysum += a[i];
}
pthread_mutex_lock(&mutexsum);
printf("Thread # %ld lock and sum = %d\n",tid,sum);
sum += mysum;
pthread_mutex_unlock(&mutexsum);
pthread_exit(NULL);
}
void main(int argv, char* argc) {
int i = 0; int rc;
pthread_attr_t attr;
pthread_mutex_init(&mutexsum, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_mutex_init(&mutexsum, NULL);
printf("Initializing array : \n");
for(i=0;i<2500;i++){
a[i]=1;
}
for (i = 0; i < NUM_THREADS; i++) {
printf("Creating thread # %d.\n", i);
rc = pthread_create(&threads[i], &attr, &do_work, (void *)i);
if (rc) {
printf("Error in thread %d with rc = %d. \n", i, rc);
exit(-1);
}
}
pthread_attr_destroy(&attr);
printf("Creating threads complete. start ruun " );
for (i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
printf("\n\tSum : %d", sum);
pthread_mutex_destroy(&mutexsum);
pthread_exit(NULL);
}

OpenMP supports built-in reduction. Add flag -fopenmp while compiling.
#include <omp.h>
#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
int main()
{
float sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i=0, i<N, i++)
sum=sum+x[i];
return 0;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.