java - Simple calculation takes longer in multi threads than in single thread

java - Simple calculation takes longer in multi threads than in single thread - java

I'm trying to understand how to take advantage of using multi threads. I wrote a simple program that increments the value of i, let's say, 400,000 times using two ways : a single threaded way (0 to 400,000) and a multiple threaded way (in my case, 4 times : 0 to 100,000) with the number of thread equal to Runtime.getRuntime().availableProcessors().
I'm surprised with the results I measured : the single threaded way is decidedly faster, sometimes 3 times faster. Here is my code :
public class Main {
public static int LOOPS = 100000;
private static ExecutorService executor=null;
public static void main(String[] args) throws InterruptedException, ExecutionException {
int procNb = Runtime.getRuntime().availableProcessors();
long startTime;
long endTime;
executor = Executors.newFixedThreadPool(procNb);
ArrayList<Calculation> c = new ArrayList<Calculation>();
for (int i=0;i<procNb;i++){
c.add(new Calculation());
}
// Make parallel computations (4 in my case)
startTime = System.currentTimeMillis();
queryAll(c);
endTime = System.currentTimeMillis();
System.out.println("Computation time using " + procNb + " threads : " + (endTime - startTime) + "ms");
startTime = System.currentTimeMillis();
for (int i =0;i<procNb*LOOPS;i++)
{
}
endTime = System.currentTimeMillis();
System.out.println("Computation time using main thread : " + (endTime - startTime) + "ms");
}
public static List<Integer> queryAll(List<Calculation> queries) throws InterruptedException, ExecutionException {
List<Future<Integer>> futures = executor.invokeAll(queries);
List<Integer> aggregatedResults = new ArrayList<Integer>();
for (Future<Integer> future : futures) {
aggregatedResults.add(future.get());
}
return aggregatedResults;
}
}
class Calculation implements Callable<Integer> {
#Override
public Integer call() {
int i;
for (i=0;i<Main.LOOPS;i++){
}
return i;
}
}
Console :
Computation time using 4 threads : 10ms.
Computation time using main thread : 3ms.
Could anyone explain this ?

An addition probably takes one cpu cycle, so if your cpu runs at 3GHz, that's 0.3 nanoseconds. Do it 400k times and that becomes 120k nanoseconds or 0.1 milliseconds. So your measurement is more affected by the overhead of starting threads, thread switching, JIT compilation etc. than by the operation you are trying to measure.
You also need to account for the compiler optimisations: if you place your empty loop in a method and run that method many times you will notice that it runs in 0 ms after some time,. because the compiler determines that the loop does nothing and optimises it away completely.
I suggest you use a specialised library for micro benchmarking, such as jmh.
See also: How do I write a correct micro-benchmark in Java?

Related

Thread.sleep() optimization for small sleep intervals

I am writing a library that involves a caller-defined temporal resolution. In the implementation, this value ends up being an interval some background thread will sleep before doing some housekeeping and going back to sleep again. I am allowing this resolution to be as small as 1 millisecond, which translates to Thread.sleep(1). My hunch is that that may be more wasteful and less precise than busy-waiting for 1 ms. If that's the case;
Should I fall back to busy-waiting for small enough (how small) time intervals?
Does anyone know if the JVM is already doing this optimization anyway and I don't need to do anything at all?

That's easy to test:
public class Test {
static int i = 0;
static long[] measurements = new long[0x100];
static void report(long value) {
measurements[i++ & 0xff] = value;
if (i > 10_000) {
for (long m : measurements) {
System.out.println(m);
}
System.exit(0);
}
}
static void sleepyWait() throws Exception {
while (true) {
long before = System.nanoTime();
Thread.sleep(1);
long now = System.nanoTime();
report(now - before);
}
}
static void busyWait() {
while (true) {
long before = System.nanoTime();
long now;
do {
now = System.nanoTime();
} while (before + 1_000_000 >= now);
report(now - before);
}
}
public static void main(String[] args) throws Exception {
busyWait();
}
}
Run on my windows system, this shows that busyWait has microsecond accuracy, but fully uses one CPU core.
In contrast, sleepyWait causes no measurable CPU load, but only achieves millisecond accuracy (often taking as much as 2 ms to fire, rather than the 1 ms requested).
At least on windows, this is therefore a straightforward tradeoff between accuracy and CPU use.
It's also worth noting that there are often alternatives to running a CPU at full speed obsessively checking the time. In many cases, there is some other signal you could be waiting for, and offering an API that focuses on time-based resolution may steer the users of your API in a bad direction.

Multithreaded vs Asynchronous programming in a single core

If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540

For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.

In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?

Threads in Java - Sum of N numbers

I tried to perform sum of N numbers using conventional method and also using threads to see the performance of threads. I see that the conventional method runs faster than the thread based.
My plan is to break down the upper limit(N) into ranges then run a thread for each range and finally add the sum calculated from each thread.
stats in milliseconds :
248
500000000500000000
-----same with threads------
498
500000000500000000
Here I see the approach using threads took ~500 milliseconds and conventional method took only ~250 seconds.
I wanted to know If I am correctly implementing threads for this problem.
Thanks
code :
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyThread implements Runnable {
private int from , to , sum;
public MyThread(long from , long to) {
this.from = from;
this.to = to;
sum = 0;
}
public void run() {
for(long i=from;i<=to;i++) {
sum+=i;
}
}
public long getSum() {
return this.sum;
}
}
public class exercise {
public static void main(String args[]) {
long startTime = System.currentTimeMillis();
long sum = 0;
for(long i=1;i<=1000000000;i++) {
sum+=i;
}
long endTime = System.currentTimeMillis();
long duration = (endTime - startTime); //Total execution time in milli seconds
System.out.println(duration);
System.out.println(sum);
System.out.println("-----same with threads------");
ExecutorService executor = Executors.newFixedThreadPool(5);
MyThread one = new MyThread(1, 100000);
MyThread two = new MyThread(100001, 10000000);
MyThread three = new MyThread(10000001, 1000000000);
startTime = System.currentTimeMillis();
executor.execute(one);
executor.execute(two);
executor.execute(three);
executor.shutdown();
// Wait until all threads are finish
while (!executor.isTerminated()) {
}
endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
long thsum = one.getSum() + two.getSum() + three.getSum();
System.out.println(thsum);
}
}

It only makes sense to split the work into multiple threads when each thread is assigned the same amount of work.
In your case, the first thread does almost nothing, the second thread does almost 1% of the work, and the third thread does 99% of the work.
Therefore, you pay the overhead for running multiple threads without benefiting from the parallel execution.
Splitting the work evenly, as follows, should yield better results:
MyThread one = new MyThread(1, 333333333);
MyThread two = new MyThread(333333334, 666666667);
MyThread three = new MyThread(666666668, 1000000000);

The multithread part of your example includes the time for thread creation. Thread creation is an expensive operation and I presume that it is responsible for a large share of the difference between the single thread and multithread approaches.
Your question was if you are correctly implementing the threads. Did you mean implementing the runnable tasks? If so, I wonder why you have distributed the number ranges so unevenly. The task three seems to be far bigger than the others and as a result the performance will be close to a single thread version however you choose to set up the threads.

Java factorial calculation with thread pool

I achieved to calculate factorial with two threads without the pool. I have two factorial classes which are named Factorial1, Factorial2 and extends Thread class. Let's consider I want to calculate the value of !160000. In Factorial1's run() method I do the multiplication in a for loop from i=2 to i=80000 and in Factorial2's from i=80001 to 160000. After that, i return both values and multiply them in the main method. When I compare the execution time it's much better (which is 5000 milliseconds) than the non-thread calculation's time (15000 milliseconds) even with two threads.
Now I want to write clean and better code because I saw the efficiency of threads at factorial calculation but when I use a thread pool to calculate the factorial value, the parallel calculation always takes more time than the non-thread calculation (nearly 16000). My code pieces look like:
for(int i=2; i<= Calculate; i++)
{
myPool.execute(new Multiplication(result, i));
}
run() method which is in Multiplication class:
public void run()
{
s1.Mltply(s2); // s1 and s2 are instances of my Number class
// their fields holds BigInteger values
}
Mltply() method which is in Number class:
public void Multiply(int number)
{
area.lock(); // result is going wrong without lock
Number temp = new Number(number);
value = value.multiply(temp.value); // value is a BigInteger
area.unlock();
}
In my opinion this lock may kills the all advantage of the thread usage because it seems like all that threads do is multiplication but nothing else. But without it, i can't even calculate the true result. Let's say i want to calculate !10, so thread1 calculates the 10*9*8*7*6 and thread2 calculate the 5*4*3*2*1. Is that the way I'm looking for? Is it even possible with thread pool? Of course execution time must be less than the normal calculation...
I appreciate all your help and suggestion.
EDIT: - My own solution to the problem -
public class MyMultiplication implements Runnable
{
public static BigInteger subResult1;
public static BigInteger subResult2;
int thread1StopsAt;
int thread2StopsAt;
long threadId;
static boolean idIsSet=false;
public MyMultiplication(BigInteger n1, int n2) // First Thread
{
MyMultiplication.subResult1 = n1;
this.thread1StopsAt = n2/2;
thread2StopsAt = n2;
}
public MyMultiplication(int n2,BigInteger n1) // Second Thread
{
MyMultiplication.subResult2 = n1;
this.thread2StopsAt = n2;
thread1StopsAt = n2/2;
}
#Override
public void run()
{
if(idIsSet==false)
{
threadId = Thread.currentThread().getId();
idIsSet=true;
}
if(Thread.currentThread().getId() == threadId)
{
for(int i=2; i<=thread1StopsAt; i++)
{
subResult1 = subResult1.multiply(BigInteger.valueOf(i));
}
}
else
{
for(int i=thread1StopsAt+1; i<= thread2StopsAt; i++)
{
subResult2 = subResult2.multiply(BigInteger.valueOf(i));
}
}
}
}
public class JavaApplication3
{
public static void main(String[] args) throws InterruptedException
{
int calculate=160000;
long start = System.nanoTime();
BigInteger num = BigInteger.valueOf(1);
for (int i = 2; i <= calculate; i++)
{
num = num.multiply(BigInteger.valueOf(i));
}
long end = System.nanoTime();
double time = (end-start)/1000000.0;
System.out.println("Without threads: \t" +
String.format("%.2f",time) + " miliseconds");
System.out.println("without threads Result: " + num);
BigInteger num1 = BigInteger.valueOf(1);
BigInteger num2 = BigInteger.valueOf(1);
ExecutorService myPool = Executors.newFixedThreadPool(2);
start = System.nanoTime();
myPool.execute(new MyMultiplication(num1,calculate));
Thread.sleep(100);
myPool.execute(new MyMultiplication(calculate,num2));
myPool.shutdown();
while(!myPool.isTerminated()) {} // waiting threads to end
end = System.nanoTime();
time = (end-start)/1000000.0;
System.out.println("With threads: \t" +String.format("%.2f",time)
+ " miliseconds");
BigInteger result =
MyMultiplication.subResult1.
multiply(MyMultiplication.subResult2);
System.out.println("With threads Result: " + result);
System.out.println(MyMultiplication.subResult1);
System.out.println(MyMultiplication.subResult2);
}
}
input : !160000
Execution time without threads : 15000 milliseconds
Execution time with 2 threads : 4500 milliseconds
Thanks for ideas and suggestions.

You may calculate !160000 concurrently without using a lock by splitting 160000 into disjunct junks as you explaint by splitting it into 2..80000 and 80001..160000.
But you may achieve this by using the Java Stream API:
IntStream.rangeClosed(1, 160000).parallel()
.mapToObj(val -> BigInteger.valueOf(val))
.reduce(BigInteger.ONE, BigInteger::multiply);
It does exactly what you try to do. It splits the whole range into junks, establishes a thread pool and computes the partial results. Afterwards it joins the partial results into a single result.
So why do you bother doing it by yourself? Just practicing clean coding?
On my real 4 core machine computation in a for loop took 8 times longer than using a parallel stream.

Threads have to run independent to run fast. Many dependencies like locks, synchronized parts of your code or some system calls leads to sleeping threads which are waiting to access some resources.
In your case you should minimize the time a thread is inside the lock. Maybe I am wrong, but it seems like you create a thread for each number. So for 1.000! you spawn 1.000 Threads. All of them trying to get the lock on area and are not able to calculate anything, because one thread has become the lock and all other threads have to wait until the lock is unlocked again. So the threads are only running in serial which is as fast as your non-threaded example plus the extra time for locking and unlocking, thread management and so on. Oh, and because of cpu's context switching it gets even worse.
Your first attempt to splitt the factorial in two threads is the better one. Each thread can calculate its own result and only when they are done the threads have to communicate with each other. So they are independent most of the time.
Now you have to generalize this solution. To reduce context switching of the cpu you only want as many threads as your cpu has cores (maybe a little bit less because of your OS). Every thread gets a rang of numbers and calculates their product. After this it locks the overall result and adds its own result to it.
This should improve the performance of your problem.
Update: You ask for additional advice:
You said you have two classes Factorial1 and Factorial2. Probably they have their ranges hard codes. You only need one class which takes the range as constructor arguments. This class implements Runnable so it has a run-Method which multiplies all values in that range.
In you main-method you can do something like that:
int n = 160_000;
int threads = 2;
ExecutorService executor = Executors.newFixedThreadPool(threads);
for (int i = 0; i < threads; i++) {
int start = i * (n/threads) + 1;
int end = (i + 1) * (n/threads) + 1;
executor.execute(new Factorial(start, end));
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.DAYS);
Now you have calculated the result of each thread but not the overall result. This can be solved by a BigInteger which is visible to the Factorial-class (like a static BigInteger reuslt; in the same main class.) and a lock, too. In the run-method of Factorial you can calculate the overall result by locking the lock and calculation the result:
Main.lock.lock();
Main.result = Main.result.multiply(value);
Main.lock.unlock();
Some additional advice for the future: This isn't really clean because Factorial needs to have information about your main class, so it has a dependency to it. But ExecutorService returns a Future<T>-Object which can be used to receive the result of the thread. Using this Future-Object you don't need to use locks. But this needs some extra work, so just try to get this running for now ;-)

In addition to my Java Stream API solution here another solution which uses a self-managed thread-pool as you demanded:
public static final int CHUNK_SIZE = 10000;
public static BigInteger fac(int max) {
ExecutorService executor = newCachedThreadPool();
try {
return rangeClosed(0, (max - 1) / CHUNK_SIZE)
.mapToObj(val -> executor.submit(() -> prod(leftBound(val), rightBound(val, max))))
.map(future -> valueOf(future))
.reduce(BigInteger.ONE, BigInteger::multiply);
} finally {
executor.shutdown();
}
}
private static int leftBound(int chunkNo) {
return chunkNo * CHUNK_SIZE + 1;
}
private static int rightBound(int chunkNo, int max) {
return Math.min((chunkNo + 1) * CHUNK_SIZE, max);
}
private static BigInteger valueOf(Future<BigInteger> future) {
try {
return future.get();
} catch (Exception e) {
throw new RuntimeException(e);
}
}
private static BigInteger prod(int min, int max) {
BigInteger res = BigInteger.valueOf(min);
for (int val = min + 1; val <= max; val++) {
res = res.multiply(BigInteger.valueOf(val));
}
return res;
}

Using parallelism in Java makes program slower (four times slower!!!)

I'm writing conjugate-gradient method realization.
I use Java multi threading for matrix back-substitution.
Synchronization is made using CyclicBarrier, CountDownLatch.
Why it takes so much time to synchronize threads?
Are there other ways to do it?
code snippet
private void syncThreads() {
// barrier.await();
try {
barrier.await();
} catch (InterruptedException e) {
} catch (BrokenBarrierException e) {
}
}

You need to ensure that each thread spends more time doing useful work than it costs in overhead to pass a task to another thread.
Here is an example of where the overhead of passing a task to another thread far outweighs the benefits of using multiple threads.
final double[] results = new double[10*1000*1000];
{
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
results[i] = (double) i * i;
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per square%n", (double) time / results.length);
}
{
ExecutorService ex = Executors.newFixedThreadPool(4);
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
final int i2 = i;
ex.execute(new Runnable() {
#Override
public void run() {
results[i2] = i2 * i2;
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per square%n", (double) time / results.length);
}
prints
With one thread it took 1.4 ns per square
With four threads it took 715.6 ns per square
Using multiple threads is much worse.
However, increase the amount of work each thread does and
final double[] results = new double[10 * 1000 * 1000];
{
long start = System.nanoTime();
// using a plain loop.
for (int i = 0; i < results.length; i++) {
results[i] = Math.pow(i, 1.5);
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
{
int threads = 4;
ExecutorService ex = Executors.newFixedThreadPool(threads);
long start = System.nanoTime();
int blockSize = results.length / threads;
// using a plain loop.
for (int i = 0; i < threads; i++) {
final int istart = i * blockSize;
final int iend = (i + 1) * blockSize;
ex.execute(new Runnable() {
#Override
public void run() {
for (int i = istart; i < iend; i++)
results[i] = Math.pow(i, 1.5);
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
prints
With one thread it took 287.6 ns per pow 1.5
With four threads it took 77.3 ns per pow 1.5
That's an almost 4x improvement.

How many threads are being used in total? That is likely the source of your problem. Using multiple threads will only really give a performance boost if:
Each task in the thread does some sort of blocking. For example, waiting on I/O. Using multiple threads in this case enables that blocking time to be used by other threads.
or You have multiple cores. If you have 4 cores or 4 CPUs, you can do 4 tasks simultaneously (or 4 threads).
It sounds like you are not blocking in the threads so my guess is you are using too many threads. If you are for example using 10 different threads to do the work at the same time but only have 2 cores, that would likely be much slower than running all of the tasks in sequence. Generally start the number of threads equal to your number of cores/CPUs. Increase the threads used slowly gaging the performance each time. This will give you the optimal thread count to use.

Perhaps you could try to implement to re-implement your code using fork/join from JDK 7 and see what it does?
The default creates a thread-pool with exactly the same amount of threads as you have cores in your system. If you choose the threshold for dividing your work into smaller chunks reasonably this will probably execute much more efficient.

You are most likely aware of this, but in case you aren't, please read up on Amdahl's Law. It gives the relationship between expected speedup of a program by using parallelism and the sequential segments of the program.

synchronizing across cores is much slower than on a single cored environment see if you can limit the jvm to 1 core (see this blog post)
or you can use a ExecuterorService and use invokeAll to run the parallel tasks

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.