Why is creating a Thread said to be expensive?

Why is creating a Thread said to be expensive? - java

The Java tutorials say that creating a Thread is expensive. But why exactly is it expensive? What exactly is happening when a Java Thread is created that makes its creation expensive? I'm taking the statement as true, but I'm just interested in mechanics of Thread creation in JVM.
Thread lifecycle overhead. Thread creation and teardown are not free. The actual overhead varies across platforms, but thread creation takes time, introducing latency into request processing, and requires some processing activity by the JVM and OS. If requests are frequent and lightweight, as in most server applications, creating a new thread for each request can consume significant computing resources.
From Java Concurrency in Practice
By Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, Doug Lea
Print ISBN-10: 0-321-34960-1

Why is creating a Thread said to be expensive?
Because it >>is<< expensive.
Java thread creation is expensive because there is a fair bit of work involved:
A large block of memory has to be allocated and initialized for the thread stack.
System calls need to be made to create / register the native thread with the host OS.
Descriptors need to be created, initialized and added to JVM-internal data structures.
It is also expensive in the sense that the thread ties down resources as long as it is alive; e.g. the thread stack, any objects reachable from the stack, the JVM thread descriptors, the OS native thread descriptors.
The costs of all of these things are platform specific, but they are not cheap on any Java platform I've ever come across.
A Google search found me an old benchmark that reports a thread creation rate of ~4000 per second on a Sun Java 1.4.1 on a 2002 vintage dual processor Xeon running 2002 vintage Linux. A more modern platform will give better numbers ... and I can't comment on the methodology ... but at least it gives a ballpark for how expensive thread creation is likely to be.
Peter Lawrey's benchmarking indicates that thread creation is significantly faster these days in absolute terms, but it is unclear how much of this is due improvements in Java and/or the OS ... or higher processor speeds. But his numbers still indicate a 150+ fold improvement if you use a thread pool versus creating/starting a new thread each time. (And he makes the point that this is all relative ...)
The above assumes native threads rather than green threads, but modern JVMs all use native threads for performance reasons. Green threads are possibly cheaper to create, but you pay for it in other areas.
Update: The OpenJDK Loom project aims to provide a light-weight alternative to standard Java threads, among other things. The are proposing virtual threads which are a hybrid of native threads and green threads. In simple terms, a virtual thread is rather like a green thread implementation that uses native threads underneath when parallel execution is required.
As of now (Jan 2021) the Project Loom work is still at the prototyping stage, with (AFAIK) no Java version targeted for the release.
I've done a bit of digging to see how a Java thread's stack really gets allocated. In the case of OpenJDK 6 on Linux, the thread stack is allocated by the call to pthread_create that creates the native thread. (The JVM does not pass pthread_create a preallocated stack.)
Then, within pthread_create the stack is allocated by a call to mmap as follows:
mmap(0, attr.__stacksize,
PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
According to man mmap, the MAP_ANONYMOUS flag causes the memory to be initialized to zero.
Thus, even though it might not be essential that new Java thread stacks are zeroed (per the JVM spec), in practice (at least with OpenJDK 6 on Linux) they are zeroed.

Others have discussed where the costs of threading come from. This answer covers why creating a thread is not that expensive compared to many operations, but relatively expensive compared to task execution alternatives, which are relatively less expensive.
The most obvious alternative to running a task in another thread is to run the task in the same thread. This is difficult to grasp for those assuming that more threads are always better. The logic is that if the overhead of adding the task to another thread is greater than the time you save, it can be faster to perform the task in the current thread.
Another alternative is to use a thread pool. A thread pool can be more efficient for two reasons. 1) it reuses threads already created. 2) you can tune/control the number of threads to ensure you have optimal performance.
The following program prints....
Time for a task to complete in a new Thread 71.3 us
Time for a task to complete in a thread pool 0.39 us
Time for a task to complete in the same thread 0.08 us
Time for a task to complete in a new Thread 65.4 us
Time for a task to complete in a thread pool 0.37 us
Time for a task to complete in the same thread 0.08 us
Time for a task to complete in a new Thread 61.4 us
Time for a task to complete in a thread pool 0.38 us
Time for a task to complete in the same thread 0.08 us
This is a test for a trivial task which exposes the overhead of each threading option. (This test task is the sort of task that is actually best performed in the current thread.)
final BlockingQueue<Integer> queue = new LinkedBlockingQueue<Integer>();
Runnable task = new Runnable() {
#Override
public void run() {
queue.add(1);
}
};
for (int t = 0; t < 3; t++) {
{
long start = System.nanoTime();
int runs = 20000;
for (int i = 0; i < runs; i++)
new Thread(task).start();
for (int i = 0; i < runs; i++)
queue.take();
long time = System.nanoTime() - start;
System.out.printf("Time for a task to complete in a new Thread %.1f us%n", time / runs / 1000.0);
}
{
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(threads);
long start = System.nanoTime();
int runs = 200000;
for (int i = 0; i < runs; i++)
es.execute(task);
for (int i = 0; i < runs; i++)
queue.take();
long time = System.nanoTime() - start;
System.out.printf("Time for a task to complete in a thread pool %.2f us%n", time / runs / 1000.0);
es.shutdown();
}
{
long start = System.nanoTime();
int runs = 200000;
for (int i = 0; i < runs; i++)
task.run();
for (int i = 0; i < runs; i++)
queue.take();
long time = System.nanoTime() - start;
System.out.printf("Time for a task to complete in the same thread %.2f us%n", time / runs / 1000.0);
}
}
}
As you can see, creating a new thread only costs ~70 µs. This could be considered trivial in many, if not most, use cases. Relatively speaking it is more expensive than the alternatives and for some situations a thread pool or not using threads at all is a better solution.

In theory, this depends on the JVM. In practice, every thread has a relatively large amount of stack memory (256 KB per default, I think). Additionally, threads are implemented as OS threads, so creating them involves an OS call, i.e. a context switch.
Do realize that "expensive" in computing is always very relative. Thread creation is very expensive relative to the creation of most objects, but not very expensive relative to a random harddisk seek. You don't have to avoid creating threads at all costs, but creating hundreds of them per second is not a smart move. In most cases, if your design calls for lots of threads, you should use a limited-size thread pool.

There are two kinds of threads:
Proper threads: these are abstractions around the underlying operating system's threading facilities. Thread creation is, therefore, as expensive as the system's -- there's always an overhead.
"Green" threads: created and scheduled by the JVM, these are cheaper, but no proper paralellism occurs. These behave like threads, but are executed within the JVM thread in the OS. They are not often used, to my knowledge.
The biggest factor I can think of in the thread-creation overhead, is the stack-size you have defined for your threads. Thread stack-size can be passed as a parameter when running the VM.
Other than that, thread creation is mostly OS-dependent, and even VM-implementation-dependent.
Now, let me point something out: creating threads is expensive if you're planning on firing 2000 threads per second, every second of your runtime. The JVM is not designed to handle that. If you'll have a couple of stable workers that won't be fired and killed over and over, relax.

Creating Threads requires allocating a fair amount of memory since it has to make not one, but two new stacks (one for java code, one for native code). Use of Executors/Thread Pools can avoid the overhead, by reusing threads for multiple tasks for Executor.

Obviously the crux of the question is what does 'expensive' mean.
A thread needs to create a stack and initialize the stack based on the run method.
It needs to set up control status structures, ie, what state it's in runnable, waiting etc.
There's probably a good deal of synchronization around setting these things up.

Related

ThreadPoolExecutor CPU 100% load

I needed to execute my workers on several cores because I have a possibility to separate my task to several tasks manually, so I limply need to use newFixedThreadPool, but my worker raises a 100% CPU loading. If it using the whole cores for it it's okay, but it's working only 2x faster than without thread pool, so seems that it using it without sense.
I tried to use plain newFixedThreadPool:
ExecutorService pool = Executors.newFixedThreadPool (Runtime.getRuntime ().availableProcessors ());
for (int i = 0; i < Runtime.getRuntime ().availableProcessors (); i++)
pool.execute (new MyTask (this));
and ThreadPoolExecutor with passed queue:
LinkedBlockingQueue<Runnable> queue = new LinkedBlockingQueue<> ();
for (int i = 0; i < Runtime.getRuntime ().availableProcessors (); i++)
queue.add (new MyTask (this));
ThreadPoolExecutor pool = new ThreadPoolExecutor (Runtime.getRuntime ().availableProcessors (), Runtime.getRuntime ().availableProcessors (), 0, TimeUnit.MILLISECONDS, queue);
pool.prestartAllCoreThreads ();
And result still the same.
MyTask is trivial and written for example:
protected static class MyTask implements Runnable {
#Override
public void run () {
int i = 0;
while (true)
i++;
}
After running workers without ThreadPoolExecutor CPU loading is normal and varies for about 10-20%. So what will be wrong?

A CPU-bound task is bound to take all of the CPU
You have an infinite loop that does nothing more than run calculations. That code never blocks. That is, the code never stops to wait on externalities. For example, the code never contacts a database, never writes to storage, and never makes a call over the network.
With nothing to do but calculations on the CPU, such a task is considered to be CPU-bound. Such a task will use 100% of a CPU core.
You multiply this task, to have one such task per core for every core in your machine. So all cores are pushed to 100%. Likely your machine will get hot, fans will spin, and CPU will throttle.
This seems rather obvious. Perhaps you should edit your Question to explain why you expected a different behavior.
Tip: Avoid loading all your cores with CPU-bound tasks. The OS has many processes that need to run, and other apps need to run.

While it is not clear exactly what you are trying to find out, you have to keep this in mind.
A single thread will only make use of one (virtual) core at a time. It is usually backed by a single OS kernel process (I say usually because there are a lot of fancy stuff some frameworks and VMs are doing with Virtual Threads and what not lately).
When the thread blocks (to do I/O or wait for something, or sleeps, or there is an explicit call to yield()), it relinquishes the CPU core to another thread.
So running a simple infinite loop on a single worker thread will only make use of 1 core. Which is why you observe 10% to 20% loading (depending on how many virtual cores you have on your machine).
On the other hand, if you create a thread pool of as many cores as you have, all doing the same infinite loop on separate threads, each will take one available core and use it, which will drive your CPU loading to 100%.
They are completely separate workers, so you can't expect your first worker to complete faster or anything of that sort. If anything it will be slower, because you have exhausted all CPU resources, and the OS and even the JVM now will compete to do its own thing, like run the garbage collector etc.

why the low priority thread is executed first [duplicate]

This is a test about thread priority.
The code is from Thinking in Java p.809
import java.util.concurrent.*;
public class SimplePriorities implements Runnable {
private int countDown = 5;
private volatile double d; // No optimization
private int priority;
public SimplePriorities(int priority) {
this.priority = priority;
}
public String toString() {
return Thread.currentThread() + ": " + countDown;
}
public void run() {
Thread.currentThread().setPriority(priority);
while (true) {
// An expensive, interruptable operation:
for (int i = 1; i < 10000000; i++) {
d += (Math.PI + Math.E) / (double) i;
if (i % 1000 == 0)
Thread.yield();
}
System.out.println(this);
if (--countDown == 0)
return;
}
}
public static void main(String[] args) {
ExecutorService exec = Executors.newCachedThreadPool();
for (int i = 0; i < 5; i++)
exec.execute(new SimplePriorities(Thread.MIN_PRIORITY));
exec.execute(new SimplePriorities(Thread.MAX_PRIORITY));
exec.shutdown();
}
}
I wonder why I can't get a regular result like:
Thread[pool-1-thread-6,10,main]: 5
Thread[pool-1-thread-6,10,main]: 4
Thread[pool-1-thread-6,10,main]: 3
Thread[pool-1-thread-6,10,main]: 2
Thread[pool-1-thread-6,10,main]: 1
Thread[pool-1-thread-3,1,main]: 5
Thread[pool-1-thread-2,1,main]: 5
Thread[pool-1-thread-1,1,main]: 5
Thread[pool-1-thread-5,1,main]: 5
Thread[pool-1-thread-4,1,main]: 5
...
but a random result (every time I run it changes):
Thread[pool-1-thread-2,1,main]: 5
Thread[pool-1-thread-3,1,main]: 5
Thread[pool-1-thread-4,1,main]: 5
Thread[pool-1-thread-2,1,main]: 4
Thread[pool-1-thread-3,1,main]: 4
Thread[pool-1-thread-1,1,main]: 5
Thread[pool-1-thread-6,10,main]: 5
Thread[pool-1-thread-5,1,main]: 5
...
I use i3-2350M 2C4T CPU with Win 7 64 bit JDK 7.
Does it matter ?

Java Thread priority has no effect
Thread priorities are highly OS specific and on many operating systems often have minimal effect. Priorities help to order the threads that are in the run queue only and will not change how often the threads are run in any major way unless you are doing a ton of CPU in each of the threads.
Your program looks to use a lot of CPU but unless you have fewer cores than there are threads, you may not see any change in output order by setting your thread priorities. If there is a free CPU then even a lower priority thread will be scheduled to run.
Also, threads are never starved. Even a lower priority thread will given time to run quite often in such a situation as this. You should see higher priority threads be given time sliced to run more often but it does not mean lower priority threads will wait for them to finish before running themselves.
Even if priorities do help to give one thread more CPU than the others, threaded programs are subject to race conditions which help inject a large amount of randomness to their execution. What you should see however, is the max priority thread is more likely to spit out its 0 message more often than the rest. If you add the priority to the println(), that should become obvious over a number of runs.
It is also important to note that System.out.println(...) is synchronized method that is writing IO which is going to dramatically affect how the threads interact and the different threads are blocking each other. In addition, Thread.yield(); can be a no-op depending on how the OS does its thread scheduling.
but a random result (every time I run it changes):
Right. The output from a threaded program is rarely if ever "perfect" because by definition the threads are running asynchronously. We want the output to be random because we want the threads to be running in parallel independently from each other. That is their power. If you expecting some precise output then you should not be using threads.

Thread priority is implementation dependent. In particular, in Windows:
Thread priority isn't very meaningful when all threads are competing
for CPU. (Source)
The book "Java Concurrency in Practice" also says to
Avoid the temptation to use thread priorities, since they increase
platform dependence and can cause liveness problems. Most concurrent
applications can use the default priority for all threads.

Thread priority does not guarantee execution order. It comes into play when resources are limited. If the System is running into constraints due to memory or CPU, then the higher priority threads will run first. Assuming that you have sufficient system resources (which I would assume so for a simple program and the system resources you posted), then you will not have any system constraints. Here is a blog post (not my post) I found that provides more information about it: Why Thread Priority Rarely Matters.

Let's keep it simple and go straight to the source ...
Every thread has a priority. When there is competition for processing
resources, threads with higher priority are generally executed in
preference to threads with lower priority. Such preference is not,
however, a guarantee that the highest priority thread will always be
running, and thread priorities cannot be used to reliably implement
mutual exclusion.
from the Java Language Specification (2nd Edition) p.445.
Also ...
Although thread priorities exist in Java and many references state
that the JVM will always select one of the highest priority threads
for scheduling [52, 56, 89], this is currently not guaranteed by the
Java language or virtual machine specifications [53, 90]. Priorities
are only hints to the scheduler [127, page 227].
from Testing Concurrent Java Components (PhD Thesis, 2005) p. 62.
Reference 127, page 227 (from the excerpt above) is from Component Software: Beyond Object-Oriented Programming (by C. Szyperski), Addison Wesley, 1998.
In short, do not rely on thread priorities.

Thread priority is only a hint to OS task scheduler. Task scheduler will only try to allocate more resources to a thread with higher priority, however there are no explicit guarantees.
In fact, it is not only relevant to Java or JVM. Most non-real time OS use thread priorities (managed or unmanaged) only in a suggestive manner.
Jeff Atwood has a good post on the topic: "Thread Priorities are Evil"
Here's the problem. If somebody begins the work that will make 'cond'
true on a lower priority thread (the producer), and then the timing of
the program is such that the higher priority thread that issues this
spinning (the consumer) gets scheduled, the consumer will starve the
producer completely. This is a classic race. And even though there's
an explicit Sleep in there, issuing it doesn't allow the producer to
be scheduled because it's at a lower priority. The consumer will just
spin forever and unless a free CPU opens up, the producer will never
produce. Oops!

As mentioned in other answers, Thread priority is more a hint than a definition of a strict rule.
That said, even if priority would be considered in a strict(er) way, your test-setup would not lead to the result which you describe as "expected". You are first creating 5 threads having equally low priority and one thread having high priority.
The CPU you are using (i3) has 4 native threads. So even if high priory would imply that the thread runs without interruption (which is not true), the low priority threads will get 3/4 of the cpu power (given that no other task is running). This CPU power is allocated to 5 threads, thus the low priority threads will run at 4 * 3/4 * 1/5 = 3/5 times the speed of the high priority thread. After the high priority thread has completed, the low priority threads run at 4/5 of the speed of the high priority thread.
You are launching the low priority threads before the high priority threads. Hence they start a bit earlier. I expect that in most systems "priority" is not implemented down to the nanosecond. So the OS will allow that a threads runs "a little longer" until it switches to another thread (to reduce the impact of switching cost). Hence, the result depends a lot on how that switching is implemented and how big the task is. If the task is small, it could be that no switching will take place, and in your example all the low priority threads finish first.
These calculations assumed that high and low priority where interpreted as extrem cases. In fact, priority is rather something like "in n out of m cases prefer this" with variable n and m.
You have a Thread.yield in your code! This will pass execution to another thread. If you do that too often, it will result in all threads getting equal CPU power.
Hence, I would not expect the output you mentioned in your question, namely that the high priority thread finishes first and then the low priority thread finish.
Indeed: With the line Thread.yield I have the result that each thread get the same CPU time. Without the line Thread.yield and increasing the number of calculations by a factor of 10 and increasing the number of low prio threads by a factor of 10, I get an expected result, namely that the high prio thread finishes earlier by some factor (which depends on the number of native CPU threads).

I believe the OS is free to disregard Java thread priority.

Thread priority is a hint (which can be ignored) such that when you have 100% CPU, the OS know you want to prefer some threads over others. The thread priority must be set before the thread is started or it can be ignored.
On Windows you have to be Administrator to set the thread priority.

Several things to consider:
thread priorities are evil and in most cases they should not be used: http://www.codinghorror.com/blog/2006/08/thread-priorities-are-evil.htmll
you explicitely yield, which probably makes your test invalid
have you checked the generated bytecode? Are you certain that your volatile variable is behaving as you expect it to? Reordering may still happen for volatile variables.

Some OS doesn't provide proper support for Thread priorities.

You have got to understand that different OS deal with thread priorities differently. For example Windows uses a pre-emptive time-slice based scheduler which gives a bigger time slice to higher priority threads where as some other OS (very old ones) use non pre-emptive scheduler where higher priority thread is executed entirely before lower priority thread unless the higher priority thread goes to sleep or does some IO operation etc.
That is why it is not guaranteed that the higher priority thread completely executes, it infact executes for more time than low priority threads so that is why your outputs are ordered in such a way.
In order to know how Windows handles multithreading please refer the below link:
https://learn.microsoft.com/en-us/windows/win32/procthread/multitasking

High-priority thread doesn't stop low-priority thread until it's done. High-priority doesn't make something faster either, if it did, we'd always make everything high-priority. And low-priority doesn't make things slower or we'd never use it.
As I understand you are misunderstanding that rest of the system should be idle and just let the highest-priority thread work while the rest of the system's capacity is wasted.

ExecutorService adding delay

I have a very simple program which uses ExecutorService.
I have set the no. of threads to 4, but the time taken is same as that set to 2.
Below is my code:
public class Test {
private static final Logger LOGGER = Logger.getLogger("./sample1.log");
#SuppressWarnings("unchecked")
public static void main(String[] args) throws Throwable {
ExecutorService service = Executors.newFixedThreadPool(4);
Future<String> resultFirst = service.submit(new FirstRequest());
Future<String> resultSecond = service.submit(new SecondRequest());
Future<String> resultThird = service.submit(new ThirdRequest());
Future<String> resultFourth = service.submit(new FourthRequest());
String temp1 = resultSecond.get();
temp1 = temp1.replace("Users", "UsersAppend1");
String temp2 = resultThird.get();
temp2 = temp2.replace("Users", "UsersAppend2");
String temp3 = resultFourth.get();
temp3 = temp3.replace("Users", "UsersAppend3");
//System.out.println(resultFirst.get() + temp1 + temp2 + temp3);
//LOGGER.info("Logger Name: "+LOGGER.getName());
LOGGER.info(resultFirst.get() + temp1 + temp2 + temp3);
service.shutdownNow();
service.awaitTermination(10, TimeUnit.SECONDS);
System.exit(0);
}
}
Here FirstRequest, SecondRequest, ThirdRequest and FourthRequest are different classes which calls another class which is common to all.
I have created distinct objects for the common class so I don't think it's a case of deadlock/Starvation.

You want to start here - meaning: it is actually hard to measure Java execution times in a reasonable way. Chances are that you have an over-simplified view; and thus your measurements aren't telling anything.
And beyond that: you have to understand that "more threads" do not magically reduce overall runtime. It very much depends on what you are doing; and how often for example your threads spend time waiting for IO.
Meaning: "adding" threads only helps when each thread is inactive for "longer" periods of time. When each thread is constantly burning CPU cycles at 100% ... then more threads do not help; to the contrary: then times get worse, because the only thing you do is add overhead for setting up and switching between your tasks.

How many processors do you have ? Time taken depends upon type of task you are performing, type of resources you are using in your tasks etc.
Adding more threads doesn't mean your process will become fast, it just mean that if there are idle processing power available, then java will try to use them.

I have set the no. of threads to 4, but the time taken is same as that set to 2.
This is a very common issue. I've heard of countless times that sometimes works really hard to turn a single-threaded application into a multi-threaded application so find that it doesn't run faster. It actually can run slower because of thread overhead and refactoring issues.
The analogy is a overworked researcher on a project. You could divide their work and give it to 4 graduate students to work concurrently but if they all have to ask the researcher questions all of the time then the project isn't going to go any faster and the lack of coordination between the 4 graduate students can make the project take even longer.
The only times adding additional threads to an application will definitively improve the throughput is when the threads are:
completely independent – i.e. not sharing data (or not a lot) between threads which is when they block or have to synchronize memory
completely CPU bound – i.e. doing only data processing and not waiting on disk or network IO or other system resources
able to make use of additional system CPUs
Here FirstRequest, SecondRequest, ThirdRequest and FourthRequest are different classes which calls another class which is common to all.
Right, this is a red flag. The First/Second/... Request classes may be able to work concurrently but if they have to call synchronized blocks in the common class then they will block each other. One of the tricky things with threaded programming is how to limit the data sharing while still accomplishing their tasks. If you show more of the common class then we might be able to help you more.
The other big red flag is any amount of IO. Watch out for reading and writing to disk or network. Watch also for logging or other output. It is often efficient to have one thread reading from disk, a number of threads processing what is read, and one thread writing. But even then, if the processing isn't very CPU intensive, you might see little to no speed improvement because the speed of the application is limited by the IO bandwidth of the disk device.
Web applications that spawn a thread for every request to be handled can be efficient because they are handling so many network IO bound requests simultaneously. But the real win is the code improvements that come from the thread being able to concentrate on its request and the thread subsystems will then take care of the context switching between requests when the thread is blocked waiting on network or disk IO.

Multithreaded application increases runtime with number of threads

I am implementing a multithreaded solution of the Barnes-Hut algorithm for the N-Body problem.
Main class does the following
public void runSimulation() {
for(int i = 0; i < numWorkers; i++) {
new Thread(new Worker(i, this, gnumBodies, numSteps)).start();
}
try {
startBarrier.await();
stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
The bh.stop- and bh.startBarrier are CyclicBarriers setting start- and stopTime to System.nanoTime(); when reached (barrier actions).
The workers run method:
public void run() {
try {
bh.startBarrier.await();
for(int j = 0; j < numSteps; j++) {
for(int i = wid; i < gnumBodies; i += bh.numWorkers) {
bh.addForce(i);
bh.moveBody(i);
}
bh.barrier.await();
}
bh.stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
addForce(i) goes through a tree and does some calculations. It does not effect any shared variables, so no synchronization is used. O(NlogN).
moveBody(i) does calculations on one element and no synchronization is used. O(N).
When bh.barrier is reached, a tree with all bodies is built up (barrier action).
Now to the problem. The runtime increases linearly with the number of threads used.
Runtimes for gnumBodies = 240, numSteps = 85000 and four cores:
1 thread = 0.763
2 threads = 0.952
3 threads = 1.261
4 threads = 1.563
Why isn't the runtime decreasing with the number of threads used?
edit: added hardware info

What hardware are you running it on? Running multiple threads has its overhead so it might not be worth while splitting your task into to small sub-task.
Also, try using an ExecutorService instead of thread. That way you can use a thread pool instead of creating an actual thread for each task. There is no use in having more threads that your hardware can handle.
It also look to me like each thread will do the same work. Can this be the case? when creating a worker you are using same parameters each time besides for i.

Multithreading does not increase the execution speed unless you also have multiple CPU cores.
Threads doing math calculations and can run full speed
If you have only one CPU core, it is all the same if you run a calculation in one thread or in multiple threads. Running in multiple threads gives no performance benefit, but comes with an overhead of thread switching, so actually the total performance will be a little worse.
If you have multiple available CPU cores, then the threads can run physically in parallel up to the number of cores. This means 4-8 threads may work well on nowadays desktop hardware.
Threads waiting for IO and getting suspended
Threads make sense if you don't do a mathematical calculation, but do something which involves slow I/O such as network, files, or databases. Instead of hogging the run of your program, while one thread waits for the IO, another thread can use the same CPU core. This is the reason why web servers and database solutions works with more threads than CPU cores.
Avoid unnecessary synchronization
Nevertheless, your measurement shows a synchonization mistake.
I guess you shall remove all xxbarrier.await() from the thread code.
I am unsure what is your goal with the xxBarriers vs. System nanotime, but any unnecessary synchornization can easily result slow performance. Instead of calculating, you're waiting on the xxxBarriers.

Your workers do the same job numWorker times, independently.
The only shared object is your CyclicBarrier. await() waits all parities invoke await on this barrier. With the number of workers are increasing, it spends more time on await()

If you have multiple cores or if hyperthreading is available, then running multiple threads will take the benefit of underlying hardware.
If only one core is present, multi-threading can give a 'perceived' benefit if your application involves atleast one non CPU intensive work like interaction with human. Humans are very slow compared to modern day CPUs. Hence if your application requires to get multiple inputs from human and also process them, it makes sense to do separate the input and calculations in two threads. By the time human will provide an input, part of the calculation can be completed in another thread. Thus the overall improvement in time.
If you application must do calculations and multi-threading support in hardware is not present, it is better to use single thread. Your 'calculations' are already lined up in the pipeline back-to-back and CPU will already be running at (almost) max speed. Multi-threading would require context-switching time which will increase the time taken to do the calculations.

When i ran the application with a larger number of bodies an less steps, the application scaled as expected. So the problem was probably the overhead of the barrier(s)!

Java - multithreaded code does not run faster on more cores

I was just running some multithreaded code on a 4-core machine in the hopes that it would be faster than on a single-core machine. Here's the idea: I got a fixed number of threads (in my case one thread per core). Every thread executes a Runnable of the form:
private static int[] data; // data shared across all threads
public void run() {
int i = 0;
while (i++ < 5000) {
// do some work
for (int j = 0; j < 10000 / numberOfThreads) {
// each thread performs calculations and reads from and
// writes to a different part of the data array
}
// wait for the other threads
barrier.await();
}
}
On a quadcore machine, this code performs worse with 4 threads than it does with 1 thread. Even with the CyclicBarrier's overhead, I would have thought that the code should perform at least 2 times faster. Why does it run slower?
EDIT: Here's a busy wait implementation I tried. Unfortunately, it makes the program run slower on more cores (also being discussed in a separate question here):
public void run() {
// do work
synchronized (this) {
if (atomicInt.decrementAndGet() == 0) {
atomicInt.set(numberOfOperations);
for (int i = 0; i < threads.length; i++)
threads[i].interrupt();
}
}
while (!Thread.interrupted()) {}
}

Adding more threads is not necessarily guarenteed to improve performance. There are a number of possible causes for decreased performance with additional threads:
Coarse-grained locking may overly serialize execution - that is, a lock may result in only one thread running at a time. You get all the overhead of multiple threads but none of the benefits. Try to reduce how long locks are held.
The same applies to overly frequent barriers and other synchronization structures. If the inner j loop completes quickly, you might spend most of your time in the barrier. Try to do more work between synchronization points.
If your code runs too quickly, there may be no time to migrate threads to other CPU cores. This usually isn't a problem unless you create a lot of very short-lived threads. Using thread pools, or simply giving each thread more work can help. If your threads run for more than a second or so each, this is unlikely to be a problem.
If your threads are working on a lot of shared read/write data, cache line bouncing may decrease performance. That said, although this often results in performance degradation, this alone is unlikely to result in performance worse than the single threaded case. Try to make sure the data that each thread writes is separated from other threads' data by the size of a cache line (usually around 64 bytes). In particular, don't have output arrays laid out like [thread A, B, C, D, A, B, C, D ...]
Since you haven't shown your code, I can't really speak in any more detail here.

You're sleeping nano-seconds instead of milli-seconds.
I changed from
Thread.sleep(0, 100000 / numberOfThreads); // sleep 0.025 ms for 4 threads
to
Thread.sleep(100000 / numberOfThreads);
and got a speed-up proportional to the number of threads started just as expected.
I invented a CPU-intensive "countPrimes". Full test code available here.
I get the following speed-up on my quad-core machine:
4 threads: 1625
1 thread: 3747
(the CPU-load monitor indeed shows that 4 course are busy in the former case, and that 1 core is busy in the latter case.)
Conclusion: You're doing comparatively small portions of work in each thread between synchronization. The synchronization takes much much more time than the actual CPU-intensive computation work.
(Also, if you have memory intensive code, such as tons of array-accesses in the threads, the CPU won't be the bottle-neck anyway, and you won't see any speed-up by splitting it on multiple CPUs.)

The code inside runnable does not actually do anything.
In your specific example of 4 threads each thread will sleep for 2.5 seconds and wait for the others via the barier.
So all that is happening is that each thread gets on the processor to increment i and then blocks for sleep leaving processor available.
I do not see why the scheduler would alocate each thread to a separate core since all that is happening is that the threads mostly wait.
It is fair and reasonable to expect to just to use the same core and switch among threads
UPDATE
Just saw that you updated post saying that some work is happening in the loop. What is happening though you do not say.

synchronizing across cores is much slower than syncing on a single core
because on a single cored machine the JVM doesn't flush the cache (a very slow operation) during each sync
check out this blog post

Here is a not tested SpinBarrier but it should work.
Check if that may have any improvement on the case. Since you run the code in loop extra sync only hurt performance if you have the cores on idle.
Btw, I still believe you have a bug in the calc, memory intense operation. Can you tell
what CPU+OS you use.
Edit, forgot the version out.
import java.util.concurrent.atomic.AtomicInteger;
public class SpinBarrier {
final int permits;
final AtomicInteger count;
final AtomicInteger version;
public SpinBarrier(int count){
this.count = new AtomicInteger(count);
this.permits= count;
this.version = new AtomicInteger();
}
public void await(){
for (int c = count.decrementAndGet(), v = this.version.get(); c!=0 && v==version.get(); c=count.get()){
spinWait();
}
if (count.compareAndSet(0, permits)){;//only one succeeds here, the rest will lose the CAS
this.version.incrementAndGet();
}
}
protected void spinWait() {
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.