I needed to execute my workers on several cores because I have a possibility to separate my task to several tasks manually, so I limply need to use newFixedThreadPool, but my worker raises a 100% CPU loading. If it using the whole cores for it it's okay, but it's working only 2x faster than without thread pool, so seems that it using it without sense.
I tried to use plain newFixedThreadPool:
ExecutorService pool = Executors.newFixedThreadPool (Runtime.getRuntime ().availableProcessors ());
for (int i = 0; i < Runtime.getRuntime ().availableProcessors (); i++)
pool.execute (new MyTask (this));
and ThreadPoolExecutor with passed queue:
LinkedBlockingQueue<Runnable> queue = new LinkedBlockingQueue<> ();
for (int i = 0; i < Runtime.getRuntime ().availableProcessors (); i++)
queue.add (new MyTask (this));
ThreadPoolExecutor pool = new ThreadPoolExecutor (Runtime.getRuntime ().availableProcessors (), Runtime.getRuntime ().availableProcessors (), 0, TimeUnit.MILLISECONDS, queue);
pool.prestartAllCoreThreads ();
And result still the same.
MyTask is trivial and written for example:
protected static class MyTask implements Runnable {
#Override
public void run () {
int i = 0;
while (true)
i++;
}
After running workers without ThreadPoolExecutor CPU loading is normal and varies for about 10-20%. So what will be wrong?
A CPU-bound task is bound to take all of the CPU
You have an infinite loop that does nothing more than run calculations. That code never blocks. That is, the code never stops to wait on externalities. For example, the code never contacts a database, never writes to storage, and never makes a call over the network.
With nothing to do but calculations on the CPU, such a task is considered to be CPU-bound. Such a task will use 100% of a CPU core.
You multiply this task, to have one such task per core for every core in your machine. So all cores are pushed to 100%. Likely your machine will get hot, fans will spin, and CPU will throttle.
This seems rather obvious. Perhaps you should edit your Question to explain why you expected a different behavior.
Tip: Avoid loading all your cores with CPU-bound tasks. The OS has many processes that need to run, and other apps need to run.
While it is not clear exactly what you are trying to find out, you have to keep this in mind.
A single thread will only make use of one (virtual) core at a time. It is usually backed by a single OS kernel process (I say usually because there are a lot of fancy stuff some frameworks and VMs are doing with Virtual Threads and what not lately).
When the thread blocks (to do I/O or wait for something, or sleeps, or there is an explicit call to yield()), it relinquishes the CPU core to another thread.
So running a simple infinite loop on a single worker thread will only make use of 1 core. Which is why you observe 10% to 20% loading (depending on how many virtual cores you have on your machine).
On the other hand, if you create a thread pool of as many cores as you have, all doing the same infinite loop on separate threads, each will take one available core and use it, which will drive your CPU loading to 100%.
They are completely separate workers, so you can't expect your first worker to complete faster or anything of that sort. If anything it will be slower, because you have exhausted all CPU resources, and the OS and even the JVM now will compete to do its own thing, like run the garbage collector etc.
Related
I am implementing a multithreaded solution of the Barnes-Hut algorithm for the N-Body problem.
Main class does the following
public void runSimulation() {
for(int i = 0; i < numWorkers; i++) {
new Thread(new Worker(i, this, gnumBodies, numSteps)).start();
}
try {
startBarrier.await();
stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
The bh.stop- and bh.startBarrier are CyclicBarriers setting start- and stopTime to System.nanoTime(); when reached (barrier actions).
The workers run method:
public void run() {
try {
bh.startBarrier.await();
for(int j = 0; j < numSteps; j++) {
for(int i = wid; i < gnumBodies; i += bh.numWorkers) {
bh.addForce(i);
bh.moveBody(i);
}
bh.barrier.await();
}
bh.stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
addForce(i) goes through a tree and does some calculations. It does not effect any shared variables, so no synchronization is used. O(NlogN).
moveBody(i) does calculations on one element and no synchronization is used. O(N).
When bh.barrier is reached, a tree with all bodies is built up (barrier action).
Now to the problem. The runtime increases linearly with the number of threads used.
Runtimes for gnumBodies = 240, numSteps = 85000 and four cores:
1 thread = 0.763
2 threads = 0.952
3 threads = 1.261
4 threads = 1.563
Why isn't the runtime decreasing with the number of threads used?
edit: added hardware info
What hardware are you running it on? Running multiple threads has its overhead so it might not be worth while splitting your task into to small sub-task.
Also, try using an ExecutorService instead of thread. That way you can use a thread pool instead of creating an actual thread for each task. There is no use in having more threads that your hardware can handle.
It also look to me like each thread will do the same work. Can this be the case? when creating a worker you are using same parameters each time besides for i.
Multithreading does not increase the execution speed unless you also have multiple CPU cores.
Threads doing math calculations and can run full speed
If you have only one CPU core, it is all the same if you run a calculation in one thread or in multiple threads. Running in multiple threads gives no performance benefit, but comes with an overhead of thread switching, so actually the total performance will be a little worse.
If you have multiple available CPU cores, then the threads can run physically in parallel up to the number of cores. This means 4-8 threads may work well on nowadays desktop hardware.
Threads waiting for IO and getting suspended
Threads make sense if you don't do a mathematical calculation, but do something which involves slow I/O such as network, files, or databases. Instead of hogging the run of your program, while one thread waits for the IO, another thread can use the same CPU core. This is the reason why web servers and database solutions works with more threads than CPU cores.
Avoid unnecessary synchronization
Nevertheless, your measurement shows a synchonization mistake.
I guess you shall remove all xxbarrier.await() from the thread code.
I am unsure what is your goal with the xxBarriers vs. System nanotime, but any unnecessary synchornization can easily result slow performance. Instead of calculating, you're waiting on the xxxBarriers.
Your workers do the same job numWorker times, independently.
The only shared object is your CyclicBarrier. await() waits all parities invoke await on this barrier. With the number of workers are increasing, it spends more time on await()
If you have multiple cores or if hyperthreading is available, then running multiple threads will take the benefit of underlying hardware.
If only one core is present, multi-threading can give a 'perceived' benefit if your application involves atleast one non CPU intensive work like interaction with human. Humans are very slow compared to modern day CPUs. Hence if your application requires to get multiple inputs from human and also process them, it makes sense to do separate the input and calculations in two threads. By the time human will provide an input, part of the calculation can be completed in another thread. Thus the overall improvement in time.
If you application must do calculations and multi-threading support in hardware is not present, it is better to use single thread. Your 'calculations' are already lined up in the pipeline back-to-back and CPU will already be running at (almost) max speed. Multi-threading would require context-switching time which will increase the time taken to do the calculations.
When i ran the application with a larger number of bodies an less steps, the application scaled as expected. So the problem was probably the overhead of the barrier(s)!
What is "Busy Spin" in multi-threaded environment?
How it is useful and how can it be implemented in java in a multi-threaded environment?
In what way can it be useful in improving the performance of an application?
Some of the other answers miss the real problem with busy waiting.
Unless you're talking about an application where you are concerned with conserving electrical power, then burning CPU time is not, in and of itself, a Bad Thing. It's only bad when there is some other thread or process that is ready-to-run. It's really bad when one of the ready-to-run threads is the thread that your busy-wait loop is waiting for.
That's the real issue. A normal, user-mode program running on a normal operating system has no control over which threads run on which processors, a normal operating system has no way to tell the difference between a thread that is busy waiting and a thread that is doing work, and even if the OS knew that the thread was busy-waiting, it would have no way to know what the thread was waiting for.
So, it's entirely possible for the busy waiter to wait for many milliseconds (practically an eternity), waiting for an event, while the the only thread that could make the event happen sits on the sideline (i.e., in the run queue) waiting for its turn to use a CPU.
Busy waiting is often used in systems where there is tight control over which threads run on which processors. Busy waiting can be the most efficient way to wait for an event when you know that the thread that will cause it is actually running on a different processor. That often is the case when you're writing code for the operating system itself, or when you're writing an embedded, real-time application that runs under a real-time operating system.
Kevin Walters wrote about the case where the time to wait is very short. A CPU-bound, ordinary program running on an ordinary OS may be allowed to execute millions of instructions in each time slice. So, if the program uses a spin-lock to protect a critical section consisting of just a few instructions, then it is highly unlikely that any thread will lose its time slice while it is in the critical section. That means, if thread A finds the spin-lock locked, then it is highly likely that thread B, which holds the lock, actually is running on a different CPU. That's why it can be OK to use spin-locks in an ordinary program when you know it's going to run on a multi-processor host.
Busy-waiting or spinning is a technique in which a process repeatedly checks to see if a condition is true instead of calling wait or sleep method and releasing CPU.
1.It is mainly useful in multicore processor where condition is going to be true quite quickly i.e. in millisecond or micro second
2.Advantage of not releasing CPU is that, all cached data and instruction are remained unaffected, which may be lost, had this thread is suspended on one core and brought back to another thread
Busy spin is one of the techniques to wait for events without releasing CPU. It's often done to avoid losing data in CPU cached which is lost if the thread is paused and resumed in some other core.
So, if you are working on a low latency system where your order processing thread currently doesn't have any order, instead of sleeping or calling wait(), you can just loop and then again check the queue for new messages. It's only beneficial if you need to wait for a very small amount of time e.g. in microseconds or nanoseconds.
LMAX Disrupter framework, a high-performance inter-thread messaging library has a BusySpinWaitStrategy which is based on this concept and uses a busy spin loop for EventProcessors waiting on the barrier.
A "busy spin" is constantly looping in one thread to see if the other thread has completed some work. It is a "Bad Idea" as it consumes resources as it is just waiting. The busiest of spins don't even have a sleep in them, but spin as fast as possible waiting for the work to get finished. It is less wasteful to have the waiting thread notified by the completion of the work directly and just let it sleep until then.
Note, I call this a "Bad Idea", but it is used in some cases on low-level code to minimize latency, but this is rarely (if ever) needed in Java code.
Busy spinning/waiting is normally a bad idea from a performance standpoint. In most cases, it is preferable to sleep and wait for a signal when you are ready to run, than to do spinning. Take the scenario where there are two threads, and thread 1 is waiting for thread 2 to set a variable (say, it waits until var == true. Then, it would busy spin by just doing
while (var == false)
;
In this case, you will take up a lot of time that thread 2 can potentially be running, because when you wake up you are just executing the loop mindlessly. So, in a scenario where you are waiting for something like this to happen, it is better to let thread 2 have all control by putting yourself to sleep and having it wake you up when it is done.
BUT, in rare cases where the time you need to wait is very short, it is actually faster to spinlock. This is because of the time it takes to perform the signalng functions; spinning is preferable if the time used spinning is less than the time it would take to perform the signaling. So, in that way it may be beneficial and could actually improve performance, but this is definitely not the most frequent case.
Spin Waiting is that you constantly wait for a condition comes true. The opposite is waiting for a signal (like thread interruption by notify() and wait()).
There are two ways of waiting, first semi-active (sleep / yield) and active (busy waiting).
On busy waiting a program idles actively using special op codes like HLT or NOP or other time consuming operations. Other use just a while loop checking for a condition comming true.
The JavaFramework provides Thread.sleep, Thread.yield and LockSupport.parkXXX() methods for a thread to hand over the cpu. Sleep waits for a specific amount of time but alwasy takes over a millisecond even if a nano second was specified. The same is true for LockSupport.parkNanos(1). Thread.yield allows for a resolution of 100ns for my example system (win7 + i5 mobile).
The problem with yield is the way it works. If the system is utilized fully yield can take up to 800ms in my test scenario (100 worker threads all counting up a number (a+=a;) indefinitively). Since yield frees the cpu and adds the thread to the end of all threads within its priority group, yield is therefore unstable unless the cpu is not utilized to a certain extend.
Busy waiting will block a CPU (core) for multiple milliseconds.
The Java Framework (check Condition class implementations) uses active (busy) wait for periodes less then 1000ns (1 microsecond). At my system an average invocation of System.nanoTime takes 160ns so busy waiting is like checking the condition spend 160ns on nanoTime and repeat.
So basically the concurrency framework of Java (queues etc) has something like wait under a microsecond spin and hit the waiting periode within a N granulairty where N is the number of nanoseconds for checking time constraints and wait for one ms or longer (for my current system).
So active busy waiting increases utilization but aid in the overall reactiveness of the system.
While burning CPU time one should use special instructions reducing the power consumption of the core executing the time consuming operations.
Busy spin is nothing but looping over until thread(s) completes. E.g. You have say 10 threads, and you want to wait all the thread to finish and then want to continue,
while(ALL_THREADS_ARE_NOT_COMPLETE);
//Continue with rest of the logic
For example in java you can manage multiple thread with ExecutorService
ExecutorService executor = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
Runnable worker = new WorkerThread('' + i);
executor.execute(worker);
}
executor.shutdown();
//With this loop, you are looping over till threads doesn't finish.
while (!executor.isTerminated());
It is a to busy spins as it consumes resources as CPU is not sitting ideal, but keeping running over the loop. We should have mechanism to notify the main thread
(parent thread) to indicate that all thread are done and it can continue with the rest of the task.
With the preceding example, instead of having busy spin, you can use different mechanism to improve performance.
ExecutorService executor = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
Runnable worker = new WorkerThread('' + i);
executor.execute(worker);
}
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
log.fatal("Exception ",e);
}
I have a program that performs a long-time computations, so I want to speed up its performance. So I tried to launch 3 threads at the moment, but java.exe still occupies 25% of CPU usage (so, only one CPU is used), and it's remains even if I try to use .setPriority(Thread.MAX_PRIORITY); and set priority of java.exe at realtime (24). I tried to use RealtimeThread but seems like it works even slower. It would be perfect if each thread was allocated to one processor and the total CPU usage has increased to 75%, but I don't know how to do it. And that's how my code looks right now:
Thread g1 = new MyThread(i,j);
g1.setPriority(Thread.MAX_PRIORITY);
g1.run();
Thread g2 = new MyThread(j,i);
g2.setPriority(Thread.MAX_PRIORITY);
g2.run();
Thread g3 = new MyThread(i,j);
g3.setPriority(Thread.MAX_PRIORITY);
g3.run();
if (g1.isAlive()) {
g1.join();
}
if (g2.isAlive()) {
g2.join();
}
if (g3.isAlive()) {
g3.join();
}
You aren't actually using threads.
You need to call .start(), not .run().
This has nothing to do with CPUs - you're not actually starting 3 threads, you're running everything on the main thread. To start a thread, call its start() method, not run().
First, as the others suggest, you're not really using multiple threads. This is because you're calling the run() method, which ends up doing the work in the calling thread.
Now, to address the rest of your question, which I take to mean how does one maximize the efficiency of a multithreaded process. This isn't a simple question, but I'll give you the basics. (Others, feel free to chime in.)
The best way to maximize the efficiency of your process is to try to make all of the threads do about the same amount of work, and to try to keep them from blocking. That is to say, it is your job to "balance" the workload in order to make the application run efficiently.
In general, you can't assign a thread to run on a particular CPU core; that's usually the job of the OS and the CPUs themselves. The OS schedules the process (using the priorities you provide) and then the CPUs can do their own scheduling at the instruction level. Besides setting the priorities, the rest of the scheduling is completely out of your control.
EDIT: I am addicted to semicolons.
I was just running some multithreaded code on a 4-core machine in the hopes that it would be faster than on a single-core machine. Here's the idea: I got a fixed number of threads (in my case one thread per core). Every thread executes a Runnable of the form:
private static int[] data; // data shared across all threads
public void run() {
int i = 0;
while (i++ < 5000) {
// do some work
for (int j = 0; j < 10000 / numberOfThreads) {
// each thread performs calculations and reads from and
// writes to a different part of the data array
}
// wait for the other threads
barrier.await();
}
}
On a quadcore machine, this code performs worse with 4 threads than it does with 1 thread. Even with the CyclicBarrier's overhead, I would have thought that the code should perform at least 2 times faster. Why does it run slower?
EDIT: Here's a busy wait implementation I tried. Unfortunately, it makes the program run slower on more cores (also being discussed in a separate question here):
public void run() {
// do work
synchronized (this) {
if (atomicInt.decrementAndGet() == 0) {
atomicInt.set(numberOfOperations);
for (int i = 0; i < threads.length; i++)
threads[i].interrupt();
}
}
while (!Thread.interrupted()) {}
}
Adding more threads is not necessarily guarenteed to improve performance. There are a number of possible causes for decreased performance with additional threads:
Coarse-grained locking may overly serialize execution - that is, a lock may result in only one thread running at a time. You get all the overhead of multiple threads but none of the benefits. Try to reduce how long locks are held.
The same applies to overly frequent barriers and other synchronization structures. If the inner j loop completes quickly, you might spend most of your time in the barrier. Try to do more work between synchronization points.
If your code runs too quickly, there may be no time to migrate threads to other CPU cores. This usually isn't a problem unless you create a lot of very short-lived threads. Using thread pools, or simply giving each thread more work can help. If your threads run for more than a second or so each, this is unlikely to be a problem.
If your threads are working on a lot of shared read/write data, cache line bouncing may decrease performance. That said, although this often results in performance degradation, this alone is unlikely to result in performance worse than the single threaded case. Try to make sure the data that each thread writes is separated from other threads' data by the size of a cache line (usually around 64 bytes). In particular, don't have output arrays laid out like [thread A, B, C, D, A, B, C, D ...]
Since you haven't shown your code, I can't really speak in any more detail here.
You're sleeping nano-seconds instead of milli-seconds.
I changed from
Thread.sleep(0, 100000 / numberOfThreads); // sleep 0.025 ms for 4 threads
to
Thread.sleep(100000 / numberOfThreads);
and got a speed-up proportional to the number of threads started just as expected.
I invented a CPU-intensive "countPrimes". Full test code available here.
I get the following speed-up on my quad-core machine:
4 threads: 1625
1 thread: 3747
(the CPU-load monitor indeed shows that 4 course are busy in the former case, and that 1 core is busy in the latter case.)
Conclusion: You're doing comparatively small portions of work in each thread between synchronization. The synchronization takes much much more time than the actual CPU-intensive computation work.
(Also, if you have memory intensive code, such as tons of array-accesses in the threads, the CPU won't be the bottle-neck anyway, and you won't see any speed-up by splitting it on multiple CPUs.)
The code inside runnable does not actually do anything.
In your specific example of 4 threads each thread will sleep for 2.5 seconds and wait for the others via the barier.
So all that is happening is that each thread gets on the processor to increment i and then blocks for sleep leaving processor available.
I do not see why the scheduler would alocate each thread to a separate core since all that is happening is that the threads mostly wait.
It is fair and reasonable to expect to just to use the same core and switch among threads
UPDATE
Just saw that you updated post saying that some work is happening in the loop. What is happening though you do not say.
synchronizing across cores is much slower than syncing on a single core
because on a single cored machine the JVM doesn't flush the cache (a very slow operation) during each sync
check out this blog post
Here is a not tested SpinBarrier but it should work.
Check if that may have any improvement on the case. Since you run the code in loop extra sync only hurt performance if you have the cores on idle.
Btw, I still believe you have a bug in the calc, memory intense operation. Can you tell
what CPU+OS you use.
Edit, forgot the version out.
import java.util.concurrent.atomic.AtomicInteger;
public class SpinBarrier {
final int permits;
final AtomicInteger count;
final AtomicInteger version;
public SpinBarrier(int count){
this.count = new AtomicInteger(count);
this.permits= count;
this.version = new AtomicInteger();
}
public void await(){
for (int c = count.decrementAndGet(), v = this.version.get(); c!=0 && v==version.get(); c=count.get()){
spinWait();
}
if (count.compareAndSet(0, permits)){;//only one succeeds here, the rest will lose the CAS
this.version.incrementAndGet();
}
}
protected void spinWait() {
}
}
The Java tutorials say that creating a Thread is expensive. But why exactly is it expensive? What exactly is happening when a Java Thread is created that makes its creation expensive? I'm taking the statement as true, but I'm just interested in mechanics of Thread creation in JVM.
Thread lifecycle overhead. Thread creation and teardown are not free. The actual overhead varies across platforms, but thread creation takes time, introducing latency into request processing, and requires some processing activity by the JVM and OS. If requests are frequent and lightweight, as in most server applications, creating a new thread for each request can consume significant computing resources.
From Java Concurrency in Practice
By Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, Doug Lea
Print ISBN-10: 0-321-34960-1
Why is creating a Thread said to be expensive?
Because it >>is<< expensive.
Java thread creation is expensive because there is a fair bit of work involved:
A large block of memory has to be allocated and initialized for the thread stack.
System calls need to be made to create / register the native thread with the host OS.
Descriptors need to be created, initialized and added to JVM-internal data structures.
It is also expensive in the sense that the thread ties down resources as long as it is alive; e.g. the thread stack, any objects reachable from the stack, the JVM thread descriptors, the OS native thread descriptors.
The costs of all of these things are platform specific, but they are not cheap on any Java platform I've ever come across.
A Google search found me an old benchmark that reports a thread creation rate of ~4000 per second on a Sun Java 1.4.1 on a 2002 vintage dual processor Xeon running 2002 vintage Linux. A more modern platform will give better numbers ... and I can't comment on the methodology ... but at least it gives a ballpark for how expensive thread creation is likely to be.
Peter Lawrey's benchmarking indicates that thread creation is significantly faster these days in absolute terms, but it is unclear how much of this is due improvements in Java and/or the OS ... or higher processor speeds. But his numbers still indicate a 150+ fold improvement if you use a thread pool versus creating/starting a new thread each time. (And he makes the point that this is all relative ...)
The above assumes native threads rather than green threads, but modern JVMs all use native threads for performance reasons. Green threads are possibly cheaper to create, but you pay for it in other areas.
Update: The OpenJDK Loom project aims to provide a light-weight alternative to standard Java threads, among other things. The are proposing virtual threads which are a hybrid of native threads and green threads. In simple terms, a virtual thread is rather like a green thread implementation that uses native threads underneath when parallel execution is required.
As of now (Jan 2021) the Project Loom work is still at the prototyping stage, with (AFAIK) no Java version targeted for the release.
I've done a bit of digging to see how a Java thread's stack really gets allocated. In the case of OpenJDK 6 on Linux, the thread stack is allocated by the call to pthread_create that creates the native thread. (The JVM does not pass pthread_create a preallocated stack.)
Then, within pthread_create the stack is allocated by a call to mmap as follows:
mmap(0, attr.__stacksize,
PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
According to man mmap, the MAP_ANONYMOUS flag causes the memory to be initialized to zero.
Thus, even though it might not be essential that new Java thread stacks are zeroed (per the JVM spec), in practice (at least with OpenJDK 6 on Linux) they are zeroed.
Others have discussed where the costs of threading come from. This answer covers why creating a thread is not that expensive compared to many operations, but relatively expensive compared to task execution alternatives, which are relatively less expensive.
The most obvious alternative to running a task in another thread is to run the task in the same thread. This is difficult to grasp for those assuming that more threads are always better. The logic is that if the overhead of adding the task to another thread is greater than the time you save, it can be faster to perform the task in the current thread.
Another alternative is to use a thread pool. A thread pool can be more efficient for two reasons. 1) it reuses threads already created. 2) you can tune/control the number of threads to ensure you have optimal performance.
The following program prints....
Time for a task to complete in a new Thread 71.3 us
Time for a task to complete in a thread pool 0.39 us
Time for a task to complete in the same thread 0.08 us
Time for a task to complete in a new Thread 65.4 us
Time for a task to complete in a thread pool 0.37 us
Time for a task to complete in the same thread 0.08 us
Time for a task to complete in a new Thread 61.4 us
Time for a task to complete in a thread pool 0.38 us
Time for a task to complete in the same thread 0.08 us
This is a test for a trivial task which exposes the overhead of each threading option. (This test task is the sort of task that is actually best performed in the current thread.)
final BlockingQueue<Integer> queue = new LinkedBlockingQueue<Integer>();
Runnable task = new Runnable() {
#Override
public void run() {
queue.add(1);
}
};
for (int t = 0; t < 3; t++) {
{
long start = System.nanoTime();
int runs = 20000;
for (int i = 0; i < runs; i++)
new Thread(task).start();
for (int i = 0; i < runs; i++)
queue.take();
long time = System.nanoTime() - start;
System.out.printf("Time for a task to complete in a new Thread %.1f us%n", time / runs / 1000.0);
}
{
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(threads);
long start = System.nanoTime();
int runs = 200000;
for (int i = 0; i < runs; i++)
es.execute(task);
for (int i = 0; i < runs; i++)
queue.take();
long time = System.nanoTime() - start;
System.out.printf("Time for a task to complete in a thread pool %.2f us%n", time / runs / 1000.0);
es.shutdown();
}
{
long start = System.nanoTime();
int runs = 200000;
for (int i = 0; i < runs; i++)
task.run();
for (int i = 0; i < runs; i++)
queue.take();
long time = System.nanoTime() - start;
System.out.printf("Time for a task to complete in the same thread %.2f us%n", time / runs / 1000.0);
}
}
}
As you can see, creating a new thread only costs ~70 µs. This could be considered trivial in many, if not most, use cases. Relatively speaking it is more expensive than the alternatives and for some situations a thread pool or not using threads at all is a better solution.
In theory, this depends on the JVM. In practice, every thread has a relatively large amount of stack memory (256 KB per default, I think). Additionally, threads are implemented as OS threads, so creating them involves an OS call, i.e. a context switch.
Do realize that "expensive" in computing is always very relative. Thread creation is very expensive relative to the creation of most objects, but not very expensive relative to a random harddisk seek. You don't have to avoid creating threads at all costs, but creating hundreds of them per second is not a smart move. In most cases, if your design calls for lots of threads, you should use a limited-size thread pool.
There are two kinds of threads:
Proper threads: these are abstractions around the underlying operating system's threading facilities. Thread creation is, therefore, as expensive as the system's -- there's always an overhead.
"Green" threads: created and scheduled by the JVM, these are cheaper, but no proper paralellism occurs. These behave like threads, but are executed within the JVM thread in the OS. They are not often used, to my knowledge.
The biggest factor I can think of in the thread-creation overhead, is the stack-size you have defined for your threads. Thread stack-size can be passed as a parameter when running the VM.
Other than that, thread creation is mostly OS-dependent, and even VM-implementation-dependent.
Now, let me point something out: creating threads is expensive if you're planning on firing 2000 threads per second, every second of your runtime. The JVM is not designed to handle that. If you'll have a couple of stable workers that won't be fired and killed over and over, relax.
Creating Threads requires allocating a fair amount of memory since it has to make not one, but two new stacks (one for java code, one for native code). Use of Executors/Thread Pools can avoid the overhead, by reusing threads for multiple tasks for Executor.
Obviously the crux of the question is what does 'expensive' mean.
A thread needs to create a stack and initialize the stack based on the run method.
It needs to set up control status structures, ie, what state it's in runnable, waiting etc.
There's probably a good deal of synchronization around setting these things up.