Multithreaded application increases runtime with number of threads

Multithreaded application increases runtime with number of threads - java

I am implementing a multithreaded solution of the Barnes-Hut algorithm for the N-Body problem.
Main class does the following
public void runSimulation() {
for(int i = 0; i < numWorkers; i++) {
new Thread(new Worker(i, this, gnumBodies, numSteps)).start();
}
try {
startBarrier.await();
stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
The bh.stop- and bh.startBarrier are CyclicBarriers setting start- and stopTime to System.nanoTime(); when reached (barrier actions).
The workers run method:
public void run() {
try {
bh.startBarrier.await();
for(int j = 0; j < numSteps; j++) {
for(int i = wid; i < gnumBodies; i += bh.numWorkers) {
bh.addForce(i);
bh.moveBody(i);
}
bh.barrier.await();
}
bh.stopBarrier.await();
} catch (Exception e) {e.printStackTrace();}
}
addForce(i) goes through a tree and does some calculations. It does not effect any shared variables, so no synchronization is used. O(NlogN).
moveBody(i) does calculations on one element and no synchronization is used. O(N).
When bh.barrier is reached, a tree with all bodies is built up (barrier action).
Now to the problem. The runtime increases linearly with the number of threads used.
Runtimes for gnumBodies = 240, numSteps = 85000 and four cores:
1 thread = 0.763
2 threads = 0.952
3 threads = 1.261
4 threads = 1.563
Why isn't the runtime decreasing with the number of threads used?
edit: added hardware info

What hardware are you running it on? Running multiple threads has its overhead so it might not be worth while splitting your task into to small sub-task.
Also, try using an ExecutorService instead of thread. That way you can use a thread pool instead of creating an actual thread for each task. There is no use in having more threads that your hardware can handle.
It also look to me like each thread will do the same work. Can this be the case? when creating a worker you are using same parameters each time besides for i.

Multithreading does not increase the execution speed unless you also have multiple CPU cores.
Threads doing math calculations and can run full speed
If you have only one CPU core, it is all the same if you run a calculation in one thread or in multiple threads. Running in multiple threads gives no performance benefit, but comes with an overhead of thread switching, so actually the total performance will be a little worse.
If you have multiple available CPU cores, then the threads can run physically in parallel up to the number of cores. This means 4-8 threads may work well on nowadays desktop hardware.
Threads waiting for IO and getting suspended
Threads make sense if you don't do a mathematical calculation, but do something which involves slow I/O such as network, files, or databases. Instead of hogging the run of your program, while one thread waits for the IO, another thread can use the same CPU core. This is the reason why web servers and database solutions works with more threads than CPU cores.
Avoid unnecessary synchronization
Nevertheless, your measurement shows a synchonization mistake.
I guess you shall remove all xxbarrier.await() from the thread code.
I am unsure what is your goal with the xxBarriers vs. System nanotime, but any unnecessary synchornization can easily result slow performance. Instead of calculating, you're waiting on the xxxBarriers.

Your workers do the same job numWorker times, independently.
The only shared object is your CyclicBarrier. await() waits all parities invoke await on this barrier. With the number of workers are increasing, it spends more time on await()

If you have multiple cores or if hyperthreading is available, then running multiple threads will take the benefit of underlying hardware.
If only one core is present, multi-threading can give a 'perceived' benefit if your application involves atleast one non CPU intensive work like interaction with human. Humans are very slow compared to modern day CPUs. Hence if your application requires to get multiple inputs from human and also process them, it makes sense to do separate the input and calculations in two threads. By the time human will provide an input, part of the calculation can be completed in another thread. Thus the overall improvement in time.
If you application must do calculations and multi-threading support in hardware is not present, it is better to use single thread. Your 'calculations' are already lined up in the pipeline back-to-back and CPU will already be running at (almost) max speed. Multi-threading would require context-switching time which will increase the time taken to do the calculations.

When i ran the application with a larger number of bodies an less steps, the application scaled as expected. So the problem was probably the overhead of the barrier(s)!

Related

Multithreaded block taking more time than single threaded block (Java) [duplicate]

I know the answer is No, here is an example Why single thread is faster than multithreading in Java? .
So when processing a task in a thread is trivial, the cost of creating a thread will create more overhead than distributing the task. This is one case where a single thread will be faster than multithreading.
Questions
Are there more cases where a single thread will be faster than multithreading?
When should we decide to give up multithreading and only use a single thread to accomplish our goal?
Although the question is tagged java, it is also welcome to discuss beyond Java.
It would be great if we could have a small example to explain in the answer.

This is a very good question regarding threading and its link to the real work, meaning the available physical CPU(s) and its cores and hyperthreads.
Multiple threads might allow you to do things in parallel, if your CPU has more than one core available. So in an ideal world, e.g. calulating some primes, might be 4 times faster using 4 threads, if your CPU has 4 cores available and your algorithm work really parallel.
If you start more threads as cores are available, the thread management of your OS will spend more and more time in Thread-Switches and in such your effiency using your CPU(s) becomes worse.
If the compiler, CPU cache and/or runtime realized that you run more than one thread, accessing the same data-area in memory, is operates in a different optimization mode: As long as the compile/runtime is sure that only one thread access the data, is can avoid writing data out to extenral RAM too often and might efficently use the L1 cache of your CPU. If not: Is has to activate semaphores and also flush cached data more often from L1/L2 cache to RAM.
So my lessons learned from highly parrallel multithreading have been:
If possible use single threaded, shared-nothing processes to be more efficient
If threads are required, decouple the shared data access as much as possible
Don't try to allocate more loaded worker threads than available cores if possible
Here a small programm (javafx) to play with. It:
Allocates a byte array of 100.000.000 size, filled with random bytes
Provides a method, counting the number of bits set in this array
The method allow to count every 'nth' bytes bits
count(0,1) will count all bytes bits
count(0,4) will count the 0', 4', 8' byte bits allowing a parallel interleaved counting
Using a MacPro (4 cores) results in:
Running one thread, count(0,1) needs 1326ms to count all 399993625 bits
Running two threads, count(0,2) and count(1,2) in parallel needs 920ms
Running four threads, needs 618ms
Running eight threads, needs 631ms
Changing the way to count, e.g. incrementing a commonly shared integer (AtomicInteger or synchronized) will dramatically change the performance of many threads.
public class MulithreadingEffects extends Application {
static class ParallelProgressBar extends ProgressBar {
AtomicInteger myDoneCount = new AtomicInteger();
int myTotalCount;
Timeline myWhatcher = new Timeline(new KeyFrame(Duration.millis(10), e -> update()));
BooleanProperty running = new SimpleBooleanProperty(false);
public void update() {
setProgress(1.0*myDoneCount.get()/myTotalCount);
if (myDoneCount.get() >= myTotalCount) {
myWhatcher.stop();
myTotalCount = 0;
running.set(false);
}
}
public boolean isRunning() { return myTotalCount > 0; }
public BooleanProperty runningProperty() { return running; }
public void start(int totalCount) {
myDoneCount.set(0);
myTotalCount = totalCount;
setProgress(0.0);
myWhatcher.setCycleCount(Timeline.INDEFINITE);
myWhatcher.play();
running.set(true);
}
public void add(int n) {
myDoneCount.addAndGet(n);
}
}
int mySize = 100000000;
byte[] inData = new byte[mySize];
ParallelProgressBar globalProgressBar = new ParallelProgressBar();
BooleanProperty iamReady = new SimpleBooleanProperty(false);
AtomicInteger myCounter = new AtomicInteger(0);
void count(int start, int step) {
new Thread(""+start){
public void run() {
int count = 0;
int loops = 0;
for (int i = start; i < mySize; i+=step) {
for (int m = 0x80; m > 0; m >>=1) {
if ((inData[i] & m) > 0) count++;
}
if (loops++ > 99) {
globalProgressBar.add(loops);
loops = 0;
}
}
myCounter.addAndGet(count);
globalProgressBar.add(loops);
}
}.start();
}
void pcount(Label result, int n) {
result.setText("("+n+")");
globalProgressBar.start(mySize);
long start = System.currentTimeMillis();
myCounter.set(0);
globalProgressBar.runningProperty().addListener((p,o,v) -> {
if (!v) {
long ms = System.currentTimeMillis()-start;
result.setText(""+ms+" ms ("+myCounter.get()+")");
}
});
for (int t = 0; t < n; t++) count(t, n);
}
void testParallel(VBox box) {
HBox hbox = new HBox();
Label result = new Label("-");
for (int i : new int[]{1, 2, 4, 8}) {
Button run = new Button(""+i);
run.setOnAction( e -> {
if (globalProgressBar.isRunning()) return;
pcount(result, i);
});
hbox.getChildren().add(run);
}
hbox.getChildren().addAll(result);
box.getChildren().addAll(globalProgressBar, hbox);
}
#Override
public void start(Stage primaryStage) throws Exception {
primaryStage.setTitle("ProgressBar's");
globalProgressBar.start(mySize);
new Thread("Prepare"){
public void run() {
iamReady.set(false);
Random random = new Random();
random.setSeed(4711);
for (int i = 0; i < mySize; i++) {
inData[i] = (byte)random.nextInt(256);
globalProgressBar.add(1);
}
iamReady.set(true);
}
}.start();
VBox box = new VBox();
Scene scene = new Scene(box,400,80,Color.WHITE);
primaryStage.setScene(scene);
testParallel(box);
GUIHelper.allowImageDrag(box);
primaryStage.show();
}
public static void main(String[] args) { launch(args); }
}

As already mentionened in a comment by #Jim Mischel, you can use
Amdahl's law
to calculate this.
Amdahl's law states that the speedup gained from adding processors to solve a task is
where
N is the number of processors, and
P is the fraction of the code that can be executed in parallel (0 .. 1)
Now if T is the time it takes to execute the task on a single processor, and O is the total 'overhead' time (create and set up a second thread, communication, ...), a single thread is faster if
T < T/S(2) + O
or, after reordering, if
O/T > P/2
When the ratio Overhead / Execution Time is greater than P/2, a single thread is faster.

Not all algorithms can be processed in parallel (algorithms that are strictly sequential; where P=0 in Amdahl's law) or at least not efficiently (see P-complete). Other algorithms are more suitable for parallel execution (extreme cases are called "embarrassingly parallel").
A naive implementation of a parallel algorithm can be less efficient in terms of complexity or space compared to a similar sequential algorithm. If there is no obvious way to parallelize an algorithm so that it will get a speedup, you may need to choose another similar parallel algorithm that solves the same problem but can be more or less efficient. If you ignore thread/process creation and direct inter-process communication overhead, there can still be other limiting factors when using shared resources like IO bottlenecks or increased paging caused by higher memory consumption.
When should we decide to give up multithreading and only use a single thread to accomplish our goal?
When deciding between single and multithreading, the time needed to change the implementation and the added complexity for developers should be taken into account. If there is only small gain by using multiple threads you could argue that the increased maintenance cost that are usually caused by multi-threaded applications are not worth the speedup.

Threading is about taking advantage of idle resources to handle more work. If you have no idle resources, multi-threading has no advantages, so the overhead would actually make your overall runtime longer.
For example, if you have a collection of tasks to perform and they are CPU-intensive calculations. If you have a single CPU, multi-threading probably wouldn't speed that process up (though you never know until you test). I would expect it to slow down slightly. You are changing how the work is split up, but no changes in capacity. If you have 4 tasks to do on a single CPU, doing them serially is 1 * 4. If you do them in parallel, you'll come out to basically 4 * 1, which is the same. Plus, the overhead of merging results and context switching.
Now, if you have multiple CPU's, then running CPU-intensive tasks in multiple threads would allow you to tap unused resources, so more gets done per unit time.
Also, think about other resources. If you have 4 tasks which query a database, running them in parallel helps if the database has extra resources to handle them all. Though, you are also adding more work, which removes resources from the database server, so I probably wouldn't do that.
Now, let's say we need to make web service calls to 3 external systems and none of the calls have input dependent on each other. Doing them in parallel with multiple threads means that we don't have to wait for one to end before the other starts. It also means that running them in parallel won't negatively impact each task. This would be a great use case for multi-threading.

The overhead may be not only for creation, but for thread-intercommunications. The other thing that should be noted that synchronization of threads on a, for example, single object may lead to alike single thread execution.

Are there more cases where a single thread will be faster than multithreading?
So in a GUI application you will benefit from multithreading. At the most basic level you will be updating the front end as well as what the front end is presenting. If you're running something basic like hello world then like you showed it would be more overhead.
That question is very broad... Do you count Unit Tests as applications? If so then there are probably more applications that use single threads because any complex system will have (hopefully) at least 1 unit test. Do you count every Hello world style program as a different application or the same? If an application is deleted does it still count?
As you can see I can't give a good response other than you would have to narrow the scope of your question to get a meaningful answer. That being said this may be a statistic out there that I'm unaware of.
When should we decide to give up multithreading and only use a single thread to accomplish our goal?
When multithreading will perform 'better' by whatever metric you think is important.
Can your problem be broken into parts that can be processed simultaneously? Not in a contrived way like breaking Hello World into two threads where one thread waits on the other to print. But in a way that 2+ threads would be able to accomplish the task more efficiently than one?
Even if a task is easily parallelizable doesn't mean that it should be. I could multithread an application that trolled thousands of new sites constantly to get me my news. For me personally this would suck because it would eat my pipe up and I wouldn't be able to get my FPS in. For CNN this might be exactly what they want and will build a mainframe to accommodate it.
Can you narrow your questions?

There are 2 scenarios that can happen here :
MultiThreading on Single Core CPU :
1.1 When to use : Multithreading helps when tasks that needs parallelism are IO bound.Threads give up execution while they wait for IO and OS assign the time slice to other waiting threads. Sequential execution do not have the behavior - Multithreads will boost the performance.
1.2 When Not to Use : When the tasks are not IO bound and mere a calculation of something, you might not want to go for multi threading since thread creation and context switching will negate the gain if any. - Multithreads will have least impact.
MultiThreading in Multi Core CPU : Multi core can run as many threads as the number of core in CPU. This will surely have performance boost. But Running higher number of threads than the available cores will again introduce the thread context switching problem. - Multithreads will surely have impact.
Be aware : Also there is limit on adding/introducing number of threads in system. More context switches will negate overall gain and application slows down.

ExecutorService adding delay

I have a very simple program which uses ExecutorService.
I have set the no. of threads to 4, but the time taken is same as that set to 2.
Below is my code:
public class Test {
private static final Logger LOGGER = Logger.getLogger("./sample1.log");
#SuppressWarnings("unchecked")
public static void main(String[] args) throws Throwable {
ExecutorService service = Executors.newFixedThreadPool(4);
Future<String> resultFirst = service.submit(new FirstRequest());
Future<String> resultSecond = service.submit(new SecondRequest());
Future<String> resultThird = service.submit(new ThirdRequest());
Future<String> resultFourth = service.submit(new FourthRequest());
String temp1 = resultSecond.get();
temp1 = temp1.replace("Users", "UsersAppend1");
String temp2 = resultThird.get();
temp2 = temp2.replace("Users", "UsersAppend2");
String temp3 = resultFourth.get();
temp3 = temp3.replace("Users", "UsersAppend3");
//System.out.println(resultFirst.get() + temp1 + temp2 + temp3);
//LOGGER.info("Logger Name: "+LOGGER.getName());
LOGGER.info(resultFirst.get() + temp1 + temp2 + temp3);
service.shutdownNow();
service.awaitTermination(10, TimeUnit.SECONDS);
System.exit(0);
}
}
Here FirstRequest, SecondRequest, ThirdRequest and FourthRequest are different classes which calls another class which is common to all.
I have created distinct objects for the common class so I don't think it's a case of deadlock/Starvation.

You want to start here - meaning: it is actually hard to measure Java execution times in a reasonable way. Chances are that you have an over-simplified view; and thus your measurements aren't telling anything.
And beyond that: you have to understand that "more threads" do not magically reduce overall runtime. It very much depends on what you are doing; and how often for example your threads spend time waiting for IO.
Meaning: "adding" threads only helps when each thread is inactive for "longer" periods of time. When each thread is constantly burning CPU cycles at 100% ... then more threads do not help; to the contrary: then times get worse, because the only thing you do is add overhead for setting up and switching between your tasks.

How many processors do you have ? Time taken depends upon type of task you are performing, type of resources you are using in your tasks etc.
Adding more threads doesn't mean your process will become fast, it just mean that if there are idle processing power available, then java will try to use them.

I have set the no. of threads to 4, but the time taken is same as that set to 2.
This is a very common issue. I've heard of countless times that sometimes works really hard to turn a single-threaded application into a multi-threaded application so find that it doesn't run faster. It actually can run slower because of thread overhead and refactoring issues.
The analogy is a overworked researcher on a project. You could divide their work and give it to 4 graduate students to work concurrently but if they all have to ask the researcher questions all of the time then the project isn't going to go any faster and the lack of coordination between the 4 graduate students can make the project take even longer.
The only times adding additional threads to an application will definitively improve the throughput is when the threads are:
completely independent – i.e. not sharing data (or not a lot) between threads which is when they block or have to synchronize memory
completely CPU bound – i.e. doing only data processing and not waiting on disk or network IO or other system resources
able to make use of additional system CPUs
Here FirstRequest, SecondRequest, ThirdRequest and FourthRequest are different classes which calls another class which is common to all.
Right, this is a red flag. The First/Second/... Request classes may be able to work concurrently but if they have to call synchronized blocks in the common class then they will block each other. One of the tricky things with threaded programming is how to limit the data sharing while still accomplishing their tasks. If you show more of the common class then we might be able to help you more.
The other big red flag is any amount of IO. Watch out for reading and writing to disk or network. Watch also for logging or other output. It is often efficient to have one thread reading from disk, a number of threads processing what is read, and one thread writing. But even then, if the processing isn't very CPU intensive, you might see little to no speed improvement because the speed of the application is limited by the IO bandwidth of the disk device.
Web applications that spawn a thread for every request to be handled can be efficient because they are handling so many network IO bound requests simultaneously. But the real win is the code improvements that come from the thread being able to concentrate on its request and the thread subsystems will then take care of the context switching between requests when the thread is blocked waiting on network or disk IO.

Does multithreading always yield better performance than single threading?

I know the answer is No, here is an example Why single thread is faster than multithreading in Java? .
So when processing a task in a thread is trivial, the cost of creating a thread will create more overhead than distributing the task. This is one case where a single thread will be faster than multithreading.
Questions
Are there more cases where a single thread will be faster than multithreading?
When should we decide to give up multithreading and only use a single thread to accomplish our goal?
Although the question is tagged java, it is also welcome to discuss beyond Java.
It would be great if we could have a small example to explain in the answer.

This is a very good question regarding threading and its link to the real work, meaning the available physical CPU(s) and its cores and hyperthreads.
Multiple threads might allow you to do things in parallel, if your CPU has more than one core available. So in an ideal world, e.g. calulating some primes, might be 4 times faster using 4 threads, if your CPU has 4 cores available and your algorithm work really parallel.
If you start more threads as cores are available, the thread management of your OS will spend more and more time in Thread-Switches and in such your effiency using your CPU(s) becomes worse.
If the compiler, CPU cache and/or runtime realized that you run more than one thread, accessing the same data-area in memory, is operates in a different optimization mode: As long as the compile/runtime is sure that only one thread access the data, is can avoid writing data out to extenral RAM too often and might efficently use the L1 cache of your CPU. If not: Is has to activate semaphores and also flush cached data more often from L1/L2 cache to RAM.
So my lessons learned from highly parrallel multithreading have been:
If possible use single threaded, shared-nothing processes to be more efficient
If threads are required, decouple the shared data access as much as possible
Don't try to allocate more loaded worker threads than available cores if possible
Here a small programm (javafx) to play with. It:
Allocates a byte array of 100.000.000 size, filled with random bytes
Provides a method, counting the number of bits set in this array
The method allow to count every 'nth' bytes bits
count(0,1) will count all bytes bits
count(0,4) will count the 0', 4', 8' byte bits allowing a parallel interleaved counting
Using a MacPro (4 cores) results in:
Running one thread, count(0,1) needs 1326ms to count all 399993625 bits
Running two threads, count(0,2) and count(1,2) in parallel needs 920ms
Running four threads, needs 618ms
Running eight threads, needs 631ms
Changing the way to count, e.g. incrementing a commonly shared integer (AtomicInteger or synchronized) will dramatically change the performance of many threads.
public class MulithreadingEffects extends Application {
static class ParallelProgressBar extends ProgressBar {
AtomicInteger myDoneCount = new AtomicInteger();
int myTotalCount;
Timeline myWhatcher = new Timeline(new KeyFrame(Duration.millis(10), e -> update()));
BooleanProperty running = new SimpleBooleanProperty(false);
public void update() {
setProgress(1.0*myDoneCount.get()/myTotalCount);
if (myDoneCount.get() >= myTotalCount) {
myWhatcher.stop();
myTotalCount = 0;
running.set(false);
}
}
public boolean isRunning() { return myTotalCount > 0; }
public BooleanProperty runningProperty() { return running; }
public void start(int totalCount) {
myDoneCount.set(0);
myTotalCount = totalCount;
setProgress(0.0);
myWhatcher.setCycleCount(Timeline.INDEFINITE);
myWhatcher.play();
running.set(true);
}
public void add(int n) {
myDoneCount.addAndGet(n);
}
}
int mySize = 100000000;
byte[] inData = new byte[mySize];
ParallelProgressBar globalProgressBar = new ParallelProgressBar();
BooleanProperty iamReady = new SimpleBooleanProperty(false);
AtomicInteger myCounter = new AtomicInteger(0);
void count(int start, int step) {
new Thread(""+start){
public void run() {
int count = 0;
int loops = 0;
for (int i = start; i < mySize; i+=step) {
for (int m = 0x80; m > 0; m >>=1) {
if ((inData[i] & m) > 0) count++;
}
if (loops++ > 99) {
globalProgressBar.add(loops);
loops = 0;
}
}
myCounter.addAndGet(count);
globalProgressBar.add(loops);
}
}.start();
}
void pcount(Label result, int n) {
result.setText("("+n+")");
globalProgressBar.start(mySize);
long start = System.currentTimeMillis();
myCounter.set(0);
globalProgressBar.runningProperty().addListener((p,o,v) -> {
if (!v) {
long ms = System.currentTimeMillis()-start;
result.setText(""+ms+" ms ("+myCounter.get()+")");
}
});
for (int t = 0; t < n; t++) count(t, n);
}
void testParallel(VBox box) {
HBox hbox = new HBox();
Label result = new Label("-");
for (int i : new int[]{1, 2, 4, 8}) {
Button run = new Button(""+i);
run.setOnAction( e -> {
if (globalProgressBar.isRunning()) return;
pcount(result, i);
});
hbox.getChildren().add(run);
}
hbox.getChildren().addAll(result);
box.getChildren().addAll(globalProgressBar, hbox);
}
#Override
public void start(Stage primaryStage) throws Exception {
primaryStage.setTitle("ProgressBar's");
globalProgressBar.start(mySize);
new Thread("Prepare"){
public void run() {
iamReady.set(false);
Random random = new Random();
random.setSeed(4711);
for (int i = 0; i < mySize; i++) {
inData[i] = (byte)random.nextInt(256);
globalProgressBar.add(1);
}
iamReady.set(true);
}
}.start();
VBox box = new VBox();
Scene scene = new Scene(box,400,80,Color.WHITE);
primaryStage.setScene(scene);
testParallel(box);
GUIHelper.allowImageDrag(box);
primaryStage.show();
}
public static void main(String[] args) { launch(args); }
}

As already mentionened in a comment by #Jim Mischel, you can use
Amdahl's law
to calculate this.
Amdahl's law states that the speedup gained from adding processors to solve a task is
where
N is the number of processors, and
P is the fraction of the code that can be executed in parallel (0 .. 1)
Now if T is the time it takes to execute the task on a single processor, and O is the total 'overhead' time (create and set up a second thread, communication, ...), a single thread is faster if
T < T/S(2) + O
or, after reordering, if
O/T > P/2
When the ratio Overhead / Execution Time is greater than P/2, a single thread is faster.

Not all algorithms can be processed in parallel (algorithms that are strictly sequential; where P=0 in Amdahl's law) or at least not efficiently (see P-complete). Other algorithms are more suitable for parallel execution (extreme cases are called "embarrassingly parallel").
A naive implementation of a parallel algorithm can be less efficient in terms of complexity or space compared to a similar sequential algorithm. If there is no obvious way to parallelize an algorithm so that it will get a speedup, you may need to choose another similar parallel algorithm that solves the same problem but can be more or less efficient. If you ignore thread/process creation and direct inter-process communication overhead, there can still be other limiting factors when using shared resources like IO bottlenecks or increased paging caused by higher memory consumption.
When should we decide to give up multithreading and only use a single thread to accomplish our goal?
When deciding between single and multithreading, the time needed to change the implementation and the added complexity for developers should be taken into account. If there is only small gain by using multiple threads you could argue that the increased maintenance cost that are usually caused by multi-threaded applications are not worth the speedup.

The overhead may be not only for creation, but for thread-intercommunications. The other thing that should be noted that synchronization of threads on a, for example, single object may lead to alike single thread execution.

Are there more cases where a single thread will be faster than multithreading?
So in a GUI application you will benefit from multithreading. At the most basic level you will be updating the front end as well as what the front end is presenting. If you're running something basic like hello world then like you showed it would be more overhead.
That question is very broad... Do you count Unit Tests as applications? If so then there are probably more applications that use single threads because any complex system will have (hopefully) at least 1 unit test. Do you count every Hello world style program as a different application or the same? If an application is deleted does it still count?
As you can see I can't give a good response other than you would have to narrow the scope of your question to get a meaningful answer. That being said this may be a statistic out there that I'm unaware of.
When should we decide to give up multithreading and only use a single thread to accomplish our goal?
When multithreading will perform 'better' by whatever metric you think is important.
Can your problem be broken into parts that can be processed simultaneously? Not in a contrived way like breaking Hello World into two threads where one thread waits on the other to print. But in a way that 2+ threads would be able to accomplish the task more efficiently than one?
Even if a task is easily parallelizable doesn't mean that it should be. I could multithread an application that trolled thousands of new sites constantly to get me my news. For me personally this would suck because it would eat my pipe up and I wouldn't be able to get my FPS in. For CNN this might be exactly what they want and will build a mainframe to accommodate it.
Can you narrow your questions?

There are 2 scenarios that can happen here :
MultiThreading on Single Core CPU :
1.1 When to use : Multithreading helps when tasks that needs parallelism are IO bound.Threads give up execution while they wait for IO and OS assign the time slice to other waiting threads. Sequential execution do not have the behavior - Multithreads will boost the performance.
1.2 When Not to Use : When the tasks are not IO bound and mere a calculation of something, you might not want to go for multi threading since thread creation and context switching will negate the gain if any. - Multithreads will have least impact.
MultiThreading in Multi Core CPU : Multi core can run as many threads as the number of core in CPU. This will surely have performance boost. But Running higher number of threads than the available cores will again introduce the thread context switching problem. - Multithreads will surely have impact.
Be aware : Also there is limit on adding/introducing number of threads in system. More context switches will negate overall gain and application slows down.

Understanding Threads + Asynchronous

So I have a program that I made that needs to send a lot (like 10,000+) of GET requests to a URL and I need it to be as fast as possible. When I first created the program I just put the connections into a for loop but it was really slow because it would have to wait for each connection to complete before continuing. I wanted to make it faster so I tried using threads and it made it somewhat faster but I am still not satisfied.
I'm guessing the correct way to go about this and making it really fast is using an asynchronous connection and connecting to all of the URLs. Is this the right approach?
Also, I have been trying to understand threads and how they work but I can't seem to get it. The computer I am on has an Intel Core i7-3610QM quad-core processor. According to Intel's website for the specifications for this processor, it has 8 threads. Does this mean I can create 8 threads in a Java application and they will all run concurrently? Any more than 8 and there will be no speed increase?
What exactly does the number represent next to "Threads" in the task manager under the "Performance" tab? Currently, my task manager is showing "Threads" as over 1,000. Why is it this number and how can it even go past 8 if that's all my processor supports?
I also noticed that when I tried my program with 500 threads as a test, the number in the task manager increased by 500 but it had the same speed as if I set it to use 8 threads instead. So if the number is increasing according to the number of threads I am using in my Java application, then why is the speed the same?
Also, I have tried doing a small test with threads in Java but the output doesn't make sense to me.
Here is my Test class:
import java.text.SimpleDateFormat;
import java.util.Date;
public class Test {
private static int numThreads = 3;
private static int numLoops = 100000;
private static SimpleDateFormat dateFormat = new SimpleDateFormat("[hh:mm:ss] ");
public static void main(String[] args) throws Exception {
for (int i=1; i<=numThreads; i++) {
final int threadNum = i;
new Thread(new Runnable() {
public void run() {
System.out.println(dateFormat.format(new Date()) + "Start of thread: " + threadNum);
for (int i=0; i<numLoops; i++)
for (int j=0; j<numLoops; j++);
System.out.println(dateFormat.format(new Date()) + "End of thread: " + threadNum);
}
}).start();
Thread.sleep(2000);
}
}
}
This produces an output such as:
[09:48:51] Start of thread: 1
[09:48:53] Start of thread: 2
[09:48:55] Start of thread: 3
[09:48:55] End of thread: 3
[09:48:56] End of thread: 1
[09:48:58] End of thread: 2
Why does the third thread start and end right away while the first and second take 5 seconds each? If I add more that 3 threads, the same thing happens for all threads above 2.
Sorry if this was a long read, I had a lot of questions.
Thanks in advance.

Your processor has 8 cores, not threads. This does in fact mean that only 8 things can be running at any given moment. That doesn't mean that you are limited to only 8 threads however.
When a thread is synchronously opening a connection to a URL it will often sleep while it waits for the remote server to get back to it. While that thread is sleeping other threads can be doing work. If you have 500 threads and all 500 are sleeping then you aren't using any of the cores of your CPU.
On the flip side, if you have 500 threads and all 500 threads want to do something then they can't all run at once. To handle this scenario there is a special tool. Processors (or more likely the operating system or some combination of the two) have a scheduler which determines which threads get to be actively running on the processor at any given time. There are many different rules and sometimes random activity that controls how these schedulers work. This may explain why in the above example thread 3 always seems to finish first. Perhaps the scheduler is preferring thread 3 because it was the most recent thread to be scheduled by the main thread, it can be impossible to predict the behavior sometimes.
Now to answer your question regarding performance. If opening a connection never involved a sleep then it wouldn't matter if you were handling things synchronously or asynchronously you would not be able to get any performance gain above 8 threads. In reality, a lot of the time involved in opening a connection is spent sleeping. The difference between asynchronous and synchronous is how to handle that time spent sleeping. Theoretically you should be able to get nearly equal performance between the two.
With a multi-threaded model you simply create more threads than there are cores. When the threads hit a sleep they let the other threads do work. This can sometimes be easier to handle because you don't have to write any scheduling or interaction between the threads.
With an asynchronous model you only create a single thread per core. If that thread needs to sleep then it doesn't sleep but actually has to have code to handle switching to the next connection. For example, assume there are three steps in opening a connection (A,B,C):
while (!connectionsList.isEmpty()) {
for(Connection connection : connectionsList) {
if connection.getState() == READY_FOR_A {
connection.stepA();
//this method should return immediately and the connection
//should go into the waiting state for some time before going
//into the READY_FOR_B state
}
if connection.getState() == READY_FOR_B {
connection.stepB();
//same immediate return behavior as above
}
if connection.getState() == READY_FOR_C {
connection.stepC();
//same immediate return behavior as above
}
if connection.getState() == WAITING {
//Do nothing, skip over
}
if connection.getState() == FINISHED {
connectionsList.remove(connection);
}
}
}
Notice that at no point does the thread sleep so there is no point in having more threads than you have cores. Ultimately, whether to go with a synchronous approach or an asynchronous approach is a matter of personal preference. Only at absolute extremes will there be performance differences between the two and you will need to spend a long time profiling to get to the point where that is the bottleneck in your application.
It sounds like you're creating a lot of threads and not getting any performance gain. There could be a number of reasons for this.
It's possible that your establishing a connection isn't actually sleeping in which case I wouldn't expect to see a performance gain past 8 threads. I don't think this is likely.
It's possible that all of the threads are using some common shared resource. In this case the other threads can't work because the sleeping thread has the shared resource. Is there any object that all of the threads share? Does this object have any synchronized methods?
It's possible that you have your own synchronization. This can create the issue mentioned above.
It's possible that each thread has to do some kind of setup/allocation work that is defeating the benefit you are gaining by using multiple threads.
If I were you I would use a tool like JVisualVM to profile your application when running with some smallish number of threads (20). JVisualVM has a nice colored thread graph which will show when threads are running, blocking, or sleeping. This will help you understand the thread/core relationship as you should see that the number of running threads is less than the number of cores you have. In addition if you see a lot of blocked threads then that can help lead you to your bottleneck (if you see a lot of blocked threads use JVisualVM to create a thread dump at that point in time and see what the threads are blocked on).

Some concepts:
You can have many threads in the system, but only some of them (max 8 in your case) will be "scheduled" on the CPU at any point of time. So, you cannot get more performance than 8 threads running in parallel. In fact the performance will probably go down as you increase the number of threads, because of the work involved in creating, destroying and managing threads.
Threads can be in different states : http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.State.html
Out of those states, the RUNNABLE threads stand to get a slice of CPU time. Operating System decides assignment of CPU time to threads. In a regular system with 1000's of threads, it can be completely unpredictable when a certain thread will get CPU time and how long it will be on CPU.
About the problem you are solving:
You seem to have figured out the correct solution - making parallel asynchronous network requests. However, practically speaking starting 10000+ threads and that many network connections, at the same time, may be a strain on the system resources and it may just not work. This post has many suggestions for asynchronous I/O using Java. (Tip: Don't just look at the accepted answer)

This solution is more specific to the general problem of trying to make 10k requests as fast as possible. I would suggest that you abandon the Java HTTP libraries and use Apache's HttpClient instead. They have several suggestions for maximizing performance which may be useful. I have heard the Apache HttpClient library is just faster in general as well, lighter weight and less overhead.

Java - multithreaded code does not run faster on more cores

I was just running some multithreaded code on a 4-core machine in the hopes that it would be faster than on a single-core machine. Here's the idea: I got a fixed number of threads (in my case one thread per core). Every thread executes a Runnable of the form:
private static int[] data; // data shared across all threads
public void run() {
int i = 0;
while (i++ < 5000) {
// do some work
for (int j = 0; j < 10000 / numberOfThreads) {
// each thread performs calculations and reads from and
// writes to a different part of the data array
}
// wait for the other threads
barrier.await();
}
}
On a quadcore machine, this code performs worse with 4 threads than it does with 1 thread. Even with the CyclicBarrier's overhead, I would have thought that the code should perform at least 2 times faster. Why does it run slower?
EDIT: Here's a busy wait implementation I tried. Unfortunately, it makes the program run slower on more cores (also being discussed in a separate question here):
public void run() {
// do work
synchronized (this) {
if (atomicInt.decrementAndGet() == 0) {
atomicInt.set(numberOfOperations);
for (int i = 0; i < threads.length; i++)
threads[i].interrupt();
}
}
while (!Thread.interrupted()) {}
}

Adding more threads is not necessarily guarenteed to improve performance. There are a number of possible causes for decreased performance with additional threads:
Coarse-grained locking may overly serialize execution - that is, a lock may result in only one thread running at a time. You get all the overhead of multiple threads but none of the benefits. Try to reduce how long locks are held.
The same applies to overly frequent barriers and other synchronization structures. If the inner j loop completes quickly, you might spend most of your time in the barrier. Try to do more work between synchronization points.
If your code runs too quickly, there may be no time to migrate threads to other CPU cores. This usually isn't a problem unless you create a lot of very short-lived threads. Using thread pools, or simply giving each thread more work can help. If your threads run for more than a second or so each, this is unlikely to be a problem.
If your threads are working on a lot of shared read/write data, cache line bouncing may decrease performance. That said, although this often results in performance degradation, this alone is unlikely to result in performance worse than the single threaded case. Try to make sure the data that each thread writes is separated from other threads' data by the size of a cache line (usually around 64 bytes). In particular, don't have output arrays laid out like [thread A, B, C, D, A, B, C, D ...]
Since you haven't shown your code, I can't really speak in any more detail here.

You're sleeping nano-seconds instead of milli-seconds.
I changed from
Thread.sleep(0, 100000 / numberOfThreads); // sleep 0.025 ms for 4 threads
to
Thread.sleep(100000 / numberOfThreads);
and got a speed-up proportional to the number of threads started just as expected.
I invented a CPU-intensive "countPrimes". Full test code available here.
I get the following speed-up on my quad-core machine:
4 threads: 1625
1 thread: 3747
(the CPU-load monitor indeed shows that 4 course are busy in the former case, and that 1 core is busy in the latter case.)
Conclusion: You're doing comparatively small portions of work in each thread between synchronization. The synchronization takes much much more time than the actual CPU-intensive computation work.
(Also, if you have memory intensive code, such as tons of array-accesses in the threads, the CPU won't be the bottle-neck anyway, and you won't see any speed-up by splitting it on multiple CPUs.)

The code inside runnable does not actually do anything.
In your specific example of 4 threads each thread will sleep for 2.5 seconds and wait for the others via the barier.
So all that is happening is that each thread gets on the processor to increment i and then blocks for sleep leaving processor available.
I do not see why the scheduler would alocate each thread to a separate core since all that is happening is that the threads mostly wait.
It is fair and reasonable to expect to just to use the same core and switch among threads
UPDATE
Just saw that you updated post saying that some work is happening in the loop. What is happening though you do not say.

synchronizing across cores is much slower than syncing on a single core
because on a single cored machine the JVM doesn't flush the cache (a very slow operation) during each sync
check out this blog post

Here is a not tested SpinBarrier but it should work.
Check if that may have any improvement on the case. Since you run the code in loop extra sync only hurt performance if you have the cores on idle.
Btw, I still believe you have a bug in the calc, memory intense operation. Can you tell
what CPU+OS you use.
Edit, forgot the version out.
import java.util.concurrent.atomic.AtomicInteger;
public class SpinBarrier {
final int permits;
final AtomicInteger count;
final AtomicInteger version;
public SpinBarrier(int count){
this.count = new AtomicInteger(count);
this.permits= count;
this.version = new AtomicInteger();
}
public void await(){
for (int c = count.decrementAndGet(), v = this.version.get(); c!=0 && v==version.get(); c=count.get()){
spinWait();
}
if (count.compareAndSet(0, permits)){;//only one succeeds here, the rest will lose the CAS
this.version.incrementAndGet();
}
}
protected void spinWait() {
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.