Using YourKit, I metered an application, and identified the main CPU sink. I structured the computation to parallelize this via an ExecutorService with a fixed number of threads.
On a 24-core machine, the benefit of adding threads trails off very fast above 4. So, thought I, there must be some contention or locking going on around here, or IO latency, or something.
OK, I turned on the 'Monitor Usage' feature of YourKit, and the amount of blocked time shown in the worker threads is trivial. Eyeballing the thread state chart, the worker threads are nearly all 'green' (running) as opposed to yellow (waiting) or red (blocked).
CPU profiling still shows 96% of the time in a call tree that is inside the worker threads.
So something is using up real time. Could it be scheduling overhead?
In pseudo-code, you might model this as:
loop over blobs:
submit tasks for a blob via invokeAll of executor
do some single-threaded processing on the results
end loop over blobs
In a test run, there are ~680 blobs, and ~13 tasks/blob. So each thread (assuming four) dispatches about 3 times per blob.
hardware: I've run tests on a small scale on my MacBook pro, and then on a big fat Dell: hwinfo on linux there reports 24 different items for --cpu, composed of
Intel(R) Xeon(R) CPU X5680 # 3.33GHz
Intel's website tells me that each has 6 cores, 12 threads, I suspect I have 4 of them.
Assuming you have 4 cores with 8 logical threads each, this means you have 4 real processing unit which can be shared across 32 threads. It also means when you have 2-8 active threads on the same core, they have to compete for resources such as the CPU pipeline and the instruction and data caches.
This works best when you have many threads which have to wait for external resources like disk or network IO. If you have CPU intensive processes, you may find that one thread per core will use all the CPU power you have.
I have written a library which supports allocations of threads and cores for linux and windows. If you have Solaris it may be easy to port as it support JNI posix calls and JNA calls.
https://github.com/peter-lawrey/Java-Thread-Affinity
It's most likely not contention, though it's hard to say without more details. Profiling results can be misleading because Java reports threads as RUNNABLE when they're blocked on disk or network I/O. Yourkit still counts it as CPU time.
Your best bet is to turn on CPU profiling and drill into what's taking the time in the worker threads. If it ends up mostly in java.io classes, you've still got disk or network latency.
You have not completely parallelized the processing. you may not be submitting the next blob till the results of the previous blob is completed, hence no parallel processing.
If you can, try this way:
for each blob{
create a runnable for blob process name it blobProcessor;
create a runnable for blob results name it resultsProcessor;
submit blobProcessor;
before blobProcessor finishes, submit resultsProcessor;
}
also:
please take a look at JetLang which provides a threadless concurrency using fibers.
Related
Say if I have a processor like this which says # cores = 4, # threads = 4 and without Hyper-threading support.
Does that mean I can run 4 simultaneous program/process (since a core is capable of running only one thread)?
Or does that mean I can run 4 x 4 = 16 program/process simultaneously?
From my digging, if no Hyper-threading, there will be only 1 thread (process) per core. Correct me if I am wrong.
A thread differs from a process. A process can have many threads. A thread is a sequence of commands that have a certain order. A logical core can execute on sequence of commands. The operating system distributes all the threads to all the logical cores available, and if there are more threads than cores, threads are processed in a fast cue, and the core switches from one to another very fast.
It will look like all the threads run simultaneously, when actually the OS distributes CPU time among them.
Having multiple cores gives the advantage that less concurrent threads will be placed on one single core, less switching between threads = greater speed.
Hyper-threading creates 2 logical cores on 1 physical core, and makes switching between threads much faster.
That's basically correct, with the obvious qualifier that most operating systems let you execute far more tasks simultaneously than there are cores or threads, which they accomplish by interleaving the executing of instructions.
A system with hyperthreading generally has twice as many hardware threads as physical cores.
The term thread is generally used as a description of an operating system concept that has the potential to execute independently of other threads. Whether it does so depends on whether it is stuck waiting for some event (disk or screen I/O, message queue), or if there are enough physical CPUs (hyperthreaded or not) to allow it run in the face of other non-waiting threads.
Hyperthreading is a CPU vendor term that means a single core, that can multiplex its attention between two computations. The easy way to think about a hyperthreaded core is as if you had two real CPUs, both slightly slower than what the manufacture says the core can actually do.
Basically this is up to the OS. A thread is a high-level construct holding a instruction pointer, and where the OS places a threads execution on a suitable logical processor. So with 4 cores you can basically execute 4 instructions in parallell. Where as a thread simply contains information about what instructions to execute and the instructions placement in memory.
An application normally uses a single process during execution and the OS switches between processes to give all processes "equal" process time. When an application deploys multiple threads the processes allocates more than one slot for execution but shares memory between threads.
Normally you make a difference between concurrent and parallell execution. Where parallell execution is when you actually physically execute instructions of more than one logical processor and concurrent execution is the the frequent switching of a single logical processor giving the apperence of parallell execution.
I just had a quick question on how processors and threads work. According to my current understanding, a core can only perform 1 process at a time. But we are able to produce a thread pool(lets say 30) with a larger number than the number of cores that we posses(lets say 4) and have them run concurrently. How is this possible if we are only have 4 cores? I am also able to run my 30 thread program on my local computer and also continue to perform other activities on my computer such as watch movies or browse the internet.
I have read somewhere that scheduling of threads occurs and that sort of gives the illusion that these 30 threads are running concurrently by the 4 cores. Is this true and if so can someone explain how this works and also recommend some good reading on this?
Thank you in advance for the help.
Processes vs Threads
In days of old, each process had precisely one thread of execution, so processes were scheduled onto cores directly (and in these old days, there was almost only one core to schedule onto). However, in operating systems that support threading (which is almost all moderns OS's), it is threads, not processes that are scheduled. So for the rest of this discussion we will talk exclusively about threads, and you should understand that each running process has one or more threads of execution.
Parallelism vs Concurrency
When two threads are running in parallel, they are both running at the same time. For example, if we have two threads, A and B, then their parallel execution would look like this:
CPU 1: A ------------------------->
CPU 2: B ------------------------->
When two threads are running concurrently, their execution overlaps. Overlapping can happen in one of two ways: either the threads are executing at the same time (i.e. in parallel, as above), or their executions are being interleaved on the processor, like so:
CPU 1: A -----------> B ----------> A -----------> B ---------->
So, for our purposes, parallelism can be thought of as a special case of concurrency*
Scheduling
But we are able to produce a thread pool(lets say 30) with a larger number than the number of cores that we posses(lets say 4) and have them run concurrently. How is this possible if we are only have 4 cores?
In this case, they can run concurrently because the CPU scheduler is giving each one of those 30 threads some share of CPU time. Some threads will be running in parallel (if you have 4 cores, then 4 threads will be running in parallel at any one time), but all 30 threads will be running concurrently. The reason you can then go play games or browse the web is that these new threads are added to the thread pool/queue and also given a share of CPU time.
Logical vs Physical Cores
According to my current understanding, a core can only perform 1 process at a time
This is not quite true. Due to very clever hardware design and pipelining that would be much too long to go into here (plus I don't understand it), it is possible for one physical core to actually be executing two completely different threads of execution at the same time. Chew over that sentence a bit if you need to -- it still blows my mind.
This amazing feat is called simultaneous multi-threading (or popularly Hyper-Threading, although that is a proprietary name for a specific instance of such technology). Thus, we have physical cores, which are the actual hardware CPU cores, and logical cores, which is the number of cores the operating system tells software is available for use. Logical cores are essentially an abstraction. In typical modern Intel CPUs, each physical core acts as two logical cores.
can someone explain how this works and also recommend some good reading on this?
I would recommend Operating System Concepts if you really want to understand how processes, threads, and scheduling all work together.
The precise meanings of the terms parallel and concurrent are hotly debated, even here in our very own stack overflow. What one means by these terms depends a lot on the application domain.
Java do not perform Thread scheduling, it leaves this on Operating System to perform Thread scheduling.
For computationally intensive tasks, It is recommended to have thread pool size equal to number of cores available. But for I/O bound tasks we should have larger number of threads. There are many other variations, if both type of tasks are available and needs CPU time slice.
a core can only perform 1 process at a time
Yes, but they can multitask and create an illusion that they are processing more than one process at a time
How is this possible if we are only have 4 cores? I am also able to
run my 30 thread program on my local computer and also continue to
perform other activities on my computer
This is possible due to multitasking (which is concurrency). Lets say you started 30 threads and OS is also running 50 threads, all 80 threads will share 4 CPU cores by getting CPU time slice one by one (one thread per core at a time). Which means on average each core will run 80/4=20 threads concurrently. And you will feel all threads/processes are running at the same time.
can someone explain how this works
All of this happens at OS level. If you are a programmer then you should not worry about this. But if you are a student of OS then pick any OS book & learn more about Multi-threading at OS level in detail or find some good research paper for depth. One thing you should know that each OS handle these things in different way (but generally concepts are same)
There are some languages like Erlang, which use green threads (or processes), due to which they get the ability to map and schedule threads on their own eliminating OS. So, do some research on green threads as well if you are interested.
Note: You can also research on actors which is another abstraction over threads. Languages like Erlang, Scala etc use actors to accomplish tasks. One thread can have hundred of actors; each actor can perform different task (similar to threads in java).
This is a very vast and active research topic and there are many things to learn.
In short, your understanding of a core is correct. A core can execute 1 thread (aka process) at a time.
However, your program doesn't really run 30 threads at once. Of those 30 threads, only 4 are running at a time, and the other 26 are waiting. The CPU will schedule threads and give each thread a slice of time to run on a core. So the CPU will make all the threads take turns running.
A common misconception:
Having more threads will make my program run faster.
FALSE: Having more threads will NOT always make your program run faster. It just means the CPU has to do more switching, and in fact, having too many threads will make your program run slower because of the overhead caused by switching out all the different processes.
What I'm wondering about (and what documentation I find is not very helpful in figuring it out), is what happens to a CPU core when the Thread that is executing on it transfers control to hardware device stuff (disk controller, network I/O, ...) to do some stuff that the CPU/core cannot help with. Does that core become available for executing other Threads, or does it just stall and wait (even if there are other Threads with CPU work to do that are available for scheduling) ?
The oft-given advice of "as many Threads as cores" seems to suggest the latter.
That's out of control to Java. The scheduling is done by the OS and therefore outside of the scope for the JVM.
It's very likely that the core is reclaimed by the OS when it is waiting for some IO to be done.
The simple advice "one thread per core/processor" is for CPU intensive operations. If you know that most of the time you're waiting for IO then you can create more threads than cores are there.
Also note that enabled Hyper-Threading counts towards the number of available processors so a quad-core processor with enabled Hyper-Threading will be reported a having 8 available processors (see also this question).
I have a general question:
my program will just go on processing something which does not require user input or system resources (like printer etc..) meaning, my program will not wait for any resources except CPU time.
The same program (let us say job) may be initiated by multiple users.
in this case, is it worth full to run this in a thread (meaning each user will get a feeling that his job is executed without delay.
or is it better to run the jobs sequentially?
The issue with running as separate threads is that, too many threads running simultaneously forcing the CPU utilization go over 100%.
Please suggest. Assume that user donot see his job progress. User is not worried when his job is finished. But at the same time, I want to have the CPU busy running the jobs.
If you don't care how long a process takes, or the length of time it takes is acceptable, then using one thread is likely to be the simplest solution. For example, many GUI applications only use one event handling thread.
If you want to keep all your CPUs busy you can start a number of busy loops to max out all the CPUs.
What you usually want is to reduce latency, or improve threadput by using more CPUs. Unless this is a goal, using more CPUs won't help you.
If the thread is genuinely purely CPU-bound, then it doesn't make sense to create more threads than there are cores (or virtual cores) available to process them. So on a quad-core machine, create no more than four threads (and probably only three, as your process isn't the only thing going on on the machine). On a quad-core machine with hyper-threading (two virtual threads per core), you might create six or seven. Creating too many additional threads (say, hundreds) causes unnecessary context-switching, which can be expensive if you really overdo it.
The converse is that on a multi-core machine, a single thread can only run on one core. So on a quad-core machine, running the jobs sequentially on a single thread will only utilize 25% of the CPU capacity.
So: Run the jobs in parallel up to the number of available cores, and sequentially (on each core) beyond that.
Big caveat: Your mileage may vary. There are lots of inputs to this equation, including what else is going on on the machine, and particularly whether the jobs really are CPU-bound (as opposed to system-bound, e.g., CPU and I/O subsystem and such).
I guess your program needs memory access. Memory access may be slow, and you really want to run the processor at that time. A common solution to limit the number of threads running at the same time is to use a thread pool.
in this case, is it worth full to run this in a thread (meaning each user will get a feeling that his job is executed without delay. or is it better to run the jobs sequentially?
It depends highly on the job. If it is interactive then running it immediately would give a better interface to the user. If the speed of the response is not an issue then maybe you don't want to incur the complexity costs of writing a multi-threaded program.
The issue with running as separate threads is that, too many threads running simultaneously forcing the CPU utilization go over 100%.
I wouldn't worry about this. One of the reasons why we use multiple threads is that we can make use of multiple processors to actually get the job done faster. In this case, depending on the OS, you can actually see more than 100% load for the process if you are using more than a full CPU -- this is expected. Also, if the CPU goes over 100%, the operating system will handle it fine unless you are worried that your application will be taking cycles away from a more important application.
Currently i am in a process of developing a application which can work in a multi threaded mode. As part of testing on my local machine (Intel Core I5) i tested with 4 threads. But now want to release the code for intense(regression) testing, so there any hard rule by which we can decide the number of threads to be created for processing.
I am not using any web or App server, instead i have written my logic to receive the request and then process it. Now During the processing, i receive the request on main Thread, and then i submit the call to ExecuterService where i need to decide the number of threads, then i process the request and each thread is again capable of returning the response.
I need to configure an optimum number of thread. I am trying to deploy my application on 16-Core, with 40GB Memory linux machine.
Thanks
The maximum number of threads for an application can not be extracted via some well defined formula but it depends on the nature of your various tasks and your target environment.
If your tasks are CPU intensive then, if you spawn too many threads the performance will degenerate as most of the time will be spend in context switching.
For compute intensive tasks a general formula is Ncpus+1. You can determine the number of CPUs using Runtime.availableProcessors
If your tasks are I/O intensive then most of the time you can use a much larger number of threads since, due to the fact that the threads are spending so much time in blocking tasks, all of the threads will be schedulable.
So taking these 2 into account you should estimate the compute-time vs waiting-time via a profiler or other similar tool.
You can try your benchmarks with various sizes until your estimate the optimal for your case.
In theory, the optimal number of threads is equal to the number of cores in your machine.
In practice, many operations are waiting for memory, IO, network or disk.
Try to execute only a single thread. If the CPU core load is 25% - you can try to create (4 x the number of cores in your machine) threads.
Note that increasing the number of threads will effect the time each thread will wait for network/disk/memory/IO, so it is somewhat more complex.
The best thing you can do is benchmark: Measure how much time it would take to complete 1,000,000 simulated requests - given different number of threads.
Depends on how cpu intensive your tasks are. But still you can assign one task to one core. So at the least you can go about creating as many threads as number of cores. That said, things may slow down depending on
Your code doing lots of I/O.
Lots of network I/O
other CPU intensive tasks
If you create too many threads, there will be lots of time wasted in context switching. Unless you can come to a benchmark based on your own tests, go with threads=number of cores.