java multithreading performance scaling - java

can you explain this nonsense to me?
i have a method that basically fills up an array with mathematical operations. there's no I/O involved or anything. now, this method takes about 50 seconds to run, and the code is perfectly scalable (theoretically 100%), so i split it up into 4 threads, wait for them to complete, and reassemble the 4 arrays. now, i run the program on a quad core processor, expecting it to take about 15 seconds, and it actually takes 58 seconds. that's right: it takes longer! i see the cpu working 100%, and i know that each thread does 1/4 of the calculations, and creating threads and reassembling the arrays take about 1-2 ms in total.
what's causing such loss of performance? what the hell is the cpu doing all that time?
CODE: http://pastebin.com/cFUgiysw

Threads don't work that way.
Threads are still part of the same process (depending on the OS), so in terms of the operating system - CPU time will be scheduled the same for 4 threads in 1 process as it is for 1 thread in 1 process.
Also, with such a small number of values, you won't see the scalability in the midst of the overhead. Re-assembling the arrays in java will be costly.
Check out things like "Context switching overhead" - things like that always mess you up when you try to map theory to practise :P
I would stick to the single-threaded way :)
~ Dan
http://en.wikipedia.org/wiki/Context_switch

A lot depends on what you are doing and how you are dividing the work. There are many possible causes for this problem.
The most likely cause is, you are using all the bandwidth of your CPU to main memory bus with one thread. This can happen if your data set is larger than your CPU cache. esp if you have some random access behaviour. You could consider trying to reuse the original array, rather than taking multiple copies to reduce cache churn.
Your locking overhead is greater than the performance gain. I suspect you have used very course locking so this shouldnt be an issue.
Starting stopping threads takes too long. As your code is multi second, I doubt this too.

There is a cost associated with opening new threads. I don't think it should be up to 8 second but it depends on what threads you are using. Some threads needs to create a copy of the data that you are handling to be thread safe and that can take some time. This cost is commonly referred to as overhead. If the execution you are doing is somewhere not serializable for instance reads the same file or needs access to a shared resource the threads might need to wait on each other this can take some time and under sub optimal conditions it can take more time than serial execution. My tip is try and check for these unserializable events remove them from the threaded part if possible. Also try and use a lower amount of threads 4 threads for 4 cpus is not always optimal.
Hope it helps.

Unless you are constantly creating and killing threads the thread overhead shouldn't be a problem. Four threads running simultaeously is no big deal for the scheduler.
As Peter Lawrey suggested the memory bandwidth could be the problem. Your 50-second code is running on a java engine and they both compete for the available memory bandwidth. The java engine needs memory bandwidth to execute your code and your code needs it to do its calculations.
You write "perfectly scalable" which would be the case if your code was compiled. Since it runs on a java engine this is not the case. So the 16% increase in overall time could be seen as the difference between the smoothness of one thread vs the chaos of four colliding over memory accesses.

Related

Why increasing number of threads is useless?

I've been looking for the way to increase the speed of processing messages received from rabbitmq queue. The only way I've found is make more than one threads doing the same - receiving and processing. And this gave me some profit. After I created 4 threads the speed quadrupled. As I have 8-core processor I've decided to increase the number of threads to 8. But this gives no performance increasing. YourKit shows that only 50% of CPU is used. Somebody can say that my app is lightweight and so it is so, but I can say that it can't do more work than it does regardless I produce much more what to do. Why this doesn't work?
There are many different issues that can constrain the maximum speed of some application on a given system. For example, it can be limited by memory bandwidth, by Amdahl's Law effects (time needed for non-parallel code, including synchronized blocks), I/O bandwidth, and cache space.
If you want further improvement you need to do some measurements and profiling to find where the time is going, and then work on that.
The short (and not particularly helpful) answer is "overheads and bottlenecks".
For instance:
Creating threads in Java is relatively expensive. If the amount of work done by a thread isn't large, the overhead of creating the thread can out-weigh the benefits.
Context switching between threads is relatively expensive, especially when you take account of memory-related overheads such as cache misses, TLB misses. (These overheads actually hit when a native thread is assigned to a core. If the OS can somehow keep a native thread on a single core continuously (i.e. with no other threads on the same core), then it can use spinlocking ... and avoid the context switch. But the more Java threads you have, the less likely it is that the OS can do this.)
The threads may be spending a large proportion of their time waiting for I/O to complete. The I/O system's throughput or the speed / latency of some external service can be a bottleneck.
You may have contention over data structures; e.g. threads requiring exclusive access to safely read or update (say) a shared Map. If threads regularly need to wait for others to release locks, then you have a bottleneck.
Your computation may be dominated by the costs of "feeding" the threads. For example, if there is a single master thread that hands out "work" to worker threads, then the master thread's activities could be the bottleneck; i.e. it may not be able to provide enough work to keep the workers busy.
Since your tags imply that you are using a message queue, it is possible that that is the bottleneck, especially if the messages are big or the "work" done on each one is relatively small.
(Using a separate separate message queue service is liable to increase context switches, add I/O latency, add protocol overheads and so on. It's not an automatic route to performance improvement for small-scale systems.)
It is also possible that you have "hyperthreaded" cores not real cores, or that the operating system is stopping your JVM from using all cores.
If CPU or waiting for IO is your bottle neck, adding independent threads can make a big difference.
If you have a shared resource is a bottleneck, e.g. your L3 cache, your network adapter, your kernel, adding threads won't help because CPU is not the problem. In fact it can often make it worse by adding overhead.
my app is lightweight
In which case CPU is unlikely to be your issue and you are doing well to see a speed up with more than 1 CPU. Most likely you are speeding up CPU used by RabbitMQ. Ideally it should be more efficient and this shouldn't really help much. IMHO, more efficient messaging solutions don't gain much by multiple CPUs as they will not be bottlenecked on CPU.
One way or another, you're only using 4 cores. There's a lot that can stop stop you from doubling your performance by doubling your threads, but from your 4 thread success you've gotten past all that. I'm guessing there's a bug in your code to set off 8 threads and it's only firing up 4. (Even with hyperthreading, you're going to get some improvement. Even with every possible problem, you're going to get some improvement.) Otherwise, I'll go with T.J.Crowder and Stephen C: I don't think you really have 8 cores.
I'd try using different numbers of threads: 3, 5, 6. See what changes. I think you'll stumble on the problem soon enough.
To be fair to Java: if you write thread-safe code and avoid bottlenecks, it handles threads really well, as you've noticed going from one thread to 4. I've always found the overhead costs to be trivial.
Your application does not have linear speedup, and therefore, it does not have good scalability.
In order to keep increase the number of threads you need to ensure the data being handle is growing accordingly. For a fixed amount of data, increasing number of threads (and/or cores) will have a diminishing return at some point since the overhead of creating threads will outweight the thread's compute time.
Make sure to look up the following link:
Ahmdahl's law
Gustafson's law
Gustafson's law is a great counterpoint to Ahmdal's law so I highly recommend understanding that article.

How to improve the performance if a parallel Java program is memory bound?

I wrote a parallel java program. It works typically:
It takes a String input as input;
Then input is cut into String inputs[numThreads] evenly;
Each inputs[i] is assigned to thread_i to process, and generates results[i];
After all the working threads finish, the main thread merge the results[i] into result.
The performance data on a 10-core (physical cores) machine is as below.
Threads# 1 thread 2 threads 4 threads 8 threads 10 threads
Time(ms) 78 41 28 21 21
Note:
the JVM warm-up time have been eliminated (first 50 runs).
the time doesn't include threads starting/joining time.
It seems the memory bandwidth becomes the bottleneck when there are more than 8 threads.
In this case, how to further improve the performance? Is there any design issues in my parallel Java program?
To examine the cause of this scalability issue, I inserted a (meaningless computation) loop into the process(inputs[i]) method. Here is the new data:
Threads# 1 thread 10 threads
Time(ms) 41000 4330
The new data shows good scalability for 10 threads, which in return confirms the original (without meaningless loop) has memory issue, so that its scalability is limited to 8 threads.
But anyway to circumvent this issue, like pre-loading the data into each core's local cache, or loading in batch?
I find it unlikely that you have a memory bandwidth issue here. It is more likely that your run times are so short that as you approach 0 you are just mostly timing the thread startup/shutdown or the hotswap compiler optimization cycles. Gaining relevant timing information from a Java task that runs so short is close to worthless. The hotswap compiler and other optimizations that run initially often dominate the CPU usage early on in a class' life. Our production applications stabilize only after minutes of live service operation.
If you can significantly increase your run times by adding more input data or by calculating the same result over and over you may get a better idea about what the optimal thread numbers are.
Edit:
Now that you have added timings for 1 and 10 threads over a longer period, it looks to me that you are not bound by anything since the timing seems to be fairly linear -- with some thread overhead. 41000/10 = 4100 versus 4330 for 10 threads.
Pretty good demonstration of what threading can do to a CPU bound application. :-)
How many logical cores do you have?
Consider - imagine you had one core and a hundred threads. The work to be done is the same, it cannot be distributed over multiple cores, but now you have a great deal of thread switching overhead.
Now imagine you have say four cores and four threads. Assuming no other bottlenecks, compute time is quartered.
Now imagine you have four cores and eight threads. You compute time will be approximately quartered, but you'll have added some thread swapping overhead.
Be aware of hyperthreading and that it may help or hinder you, depending on the nature of the compute task.
I'd say your losses are down to switching threads. You have more threads than cores, and none need to block for slower processes, so they are getting switched in, doing a bit of work and then gettimg switched out to switch another one in. Switching threads is an expensive process, given the nature of what you appear to be doing I would have instinctively restricted the number of threads to 8 (leave two cores for the os) , and your performance numbers appear to bear me out.

Thread.sleep and BufferedReader.readLine use the most cpu cycles in my java tcp server. Why?

Good evening,
I'm developing a java tcp server for communication between clients.
At this point i'm load testing the developed server.
This morning i got my hands on a profiler (yourkit) and started looking for problem spots in my server.
I now have 480 clients sending messages to the server every 500 msec. The server forwards every received message to 6 clients.
The server is now using about 8% of my cpu, when being on constant load.
My question is about the java functions that uses the most cpu cycles.
The java function that uses the most cpu cycles is strangly "Thread.sleep", followed by "BufferedReader.readLine".
Both of these functions seem to block the current thread while waiting for something (sleep waits for a few msec, readline waits for data).
Can somebody explain why these 2 functions take up that much cpu cycles? I was also wondering if there are alternative approaches that use less cpu cycles.
Kind regards,
T. Akhayo
sleep() and readLine() can use a lot of cpu as they both result in system calls which can context switch. It is also possible that the timing for these methods is not accurate for this reason (it may be an over estimate)
A way to reduce the overhead of context switches/sleep() is to have less threads and avoid needing to use sleep (e.g. use ScheduledExecutorServices), readLine() overhead can be reduced by using NIO but it is likely to add some complexity.
Sleeping shouldn't be an issue, unless you're having a bunch of threads sleep for short periods of time (100-150ms is 'short' in when you have 480 threads running a loop that just sleeps and does something trivial).
The readLine call should be using next to nothing when it's not actually reading something, except when you first call it. But like you said, it blocks, and it shouldn't be using a noticeable amount of CPU unless you have small windows where it blocks. CPU usage isn't that much unless you're reading tons of data, or initially calling the method.
So, your loops are too tight, and you're receiving too many messages too quickly, which is ultimately causing 'tons' of context switching, and processing. I'd suggest using a NIO framework (like Netty) if you're not comfortable enough with NIO to use it on your own.
Also, 8% CPU isn't that much for 480 clients that send 2 messages per second.
Here is a program in which sleep uses almost 100% of the cpu cycles given to the application:
for (i = 0; i < bigNumber; i++){
sleep(someTime);
}
Why? Because it doesn't use very many actual cpu cycles at all,
and of the ones it does use, nearly all of them are spent entering and leaving sleep.
Does that mean it's a real problem? Of course not.
That's the problem with profilers that only look at CPU time.
You need a sampler that samples on wall-clock time, not CPU time.
It should sample the stack, not just the program counter.
It should show you by line of code (not by function) the fraction of stack samples containing that line.
The usual objection to sampling on wall-clock time is that the measurements will be inaccurate due to sharing the machine with other processes.
But that doesn't matter, because to find time drains does not require precision of measurement.
It requires precision of location.
What you are looking for is precise code locations, and call sites, that are on the stack a healthy fraction of actual time, as determined by stack sampling that's uncorrelated with the state of the program.
Competition with other processes does not change the fraction of time that call sites are on the stack by a large enough amount to result in missing the problems.

Java Threads memory explosion

I am fairly new with concurrent programming and I am learning it.
I am implementing a quick sort in Java JDK 7 (Fork Join API) to sort a list of objects (100K).
While using this recursive piece of code without using concurrency,i observe no memory explosion, everything is fine.
I just added the code to use it on multi cores (by extending the class RecursiveAction) and then the memory usage jumped very high until it reached its limits. By doing some profiling i observe a high creation rate of threads and i think its expectable.
But, is a java Thread by itself much more memory demanding or am i missing something here ?
Quicksort must requires a lot of threads but not much than regular objects.
Should I stop creating RecursiveAction Threads when i meet a threshold and then just switch to a sequential piece of code (no more threads)?
Thank you very much.
Java threads usually take 256k/512k(depeding in OS, jdk versions..) of stack space alone, by default.
You're wasting huge resources and speed if you're running more threads than you have processors/cores for a CPU intensive process such as doing quicksort, so try to not run more threads than you have cores.
Yes, switching over to sequential code is a good idea when the unit of work is in the region of ca. 10,000-100,000 operations. This is just a rule of thumb. So, for quick sort, I'd drop out to sequential execution when the size to be sorted is less than say 10-20,000 elements, depending upon the complexity of the comparison operation.
What's the size of the ForkJoinPool - usually it's set to create the same number of threads as processors, so you shouldn't be seeing too many threads. If you've manually set the parallelism to be high (say, in the hundreds or thousands) then you will see high (virtual) memory use, since each thread allocates space for the stack (256K by default on 32-bit windows and linux.)
As a rule of thumb for a CPU bound computation, once your number of threads exceeds the number of available cores, adding more threads is not going to speed things up. In fact, it will probably slow you down due to the overheads of creating the threads, the resources tied down by each thread (e.g. the thread stacks), and the cost of synchronizing.
Indeed, even if you had an infinite number of cores, it would not be worth creating threads to do small tasks. Even with thread pools and other clever tricks, if the amount of work to be done in a task is too small the overheads of using a thread will exceed any savings. (It is difficult to predict exactly where that threshold is, and it certainly depends on the nature of the task as well as platform-related factors.)
I changed my code and so far I have better results. I invoke the main Thread task in the ForkJoinPool, in the Threads, I dont create more threads if there are a lot more active threads than available cores in the ForkJoinPool.
I dont do synchronism through the join() method. As a result a parent thread will die as soon as it created its offsprings. In the main function that invoked the root task. I wait for the tasks to be completed, aka no more active threads. Its seems to work fine as the memory stays normal and i gained lots of time over a the same piece of code executed sequentially.
I am going to learn more.
Thank you all !

Threads per Processor

In Java, is there a programmatic way to find out how many concurrent threads are supported by a CPU?
Update
To clarify, I'm not trying to hammer the CPU with threads and I am aware of Runtime.getRuntime().availableProcessors() function, which provides me part of the information I'm looking for.
I want to find out if there's a way to automatically tune the size of thread pool so that:
if I'm running on a 1-year old server, I get 2 threads (1 thread per CPU x an arbitrary multiplier of 2)
if I switch to an Intel i7 quad core two years from now (which supports 2 threads per core), I get 16 threads (2 logical threads per CPU x 4 CPUs x the arbitrary multiplier of 2).
if, instead, I use a eight core Ultrasparc T2 server (which supports 8 threads per core), I get 128 threads (8 threads per CPU x 8 CPUs x the arbitrary multiplier of 2)
if I deploy the same software on a cluster of 30 different machines, potentially purchased at different years, I don't need to read the CPU specs and set configuration options for every single one of them.
Runtime.availableProcessors returns the number of logical processors (i.e. hardware threads) not physical cores. See CR 5048379.
A single non-hyperthreading CPU core can always run one thread. You can spawn lots of threads and the CPU will switch between them.
The best number depends on the task. If it is a task that will take lots of CPU power and not require any I/O (like calculating pi, prime numbers, etc.) then 1 thread per CPU will probably be best. If the task is more I/O bound. like processing information from disk, then you will probably get better performance by having more than one thread per CPU. In this case the disk access can take place while the CPU is processing information from a previous disk read.
I suggest you do some testing of how performance in your situation scales with number of threads per CPU core and decide based on that. Then, when your application runs, it can check availableProcessors() and decide how many threads it should spawn.
Hyperthreading will make the single core appear to the operating system and all applications, including availableProcessors(), as 2 CPUs, so if your application can use hyperthreading you will get the benefit. If not, then performance will suffer slightly but probably not enough to make the extra effort in catering for it worth while.
There is no standard way to get the number of supported threads per CPU core within Java. Your best bet is to get a Java CPUID utility that gives you the processor information, and then match it against a table you'll have to generate that gives you the threads per core that the processor manages without a "real" context switch.
Each processor, or processor core, can do exactly 1 thing at a time. With hyperthreading, things get a little different, but for the most part that still remains true, which is why my HT machine at work almost never goes above 50%, and even when it's at 100%, it's not processing twice as much at once.
You'll probably just have to do some testing on common architectures you plan to deploy on to determine how many threads you want to run on each CPU. Just using 1 thread may be too slow if you're waiting for a lot of I/O. Running a lot of threads will slow things down as the processor will have to switch threads more often, which can be quite costly. I'm not sure if there is any hard-coded limit to how many threads you can run, but I gaurantee that your app would probably come to a crawl from too much thread switching before you reached any kind of hard limit. Ultimately, you should just leave it as an option in the configuration file, so that you can easily tune your app to whatever processor you're running it on.
A CPU does not normally pose a limit on the number of threads, and I don't think Java itself has a limit on the number of native (kernel) threads it will spawn.
There is a method availableProcessors() in the Runtime class. Is that what you're looking for?
Basics:
Application loaded into memory is a process. A process has at least 1 thread. If you want, you can create as many threads as you want in a process (theoretically). So number of threads depends upon you and the algorithms you use.
If you use thread pools, that means thread pool manages the number of threads because creating a thread consumes resources. Thread pools recycle threads. This means many logical threads can run inside one physical thread one after one.
You don't have to consider the number of threads, it's managed by the thread pool algorithms. Thread pools choose different algorithms for servers and desktop machines (OSes).
Edit1:
You can use explicit threads if you think thread pool doesn't use the resources you have. You can manage the number of threads explicitly in that case.
This is a function of the VM, not the CPU. It has to do with the amount of heap consumed per thread. When you run out of space on the heap, you're done. As with other posters, I suspect your app becomes unusable before this point if you exceed the heap space because of thread count.
See this discussion.

Categories