Suppose I have a large batch of memory-bound tasks that are quite independent of one another. To make things concrete, let's say I can allocate 30GB for the heap and that each task requires on average about 3GB of memory at its peak, but with some variability both over time and from task to task. A few tasks here and there might even require 6GB.
In this case, it seems more efficient to try to run 10 (or arguably even more) tasks concurrently, and if / when we bump into the memory limit have the task wait, much the same as we do with other shared resources like I/O, specific memory addresses (which are accessed through locks), etc.
Is it possible do this in Java? More generally
What's the best way to handle memory-bound task scheduling in Java?
Some Related Questions and "Close Misses"
This question asks whether it's possible to have threads in java wait for memory instead of throwing an OOM exception, but the answers seem to focus on why this is a bad idea to begin with - perhaps because the question suggests the number of threads is unreasonable. Also, I guess treating all memory requests as equal can lead to deadlocks. So I want to emphasize that here we are talking about only about 10 tasks, and the desire to "max out" the memory usage seems like a very natural one. I do not mind wrapping my tasks by some suitable logic that will distinguish their memory requests as having lower priority. I can even accept a solution where I need to identify the class whose instances are filling up the memory and maybe add some suitable counter - but I'd prefer a platform-independent solution that works "out of the box", if there is one.
This question also also asks about scheduling memory-bound tasks but seems to presuppose a specific solution framework.
The problem is that within a single JVM you have very little control on how much memory a single thread is going to use; unless you make use of offheap (e.g. using Unsafe or direct memory as AnatolyG already mentioned). If you have huge array allocations, you could also control these. But we need to know more about the data-structures that consume the most memory.
But if you have orbitrary object graphs you don't have much control over, perhaps it smarter to model the problem using multiple processes. You have 1 intake controller process and then a bunch of worker processes. And on each process you can configure the maximum amount of heap a JVM is allowed to use.
Bumping into memory limits on OS level can be a huge PITA because it could lead to swapping and this will makes all the threads in a system slow. Or even worse, OOM-killer. Make sure you set the vm.swappiness to a very low value to prevent premature swapping.
Do you know up front how much memory a process is going to consume? If so, then you could keep track of the maximum amount of memory being consumed in the system and don't allow for new tasks in the system before tasks have completed.
If you don't know up front the memory limits, then you could assume each tasks will use the maximum, but this can lead to under-utilization of memory.
Related
(The specifics for this question are for a mod for Minecraft. In general, the question deals with resizing a threadpool based on system load and CPU availability).
I am coming from an Objective C background, and Apple's libdispatch (Grand Central Dispatch) for thread scheduling.
The immediate concern I have is trying to reduce the size of the threadpool when a CMS tenured collection is running. The program in question (Minecraft) only works well with CMS collections. A much less immediate, but still "of interest", is reducing the threadpool size when other programs are demanding significant CPU (specifically, either a screen recorder, or a twitch stream).
In Java, I have just found out about (deep breath):
Executors, which provide access to thread pools (both fixed size, and adjustable size), with cached thread existence (to avoid the overhead of constantly re-creating new threads, or to avoid the worry of coding threads to pause and resume based on workload),
Executor (no s), which is the generic interface for saying "Now it is time to execute this runnable()",
ExecutorService, which manages the threadpools according to Executor,
ThreadPoolExecutor, which is what actually manages the thread pool, and has the ability to say "This is the maximum number of threads to use".
Under normal operation, about 5 times a second, there will be 50 high priority, and 400 low priority operations submitted to the thread pool per user on the server. This is for high-powered machines.
What I want to do is:
Work with less-powerful machines. So, if a computer only has 2 cores, and the main program (two primary threads, plus some minor assistant threads) is already maxing out the CPU, these background tasks will be competing with the main program and the garbage collector. In this case, I don't want to reduce the number of background threads (it will probably stay at 2), but I do want to reduce how much work I schedule. So this is just "How do I detect when the work-load is going up". I suspect that this is just a case of watching the size of the work queue I use when Executors.newCachedThreadPool()
But the first problem: I can't find anything to return the size of the work queue! ThreadPoolExecutor() can return the queue, and I can ask that for a size, but newCachedThreadPool() only returns an ExecutorService, which doesn't let me query for size (or rather, I don't see how to).
If I have "enough cores", I want to tell the pool to use more threads. Ideally, enough to keep CPU usage near max. Most of the tasks that I want to run are CPU bound (disk I/O will be the exception, not the rule; concurrency blocking will also be rare). But I don't want to heavily over-schedule threads. How do I determine "enough threads" without going way over the available cores?
If, for example, screen recording (or streaming) activates, CPU core usage by other programs will go up, and then I want to reduce the number of threads; as the number of threads go down, and queue backlog goes up, I can reduce the amount of tasks I add to the queue. But I have no idea how to detect this.
I think that the best advice I / we can give is to not try to "micro-manage" the number of threads in the thread pools. Set it to sensible size that is proportional to the number of physical cores ... and leave it. By all means provide some static tuning parameters (e.g. in config files), but don't to make the system tune itself dynamically. (IMO, the chances that dynamic tuning will work better than static are ... pretty slim.)
For "real-time" streaming, your success is going to depend on the overall load and the scheduler's ability to prioritize more than the number of threads. However it is a fact that standard Java SE on a standard OS is not suited to hard real-time, so your real-time performance is liable to deteriorate badly if you push the envelope.
I've been looking for the way to increase the speed of processing messages received from rabbitmq queue. The only way I've found is make more than one threads doing the same - receiving and processing. And this gave me some profit. After I created 4 threads the speed quadrupled. As I have 8-core processor I've decided to increase the number of threads to 8. But this gives no performance increasing. YourKit shows that only 50% of CPU is used. Somebody can say that my app is lightweight and so it is so, but I can say that it can't do more work than it does regardless I produce much more what to do. Why this doesn't work?
There are many different issues that can constrain the maximum speed of some application on a given system. For example, it can be limited by memory bandwidth, by Amdahl's Law effects (time needed for non-parallel code, including synchronized blocks), I/O bandwidth, and cache space.
If you want further improvement you need to do some measurements and profiling to find where the time is going, and then work on that.
The short (and not particularly helpful) answer is "overheads and bottlenecks".
For instance:
Creating threads in Java is relatively expensive. If the amount of work done by a thread isn't large, the overhead of creating the thread can out-weigh the benefits.
Context switching between threads is relatively expensive, especially when you take account of memory-related overheads such as cache misses, TLB misses. (These overheads actually hit when a native thread is assigned to a core. If the OS can somehow keep a native thread on a single core continuously (i.e. with no other threads on the same core), then it can use spinlocking ... and avoid the context switch. But the more Java threads you have, the less likely it is that the OS can do this.)
The threads may be spending a large proportion of their time waiting for I/O to complete. The I/O system's throughput or the speed / latency of some external service can be a bottleneck.
You may have contention over data structures; e.g. threads requiring exclusive access to safely read or update (say) a shared Map. If threads regularly need to wait for others to release locks, then you have a bottleneck.
Your computation may be dominated by the costs of "feeding" the threads. For example, if there is a single master thread that hands out "work" to worker threads, then the master thread's activities could be the bottleneck; i.e. it may not be able to provide enough work to keep the workers busy.
Since your tags imply that you are using a message queue, it is possible that that is the bottleneck, especially if the messages are big or the "work" done on each one is relatively small.
(Using a separate separate message queue service is liable to increase context switches, add I/O latency, add protocol overheads and so on. It's not an automatic route to performance improvement for small-scale systems.)
It is also possible that you have "hyperthreaded" cores not real cores, or that the operating system is stopping your JVM from using all cores.
If CPU or waiting for IO is your bottle neck, adding independent threads can make a big difference.
If you have a shared resource is a bottleneck, e.g. your L3 cache, your network adapter, your kernel, adding threads won't help because CPU is not the problem. In fact it can often make it worse by adding overhead.
my app is lightweight
In which case CPU is unlikely to be your issue and you are doing well to see a speed up with more than 1 CPU. Most likely you are speeding up CPU used by RabbitMQ. Ideally it should be more efficient and this shouldn't really help much. IMHO, more efficient messaging solutions don't gain much by multiple CPUs as they will not be bottlenecked on CPU.
One way or another, you're only using 4 cores. There's a lot that can stop stop you from doubling your performance by doubling your threads, but from your 4 thread success you've gotten past all that. I'm guessing there's a bug in your code to set off 8 threads and it's only firing up 4. (Even with hyperthreading, you're going to get some improvement. Even with every possible problem, you're going to get some improvement.) Otherwise, I'll go with T.J.Crowder and Stephen C: I don't think you really have 8 cores.
I'd try using different numbers of threads: 3, 5, 6. See what changes. I think you'll stumble on the problem soon enough.
To be fair to Java: if you write thread-safe code and avoid bottlenecks, it handles threads really well, as you've noticed going from one thread to 4. I've always found the overhead costs to be trivial.
Your application does not have linear speedup, and therefore, it does not have good scalability.
In order to keep increase the number of threads you need to ensure the data being handle is growing accordingly. For a fixed amount of data, increasing number of threads (and/or cores) will have a diminishing return at some point since the overhead of creating threads will outweight the thread's compute time.
Make sure to look up the following link:
Ahmdahl's law
Gustafson's law
Gustafson's law is a great counterpoint to Ahmdal's law so I highly recommend understanding that article.
can you explain this nonsense to me?
i have a method that basically fills up an array with mathematical operations. there's no I/O involved or anything. now, this method takes about 50 seconds to run, and the code is perfectly scalable (theoretically 100%), so i split it up into 4 threads, wait for them to complete, and reassemble the 4 arrays. now, i run the program on a quad core processor, expecting it to take about 15 seconds, and it actually takes 58 seconds. that's right: it takes longer! i see the cpu working 100%, and i know that each thread does 1/4 of the calculations, and creating threads and reassembling the arrays take about 1-2 ms in total.
what's causing such loss of performance? what the hell is the cpu doing all that time?
CODE: http://pastebin.com/cFUgiysw
Threads don't work that way.
Threads are still part of the same process (depending on the OS), so in terms of the operating system - CPU time will be scheduled the same for 4 threads in 1 process as it is for 1 thread in 1 process.
Also, with such a small number of values, you won't see the scalability in the midst of the overhead. Re-assembling the arrays in java will be costly.
Check out things like "Context switching overhead" - things like that always mess you up when you try to map theory to practise :P
I would stick to the single-threaded way :)
~ Dan
http://en.wikipedia.org/wiki/Context_switch
A lot depends on what you are doing and how you are dividing the work. There are many possible causes for this problem.
The most likely cause is, you are using all the bandwidth of your CPU to main memory bus with one thread. This can happen if your data set is larger than your CPU cache. esp if you have some random access behaviour. You could consider trying to reuse the original array, rather than taking multiple copies to reduce cache churn.
Your locking overhead is greater than the performance gain. I suspect you have used very course locking so this shouldnt be an issue.
Starting stopping threads takes too long. As your code is multi second, I doubt this too.
There is a cost associated with opening new threads. I don't think it should be up to 8 second but it depends on what threads you are using. Some threads needs to create a copy of the data that you are handling to be thread safe and that can take some time. This cost is commonly referred to as overhead. If the execution you are doing is somewhere not serializable for instance reads the same file or needs access to a shared resource the threads might need to wait on each other this can take some time and under sub optimal conditions it can take more time than serial execution. My tip is try and check for these unserializable events remove them from the threaded part if possible. Also try and use a lower amount of threads 4 threads for 4 cpus is not always optimal.
Hope it helps.
Unless you are constantly creating and killing threads the thread overhead shouldn't be a problem. Four threads running simultaeously is no big deal for the scheduler.
As Peter Lawrey suggested the memory bandwidth could be the problem. Your 50-second code is running on a java engine and they both compete for the available memory bandwidth. The java engine needs memory bandwidth to execute your code and your code needs it to do its calculations.
You write "perfectly scalable" which would be the case if your code was compiled. Since it runs on a java engine this is not the case. So the 16% increase in overall time could be seen as the difference between the smoothness of one thread vs the chaos of four colliding over memory accesses.
I want to determine the max number of threads I am able to create for my sort algoritm. I want to use java.lang.Runtime for that.
I want to count the current thread amount and stop creating new threads when the limit is reached.
The max number of threads for a JVM is generally somewhere in the thousands. If you're using multiple threads to optimize a computational algorithm you don't really want more than the number of processors in the system its run on. Use Runtime.getRuntime().availableProcessors() to find out.
I'd suggest you look at the problem from a slightly different angle...
what I mean is that if you're asking
"what's the maximum number of threads I can create and use for task x?"
and you're asking that because you think the most threads you use the better, you could instead ask
"how many threads would I need to complete x most efficiently?"
its very unlikely that task x will perform better with thousands of threads on a dual core machine for example. As a general guide for example, for CPU bound tasks (which a sort algorithm probably is), the optimal number of threads is
threads = number of CPUs + 1
see How to find out the optimal amount of threads?
The actual maximum number of threads depends on the OS and JVM along with how much memory is configured for use within the JVM. Creating a new thread takes some OS memory but not Java' heap memory. This means that if you create thousands of threads, you might run out of memory which you can't bump up using -Xmx. In fact, the less Java heap memory you allocate, the more native memory is available and so the more threads you can create.
See this article and the comment to get a feel for how you can work out the max with pen and paper.
Having said all that, if you want to use an unlimited number of threads in your application, you can use the `Executors.newCachedThreadPool()' which will create a pool of threads, creating as many as are needed. I don't recommend using this type of pool for everyday usage, but its related to your original question.
Hope that helps.
You can see how many threads you have by attaching visualvm to your process. I recommend version 1.3.2 with all the plugins downloaded.
I'd also recommend using Spring and its Executor pools. It's a great way to be able to configure the size of the thread pool.
I am fairly new with concurrent programming and I am learning it.
I am implementing a quick sort in Java JDK 7 (Fork Join API) to sort a list of objects (100K).
While using this recursive piece of code without using concurrency,i observe no memory explosion, everything is fine.
I just added the code to use it on multi cores (by extending the class RecursiveAction) and then the memory usage jumped very high until it reached its limits. By doing some profiling i observe a high creation rate of threads and i think its expectable.
But, is a java Thread by itself much more memory demanding or am i missing something here ?
Quicksort must requires a lot of threads but not much than regular objects.
Should I stop creating RecursiveAction Threads when i meet a threshold and then just switch to a sequential piece of code (no more threads)?
Thank you very much.
Java threads usually take 256k/512k(depeding in OS, jdk versions..) of stack space alone, by default.
You're wasting huge resources and speed if you're running more threads than you have processors/cores for a CPU intensive process such as doing quicksort, so try to not run more threads than you have cores.
Yes, switching over to sequential code is a good idea when the unit of work is in the region of ca. 10,000-100,000 operations. This is just a rule of thumb. So, for quick sort, I'd drop out to sequential execution when the size to be sorted is less than say 10-20,000 elements, depending upon the complexity of the comparison operation.
What's the size of the ForkJoinPool - usually it's set to create the same number of threads as processors, so you shouldn't be seeing too many threads. If you've manually set the parallelism to be high (say, in the hundreds or thousands) then you will see high (virtual) memory use, since each thread allocates space for the stack (256K by default on 32-bit windows and linux.)
As a rule of thumb for a CPU bound computation, once your number of threads exceeds the number of available cores, adding more threads is not going to speed things up. In fact, it will probably slow you down due to the overheads of creating the threads, the resources tied down by each thread (e.g. the thread stacks), and the cost of synchronizing.
Indeed, even if you had an infinite number of cores, it would not be worth creating threads to do small tasks. Even with thread pools and other clever tricks, if the amount of work to be done in a task is too small the overheads of using a thread will exceed any savings. (It is difficult to predict exactly where that threshold is, and it certainly depends on the nature of the task as well as platform-related factors.)
I changed my code and so far I have better results. I invoke the main Thread task in the ForkJoinPool, in the Threads, I dont create more threads if there are a lot more active threads than available cores in the ForkJoinPool.
I dont do synchronism through the join() method. As a result a parent thread will die as soon as it created its offsprings. In the main function that invoked the root task. I wait for the tasks to be completed, aka no more active threads. Its seems to work fine as the memory stays normal and i gained lots of time over a the same piece of code executed sequentially.
I am going to learn more.
Thank you all !