We have a computationally demanding java program (scientific research) that is designed single-threaded. However, when executed, it loads much more than 1 CPU core (we noticed it the hard way - cluster job scheduler killed our program because it loaded more cores than requested). We encountered this weird phenomenon both on linux (Debian, Ubuntu) and windows (7).
I understand that there are several background threads added by java/jvm (garbage collector) so even single-threaded program can load more than one core but I doubt that these background processes could load another full core or two.
I ask for any idea what may be causing this. Thanks for any hints. Feel free to ask for any details, though I can't post the code here (first, it's quite a lot of code, second, it is still under research and I cannot publish anything yet).
Let me first give you my condolences for having to run your program in an environment where someone has found it more intellectually fulfilling to kill jobs attempting to use more than one core, than to restrict jobs to using just one core. But let's move on with the question.
When I pause a random single-threaded java program and look at my debugger's thread listing there is about half a dozen threads in there. That's just how the JVM works. There is at least one thread for the garbage collection, another thread for running finalizers, and various other stuff, most of which I don't even know what purpose they serve. We lost the game of knowing precisely what is going on in our machines a couple of decades ago.
There may be options that you could use to tell the JVM to reduce its use of threads, for example to run garbage-collection in the same thread as your program, but I don't know them by heart, so you would need to look them up, and frankly, I doubt that it would make much difference. There will always be threads that you have no control over.
So, it seems like you are going to have to configure your own job to not use more than one core. I have done it at work, with some success, but today is Saturday, so I do not have access to the script files that I used, so I am going to try and help with whatever I remember.
The concepts you are looking for are "process thread affinity" and "NUMA".
Under Windows, the start command (built into cmd.exe) allows you to specify the number of logical CPUs (in other words, cores) to run your process on. start /affinity 1 myapp will run myapp limiting it on core 1.
Under Linux there are at least a couple of different commands that allow you to launch a process on a limited subset of cores. One that I know of is taskset and another is numactl.
There are set of parameters for JVM you could play. For Java 7 and earlier:
-XX:ParallelGCThreads=n Sets the number of threads used during parallel phases of the garbage collectors.
-XX:ConcGCThreads=n Number of threads concurrent garbage collectors will use
For Java 8 there are another options which depends on OS. You could see them for Windows here. Some you could find helpful:
-XX:CICompilerCount=threads Sets the number of compiler threads to use for compilation
-XX:ConcGCThreads=threads Sets the number of threads used for concurrent GC. The default value depends on the number of CPUs available to the JVM (!possible cause of your problem!)
-XX:ParallelGCThreads=threads Sets the number of threads used for parallel garbage collection in the young and old generations. The default value depends on the number of CPUs available to the JVM
-XX:+UseParNewGC Enables the use of parallel threads for collection in the young generation. By default, this option is disabled (but it could be enabled due to another options)
If you provide us additional info then answers would be more helpful and informative
Related
I have a program which runs (all day) tasks in parallel (no I/O in the task to be executed) so I have used Executors.newFixedThreadPool(poolSize) to implement it.
Initially I set the poolSize to Runtime.getRuntime().availableProcessors(), but I was a bit worried to use all the available cores since there are other processes running on the same PC (32 cores).
In particular I have ten other JVM running the same program (on different input data), so I'm a bit worried that there might be a lot of overhead in terms of threads switching amongst the available cores, which could slow down the overall calculations.
How shall I decide the size of the pool for each program / JVM?
Also, in my PC, there are other processes running all the time (Antivirus, Backup, etc.). Shall I take into account these as well?
Any advice is going to be dependent upon your particular circumstances. 10 JVMs on 32 cores would suggest 3 threads each (ignoring garbage collection threads, timer tasks etc...)
You also have other tasks running. The scheduler will ensure they're running, but do they have to be responsive ? More responsive than the JVM ? If you're running Linux/Unix then you can also make use of prioritisation (via nice) to ensure particular processes don't hog the CPU.
Finally you're running 10 JVMs. Will that cause paging ? If so, that will be slow and you may be better off running fewer JVMs in order to avoid consuming so much memory.
Just make sure that your key variables are exposed and configurable, and measure various scenarios in order to find the optimal one.
How shall I decide the size of the pool for each program / JVM?
You want the number of threads which will get you close to 99% utilisation and no more.
The simplest way to balance the work is to have the process running once, processing multiple files at concurrently and using just one thread pool. You can set up you process as a service if you need to start files via the command line.
If this is impossible for some reason, you will need to guesstimate how much the thread pools should be shrunk by. Try running one process and look at the utilisation. If one is say 40% then I suspect ten processes is over utilised by 400%. i.e then you might reduce the pool size by a factor of 4.
Unfortunately, this is a hard thing to know, as programs don't typically know what else is or might be going on on the same box.
the "easy" way out is to make the pool size configurable. this allows the user who controls the program/box to decide how many threads to allocate to your program (presumably using their knowledge of the general workload of the box).
a more complex solution would be to attempt to programmatically determine the current workload of the box and choose the pool size appropriately from that. the efficacy of this solution depends on how accurately you can determine the workload and potentially adapt as it changes over time.
Try grepping the processes, check top/task manager and performance monitors to verify if this implementation is actually affecting your machine.
This article seems to contain interesting info about what you are trying to implement:
http://www.ibm.com/developerworks/library/j-jtp0730/index.html
I have a general question:
my program will just go on processing something which does not require user input or system resources (like printer etc..) meaning, my program will not wait for any resources except CPU time.
The same program (let us say job) may be initiated by multiple users.
in this case, is it worth full to run this in a thread (meaning each user will get a feeling that his job is executed without delay.
or is it better to run the jobs sequentially?
The issue with running as separate threads is that, too many threads running simultaneously forcing the CPU utilization go over 100%.
Please suggest. Assume that user donot see his job progress. User is not worried when his job is finished. But at the same time, I want to have the CPU busy running the jobs.
If you don't care how long a process takes, or the length of time it takes is acceptable, then using one thread is likely to be the simplest solution. For example, many GUI applications only use one event handling thread.
If you want to keep all your CPUs busy you can start a number of busy loops to max out all the CPUs.
What you usually want is to reduce latency, or improve threadput by using more CPUs. Unless this is a goal, using more CPUs won't help you.
If the thread is genuinely purely CPU-bound, then it doesn't make sense to create more threads than there are cores (or virtual cores) available to process them. So on a quad-core machine, create no more than four threads (and probably only three, as your process isn't the only thing going on on the machine). On a quad-core machine with hyper-threading (two virtual threads per core), you might create six or seven. Creating too many additional threads (say, hundreds) causes unnecessary context-switching, which can be expensive if you really overdo it.
The converse is that on a multi-core machine, a single thread can only run on one core. So on a quad-core machine, running the jobs sequentially on a single thread will only utilize 25% of the CPU capacity.
So: Run the jobs in parallel up to the number of available cores, and sequentially (on each core) beyond that.
Big caveat: Your mileage may vary. There are lots of inputs to this equation, including what else is going on on the machine, and particularly whether the jobs really are CPU-bound (as opposed to system-bound, e.g., CPU and I/O subsystem and such).
I guess your program needs memory access. Memory access may be slow, and you really want to run the processor at that time. A common solution to limit the number of threads running at the same time is to use a thread pool.
in this case, is it worth full to run this in a thread (meaning each user will get a feeling that his job is executed without delay. or is it better to run the jobs sequentially?
It depends highly on the job. If it is interactive then running it immediately would give a better interface to the user. If the speed of the response is not an issue then maybe you don't want to incur the complexity costs of writing a multi-threaded program.
The issue with running as separate threads is that, too many threads running simultaneously forcing the CPU utilization go over 100%.
I wouldn't worry about this. One of the reasons why we use multiple threads is that we can make use of multiple processors to actually get the job done faster. In this case, depending on the OS, you can actually see more than 100% load for the process if you are using more than a full CPU -- this is expected. Also, if the CPU goes over 100%, the operating system will handle it fine unless you are worried that your application will be taking cycles away from a more important application.
I wrote a very simple single threaded java application that simply iterates (a few times) over a list of Integer:s and calculates the sum. When I run this on my Linux machine (Intel X5677 3.46GHz quad-core), it takes the program about 5 seconds to finish. Same time if I restrict the jvm to two specific cores using taskset (which was quite expected, as the application is single threaded and the cpu load is < 0.1% on all cores). However – when I restrict the jvm to a single core, the program suddenly executes extreeemly slow and it takes 350+ seconds for it to finish. I could understand if it was only marginally slower when restricted to a single core as the jvm is running a few other threads in addition to the main thread, but I can’t understand this extreme difference. I ran the same program on an old laptop with a single core, and it executes in about 15 seconds. Does anyone understand what is going on here, or has anyone successfully restricted a jvm to a single core on multicore system without experiencing something like this?
Btw, I tried this with both hotspot 1.6.0_26-b03 and 1.7.0-b147 – same problem.
Many thanks
Yes, this seems counter-intuitive, but the simple solution would be to not do it. Let the JVM use 2 cores.
FWIW, my theory is that the JVM is looking at the number of cores that the operating system is reporting, assuming that it will be able to use all of them, and tuning itself based on that assumption. But the fact that you've pinned the JVM to a single core is making that tuning pessimal.
One possibility is that the JVM has turned on spin-locking. That is a strategy where a thread that can't immediately acquire a lock will "spin" (repeatedly testing the lock) for a period, rather than immediately rescheduling. This can work well if you've got multiple cores and the locks are held for a short time, but if there is only one core available then spinlocking is an anti-optimization.
(If this is the real cause of the problem, I believe there is a JVM option you can set to turn off spinlocks.)
This would be normal behaviour if you have two or more threads with an interdependence on each other. Imagine a program where two threads ping-ponging messages or data between them. When they are both running this can take 10 - 100 ns per ping-pong. When they have to context switch to run they can take 10 - 100 micro-seconds each. A 1000x increase I wouldn't find surprising.
If you want to limit the program to one core, you may have to re-write portions of it so its designed to run on one core efficiently.
We've been talking about threads in my operating system class a lot lately and one question has come to my mind.
Since Go, (and Java) uses User-space thread instead of kernel threads, doesn't that mean that you can't effectively take advantages of multiple cores since the OS only allocates CPU time to the process and not the threads themselves?
This seems to confirm the fact that you can't
Wikipedia also seems to think so
What makes you think Go uses User-space threads?
It doesn't. It uses OS-threads and can take advantage of multiple cores.
You might be puzzled by the fact that by default Go only uses 1 thread to run your program. If you start two goroutines they run in one thread. But if one goroutine blocks for I/O Go creates a second thread and continues to run the other goroutine on the new thread.
If you really want to unlock the full multi-core power just use the GOMAXPROCS() function.
runtime.GOMAXPROCS(4); //somewhere in main
Now your program would use 4 OS-threads (instead of 1) and would be able to fully use a e.g. 4 core system.
Most recent versions of Java to use OS threads, although there is not necessarily a one-to-one mapping with Java threads. Java clearly does work quite nicely across many hardware threads.
I presume that by "user-space threads" you mean (for example) Go's goroutines.
It is true that using goroutines for concurrency is less efficient than designing (by hand and by scientific calculations) a special-purpose algorithm for assigning work units to OS threads.
However: Every Go program is situated in an environment and is designed to solve a particular problem. A new goroutine can be started for each request that the environment is making to the Go program. If the environment is making concurrent requests to the Go program, a Go program using goroutines might be able to run faster than a serial program even if the Go program is using just 1 OS thread. The reason why goroutines might be able to process requests with greater speed (even when using just 1 OS thread) is that the Go program will automatically switch from goroutine A to goroutine B when the part of environment which is associated with A is momentarily unable to respond.
But yes, it is true that using goroutines and automatically assigning them to multiple OS threads is less efficient than designing (by hand and by scientific calculations) a special-purpose algorithm for assigning work units to OS threads.
Short version is in the title.
Long version:
I am working on a program for scientific optimization using Java. The workload of the program can be divided into parallel and serial phases -- parallel phases meaning that highly parallelizable work is being performed. To speed up the program (it runs for hours/days) I create a number of threads equal to the number of CPU cores on the machine I'm using -- typically 4 or 8 -- and divide the work between them. I then start these threads and join() them before proceeding to a serial phase.
So far so good. What's bothering me is that the CPU utilization and speedup of the parallel phases is nowhere near the "theoretical maximum" -- e.g. if I have 4 cores, I expect to see somewhere between 350-400% "utilization" (as reported by top) but instead it bounces around between 180 and about 310. Using only a single thread, I get 100% CPU utilization.
The only reasons I know of for threads not to run at full speed are:
-blocking due to I/O
-blocking due to synchronization
No I/O whatsoever is going on in my parallel threads, nor any synchronization -- the only data structures shared by the threads are read-only, and are either basic types or (non-concurrent) collections. So I'm looking for other explanations. One possibility would be that several threads are repeatedly blocking for garbage collection, but that would only seem to make sense in a situation with memory pressure, and I am allocating well above the required maximum heap space.
Any suggestions would be appreciated.
Update: Just in case anyone is curious, after some more investigation I tweaked the code for general performance and am seeing better utilization, even though nothing I changed has to do with synchronization. However, some of the changes should have resulted in fewer new heap allocations in particular I got rid of some use of iterators and termporary boxed numbers (The CERN "Colt" library for high-performance Java computing was useful here: it provides collections like IntArrayList, DoubleArrayList etc for basic types.). So I think garbage collection was probably the culprit.
All graphics operations run on a single thread in swing. If they are rendering to the screen they will effectively be contending for access to this thread.
If you are running on Windows, all graphics operations run on a single thread no matter what. Other operating systems have similar limitations.
It's actually fairly difficult to get the proper granularity of threaded workers sometimes, and sometimes it's easy to make them too big or too small, which will typically give you less than 100% usage of all cores.
If you're not rendering much gui, the most likely culprit is that you're contending more than you think for some shared resource. This is easily seen with profiler tools like jprofiler. Some VM's like bea's jrockit can even tell you this straight out of the box.
This is one of those places where you dont want to act on guesswork. Get a profiler!
First of all, GC will not happen only "in situation with memory pressure", but at any time the JVM sees fit (unpredictable, as far as I know).
Second, if your threads allocate memory in the heap (you mention they use Collections so I guess they do assign memory in the heap), you can never be sure if this memory is currently in RAM or on a Virtual Memory page (the OS decides), and thus access to "memory" may generate blocking I/O access!
Finally, as suggested in a prior answer, you may find it useful to check what happens by using a profiler (or even JMX monitoring might give some hints there).
I believe it will be difficult to get further hints on your problem unless you provide more concrete (code) information.
Firstly, I assume you're not doing any other significant work on the box. If you are, that's clearly going to mess with things.
It does sound very odd if you're really not sharing anything. Can you give us more idea of what the code is really doing?
What happens if you run n copies of the program as different Java processes, with each only using a single thread? If that uses each CPU completely, then at least we know that it can't be a problem with the OS. Speaking of the OS, which one is this running on, and which JVM? If you can try different JVMs and different OSes, the results might give you a hint as to what's wrong.
Also an important point: Which Hardware do you use?
E.g. 4-8 Cores could mean you work on one of Suns Niagara CPUs. And despite having 4-8 Cores they have less FPUs. When computing scientific stuff it could happen, that the FPU is the bottleneck.
You try to use the full CPU capability for your calculations but the OS itself uses resources as well. So be aware that the OS will block some of your execution in order to satisfy its needs.
You are doing synchronization at some level.
Perhaps only in the memory allocation system, including garbage collection. While the JVM vendor has worked to keep blocking in these areas to a minimum, they can't reduce it to zero. Perhaps something about your application is pushing at a weak point in this area.
The accepted wisdom is "don't build your own memory reclaiming pool, let the GC work for you". This is true most of the time but not in at least one piece of code I maintain (proven with profiling). Perhaps you need to rework your Object allocation in some major way.
Try the latency analyzer that comes with JRockit Mission Control. It will show you what the CPU is doing when it's not doing anything, if the application is waiting for file I/O, TLA-fetches, object allocations, thread suspension, JVM-locks, gc-pauses etc. You can also see transitions, e.g. when one thread wakes up another. The overhead is negligible, 1% or so.
See this blog for more info. The tool is free to use for development and you can download it here