Parallelization: What causes Java threads to block other than synchronization & I/O? - java

Short version is in the title.
Long version:
I am working on a program for scientific optimization using Java. The workload of the program can be divided into parallel and serial phases -- parallel phases meaning that highly parallelizable work is being performed. To speed up the program (it runs for hours/days) I create a number of threads equal to the number of CPU cores on the machine I'm using -- typically 4 or 8 -- and divide the work between them. I then start these threads and join() them before proceeding to a serial phase.
So far so good. What's bothering me is that the CPU utilization and speedup of the parallel phases is nowhere near the "theoretical maximum" -- e.g. if I have 4 cores, I expect to see somewhere between 350-400% "utilization" (as reported by top) but instead it bounces around between 180 and about 310. Using only a single thread, I get 100% CPU utilization.
The only reasons I know of for threads not to run at full speed are:
-blocking due to I/O
-blocking due to synchronization
No I/O whatsoever is going on in my parallel threads, nor any synchronization -- the only data structures shared by the threads are read-only, and are either basic types or (non-concurrent) collections. So I'm looking for other explanations. One possibility would be that several threads are repeatedly blocking for garbage collection, but that would only seem to make sense in a situation with memory pressure, and I am allocating well above the required maximum heap space.
Any suggestions would be appreciated.
Update: Just in case anyone is curious, after some more investigation I tweaked the code for general performance and am seeing better utilization, even though nothing I changed has to do with synchronization. However, some of the changes should have resulted in fewer new heap allocations in particular I got rid of some use of iterators and termporary boxed numbers (The CERN "Colt" library for high-performance Java computing was useful here: it provides collections like IntArrayList, DoubleArrayList etc for basic types.). So I think garbage collection was probably the culprit.

All graphics operations run on a single thread in swing. If they are rendering to the screen they will effectively be contending for access to this thread.
If you are running on Windows, all graphics operations run on a single thread no matter what. Other operating systems have similar limitations.
It's actually fairly difficult to get the proper granularity of threaded workers sometimes, and sometimes it's easy to make them too big or too small, which will typically give you less than 100% usage of all cores.
If you're not rendering much gui, the most likely culprit is that you're contending more than you think for some shared resource. This is easily seen with profiler tools like jprofiler. Some VM's like bea's jrockit can even tell you this straight out of the box.
This is one of those places where you dont want to act on guesswork. Get a profiler!

First of all, GC will not happen only "in situation with memory pressure", but at any time the JVM sees fit (unpredictable, as far as I know).
Second, if your threads allocate memory in the heap (you mention they use Collections so I guess they do assign memory in the heap), you can never be sure if this memory is currently in RAM or on a Virtual Memory page (the OS decides), and thus access to "memory" may generate blocking I/O access!
Finally, as suggested in a prior answer, you may find it useful to check what happens by using a profiler (or even JMX monitoring might give some hints there).
I believe it will be difficult to get further hints on your problem unless you provide more concrete (code) information.

Firstly, I assume you're not doing any other significant work on the box. If you are, that's clearly going to mess with things.
It does sound very odd if you're really not sharing anything. Can you give us more idea of what the code is really doing?
What happens if you run n copies of the program as different Java processes, with each only using a single thread? If that uses each CPU completely, then at least we know that it can't be a problem with the OS. Speaking of the OS, which one is this running on, and which JVM? If you can try different JVMs and different OSes, the results might give you a hint as to what's wrong.

Also an important point: Which Hardware do you use?
E.g. 4-8 Cores could mean you work on one of Suns Niagara CPUs. And despite having 4-8 Cores they have less FPUs. When computing scientific stuff it could happen, that the FPU is the bottleneck.

You try to use the full CPU capability for your calculations but the OS itself uses resources as well. So be aware that the OS will block some of your execution in order to satisfy its needs.

You are doing synchronization at some level.
Perhaps only in the memory allocation system, including garbage collection. While the JVM vendor has worked to keep blocking in these areas to a minimum, they can't reduce it to zero. Perhaps something about your application is pushing at a weak point in this area.
The accepted wisdom is "don't build your own memory reclaiming pool, let the GC work for you". This is true most of the time but not in at least one piece of code I maintain (proven with profiling). Perhaps you need to rework your Object allocation in some major way.

Try the latency analyzer that comes with JRockit Mission Control. It will show you what the CPU is doing when it's not doing anything, if the application is waiting for file I/O, TLA-fetches, object allocations, thread suspension, JVM-locks, gc-pauses etc. You can also see transitions, e.g. when one thread wakes up another. The overhead is negligible, 1% or so.
See this blog for more info. The tool is free to use for development and you can download it here

Related

Painfully Slow JVM Not Caused by Memory Leak?

I'm programming in Java using eclipse and after running JVM for a couple of hours, my program tends to slow to a trickle. What's normally printed (or executed) in a few fraction's of a second, is taking a couple of minutes or hours.
I'm aware this is usually caused by a memory leak in program. However, I'm under the impression that a memory leak slows PC bec it uses the majority of CPU power for garbage collection. When I take a look at task manager I only see 22-25% of CPU being used at the moment (it has remained steady for the last couple of hours) and approx. 35% of memory free on my machine.
Could the slowing down of my program be caused by something other than a memory leak or is it for sure a memory leak (which means I now need to take a hard look to track down source of leak..) And if yes, why would CPU usage be relatively low?
Thanks
Sometimes this happens when you have loop relationships over your objects or entities. JVM tries to read the data or bind the data looping through same set of objects, this drastically effect the performance of the JVM; most of the time crash the application even. As on previous answer, you can use jconsole to check which time this happens and take an action. Hope you get the idea; may be this is not the case, this is what came to my mind when I read your question.
cheers!!!
Well, at first, Memory Leak/any other malfunction doesn't affect your PC or any other part of your computer unless you are referencing some external resource which is choking. To answer your question, Generically speaking, while there is a possibility that slowing down your program could be caused by CPU, in your case however since your program/process is going slow gradually, most likely there is a memory Leak in your code.
You could use any profiler / jVIsualVM to monitor the mermoy usage/ object's state to nail down the issue.
You may be aware that a modern computer system has more than one CPU core. A single threaded program will use only a single core, which is consistent with task manager reporting an overall cpu usage of 25% (1 core fully loaded, 3 cores idle = 25% total cpu capacity used).
Garbage collection can cause slowdowns, but usually only does so if the JVM is memory constrained. To verify whether it is garbage collection, you can use jconsole or jvisualvm (which are part of the JDK) to see how much CPU time was spent doing garbage collection.
To investigate why your program is slow, using a profiler is usually the most efficient approach.
I think We can not say anything straight forward for this issue. You need to check the behaviour of you program using jconsole or jvisualvm which is part of you JDK.

How to find which Finalizer is time consuming

I am working on an application whose purpose is to compute reports has fast as possible.
My application uses a big amount of memory; more than 100 Go.
Since our last release, I notice a big performance slowdown. My investigation shows that, during the computation, I get many garbage collection between 40 and 60 seconds!!!
(JMC tells me that they are SerialOld but I don't know what it exactly means) and, of course, when the JVM is garbage collecting, the application is absolutely freezed
I am now investigating the origin of these garbage collections... and this is a very hard work.
I suspect that, if these garbage collections are so long, it is because they are spending many times in finalize functions (I know that, among all the libraries we integrate from other teams, some of them uses finalizers)
However, I don't know how to confrim (or not) this hypothesis; How to find which finalizer is time consuming.
I am looking for a good tool or even a good methodology
Here is data collected via JVisualVM
As you can see, I always have many "Pending Finalizers" when I have a
log Old Garbage
What is surprising is that when I am using JVisualVM, the above graph
scrolls regularly from right to left. When the Old Garbage is
triggered, the scrolling stops (until here, it looks normal, this is
end-of-world). However, when the scrolling suddenly restart, it does
not from the end of Old Garbage but from the end of Pending Serializer
This lets me think that the finalizers were blocking the JVM
Does anyone has an explaination for this?
Thank you very much
Philippe
My application uses a big amount of memory; more than 100 Go.
JMC tells me that they are SerialOld but I don't know what it exactly means
If you are using the serial collector for a 100GB heap then long pauses are to be expected because the serial collector is single-threaded and one core can only only chomp through so much memory per unit of time.
Simply choosing any one of the multi-threaded collectors should yield lower pause times.
However, I don't know how to confrim (or not) this hypothesis; How to find which finalizer is time consuming.
Generally: Gather more data. For GC-related things you need to enabled GC logging, for time spent in java code (be it your application or 3rd party libraries) you need a profiler.
Here is what I would do to investigate your finalizer theory.
Start the JVM using your favorite Java profiler.
Leave it running for long enough to get a full heap.
Start the profiler.
Trigger garbage collection.
Stop profiler.
Now you can use the profiler information to figure out which (if any) finalize methods are using a large amount of time.
However, I suspect that the real problem will be a memory leak, and that your JVM is getting to the point where the heap is filling up with unreclaimable objects. That could explain the frequent "SerialOld" garbage collections.
Alternatively, this could just be a big heap problem. 100Gb is ... big.

Java program is getting slower after running for a while

I have a java program that is a typical machine learning algorithm, updating the values for some parameters by some equations:
for (int iter=0; iter<1000; iter++) {
// 1. Create many temporary variables and do some computations
// 2. Update the value for the parameters
}
The computations of updating parameters are rather complex, and I have to create many temporary objects, but they are not referenced out of the loop. The code in the loop is CPU-intensive, and does not access disk. This program loads a relatively large training dataset, therefore, I granted 10G memory (-Xmx10G) to JVM, which is much larger than it requires (peak at ~6G by "top" command or window's task manager).
I tested it on several linux machines (centos 6, 24G memory) and a window machine (win7, 12G), both with SUN Hotspot JDK/JRE 1.8 installed. I did not specify other JVM parameters except -Xmx. Both machines are dedicated to my program.
On windows, my program runs well: each iteration uses very similar running time. However, the running time on all of the centos machines is weird.
It initially runs properly, but slows down dramatically (~10 times slower) at 7th/8th iteration, and then keeps slow down ~10% in each iteration ever after.
I suspect it might be caused by Java's garbage collector. Therefore, I use jconsole to monitor my program. Minor GC happens very frequently on both machines , that is because the program creates many temporary variable in the loop. Furthermore, I used "jstat -gcutil $pid$ 1s" command and captured the statistics:
Centos: https://www.dropbox.com/s/ioz7ai6i1h57eoo/jstat.png?dl=0
Window: https://www.dropbox.com/s/3uxb7ltbx9kpm9l/jstat-winpng.png?dl=0
[Edited] However, the statistics on two kinds of machines differ a lot:
"S1” on windows jumps fast between 0 to 50, while stays at "0.00" on centos.
"E" on windows changes very rapidly from 0 to 100. As I print the stat for every second, the screenshot does not show its increment to 100. On centos, however, "E" increases rather slowly towards 100, and then reduces to 0, and increases again.
It seems the weird behaviour of my program is due to Java GC? I am new to Java performance monitor and do not have a good idea to optimize GC parameter setting. Do you have any suggestions? Thank you very much!
I'm sorry to post this as an answer but I don't have enough score to comment.
If you think it's a GC related issue I'd change it for the Garbage 1 Collector –XX:+UseG1GC
I found this brief explanation about it:
http://blog.takipi.com/garbage-collectors-serial-vs-parallel-vs-cms-vs-the-g1-and-whats-new-in-java-8/
Can you run your software under profiling? Try to use the jprofiler, VisualVM or even the netbeans profiler. It may help you a lot.
I noticed that you have your own encapsulation of a vector and matrix. Maybe your are spending a lot more memory than necessary with that too. But I don't think that is the problem.
Sorry again about not contributing as a comment. (It would be more appropriate)
I would consider declaring the vars outside the loop so mem allocation is done once and eliminate GC completely.
Giving Java (or any garbage collecting language) too much memory has an adverse effect on performance. The live (referenced) objects become increasing sparse in memory resulting in more frequent fetches from main memory. Note that in the examples you've shown us the faster windows is doing more quick and full GC than Linux - but GC cycles (especially full gcs) are usually bad for performance.
If running the training set does not take an exceptionally long time, then try benchmarking at different memory allocations.
A more radical solution, but one which should have a big impact is to eliminate (or reduce as much as possible) object creation within the loop by recycling objects in pools.
First, it is a common best practice to declare Variables outside of loops to avoid garbace collection.
as 'Wagner Tsuchiya' said, try running a profiler if you have doubts about the GC.
If you want some tips on GC tuning, i found nice blogpost.
You could try calling System.gc() every couple iterations to see if performance goes up or down. This may help you narrow it down to some of the previous answers diagnostics.
If the GC time is hundreds of milliseconds as shown in your screenshot then GC is likely not the issue here. I suggest you look into lock contention and possibly IO using a profiler (Netbeans is great). I know you stated your program did very little IO but with profiling (much like debugging) you have to remove all your assumptions and go step by step.
In my experience JAVA needs enough memory and 2+ CPU. Otherwise CPU usage will be very extensive when GC starts running.

Why is my multithreaded Java program not maxing out all my cores on my machine?

I have a program that starts up and creates an in-memory data model and then creates a (command-line-specified) number of threads to run several string checking algorithms against an input set and that data model. The work is divided amongst the threads along the input set of strings, and then each thread iterates the same in-memory data model instance (which is never updated again, so there are no synchronization issues).
I'm running this on a Windows 2003 64-bit server with 2 quadcore processors, and from looking at Windows task Manager they aren't being maxed-out, (nor are they looking like they are being particularly taxed) when I run with 10 threads. Is this normal behaviour?
It appears that 7 threads all complete a similar amount of work in a similar amount of time, so would you recommend running with 7 threads instead?
Should I run it with more threads?...Although I assume this could be detrimental as the JVM will do more context switching between the threads.
Alternatively, should I run it with fewer threads?
Alternatively, what would be the best tool I could use to measure this?...Would a profiling tool help me out here - indeed, is one of the several profilers better at detecting bottlenecks (assuming I have one here) than the rest?
Note, the server is also running SQL Server 2005 (this may or may not be relevant), but nothing much is happening on that database when I am running my program.
Note also, the threads are only doing string matching, they aren't doing any I/O or database work or anything else they may need to wait on.
My guess would be that your app is bottlenecked on memory access, i.e. your CPU cores spend most of the time waiting for data to be read from main memory. I'm not sure how well profilers can diagnose this kind of problem (the profiler itself could influence the behaviour considerably). You could verify the guess by having your code repeat the operations it does many times on a very small data set.
If this guess is correct, the only thing you can do (other than getting a server with more memory bandwidth) is to try and increase the locality of your memory access to make better use of caches; but depending on the details of the application that may not be possible. Using more threads may in fact lead to worse performance because of cores sharing cache memory.
Without seeing the actual code, it's hard to give proper advice. But do make sure that the threads aren't locking on shared resources, since that would naturally prevent them all from working as efficiently as possible. Also, when you say they aren't doing any io, are they not reading an input or writing an output either? this could also be a bottleneck.
With regards to cpu intensive threads, it is normally not beneficial to run more threads than you have actual cores, but in an uncontrolled environment like this with other big apps running at the same time, you are probably better off simply testing your way to the optimal number of threads.

Java Random Slowdowns on Mac OS cont'd

I asked this question a few weeks ago, but I'm still having the problem and I have some new hints. The original question is here:
Java Random Slowdowns on Mac OS
Basically, I have a java application that splits a job into independent pieces and runs them in separate threads. The threads have no synchronization or shared memory items. The only resources they do share are data files on the hard disk, with each thread having an open file channel.
Most of the time it runs very fast, but occasionally it will run very slow for no apparent reason. If I attach a CPU profiler to it, then it will start running quickly again. If I take a CPU snapshot, it says its spending most of its time in "self time" in a function that doesn't do anything except check a few (unshared unsynchronized) booleans. I don't know how this could be accurate because 1, it makes no sense, and 2, attaching the profiler seems to knock the threads out of whatever mode they're in and fix the problem. Also, regardless of whether it runs fast or slow, it always finishes and gives the same output, and it never dips in total cpu usage (in this case ~1500%), implying that the threads aren't getting blocked.
I have tried different garbage collectors, different sizings the parts of the memory space, writing data output to non-raid drives, and putting all data output in threads separate the main worker threads.
Does anyone have any idea what kind of problem this could be? Could it be the operating system (OS X 10.6.2) ? I have not been able to duplicate it on a windows machine, but I don't have one with a similar hardware configuration.
It's probably a bit late to reply, but I could observe similar slowdowns using Random in Threads, related to a volatile variable used within java.util.Random - see How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched? for details. If the answer I got is correct (and it sounds pretty reasonable to me), the slowdown might be related to the in-memory-addresses of the volatile variables used within Random (Have a look at the answer of user 'irreputable' to my question, which explains the problem much better than I do here).
In case you're creating the Random-instances within the run-method of your Threads, you could simply try to turn them into object-variables and initialize them within the constructor of your Thread: This would most likely ensure that the volatile fields of your Random instances will end up in 'different areas' in RAM, which do not have to get synchronized between the processor cores.
How do you know it's running slow? How do you know that it runs quicker when CPU profiler is active? If you do the entire run under the profiler does it ever run slow? If you restrict the number of threads to one does it ever run slow?
Actually this is an interesting problem, im curious to know whats the problem.
First, in your previous question, you are saying you split the job between "multiple" processors. Are they physically multiple, like in multiple machines? or a multi core CPU?
Second, im not sure if Snow Leopard has something to do with it, but we know that SL introduced few new features in term of multi-processor machines. So there might be some problem with the VM on the new OS. Try to use another Java version, i know SL uses Java 6 by default. Try to use Java 5.
Third, did you try to make the Thread pool a little smaller, you are talking about 100 threads running at same time. Try to make them 20 or 40 for example. See if it makes difference.
Finally, i would be interested in seeing how you implemented the multi-threading solution. Small parts of the code will be good

Categories