JVM OutOfMemory error "death spiral" (not memory leak)

JVM OutOfMemory error "death spiral" (not memory leak) - java

We have recently been migrating a number of applications from running under RedHat linux JDK1.6.0_03 to Solaris 10u8 JDK1.6.0_16 (much higher spec machines) and we have noticed what seems to be a rather pressing problem: under certain loads our JVMs get themselves into a "Death Spiral" and eventually go out of memory. Things to note:
this is not a case of a memory leak. These are applications which have been running just fine (in one case for over 3 years) and the out-of-memory errors are not certain in any case. Sometimes the applications work, sometimes they don't
this is not us moving to a 64-bit VM - we are still running 32 bit
In one case, using the latest G1 garbage collector on 1.6.0_18 seems to have solved the problem. In another, moving back to 1.6.0_03 has worked
Sometimes our apps are falling over with HotSpot SIGSEGV errors
This is affecting applications written in Java as well as Scala
The most important point is this: the behaviour manifests itself in those applications which suddenly get a deluge of data (usually via TCP). It's as if the VM decides to keep adding more data (possibly progressing it to the TG) rather than running a GC on "newspace" until it realises that it has to do a full GC and then, despite practically everything in the VM being garbage, it somehow decides not to collect it!
It sounds crazy but I just don't see what else it is. How else can you explain an app which one minute falls over with a max heap of 1Gb and the next works just fine (never going about 256M when the app is doing exactly the same thing)
So my questions are:
Has anyone else observed this kind of behaviour?
has anyone any suggestions as to how I might debug the JVM itself (as opposed to my app)? How do I prove this is a VM issue?
Are there any VM-specialist forums out there where I can ask the VM's authors (assuming they aren't on SO)? (We have no support contract)
If this is a bug in the latest versions of the VM, how come no-one else has noticed it?

Interesting problem. Sounds like one of the garbage collectors works poorly on your particular situation.
Have you tried changing the garbage collector being used? There are a LOT of GC options, and figuring out which ones are optimal seems to be a bit of a black art, but I wonder if a basic change would work for you.
I know there is a "Server" GC that tends to work a lot better than the default ones. Are you using that?
Threaded GC (which I believe is the default) is probably the worst for your particular situation, I've noticed that it tends to be much less aggressive when the machine is busy.
One thing I've noticed, it often takes two GCs to convince Java to actually take out the trash. I think the first one tends to unlink a bunch of objects and the second actually deletes them. What you might want to do is occasionally force two garbage collections. This WILL cause a significant GC pause, but I've never seen a case where it took more than two to clean out the entire heap.

I have had the same issue on Solaris machines, and I solved it by decreasing the maximum size of the JVM. The 32 bit Solaris implementation apparently needs some overhead room beyond what you allocate for the JVM when doing garbage collections. So, for example, with -Xmx3580M I'd get the errors you describe, but with -Xmx3072M it would be fine.

Yes, I've observed this behavior before, and usually after countless hours of tweaking JVM parameters it starts working.
Garbage Collection, especially in multithreaded situations is nondeterministic. Defining a bug in nondeterministic code can be a challenge. But you could try DTrace if you are using Solaris, and there are a lot of JVM options for peering into HotSpot.
Go on Scala IRC and see if Ismael Juma is hanging around (ijuma). He's helped me before, but I think real in-depth help requires paying for it.
I think most people doing this kind of stuff accept that they either need to be JVM tuning experts, have one on staff, or hire a consultant. There are people who specialize in JVM tuning.
In order to solve these problems I think you need to be able to replicate them in a controlled environment where you can precisely duplicate runs with different tuning parameters and/or code changes. If you can't do that hiring an expert probably isn't going to do you any good, and the cheapest way out of the problem is probably buying more RAM.

What kind of OutOfMemoryError are you getting? Is the heap space exhausted or is the problem related to any of the other memory pools (the Error usually have a message giving more details on its cause).
If the heap is exhausted and the problem can be reproduced (it sounds as if it can), I would first of all configure the VM to produce a heap dump on OutOfMemoryErrors. You can then analyze the heap and make sure that it's not filled with objects, which are still reachable through some unexpected references.
It's of course not impossible that you are running into a VM bug, but if your application is relying on implementation specific behaviour in 1.6.0_03, it may for some reason or another end up as a memory hog when running on 1.6.0_16. Such problems may also be found if you are using some kind of server container for your application. Some developers are obviously unable to read documentation, but tend to observe the API behaviour and make their own conclusions about how something is supposed to work. This is of course not always correct and I've ran into similar problems both with Tomcat and with JBoss (both products at least used to work only with specific VMs).

Also make sure it's not a hardware fault (try running MemTest86 or similar on the server.)

Which kind of SIGSEV errors exactly do you encounter?
If you run a 32bit VM, it could be what I described here: http://janvanbesien.blogspot.com/2009/08/mysterious-jvm-crashes-explained.html

Related

Stable JVM but system memory increase

I am running a java application on a Ubuntu 16.04 server. After extensive investigation I have discovered that the JVM heap size is more or less constant. At any rate there are no memory increase.
However, when I look at the server using htop the memory consumption of the server grows at an alarming rate. I am not sure what exactly is causing this but its 100% originating from java process.
I have looked at the hprof files but I cant really tell what Im looking for.
I am running two libs that might be responsible but I am not intimately familiar with them;
OrientDB (plocal)
Hazelcast
Im not sure if either / both of these would cause a memory increase outside the JVM.
Any advice on the best plan to help identify the problem would be great.

Thanks to #the8472, #davmac #qwwdfsad and #andrey-lomakin for your comments. I appreciate the details provided in the question where very thin but I was trying to avoid providing unrelated data that might lead down a rabbit whole.
I systematically tested each suggestion and it turns out that the problem was originating from OrientDB. I cant say 100% which of the following fixed the problem (possibly both). As per #andrey-lomakin suggestion I upgraded from 2.1.19 to 2.2-rc1. In doing this the applications batch inserts started throwing exceptions so I converted them all into single linear queries. Once compete the memory leak has gone.
As a side note in case it affects anybody else while testing for direct IO leak I did discover to my suprise that -Djdk.nio.maxCachedBufferSize=... works withJava(TM) SE Runtime Environment (build 1.8.0_91-b14).

32bit JVM, ProcessBuilder.start() and ENOMEM

One application I have to deal with regularly launches shell helpers using ProcessBuilder. For reasons untold, it still runs on a 32bit JVM (Sun, 1.6.0.25) even though the underlying OS is 64bits (RHEL 5.x for what it's worth).
This application is memory-happy, so the heap size is set to its maximum of 3 GB, and the permgen is 128M.
However... At random moments, shell helpers fail to launch. Not because of an OutOfMemory, but ENOMEM... The only cause I can see for this is lack of address space.
Well, sure, but at the same moment, the memory is not really under pressure and top reports that the actual memory usage of the JVM and its virtual set size, is not even 3 GB...
Looking at what can be looked of the code of Process, I see that the core method is called forkAndExec(), which is pretty much self explanatory... From what I know of both syscalls, it just shouldn't fail. But it does. And not always.
Why?
edit: it should be noted that neo4j is used. It seems to use FileChannel a lot, can that be the cause of lack of address space?

I would decrease the heap size. The amount of heap actually used could be leaving less and less space for the forked process to run (it inherits resources from its parent)
It is highly likely that just upgrading to a 64-bit JVM would fix the problem, Can you try Java 6 update 30 64-bit instead (just to see if it would fix the problem) If it does or does not, it should tell more about what the cause is (and then you can decide if its worth switching)

I think that you are being bitten by Linux memory overcommits killing your processes. That blog post suggest a sysctl variable that you can tune.

How do I debug Segfaults occurring in the JVM when it runs my code?

My Java application has started to crash regularly with a SIGSEGV and a dump of stack data and a load of information in a text file.
I have debugged C programs in gdb and I have debugged Java code from my IDE. I'm not sure how to approach C-like crashes in a running Java program.
I'm assuming I'm not looking at a JVM bug here. Other Java programs run just fine, and the JVM from Sun is probably more stable than my code. However, I have no idea how I could even cause segfaults with Java code. There definitely is enough memory available, and when I last checked in the profiler, heap usage was around 50% with occasional spikes around 80%. Are there any startup parameters I could investigate? What is a good checklist when approaching a bug like this?
Though I'm not so far able to reliably reproduce the event, it does not seem to occur entirely at random either, so testing is not completely impossible.
ETA: Some of the gory details
(I'm looking for a general approach, since the actual problem might be very specific. Still, there's some info I already collected and that may be of some value.)
A while ago, I had similar-looking trouble after upgrading my CI server (see here for more details), but that fix (setting -XX:MaxPermSize) did not help this time.
Further investigation revealed that in the crash log files the thread marked as "current thread" is never one of mine, but either one called "VMThread" or one called "GCTaskThread"- I f it's the latter, it is additionally marked with the comment "(exited)", if it's the former, the GCTaskThread is not in the list. This makes me suppose that the problem might be around the end of a GC operation.

I'm assuming I'm not looking at a JVM bug here. Other Java programs
run just fine, and the JVM from Sun is probably more stable than my
code.
I don't think you should make that assumption. Without using JNI, you should not be able to write Java code that causes a SIGSEGV (although we know it happens). My point is, when it happens, it is either a bug in the JVM (not unheard of) or a bug in some JNI code. If you don't have any JNI in your own code, that doesn't mean that you aren't using some library that is, so look for that. When I have seen this kind of problem before, it was in an image manipulation library. If the culprit isn't in your own JNI code, you probably won't be able to 'fix' the bug, but you may still be able to work around it.
First, you should get an alternate JVM on the same platform and try to reproduce it. You can try one of these alternatives.
If you cannot reproduce it, it likely is a JVM bug. From that, you can either mandate a particular JVM or search the bug database, using what you know about how to reproduce it, and maybe get suggested workarounds. (Even if you can reproduce it, many JVM implementations are just tweaks on Oracle's Hotspot implementation, so it might still be a JVM bug.)
If you can reproduce it with an alternative JVM, the fault might be that you have some JNI bug. Look at what libraries you are using and what native calls they might be making. Sometimes there are alternative "pure Java" configurations or jar files for the same library or alternative libraries that do almost the same thing.
Good luck!

The following will almost certainly be useless unless you have native code. However, here goes.
Start java program in java debugger, with breakpoint well before possible sigsegv.
Use the ps command to obtain the processid of java.
gdb /usr/lib/jvm/sun-java6/bin/java processid
make sure that the gdb 'handle' command is set to stop on SIGSEGV
continue in the java debugger from the breakpoint.
wait for explosion.
Use gdb to investigate
If you've really managed to make the JVM take a sigsegv without any native code of your own, you are very unlikely to make any sense of what you will see next, and the best you can do is push a test case onto a bug report.

I found a good list at http://www.oracle.com/technetwork/java/javase/crashes-137240.html. As I'm getting the crashes during GC, I'll try switching between garbage collectors.
I tried switching between the serial and the parallel GC (the latter being the default on a 64-bit Linux server), this only changed the error message accordingly.
Reducing the max heap size from 16G to 10G after a fresh analysis in the profiler (which gave me a heap usage flattening out at 8G) did lead to a significantly lower "Virtual Memory" footprint (16G instead of 60), but I don't even know what that means, and The Internet says, it doesn't matter.
Currently, the JVM is running in client mode (using the -client startup option thus overriding the default of -server). So far, there's no crash, but the performance impact seems rather large.

If you have a corefile you could try running jstack on it, which would give you something a little more comprehensible - see http://download.oracle.com/javase/6/docs/technotes/tools/share/jstack.html, although if it's a bug in the gc thread it may not be all that helpful.

Try to check whether c program carsh which have caused java crash.use valgrind to know invalid and also cross check stack size.

Tune Java GC, so that it would immediately throw OOME, rather than slow down indeterminately

I've noticed, that sometimes, when memory is nearly exhausted, the GC is trying to complete at any price of performance (causes nearly freeze of the program, sometimes multiple minutes), rather that just throw an OOME (OutOfMemoryError) immediately.
Is there a way to tune the GC concerning this aspect?
Slowing down the program to nearly zero-speed makes it unresponsive. In certain cases it would be better to have a response "I'm dead" rather than no response at all.

Something like what you're after is built into recent JVMs.
If you:
are using Hotspot VM from (at least) Java 6
are using the Parallel or Concurrent garbage collectors
have the option UseGCOverheadLimit enabled (it's on by default with those collectors, so more specifically if you haven't disabled it)
then you will get an OOM before actually running out of memory: if more than 98% of recent time has been spent in GC for recovery of <2% of the heap size, you'll get a preemptive OOM.
Tuning these parameters (the 98% in particular) sounds like it would be useful to you, however there is no way as far as I'm aware to tune those thresholds.
However, check that you qualify under the three points above; if you're not using those collectors with that flag, this may help your situation.
It's worth reading the HotSpot JVM tuning guide, which can be a big help with this stuff.

I am not aware of any way to configure the Java garbage collector in the manner you describe.
One way might be for your application to proactively monitor the amount of free memory, e.g. using Runtime.freeMemory(), and declare the "I'm dead" condition if that drops below a certain threshold and can't be rectified with a forced garbage collection cycle.
The idea is to pick the value for the threshold that's large enough for the process to never get into the situation you describe.

I strongly advice against this, Java trying to GC rather than immediately throwing an OutOfMemoryException makes far much more sense - don't make your application fall over unless every alternative has been exhausted.
If your application is running out of memory, you should be increasing your max heap size or looking at it's performance in terms of memory allocation and seeing if it can be optimised.
Some things to look at would be:
Use weak references in places where your objects would not be required if not referenced anywhere else.
Don't allocated larger objects than you need (ie storing a huge array of 100 objects when you are only going to need access to three of them through the array lifecycle), or using a long datatype when you only need to store eight values.
Don't hold onto references to objects longer than you would need!
Edit: I think you misunderstand my point. If you accidentally leave a live reference to an object that no longer needs to be used it will obviously still not be garbage collected. This is nothing to do with nulling just incase - a typical example to this would be using a large object for a specific purpose, but when it goes out of scope it is not GC because a live reference has accidentally been left elsewhere, somewhere that you don't know about causing a leak. A typical example of this would be in a hashtable lookup which can be solved with weak references as it will be eligible for GC when only weakly reachable.
Regardless these are just general ideas off the top of my head on how to improve performance with memory allocation. The point I am trying to make is that asking how to throw an OutOfMemory error quicker rather than letting Java GC try it's best to free up space on the heap is not a great idea IMO. Optimize your application instead.

Well, turns out, there is a solution since Java8 b92:
-XX:+ExitOnOutOfMemoryError
When you enable this option, the JVM exits on the first occurrence of an out-of-memory error. It can be used if you prefer restarting an instance of the JVM rather than handling out of memory errors.
-XX:+CrashOnOutOfMemoryError
If this option is enabled, when an out-of-memory error occurs, the JVM crashes and produces text and binary crash files (if core files are enabled).
A good idea is to combine one of the above options with the good old -XX:+HeapDumpOnOutOfMemoryError
I tested these options, they actually work as expected!
Links
See the feature description
See List of changes in that Java release

Java memory consumption, "top" and HP-Ux

We ship Java applications that are run on Linux, AIX and HP-Ux (PA-RISC). We seem to struggle to get acceptable levels of performance on HP-Ux from applications that work just fine in the other two environments. This is true of both execution time and memory consumption.
Although I'm yet to find a definitive article on "why", I believe that measuring memory consumption using "top" is a crude approach due to things like the shared code giving misleading results. However, it's about all we have to go on with a customer site where memory consumption on HP-Ux has become an issue. It only became an issue this time when we moved from Java 1.4 to Java 1.5 (on HP-Ux 11.23 PA-RISC). By "an issue", I mean that the machine ceased to create new processes because we had exhausted all 16GB of physical memory.
By measuring "before" and "after" total "free memory" we are trying to gauge how much has been consumed by a Java application. I wrote a quick app that stores 10,000 random 64 bit strings in an ArrayList and tried this approach to measuring consumption on Linux and HP-Ux under Java 1.4 and Java 1.5.
The results:
HP Java 1.4 ~60MB
HP Java 1.5 ~150MB
Linux Java 1.4 ~24MB
Linux Java 1.5 ~16MB
Can anyone explain why these results might arise? Is this some idiosyncrasy of the way "top" measures free memory? Does Java 1.5 on HP really consume 2.5 times more memory than Java 1.4?
Thanks.

The JVMs might just have different default parameters. The heap will grow to the size that you have configured to let it. The default on the Sun VM is a certain percentage of the RAM in the machine - that's to say that Java will, by default, use more memory if you use a machine with more memory on it.
I'd be really surprised if the HP-UX VM hadn't had lots of tuning for this sort of thing by HP. I'd suggest you fiddle with the parameters on both - figure out what the smallest max heap size you can use without hurting performance or throughput.

I don't have a HP box right now to test my hypothesis. However, if I were you, I would use a profiler like JConsole(comes with JDK) OR yourkit to measure what is happening.
However, it appears that you started measuring after you saw something amiss; So, I'm NOT discounting that it's happening -- just pointing you at something I'd have done in the same situation.

First, it's not clear what did you measure by "10,000 random 64 bit strings" test. You supposed to start the application, measure it's bootstrap memory footprint, and then run your test. It could easily be that Java 1.5 acquires more heap right after start (due to heap manager settings, for instance).
Second, we do run Java apps under 1.4, 1.5 and 1.6 under HP-UX, and they don't demonstrate any special memory requirements. We have Itanium hardware, though.
Third, why do you use top? Why not just print Runtime.getRuntime().totalMemory()?
Fourth, by adding values to ArrayList you create memory fragmentation. ArrayList has to double it's internal storage now and then. Depending on GC settings and ArrayList.ensureCapacity() implementation the amount of non-collected memory may differ dramatically between 1.4 and 1.5.
Essentially, instead of figuring out the cause of problem you have run a random test that gives you no useful information. You should run a profiler on the application to figure out where the memory leaks.

You might also want to look at the problem you are trying to solve... I don't imagine there are many problems that eat 16GB of memory that aren't due for a good round of optimization.
Are you launching multiple VMs? Are you reading large datasets into memory, and not discarding them quickly enough? etc etc etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.