Unfinalized objects exhausting memory - java

We're running a Jersey (1.x) based service in Tomcat on AWS in an array of ~20 instances Periodically an instance "goes bad": over the course of about 4 hours its heap and CPU usage increase until the heap is exhausted and the CPU is pinned. At that point it gets automatically removed from the load balancer and eventually killed.
Examining heap dumps from these instances, ~95% of the memory has been used up by an instance of java.lang.ref.Finalizer which is holding onto all sorts of stuff, but most or all of it is related to HTTPS connections sun.net.www.protocol.https.HttpsURLConnectionImpl, sun.security.ssl.SSLSocketImpl, various crypto objects). These are connections that we're making to an external webservice using Jersey's client library. A heap dump from a "healthy" instance doesn't indicate any sort of issue.
Under relatively low load instances run for days or weeks without issue. As load increases, so does the frequency of instance failure (several per day by the time average CPU gets to ~40%).
Our JVM args are:
-XX:+UseG1GC -XX:MaxPermSize=256m -Xmx1024m -Xms1024m
I'm in the process of adding JMX logging for garbage collection metrics, but I'm not entirely clear what I should be looking for. At this point I'm primarily looking for ideas of what could kick off this sort of failure or additional targets for investigation.

Is it possibly a connection leak? I'm assuming you have checked for that?
I've had similar issues with GC bugs. Depending on your JVM version is looks like you are using an experimental (and potentially buggy) feature. You can try disabling G1 and use the default garbage collector. Also depending on your version, you might be running into a garbage collection overhead where it bails and doesn't properly GC stuff because it is taking too long to calculate what can and can't be trashed. The -XX:-UseGCOverheadLimit might help if available in your JVM.

Java uses a single finalizer thread to clean up dead objects. Your machine's symptoms are consistent with a pileup of backlogged finalizations. If the finalizer thread slows down too much (because some object takes a long time to finalize), the resulting accumulation of finalizer queue entries could cause the finalizer thread to fall further and further behind the incoming objects until everything grinds to a halt.
You may find profiling useful in determining what objects are slowing the finalizer thread.

This ultimately turned out to be caused by a JVM bug (unfortunately I've lost the link to the specific one we tracked it down to). Upgrading to a newer version of OpenJDK (we ended up with OpenJDK 1.7.0_50) solved the issue without us making any changes to our code.

Related

How can WeakHashMap leak memory in Java?

I am running a big Minecraft(Spigot) server with a proprietary, heavily obfuscated AntiCheat plugin as well as our custom set of battle tested plugins. Server is creating and unloading new worlds many times, thus after unload, leaving CraftPlayer and CraftWorld objects for GC.
We are running on Java 13 (same happens Java 12) with G1GC and so-called "aikar gc flags"
java -XX:+UseLargePagesInMetaspace -XX:+UseG1GC
-XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=100
-XX:TargetSurvivorRatio=90 -XX:G1NewSizePercent=50
-XX:G1MaxNewSizePercent=80 -XX:G1MixedGCLiveThresholdPercent=35
-XX:+ParallelRefProcEnabled -jar Spigot.jar
AntiCheat keeps weak reference to CraftPlayers, and yet they leak, and with the leak big enough the server's CPU is spinning very heavily to ultimately die with OutOfMemoryError. Manually running System.gc() is not cleaning these up as well.
When AntiCheat is disabled, the leak is gone.
The ONLY gc-root reference in the heapdump to all of the leaked CraftWorlds and CraftPlayer is the entry in WeakHashMap, key being CraftPlayer. (CraftPlayer and CraftWorld cross reference each other before being normally GCd).
There's a way you can make a leak with a WeakHashMap: the stale "expired" entries won't be deleted if you aren't actively using get()/set() which triggers expungeStaleEntries(). Maybe that would be the cause if AntiCheat was using something like Map<String, WeakHashMap<...>>, but that's not the case, it is using one major WeakHashMap.
And there's a catch: in this kind of leak, you could see that WeakHashMap's queue would be non-empty! And yet it's empty, as I checked! I also noticed the WeakHashMap is wrapped by synchronizedMap, but doesn't seem to me like that could be the problem.
Is this really a GC bug caused by one of the flags?
This pattern repeats every time and in every heapdump:

Garbage Collection settings with OpenJDK8

I need help tuning one of our Microservices.
we are running a Spring based Microservice (Spring Integration, Spring Data JPA) on a jetty server in an OpenJDK8 Container. We are also using Mesosphere as our Container Orchestrating platform.
The application consumes messages from IBM MQ, does some processing and then stores the processed output in an Oracle DB.
We noticed that at some point on the 2nd of May that the queue processing stopped from our application. Our MQ team could still see that there were open connections against the queue, but the application was just not reading anymore. It did not die totally, as the healthCheck Api that DCOS hits still shows as healthy.
We use AppD for performance monitoring and what we could see is that on the same date there was a garbage collection done and from there the application never picked up messages from the queue. The graph above shows the amount of time spent doing GC on the different dates.
As part of the Java Opts we use to run the application we state
-Xmx1024m
The Mesosphere reservation for each of that Microservice is as shown below
Can someone please point me in the right direction to configure the right settings for Garbage Collection for my application.
Also, if you think that the GC is just a symptom, thanks for sharing your views on potential flaws I should be looking for.
Cheers
Kris
You should check up your code.
A GC operation will trigger a STW(Stop The World) operation which will block all the thread created in your code. But STW dosen't affect the code run state.
But gc will affect your code logic if you use such as System.currentTimeMillis to control you code run logic.
A gc operation will also effect the non-strong reference, if you're use WeakReference, SoftReference, WeakHashMap, after a full gc, these component may change their behavir.
A full gc operation is done,and freed memory dosen't allow your code to allocate new Object,your code will throw a 'OutOfMembryException' which will interrupt your code execution.
I think the things you should do now is:
First, check up the 'GC Cause', to determine if the full gc happend in System.gc() call or Allocate failed.
Then, if GC Cause is System.gc(), your should check up the non-strong reference used in your code.
Finally, if GC cause is Allocate failed, you should check up your log to determine weather there happend a OutOfMembryException in you code, if happend, you should allocate more memory to avoid OutOfMembryException.
As a suggestion, You SHOULD NOT keep your mq message in your microservice application memory. Mostlly, the source of gc problem is bad practice in your code.
I don't think that garbage collection is at fault here, or that you should be attempting to fix this by tweaking GC parameters.
I think it is one of two things:
A coincidence. A correlation (for a single data point) that doesn't imply causation.
Something about garbage collection, or the event that triggered the garbage collection has caused something to break in your application.
For the latter, there are any number of possibilities. But one that springs to mind is that something (e.g. a request) caused an application thread to allocate a really large object. That triggered a full GC in an attempt to find space. The GC failed; i.e. there still wasn't enough space after the GC did its best. That then turned into an OOME which killed the thread.
If the (hypothetical) thread that was killed by the OOME was critical to the operation application, AND the rest of the application didn't "notice" it had died, then the application as a whole would break.
One clue to look for would be an OOME logged when the thread died. But it is also possible (if the application is not written / configured appropriately) for the OOME not to appear in the logs.
Regarding the ApppD chart? Is that time in seconds? How many Full GCs do you have? Perhaps you should enable the log for the garbage collector.
Thanks for your contribution guys. We will be attempting to increase the CPU allocation from 0.5 CPU to 1.25 CPU, and execute another round of NFT tests.
We tried running the command below
jmap -dump:format=b,file=$FILENAME.bin $PID
to get a heap dump, but the utility is not present on the default OpenJDK8 container.
I have just seen your comments about CPU
increase the CPU allocation from 0.5 CPU to 1.25 CPU
Please, keep in mind that in order to execute the parallel GC you need at least two cores. I think with your configuration you are using serial collector and there is no reason to use a serial garbage collector nowadays when you can leverage the use of multiple cores. Have you consider trying at least two cores? I often use four as a minimum number for my application servers on production and performance.
You can see more information here:
On a machine with N hardware threads where N is greater than 8, the parallel collector uses a fixed fraction of N as the number of garbage collector threads. The fraction is approximately 5/8 for large values of N. At values of N below 8, the number used is N. On selected platforms, the fraction drops to 5/16. The specific number of garbage collector threads can be adjusted with a command-line option (which is described later). On a host with one processor, the parallel collector will likely not perform as well as the serial collector because of the overhead required for parallel execution (for example, synchronization). However, when running applications with medium-sized to large-sized heaps, it generally outperforms the serial collector by a modest amount on machines with two processors, and usually performs significantly better than the serial collector when more than two processors are available.
Source: https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/parallel.html
Raúl

Java: how to trace/monitor GC times for the CMS garbage collector?

I'm having trouble figuring out a way to monitor the JVM GC for memory exhaustion issues.
With the serial GC, we could just look at the full GC pause times and have a pretty good notion if the JVM was in trouble (if it took more than a few seconds, for example).
CMS seems to behave differently.
When querying lastGcInfo from the java.lang:type=GarbageCollector,name=ConcurrentMarkSweep MXBean (via JMX), the reported duration is the sum of all GC steps, and is usually several seconds long. This does not indicate an issue with GC, to the contrary, I've found that too short GC times are usually more of an indicator of trouble (which happens, for example, if the JVM goes into a CMS-concurrent-mark-start-> concurrent mode failure loop).
I've tried jstat as well, which gives the cumulative time spent garbage collecting (unsure if it's for old or newgen GC). This can be graphed, but it's not trivial to use for monitoring purposes. For example, I could parse jstat -gccause output and calculate differences over time, and trace+monitor that (e.g. amount of time spent GC'ing over the last X minutes).
I'm using the following JVM arguments for GC logging:
-Xloggc:/xxx/gc.log
-XX:+PrintGCDetails
-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintReferenceGC
-XX:+PrintPromotionFailure
Parsing gc.log is also an option if nothing else is available, but the optimal solution would be to have a java-native way to get at the relevant information.
The information must be machine-readable (to send to monitoring platforms) so visual tools are not an option. I'm running a production environment with a mix of JDK 6/7/8 instances, so version-agnostic solutions are better.
Is there a simple(r) way to monitor CMS garbage collection? What indicators should I be looking at?
Fundamentally one wants two things from the CMS concurrent collector
the throughput of the concurrent cycle to keep up with the promotion rate, i.e. the objects surviving into the old gen per unit of time
enough room in the old generation for objects promoted during a concurrent cycle
So let's say the IHOP is fixed to 70% then you probably are approaching a problem when it reaches >90% at some point. Maybe even earlier if you do some large allocations that don't fit into the young generation or outlive it (that's entirely application-specific).
Additionally you usually want it to spend more time outside the concurrent cycle than in it, although that depends on how tightly you tune the collector, in principle you could have the concurrent cycle running almost all the time, but then you have very little throughput margin and burn a lot of CPU time on concurrent collections.
If you really really want to avoid even the occasional Full GC then you'll need even more safety margins due to fragmentation (CMS is non-compacting). I think this can't be monitored via MX beans, you'll have to to enable some CMS-specific GC logging to get fragmentation info.
For viewing GC logs:
If you have already enabled GC logging, I suggest GCViewer - this is an open source tool that can be used to view GC logs and look at parameters like throughput, pause times etc.
For profiling:
I don't see a JDK version mentioned in the question. For JDK 6, I would recommend visualvm to profile an application. For JDK 7/8 I would suggest mission control. You can find these in JDK\lib folder. These tools can be used to see how the application performs over a period of time and during GC (can trigger GC via visualvm UI).

Access Memory Usage of JVM from within my Application?

I have a Grails/Spring application which runs in a servlet container on a web server like Tomcat. Sometime my app crashes because the JVM reaches its maximal allowed memory (Xmx).
The error which follows is a "java.lang.OutOfMemoryError" because Java heap space is full.
To prevent this error I want to check from within my app how much memory is in use and how much memory the current JVM has remaining.
How can I access these parameters from within my application?
Try to understand when OOM is thrown instead of trying to manipulate it through the application. And also, even if you are able to capture those values from within your application - how would you prevent the error? By calling GC explicitly. Know that,
Java machine specifications says that
OutOfMemoryError: The Java virtual machine implementation has run out of either virtual or physical memory, and the automatic storage manager was unable to reclaim enough memory to satisfy an object creation request.
Therefore, GC is guaranteed to run before a OOM is thrown. Your application is throwing an OOME after it has just run a full garbage collect, and discovered that it still doesn't have enough free heap to proceed.
This would be a memory leak or in general your application could have high memory requirement. Mostly if the OOM is thrown with in short span of starting the application - it is usually that application needs more memory, if your server runs fine for some time and then throw OOM then it is most likely a memory leak.
To discover the memory leak, use the tools mentioned by people above. I use new-relic to monitor my application and check the frequency of GC runs.
PS Scavenge aka minor-GC aka the parallel object collector runs for young generation only, and PS MarkAndSweep aka major GC aka parallel mark and sweep collector is for old generation. When both are run – its considered a full GC. Minor gc runs are pretty frequent – a Full GC is comparatively less frequent. Note the consumption of different heap spaces to analyze your application.
You can also try the following option -
If you get OOM too often, then start java with correct options, get a heap dump and analyze it with jhat or with memory analyzer from eclipse (http://www.eclipse.org/mat/)
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=path to dump file
You can try the Grails Melody Plugin that display's the info in the url /monitoring relative to your context.
To prevent this error I want to check from within my app how much
memory is in use and how much memory the current JVM has remaining.
I think that it is not the best idea to proceed this way. Much better is to investigate what actually breaks your app and eliminate error or make some limitation there. There could be many different scenarios and your app can become unpredictable. So to sum up - capturing memory level for monitoring purpose is OK (but there are many dedicated tools for that) but in my opinion depending on these values in application logic is not recommended and bad practice
To do this you would use a profiler to profile your application and JVM, rather than having code to monitor such metrics inside your application.
Profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or frequency and duration of function calls
Here are some good java profilers:
http://visualvm.java.net/ (Free)
http://www.ej-technologies.com/products/jprofiler/overview.html (Paid)

Is there a way to identify CMS concurrent mode failures over JMX?

So I've been trying to track down a good way to monitor when the JVM might potentially be heading towards an OOM situation. They best way that seems to work with our app is to track back-to-back concurrent mode failures through CMS. This indicates that the tenured pool is filling up faster than it can actually clean itself up, or its reclaiming very little.
The JMX bean for tracking GCs has very generic information such as memory usage before/after and the like. This information has been relatively inconsistent at best. Is there a better way I can be monitoring this potential warning sign of a dying JVM?
Assuming you're using the Sun JVM then I am aware of 2 options;
memory management mxbeans (API ref starts here) which you appear to be using already though note there are some hotspot specific internal ones you can get access to, see this blog for an example of how to use
jstat: cmd reference is here, you'll want the -gccause option. You can either write a script to launch this and parse the output or, theoretically, you could spawn a process from the host jvm (or another one) that then reads the output stream from jstat to detect the gc causes. I don't think the cause reporting is 100% comprehensive though. I don't know a way to get this info programatically from java code.
With standard JRE 1.6 GC, heap utilization can trend upwards overtime with the garbage collector running less and less frequently depending on the nature of your application and your maximum specified heap size. That said, it is hard to say what is going on without having more information.
A few methods to investigate further:
You could take a heap dump of your application while it is running using jmap, and then inspect the heap using jhat to see which objects are in heap at any given time.
You could also run your application with -XX:+HeapDumpOnOutOfMemoryError which will automatically make a heap dump on the first out of memory exception that the JVM encounters.
You could create a monitoring bean specific to your application, and create accessor methods you can hit with a remote JMX client. For example methods to return the sizes of queues and other collections that are likely places of memory utilization in your program.
HTH

Categories