I need help tuning one of our Microservices.
we are running a Spring based Microservice (Spring Integration, Spring Data JPA) on a jetty server in an OpenJDK8 Container. We are also using Mesosphere as our Container Orchestrating platform.
The application consumes messages from IBM MQ, does some processing and then stores the processed output in an Oracle DB.
We noticed that at some point on the 2nd of May that the queue processing stopped from our application. Our MQ team could still see that there were open connections against the queue, but the application was just not reading anymore. It did not die totally, as the healthCheck Api that DCOS hits still shows as healthy.
We use AppD for performance monitoring and what we could see is that on the same date there was a garbage collection done and from there the application never picked up messages from the queue. The graph above shows the amount of time spent doing GC on the different dates.
As part of the Java Opts we use to run the application we state
-Xmx1024m
The Mesosphere reservation for each of that Microservice is as shown below
Can someone please point me in the right direction to configure the right settings for Garbage Collection for my application.
Also, if you think that the GC is just a symptom, thanks for sharing your views on potential flaws I should be looking for.
Cheers
Kris
You should check up your code.
A GC operation will trigger a STW(Stop The World) operation which will block all the thread created in your code. But STW dosen't affect the code run state.
But gc will affect your code logic if you use such as System.currentTimeMillis to control you code run logic.
A gc operation will also effect the non-strong reference, if you're use WeakReference, SoftReference, WeakHashMap, after a full gc, these component may change their behavir.
A full gc operation is done,and freed memory dosen't allow your code to allocate new Object,your code will throw a 'OutOfMembryException' which will interrupt your code execution.
I think the things you should do now is:
First, check up the 'GC Cause', to determine if the full gc happend in System.gc() call or Allocate failed.
Then, if GC Cause is System.gc(), your should check up the non-strong reference used in your code.
Finally, if GC cause is Allocate failed, you should check up your log to determine weather there happend a OutOfMembryException in you code, if happend, you should allocate more memory to avoid OutOfMembryException.
As a suggestion, You SHOULD NOT keep your mq message in your microservice application memory. Mostlly, the source of gc problem is bad practice in your code.
I don't think that garbage collection is at fault here, or that you should be attempting to fix this by tweaking GC parameters.
I think it is one of two things:
A coincidence. A correlation (for a single data point) that doesn't imply causation.
Something about garbage collection, or the event that triggered the garbage collection has caused something to break in your application.
For the latter, there are any number of possibilities. But one that springs to mind is that something (e.g. a request) caused an application thread to allocate a really large object. That triggered a full GC in an attempt to find space. The GC failed; i.e. there still wasn't enough space after the GC did its best. That then turned into an OOME which killed the thread.
If the (hypothetical) thread that was killed by the OOME was critical to the operation application, AND the rest of the application didn't "notice" it had died, then the application as a whole would break.
One clue to look for would be an OOME logged when the thread died. But it is also possible (if the application is not written / configured appropriately) for the OOME not to appear in the logs.
Regarding the ApppD chart? Is that time in seconds? How many Full GCs do you have? Perhaps you should enable the log for the garbage collector.
Thanks for your contribution guys. We will be attempting to increase the CPU allocation from 0.5 CPU to 1.25 CPU, and execute another round of NFT tests.
We tried running the command below
jmap -dump:format=b,file=$FILENAME.bin $PID
to get a heap dump, but the utility is not present on the default OpenJDK8 container.
I have just seen your comments about CPU
increase the CPU allocation from 0.5 CPU to 1.25 CPU
Please, keep in mind that in order to execute the parallel GC you need at least two cores. I think with your configuration you are using serial collector and there is no reason to use a serial garbage collector nowadays when you can leverage the use of multiple cores. Have you consider trying at least two cores? I often use four as a minimum number for my application servers on production and performance.
You can see more information here:
On a machine with N hardware threads where N is greater than 8, the parallel collector uses a fixed fraction of N as the number of garbage collector threads. The fraction is approximately 5/8 for large values of N. At values of N below 8, the number used is N. On selected platforms, the fraction drops to 5/16. The specific number of garbage collector threads can be adjusted with a command-line option (which is described later). On a host with one processor, the parallel collector will likely not perform as well as the serial collector because of the overhead required for parallel execution (for example, synchronization). However, when running applications with medium-sized to large-sized heaps, it generally outperforms the serial collector by a modest amount on machines with two processors, and usually performs significantly better than the serial collector when more than two processors are available.
Source: https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/parallel.html
Raúl
Related
We have a Java web server that often decides to do garbage-collection while it is running a service. We would like to tell it to do garbage-collection in the idle time, while no service is running. How can we do this?
You would need to be able to find out when the web container is idle, and that is likely to depend on the web container that you are using.
But I think that this is a bad idea. The way to force the GC to run is to call System.gc(). If that does anything (!) it will typically trigger a major garbage collection, and will likely take a long time (depending on the GC algorithm). Furthermore, the manually triggered collection will happen whether you need to run the GC or not1. And any request that arrives when the GC is running will be blocked.
In general, it is better to let the JVM decide when to run the GC. If you do this, the GC will run when it is efficient to do so, and will mostly run fast young space collections.
If you are concerned with request delays / jitter caused by long GC pauses, a better strategy is to tune the GC to minimize the pause times. (But beware: the low-pause collectors have greater overheads compared to the throughput collectors. This means that if your system is already heavily loaded a lot of the time, then this is liable to increase average response times.)
Other things to consider include:
Tuning your application to reduce its rate of generating garbage.
Tuning your application to reduce its heap working-set. For example, you might reduce in-memory caching by the application.
Tuning the web container. For example, check that you don't have too many worker threads.
1 - The best time to run the GC is when there is a lot of collectable garbage. Unfortunately, it is difficult for application code to know when that is. The JVM (on the other hand) has ways to keep track of how much free space there is, and when it is a good time to collect. The answer is not always when the heap is full.
I am working on an application whose purpose is to compute reports has fast as possible.
My application uses a big amount of memory; more than 100 Go.
Since our last release, I notice a big performance slowdown. My investigation shows that, during the computation, I get many garbage collection between 40 and 60 seconds!!!
(JMC tells me that they are SerialOld but I don't know what it exactly means) and, of course, when the JVM is garbage collecting, the application is absolutely freezed
I am now investigating the origin of these garbage collections... and this is a very hard work.
I suspect that, if these garbage collections are so long, it is because they are spending many times in finalize functions (I know that, among all the libraries we integrate from other teams, some of them uses finalizers)
However, I don't know how to confrim (or not) this hypothesis; How to find which finalizer is time consuming.
I am looking for a good tool or even a good methodology
Here is data collected via JVisualVM
As you can see, I always have many "Pending Finalizers" when I have a
log Old Garbage
What is surprising is that when I am using JVisualVM, the above graph
scrolls regularly from right to left. When the Old Garbage is
triggered, the scrolling stops (until here, it looks normal, this is
end-of-world). However, when the scrolling suddenly restart, it does
not from the end of Old Garbage but from the end of Pending Serializer
This lets me think that the finalizers were blocking the JVM
Does anyone has an explaination for this?
Thank you very much
Philippe
My application uses a big amount of memory; more than 100 Go.
JMC tells me that they are SerialOld but I don't know what it exactly means
If you are using the serial collector for a 100GB heap then long pauses are to be expected because the serial collector is single-threaded and one core can only only chomp through so much memory per unit of time.
Simply choosing any one of the multi-threaded collectors should yield lower pause times.
However, I don't know how to confrim (or not) this hypothesis; How to find which finalizer is time consuming.
Generally: Gather more data. For GC-related things you need to enabled GC logging, for time spent in java code (be it your application or 3rd party libraries) you need a profiler.
Here is what I would do to investigate your finalizer theory.
Start the JVM using your favorite Java profiler.
Leave it running for long enough to get a full heap.
Start the profiler.
Trigger garbage collection.
Stop profiler.
Now you can use the profiler information to figure out which (if any) finalize methods are using a large amount of time.
However, I suspect that the real problem will be a memory leak, and that your JVM is getting to the point where the heap is filling up with unreclaimable objects. That could explain the frequent "SerialOld" garbage collections.
Alternatively, this could just be a big heap problem. 100Gb is ... big.
I've got a strange problem in my Clojure app.
I'm using http-kit to write a websocket based chat application.
Client's are rendered using React as a single page app, the first thing they do when they navigate to the home page (after signing in) is create a websocket to receive things like real-time updates and any chat messages. You can see the site here: www.csgoteamfinder.com
The problem I have is after some indeterminate amount of time, it might be 30 minutes after a restart or even 48 hours, the JVM running the chat server suddenly starts consuming all the CPU. When I inspect it with NR (New Relic) I can see that all that time is being used by the garbage collector -- at this stage I have no idea what it's doing.
I've take a number of screenshots where you can see the effect.
You can see a number of spikes, those spikes correspond to large increases in CPU usage because of the garbage collector. To free up CPU I usually have to restart the JVM, I have been relying on receiving a CPU alert from NR in my slack account to make sure I jump on these quickly....but I really need to get to the root of the problem.
My initial thought was that I was possibly holding onto the socket reference when the client closed it at their end, but this is not the case. I've been looking at socket count periodically and it is fairly stable.
Any ideas of where to start?
Kind regards, Jason.
It's hard to imagine what could have caused such an issue. But at first what I would do is taking a heap dump at the time of crash. This can be enabled with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<path_to_your_heap_dump> JVM args. As a general practice don't increase heap size more the size of physical memory available on your server machine. In some rare cases JVM is unable to dump heap space because process is doomed; in such cases you can use gcore(if you're on Linux, not sure about Windows).
Once you grab the heap dump, analyse it with mat, I have debugged such applications and this worked perfectly to pin down any memory related issues. Mat allows you to dissect the heap dump in depth so you're sure to find the cause of your memory issue if it is not the case that you have allocated very small heap space.
If your program is spending a lot of CPU time in garbage collection, that means that your heap is getting full. Usually this means one of two things:
You need to allocate more heap to your program (via -Xmx).
Your program is leaking memory.
Try the former first. Allocate an insane amount of memory to your program (16GB or more, in your case, based on the graphs I'm looking at). See if you still have the same symptoms.
If the symptoms go away, then your program just needed more memory. Otherwise, you have a memory leak. In this case, you need to do some memory profiling. In the JVM, the way this is usually done is to use jmap to generate a heap dump, then use a heap dump analyser (such as jhat or VisualVM) to look at it.
(Fair disclosure: I'm the creator of a jhat fork called fasthat.)
Most likely your tenure space is filling up triggering a full collection. At this time the GC uses all the CPUS for sometime seconds at time.
To diagnose why this is happening you need to look at your rate of promotion (how much data is moving from young generation to tenured space)
I would look at increasing the young generation size to decrease rate of promotion. You could also look at using CMS as this has shorter pause times (though it uses more CPU)
Things to try in order:
Reduce the heap size
Count the number of objects of each class, and see if the numbers makes sense
Do you have big byte[] that lives past generation 1?
Change or tune GC algorithm
Use high-availability, i.e. more than one JVM
Switch to Erlang
You have triggered a global GC. The GC time grows faster-than-linear depending on the amount of memory, so actually reducing the heap space will trigger the global GC more often and make it faster.
You can also experiment with changing GC algorithm. We had a system where the global GC went down from 200s (happened 1-2 times per 24 hours) to 12s. Yes, the system was at a complete stand still for 3 minutes, no the users were not happy :-) You could try -XX:+UseConcMarkSweepGC
http://www.fasterj.com/articles/oraclecollectors1.shtml
You will always have stops like this for JVM and similar; it is more about how often you will get it, and how fast the global GC will be. You should make a heap dump and get the count of the different objects of each class. Most likely, you will see that you have millions of one of them, somehow, you are keeping a pointer to them unnecessary in a ever growing cache or sessions or similar.
http://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks001.html#CIHCAEIH
You can also start using a high-availability solution with at least 2 nodes, so that when one node is busy doing GC, the other node will have to handle the total load for a time. Hopefully, you will not get the global GC on both systems at the same time.
Big binary objects like byte[] and similar is a real problem. Do you have those?
At some time, these needs to be compacted by the global GC, and this is a slow operation. Many of the data-processing JVM based solution actually avoid to store all data as plain POJOs on the heap, and implement heaps themselves in order to overcome this problem.
Another solution is to switch from JVM to Erlang. Erlang is near real time, and they got by not having the concept of a global GC of the whole heap. Erlang has many small heaps. You can read a little about it at
https://hamidreza-s.github.io/erlang%20garbage%20collection%20memory%20layout%20soft%20realtime/2015/08/24/erlang-garbage-collection-details-and-why-it-matters.html
Erlang is slower than JVM, since it copies data, but the performance is much more predictable. It is difficult to have both. I have a websocket Erlang based solution, and it really works well.
So you run into a problem that is expected and normal for JVM, Microsoft CLR and similar. It will get worse and more common during the next couple of years when heap sizes grows.
We're running a Jersey (1.x) based service in Tomcat on AWS in an array of ~20 instances Periodically an instance "goes bad": over the course of about 4 hours its heap and CPU usage increase until the heap is exhausted and the CPU is pinned. At that point it gets automatically removed from the load balancer and eventually killed.
Examining heap dumps from these instances, ~95% of the memory has been used up by an instance of java.lang.ref.Finalizer which is holding onto all sorts of stuff, but most or all of it is related to HTTPS connections sun.net.www.protocol.https.HttpsURLConnectionImpl, sun.security.ssl.SSLSocketImpl, various crypto objects). These are connections that we're making to an external webservice using Jersey's client library. A heap dump from a "healthy" instance doesn't indicate any sort of issue.
Under relatively low load instances run for days or weeks without issue. As load increases, so does the frequency of instance failure (several per day by the time average CPU gets to ~40%).
Our JVM args are:
-XX:+UseG1GC -XX:MaxPermSize=256m -Xmx1024m -Xms1024m
I'm in the process of adding JMX logging for garbage collection metrics, but I'm not entirely clear what I should be looking for. At this point I'm primarily looking for ideas of what could kick off this sort of failure or additional targets for investigation.
Is it possibly a connection leak? I'm assuming you have checked for that?
I've had similar issues with GC bugs. Depending on your JVM version is looks like you are using an experimental (and potentially buggy) feature. You can try disabling G1 and use the default garbage collector. Also depending on your version, you might be running into a garbage collection overhead where it bails and doesn't properly GC stuff because it is taking too long to calculate what can and can't be trashed. The -XX:-UseGCOverheadLimit might help if available in your JVM.
Java uses a single finalizer thread to clean up dead objects. Your machine's symptoms are consistent with a pileup of backlogged finalizations. If the finalizer thread slows down too much (because some object takes a long time to finalize), the resulting accumulation of finalizer queue entries could cause the finalizer thread to fall further and further behind the incoming objects until everything grinds to a halt.
You may find profiling useful in determining what objects are slowing the finalizer thread.
This ultimately turned out to be caused by a JVM bug (unfortunately I've lost the link to the specific one we tracked it down to). Upgrading to a newer version of OpenJDK (we ended up with OpenJDK 1.7.0_50) solved the issue without us making any changes to our code.
So I've been trying to track down a good way to monitor when the JVM might potentially be heading towards an OOM situation. They best way that seems to work with our app is to track back-to-back concurrent mode failures through CMS. This indicates that the tenured pool is filling up faster than it can actually clean itself up, or its reclaiming very little.
The JMX bean for tracking GCs has very generic information such as memory usage before/after and the like. This information has been relatively inconsistent at best. Is there a better way I can be monitoring this potential warning sign of a dying JVM?
Assuming you're using the Sun JVM then I am aware of 2 options;
memory management mxbeans (API ref starts here) which you appear to be using already though note there are some hotspot specific internal ones you can get access to, see this blog for an example of how to use
jstat: cmd reference is here, you'll want the -gccause option. You can either write a script to launch this and parse the output or, theoretically, you could spawn a process from the host jvm (or another one) that then reads the output stream from jstat to detect the gc causes. I don't think the cause reporting is 100% comprehensive though. I don't know a way to get this info programatically from java code.
With standard JRE 1.6 GC, heap utilization can trend upwards overtime with the garbage collector running less and less frequently depending on the nature of your application and your maximum specified heap size. That said, it is hard to say what is going on without having more information.
A few methods to investigate further:
You could take a heap dump of your application while it is running using jmap, and then inspect the heap using jhat to see which objects are in heap at any given time.
You could also run your application with -XX:+HeapDumpOnOutOfMemoryError which will automatically make a heap dump on the first out of memory exception that the JVM encounters.
You could create a monitoring bean specific to your application, and create accessor methods you can hit with a remote JMX client. For example methods to return the sizes of queues and other collections that are likely places of memory utilization in your program.
HTH