We have linux servers running about 200 microservices in java each, using c-groups for isolation for cpu and memory isolation. One thing we have discovered is that java tends to require a full core of CPU to perform (major) garbage collection efficiently.
However obviously with 200 apps running and only 24 CPUs if they all decided to GC at the same time they would be limited by the c-groups. Since typical application CPU usage is relatively small (say about 15% of 1 cpu peak), it would be nice to find a way to ensure they don't all GC at the same time.
I'm looking into how we can schedule GCs so that each microserevice does not GC at the same time so that we can still run over 200 apps per host, but was wondering if anybody had some suggestions or experience on this topic before trying to re invent the wheel.
I found that there are command line methods that we can use, as well as using MBeans to actually trigger the GC, but read that it is not advised to do so as this will mess up the non-deterministic procedure java uses for GC.
Something that I'm thinking about is using performance metrics to monitor cpu, memory, and traffic to try and predict a GC, then if multiple are about to GC, perhaps we could trigger them one at a time, however this might be impractical or also bad idea.
We are running java 7, and 8.
You can't schedule a GC, since it depends on the allocation rate - i.e. it depends on the load and application logic. Some garbage collectors try to control major GC duration, total time consumed by GC, but not the rate. There is no guarantee as well that an externally triggered GC (e.g. via MBean) will actually run, and if it will run it might be ran later then it was triggered.
As other guys pointed, it is a very rare possibility (you can calculate it by gathering an average period in seconds of a major GC from all of your apps and bulding a histogram) to face it under the "normal" load. Under the "heavy" load you'll likely face CPU shortage much earlier than a probable simultaneous GC from an increased allocation rate will happen - because you would need to have a "lot" (depending on the size of your objects) of "long"-living objects to pollute the old generation to trigger a major GC.
Related
We are designing new software system architecture. and I am working by project manager.
but there is something on the issue within our team.
Our architect says "System memory should be kept as small as possible because it takes a long time when Full GC occurs. (JVM)
I am not sure of that opinion.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
How long will it take if Full GC occurs in a 16GB memory environment?
You might be worrying (your "architects") about something that might not be a problem for your throughput to begin with. Until java-9, the default collector is ParallelGC and there are dozens and dozens of applications that have not changed it and are happy with the pause times (and that collector pauses the world every time). So the only real answer is : measure. Enable GC logs and look into it.
On the other hand, if you choose a concurrent collector (you should start with G1), having enough breathing room for it in the heap is crucial. It is a lot more important for Shenandoan and ZGC, since they do everything concurrently. Every time GC initiates a concurrent phase, it works via so-called "barriers", which are basically interceptors for the objects in the heap. These structures used by these barriers require storage. If you will narrow this storage, GC is not going to be happy.
In rather simple words - the more "free" space in the heap, the better your GC will perform.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
This is not the correct question. If you are happy with your target response times, this does not matter. If you are not - you start analyzing gc logs and understand what is causing your delays (this is not going to be trivial, though).
How long will it take if Full GC occurs in a 16GB memory environment?
It depends on the collector and on the java version and on the type of Objects to be collected, there is no easy answer. Shenandah and ZGC - this is irrelevant since they do not care on the size of the heap to be scanned. For G1 it is going to be in the range of a "few seconds" most probably. But if you have WeakReferences and finalizers and a java version that is known to handle this not so good, the times to collect is going to be big.
How long will it take if Full GC occurs in a 16GB memory environment?
On a small heaps like that the ballpark figure is around 10 sec I guess.
But it's not what you should consider.
When setting up system memory, what level of Full GC(Garbage Collection) time should be reviewed?
All of the times. Whenever full gc occurs it should be reviewed if your application is latency-critical. This is what you should consider.
Full GC is a failure.
Failure on a multiple levels.
to address memory size available for application
to address GC type you use
to address the types of workloads
to address graceful degradation under the load
and the list goes on
Concurrent GC implicitly relies on a simple fact that it can collect faster then application allocates.
When allocation pressure becomes overwhelming GC has two options: slowdown allocations or stop them altogether.
And when it stops, you know, the hell breaks loose, futures time out, clusters brake apart and engineers fear and loathe large heaps for rest of their lives...
It's a common scenario for applications that evolve for years with increasing complexity and loads and lack of overhaul to accommodate to changing world.
It doesn't have to be this way though.
When you build new application from ground up you can design in with performance and latency in mind, with scalability and graceful degradation instead heap size and GC times.
You can split workloads that are not latency-critical but memory-heavy to different VM and run it under good 'ol ParallelGC, and it will outperform any concurrent GC in both throughput and memory overhead.
You can run latency-critical tasks under modern state-of-the-art GC like Shenandoah and have sub-second collection pauses on heaps of several TB if you don't mind some-30% memory overhead and considerable amount of CPU overhead.
Let the application and requirements dictate you heap size, not engineers.
As after Jrockit is no more available, hence is there any way to achieve deterministic (no more that x ms) GC pause? I am trying with G1 GC in java_8_65 but it is non-deterministic and many times i see young gc pauses greater than -XX:MaxGCPauseMillis which is expected but not as per my requirement.
The simple answer is no. All GC used by Hotspot and other JVMs (like Zing from Azul, who I work for) are inherently non-deterministic. You can certainly tune a GC to achieve your latency goal for most of the time and using Zing would give you much more reliable results because it performs a compacting collection truly concurrently with the application threads (so, therefore, does not have stop-the-world pauses).
The problem is that, if your application suddenly hits a point where it starts allocating objects at a much higher rate or generates garbage much faster than you have tuned for, you will start seeing pauses that exceed your goal. This is simply the way GC works.
The only way to get true deterministic behaviour like you're looking for would be to use a real-time JVM (look up the RTSJ spec) that would also require a real-time operating system underneath. The drawback to doing this is often your throughput will suffer.
Your options are
do some tuning until G1 performs as expected
switch to another collector available in the JVM you're using, e.g. CMS
switch to a different JVM which offers collectors with stronger guarantees
optimize your application to reduce GC pressure or worst case behavior
throw more hardware at the problem (more or faster CPU cores, more RAM)
One other options could be OpenJ9 Metronome GC.
As far as I know, it is design for deterministic, short pauses for real time apps. According to the documentation the default is 10 millisecond pauses. However, it will of course need more CPU and is more design for small heaps.
I never used it, so I cannot share any experience.
The release of Java 11 contains a brand new Garbage Collector, ZGC, that promises very low pause times.
The goal of this project is to create a scalable low latency garbage collector capable of handling heaps ranging from a few gigabytes to multi terabytes in size, with GC pause times not exceeding 10ms.
We have a Java web server that often decides to do garbage-collection while it is running a service. We would like to tell it to do garbage-collection in the idle time, while no service is running. How can we do this?
You would need to be able to find out when the web container is idle, and that is likely to depend on the web container that you are using.
But I think that this is a bad idea. The way to force the GC to run is to call System.gc(). If that does anything (!) it will typically trigger a major garbage collection, and will likely take a long time (depending on the GC algorithm). Furthermore, the manually triggered collection will happen whether you need to run the GC or not1. And any request that arrives when the GC is running will be blocked.
In general, it is better to let the JVM decide when to run the GC. If you do this, the GC will run when it is efficient to do so, and will mostly run fast young space collections.
If you are concerned with request delays / jitter caused by long GC pauses, a better strategy is to tune the GC to minimize the pause times. (But beware: the low-pause collectors have greater overheads compared to the throughput collectors. This means that if your system is already heavily loaded a lot of the time, then this is liable to increase average response times.)
Other things to consider include:
Tuning your application to reduce its rate of generating garbage.
Tuning your application to reduce its heap working-set. For example, you might reduce in-memory caching by the application.
Tuning the web container. For example, check that you don't have too many worker threads.
1 - The best time to run the GC is when there is a lot of collectable garbage. Unfortunately, it is difficult for application code to know when that is. The JVM (on the other hand) has ways to keep track of how much free space there is, and when it is a good time to collect. The answer is not always when the heap is full.
I'm having trouble figuring out a way to monitor the JVM GC for memory exhaustion issues.
With the serial GC, we could just look at the full GC pause times and have a pretty good notion if the JVM was in trouble (if it took more than a few seconds, for example).
CMS seems to behave differently.
When querying lastGcInfo from the java.lang:type=GarbageCollector,name=ConcurrentMarkSweep MXBean (via JMX), the reported duration is the sum of all GC steps, and is usually several seconds long. This does not indicate an issue with GC, to the contrary, I've found that too short GC times are usually more of an indicator of trouble (which happens, for example, if the JVM goes into a CMS-concurrent-mark-start-> concurrent mode failure loop).
I've tried jstat as well, which gives the cumulative time spent garbage collecting (unsure if it's for old or newgen GC). This can be graphed, but it's not trivial to use for monitoring purposes. For example, I could parse jstat -gccause output and calculate differences over time, and trace+monitor that (e.g. amount of time spent GC'ing over the last X minutes).
I'm using the following JVM arguments for GC logging:
-Xloggc:/xxx/gc.log
-XX:+PrintGCDetails
-verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintReferenceGC
-XX:+PrintPromotionFailure
Parsing gc.log is also an option if nothing else is available, but the optimal solution would be to have a java-native way to get at the relevant information.
The information must be machine-readable (to send to monitoring platforms) so visual tools are not an option. I'm running a production environment with a mix of JDK 6/7/8 instances, so version-agnostic solutions are better.
Is there a simple(r) way to monitor CMS garbage collection? What indicators should I be looking at?
Fundamentally one wants two things from the CMS concurrent collector
the throughput of the concurrent cycle to keep up with the promotion rate, i.e. the objects surviving into the old gen per unit of time
enough room in the old generation for objects promoted during a concurrent cycle
So let's say the IHOP is fixed to 70% then you probably are approaching a problem when it reaches >90% at some point. Maybe even earlier if you do some large allocations that don't fit into the young generation or outlive it (that's entirely application-specific).
Additionally you usually want it to spend more time outside the concurrent cycle than in it, although that depends on how tightly you tune the collector, in principle you could have the concurrent cycle running almost all the time, but then you have very little throughput margin and burn a lot of CPU time on concurrent collections.
If you really really want to avoid even the occasional Full GC then you'll need even more safety margins due to fragmentation (CMS is non-compacting). I think this can't be monitored via MX beans, you'll have to to enable some CMS-specific GC logging to get fragmentation info.
For viewing GC logs:
If you have already enabled GC logging, I suggest GCViewer - this is an open source tool that can be used to view GC logs and look at parameters like throughput, pause times etc.
For profiling:
I don't see a JDK version mentioned in the question. For JDK 6, I would recommend visualvm to profile an application. For JDK 7/8 I would suggest mission control. You can find these in JDK\lib folder. These tools can be used to see how the application performs over a period of time and during GC (can trigger GC via visualvm UI).
We monitor our production JVMs and have monitoring triggers that (ideally) send warnings, when the JVM runs low on heap space. However, coming up with an effective detection algorithm is quite difficult, since it is the nature of garbage collection, that the application regularly has no available memory, before the GC kicks in.
There are many ways to work around this, I can think of. E.g. monitor the available space, send a warning when it becomes too low, but delay it and only trigger, when it is persistent for more than a minute. So, what works for you in practice?
Particular interesting:
How to detect a critical memory/heap problem, that needs immediate reaction?
How to detect a heap problem, that needs a precaution action?
What approaches work universally? E.g. without the need to adapt the triggers to certain JVM tuning parameters or vice versa, or, force a GC in certain time intervals.
Is there any best practice that is used widely?
I have found a very effective measure of JVM memory health to be the percentage of time the JVM spends in garbage collection. A healthy, well-tuned JVM will use very little (< 1% or so) of its CPU time collecting garbage. An unhealthy JVM will be "wasting" much of its time keeping the heap clean, and the percent of CPU used on collection will climb exponentially in a JVM experiencing a memory leak or with a max heap setting that is too low (as more CPU is used keeping the heap clean, less is used doing "real work"...assuming the inbound request rate doesn't slow down, it's easy to fall off a cliff where you become CPU bound and can't get enough work done quickly enough long before you actually get a java.lang.OutOfMemoryError).
It's worth noting that this is really the condition you want to guard against, too. You don't actually care if the JVM uses all of its heap, so long as it can efficiently reclaim memory without getting in the way of the "real work" it needs to do. (In fact, if you're never hitting the max heap size, you may want to consider shrinking your heap.)
This statistic is provided by many modern JVMs (certainly Oracle's and IBMs, at least).
Another somewhat effective measure can be the time between full GCs. The more often you are having to perform a full GC, the more time you're spending in GC.