Controlling Java garbage collection in real-time system - java

We are running a RT system in Java. It often uses relatively large heaps (100+GB) and serves requests coming from message queue. Each request must be handled fast (<100ms) to meet the SLAs.
We are experiencing serious GC-related problems, because it often happens that GC causes stop-the-world collection during a request (200+ms), resulting in failure.
One of our developers with reasonable knowledge of GCs spent quite some time with tuning GC parameters and trying different GCs. After several days, he came up with some parametrization that we jokingly call "evolved by genetic algorithm". It lowers the GC pauses, but is still far from meeting the SLA requirements.
The solution I am looking for is to protect some critical parts of code from GC, and after a request is finished, let the GC do as much work as it needs, before taking next request. Occasional pauses outside the requests would be OK, because we have several workers and garbage-collecting workers would just not ask for requests for a while.
I have some ideas which are silly, ugly, and most probably not working, but hopefully they illustrate the problem:
Occasionally call Thread.sleep() in the receiving thread, praying for the GC to do some work in the meantime,
Invoke System.gc() or Runtime.gc() between requests, again hopelessly praying for it to help,
Mess the code entirely with hacky patterns like https://stackoverflow.com/a/6915221/1137187.
The last important note is that we are a low-budget startup and commercial solutions such as Zing® are not an option for us, we are looking for a non-commercial solution.
Any ideas? We would rewrite our code entirely to C++ (we didn't know that GC might be a problem rather than solution at the beginning), but the code-base is too large already to do that.

Any ideas?
Use a different JVM? Azul claims to be able to handle such cases. Redhat and Oracle are contributing shenandoah and zgc to openjdk, respectively, with similar goals in mind, so maybe you could try experimental builds if you don't want a commercial solution.
There also are other JVMs focused on realtime applications, but as I understand it they focus on harder realtime requirements on smaller systems, yours sounds more like soft realtime requirements.
Another thing you can attempt is significantly reducing object allocations (profile your application!) by using pre-allocated objects or more compact data representations where applicable. Reducing allocation pressure while keeping the new gen size the same means increased mortality rate per collection, which should speed up young collections.
Choosing hardware to maximize memory bandwidth might help too.
Invoke System.gc() or Runtime.gc() between requests, again hopelessly praying for it to help,
This might work when combined with -XX:+ExplicitGCInvokesConcurrent, otherwise it would trigger a single-threaded STW collection with CMS or G1 (I'm assuming you're using one of those). But that approach is seems brittle and requires lots of tuning and monitoring.

Related

How to monitor memory after major garbage collection via JMX or code

Many monitoring tools, like the otherwise phantastic JavaMelody, just monitor the current memory usage. If you want to check for memory leaks or impending out of memory situations, this is not particularily helpful, if you have an application that generates loads of garbage which gets collected immediately. Not perfect, but IMHO much more interesting, would it be to monitor the memory usage immediately after a major garbage collection. If that's high, a crash is looming over you.
So: can you find out the memory usage immediately after the last major garbage collection - either from Java code or via JMX? I know there are some tools like VisualVM which do this (which is no option for production use), and it can be written in the garbage collection log, but I'm looking for a more straightforward solution than parsing the garbage collection logfile. :-) To be clear: I'm looking for something that can easily be used in any application in production, not any expensive tool for debugging.
In case that matters: JDK 7 with -XX:+UseConcMarkSweepGC , but I am interested in general answers, too.
Information about memory available right after gc (youg or old) is available via JMX.
Garbage collector MBean has attribute LastGcInfo which is composite data object including information about memory pool sizes before and after GC.
In addition, starting with Java 7 JMX notification subscription could be used to receive GC events without polling.
You can find example of code working with GC MBean here.
Probably 'Dynatrace' is an option... it's a very powerful monitoring tool (not only for memory).
http://www.dynatrace.com/en/index.html
A very crude way would be to monitor the minima of Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory() for some time. At least that would not require you to know intimate details about memory pools, as monitoring LastGcInfo in Alexey Ragozin's answer does. This might require you to get notifications about garbage collections.

Is garbage collection detrimental to the performance of this type of program

I'm building a program that will live on an AWS EC2 instance (probably) be invoked periodically via a cron job. The program will 'crawl'/'poll' specific websites that we've partnered with and index/aggregate their content and update our database. I'm thinking java is a perfect fit for a language to program this application in. Some members of our engineering team are concerned about the performance detriment of java's garbage collection feature, and are suggesting using C++.
Are these valid concerns? This is an application that will be invoked possible once every 30 minutes via cron job, and as long as it finishes its task within that time frame the performance is acceptable I would assume. I'm not sure if garbage collection would be a performance issue, since I would assume the server will have plenty of memory and the actual act of tracking how many objects point to an area of memory and then declaring that memory free when it reaches 0 doesn't seem too detrimental to me.
No, your concerns are most likely unfounded.
GC can be a concern, when dealing with large heaps & fractured memory (requires a stop the world collection) or medium lived objects that are promoted to old generation but then quickly de-referenced (requires excessive GC, but can be fixed by resizing ratio of new:old space).
A web crawler is very unlikely to fit either of the above two profiles - you probably don't need a massive old generation and should have relatively short lived objects (page representation in memory while you parse out data) and this will be efficiently dealt with in the young generation collector.
We have an in-house crawler (Java) that can happily handle 2 million pages per day, including some additional post-processing per page, on commodity hardware (2G RAM), the main constraint is bandwidth. GC is a non-issue.
As others have mentioned, GC is rarely an issue for throughput sensitive applications (such as a crawler) but it can (if one is not careful) be an issue for latency sensitive apps (such as a trading platform).
The typical concern C++ programmers have for GC is one of latency. That is, as you run a program, periodic GCs interrupt the mutator and cause spikes in latency. Back when I used to run Java web applications for a living, I had a couple customers who would see latency spikes in the logs and complain about it — and my job was to tune the GC to minimize the impact of those spikes. There are some relatively complicated advances in GC over the years to make monstrous Java applications run with consistently low latency, and I'm impressed with the work of the engineers at Sun (now Oracle) who made that possible.
However, GC has always been very good at handling tasks with high throughput, where latency is not a concern. This includes cron jobs. Your engineers have unfounded concerns.
Note: A simple experimental GC reduced the cost of memory allocation / freeing to less than two instructions on average, which improved throughput, but this design is fairly esoteric and requires a lot of memory, which you don't have on EC2.
The simplest GCs around offer a tradeoff between large heap (high latency, high throughput) and small heap (lower latency, lower throughput). It takes some profiling to get it right for a particular application and workload, but these simple GCs are very forgiving in a large heap / high throughput / high latency configuration.
Fetching and parsing websites will take way more time than the garbage collector, its impact will be probably neliglible. Moreover, the automatic memory management is often more efficient when dealing with a lot of small objects (such as strings) than a manual memory management via new/delete. Not talking about the fact that the garbage collected memory is easier to use.
I don't have any hard numbers to back this up, but code that does a lot of small string manipulations (lots of small allocations and deallocations in a short period of time) should be much faster in a garbage-collected environment.
The reason is that modern GC's "re-pack" the heap on a regular basis, by moving objects from an "eden" space to survivor spaces and then to a tenured object heap, and modern GC's are heavily optimized for the case where many small objects are allocated and then deallocated quickly.
For example, constructing a new string in Java (on any modern JVM) is as fast as a stack allocation in C++. By contrast, unless you're doing fancy string-pooling stuff in C++, you'll be really taxing your allocator with lots of small and quick allocations.
Plus, there are several other good reasons to consider Java for this sort of app: it has better out-of-the-box support for network protocols, which you'll need for fetching website data, and it is much more robust against the possibility of buffer overflows in the face of malicious content.
Garbage collection (GC) is fundamentally a space-time tradeoff. The more memory you have, the less time your program will need to spend performing garbage collection. As long as you have a lot of memory available relative to the maximum live size (total memory in use), the main performance hit of GC -- whole-heap collections -- should be a rare event. Java's other advantages (notably robustness, security, portability, and an excellent networking library) make this a no-brainer.
For some hard data to share with your colleagues showing that GC performs as well as malloc/free with plenty of available RAM, see:
"Quantifying the Performance of Garbage Collection vs. Explicit Memory Management", Matthew Hertz and Emery D. Berger, OOPSLA 2005.
This paper provides empirical answers to an age-old question: is
garbage collection faster/slower/the same speed as malloc/free? We
introduce oracular memory management, an approach that lets us measure
unaltered Java programs as if they used malloc and free. The result: a
good GC can match the performance of a good allocator, but it takes 5X
more space. If physical memory is tight, however, conventional garbage
collectors suffer an order-of-magnitude performance penalty.
Paper: PDF
Presentation slides: PPT, PDF

Can the thread per request model be faster than non-blocking I/O?

I remember 2 or 3 years ago reading a couple articles where people claimed that modern threading libraries were getting so good that thread-per-request servers would not only be easier to write than non-blocking servers but that they'd be faster, too. I believe this was even demonstrated in Java with a JVM that mapped Java threads to pthreads (i.e. the Java nio overhead was more than the context-switching overhead).
But now I see all the "cutting edge" servers use asynchronous libraries (Java nio, epoll, even node.js). Does this mean that async won?
Not in my opinion. If both models are well implemented (this is a BIG requirement) I think that the concept of NIO should prevail.
At the heart of a computer are cores. No matter what you do, you cannot parallelize your application more than you have cores. i.e. If you have a 4 core machine, you can ONLY do 4 things at a time (I'm glossing over some details here, but that suffices for this argument).
Expanding on that thought, if you ever have more threads than cores, you have waste. That waste takes two forms. First is the overhead of the extra threads themselves. Second is the time spent switching between threads. Both are probably minor, but they are there.
Ideally, you have a single thread per core, and each of those threads is running at 100% processing speed on their core. Task switching wouldn't occur in the ideal. Of course there is the OS, but if you take a 16 core machine and leave 2-3 threads for the OS, then the remaining 13-14 go towards your app. Those threads can switch what they are doing within your app, like when they are blocked by IO requirements, but don't have to pay that cost at the OS level. Write it right into your app.
An excellent example of this scaling is seen in SEDA http://www.eecs.harvard.edu/~mdw/proj/seda/ . It showed much better scaling under load than a standard thread-per-request model.
My personal experience is with Netty. I had a simple app. I implemented it well in both Tomcat and Netty. Then I load tested it with 100s of concurrent requests (upwards of 800 I believe). Eventually Tomcat slowed to a crawl and exhibited extremely bursty/laggy behavior. Whereas the Netty implementation simply increased response time, but continued with incredibly overall throughput.
Please note, this hinges on solid implementation. NIO is still getting better with time. We are learning how to tune our servers OSes to work better with it as well as how to implement the JVMs to better leverage the OS functionality. I don't think a winner can be declared yet, but I believe NIO will be the eventual winner, and it's doing quite well already.
It is faster as long as there is enough memory.
When there are too many connections, most of which are idle, NIO can save threads therefore save memory, and the system can handle a lot more users than thread-per-connection model.
CPU is not a direct factor here. With NIO, you effectively need to implement a threading model yourself, which is unlikely to be faster than JVM's threads.
In either choice, memory is the ultimate bottleneck. When load increases and memory used approaches max, GC will be very busy, and the system often demonstrate the symptom of 100% CPU.
Some time ago I found rather interesting presentation providing some insight on "why old thread-per-client model is better". There are even measurements. However I'm still thinking it through. In my opinion the best answer to this question is "it depends" because most (if not all) engineering decisions are trade offs.
Like that presentation said - there's speed and there's scalability.
One scenario where thread-per-request will almost certainly be faster than any async solution is when you have a relatively small number of clients (e.g. <100), but each client is very high volume. e.g. a realtime app where no more than 100 clients are sending/generating 500 messages a second each. Thread-per-request model will certainly be more efficient than any async event loop solution. Async scales better when there are many requests/clients because it doesn't waste cycles waiting on many client connections, but when you have few high volume clients with little waiting, it's less efficient.
From what I seen, authors of Node and Netty both recognize that these frameworks are meant to primarily address high volumes/many connections scalability situations, rather than being faster for for a smaller number of high volume clients.

Java without gc - io

I would like to run a Java program with garbage collection switched off. Managing memory in my own code is not so difficult.
However the program needs quite a lot of I/O.
Is there any way (short of using JNI for all I/O operations) that I could achieve this using pure Java?
Thanks
Daniel
What you are trying to achieve is frequently done in investment banking to develop low-latency real-time systems.
To avoid GC you simply need to make sure not to allocate memory after the startup and warm-up phase of your application.
As you seem to have noticed Java NIO internally does unwanted memory allocation.
Unfortunately, you have no choice but write JNI replacements for the problematic calls.
You need at least to write a replacement for the NIO Selector.
You will have to avoid using most of the Java libraries due to similar unwanted memory allocations.
For example you will have to avoid using immutable object like String, avoid Boxing, re-implement Collections that preallocate enough entries for the whole lifetime of your program.
Writing Java code this way is not easy, but certainly possible.
I am developing a platform to do just so.
Managing memory in my own code is not
so difficult.
It's not difficult - It's impossible. For example:
public void foo() {
Object o = new Object();
// free(o); // Doh! No "free" keyword in Java.
}
Without the aid of the garbage collector how can the memory consumed by o be reclaimed?
I'm assuming from your question that you might want to avoid the sporadic pauses caused by garbage collection due to the high level of I/O being performed by your app. If this is the case there are techniques for minimising the number of objects created (e.g. re-using objects from a pool). You could also consider enabling the Concurrent Mark Sweep Collector.
The concurrent mark sweep collector,
also known as the concurrent collector
or CMS, is targeted at applications
that are sensitive to garbage
collection pauses.
It's very hard (but not impossible) to disable GC in a JVM.
Look at the JNI "critical" functions for hints.
You can also essentially ensure you don't GC by not allocating any more objects (write a JVMTI agent that slaps you if you do, and instrument your code).
Finally, you can force a fatal OutOfMemoryError by ensuring that every object you allocate is never freed, thus when you hit -Xmx memory used, you'll fall over as GC won't be able to reclaim anything (mind you, you'll GC one or more times at this point before you fall over in a heap).
The real question is why you'd want to? What upside do you see in doing it? Is it for realtime? If so, I'd consider looking at one of the several realtime JVMs available on the market (Oracle, IBM, & others all sell them). I can't honestly think of another reason to do this while still using Java.
The only way you are going to be able to turn off garbage collection is to modify the JVM. This is should be feasible with OpenJDK 6 codebase.
However, the what you will get at the end is a JVM that leaks memory like crazy, with no reasonable hope of fixing the leaks. The Java class library APIs are designed and implemented on the assumption that there is a GC taking care of memory management. This is so fundamental that any serious attempt to "fix" it would lead to a language / library that is not recognizable as Java.
If you want a non-garbage collected language, use C or C++.
Modern JVM's are so good at handling short-lived objects that any scheme you devise on your own will be slower.
This is because the objects you handle yourself will become long-lived and receive extra deluxe treatment from the JVM in terms of being moved around etc. Of course, this is by the garbage collector, which you want to turn off, but you can do very little without any gc.
So, before you start considering what optimization to use, then establish a baseline where you have a large unoptimized, program and profile it. Then do your tweaks, and see if it helps, but you will never know if you do not have a baseline.
As other people have mentioned you can't disable the GC. However, you can choose to use the experimental 'Epsilon' garbage collector which will never actually perform any garbage collections. Warning: it will crash if your JVM runs out of memory (because it's not doing any garbage collections).
There's more info (including the command-line switch to use) at:
http://openjdk.java.net/jeps/318
Good luck!
GarbageCollection is automated memory management in java.So you can not disable GC
Since you say, "its all about predictability not straight line speed," you should look at using a realtime Java system with deterministic garbage collection.

Java profiling - How reliable are the values it gives?

I am working on a simple text markup Java Library which should be, amongst other requirements, fast.
For that purpose, I did some profiling, but the results give me worse numbers that are then measured when running in non-profile mode.
So my question is - how much reliable is the profiling? Does that give just an informational ratio of the time spent in methods? Does that take JIT compiler into account, or is the profiling mode only interpreted? I use NetBeans Profiler and Sun JDK 1.6.
Thanks.
When running profiling, you'll always incur a performance penalty as something has to measure the start/stop time of the methods, keep track of the objects of the heap (for memory profiling), so there is a management overhead.
However, it will give you clear pointers to find out where bottlenecks are. I tend to look for the methods where the most cumulative time is spent and check whether optimisations can be made. It is also useful to determine whether methods are called unnecessarily.
With methods that are very small, take the profile results with a pinch of salt, sometimes the process of measuring can take more time than the method call itself and skew the results (it might appear that a small, often-called method has more of a performance impact).
Hope that helps.
Because of instrumentation profiled code on average will run slower than non-profiled code. Measuring speed is not the purpose of profiling however.
The profiling output will point you to bottlenecks, places that threads spend more time in, code that behaves worse than expected, or possible memory leaks.
You can use those hints to improve said methods and profile again until you are happy with the results.
A profiler will not be a solution to a coding style that is x% slower than optimal however, you still need to spend time fine-tuning those parts of your code that are used more often than others.
I'm not surprised by the fact that you get worst results when profiling your application as instrumenting java code will typically always slow its execution. This is actually nicely captured by the Wikipedia page on Profiling which mentions that instrumentation can causes changes in the performance of a program, potentially causing inaccurate inaccurate results and heisenbugs (due to the observer effect: observers affect what they are observing, by the mere act of observing it alone).
Having that said, if you want to measure speed, I think that you're not using the right tool. Profilers are used to find bottlenecks in an application (and for that, you don't really care of the overall impact). But if you want to benchmark your library, you should use a performance testing tool (for example, something like JMeter) that will be able to give you an average execution time per call. You will get much better and more reliable results with the right tool.
The Profiling should have no influence on the JIT compiler. The code inserted to profile your application however will slow the methods down quite a bit.
Profillers work on severall different models, either they insert code to see how long and how often methods are running or they only take samples by repeatedly polling what code is currently executed.
The first will slow down your code quite a bit, while the second is not 100% accurate, as it may miss some method calls.
Profiled code is bound to run slower as mentioned in most of the previous comments. I would say, use profiling to measure only the relative performance of various parts of your code (say methods). Do not use the measures from a profiler as an indicator of how the code would perform overall (unless you want a worst-case measure, in which case what you have would be an overestimated value.)
I have found I get different result depending on which profiler I use. However the results are often valid, but a different perspective on the problem. Something I often do when profiling CPU usage is to enable memory allocation profiling. This often gives me different CPU results (due to the increased overhead caused by the memory profiling) and can give me some useful places to optimise.

Categories