I have to find a memory leak in a Java application. I have some experience with this but would like advice on a methodology/strategy for this. Any reference and advice is welcome.
About our situation:
Heap dumps are larger than 1 GB
We have heap dumps from 5 occasions.
We don't have any test case to provoke this. It only happens in the (massive) system test environment after at least a weeks usage.
The system is built on a internally developed legacy framework with so many design flaws that they are impossible to count them all.
Nobody understands the framework in depth. It has been transfered to one guy in India who barely keeps up with answering e-mails.
We have done snapshot heap dumps over time and concluded that there is not a single component increasing over time. It is everything that grows slowly.
The above points us in the direction that it is the frameworks homegrown ORM system that increases its usage without limits. (This system maps objects to files?! So not really a ORM)
Question: What is the methodology that helped you succeed with hunting down leaks in a enterprise scale application?
It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.
Also, you can't know if something is a leak or not without know why the class is there in the first place.
I just spent the past couple of weeks doing exactly this, and I used an iterative process.
First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.
Rather, I relied almost solely on jmap histograms.
I imagine you're familiar with these, but for those not:
jmap -histo:live <pid> > histogram.out
creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.
I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.
I ran several different analyses on this data.
I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.
I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.
However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.
To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysis on the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.
I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.
Now, all this can do is identify potentially leaking classes.
This is where understanding how the classes are used, and whether they should or should not be their comes in to play.
For example, you may find that you have a lot of Map.Entry classes, or some other system class.
Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.
Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.
Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.
In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).
A profiler may be able to help you track the owners of that "now leaked" class.
But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.
Take a look at Eclipse Memory Analyzer. It's a great tool (and self contained, does not require Eclipse itself installed) which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.
I've used it in the past to help hunt down suspicious leaks.
This answer expands upon #Will-Hartung's. I applied to same process to diagnose one of my memory leaks and thought that sharing the details would save other people time.
The idea is to have postgres 'plot' time vs. memory usage of each class, draw a line that summarizes the growth and identify the objects that are growing the fastest:
^
|
s | Legend:
i | * - data point
z | -- - trend
e |
( |
b | *
y | --
t | --
e | * -- *
s | --
) | *-- *
| -- *
| -- *
--------------------------------------->
time
Convert your heap dumps (need multiple) into a format this is convenient for consumption by postgres from the heap dump format:
num #instances #bytes class name
----------------------------------------------
1: 4632416 392305928 [C
2: 6509258 208296256 java.util.HashMap$Node
3: 4615599 110774376 java.lang.String
5: 16856 68812488 [B
6: 278914 67329632 [Ljava.util.HashMap$Node;
7: 1297968 62302464
...
To a csv file with a the datetime of each heap dump:
2016.09.20 17:33:40,[C,4632416,392305928
2016.09.20 17:33:40,java.util.HashMap$Node,6509258,208296256
2016.09.20 17:33:40,java.lang.String,4615599,110774376
2016.09.20 17:33:40,[B,16856,68812488
...
Using this script:
# Example invocation: convert.heap.hist.to.csv.pl -f heap.2016.09.20.17.33.40.txt -dt "2016.09.20 17:33:40" >> heap.csv
my $file;
my $dt;
GetOptions (
"f=s" => \$file,
"dt=s" => \$dt
) or usage("Error in command line arguments");
open my $fh, '<', $file or die $!;
my $last=0;
my $lastRotation=0;
while(not eof($fh)) {
my $line = <$fh>;
$line =~ s/\R//g; #remove newlines
# 1: 4442084 369475664 [C
my ($instances,$size,$class) = ($line =~ /^\s*\d+:\s+(\d+)\s+(\d+)\s+([\$\[\w\.]+)\s*$/) ;
if($instances) {
print "$dt,$class,$instances,$size\n";
}
}
close($fh);
Create a table to put the data in
CREATE TABLE heap_histogram (
histwhen timestamp without time zone NOT NULL,
class character varying NOT NULL,
instances integer NOT NULL,
bytes integer NOT NULL
);
Copy the data into your new table
\COPY heap_histogram FROM 'heap.csv' WITH DELIMITER ',' CSV ;
Run the slop query against size (num of bytes) query:
SELECT class, REGR_SLOPE(bytes,extract(epoch from histwhen)) as slope
FROM public.heap_histogram
GROUP BY class
HAVING REGR_SLOPE(bytes,extract(epoch from histwhen)) > 0
ORDER BY slope DESC
;
Interpret the results:
class | slope
---------------------------+----------------------
java.util.ArrayList | 71.7993806279174
java.util.HashMap | 49.0324576155785
java.lang.String | 31.7770770326123
joe.schmoe.BusinessObject | 23.2036817108056
java.lang.ThreadLocal | 20.9013528767851
The slope is bytes added per second (since the unit of epoch is in seconds). If you use instances instead of size, then that's the number of instances added per second.
My one of the lines of code creating this joe.schmoe.BusinessObject was responsible for the memory leak. It was creating the object, appending it to an array without checking if it already existed. The other objects were also created along with the BusinessObject near the leaking code.
Can you accelerate time? i.e. can you write a dummy test client that forces it to do a weeks worth of calls/requests etc in a few minutes or hours? These are your biggest friend and if you don't have one - write one.
We used Netbeans a while ago to analyse heap dumps. It can be a bit slow but it was effective. Eclipse just crashed and the 32bit Windows tools did as well.
If you have access to a 64bit system or a Linux system with 3GB or more you will find it easier to analyse the heap dumps.
Do you have access to change logs and incident reports? Large scale enterprises will normally have change management and incident management teams and this may be useful in tracking down when problems started happening.
When did it start going wrong? Talk to people and try and get some history. You may get someone saying, "Yeah, it was after they fixed XYZ in patch 6.43 that we got weird stuff happening".
I've had success with IBM Heap Analyzer. It offers several views of the heap, including largest drop-off in object size, most frequently occurring objects, and objects sorted by size.
There are great tools like Eclipse MAT and Heap Hero to analyze heap dumps. However, you need to provide these tools with heap dumps captured in the correct format and correct point in time.
This article gives you multiple options to capture heap dumps. However, in my opinion, first 3 are effective options to use and others are good options to be aware.
1. jmap
2. HeapDumpOnOutOfMemoryError
3. jcmd
4. JVisualVM
5. JMX
6. Programmatic Approach
7. IBM Administrative Console
7 Options to capture Java Heap dumps
If it's happening after a week's usage, and your application is as byzantine as you describe, perhaps you're better off restarting it every week ?
I know it's not fixing the problem, but it may be a time-effective solution. Are there time windows when you can have outages ? Can you load balance and fail over one instance whilst keeping the second up ? Perhaps you can trigger a restart when memory consumption breaches a certain limit (perhaps monitoring via JMX or similar).
I've used jhat, this is a bit harsh, but it depends on the kind of framework you had.
Related
Regarding the dataflow model of computation, I'm doing a PoC to test a few concepts using apache beam with the direct-runner (and java sdk). I'm having trouble creating a pipeline which reads a "big" csv file (about 1.25GB) and dumping it into an output file without any particular transformation like in the following code (I'm mainly concerned with testing IO bottlenecks using this dataflow/beam model because that's of primary importance for me):
// Example 1 reading and writing to a file
Pipeline pipeline = Pipeline.create();
PCollection<String> output = ipeline
.apply(TextIO.read().from("BIG_CSV_FILE"));
output.apply(
TextIO
.write()
.to("BIG_OUTPUT")
.withSuffix("csv").withNumShards(1));
pipeline.run();
The problem that I'm having is that only smaller files do work, but when the big file is used, no output file is being generated (but also no error/exception is shown either, which makes debugging harder).
I'm aware that on the runners page of the apache-beam project (https://beam.apache.org/documentation/runners/direct/), it is explicitly stated under the memory considerations point:
Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with
data sets small enough to fit in local memory. You can create a small
in-memory data set using a Create transform, or you can use a Read
transform to work with small local or remote files.
This above suggests I'm having a memory problem (but sadly isn't being explicitly stated on the console, so I'm just left wondering here). I'm also concerned with their suggestion that the dataset should fit into memory (why isn't it reading from the file in parts instead of fitting the whole file/dataset into memory?)
A 2nd consideration I'd like to also add into this conversation would be (in case this is indeed a memory problem): How basic is the implementation of the direct runner? I mean, it isn't hard to implement a piece of code that reads from a big file in chunks, and also outputs to a new file (also in chunks), so that at no point in time the memory usage becomes a problem (because neither file is completely loaded into memory - only the current "chunk"). Even if the "direct-runner" is more of a prototyping runner to test semantics, would it be too much to expect that it should deal nicely with huge files? - considering that this is a unified model built for the ground up to deal with streaming where window size is arbitrary and huge data accumulation/aggregation before sinking it is a standard use-case.
So more than a question I'd deeply appreciate your feedback/comments regarding any of these points: have you notice IO constraints using the direct-runner? Am I overlooking some aspect or is the direct-runner really so naively implemented? Have you verified that by using a proper production runner like flink/spark/google cloud dataflow, this constraint disapears?
I'll eventually test with other runners like the flink or the spark one, but it feels underwhelming that the direct-runner (even if it is intended only for prototyping purposes) is having trouble with this first test I'm running on - considering the whole dataflow idea is based around ingesting, processing, grouping and distributing huge amounts of data under the umbrella of an unified batch/streaming model.
EDIT (to reflect Kenn's feedback):
Kenn, thanks for those valuable points and feedback, they have been of great help in pointing me towards relevant documentation. By your suggestion I've found out by profiling the application that the problem is indeed a java heap related one (that somehow is never shown on the normal console - and only seen on the profiler). Even though the file is "only" 1.25GB in size, internal usage goes beyond 4GB before dumping the heap, suggesting the direct-runner isn't "working by chunks" but is indeed loading everything in memory (as their doc says).
Regarding your points:
1- I believe that serialization and shuffling can very well still be achieved through a "chunk by chunk" implementation. Maybe I had a false expectation of what the direct-runner should be capable of, or I didn't fully grasp its intended reach, for now I'll refrain of doing non-functional type of tests while using the direct-runner.
2 - Regarding sharding. I believe the NumOfShards controls the parallelism (and amount of output files) at the write stage (processing before that should still be fully parallel, and only at the time of writing, will it use as many workers -and generate as many files- as explicitly provided). Two reasons to believe this are: first, the CPU profiler always show 8 busy "direct-runner-workers" -mirroring the amount of logical cores that my PC has-, independently on if I set 1 shard or N shards. The 2nd reason is what I understand from the documentation here (https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/WriteFiles.html) :
By default, every bundle in the input PCollection will be processed by
a FileBasedSink.WriteOperation, so the number of output will vary
based on runner behavior, though at least 1 output will always be
produced. The exact parallelism of the write stage can be controlled
using withNumShards(int), typically used to control how many files
are produced or to globally limit the number of workers connecting to
an external service. However, this option can often hurt performance:
it adds an additional GroupByKey to the pipeline.
One interesting thing here is that "additional GroupByKey added to the pipeline" is kind of undesired in my use case (I only desire results in 1 file, without any regard for order or grouping),
so probbly adding an extra "flatten" files step, after having the N sharded output files generated is a better approach.
3 - your suggestion for profiling was spot on, thanks.
Final Edit the direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles everything in memory
There are a few issues or possibilities. I will answer in priority order.
The direct runner is for testing with very small data. It is engineered for maximum quality assurance, with performance not much of a priority. For example:
it randomly shuffles data to make sure you are not depending on ordering that will not exist in production
it serializes and deserializes data after each step, to make sure the data will be transmitted correctly (production runners will avoid serialization as much as possible)
it checks whether you have mutated elements in forbidden ways, which would cause you data loss in production
The data you are describing is not very big, and the DirectRunner can process it eventually in normal circumstances.
You have specified numShards(1) which explicitly eliminates all parallelism. It will cause all of the data to be combined and processed in a single thread, so it will be slower than it could be, even on the DirectRunner. In general, you will want to avoid artificially limiting parallelism.
If there is any out of memory error or other error preventing processing, you should see a lot message. Otherwise, it will be helpful to look at profiling and CPU utilization to determine if processing is active.
This question has been indirectly answered by Kenn Knowles above. The direct runner is not intended for performance testing, only prototyping and well formedness of the data. It doen't have any mechanism of spliting and dividing work by partitions, and handles every dataset in memory. Performance testing should be carried on by using other runners (like Flink Runner), - those will provide data splitting and the type of infrastructure needed to deal with high IO bottlenecks.
UPDATE: adding to the point adressed by this question, there is a related question here: How to deal with (Apache Beam) high IO bottlenecks?
Whereas the question here revolves around figuring out if the direct runner can deal with huge datasets (which we already established here that it is not possible); the provided link above points to a discussion of weather production runners (like flink/spark/cloud dataflow) can deal natively out of the box with huge datasets (the short answer is yes, but please check yourself on the link for a deeper discussion).
The problem:
I'm using UIMA Ruta (v2.3.1) in one of my projects, but now I'm facing a problem:
The memory exceeds explainable sizes, but it can't be figured out, where this problem is located, except for the class org.apache.uima.ruta.rule.RuleElementMatch, that takes up to 50% of memory usage.
I call the JavaAPI of UIMA Ruta in my project, to set up the analysis engine. When I'm sending a text to analyze with around 400kbyte size to this engine, there are around 700MB memory blocked by this process, but without any chance for the GC to free some space.
Ruta project:
The given Ruta rules are built-up with REGEXP-structures, but theoretical they should reduce the amount of memory usage, because there are UNMARKALL-Statements at specific endpoints.
Is someone facing the same situation of high memory consumption or are there any suggested solutions? Using the low memory profile as an advice of uima itself is not possible, due the response time is already at around 30 seconds. Increasing the max memory of the JVM is not an option.
This is probably not an answer, but here are some comments that maybe help.
As the name says, RutaRuleElementMatch stores the matches of rule elements, which is required wihtin one RuleMatch in order to identify the information for the actions. This information can be forgotten after a RuleMatch, but sometimes it is necessary to store it. Mainly, it is stored if the analysis engine is configured for debugging (parameters debug and debugWithMacthes). Then, all rule matches and rule element matches are remembered in order to create the debug annotations later. If there are many matches, this can take a lot of memory in the current implementation.
The debug config is also used in the Java API, e.g., in Ruta.select() or Ruta.matches(). In a smaller amount, the matches are also remembered for the head rules of block statements.
So, if debugging is activated, it should be deactivated in order to reduce the memory usage.
400KB of text is quite a lot, I think. Ruta brings quite some overhead, which is required, but can also be improved/reduced. Right now, until the implementation is improved, there are some best practices in order to handle large document in ruta, i.e. reduce the memory usage.
In your use case, I would switch to a different seeder which creates only annotations you need, and only where you need them, e.g., do you need SPACE and BREAK? Then, I would refactor the rules. The example rule you mentioned in the comments is extremely inefficient, and produces many RuleElementMatches. I rather recommend to use dictionary lookup where possible, e.g. with TRIE. You can also improve such rule by restricting the match condition. In your example, this could be W or the output of some dictionary lookup.
If profiling shows that a lot of memory is used by RutaRuleElementMatch, then this can be caused by the debug config, or by inefficient rules.
If profiling shows that a lot of memory is used by RutaBasic, then it is caused by the size of the document and therefore by the amount of annotations. Reducing the amount of annotations helps, as less coverage information needs to be stored in the internal lists/arrays. UNMARK and UNMARKALL help also but not to an extend one expects, at least in my use cases. There is also the parameter lowMemoryProfile which reduces the memory usage of RutaBasic but also the runtime performance as you mentioned. However, I suppose that your rules can be optimized a lot so that the parameter would be an option again.
I hope this helps.
DISCLAIMER: I am a developer of UIMA Ruta
I'm trying to use stanford-ner.jar to train on a relatively large corpus (504MB) and even though I use the option of -Xms1g and -Xms1g there's still memory issues. And what's horrible (I assume) is the output, when I tried to train a small model, the output is like:
[1000][2000]numFeatures = 215032
However, the staff I got currently is even up to "534700" and numberFeatures is still being computed. I think there must be something wrong that cause the memory issue that the software can't handle such large features? And I don't really understand the [1000][2000].. what does these mean? Is there a tutorial by Stanford explaining the outputs of the software ?
My train corpus' format is likeļ¼
Google COMP
And O
Steve PER
. O
Microsoft COMP
Facebook COMP
Total MET
profix MET
. MET
Things like that, small entries that all make this 504MB corpus.
Can anyone points me the problem?
Thanks!
You should probably increase the memory allocated to the program. What is the -mx value you pass to Java? -Xms sets the initial memory, whereas -mx (or -Xmx) sets the maximum memory. My guess would be that for a corpus of 500MB, this has to be a very large value -- at minimum a few 10GB, and possibly more. On top of that, I have a bad feeling this will take very long to train.
Where did you collect such a large training corpus from? Is it possible to subsample the corpus, at least initially, and see if that trains?
I wrote a small java program which loads data from DB2 database using simple JDBC call. I am using select query to get data and using java statement for this purpose. I have properly closed statement and connection objects. I am using 64 bit JVM for compilation and for running the program.
The query is returning 52 million records, each row having 24 columns, which takes me around 4 minutes to load complete data in Unix (having multiprocessor environment). I am using HashMap as data-structure to load the data: Map<String, Map<String, GridTradeStatus>>. The bean GridTradeStatus is a simple getter/setter bean with 24 properties in it.
The memory required for the program is alarmingly high. Java heap size goes up to 5.8 - 6GB to load complete data while actual used heap size remains between 4.7 - 4.9GB. I know that we should not load this much data into memory but my business requirements are in that way only.
The question is that when I put whole data of my table in a flat file it comes out to be roughly equivalent to ~1.2GB. I want to know why my java program is consuming memory 4 times more that its actual size.
There is nothing surprising here (to me at least).
a.) Strings in java consume double the space compared to most common text formats (because Strings are always represented as UTF-16 in the heap). Also, String as an object has quite some overhead (String object itself, reference to the char[] it contains, hashCode etc.). For small strings the String object costs easily as much memory as the data it contains.
b.) You put stuff into a HashMap. HashMap is not exactly memory efficient. First it uses a default load factor of 75%, which means a map with many entries has also a big bucket array. Then, each entry in the map is an object itself, which costs at least two references (key and value) plus object overhead.
In conclusion you pretty much have to expect the memory requirements to increase quite a bit. A factor of 4 is reasonable if your average data String is relatively short.
If you think you cannot afford a ratio 1:4 between the size of data in a flat file and the memory necessary to load the Strings in a HashMap, you should considere not using Java but a lower level language such as C++ or even C.
Of course there are possible optimizations :
use byte[] instead of String (about half the size)
do not use default HashMap parameters (initial size / load factor) but tweak them to meet your actual requirements.
What follows is mainly experience opinion based. I generally use 4 language levels :
high level scripting language (Python, Ruby, or even bash ...) when performance
is not a requirement and speed of developpement is
mid level language (Java, less frequently high level C++) when performance matters but when I also want simplicity of developpement and robustness (strong typing, ...)
low level language (low level C++, or C) what performance is a high requirement and when I accept to spend much more time in writing and testing individual modules
assembly language for the small parts where performance is critical and has been proved to be by profiling.
IMHO you can tweak Java code to highly reduce the memory footprint, but you risk to lose a great part of the interest of Java by losing the excellent string and collections support. It might be as easy and perhaps more efficient to code a small part of the application in C++ and use JNI to tie all together.
I am looking for a tool that can provide VisuaVM-like profiling about live objects, but in non-GUI mode.
The Visual VM functionality I am referring, is accessed by going to the "Profiler" tab and clicking "Memory".
By setting a profile preset of "Profile object allocations and GC" for ever 1 objects (all objects). This gives me exactly what I need in an auto-refreshing view, which I can filter for the class that interests me.
However, I want to be able to export the table of "live objects" to a text file, for every snapshot that is taken (Visual VM refreshes every one seconds). Obviously, pointing and clicking cannot possibly be a solution...
Anyone know of such a "command-line" profiler?
I have been looking at jmap which provides heap dumps, but it is too costly (the dump takes too long, I am just interested in the number of objects).
There is a commercial tool called YourKit but I don't know whether it can do what I need (and also seems rather expensive for the type of "one-off" usage I need it for).
If I could use VisualVM as-is, but have it append the output to a file (instead of refreshing its GUI) it'd be perfect...
I think Class Histograms are what you look for. You could collect the histograms in regular intervals and this will show you number of objects of each class and occupied space. You can then parse the text output yourself in order to:
compare two histograms to see instance allocation/deallocation
filter by a class name
monitor space occupation of class instances over time
Collect class histogram with jmap -histo $pid.