OutOfMemoryError while doing docx comparison using docx4j - java

in my application i am comparing two docx files and creating one html comparison file, when i tried with below 150 or 170 lines of file then there is no issue, while i try to compare the big files like 200 lines or more than that then that time it showing the
java.lang.OutOfMemoryError: Java heap space error,
can any one please help on this?

You are running out of memory because you aren't using the Docx4jDriver class, which makes the diff problem more tractable by doing a paragraph level diff first.
Use it like so:
Body newerBody = ((Document)newerPackage.getMainDocumentPart().getJaxbElement()).getBody();
Body olderBody = ((Document)olderPackage.getMainDocumentPart().getJaxbElement()).getBody();
// 2. Do the differencing
java.io.StringWriter sw = new java.io.StringWriter();
Docx4jDriver.diff( XmlUtils.marshaltoW3CDomDocument(newerBody).getDocumentElement(),
XmlUtils.marshaltoW3CDomDocument(olderBody).getDocumentElement(),
sw);
// 3. Get the result
String contentStr = sw.toString();
System.out.println("Result: \n\n " + contentStr);
Body newBody = (Body) org.docx4j.XmlUtils
.unmarshalString(contentStr);

you can make the heap space bigger with -Xmx and -Xmx as VM Arguments
Here more about Heap Size Tuning or here Heap size

Try increasing the Java heap size using the command line arguments -Xmx<maximum heap size> and -Xms<minimum heap size>.
Also in your code, test that you actually have increased the heap size with the following:
long heapSize = Runtime.getRuntime().totalMemory();
System.out.println("Heap Size = " + heapSize);
Do this before calling Differencer.diff on line 117.

Try profiling your application rather than making assumptions or intelligent guess. You can use visualvm or console that ships with the Jdk.
Also, you can take a heap dump of your application using jmap and then use either jhat or eclipse mat (I prefer this, google it out) to see what's consuming the memory and look out for any unusual behavior.

Related

OutOfMemoryError with apache commons Base64 static method decodeBase64

While decoding an Base64 encoded string to byte array (Have to do this as I have a key which can act on byte array to decrypt), I am getting outOfMemory. What are the effective ways to handle this problem? Should I chunk my input encoded String into partitions of size and then decode it or any other suggestions which are effective please suggest.
Code which was causing the issue.
byte[] encrypted = Base64.decodeBase64(strEncryptedEncodedData);
Stack Trace
DefaultQuartzScheduler_Worker-3
at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
at java.lang.StringCoding$StringEncoder.encode([CII)[B (StringCoding.java:300)
at java.lang.StringCoding.encode(Ljava/lang/String;[CII)[B (StringCoding.java:344)
at java.lang.String.getBytes(Ljava/lang/String;)[B (String.java:918)
at org.apache.commons.codec.binary.StringUtils.getBytesUnchecked(Ljava/lang/String;Ljava/lang/String;)[B (StringUtils.java:156)
at org.apache.commons.codec.binary.StringUtils.getBytesUtf8(Ljava/lang/String;)[B (StringUtils.java:129)
at org.apache.commons.codec.binary.BaseNCodec.decode(Ljava/lang/String;)[B (BaseNCodec.java:306)
at org.apache.commons.codec.binary.Base64.decodeBase64(Ljava/lang/String;)[B (Base64.java:669)
Eclipse Memory Analyzer memory usage:
Edit1: Max allowed XMX is 1 GB.
Edit2: JDK version"1.8.0_91"
try to increase max heap size to the JVM using option like this
-Xmx4096m
Please specify the java version you use for this code.
There are more than 10 different types of OutOfMemoryError as listed below and yours might be “10.java.lang.OutOfMemoryError: Direct buffer memory” type. Please verify your exception stack trace to find this matching string to confirm the same. If you see different type, please share it.
I verified that “java.lang.StringCoding$StringEncoder” class you shared in your exception trace uses java.nio.ByteBuffer and other related classes. You can validate the import sections in the below url.
http://cr.openjdk.java.net/~sherman/7040220/webrev/src/share/classes/java/lang/StringCoding.java.html
Java applications can access native memory (not heap memory) for buffer operations (direct byte) to perform speed operations. Some portion of memory is allocated to JVM from native memory for these direct byte buffer operations. If its size is not enough, you can increase it using VM flag –XX:MaxDirectMemorySize= (eg. -XX:MaxDirectMemorySize=10M). Increasing heap memory by using –Xmx flag would not solve this type of outofmemory. Please try MaxDirectMemorySize flag and see whether it solves your problem.
If you want to know more details about this OutOfMemoryError, you can read Java Performance Optimization: How to avoid the 10 OutOfMemoryErrors book.
1.java.lang.OutOfMemoryError: Java heap space
2.java.lang.OutOfMemoryError: Unable to create new native thread
3.java.lang.OutOfMemoryError: Permgen space
4.java.lang.OutOfMemoryError: Metaspace
5.java.lang.OutOfMemoryError: GC overhead limit exceeded
6.java.lang.OutOfMemoryError: Requested array size exceeds VM limit
7.java.lang.OutOfMemoryError: request "size" bytes for "reason". Out of swap space?
8.java.lang.OutOfMemoryError: Compressed class space
9.java.lang.OutOfMemoryError: "reason" "stack trace" (Native method)
10.java.lang.OutOfMemoryError: Direct buffer memory

OOM error while reading 2GB xml file in java

I have been trying to read an XML file which is of 2GB. I have followed different methods to read it but each of those methods give OutOfMemoryError I even tried to increase heapsize max to 4GB and min 2GB heap size in eclispe but still problem persists. How can i resolve this problem? I don't want to use any third party libaray.
Following is the code that i have tried so far
String str = new String(Files.readAllBytes(Paths.get(pathname)),
StandardCharsets.UTF_8);
and
try(Scanner scanner = new Scanner(new File(pathname))) {
while ( scanner.hasNextLine() ) {
String line = scanner.nextLine();
}
}
Each character uses at least 2 bytes and you also need memory for processing. I would give it a lot more memory like 24 GB and see how much it really needs.
Note Java 9+ has compressed string which can reduce consumption.
A better approach is to use SAX parser to process the file as you read it which will use a tiny fraction of the memory.

Java outOfMemory exception in string.split

I have a big txt file with integers in it. Each line in file has two integer numbers separated by whitespace. Size of a file is 63 Mb.
Pattern p = Pattern.compile("\\s");
try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = reader.readLine()) != null) {
String[] tokens = p.split(line);
String s1 = new String(tokens[0]);
String s2 = new String(tokens[1]);
int startLabel = Integer.valueOf(s1) - 1;
int endLabel = Integer.valueOf(s2) - 1;
Vertex fromV = vertices.get(startLabel);
Vertex toV = vertices.get(endLabel);
Edge edge = new Edge(fromV, toV);
fromV.addEdge(edge);
toV.addEdge(edge);
edges.add(edge);
System.out.println("Edge from " + fromV.getLabel() + " to " + toV.getLabel());
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.lang.String.substring(String.java:1913)
at java.lang.String.subSequence(String.java:1946)
at java.util.regex.Pattern.split(Pattern.java:1202)
at java.util.regex.Pattern.split(Pattern.java:1259)
at SCC.main(SCC.java:25)
Why am I getting this exception? How can I change my code to avoid it?
EDIT:
I've already increase heap size to 2048m.
What is consuming it? That's what I would want to know also.
For all I know jvm should allocate memory to list of vertices, set of edges, buffer for buffered reader and one small string "line". I don't see where this outOfMemory coming from.
I read about string.split() method. I think it's causing memory leak, but I don't know what should I do about it.
What you should try first is reduce the file to small enough that it works. That will allow you to appraise just how large a problem you have.
Second, your problem is definitely unrelated to String#split since you are using it on just one line at a time. What is consuming your heap are the Vertex and Edge instances. You'll have to redesign this towards a smaller footprint, or completely overhaul your algorithms to be able to work with only a part of the graph in memory, the rest on the disk.
P.S. Just a general Java note: don't write
String s1 = new String(tokens[0]);
String s2 = new String(tokens[1]);
you just need
String s1 = tokens[0];
String s2 = tokens[1];
or even just use tokens[0] directly instead of s1, since it's about as clear.
Easiest way: increase your heap size:
Add -Xmx512m -Xms512m (or even more) arguments to jvm
Increase the heap memory limit, using the -Xmx JVM option.
More info here.
You are getting this exception because your program is storing too much data in the java heap.
Although your exception is showing up in the Pattern.split() method, the actual culprit could be any large memory user in your code, such as the graph you are building. Looking at what you provided, I suspect the graph data structure is storing much redundant data. You may want to research a more space-efficient graph structure.
If you are using the Sun JVM, try the JVM option -XX:+HeapDumpOnOutOfMemoryError to create a heap dump and analyze that for any heavy memory users, and use that analysis to optimize your code. See Using HeapDumpOnOutOfMemoryError parameter for heap dump for JBoss for more info.
If that's too much work for you, as others have indicated, try increasing the JVM heap space to a point where your program no longer crashes.
When ever you get an OOM while trying to parse stuff, its just that the method you are using is not scalable. Even though increasing the heap might solve the issue temporarily, it is not scalable. Example, if tomorrow your file size increases by an order or magnitude, you would be back in square one.
I would recommend trying to read the file in pieces, cache x lines of the file, read off it, clear the cache and re-do the process.
You can use either ehcache or guava cache.
The way you parse the string could be changed.
try (Scanner scanner = new Scanner(new FileReader(filePath))) {
while (scanner.hasNextInt()) {
int startLabel = scanner.nextInt();
int endLabel = scanner.nextInt();
scanner.nextLine(); // discard the rest of the line.
// use start and end.
}
I suspect the memory consumption is actually in the data structure you build rather than how you read the data, but this should make it more obvious.

getting Java OutOfMemoryError: Java heap space error that I can't debug

I am struggling to figure out what's causing this OutofMemory Error. Making more memory available isn't the solution, because my system doesn't have enough memory. Instead I have to figure out a way of re-writing my code.
I've simplified my code to try to isolate the error. Please take a look at the following:
File[] files = new File(args[0]).listFiles();
int filecnt = 0;
LinkedList<String> urls = new LinkedList<String>();
for (File f : files) {
if (filecnt > 10) {
System.exit(1);
}
System.out.println("Doing File " + filecnt + " of " + files.length + " :" + f.getName());
filecnt++;
FileReader inputStream = null;
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
finally {
if (inputStream != null) {
inputStream.close();
}
}
inputStream.close();
String mystring = builder.toString();
String temp[] = mystring.split("\\|NEWandrewLINE\\|");
for (String s : temp) {
String temp2[] = s.split("\\|NEWandrewTAB\\|");
if (temp2.length == 22) {
urls.add(temp2[7].trim());
}
}
}
I know this code is probably pretty confusing :) I have loads of text files in the directory that is specified in args[0]. These text files were created by me. I used |NEWandrewLINE| to indicate a new row in the text file, and |NEWandrewTAB| to indicate a new column. In this code snippet, I am trying to access the URL of each stored row (which is in the 8th column of each row). So, I read in the whole text file. String split on |NEWandrewLINE| and then string split again on the substrings on |NEWandrewTAB|. I add the URL to the LinkedList (called "urls") with the line: urls.add(temp2[7].trim())
Now, the output of running this code is:
Doing File 0 of 973 :results1322453406319.txt
Doing File 1 of 973 :results1322464193519.txt
Doing File 2 of 973 :results1322337493419.txt
Doing File 3 of 973 :results1322347332053.txt
Doing File 4 of 973 :results1322330379488.txt
Doing File 5 of 973 :results1322369464720.txt
Doing File 6 of 973 :results1322379574296.txt
Doing File 7 of 973 :results1322346981999.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at Twitter.main(Twitter.java:86)
Where main line 86 relates to the line builder.append(d); in this example.
But the thing I don't understand is that if I comment out the line urls.add(temp2[7].trim()); I don't get any error. So the error seems to be caused by the linkedlist "urls" overfilling. But why then does the reported error relate to the StringBuilder?
Try to replace urls.add(temp2[7].trim()); with urls.add(new String(temp2[7].trim()));.
I suppose that your problem is that you are in fact storing the entire file content and not just the extracted URL field in your urls list, although that's not really obvious. It is actually an implementation specific issue with the String class, but usually String#split and String#trim return new String objects, which contain the same internal char array as the original string and only differs in their offset and length fields. Using the new String(String) constructor makes sure that you only keep the relevant part of the original data.
The linked list is using more memory each time you add a string. This means you can be left it not enough memory to build your StringBuilder.
The way to avoid this issue to write the results to a file instead of to a List as you don't appear to have enough memory to keep the List in memory.
Because this is
out of memory and not out of heap
you have LOTS of small temporary objects
I would suggest you give your JVM a -X maximum heap size limit that fits in your RAM.
To use less memory I would use a buffered reader to pull in the entire line and save on the temporary object creation.
The simple answer is: you should not load all the URLs from the text files into memory. You are surely doing this because you want to process them in a next step. So instead of adding them to a List in memory do the next step (maybe storing in a database or check if it is reachable) and forget that URL.
How many URLS do you have? Looks like you're just storing more of them than you can handle.
As far as I can see, the linked list is the only object that is not scoped inside the loop, so cannot be collected.
For an OOM error, it doesn't really matter where it is thrown.
To check this properly, use a profiler (look at JVisualVM for a free one, and you probably already have it). You'll see which objects are in the heap. You can also have the JVM dump its memory into a file when it crashes, then analyse that file with visualvm. You should see that one thing is grabbing all of your memory. I'm suspecting it's all the URLs.
There are several experts in here already, so, I'l be brief to the problems:
Inappropriate use of String Builder:
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
Java is beautiful when you process small amounts of data at a time, remember the garbage collector.
Instead, I would recommend that you read the file (Text file) 1 line at a time, process the line, and move on, never create a huge memory ball of StringBuilder just to get a String,
Immagine of your text file is 1 GB in size, you are done mate.
Add the real process while reading the file (as in item #1)
You dont need to close InputStream again, the code in finally block is good enough.
regards
if the linkedlist eats your memory every command which allocates memory afterwards may fail with an OOM error. So this looks like your problem.
You're reading the files into memory. At least one file is simply too big to fit into the default JVM heap. You can allow it use a lot more memory with an arg like -Xmx1g on the command line after java.
By the way this is really inefficient to read a file one character at a time!
Instead of trying to split the string (which basically creates an array of substrings based on the split) - thereby using more than double the memory each time you use the slpit, you should try to do regex based matching of the start and end patterns, extract individual sub-strings one by one and then extract the URL from that.
Also, if your file is large, I would suggest that you not even load all of that into memory at once ... stream its contents to a buffer (of manageable size) and use the pattern based search on that (and keep removing / adding more to the buffer as you progress through the file contents).
The implementation will slow down the program a bit but will use a considerably lesser amount of memory.
One major problem in your code is that you read whole file into a string builder, then convert it into string and then split it into smaller parts. So if file size is large you will get into trouble. As suggested by others process the file line by line as that should save a lot of memory.
Also you should check what is the size of your list after processing each file. If the size is very large you may want to use different approach or increase the memory for your process via -Xmx option.

Reading large file in Java -- Java heap space

I'm reading a large tsv file (~40G) and trying to prune it by reading line by line and print only certain lines to a new file. However, I keep getting the following exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
at java.lang.StringBuffer.append(StringBuffer.java:323)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at java.io.BufferedReader.readLine(BufferedReader.java:379)
Below is the main part of the code. I specified the buffer size to be 8192 just in case. Doesn't Java clear the buffer once the buffer size limit is reached? I don't see what may cause the large memory usage here. I tried to increase the heap size but it didn't make any difference (machine with 4GB RAM). I also tried flushing the output file every X lines but it didn't help either. I'm thinking maybe I need to make calls to the GC but it doesn't sound right.
Any thoughts? Thanks a lot.
BTW - I know I should call trim() only once, store it, and then use it.
Set<String> set = new HashSet<String>();
set.add("A-B");
...
...
static public void main(String[] args) throws Exception
{
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192);
PrintStream output = new PrintStream(outputFile, "UTF-8");
String line = reader.readLine();
while(line!=null){
String[] fields = line.split("\t");
if( set.contains(fields[0].trim()+"-"+fields[1].trim()) )
output.println((fields[0].trim()+"-"+fields[1].trim()));
line = reader.readLine();
}
output.close();
}
Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory.
The solution would be to read a fixed number of bytes at a time, using the 'read' method of the reader, and then look for new lines (or other parsing tokens) within the smaller buffer(s).
Are you certain the "lines" in the file are separated by newlines?
I have 3 theories:
The input file is not UTF-8 but some indeterminate binary format that results in extremely long lines when read as UTF-8.
The file contains some extremely long "lines" ... or no line breaks at all.
Something else is happening in code that you are not showing us; e.g. you are adding new elements to set.
To help diagnose this:
Use some tool like od (on UNIX / LINUX) to confirm that the input file really contains valid line terminators; i.e. CR, NL, or CR NL.
Use some tool to check that the file is valid UTF-8.
Add a static line counter to your code, and when the application blows up with an OOME, print out the value of the line counter.
Keep track of the longest line seen so far, and print that out as well when you get an OOME.
For the record, your slightly suboptimal use of trim will have no bearing on this issue.
One possibility is that you are running out of heap space during a garbage collection. The Hotspot JVM uses a parallel collector by default, which means that your application can possibly allocate objects faster than the collector can reclaim them. I have been able to cause an OutOfMemoryError with supposedly only 10K live (small) objects, by rapidly allocating and discarding.
You can try instead using the old (pre-1.5) serial collector with the option -XX:+UseSerialGC. There are several other "extended" options that you can use to tune collection.
You might want to try removing the String[] fields declaration out of the loop. As you are creating a new array in every loop. You can just reuse the old one right?

Categories