Debugging Java Out of Memory Error - java

I'm still a relatively new programmer, and an issue I keep having in Java is Out of Memory Errors. I don't want to increase the memory using -Xmx, because I feel that the error is due to poor programming, and I want to improve my coding rather than rely on more memory.
The work I do involves processing lots of text files, each around 1GB when compressed. The code I have here is meant to loop through a directory where new compressed text files are being dropped. It opens the second most recent text file (not the most recent, because this is still being written to), and uses the Jsoup library to parse certain fields in the text file (fields are separated with custom delimiters: "|nTa|" designates a new column and "|nLa|" designates a new row).
I feel there should be no reason for using a lot of memory. I open a file, scan through it, parse the relevant bits, write the parsed version into another file, close the file, and move onto the next file. I don't need to store the whole file in memory, and I certainly don't need to store files that have already been processed in memory.
I'm getting errors when I start parsing the second file, which suggests that I'm not dealing with garbage collection. Please have a look at the code, and see if you can spot things that I'm doing that mean I'm using more memory than I should be. I want to learn how to do this right so I stop getting memory errors!
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;
import java.util.TreeMap;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import org.jsoup.Jsoup;
public class ParseHTML {
public static int commentExtractField = 3;
public static int contentExtractField = 4;
public static int descriptionField = 5;
public static void main(String[] args) throws Exception {
File directoryCompleted = null;
File filesCompleted[] = null;
while(true) {
// find second most recent file in completed directory
directoryCompleted = new File(args[0]);
filesCompleted = directoryCompleted.listFiles();
if (filesCompleted.length > 1) {
TreeMap<Long, File> timeStamps = new TreeMap<Long, File>(Collections.reverseOrder());
for (File f : filesCompleted) {
timeStamps.put(getTimestamp(f), f);
}
File fileToProcess = null;
int counter = 0;
for (Long l : timeStamps.keySet()) {
fileToProcess = timeStamps.get(l);
if (counter == 1) {
break;
}
counter++;
}
// start processing file
GZIPInputStream gzipInputStream = null;
if (fileToProcess != null) {
gzipInputStream = new GZIPInputStream(new FileInputStream(fileToProcess));
}
else {
System.err.println("No file to process!");
System.exit(1);
}
Scanner scanner = new Scanner(gzipInputStream);
scanner.useDelimiter("\\|nLa\\|");
GZIPOutputStream output = new GZIPOutputStream(new FileOutputStream("parsed/" + fileToProcess.getName()));
while (scanner.hasNext()) {
Scanner scanner2 = new Scanner(scanner.next());
scanner2.useDelimiter("\\|nTa\\|");
ArrayList<String> row = new ArrayList<String>();
while(scanner2.hasNext()) {
row.add(scanner2.next());
}
for (int index = 0; index < row.size(); index++) {
if (index == commentExtractField ||
index == contentExtractField ||
index == descriptionField) {
output.write(jsoupParse(row.get(index)).getBytes("UTF-8"));
}
else {
output.write(row.get(index).getBytes("UTF-8"));
}
String delimiter = "";
if (index == row.size() - 1) {
delimiter = "|nLa|";
}
else {
delimiter = "|nTa|";
}
output.write(delimiter.getBytes("UTF-8"));
}
}
output.finish();
output.close();
scanner.close();
gzipInputStream.close();
}
}
}
public static Long getTimestamp(File f) {
String name = f.getName();
String removeExt = name.substring(0, name.length() - 3);
String timestamp = removeExt.substring(7, removeExt.length());
return Long.parseLong(timestamp);
}
public static String jsoupParse(String s) {
if (s.length() == 4) {
return s;
}
else {
return Jsoup.parse(s).text();
}
}
}
How can I make sure that when I finish with objects, they are destroyed and not using any resources? For example, each time I close the GZIPInputStream, GZIPOutputStream and Scanner, how can I make sure they're completely destroyed?
For the record, the error I'm getting is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1171)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
at org.jsoup.parser.Parser.parse(Parser.java:24)
at org.jsoup.Jsoup.parse(Jsoup.java:44)
at ParseHTML.jsoupParse(ParseHTML.java:125)
at ParseHTML.main(ParseHTML.java:81)

I haven't spent very long analysing your code (nothing stands out), but a good general-purpose start would be to familiarise yourself with the free VisualVM tool. This is a reasonable guide to its use, though there are many more articles.
There are better commercial profilers in my opinion - JProfiler for one - but it will at the very least show you what objects/classes most memory is being assigned to, and possibly the method stack traces that caused that to happen. More simply it shows you heap allocation over time, and you can use this to judge whether you are failing to clear something or whether it is an unavoidable spike.
I suggest this rather than looking at the specifics of your code because it is a useful diagnostic skill to have.

Update: This issue was fixed in JSoup 1.6.2
It looks to me like it's probably a bug in the JSoup parser that you're using...at present the documentation for JSoup.parse() has a warning "BETA: if you do get an exception raised, or a bad parse-tree, please file a bug." Which suggests they aren't confident that it's completely safe for use in production code.
I also found several bug reports mentioning out of memory exceptions, one of which suggests that it's due to parse error objects being held statically by JSoup, and that downgrading from JSoup 1.6.1 to 1.5.2 may be a work-around.

I am wondering if your parse is failing because you have bad HTML (e.g. unclosed tags, unpaired quotes or whatnot) being parsed? You could do a output /println to see how far you are getting in the document if at all. The Java library may not understand the end of the document /file before running out of memory.
parse
public static Document parse(String html) Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a tag.
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parse(java.lang.String)

It's a little hard to tell what's going on but two things come to my mind.
1) In some weird circumstances (depending on the input file), the following loop might load the entire file into memory:
while(scanner2.hasNext()) {
row.add(scanner2.next());
}
2) By looking at the stackTrace it seems that the jsoupParse is the problem. I believe that this line Jsoup.parse(s).text(); loads s into memory first and depending on the string size (that again depends on the particular file input) this might cause the OutOfMemoryError
Maybe a combination of the two points above is the issue. Again, it's hard to tell by just looking at the code..
Does this happen always with the same file? Did you check the input content and the custom delimiters in it?

Assuming the problem is not in JSoup code, we can do some memory optimization. In example, ArrayList<String> row could be stripped, as it holds all parsed lines in memory, but only one line needed for parsing.
Inner loop with row removed:
//Caution! May contain obvious bugs!
while (scanner.hasNext()) {
String scanStr = scanner.next();
//manually count of rows to replace 'row.size()'
int rowCount = 0;
int offset = 0;
while ((offset = scanStr.indexOf("|nTa|", offset)) >= 0) {
rowCount++;
offset++;
}
rowCount++;
Scanner scanner2 = new Scanner(scanStr);
scanner2.useDelimiter("\\|nTa\\|");
int index = 0;
while (scanner2.hasNext()) {
String curRow = scanner2.next();
if (index == commentExtractField
|| index == contentExtractField
|| index == descriptionField) {
output.write(jsoupParse(curRow).getBytes("UTF-8"));
} else {
output.write(curRow.getBytes("UTF-8"));
}
String delimiter = "";
if (index == rowCount - 1) {
delimiter = "|nLa|";
} else {
delimiter = "|nTa|";
}
output.write(delimiter.getBytes("UTF-8"));
}
}

Related

Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?

I am basically looking for a solution that allows me to stream the lines and replace them IN THE SAME FILE, a la Files.lines
Any mechanism in Java 8/NIO for replacing the lines of a big file without loading it in memory?
Basically, no.
Any change to a file that involves changing the number of bytes between offets A and B can only be done by rewriting the file, or creating a new one. In either case, everything after B has to be loaded / read into memory.
This is not a Java-specific restriction. It is a consequence of the way that modern operating systems represent files, and the low-level (ie.e. syscall) APIs that they provide to applications.
In the specific case where you replace one line (or sequence of lines) with a line (or sequence of lines) of exactly the same length, then you can do the replacement using either RandomAccessFile, or by mapping the file into memory. Note that the latter approach won't cause the entire file to be read into memory.
It is also possible to replace or delete lines while updating the file "in place" (changing the file length ...). See #Sergio Montoro's answer for an example. However, with an in place update, there is a risk that the file will be corrupted if the application is interrupted. And this does involve reading and rewriting all bytes in the file after the insertion / deletion point. And that entails loading them into memory.
There was a mechanism in Java 1: RandomAccessFile; but any such in-place mechanism requires that you know the start offset of the line, and that the new line is the same length as the old one.
Otherwise you have to copy the file up to that line, substitute the new line in the output, and then continue the copy.
You certainly don't have to load the entire file into memory.
Yes.
A FileChannel allows random read/write to any position of a file. Therefore, if you have a read ahead buffer which is long enough you can replace lines even if the new line is longer than the former one.
The following example is a toy implementation which makes two assumptions: 1st) the input file is ISO-8859-1 Unix LF encoded and 2nd) each new line is never going to be longer than the next line (one line read ahead buffer).
Unless you definitely cannot create a temporary file, you should benchmark this approach against the more natural stream in -> stream out, because I do not know what performance may a spinning drive provide you for an algorithm that constantly moves forward and backward in a file.
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import static java.nio.file.StandardOpenOption.*;
import java.io.IOException;
public class ReplaceInFile {
public static void main(String args[]) throws IOException {
Path file = Paths.get(args[0]);
ByteBuffer writeBuffer;
long readPos = 0l;
long writePos;
String line_m;
String line_n;
String line_t;
FileChannel channel = FileChannel.open(file, READ, WRITE);
channel.position(0);
writePos = readPos;
line_m = readLine(channel);
do {
readPos += line_m.length() + 1;
channel.position(readPos);
line_n = readLine(channel);
line_t = transformLine(line_m)+"\n";
writeBuffer = ByteBuffer.allocate(line_t.length()+1);
writeBuffer.put(line_t.getBytes("ISO8859_1"));
System.out.print("replaced line "+line_m+" with "+line_t);
channel.position(writePos);
writeBuffer.rewind();
while (writeBuffer.hasRemaining()) {
channel.write(writeBuffer);
}
writePos += line_t.length();
line_m = line_n;
assert writePos > readPos;
} while (line_m.length() > 0);
channel.close();
System.out.println("Done!");
}
public static String transformLine(String input) throws IOException {
return input.replace("<", "<").replace(">", ">");
}
public static String readLine(FileChannel channel) throws IOException {
ByteBuffer readBuffer = ByteBuffer.allocate(1);
StringBuffer line = new StringBuffer();
do {
int read = channel.read(readBuffer);
if (read<1) break;
readBuffer.rewind();
char c = (char) readBuffer.get();
readBuffer.rewind();
if (c=='\n') break;
line.append(c);
} while (true);
return line.toString();
}
}

Character Matching DNA Program

I am supposed to write a program using command line arguments to put in 3 different files, a human DNA sequence, a mouse DNA sequence, and an unknown sequence. Without using arrays, I have to compare each character and give the percent match as well aas which one it closely matches up to. Here is what I have so far
import java.io.File;
import java.io.FileInputStream;
import java.io.DataInputStream;
import java.io.*;
public class Lucas_Tilak_Hw8_DNA
{
public static void main (String args[]) throws IOException
{
//First let's take in each file
File MouseFile = new File(args[0]);
File HumanFile = new File(args[1]);
File UnknownFile = new File(args[2]);
//This allows us to view individual characters
FileInputStream m = new FileInputStream(MouseFile);
FileInputStream h = new FileInputStream(HumanFile);
FileInputStream u = new FileInputStream(UnknownFile);
//This allows us to read each character one by one.
DataInputStream mouse = new DataInputStream(m);
DataInputStream human = new DataInputStream(h);
DataInputStream unk = new DataInputStream(u);
//We initialize our future numerators
int humRight = 0;
int mouRight = 0;
//Now we set the counting variable
int countChar = 0;
for( countChar = 0; countChar < UnknownFile.length(); countChar++);
{
//initialize
char unkChar = unk.readChar();
char mouChar = mouse.readChar();
char humChar = human.readChar();
//add to numerator if they match
if (unkChar == humChar)
{
humRight++;
}
if (unkChar == mouChar)
{
mouRight++;
}
//add to denominator
countChar++;
}
//convert to fraction
long mouPercent = (mouRight/countChar);
long humPercent = (humRight/countChar);
//print fractions
System.out.println("Mouse Compare: " + mouPercent);
System.out.println("Human Compare: " + humPercent);
if (mouPercent > humPercent)
{
System.out.println("mouse");
}
else if (mouPercent < humPercent)
{
System.out.println("human");
}
else
{
System.out.println("identity cannot be determined");
}
}
}
If I put in random code {G, T, C, A} for each file I use, it doesn't seem to compare characters, so I get O = mouPercent and 0 = humPercent. Please Help!
Several errors in your code are to blame.
Remove the ; from the end of your for() statement. Basically, you are only reading a single character from each file, and your comparison is strictly limited to that first set of characters. It's unlikely they will have any overlap.
Second error: don't use the "file length". Characters are typically encoded as more than one byte, so you're going to get inconsistent results this way. Better to query the stream to see if there are more bytes available, and stopping when you run out of bytes to read. Most Streams or Readers have an available or ready method that will let you determine if there is more to be read or not.
Third error: DataInputStream is not going to do what you expect it to do. Read the docs -- you're getting strange characters because it's always pulling 2 bytes and building a character using a modified UTF-8 scheme, which only really maps to characters written by the corresponding DataOutput implementing classes. You should research and modify your code to use BufferedReader instead, which will more naturally respect other character encodings like UTF-8, etc. which is most likely the encoding of the files you are reading in.
TL;DR? Your loop is broken, file length is a bad idea for loop terminating condition, and DataInputStream is a special unicorn, so use BufferedReader instead when dealing with characters in normal files.
Try using floats instead of longs for your percentage variables.

Java large String returned from findWithinHorizon converted to InputStream

I have wrote an application which in one of its modules parses huge file and saves this file chunk by chunk into a database.
First of all the following code works, and my main problem is to reduce memory usage and general increase in performance.
The following code snippet is a small part of the big picture, but is the most problematic after doing some YourKit profiling, The lines that are marked by /*Here*/ allocate huge amount of memory.
....
Scanner fileScanner = new Scanner(file,"UTF-8");
String scannedFarm;
try{
Pattern p = Pattern.compile("(?:^.++$(?:\\r?+\\n)?+){2,100000}+",Pattern.MULTILINE);
String [] tableName = null;
/*HERE*/while((scannedFarm = fileScanner.findWithinHorizon(p, 0)) != null){
boolean continuePrevStream = false;
Scanner scanner = new Scanner(scannedFarm);
String[] tmpTableName = scanner.nextLine().split(getSeparator());
if (tmpTableName.length==2){
tableName = tmpTableName;
}else{
if (tableName==null){
continue;
}
continuePrevStream = true;
}
scanner.close();
/*HERE*/ InputStream is = new ByteArrayInputStream(scannedFarm.getBytes("UTF-8"));
....
It is acceptable to allocate huge amount of memory since the String is large (i need it too be such large chunk), My main problem is that the same allocation happens twice as a result of getBytes,
So my question is their a way to transfer the findWithinHorizon Result directly to InputStream without allocating memory twice?
Is their more efficient way to achieve the same functionality?
Not exactly the same approach but instead of findWithinHorizon, you could try reading each line and searching for the pattern within the line context. This is sure to reduce memory pressure as you're not buffering the whole file as the API states:
If horizon is 0, then the horizon is ignored and this method continues
to search through the input looking for the specified pattern without
bound. In this case it may buffer all of the input searching for the
pattern.
Something like:
while(String line = fileScanner.nextLine() != null) {
if(grep for pattern in line) {
}
}

Good Practices in parsing text files in Java

i want to parse a text file that represents a log. i want it to be powerful enough to handle all erros that might occur.. although i am clueless about the best practices and the errors i should account for . i will be using JAVA to implement this.
Sample log :
2012-07-16 10:23:40,558 - 127.0.0.1 - Paremter array[param1=1,param2=1,param3=0,] - 383
I already wrote a prasing code that works as follows :
public Parser(String log) {
this.log = log;
this.parse();
}
public void parse() {
String[] temp = new String[10];
String[] temp2 = new String[10];
temp = log.split(" - ");
key = temp[3];
id = Integer.parseInt(key);
String IP = temp[1];
String str;
String temp3 = temp[2].substring(temp[2].indexOf("g"), temp[2].indexOf("]"));
temp = temp3.split(",");
str = "param1";
boolean ordered = CheckOrder(temp);
if (ordered) {
for (int q = 0; q < temp.length; q++) {
temp[q] = temp[q].substring(temp[q].indexOf("=") + 1);
}
if (temp[0].equals("q")) {
param= 0;
} else if (temp[0].equals("k")) {
param= 1;
} else {
param= 2;
}
// Same way for all parameters
}
}
Check the javadoc of all the methods you use, and make sure to handle all the nominal and exceptional cases:
the file doesn't exist: an exception is being thrown. Handle this exception correctly
String.indexOf() doesn't find what it looks for. It returns -1. Handle this case correctly
String.split() doesn't return an array of the length I expect. Handle this case correctly
...
Split your big method into several sub-methods, each doing only one thing.
Write unit tests to check that your methods do what they're supposed to do, with all the possible inputs.
Note that "handling things correctly" might very well mean: throw an exception because the input is incorrect, if the contract is that the logs follow a well-defined format. In this case, it's the code generating the logs that is incorrect. But it's better to have an exception telling which format you expected and which format you got instead, rather than an obscure NullPointerException or ArrayIndexOutOfBoundsException.
The above applies to any kind of code you write, and not just to file parsing.
Side note:
String[] temp = new String[10];
temp = log.split(" - ");
What's the point in creating an array of 10 elements to discard it right after and replace it by another array (the one returned by log.split(" - ")).

Java iteration reading & parsing

I have a log file that I am reading to a string
public static String read (String path) throws IOException {
StringBuilder sb = new StringBuilder();
FileInputStream fs = new FileInputStream(path);
InputStream in = new BufferedInputStream(fs);
int r;
while ((r = in.read()) != -1) {
sb.append((char)r);
}
fs.close();
in.close();
return sb.toString();
}
Then I have a parser that iterates over the entire string once
void parse () {
String con = read("log.txt");
for (int i = 0; i < con.length; i++) {
/* parsing action */
}
}
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
How can I parse the file in one iteration over the contents and still have separate methods for parsing and reading?
In C# I understand there is some sort of yield return thing, but I'm locked with Java.
What are my options in Java?
This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place.
It's worse than just a huge waste of cpu cycles. It's a huge waste of memory to read the entire file into a string, if you're only going to use it once and the use is looking at one character at a time moving forward, as your code indicates. And if your file is large, you'll exhaust memory.
You should parse as you read, and never have the entire file loaded into memory at once.
If the parsing action needs to be called from more than one place, make it a function and call it rather than copying the same code all over the place. Copying a single-line function call is fine.

Categories