Java: Filling in-memory sorted batches

Java: Filling in-memory sorted batches - java

So I'm using Java to do multi-way external merge sorts of large on-disk files of line-delimited tuples. Batches of tuples are read into a TreeSet, which are then dumped into on-disk sorted batches. Once all of the data have been exhausted, these batches are then merge-sorted to the output.
Currently I'm using magic numbers for figuring out how many tuples we can fit into memory. This is based on a static figure indicating how may tuples can be roughly fit per MB of heap space, and how much heap space is available using:
long max = Runtime.getRuntime().maxMemory();
long used = Runtime.getRuntime().totalMemory();
long free = Runtime.getRuntime().freeMemory();
long space = free + (max - used);
However, this does not always work so well since we may be sorting different length tuples (for which the static tuple-per-MB figure might be too conservative) and I now want to use flyweight patterns to jam more in there, which may make the figure even more variable.
So I'm looking for a better way to fill the heap-space to the brim. Ideally the solution should be:
reliable (no risk of heap-space exceptions)
flexible (not based on static numbers)
efficient (e.g., not polling runtime memory estimates after every tuple)
Any ideas?

Filling the heap to the brim might be a bad idea due to garbage collector trashing. (As the memory gets nearly full, the efficiency of garbage collection approaches 0, because the effort for collection depends on heap size, but the amount of memory freed depends on the size of the objects identified as unreachable).
However, if you must, can't you simply do it as follows?
for (;;) {
long freeSpace = getFreeSpace();
if (freeSpace < 1000000) break;
for (;;freeSpace > 0) {
treeSet.add(readRecord());
freeSpace -= MAX_RECORD_SIZE;
}
}
The calls to discover the free memory will be rare, so shouldn't tax performance much. For instance, if you have 1 GB heap space, and leave 1MB empty, and MAX_RECORD_SIZE is ten times average record size, getFreeSpace() will be invoked a mere log(1000) / -log(0.9) ~= 66 times.

Why bother with calculating how many items you can hold? How about letting java tell you when you've used up all your memory, catching the exception and continuing. For example,
// prepare output medium now so we don't need to worry about having enough
// memory once the treeset has been filled.
BufferedWriter writer = new BufferedWriter(new FileWriter("output"));
Set<?> set = new TreeSet<?>();
int linesRead = 0;
{
BufferedReader reader = new BufferedReader(new FileReader("input"));
try {
String line = reader.readLine();
while (reader != null) {
set.add(parseTuple(line));
linesRead += 1;
line = reader.readLine();
}
// end of file reached
linesRead = -1;
} catch (OutOfMemoryError e) {
// while loop broken
} finally {
reader.close();
}
// since reader and line were declared in a block their resources will
// now be released
}
// output treeset to file
for (Object o: set) {
writer.write(o.toString());
}
writer.close();
// use linesRead to find position in file for next pass
// or continue on to next file, depending on value of linesRead
If you still have trouble with memory, just make the reader's buffer extra large so as to reserve more memory.
The default size for the buffer in a BufferedReader is 4096 bytes. So when finishing reading you will release upwards of 4k of memory. After this your additional memory needs will be minimal. You need enough memory to create an iterator for the set, let's be generous and assume 200 bytes. You will also need memory to store the string output of your tuples (but only temporarily). You say the tuples contain about 200 characters. Let's double that to take account for separators -- 400 characters, which is 800 bytes. So all you really need is an additional 1k bytes. So you're fine as you've just released 4k bytes.
The reason you don't need to worry about the memory used to store the string output of your tuples is because they are short lived and only referred to within the output for loop. Note that the Writer will copy the contents into its buffer and then discard the string. Thus, the next time the garbage collector runs the memory can be reclaimed.
I've checked and, a OOME in add will not leave a TreeSet in an inconsistent state, and the memory allocation for a new Entry (the internal implementation for storing a key/value pair) happens before the internal representation is modified.

You can really fill the heap to the brim using direct memory writing (it does exist in Java!). It's in sun.misc.Unsafe, but isn't really recommended for use. See here for more details. I'd probably advise writing some JNI code instead, and using existing C++ algorithms.

I'll add this as an idea I was playing around with, involving using a SoftReference as a "sniffer" for low memory.
SoftReference<Byte[]> sniffer = new SoftReference<String>(new Byte[8192]);
while(iter.hasNext()){
tuple = iter.next();
treeset.add(tuple);
if(sniffer.get()==null){
dump(treeset);
treeset.clear();
sniffer = new SoftReference<String>(new Byte[8192]);
}
}
This might work well in theory, but I don't know the exact behaviour of SoftReference.
All soft references to softly-reachable objects are guaranteed to have been cleared before the virtual machine throws an OutOfMemoryError. Otherwise no constraints are placed upon the time at which a soft reference will be cleared or the order in which a set of such references to different objects will be cleared. Virtual machine implementations are, however, encouraged to bias against clearing recently-created or recently-used soft references.
Would like to hear feedback as it seems to me like an elegant solution, although behaviour might vary between VMs?
Testing on my laptop, I found that it the soft-reference is cleared infrequently, but sometimes is cleared too early, so I'm thinking to combine it with meriton's answer:
SoftReference<Byte[]> sniffer = new SoftReference<String>(new Byte[8192]);
while(iter.hasNext()){
tuple = iter.next();
treeset.add(tuple);
if(sniffer.get()==null){
free = MemoryManager.estimateFreeSpace();
if(free < MIN_SAFE_MEMORY){
dump(treeset);
treeset.clear();
sniffer = new SoftReference<String>(new Byte[8192]);
}
}
}
Again, thoughts welcome!

Related

Drop part of a List<> when encountering OutOfMemoryException

I'm writing a program that is supposed to continually push generated data into a List sensorQueue. The side effect is that I will eventually run out of memory. When that happens, I'd like drop parts of the list, in this example the first, or older, half. I imagine that if I encounter an OutOfMemeryException, I won't be able to just use sensorQueue = sensorQueue.subList((sensorQueue.size() / 2), sensorQueue.size());, so I came here looking for an answer.
My code:
public static void pushSensorData(String sensorData) {
try {
sensorQueue.add(parsePacket(sensorData));
} catch (OutOfMemoryError e) {
System.out.println("Backlog full");
//TODO: Cut the sensorQueue in half to make room
}
System.out.println(sensorQueue.size());
}

Is there an easy way to detect an impending OutOfMemoryException then?
You can have something like below to determine MAX memory and USED memory. Using that information you can define next set of actions in your programme. e.g. reduce its size or drop some elements.
final int MEGABYTE = (1024*1024);
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();
long maxMemory = heapUsage.getMax() / MEGABYTE;
long usedMemory = heapUsage.getUsed() / MEGABYTE;
Hope this would helps!

The problem with subList is that it creates sublist keeping the original one in memory. However, ArrayList or other extension of AbstractList has removeRange(int fromIndex, int toIndex) which removes elements of current list, so doesn't require additional memory.
For the other List implementations there is similar remove(int index) which you can use multiple times for the same purpose.

I think you idea is severely flawed (sorry).
There is no OutOfMemoryException, there is OutOfMemoryError only! Why that is important? Because errors leaves app in unstable state, well I'm not that sure about that claim in general, but it definitely holds for OutOfMemoryError. Because there is no guarantee, that you will be able to catch it! You can consume all of memory within you try-catch block, and OutOfMemoryError will be thrown somewhere in JDK code. So your catching is pointless.
And what is the reason for this anyways? How many messages do you want in list? Say that your message is 1MB. And your heap is 1000MB. So if we stop considering other classes, your heap size define, that your list will contain up to 1000 messages, right? Wouldn't it be easier to set heap sufficiently big for your desired number of messages, and specify message count in easier, intergral form? And if your answer is "no", then you still cannot catch OutOfMemoryError reliably, so I'd advise that your answer rather should be "yes".
If you really need to consume all what is possible, then checking memory usage in % as #fabsas recommended could be way. But I'd go with integral definition — easier to managed. Your list will contain up-to N messages.

You can drop a range of elements from a ArrayList using subList:
list.subList(from, to).clear();
Where from is the first index of the range to be removed and to is the last. In your case, you can do something like:
list.subList(0, sensorQueue.size() / 2).clear();
Note that this command will return a List.

Java outOfMemory exception in string.split

I have a big txt file with integers in it. Each line in file has two integer numbers separated by whitespace. Size of a file is 63 Mb.
Pattern p = Pattern.compile("\\s");
try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = reader.readLine()) != null) {
String[] tokens = p.split(line);
String s1 = new String(tokens[0]);
String s2 = new String(tokens[1]);
int startLabel = Integer.valueOf(s1) - 1;
int endLabel = Integer.valueOf(s2) - 1;
Vertex fromV = vertices.get(startLabel);
Vertex toV = vertices.get(endLabel);
Edge edge = new Edge(fromV, toV);
fromV.addEdge(edge);
toV.addEdge(edge);
edges.add(edge);
System.out.println("Edge from " + fromV.getLabel() + " to " + toV.getLabel());
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.lang.String.substring(String.java:1913)
at java.lang.String.subSequence(String.java:1946)
at java.util.regex.Pattern.split(Pattern.java:1202)
at java.util.regex.Pattern.split(Pattern.java:1259)
at SCC.main(SCC.java:25)
Why am I getting this exception? How can I change my code to avoid it?
EDIT:
I've already increase heap size to 2048m.
What is consuming it? That's what I would want to know also.
For all I know jvm should allocate memory to list of vertices, set of edges, buffer for buffered reader and one small string "line". I don't see where this outOfMemory coming from.
I read about string.split() method. I think it's causing memory leak, but I don't know what should I do about it.

What you should try first is reduce the file to small enough that it works. That will allow you to appraise just how large a problem you have.
Second, your problem is definitely unrelated to String#split since you are using it on just one line at a time. What is consuming your heap are the Vertex and Edge instances. You'll have to redesign this towards a smaller footprint, or completely overhaul your algorithms to be able to work with only a part of the graph in memory, the rest on the disk.
P.S. Just a general Java note: don't write
String s1 = new String(tokens[0]);
String s2 = new String(tokens[1]);
you just need
String s1 = tokens[0];
String s2 = tokens[1];
or even just use tokens[0] directly instead of s1, since it's about as clear.

Easiest way: increase your heap size:
Add -Xmx512m -Xms512m (or even more) arguments to jvm

Increase the heap memory limit, using the -Xmx JVM option.
More info here.

You are getting this exception because your program is storing too much data in the java heap.
Although your exception is showing up in the Pattern.split() method, the actual culprit could be any large memory user in your code, such as the graph you are building. Looking at what you provided, I suspect the graph data structure is storing much redundant data. You may want to research a more space-efficient graph structure.
If you are using the Sun JVM, try the JVM option -XX:+HeapDumpOnOutOfMemoryError to create a heap dump and analyze that for any heavy memory users, and use that analysis to optimize your code. See Using HeapDumpOnOutOfMemoryError parameter for heap dump for JBoss for more info.
If that's too much work for you, as others have indicated, try increasing the JVM heap space to a point where your program no longer crashes.

When ever you get an OOM while trying to parse stuff, its just that the method you are using is not scalable. Even though increasing the heap might solve the issue temporarily, it is not scalable. Example, if tomorrow your file size increases by an order or magnitude, you would be back in square one.
I would recommend trying to read the file in pieces, cache x lines of the file, read off it, clear the cache and re-do the process.
You can use either ehcache or guava cache.

The way you parse the string could be changed.
try (Scanner scanner = new Scanner(new FileReader(filePath))) {
while (scanner.hasNextInt()) {
int startLabel = scanner.nextInt();
int endLabel = scanner.nextInt();
scanner.nextLine(); // discard the rest of the line.
// use start and end.
}
I suspect the memory consumption is actually in the data structure you build rather than how you read the data, but this should make it more obvious.

JAVA processing file with java.lang.OutOfMemoryError: GC overhead limit exceeded error

I have the following JAVA class to read from a file containing many lines of tab delimited strings. An example line is like the following:
GO:0085044 GO:0085044 GO:0085044
The code read each line and use split function to put three sub strings into an array, then it put them into a two level hash.
public class LCAReader {
public static void main(String[] args) {
Map<String, Map<String, String>> termPairLCA = new HashMap<String, Map<String, String>>();
File ifile = new File("LCA1.txt");
try {
BufferedReader reader = new BufferedReader(new FileReader(ifile));
String line = null;
while( (line=reader.readLine()) != null ) {
String[] arr = line.split("\t");
if( termPairLCA.containsKey(arr[0]) ) {
if( termPairLCA.get(arr[0]).containsKey(arr[1]) ) {
System.out.println("Error: Duplicate term in LCACache");
} else {
termPairLCA.get(arr[0]).put(new String(arr[1]), new String(arr[2]));
}
} else {
Map<String, String> tempMap = new HashMap<String, String>();
tempMap.put( new String(arr[1]), new String(arr[2]) );
termPairLCA.put( new String(arr[0]), tempMap );
}
}
reader.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
When I ran the program, I got the following run time error after some time of running. I noticed the memory usage kept increasing.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.compile(Pattern.java:1469)
at java.util.regex.Pattern.(Pattern.java:1150)
at java.util.regex.Pattern.compile(Pattern.java:840)
at java.lang.String.split(String.java:2304)
at java.lang.String.split(String.java:2346)
at LCAReader.main(LCAReader.java:17)
The input file is almost 2G and the machine I ran the program has 8G memory. I also tried -Xmx4096m parameter to run the program but that did not help. So I guess there is some memory leak in my code, but I cannot find them.
Can anyone help me on this? Thanks in advance!

There's no memory leak; you're just trying to store too much data. 2GB of text will take 4GB of RAM as Java characters; plus there's about 48 bytes per String object overhead. Assuming the text is in 100 character lines, there's about another GB, for a total of 5GB -- and we haven't even counted the Map.Entry objects yet! You'd need a Java heap of at least, conservatively, 6GB to run this program on your data, and maybe more.
There are a couple of easy things you can do to improve this. First, lose the new String() constructors -- they're useless and just make the garbage collector work harder. Strings are immutable so you never need to copy them. Second, you could use the intern pool to share duplicate strings -- this may or may not help, depending on what the data actually looks like. But you could try, for example,
tempMap.put(arr[1].intern(), arr[2].intern() );
These simple steps might help a lot.

I don't see any leak, you simply need a very huge amount of memory to store your map.
There is a very good tool for verifying this: making a heap dump with the option -XX:+HeapDumpOnOutOfMemoryError and import it into Eclipse Memory Analyzer which comes in a standalone version. It can show you the biggest retained objects and the references tree that could prevent the garbage collector to do its job.
In addition a profiler such as Netbeans Profiler can give you a lot of interesting real-time informations (for instance to check the number of String and Char instances).
Also it is a good practice to split your code into different classes each having a different responsability: the "two keys map" class (TreeMap) on one side and a "parser" class on the other side, it should make debugging easier...
This is definitely not a good idea to store this huge map inside the RAM... or you need to make a benchkmark with some smaller files and extrapolate to obtain the estimated RAM you need to have on your system to fit your worste case... And set Xmx to the proper value.
Why don't you use a Key Value store such as Berckley DB: simpler than a Relational DB and should fit exactly you need of two levels indexing.
Check this post for the choice of the store: key-value store suggestion
Good luck

You probably shouldn't use String.split and store the information as pure String as this generates lots of String objects on the fly.
Try using a char based approach since your format seems rather fixed so you know the exact indizes of the different data points on one line.
If your a bit more into experimenting you could try to use a NIO-backed approach with a memory mapped DirectByteBuffer or a CharBuffer that is used to traverse the file. There you could just mark the indizes of different data points into Marker-objects and only load the real String-data later on in the process when needed.

BufferedReader no longer buffering after a while?

Sorry I can't post code but I have a bufferedreader with 50000000 bytes set as the buffer size. It works as you would expect for half an hour, the HDD light flashing every two minutes or so, reading in the big chunk of data, and then going quiet again as the CPU processes it. But after about half an hour (this is a very big file), the HDD starts thrashing as if it is reading one byte at a time. It is still in the same loop and I think I checked free ram to rule out swapping (heap size is default).
Probably won't get any helpful answers, but worth a try.
OK I have changed heap size to 768mb and still nothing. There is plenty of free memory and java.exe is only using about 300mb.
Now I have profiled it and heap stays at about 200MB, well below what is available. CPU stays at 50%. Yet the HDD starts thrashing like crazy. I have.. no idea. I am going to rewrite the whole thing in c#, that is my solution.
Here is the code (it is just a throw-away script, not pretty):
BufferedReader s = null;
HashMap<String, Integer> allWords = new HashMap<String, Integer>();
HashSet<String> pageWords = new HashSet<String>();
long[] pageCount = new long[78592];
long pages = 0;
Scanner wordFile = new Scanner(new BufferedReader(new FileReader("allWords.txt")));
while (wordFile.hasNext()) {
allWords.put(wordFile.next(), Integer.parseInt(wordFile.next()));
}
s = new BufferedReader(new FileReader("wikipedia/enwiki-latest-pages-articles.xml"), 50000000);
StringBuilder words = new StringBuilder();
String nextLine = null;
while ((nextLine = s.readLine()) != null) {
if (a.matcher(nextLine).matches()) {
continue;
}
else if (b.matcher(nextLine).matches()) {
continue;
}
else if (c.matcher(nextLine).matches()) {
continue;
}
else if (d.matcher(nextLine).matches()) {
nextLine = s.readLine();
if (e.matcher(nextLine).matches()) {
if (f.matcher(s.readLine()).matches()) {
pageWords.addAll(Arrays.asList(words.toString().toLowerCase().split("[^a-zA-Z]")));
words.setLength(0);
pages++;
for (String word : pageWords) {
if (allWords.containsKey(word)) {
pageCount[allWords.get(word)]++;
}
else if (!word.isEmpty() && allWords.containsKey(word.substring(0, word.length() - 1))) {
pageCount[allWords.get(word.substring(0, word.length() - 1))]++;
}
}
pageWords.clear();
}
}
}
else if (g.matcher(nextLine).matches()) {
continue;
}
words.append(nextLine);
words.append(" ");
}

Have you tried removing the buffer size and trying it out with the defaults?

It may be not that the file buffering isn't working, but that your program is using up enough memory that your virtual memory system is page swapping to disk. What happens if you try with a smaller buffer size? What about larger?

I'd bet that you are running out of heap space and you are getting stuck doing back to back GC's. Have you profiled the app to see what is going on during that time? Also, try running with -verbose:gc to see garbage collection as it happens. You could also try starting with a larger heap like"
-Xms1000m -Xmx1000m
That will give you 1gb of heap so if you do use that all up, it should be much later than it is currently happening.

It appears to me that if the file you are reading is very large, then the following lines could result in a large portion of the file being copied to memory via a StringBuilder. If the process' memory footprint becomes too large, you will likely swap and/or throw your garbage collector into a spin.
...
words.append(nextLine);
words.append(" ");

Hopefully this may help: http://www.velocityreviews.com/forums/t131734-bufferedreader-and-buffer-size.html

Before you assume there is something wrong with Java and reading IO, I suggest you write a simple program which just reads the file as fast as it can. You should be able to read the file at 20 MB/s or more regardless of file size with default buffering. You should be able to do this by stripping down your application to just read the file. Then you can prove to yourself how long it takes to read the file.
You have used quite a lot of expensive operations. Perhaps you should look at how you can make your parser more efficient using a profiler. e.g.
word.substring(0, word.length() - 1)
is the same as
word
so the first if clause and the second are the same.

Why does reading a file into memory takes 4x the memory in Java?

I have the following code which reads in the follow file, append a \r\n to the end of each line and puts the result in a string buffer:
public InputStream getInputStream() throws Exception {
StringBuffer holder = new StringBuffer();
try{
FileInputStream reader = new FileInputStream(inputPath);
BufferedReader br = new BufferedReader(new InputStreamReader(reader));
String strLine;
//Read File Line By Line
boolean start = true;
while ((strLine = br.readLine()) != null) {
if( !start )
holder.append("\r\n");
holder.append(strLine);
start = false;
}
//Close the input stream
reader.close();
}catch (Throwable e){//this is where the heap error is caught up to 2Gb
System.err.println("Error: " + e.getMessage());
}
return new StringBufferInputStream(holder.toString());
}
I tried reading in a 400Mb file, and I changed the max heap space to 2Gb and yet it still gives the out of memory heap exception. Any ideas?

It may be to do with how the StringBuffer resizes when it reaches capacity - This involves creating a new char[] double the size of the previous one and then copying the contents across into the new array. Together with the points already made about characters in Java being stored as 2 bytes this will definitely add to your memory usage.
To resolve this you could create a StringBuffer with sufficient capacity to begin with, given that you know the file size (and hence approximate number of characters to read in). However, be warned that the array allocation will also occur if you then attempt to convert this large StringBuffer into a String.
Another point: You should typically favour StringBuilder over StringBuffer as the operations on it are faster.
You could consider implementing your own "CharBuffer", using for example a LinkedList of char[] to avoid expensive array allocation / copy operations. You could make this class implement CharSequence and perhaps avoid converting to a String altogether. Another suggestion for more compact representation: If you're reading in English text containing large numbers of repeated words you could read and store each word, using the String.intern() function to significantly reduce storage.

To begin with Java strings are UTF-16 (i.e. 2 bytes per character), so assuming your input file is ASCII or a similar one-byte-per-character format then holder will be ~2x the size of the input data, plus the extra \r\n per line and any additional overhead. There's ~800MB straight away, assuming a very low storage overhead in StringBuffer.
I could also believe that the contents of your file is buffered twice - once at the I/O level and once in the BufferedReader.
However, to know for sure, it's probably best to look at what's actually on the heap - use a tool like HPROF to see exactly where your memory has gone.
I terms of solving this, I suggest you process a line at a time, writing out each line after your have added the line termination. That way your memory usage should be proportional to the length of a line, instead of the entire file.

It's an interesting question, but rather than stress over why Java is using so much memory, why not try a design that doesn't require your program to load the entire file into memory?

You have a number of problems here:
Unicode: characters take twice as much space in memory as on disk (assuming a 1 byte encoding)
StringBuffer resizing: could double (permanently) and triple (temporarily) the occupied memory, though this is the worst case
StringBuffer.toString() temporarily doubles the occupied memory since it makes a copy
All of these combined mean that you could require temporarily up to 8 times your file's size in RAM, i.e. 3.2G for a 400M file. Even if your machine physically has that much RAM, it has to be running a 64bit OS and JVM to actually get that much heap for the JVM.
All in all, it's simply a horrible idea to keep such a huge String in memory - and it's totally unneccessary as well - since your method returns an InputStream, all you really need is a FilterInputStream that adds the line breaks on the fly.

It's the StringBuffer. The empty constructor creates a StringBuffer with a initial length of 16 Bytes. Now if you append something and the capacity is not sufficiant, it does an Arraycopy of the internal String Array to a new buffer.
So in fact, with each line appended the StringBuffer has to create a copy of the complete internal Array which nearly doubles the required memory when appending the last line. Together with the UTF-16 representation this results in the observed memory demand.
Edit
Michael is right, when saying, that the internal buffer is not incremented in small portions - it roughly doubles in size each to you need more memory. But still, in the worst case, say the buffer needs to expand capacity just with the very last append, it creates a new array twice the size of the actual one - so in this case, for a moment you need roughly three times the amount of memory.
Anyway, I've learned the lesson: StringBuffer (and Builder) may cause unexpected OutOfMemory errors and I'll always initialize it with a size, at least when I have to store large Strings. Thanks for the question :)

At the last insert into the StringBuffer, you need three times the memory allocated, because the StringBuffer always expands by (size + 1) * 2 (which is already double because of unicode). So a 400GB file could require an allocation of 800GB * 3 == 2.4GB at the end of the inserts. It may be something less, that depends on exactly when the threshold is reached.
The suggestion to concatenate Strings rather than using a Buffer or Builder is in order here. There will be a lot of garbage collection and object creation (so it will be slow), but a much lower memory footprint.
[At Michael's prompting, I investigated this further, and concat wouldn't help here, as it copies the char buffer, so while it wouldn't require triple, it would require double the memory at the end.]
You could continue to use the Buffer (or better yet Builder in this case) if you know the maximum size of the file and initialize the size of the Buffer on creation and you are sure this method will only get called from one thread at a time.
But really such an approach of loading such a large file into memory at once should only be done as a last resort.

I would suggest you use the OS file cache instead of copying the data into Java memory via characters and back to bytes again. If you re-read the file as required (perhaps transforming it as you go) it will be faster and very likely to be simpler
You need over 2 GB because 1 byte letters use char (2-bytes) in memory and when your StringBuffer resizes you need double that (to copy the old array to the larger new array) The new array is typically 50% larger so you need up to 6x the original file size. If the performance wasn't bad enough, you are using StringBuffer instead of StringBuilder which synchronizes every call when it is clearly not needed. (This only slows you down, but uses the same amount of memory)

Others have explained why you're running out of memory. As to how to solve this problem, I'd suggest writing a custom FilterInputStream subclass. This class would read one line at a time, append the "\r\n" characters and buffer the result. Once the line has been read by the consumer of your FilterInputStream, you'd read another line. This way you'd only ever have one line in memory at a time.

I also recommend checking out Commons IO FileUtils class for this. Specifically: org.apache.commons.io.FileUtils#readFileToString. You can also specify the encoding if you know you only are using ASCII.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Filling in-memory sorted batches - java

You can really fill the heap to the brim using direct memory writing (it does exist in Java!). It's in sun.misc.Unsafe, but isn't really recommended for use. See here for more details. I'd probably advise writing some JNI code instead, and using existing C++ algorithms.

Related

Drop part of a List<> when encountering OutOfMemoryException

Java outOfMemory exception in string.split

JAVA processing file with java.lang.OutOfMemoryError: GC overhead limit exceeded error

BufferedReader no longer buffering after a while?

Why does reading a file into memory takes 4x the memory in Java?

Categories

Resources