Reading big files and performing some operations in java

Reading big files and performing some operations in java - java

First of all I would try to explain what I need to do.
I need to read a file (whose size could be from 1 byte to 2 GB), 2 GB maximum because I try to use MappedByteBuffer for fast reading. Maybe later I will try to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII encoding) to chars which later I put into a StringBuilder and then I put this String Builder into an ArrayList
However I also need to do the following:
User could type blockSize which is the number of chars I have to read into the StringBuilder (which is basically number of file bytes converted to chars)
Once I have collected the user defined char count, I create a copy of the String Builder and put it into an Array List
All steps are performed for every char read. The problem is with String Builder since if the file is big (<500 MB), I get the exception OutOfMemoryError.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
at java.lang.StringBuilder.<init>(StringBuilder.java:80)
at java.lang.StringBuilder.<init>(StringBuilder.java:106)
at borrows.wheeler.ReadFile.readFile(ReadFile.java:43)
Java Result: 1
I post my code, maybe someone could suggest improvements to this code or suggest some alternatives.
public class ReadFile {
//matrix block size
public int blockSize = 100;
public int charCounter = 0;
public ArrayList readFile(File file) throws FileNotFoundException, IOException {
FileChannel fc = new FileInputStream(file).getChannel();
MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0, (int) fc.size());
ArrayList characters = new ArrayList();
int counter = 0;
StringBuilder sb = new StringBuilder();//blockSize-1
while (mbb.hasRemaining()) {
char charAscii = (char)mbb.get();
counter++;
charCounter++;
if (counter == blockSize){
sb.append(charAscii);
characters.add(new StringBuilder(sb));//new StringBuilder(sb)
sb.delete(0, sb.length());
counter = 0;
}else{
sb.append(charAscii);
}
if(!mbb.hasRemaining()){
characters.add(sb);
}
}
fc.close();
return characters;
}
}
EDIT:
I am doing Burrows-Wheeler transformation. There i should read every file then by Block Size create as many as needed matrixes. well i believe that wiki will explain better than me:
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

If you load large files, it's not entirely surprising that you run out of memory.
How much memory do you have? Are you on a 64-bit system with 64-bit Java? How much heap memory have you allocated (e.g using -Xmx setting)?
Bear in mind that you will need at least twice as much memory as the filesize, because Java uses Unicode UTF-16, which uses at least 2 bytes for each character, but your input is one byte per character. So to load a 2GB file you will need at least 4GB allocated to the heap just for storing this text data.
Also, you need to sort out the logic in your code - you do the same sb.append(charAscii) in the if and the else, and you test !mbb.hasRemaining() in every iteration of a while((mbb.hasRemaining()) loop.
As I asked in your previous question, do you need to store StringBuilders, or would the resulting Strings be OK? Storing strings would save space because StringBuilder allocates memory in big chunks (I think it doubles in size every time it runs out of space!) so may waste a lot.
If you do have to use StringBuilders then pre-sizing them to the value of blockSize would make the code more memory-efficient (and faster).

I try to use MappedByteBuffer for fast reading. Maybe later I will try
to read file in chunks in order to read files of arbitrary size.
When i read file I convert its bytes and convert them (using ASCII
encoding) to chars which later I put into a StringBuilder and then I
put this String Builder into an ArrayList
This sounds more like a problem than a solution. I suggest to you that the file already is ASCII, or character data; that it could be read pretty efficiently using a BufferedReader; and that it can be processed one line at a time.
So do that. You won't get even double the speed by using a MappedByteBuffer, and everything you're doing including the MappedByteBuffer is consuming memory on a truly heroic scale.
If the file isn't such that it can be processed line by line, or record by record, there is something badly wrong upstream.

Related

Reading large file in bytes by chunks with dynamic buffer size

I'm trying to read a large file by chunks and save them in an ArrayList of bytes.
My code, in short, looks like this:
public ArrayList<byte[]> packets = new ArrayList<>();
FileInputStream fis = new FileInputStream("random_text.txt");
byte[] buffer = new byte[512];
while (fis.read(buffer) > 0){
packets.add(buffer);
}
fis.close();
But it has a behavior that I don't know how to solve, for example: If a file has only the words "hello world", this chunk does not necessarily need to be 512 bytes long. In fact, I want each chunk to be a maximum of 512 bytes not that they all necessarily have that size.

First of all, what you are doing is probably a bad idea. Storing a file's contents in memory like this is liable to be a waste of heap space ... and can lead to OutOfMemoryError exceptions and / or a requirement for an excessively large heap if you process large (enough) input files.
The second problem is that your code is wrong. You are repeatedly reading the data into the same byte array. Each time you do, it overwrites what was there before. So you will end up will a list containing lots of reference to a single byte array ... containing just the last chunk of data that you read.
To solve the problem that you asked about1, you will need to copy the chunk that you read to a new (smaller) byte array.
Something like this:
public ArrayList<byte[]> packets = new ArrayList<>();
try (FileInputStream fis = new FileInputStream("random_text.txt")) {
byte[] buffer = new byte[512];
int len;
while ((len = fis.read(buffer)) > 0) {
packets.add(Arrays.copyOf(buffer, len));
}
}
Note that this also deals with the second problem I mentioned. And fixes a potential resource leak by using try with resource syntax to manage the closure of the input stream.
A final issue: If this is really a text file that you are reading, you probably should be using a Reader to read it, and char[] or String to hold it.
But even if you do that there are some awkward edge cases if your text contains Unicode codepoints that are not in code plane 0. For example, emojis. The edge cases will involve code points that are represented as a surrogate pair AND the pair being split on a chunk boundary. Reading and storing the text as lines would avoid that.
1 - The issue here is not the "wasted" space. Unless you are reading and caching a large number of small file, any space wastage due to "short" chunks will be unimportant. The important issue is knowing which bytes in each byte[] are actually valid data.

handling comp3 and ebcidic conversion in java to ASCII for large files

I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:
byte[] data = Files.readAllBytes(path);
this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.
Can anyone point me in the correct direction on how to handle this
Note: the file may contain records of different length hence splitting it based on record length seams not possible.

As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.
Also how are you deciding where comp-3 fields start ???
You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:
protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {
int total = 0;
int num = in.read(buf, total, buf.length);
while (num >= 0 && total + num < buf.length) {
total += num;
num = in.read(buf, total, buf.length - total);
}
return num;
}
if all the records are the same length, create an array of the record length and the above method will read one record at a time.
Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.

I'm running into out of memory exception as the amount of data handled is huge about 5 gb.
You only need to read one record at a time.
My code is currently as follows:
byte[] data = Files.readAllBytes(path);
This is resulting in an out of memory exception which i can understand
Me too.
but i cant use a file scanner as well since the data in the file wont be split into lines.
You mean you can't use the Scanner class? That's not the only way to read a record at a time.
In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.
I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length
So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.
I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.

getting Java OutOfMemoryError: Java heap space error that I can't debug

I am struggling to figure out what's causing this OutofMemory Error. Making more memory available isn't the solution, because my system doesn't have enough memory. Instead I have to figure out a way of re-writing my code.
I've simplified my code to try to isolate the error. Please take a look at the following:
File[] files = new File(args[0]).listFiles();
int filecnt = 0;
LinkedList<String> urls = new LinkedList<String>();
for (File f : files) {
if (filecnt > 10) {
System.exit(1);
}
System.out.println("Doing File " + filecnt + " of " + files.length + " :" + f.getName());
filecnt++;
FileReader inputStream = null;
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
finally {
if (inputStream != null) {
inputStream.close();
}
}
inputStream.close();
String mystring = builder.toString();
String temp[] = mystring.split("\\|NEWandrewLINE\\|");
for (String s : temp) {
String temp2[] = s.split("\\|NEWandrewTAB\\|");
if (temp2.length == 22) {
urls.add(temp2[7].trim());
}
}
}
I know this code is probably pretty confusing :) I have loads of text files in the directory that is specified in args[0]. These text files were created by me. I used |NEWandrewLINE| to indicate a new row in the text file, and |NEWandrewTAB| to indicate a new column. In this code snippet, I am trying to access the URL of each stored row (which is in the 8th column of each row). So, I read in the whole text file. String split on |NEWandrewLINE| and then string split again on the substrings on |NEWandrewTAB|. I add the URL to the LinkedList (called "urls") with the line: urls.add(temp2[7].trim())
Now, the output of running this code is:
Doing File 0 of 973 :results1322453406319.txt
Doing File 1 of 973 :results1322464193519.txt
Doing File 2 of 973 :results1322337493419.txt
Doing File 3 of 973 :results1322347332053.txt
Doing File 4 of 973 :results1322330379488.txt
Doing File 5 of 973 :results1322369464720.txt
Doing File 6 of 973 :results1322379574296.txt
Doing File 7 of 973 :results1322346981999.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at Twitter.main(Twitter.java:86)
Where main line 86 relates to the line builder.append(d); in this example.
But the thing I don't understand is that if I comment out the line urls.add(temp2[7].trim()); I don't get any error. So the error seems to be caused by the linkedlist "urls" overfilling. But why then does the reported error relate to the StringBuilder?

Try to replace urls.add(temp2[7].trim()); with urls.add(new String(temp2[7].trim()));.
I suppose that your problem is that you are in fact storing the entire file content and not just the extracted URL field in your urls list, although that's not really obvious. It is actually an implementation specific issue with the String class, but usually String#split and String#trim return new String objects, which contain the same internal char array as the original string and only differs in their offset and length fields. Using the new String(String) constructor makes sure that you only keep the relevant part of the original data.

The linked list is using more memory each time you add a string. This means you can be left it not enough memory to build your StringBuilder.
The way to avoid this issue to write the results to a file instead of to a List as you don't appear to have enough memory to keep the List in memory.

Because this is
out of memory and not out of heap
you have LOTS of small temporary objects
I would suggest you give your JVM a -X maximum heap size limit that fits in your RAM.
To use less memory I would use a buffered reader to pull in the entire line and save on the temporary object creation.

The simple answer is: you should not load all the URLs from the text files into memory. You are surely doing this because you want to process them in a next step. So instead of adding them to a List in memory do the next step (maybe storing in a database or check if it is reachable) and forget that URL.

How many URLS do you have? Looks like you're just storing more of them than you can handle.
As far as I can see, the linked list is the only object that is not scoped inside the loop, so cannot be collected.
For an OOM error, it doesn't really matter where it is thrown.
To check this properly, use a profiler (look at JVisualVM for a free one, and you probably already have it). You'll see which objects are in the heap. You can also have the JVM dump its memory into a file when it crashes, then analyse that file with visualvm. You should see that one thing is grabbing all of your memory. I'm suspecting it's all the URLs.

There are several experts in here already, so, I'l be brief to the problems:
Inappropriate use of String Builder:
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
Java is beautiful when you process small amounts of data at a time, remember the garbage collector.
Instead, I would recommend that you read the file (Text file) 1 line at a time, process the line, and move on, never create a huge memory ball of StringBuilder just to get a String,
Immagine of your text file is 1 GB in size, you are done mate.
Add the real process while reading the file (as in item #1)
You dont need to close InputStream again, the code in finally block is good enough.
regards

if the linkedlist eats your memory every command which allocates memory afterwards may fail with an OOM error. So this looks like your problem.

You're reading the files into memory. At least one file is simply too big to fit into the default JVM heap. You can allow it use a lot more memory with an arg like -Xmx1g on the command line after java.
By the way this is really inefficient to read a file one character at a time!

Instead of trying to split the string (which basically creates an array of substrings based on the split) - thereby using more than double the memory each time you use the slpit, you should try to do regex based matching of the start and end patterns, extract individual sub-strings one by one and then extract the URL from that.
Also, if your file is large, I would suggest that you not even load all of that into memory at once ... stream its contents to a buffer (of manageable size) and use the pattern based search on that (and keep removing / adding more to the buffer as you progress through the file contents).
The implementation will slow down the program a bit but will use a considerably lesser amount of memory.

One major problem in your code is that you read whole file into a string builder, then convert it into string and then split it into smaller parts. So if file size is large you will get into trouble. As suggested by others process the file line by line as that should save a lot of memory.
Also you should check what is the size of your list after processing each file. If the size is very large you may want to use different approach or increase the memory for your process via -Xmx option.

Is there an easier way to change BufferedReader to string?

Right Now I have
;; buffer->string: BufferedReader -> String
(defn buffer->string [buffer]
(loop [line (.readLine buffer) sb (StringBuilder.)]
(if(nil? line)
(.toString sb)
(recur (.readLine buffer) (.append sb line)))))
This is too slow.
Edit:
I have a BufferedReader
when i try to do (str BufferedReader) it gives me "java.io.BufferedReader#1ce784b"
the above loop is too slow and I run out of memory space.

(clojure.contrib.duck-streams/slurp* your-buffer) ; is what you want
Your code is slow because buffer isn't hinted.

I don't know Clojure, so I can't tell if you have some detail wrong in your code, but using StringBuffer and appending the input line by line is the correct way to do it (well, using a StringBuilder initialized to the expected final size if known would bring significant but not dramatic improvements).
If you run out of memory, then maybe the content of your BufferedReader is simply too large to fit into your memory and there is no way to have it as a single string - in that case, you'll either have to increase your heap size or find a way to process the data one small chunk at a time.
BTW, if you know the size of your input, a more efficient method would be to use a CharBuffer and fill it by using Reader.read() (you'll have to pay attention to the return method and use it in a loop).

buffer.ToString()? Or in your case, maybe (.toString buffer)?

in java you would do something like;
public String getStringFromBuffer(){
BufferedReader bRead = new BufferedReader();
String line = null;
StringBuffer theText = new StringBuffer();
while((line=bRead.readLine())!=null){
theText.append(line+"\n);
}
return theText.toString();
}

I don't know clojure, just Java. Lets work from there.
Some points to consider:
If your target JVM version is >= 1.5 you can use StringBuilder instead of StringBuffer for a small performance improvement (no synchronization and you don't need it). Read about it here
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/StringBuilder.html
But your big performance cost is probably on the buffer expansion. When you instantiate a StringBuffer/StringBuilder without using the constructor with the capacity argument you get a small capacity.
When starting with a small capacity (the internal buffer size) you have many expansions - every time you exceed that capacity, its internal buffer is reallocated to a new capacity, just large enough to hold the newly appended text, which means copying all previously held text to the new buffer.
This is very slow when you are appending more text to an already very large string.
If you have access to the size of the text you are reading (a file size would be an approximation) you can significantly reduce the amount of expansions.
I could also tell you to use read() method of the BufferedReader, the one with 3 arguments, this one:
BufferedReader.read(char[], int, int)
You could then use one of the String's class constructors that accept a char array to convert the char buffer into a String:
String.String(char[], int, int)
...however, I suspect that the performance improvement will not be that big, especially when compared with the one of reducing how many StringBuilder expansions you'll have.
Whatever the approximation, you seem to have memory capacity problem:
In the end you will need at least twice as much memory as the whole text occupies.
Either if you use the StringBuilder/StringBuffer approach or the other one, in the end you will have to copy the text contents to the new String holding the result.
In the end you will probably need to work out of this box:
Are you sure you only have a BufferedReader as a start and a String as an end? You should provide the broader picture!
If this is the broadest you have, you will need at least a JVM instance configured with more heap since you will probably run out of memory with any of this solutions anyway.

use slurp to read (reasonably sized files) in
use spit to write them back out again.

Why does reading a file into memory takes 4x the memory in Java?

I have the following code which reads in the follow file, append a \r\n to the end of each line and puts the result in a string buffer:
public InputStream getInputStream() throws Exception {
StringBuffer holder = new StringBuffer();
try{
FileInputStream reader = new FileInputStream(inputPath);
BufferedReader br = new BufferedReader(new InputStreamReader(reader));
String strLine;
//Read File Line By Line
boolean start = true;
while ((strLine = br.readLine()) != null) {
if( !start )
holder.append("\r\n");
holder.append(strLine);
start = false;
}
//Close the input stream
reader.close();
}catch (Throwable e){//this is where the heap error is caught up to 2Gb
System.err.println("Error: " + e.getMessage());
}
return new StringBufferInputStream(holder.toString());
}
I tried reading in a 400Mb file, and I changed the max heap space to 2Gb and yet it still gives the out of memory heap exception. Any ideas?

It may be to do with how the StringBuffer resizes when it reaches capacity - This involves creating a new char[] double the size of the previous one and then copying the contents across into the new array. Together with the points already made about characters in Java being stored as 2 bytes this will definitely add to your memory usage.
To resolve this you could create a StringBuffer with sufficient capacity to begin with, given that you know the file size (and hence approximate number of characters to read in). However, be warned that the array allocation will also occur if you then attempt to convert this large StringBuffer into a String.
Another point: You should typically favour StringBuilder over StringBuffer as the operations on it are faster.
You could consider implementing your own "CharBuffer", using for example a LinkedList of char[] to avoid expensive array allocation / copy operations. You could make this class implement CharSequence and perhaps avoid converting to a String altogether. Another suggestion for more compact representation: If you're reading in English text containing large numbers of repeated words you could read and store each word, using the String.intern() function to significantly reduce storage.

To begin with Java strings are UTF-16 (i.e. 2 bytes per character), so assuming your input file is ASCII or a similar one-byte-per-character format then holder will be ~2x the size of the input data, plus the extra \r\n per line and any additional overhead. There's ~800MB straight away, assuming a very low storage overhead in StringBuffer.
I could also believe that the contents of your file is buffered twice - once at the I/O level and once in the BufferedReader.
However, to know for sure, it's probably best to look at what's actually on the heap - use a tool like HPROF to see exactly where your memory has gone.
I terms of solving this, I suggest you process a line at a time, writing out each line after your have added the line termination. That way your memory usage should be proportional to the length of a line, instead of the entire file.

It's an interesting question, but rather than stress over why Java is using so much memory, why not try a design that doesn't require your program to load the entire file into memory?

You have a number of problems here:
Unicode: characters take twice as much space in memory as on disk (assuming a 1 byte encoding)
StringBuffer resizing: could double (permanently) and triple (temporarily) the occupied memory, though this is the worst case
StringBuffer.toString() temporarily doubles the occupied memory since it makes a copy
All of these combined mean that you could require temporarily up to 8 times your file's size in RAM, i.e. 3.2G for a 400M file. Even if your machine physically has that much RAM, it has to be running a 64bit OS and JVM to actually get that much heap for the JVM.
All in all, it's simply a horrible idea to keep such a huge String in memory - and it's totally unneccessary as well - since your method returns an InputStream, all you really need is a FilterInputStream that adds the line breaks on the fly.

It's the StringBuffer. The empty constructor creates a StringBuffer with a initial length of 16 Bytes. Now if you append something and the capacity is not sufficiant, it does an Arraycopy of the internal String Array to a new buffer.
So in fact, with each line appended the StringBuffer has to create a copy of the complete internal Array which nearly doubles the required memory when appending the last line. Together with the UTF-16 representation this results in the observed memory demand.
Edit
Michael is right, when saying, that the internal buffer is not incremented in small portions - it roughly doubles in size each to you need more memory. But still, in the worst case, say the buffer needs to expand capacity just with the very last append, it creates a new array twice the size of the actual one - so in this case, for a moment you need roughly three times the amount of memory.
Anyway, I've learned the lesson: StringBuffer (and Builder) may cause unexpected OutOfMemory errors and I'll always initialize it with a size, at least when I have to store large Strings. Thanks for the question :)

At the last insert into the StringBuffer, you need three times the memory allocated, because the StringBuffer always expands by (size + 1) * 2 (which is already double because of unicode). So a 400GB file could require an allocation of 800GB * 3 == 2.4GB at the end of the inserts. It may be something less, that depends on exactly when the threshold is reached.
The suggestion to concatenate Strings rather than using a Buffer or Builder is in order here. There will be a lot of garbage collection and object creation (so it will be slow), but a much lower memory footprint.
[At Michael's prompting, I investigated this further, and concat wouldn't help here, as it copies the char buffer, so while it wouldn't require triple, it would require double the memory at the end.]
You could continue to use the Buffer (or better yet Builder in this case) if you know the maximum size of the file and initialize the size of the Buffer on creation and you are sure this method will only get called from one thread at a time.
But really such an approach of loading such a large file into memory at once should only be done as a last resort.

I would suggest you use the OS file cache instead of copying the data into Java memory via characters and back to bytes again. If you re-read the file as required (perhaps transforming it as you go) it will be faster and very likely to be simpler
You need over 2 GB because 1 byte letters use char (2-bytes) in memory and when your StringBuffer resizes you need double that (to copy the old array to the larger new array) The new array is typically 50% larger so you need up to 6x the original file size. If the performance wasn't bad enough, you are using StringBuffer instead of StringBuilder which synchronizes every call when it is clearly not needed. (This only slows you down, but uses the same amount of memory)

Others have explained why you're running out of memory. As to how to solve this problem, I'd suggest writing a custom FilterInputStream subclass. This class would read one line at a time, append the "\r\n" characters and buffer the result. Once the line has been read by the consumer of your FilterInputStream, you'd read another line. This way you'd only ever have one line in memory at a time.

I also recommend checking out Commons IO FileUtils class for this. Specifically: org.apache.commons.io.FileUtils#readFileToString. You can also specify the encoding if you know you only are using ASCII.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading big files and performing some operations in java - java

Related

Reading large file in bytes by chunks with dynamic buffer size

handling comp3 and ebcidic conversion in java to ASCII for large files

getting Java OutOfMemoryError: Java heap space error that I can't debug

Is there an easier way to change BufferedReader to string?

Why does reading a file into memory takes 4x the memory in Java?

Categories

Resources