Reading from a file, running out of memory - java

I was just recently asked an interview question that held to deal with reading from a CSV file and summing up entries in certain cells. When asked to optimize it, I couldn't answer how to deal with the case of running out of memory if we were given a CSV of size say 100 gigs.
In Java, how exactly does reading from a file work? How do we know when something is too big? How do we deal with that? I was told that you could pass in the intermediate reader object instead of trying to read the entire thing?

The interviewer gave you a hint - BufferedReader. It is an efficient choice for reading a large file line by line.
Small example:
String line;
BufferedReader br = new BufferedReader("c:/test.txt");
while ((line= br.readLine()) != null) {
//do processing
}
br.close();
Here is the documentation

There are several ways to read from a file in Java, some of them involve keeping all of the files lines (or data) in memory as you "read" the data delimited by something like a newline character (reading line by line for example).
For large files you want to process smaller bits at a time using the Scannerclass (or something like it to read specific bytes at a time).
Sample code:
FileInputStream inputStream = new FileInputStream(path);
Scanner sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
// System.out.println(line);
}

You can use RandomAccessFile to read the file. It may not be the best solution though.

Related

Reading a file in java using buffered reader correctly

I have the following code to read a file in java, and prints out the lines. I implemented it in two ways:
using streams:
List<String> list = new ArrayList<>();
try (BufferedReader br = Files.newBufferedReader(file.toPath())) {
list = br.lines().collect(Collectors.toList());
Map<String, Long> totalCount = list.stream() .....
Using loops:
try (FileReader reader = new FileReader(file);
final BufferedReader bufferedReader = new BufferedReader(reader)) {
String line;
while ((line = bufferedReader.readLine()) != null) {.. }
I was told this is the wrong, and using buffered reader is using features of the language wrongly. Is there a better way to this, i want to the know the correct way to use the language features.
These two examples do two slightly different things. The key difference is that the first solution first reads all lines into memory, before then iterating the lines.
The second example reads the file line by line, this means that its content can be processed line by line.
First way: easy to write, read, and understand. But: as said, it reads the whole file into memory. Which isn't a problem for small files, but for really large files, this can create all kinds of issues (it takes time to read a large file, and you might run out of memory for really large files).
The other differences between the two approaches are somehow more "style", where: br.lines() already gives you a Stream, it doesn't make any sense to first collect the lines into a List object, to then stream that one.

Modify content of large file

I have extract my tables from my database in json file, now I want to read this files and remove all double quotes on them, seems easy and tried hundred of solutions, and some take me to the out of memory problems. I'm dealing with files that they have more than 1Gb size.The code that you will find below have a strange behaviour, and I don't understand why it return empty files
public void replaceDoubleQuotes(String fileName){
log.debug(" start formatting " + fileName + " ...");
File firstFile = new File ("C:/sqlite/db/tables/" + fileName);
String oldContent = "";
String newContent = "";
BufferedReader reader = null;
BufferedWriter writer = null;
FileWriter writerFile = null;
String stringQuotes = "\\\\\\\\\"";
try {
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
while (( oldContent = reader.readLine()) != null ){
newContent = oldContent.replaceAll(stringQuotes, "");
writer.write(newContent);
}
writer.flush();
writer.close();
} catch (Exception e) {
log.error(e);
}
}
and when I try to use FileWriter(path,true) to write at the end of the file the program don't stop increasing the file memory till the hard disk will be full, thanks for help
ps : I also tried to use subString and append the new content and after the while I write the subString but also doesn't work
TL; DR;
Do not read and write the same file concurrently.
The issue
Your code starts reading, and then immediately truncates the file it is reading.
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
The first line opens a read handle to the file.
The second line opens a write handle to the same file.
It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.
At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.
What about using append=true
Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.
So each time a line is read, another is appended.
No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).
The solution
Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.
An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".
A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).
About out of memory issues
When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.
Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).
Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.
Not related but important
Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).
Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.
#GPI already provided a great answer on why reading and writing concurrently is causing the issue you're experiencing. It is also worth noting that reading 1gb of data into heap at once can definitely cause a OutOfMemoryError if enough heap isn't allocated which is likely. To solve this problem you could use an InputStream and read chunks of the file at a time, then write to another file until the process is completed, and ultimately replace the existing file with the modified one and delete. With this approach you could even use a ForkJoinTask to help with this since it's such a large job.
Side note;
There may be a better solution than create new file, write to new file, replace existing, delete new file.

high-level streams and low-level streams in IO Java? [duplicate]

I would like to know the specific difference between BufferedReader and FileReader.
I do know that BufferedReader is much more efficient as opposed to FileReader, but can someone please explain why (specifically and in detail)? Thanks.
First, You should understand "streaming" in Java because all "Readers" in Java are built upon this concept.
File Streaming
File streaming is carried out by the FileInputStream object in Java.
// it reads a byte at a time and stores into the 'byt' variable
int byt;
while((byt = fileInputStream.read()) != -1) {
fileOutputStream.write(byt);
}
This object reads a byte(8-bits) at a time and writes it to the given file.
A practical useful application of it would be to work with raw binary/data files, such as images or audio files (use AudioInputStream instead of FileInputStream for audio files).
On the other hand, it is very inconvenient and slower for text files, because of looping through a byte at a time, then do some processing and store the processed byte back is tedious and time-consuming.
You also need to provide the character set of the text file, i.e if the characters are in Latin or Chinese, etc. Otherwise, the program would decode and encode 8-bits at a time and you'd see weird chars printed on the screen or written in the output file (if a char is more than 1 byte long, i.e. non-ASCII characters).
File Reading
This is just a fancy way of saying "File streaming" with inclusive charset support (i.e no need to define the charset, like earlier).
The FileReader class is specifically designed to deal with the text files.
As you've seen earlier, the file streaming is best to deal with raw binary data, but for the sake of text, it is not so efficient.
So the Java-dudes added the FileReader class, to deal specifically with the text files. It reads 2 bytes (or 4 bytes, depends on the charset) at a time. A remarkably huge improvement over the preceding FileInputStream!!
so the streaming operation is like this,
int c;
while ( (c = fileReader.read()) != -1) { // some logic }
Please note, Both classes use an integer variable to store the value retrieved from the input file (so every char is converted into an integer while fetching and back to the char while storing).
The only advantage here is that this class deals only with text files, so you don't have to specify the charset and a few other properties. It provides an out-of-the-box solution, for most of the text files processing cases. It also supports internationalization and localization.
But again it's still very slow (Imaging reading 2 bytes at a time and looping through it!).
Buffering streams
To tackle the problem of continuous looping over a byte or 2. The Java-dudes added another spectacular functionality. "To create a buffer of data, before processing."
The concept is pretty much alike when a user streams a video on YouTube. A video is buffered before playing, to provide flawless video watching experience. (Tho, the browser keeps buffering until the whole video is buffered ahead of time.) The same technique is used by the BufferedReader class.
A BufferedReader object takes a FileReader object as an input which contains all the necessary information about the text file that needs to be read. (such as the file path and charset.)
BufferedReader br = new BufferedReader( new FileReader("example.txt") );
When the "read" instruction is given to the BufferedReader object, it uses the FileReader object to read the data from the file. When an instruction is given, the FileReader object reads 2 (or 4) bytes at a time and returns the data to the BufferedReader and the reader keeps doing that until it hits '\n' or '\r\n' (The end of the line symbol).
Once a line is buffered, the reader waits patiently, until the instruction to buffer the next line is given.
Meanwhile, The BufferReader object creates a special memory place (On the RAM), called "Buffer", and stores all the fetched data from the FileReader object.
// this variable points to the buffered line
String line;
// Keep buffering the lines and print it.
while ((line = br.readLine()) != null) {
printWriter.println(line);
}
Now here, instead of reading 2 bytes at a time, a whole line is fetched and stored in the RAM somewhere, and when you are done with processing the data, you can store the whole line back to the hard disk. So it makes the process run way faster than doing 2 bytes a time.
But again, why do we need to pass FileReader object to the BufferReader? Can't we just say "buffer this file" and the BufferReader would take care of the rest? wouldn't that be sweet?
Well, the BufferReader class is created in a way that it only knows how to create a buffer and to store incoming data. It is irrelevant to the object from where the data is coming. So the same object can be used for many other input streams than just text files.
So being said that, When you provide the FileReader object as an input, it buffers the file, the same way if you provide the InputStreamReader as an object, it buffers the Terminal/Console input data until it hits a newline symbol. such as,
// Object that reads console inputs
InputStreamReader console = new InputStreamReader(System.in);
BufferedReader br = new BufferedReader(console);
System.out.println(br.readLine());
This way, you can read (or buffer) multiple streams with the same BufferReader class, such as text files, consoles, printers, networking data etc, and all you have to remember is,
bufferedReader.readLine();
to print whatever you've buffered.
In simple manner:
A FileReader class is a general tool to read in characters from a File. The BufferedReader class can wrap around Readers, like FileReader, to buffer the input and improve efficiency. So you wouldn't use one over the other, but both at the same time by passing the FileReader object to the BufferedReader constructor.
Very Detail
FileReader is used for input of character data from a disk file. The input file can be an ordinary ASCII, one byte per character text file. A Reader stream automatically translates the characters from the disk file format into the internal char format. The characters in the input file might be from other alphabets supported by the UTF format, in which case there will be up to three bytes per character. In this case, also, characters from the file are translated into char format.
As with output, it is good practice to use a buffer to improve efficiency. Use BufferedReader for this. This is the same class we've been using for keyboard input. These lines should look familiar:
BufferedReader stdin =
new BufferedReader(new InputStreamReader( System.in ));
These lines create a BufferedReader, but connect it to an input stream from the keyboard, not to a file.
Source: http://www.oopweb.com/Java/Documents/JavaNotes/Volume/chap84/ch84_3.html
BufferedReader requires a Reader, of which FileReader is one - it descends from InputStreamReader, which descends from Reader.
FileReader - read character files
BufferedReader - "Read text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines."
http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
http://docs.oracle.com/javase/7/docs/api/java/io/FileReader.html
Actually BufferedReader makes use of Readers like FileReader.
FileReader class helps in writing on file but its efficency is low since it has yo retrive one character at a time from file but BufferedReader takes chunks of data and store it in buffer so instead of retriving one character at atime from file retrival becomes easy using buffer.
Bufferedreader - method that you can use actually as a substitute for Scanner method, gets file, gets input.
FileReader - as the name suggests.

Java reading nth line

I am trying to read a specific line from a text file, however I don't want to load the file into memory (it can get really big).
I have been looking but every example i have found requires either to read every line (this would slow my code down as there are over 100,000 lines) or load the whole thing into an array and get the correct element (file will have alot of lines to input).
An example of what I want to do:
String line = File.getLine(5);
"code is not actual code, it is made up to show the principle of what i want"
Is there a way to do this?
-----Edit-----
I have just realized this file will be written too in between reading lines (adding to the end of the file).
Is there a way to do this?
Not unless the lines are of a fixed number of bytes each, no.
You don't have to actually keep each line in memory - but you've got to read through the whole file to get to the line you want, as otherwise you won't know where to start reading.
You have to read the file line by line. Otherwise how do you know when you have gotten to line 5 (as in your example)?
Edit:
You might also want to check out Random Access Files which could be helpful if you know how many bytes per line, as Jon Skeet has said.
The easiest way to do this would be to use a BufferedReader (http://docs.oracle.com/javase/1.5.0/docs/api/java/io/BufferedReader.html), because you can specify your buffer size. You could do something like:
BufferedReader in = new BufferedReader(new FileReader("foo.in"), 1024);
in.readLine();
in.readLine();
in.readLine();
in.readLine();
String line = in.readLine();
1) read a line which the user selects,
If you only need to read a user-selected line once or infrequently (or if the file is small enough), then you just have to read the file line-by-line from the start, until you get to the selected line.
If, on the other hand you need to read a user-selected line frequently, you should build an index of line numbers and offsets. So, for example, line 42 corresponds to an offset of 2347 bytes into the file. Ideally, then, you would only read the entire file once and store the index--for example, in a map, using the line numbers as keys and offsets as values.
2) read new lines added since
the last read. i plan to read the file every 10 seconds. i have got
the line count and can find out the new line numbers but i need to
read that line
For the second point, you can simply save the current offset into the file instead of saving the current line number--but it certainly wouldn't hurt to continue building the index if it will continue to provide a significant performance benefit.
Use RandomAccessFile.seek(long offset) to set the file pointer to the most recently saved offset (confirm the file is longer than the most recently saved offset first--if not, nothing new has been appended).
Use RandomAccessFile.readLine() to read a line of the file
Call RandomAccessFile.getFilePointer() to get the current offset after reading the line and optionally put(currLineNo+1, offset) into the index.
Repeat steps 2-3 until reaching the end of the file.
But don't get too carried away with performance optimizations unless the performance is already a problem or is highly likely to be a problem.
For small files:
String line = Files.readAllLines(Paths.get("file.txt")).get(n);
For large files:
String line;
try (Stream<String> lines = Files.lines(Paths.get("file.txt"))) {
line = lines.skip(n).findFirst().get();
}
Java 7:
String line;
try (BufferedReader br = new BufferedReader(new FileReader("file.txt"))) {
for (int i = 0; i < n; i++)
br.readLine();
line = br.readLine();
}
Source: Reading nth line from file
The only way to do this is to build an index of where each line is (you only need to record the end of each line) Without a way to randomly access a line based on an index from the start, you have to read every byte before that line.
BTW: Reading 100,000 lines might take only one second on a fast machine.
If performance is a big concern here and you are frequently reading random lines from a static file then you can optimize this a bit by reading through the file and building an index (basically just a long[]) of the starting offset of each line of the file.
Once you have this you know exactly where to jump to in the file and then you can read up to the next newline character to retrieve the full line.
Here is a snippet of some code I had which will read a file and write every 10th line including the first line to a new file (writer.) You can always replace the try section with whatever you want to do. To change the number of lines to read just change the 0 in the if statement "lc.endsWith("0")" to whatever line you want to read. But if the file is being written to as you read it, this code will only work with the data that is contained inside the file when you run this code.
LineNumberReader lnr = new LineNumberReader(new FileReader(new File(file)));
lnr.skip(Long.MAX_VALUE);
int linecount=lnr.getLineNumber();
lnr.close();
for (int i=0; i<=linecount; i++){
//read lines
String line = bufferedReader.readLine();
String lc = String.valueOf(i);
if (lc.endsWith("0")){
try{
writer.append(line+"\n");
writer.flush();
}catch(Exception ee){
}
}
}

Specific difference between bufferedreader and filereader

I would like to know the specific difference between BufferedReader and FileReader.
I do know that BufferedReader is much more efficient as opposed to FileReader, but can someone please explain why (specifically and in detail)? Thanks.
First, You should understand "streaming" in Java because all "Readers" in Java are built upon this concept.
File Streaming
File streaming is carried out by the FileInputStream object in Java.
// it reads a byte at a time and stores into the 'byt' variable
int byt;
while((byt = fileInputStream.read()) != -1) {
fileOutputStream.write(byt);
}
This object reads a byte(8-bits) at a time and writes it to the given file.
A practical useful application of it would be to work with raw binary/data files, such as images or audio files (use AudioInputStream instead of FileInputStream for audio files).
On the other hand, it is very inconvenient and slower for text files, because of looping through a byte at a time, then do some processing and store the processed byte back is tedious and time-consuming.
You also need to provide the character set of the text file, i.e if the characters are in Latin or Chinese, etc. Otherwise, the program would decode and encode 8-bits at a time and you'd see weird chars printed on the screen or written in the output file (if a char is more than 1 byte long, i.e. non-ASCII characters).
File Reading
This is just a fancy way of saying "File streaming" with inclusive charset support (i.e no need to define the charset, like earlier).
The FileReader class is specifically designed to deal with the text files.
As you've seen earlier, the file streaming is best to deal with raw binary data, but for the sake of text, it is not so efficient.
So the Java-dudes added the FileReader class, to deal specifically with the text files. It reads 2 bytes (or 4 bytes, depends on the charset) at a time. A remarkably huge improvement over the preceding FileInputStream!!
so the streaming operation is like this,
int c;
while ( (c = fileReader.read()) != -1) { // some logic }
Please note, Both classes use an integer variable to store the value retrieved from the input file (so every char is converted into an integer while fetching and back to the char while storing).
The only advantage here is that this class deals only with text files, so you don't have to specify the charset and a few other properties. It provides an out-of-the-box solution, for most of the text files processing cases. It also supports internationalization and localization.
But again it's still very slow (Imaging reading 2 bytes at a time and looping through it!).
Buffering streams
To tackle the problem of continuous looping over a byte or 2. The Java-dudes added another spectacular functionality. "To create a buffer of data, before processing."
The concept is pretty much alike when a user streams a video on YouTube. A video is buffered before playing, to provide flawless video watching experience. (Tho, the browser keeps buffering until the whole video is buffered ahead of time.) The same technique is used by the BufferedReader class.
A BufferedReader object takes a FileReader object as an input which contains all the necessary information about the text file that needs to be read. (such as the file path and charset.)
BufferedReader br = new BufferedReader( new FileReader("example.txt") );
When the "read" instruction is given to the BufferedReader object, it uses the FileReader object to read the data from the file. When an instruction is given, the FileReader object reads 2 (or 4) bytes at a time and returns the data to the BufferedReader and the reader keeps doing that until it hits '\n' or '\r\n' (The end of the line symbol).
Once a line is buffered, the reader waits patiently, until the instruction to buffer the next line is given.
Meanwhile, The BufferReader object creates a special memory place (On the RAM), called "Buffer", and stores all the fetched data from the FileReader object.
// this variable points to the buffered line
String line;
// Keep buffering the lines and print it.
while ((line = br.readLine()) != null) {
printWriter.println(line);
}
Now here, instead of reading 2 bytes at a time, a whole line is fetched and stored in the RAM somewhere, and when you are done with processing the data, you can store the whole line back to the hard disk. So it makes the process run way faster than doing 2 bytes a time.
But again, why do we need to pass FileReader object to the BufferReader? Can't we just say "buffer this file" and the BufferReader would take care of the rest? wouldn't that be sweet?
Well, the BufferReader class is created in a way that it only knows how to create a buffer and to store incoming data. It is irrelevant to the object from where the data is coming. So the same object can be used for many other input streams than just text files.
So being said that, When you provide the FileReader object as an input, it buffers the file, the same way if you provide the InputStreamReader as an object, it buffers the Terminal/Console input data until it hits a newline symbol. such as,
// Object that reads console inputs
InputStreamReader console = new InputStreamReader(System.in);
BufferedReader br = new BufferedReader(console);
System.out.println(br.readLine());
This way, you can read (or buffer) multiple streams with the same BufferReader class, such as text files, consoles, printers, networking data etc, and all you have to remember is,
bufferedReader.readLine();
to print whatever you've buffered.
In simple manner:
A FileReader class is a general tool to read in characters from a File. The BufferedReader class can wrap around Readers, like FileReader, to buffer the input and improve efficiency. So you wouldn't use one over the other, but both at the same time by passing the FileReader object to the BufferedReader constructor.
Very Detail
FileReader is used for input of character data from a disk file. The input file can be an ordinary ASCII, one byte per character text file. A Reader stream automatically translates the characters from the disk file format into the internal char format. The characters in the input file might be from other alphabets supported by the UTF format, in which case there will be up to three bytes per character. In this case, also, characters from the file are translated into char format.
As with output, it is good practice to use a buffer to improve efficiency. Use BufferedReader for this. This is the same class we've been using for keyboard input. These lines should look familiar:
BufferedReader stdin =
new BufferedReader(new InputStreamReader( System.in ));
These lines create a BufferedReader, but connect it to an input stream from the keyboard, not to a file.
Source: http://www.oopweb.com/Java/Documents/JavaNotes/Volume/chap84/ch84_3.html
BufferedReader requires a Reader, of which FileReader is one - it descends from InputStreamReader, which descends from Reader.
FileReader - read character files
BufferedReader - "Read text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines."
http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
http://docs.oracle.com/javase/7/docs/api/java/io/FileReader.html
Actually BufferedReader makes use of Readers like FileReader.
FileReader class helps in writing on file but its efficency is low since it has yo retrive one character at a time from file but BufferedReader takes chunks of data and store it in buffer so instead of retriving one character at atime from file retrival becomes easy using buffer.
Bufferedreader - method that you can use actually as a substitute for Scanner method, gets file, gets input.
FileReader - as the name suggests.

Categories