I have a program that loads a large file into the memory, loading it line by line, into an array. One line = one index in the array. Each line of data needs to be "used / processed". I then have a static AtomicInteger in the main class as well. I create multiple worker threads, and each worker thread, gets the data it needs by calling MainClass.array[MainClass.atomicint.getAndIncrement()]
This works, but now that I am starting to use larger files, I am getting out of memoery errors, etc. How can I do this so that I don't get out of memoery.
You could have a Thread adding lines to an ArrayBlockingQueue. It could check the queue is never too large to avoid an OOME. You can also have a pool of threads reading this queue to get the next task.
Don't load the whole file into memory. Load it line by line as needed.
Have one class to read the lines of the file. Not all lines, but one line at a time.
Then make your threads call this class, asking for new lines. When the class returns null it is because there are no more lines to be processed.
Generally, loading a whole file into memory is bad practice, because your memory requirement may be arbitrarily large — specially when the file is provided by the user. What you want is to read it and process it in chunks, eg.:
try (BufferedReader br = new BufferedReader(new FileReader("somefile"))) {
String line = br.readLine();
while (line != null) {
process(line);
line = br.readLine();
}
}
Related
Good morning,
I am progarmming a tool to get some information from log files, the problem is that that files could by 100Mb, 300Mb ...900Mb.
When I run my application using log of 100-200Mb the program works fine, but when I use a 500Mb or above log the application send heap space problem.
The problem could be this line?
List<String> resultado = Files.readAllLines(Paths.get(ruta.getText()));
I think that this object consume a lot of memory and it is the cause of my problem
There is a best way to read line by line the file and storing in a List consuming less memory?
Thank you very much
There is a best way to read line by line the file and storing in a List consuming less memory?
Not really, no.
The problem is the "store it in a list" aspect. That is the fundamental reason why you are using lots of memory. Any time you hold an entire file in memory in String form you are going to use lots of memory1.
The solution is NOT to store the entire file in memory at the same time. Instead, read it line by line, process each line as you read it, and then discard the lines.
For example, instead of this:
List<String> lines = Files.readAllLines(somePath);
for (String line: lines) {
// process line
}
... do this:
try (BufferedReader br = ...) {
String line;
while ((line = br.readLine()) != null) {
// process line
}
}
1 - You may save a bit of space by storing the file contents as bytes, but with recent JVMs and their more space-efficient String representations, the saving won't be great enough to make a lot of difference.
I am currently working on a spring based API which has to transform csv data and to expose them as json.
it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each.
I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers.
Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.
The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines.
To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.
Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.
So I actually have several questions :
#1. How could I improve the speed of the CSV reading ?
#2. Is the multithread implementation with Callable correct ?
#3. How could I reduce the amount of heap memory used in the process ?
#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?
Here below the CSV method
public static final int NUMBER_OF_THREADS = 10;
public static List<List<String>> readCsv(InputStream inputStream) {
List<List<String>> rowList = new ArrayList<>();
ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
List<Future<List<String>>> listOfFutures = new ArrayList<>();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
String line = null;
while ((line = reader.readLine()) != null) {
CallableLineReader callableLineReader = new CallableLineReader(line);
Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
listOfFutures.add(futureCounterResult);
}
reader.close();
pool.shutdown();
} catch (Exception e) {
//log Error reading csv file
}
for (Future<List<String>> future : listOfFutures) {
try {
List<String> row = future.get();
}
catch ( ExecutionException | InterruptedException e) {
//log Error CSV processing interrupted during execution
}
}
return rowList;
}
And the Callable implementation
public class CallableLineReader implements Callable<List<String>> {
private final String line;
public CallableLineReader(String line) {
this.line = line;
}
#Override
public List<String> call() throws Exception {
return Arrays.asList(line.replace("\"", "").split(","));
}
}
I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).
The memory consumption would be less from the replace and split operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.
If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.
Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List. Assuming you are using Java 8 or later, the Stream API can be very helpful. You could rewrite the method like this:
public static Stream<List<String>> readCsv(InputStream inputStream) {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
}
Note that this throws unchecked exceptions in case of an I/O error.
This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.
Instead of trying out a different approach, try to run with a profiler first and see where time is actually being spent. And use this information to change the approach.
Async-profiler is a very solid profiler (and free!) and will give you a very good impression of where time is being spent. And it will also show the time spend on garbage collection. So you can easily see the ratio of CPU utilization caused by garbage collection. It also has the ability to do allocation profiling to figure out which objects are being created (and where).
For a tutorial see the following link.
Try using Spring batch and see if it helps your scenario.
Ref : https://howtodoinjava.com/spring-batch/flatfileitemreader-read-csv-example/
I have extract my tables from my database in json file, now I want to read this files and remove all double quotes on them, seems easy and tried hundred of solutions, and some take me to the out of memory problems. I'm dealing with files that they have more than 1Gb size.The code that you will find below have a strange behaviour, and I don't understand why it return empty files
public void replaceDoubleQuotes(String fileName){
log.debug(" start formatting " + fileName + " ...");
File firstFile = new File ("C:/sqlite/db/tables/" + fileName);
String oldContent = "";
String newContent = "";
BufferedReader reader = null;
BufferedWriter writer = null;
FileWriter writerFile = null;
String stringQuotes = "\\\\\\\\\"";
try {
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
while (( oldContent = reader.readLine()) != null ){
newContent = oldContent.replaceAll(stringQuotes, "");
writer.write(newContent);
}
writer.flush();
writer.close();
} catch (Exception e) {
log.error(e);
}
}
and when I try to use FileWriter(path,true) to write at the end of the file the program don't stop increasing the file memory till the hard disk will be full, thanks for help
ps : I also tried to use subString and append the new content and after the while I write the subString but also doesn't work
TL; DR;
Do not read and write the same file concurrently.
The issue
Your code starts reading, and then immediately truncates the file it is reading.
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
The first line opens a read handle to the file.
The second line opens a write handle to the same file.
It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.
At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.
What about using append=true
Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.
So each time a line is read, another is appended.
No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).
The solution
Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.
An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".
A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).
About out of memory issues
When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.
Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).
Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.
Not related but important
Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).
Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.
#GPI already provided a great answer on why reading and writing concurrently is causing the issue you're experiencing. It is also worth noting that reading 1gb of data into heap at once can definitely cause a OutOfMemoryError if enough heap isn't allocated which is likely. To solve this problem you could use an InputStream and read chunks of the file at a time, then write to another file until the process is completed, and ultimately replace the existing file with the modified one and delete. With this approach you could even use a ForkJoinTask to help with this since it's such a large job.
Side note;
There may be a better solution than create new file, write to new file, replace existing, delete new file.
I'm writing a java program that does the following:
Reads in a line from a file
Does some action based on that line
Delete the line (or replace it with ""), and if 2 is not successful, write it to a new file
Continue on to the next line for all lines in file (as opposed to removing an arbitrary line)
Currently I have:
try (BufferedReader br = new BufferedReader(new FileReader(inputFile))) {
String line;
while ((line = br.readLine()) != null) {
try {
if (!do_stuff(line)){ //do_stuff returns bool based on success
write_non_success(line);
}
} catch (Exception e) {
e.printStackTrace(); //eat the exception for now, do something in the future
}
}
Obviously I'm going to need to not use a BufferedReader for this, as it can't write, but what class should I use? Also, read order doesn't matter
This differs from this question because I want to remove all lines, as opposed to an arbitrary line number as the other OP wants, and if possible I'd like to avoid writing the temp file after every line, as my files are approximately 1 million lines
If you do everything according to the algorithm that you describe, the content left in the original file would be the same as the content of "new file" from step #3:
If a line is processed successfully, it gets removed from the original file
If a line is not processed successfully, it gets added to the new file, and it also stays in the original file.
It is easy to see why at the end of this process the original file is the same as the "new file". All you need to do is to carry out your algorithm to the end, and then copy the new file in place of the original.
If your concern is that the process is going to crash in the middle, the situation becomes very different: now you have to write out the current state of the original file after processing each line, without writing over the original until you are sure that it is going to be in a consistent state. You can do it by reading all lines into a list, deleting the first line from the list once it has been processed, writing the content of the entire list into a temporary file, and copying it in place of the original. Obviously, this is very expensive, so it shouldn't be attempted in a tight loop. However, this approach ensures that the original file is not left in an inconsistent state, which is important when you are looking to avoid doing the same work multiple times.
I'm reading a large tsv file (~40G) and trying to prune it by reading line by line and print only certain lines to a new file. However, I keep getting the following exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
at java.lang.StringBuffer.append(StringBuffer.java:323)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at java.io.BufferedReader.readLine(BufferedReader.java:379)
Below is the main part of the code. I specified the buffer size to be 8192 just in case. Doesn't Java clear the buffer once the buffer size limit is reached? I don't see what may cause the large memory usage here. I tried to increase the heap size but it didn't make any difference (machine with 4GB RAM). I also tried flushing the output file every X lines but it didn't help either. I'm thinking maybe I need to make calls to the GC but it doesn't sound right.
Any thoughts? Thanks a lot.
BTW - I know I should call trim() only once, store it, and then use it.
Set<String> set = new HashSet<String>();
set.add("A-B");
...
...
static public void main(String[] args) throws Exception
{
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192);
PrintStream output = new PrintStream(outputFile, "UTF-8");
String line = reader.readLine();
while(line!=null){
String[] fields = line.split("\t");
if( set.contains(fields[0].trim()+"-"+fields[1].trim()) )
output.println((fields[0].trim()+"-"+fields[1].trim()));
line = reader.readLine();
}
output.close();
}
Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory.
The solution would be to read a fixed number of bytes at a time, using the 'read' method of the reader, and then look for new lines (or other parsing tokens) within the smaller buffer(s).
Are you certain the "lines" in the file are separated by newlines?
I have 3 theories:
The input file is not UTF-8 but some indeterminate binary format that results in extremely long lines when read as UTF-8.
The file contains some extremely long "lines" ... or no line breaks at all.
Something else is happening in code that you are not showing us; e.g. you are adding new elements to set.
To help diagnose this:
Use some tool like od (on UNIX / LINUX) to confirm that the input file really contains valid line terminators; i.e. CR, NL, or CR NL.
Use some tool to check that the file is valid UTF-8.
Add a static line counter to your code, and when the application blows up with an OOME, print out the value of the line counter.
Keep track of the longest line seen so far, and print that out as well when you get an OOME.
For the record, your slightly suboptimal use of trim will have no bearing on this issue.
One possibility is that you are running out of heap space during a garbage collection. The Hotspot JVM uses a parallel collector by default, which means that your application can possibly allocate objects faster than the collector can reclaim them. I have been able to cause an OutOfMemoryError with supposedly only 10K live (small) objects, by rapidly allocating and discarding.
You can try instead using the old (pre-1.5) serial collector with the option -XX:+UseSerialGC. There are several other "extended" options that you can use to tune collection.
You might want to try removing the String[] fields declaration out of the loop. As you are creating a new array in every loop. You can just reuse the old one right?