Modify content of large file - java

I have extract my tables from my database in json file, now I want to read this files and remove all double quotes on them, seems easy and tried hundred of solutions, and some take me to the out of memory problems. I'm dealing with files that they have more than 1Gb size.The code that you will find below have a strange behaviour, and I don't understand why it return empty files
public void replaceDoubleQuotes(String fileName){
log.debug(" start formatting " + fileName + " ...");
File firstFile = new File ("C:/sqlite/db/tables/" + fileName);
String oldContent = "";
String newContent = "";
BufferedReader reader = null;
BufferedWriter writer = null;
FileWriter writerFile = null;
String stringQuotes = "\\\\\\\\\"";
try {
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
while (( oldContent = reader.readLine()) != null ){
newContent = oldContent.replaceAll(stringQuotes, "");
writer.write(newContent);
}
writer.flush();
writer.close();
} catch (Exception e) {
log.error(e);
}
}
and when I try to use FileWriter(path,true) to write at the end of the file the program don't stop increasing the file memory till the hard disk will be full, thanks for help
ps : I also tried to use subString and append the new content and after the while I write the subString but also doesn't work

TL; DR;
Do not read and write the same file concurrently.
The issue
Your code starts reading, and then immediately truncates the file it is reading.
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
The first line opens a read handle to the file.
The second line opens a write handle to the same file.
It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.
At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.
What about using append=true
Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.
So each time a line is read, another is appended.
No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).
The solution
Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.
An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".
A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).
About out of memory issues
When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.
Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).
Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.
Not related but important
Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).
Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.

#GPI already provided a great answer on why reading and writing concurrently is causing the issue you're experiencing. It is also worth noting that reading 1gb of data into heap at once can definitely cause a OutOfMemoryError if enough heap isn't allocated which is likely. To solve this problem you could use an InputStream and read chunks of the file at a time, then write to another file until the process is completed, and ultimately replace the existing file with the modified one and delete. With this approach you could even use a ForkJoinTask to help with this since it's such a large job.
Side note;
There may be a better solution than create new file, write to new file, replace existing, delete new file.

Related

How to delete all lines from a file one-by-one after reading the line?

I'm writing a java program that does the following:
Reads in a line from a file
Does some action based on that line
Delete the line (or replace it with ""), and if 2 is not successful, write it to a new file
Continue on to the next line for all lines in file (as opposed to removing an arbitrary line)
Currently I have:
try (BufferedReader br = new BufferedReader(new FileReader(inputFile))) {
String line;
while ((line = br.readLine()) != null) {
try {
if (!do_stuff(line)){ //do_stuff returns bool based on success
write_non_success(line);
}
} catch (Exception e) {
e.printStackTrace(); //eat the exception for now, do something in the future
}
}
Obviously I'm going to need to not use a BufferedReader for this, as it can't write, but what class should I use? Also, read order doesn't matter
This differs from this question because I want to remove all lines, as opposed to an arbitrary line number as the other OP wants, and if possible I'd like to avoid writing the temp file after every line, as my files are approximately 1 million lines
If you do everything according to the algorithm that you describe, the content left in the original file would be the same as the content of "new file" from step #3:
If a line is processed successfully, it gets removed from the original file
If a line is not processed successfully, it gets added to the new file, and it also stays in the original file.
It is easy to see why at the end of this process the original file is the same as the "new file". All you need to do is to carry out your algorithm to the end, and then copy the new file in place of the original.
If your concern is that the process is going to crash in the middle, the situation becomes very different: now you have to write out the current state of the original file after processing each line, without writing over the original until you are sure that it is going to be in a consistent state. You can do it by reading all lines into a list, deleting the first line from the list once it has been processed, writing the content of the entire list into a temporary file, and copying it in place of the original. Obviously, this is very expensive, so it shouldn't be attempted in a tight loop. However, this approach ensures that the original file is not left in an inconsistent state, which is important when you are looking to avoid doing the same work multiple times.

Java: What's the most efficient way to read relatively large txt files and store its data?

I was supposed to write a method that reads a DNA sequence in order to test some string matching algorithms on it.
I took some existing code I use to read text files (don't really know any others):
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine()) != null) {
seq += line;
}
br.close();
}
catch(FileNotFoundException e) { e.printStackTrace(); }
catch(IOException e) { e.printStackTrace(); }
This seems to work just fine for small text files with ~3000 characters, but it takes forever (I just cancelled it after 10 minutes) to read files containing more than 45 million characters.
Is there a more efficient way of doing this?
One thing I notice is that you are doing seq+=line. seq is probably a String? If so, then you have to remember that strings are immutable. So in fact what you are doing is creating a new String each time you are trying to append a line to it. Please use StringBuilder instead. Also, if possible you don't want to do create a string and then process. That way you have to do it twice. Ideally you want to process as you read, but I don't know your situation.
The main element slowing your progress is the "concatenation" of the String seq and line when you call seq+=line. I use quotes for concatenation because in Java, Strings cannot be modified once they are created (e.g. immutable as user1598503 mentioned). Initially, this is not an issue, as the Strings are small, however once the Strings become very long, e.e. hundreds of thousands of characters, memory must be reallocated for the new String, which takes quite a bit of time. StringBuilder will allow you to do these concatenations in place, meaning you will not be creating a new Object every single time.
Your problem is not that the reading takes too much time, but the concatenating takes too much time. Just to verify this I ran your code (didn't finish) and then simply comented line 8 (seq += line) and it ran in under a second. You could try using seq = seq.concat(line) since it has been reported to be quite a bit faster most of the times, but I tried that too and didn't ran under 1-2 minutes (for a 9.6mb input file). My solution would be to store your lines in an ArrayList (or a container of your choice). The ArrayList example worked in about 2-3 seconds with the same input file. (so the content of your while loop would be list.add(line);). If you really, really want to store your entire file in a string you could do something like this (using the Scanner class):
String content = new Scanner(new File("input")).useDelimiter("\\Z").next();
^^This works in a matter of seconds as well. I should mention that "\Z" is the end of file delimiter so that's why it reads the whole thing in one swoop.

Delete specific contents of file using Regex Expression in Java

Consider that I have a data file storing rules in the following format:
//some header info
//more header info
//Rule: some_uuid_1234
rule "name"
data
data
data
end
//Rule: some_uuid_5678
rule "name2"
data
data
data
end
Now, what I would like is to be able to either read(id) or delete(id) a rule given the ID number. My question therefore is, how could I select and delete a rule (perhaps using a regex expression), and then delete this specific rule from the file, without altering anything else.
Simply replace <some_id> in your select/delete function with the actual true ID number.
//Rule: <some_id>.+?rule.+?end
NOTE: Don't forget SingleLine option.
There are 2 solutions I can think of and they have varied performance, so you can choose the one that suits you best.
Index the file
You could write an inverted index for this rule file and keep it updated for any operation that modifies the file. Of course your word index will be limited to one file and the only words in it will be the unique UUIDs. You can use a RandomAccess file to quickly read() from a given offset. The delete() operation can overwrite the target rule until it encounters the word 'end'. This solution requires more work, but you can retrieve values instantly.
Use a regex
You can alternatively read each line in the file and match it with a regex pattern that matches your rule UUID. Keep reading until you hit the 'end' of the rule and return it. A delete will involve over-writing the rule once you know the desired index. This solution is easy to write but the performance will suck. There is a lot of IO and it could become a bottleneck. (You could also load the entire file into memory and run a regex on the whole string, depending on how large the file / string is expected to be. This can get ugly real quick though.)
Whichever solution you choose you might also want to think about file level locks and how that affects CRUD operations. If this design has not been implemented yet, please consider moving the rules to a database.
I wouldn't use regular expressions to solve this particular problem - it would require loading the whole file in memory, processing it and rewriting it. That's not inherently bad, but if you have large enough files, a stream-based solution is probably better.
What you'd do is process your input file one line at a time and maintain a boolean value that:
becomes true when you find a line that matches the desired rule's declaration header.
becomes false when it's true and you find a line that matches end.
Discard all lines encountered while your boolean is set to true, write all other ones to a temporary output file (created, for example, with File#createTempFile).
For each line, if your boolean value is true, ignore it. Otherwise, write it to a temporary output file.
At the end of the process, overwrite your input file with your temporary output file using File#renameTo.
Note that this solution has the added advantage of being atomic: there is no risk for your input file to be partially written should an error occur in the middle of processing. It will either be overwritten entirely or not at all, which protects you against unexpected IOExceptions.
The following code demonstrates how you could implement that. It's not necessarily a perfect implementation, but it should illustrate the algorithm - lost somewhere in the middle of all that boilerplate code.
public void deleteFrom(String id, File file) throws IOException {
BufferedReader reader;
String line;
boolean inRule;
File temp;
PrintWriter writer;
reader = null;
writer = null;
try {
// Streams initialisation.
temp = File.createTempFile("delete", "rule");
writer = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(temp), "utf-8")));
reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "utf-8"));
inRule = false;
// For each line in the file...
while((line = reader.readLine()) != null) {
// If we're parsing the rule to delete, we're only interested in knowing when we're done.
if(inRule) {
if(line.trim().equals("end"))
inRule = false;
}
// Otherwise, look for the beginning of the targetted rule.
else if(line.trim().equals("rule \"" + id + "\""))
inRule = true;
// Normal line, we want to keep it.
else
writer.println(line);
}
}
// Stream cleanup.
finally {
if(reader != null)
reader.close();
if(writer != null)
writer.close();
}
// We're done, copy the new file over the old one.
temp.renameTo(file);
}

getting Java OutOfMemoryError: Java heap space error that I can't debug

I am struggling to figure out what's causing this OutofMemory Error. Making more memory available isn't the solution, because my system doesn't have enough memory. Instead I have to figure out a way of re-writing my code.
I've simplified my code to try to isolate the error. Please take a look at the following:
File[] files = new File(args[0]).listFiles();
int filecnt = 0;
LinkedList<String> urls = new LinkedList<String>();
for (File f : files) {
if (filecnt > 10) {
System.exit(1);
}
System.out.println("Doing File " + filecnt + " of " + files.length + " :" + f.getName());
filecnt++;
FileReader inputStream = null;
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
finally {
if (inputStream != null) {
inputStream.close();
}
}
inputStream.close();
String mystring = builder.toString();
String temp[] = mystring.split("\\|NEWandrewLINE\\|");
for (String s : temp) {
String temp2[] = s.split("\\|NEWandrewTAB\\|");
if (temp2.length == 22) {
urls.add(temp2[7].trim());
}
}
}
I know this code is probably pretty confusing :) I have loads of text files in the directory that is specified in args[0]. These text files were created by me. I used |NEWandrewLINE| to indicate a new row in the text file, and |NEWandrewTAB| to indicate a new column. In this code snippet, I am trying to access the URL of each stored row (which is in the 8th column of each row). So, I read in the whole text file. String split on |NEWandrewLINE| and then string split again on the substrings on |NEWandrewTAB|. I add the URL to the LinkedList (called "urls") with the line: urls.add(temp2[7].trim())
Now, the output of running this code is:
Doing File 0 of 973 :results1322453406319.txt
Doing File 1 of 973 :results1322464193519.txt
Doing File 2 of 973 :results1322337493419.txt
Doing File 3 of 973 :results1322347332053.txt
Doing File 4 of 973 :results1322330379488.txt
Doing File 5 of 973 :results1322369464720.txt
Doing File 6 of 973 :results1322379574296.txt
Doing File 7 of 973 :results1322346981999.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at Twitter.main(Twitter.java:86)
Where main line 86 relates to the line builder.append(d); in this example.
But the thing I don't understand is that if I comment out the line urls.add(temp2[7].trim()); I don't get any error. So the error seems to be caused by the linkedlist "urls" overfilling. But why then does the reported error relate to the StringBuilder?
Try to replace urls.add(temp2[7].trim()); with urls.add(new String(temp2[7].trim()));.
I suppose that your problem is that you are in fact storing the entire file content and not just the extracted URL field in your urls list, although that's not really obvious. It is actually an implementation specific issue with the String class, but usually String#split and String#trim return new String objects, which contain the same internal char array as the original string and only differs in their offset and length fields. Using the new String(String) constructor makes sure that you only keep the relevant part of the original data.
The linked list is using more memory each time you add a string. This means you can be left it not enough memory to build your StringBuilder.
The way to avoid this issue to write the results to a file instead of to a List as you don't appear to have enough memory to keep the List in memory.
Because this is
out of memory and not out of heap
you have LOTS of small temporary objects
I would suggest you give your JVM a -X maximum heap size limit that fits in your RAM.
To use less memory I would use a buffered reader to pull in the entire line and save on the temporary object creation.
The simple answer is: you should not load all the URLs from the text files into memory. You are surely doing this because you want to process them in a next step. So instead of adding them to a List in memory do the next step (maybe storing in a database or check if it is reachable) and forget that URL.
How many URLS do you have? Looks like you're just storing more of them than you can handle.
As far as I can see, the linked list is the only object that is not scoped inside the loop, so cannot be collected.
For an OOM error, it doesn't really matter where it is thrown.
To check this properly, use a profiler (look at JVisualVM for a free one, and you probably already have it). You'll see which objects are in the heap. You can also have the JVM dump its memory into a file when it crashes, then analyse that file with visualvm. You should see that one thing is grabbing all of your memory. I'm suspecting it's all the URLs.
There are several experts in here already, so, I'l be brief to the problems:
Inappropriate use of String Builder:
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
Java is beautiful when you process small amounts of data at a time, remember the garbage collector.
Instead, I would recommend that you read the file (Text file) 1 line at a time, process the line, and move on, never create a huge memory ball of StringBuilder just to get a String,
Immagine of your text file is 1 GB in size, you are done mate.
Add the real process while reading the file (as in item #1)
You dont need to close InputStream again, the code in finally block is good enough.
regards
if the linkedlist eats your memory every command which allocates memory afterwards may fail with an OOM error. So this looks like your problem.
You're reading the files into memory. At least one file is simply too big to fit into the default JVM heap. You can allow it use a lot more memory with an arg like -Xmx1g on the command line after java.
By the way this is really inefficient to read a file one character at a time!
Instead of trying to split the string (which basically creates an array of substrings based on the split) - thereby using more than double the memory each time you use the slpit, you should try to do regex based matching of the start and end patterns, extract individual sub-strings one by one and then extract the URL from that.
Also, if your file is large, I would suggest that you not even load all of that into memory at once ... stream its contents to a buffer (of manageable size) and use the pattern based search on that (and keep removing / adding more to the buffer as you progress through the file contents).
The implementation will slow down the program a bit but will use a considerably lesser amount of memory.
One major problem in your code is that you read whole file into a string builder, then convert it into string and then split it into smaller parts. So if file size is large you will get into trouble. As suggested by others process the file line by line as that should save a lot of memory.
Also you should check what is the size of your list after processing each file. If the size is very large you may want to use different approach or increase the memory for your process via -Xmx option.

Fastest Java way to remove the first/top line of a file (like a stack)

I am trying to improve an external sort implementation in java.
I have a bunch of BufferedReader objects open for temporary files. I repeatedly remove the top line from each of these files. This pushes the limits of the Java's Heap.
I would like a more scalable method of doing this without loosing speed because of a bunch of constructor calls.
One solution is to only open files when they are needed, then read the first line and then delete it. But I am afraid that this will be significantly slower.
So using Java libraries what is the most efficient method of doing this.
--Edit--
For external sort, the usual method is to break a large file up into several chunk files. Sort each of the chunks. And then treat the sorted files like buffers, pop the top item from each file, the smallest of all those is the global minimum. Then continue until for all items.
http://en.wikipedia.org/wiki/External_sorting
My temporary files (buffers) are basically BufferedReader objects. The operations performed on these files are the same as stack/queue operations (peek and pop, no push needed).
I am trying to make these peek and pop operations more efficient. This is because using many BufferedReader objects takes up too much space.
I'm away from my compiler at the moment, but I think this will work. Edit: works fine.
I urge you to profile it and see. I bet the constructor calls are going to be nothing compared to the file I/O and your comparison operations.
public class FileStack {
private File file;
private long position = 0;
private String cache = null;
public FileStack(File file) {
this.file = file;
}
public String peek() throws IOException {
if (cache != null) {
return cache;
}
BufferedReader r = new BufferedReader(new FileReader(file));
try {
r.skip(position);
cache = r.readLine();
return cache;
} finally {
r.close();
}
}
public String pop() throws IOException {
String r = peek();
if (r != null) {
// if you have \r\n line endings, you may need +2 instead of +1
// if lines could end either way, you'll need something more complicated
position += r.length() + 1;
cache = null;
}
return r;
}
}
If heap space is the main concern, use the [2nd form of the BufferedReader constructor][1] and specify a small buffer size.
[1]: http://java.sun.com/j2se/1.5.0/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader, int)
I have a bunch of BufferedReader objects open for temporary files. I repeatedly remove the top line from each of these files. This pushes the limits of the Java's Heap.
This is a really surprising claim. Unless you have thousands files open at the same time, there is no way that that should stress the heap. The default buffer size for a BufferedReader is 8192 bytes, and there should be little extra space required. 8192 * 1000 is only ~8Mbytes, and that is tiny compared with a typical Java application's memory usage.
Consider the possibility that something else is causing the heap problems. For example, if your program retained references to each line that it read, THAT would lead to heap problems.
(Or maybe your notion of what is "too much space" is unrealistic.)
One solution is to only open files when they are needed, then read the first line and then delete it. But I am afraid that this will be significantly slower.
There is no doubt that it would be significantly slower! There is simply no efficient way to delete the first line from a file. Not in Java, or in any other language. Deleting characters from the beginning or middle of a file entails copying the file to a new one while skipping over the characters that need to be removed. There is no faster alternative.

Categories