Delete specific contents of file using Regex Expression in Java - java

Consider that I have a data file storing rules in the following format:
//some header info
//more header info
//Rule: some_uuid_1234
rule "name"
data
data
data
end
//Rule: some_uuid_5678
rule "name2"
data
data
data
end
Now, what I would like is to be able to either read(id) or delete(id) a rule given the ID number. My question therefore is, how could I select and delete a rule (perhaps using a regex expression), and then delete this specific rule from the file, without altering anything else.

Simply replace <some_id> in your select/delete function with the actual true ID number.
//Rule: <some_id>.+?rule.+?end
NOTE: Don't forget SingleLine option.

There are 2 solutions I can think of and they have varied performance, so you can choose the one that suits you best.
Index the file
You could write an inverted index for this rule file and keep it updated for any operation that modifies the file. Of course your word index will be limited to one file and the only words in it will be the unique UUIDs. You can use a RandomAccess file to quickly read() from a given offset. The delete() operation can overwrite the target rule until it encounters the word 'end'. This solution requires more work, but you can retrieve values instantly.
Use a regex
You can alternatively read each line in the file and match it with a regex pattern that matches your rule UUID. Keep reading until you hit the 'end' of the rule and return it. A delete will involve over-writing the rule once you know the desired index. This solution is easy to write but the performance will suck. There is a lot of IO and it could become a bottleneck. (You could also load the entire file into memory and run a regex on the whole string, depending on how large the file / string is expected to be. This can get ugly real quick though.)
Whichever solution you choose you might also want to think about file level locks and how that affects CRUD operations. If this design has not been implemented yet, please consider moving the rules to a database.

I wouldn't use regular expressions to solve this particular problem - it would require loading the whole file in memory, processing it and rewriting it. That's not inherently bad, but if you have large enough files, a stream-based solution is probably better.
What you'd do is process your input file one line at a time and maintain a boolean value that:
becomes true when you find a line that matches the desired rule's declaration header.
becomes false when it's true and you find a line that matches end.
Discard all lines encountered while your boolean is set to true, write all other ones to a temporary output file (created, for example, with File#createTempFile).
For each line, if your boolean value is true, ignore it. Otherwise, write it to a temporary output file.
At the end of the process, overwrite your input file with your temporary output file using File#renameTo.
Note that this solution has the added advantage of being atomic: there is no risk for your input file to be partially written should an error occur in the middle of processing. It will either be overwritten entirely or not at all, which protects you against unexpected IOExceptions.
The following code demonstrates how you could implement that. It's not necessarily a perfect implementation, but it should illustrate the algorithm - lost somewhere in the middle of all that boilerplate code.
public void deleteFrom(String id, File file) throws IOException {
BufferedReader reader;
String line;
boolean inRule;
File temp;
PrintWriter writer;
reader = null;
writer = null;
try {
// Streams initialisation.
temp = File.createTempFile("delete", "rule");
writer = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(temp), "utf-8")));
reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "utf-8"));
inRule = false;
// For each line in the file...
while((line = reader.readLine()) != null) {
// If we're parsing the rule to delete, we're only interested in knowing when we're done.
if(inRule) {
if(line.trim().equals("end"))
inRule = false;
}
// Otherwise, look for the beginning of the targetted rule.
else if(line.trim().equals("rule \"" + id + "\""))
inRule = true;
// Normal line, we want to keep it.
else
writer.println(line);
}
}
// Stream cleanup.
finally {
if(reader != null)
reader.close();
if(writer != null)
writer.close();
}
// We're done, copy the new file over the old one.
temp.renameTo(file);
}

Related

Modify content of large file

I have extract my tables from my database in json file, now I want to read this files and remove all double quotes on them, seems easy and tried hundred of solutions, and some take me to the out of memory problems. I'm dealing with files that they have more than 1Gb size.The code that you will find below have a strange behaviour, and I don't understand why it return empty files
public void replaceDoubleQuotes(String fileName){
log.debug(" start formatting " + fileName + " ...");
File firstFile = new File ("C:/sqlite/db/tables/" + fileName);
String oldContent = "";
String newContent = "";
BufferedReader reader = null;
BufferedWriter writer = null;
FileWriter writerFile = null;
String stringQuotes = "\\\\\\\\\"";
try {
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
while (( oldContent = reader.readLine()) != null ){
newContent = oldContent.replaceAll(stringQuotes, "");
writer.write(newContent);
}
writer.flush();
writer.close();
} catch (Exception e) {
log.error(e);
}
}
and when I try to use FileWriter(path,true) to write at the end of the file the program don't stop increasing the file memory till the hard disk will be full, thanks for help
ps : I also tried to use subString and append the new content and after the while I write the subString but also doesn't work
TL; DR;
Do not read and write the same file concurrently.
The issue
Your code starts reading, and then immediately truncates the file it is reading.
reader = new BufferedReader(new FileReader(firstFile));
writerFile = new FileWriter("C:/sqlite/db/tables/" + fileName);
writer = new BufferedWriter(writerFile);
The first line opens a read handle to the file.
The second line opens a write handle to the same file.
It is not very clear if you look at the documentation of FileWriter constructor, but when you do not use a constructor that allows you to specify the append parameter, then the value is false by default, meaning, you immediately truncate the file if it already exists.
At this point (line 2) you have just erased the file you were about to read. So you end up with an empty file.
What about using append=true
Well, then the file is not erased when it is created, which is "good". So you program starts reading the first line, and outputs (to the same file) the filtered version.
So each time a line is read, another is appended.
No wonder your program will never reach the end of the file : each time it advances a line, it creates another line to process. Generally speaking, you'll never reach end of file (well of course if the file is a single line to begin with, you might but that's a corner case).
The solution
Write to a temporary file, and IF (and only IF) you succed, then swap the files if you really need too.
An advantage of this solution : if for whatever reason your processe crahses, you'll have the original file untouched and you could retry later, which is usually a good thing. Your process is "repeatable".
A disadvantage : you'll need twice the space at some point. (Although you could compress the temp file and reduce this factor but still).
About out of memory issues
When working with arbitrarily large files, the path you chose (using buffered readers and writers) is the right one, because you only use one line-worth of memory at a time.
Therefore it generally avoids memory usage issues (unless of course, you have a file without line breaks, in which case it makes no difference at all).
Other solutions, involving reading the whole file at once, then performing the search/replace in memory, then writing the contents back do not scale that well, so it's good you avoided this kind of computation.
Not related but important
Check out the try with resources syntax to properly close your resources (reader / writer). Here you forgot to close the reader, and you are not closing the writer appropriately anyway (that is : in a finally clause).
Another thing : I'm pretty sure no java program written by a mere mortal will beat tools like sed or awk that are available on most unix platforms (and some more). Maybe you'd want to check if rolling your own in java is worth what is a shell one-liner.
#GPI already provided a great answer on why reading and writing concurrently is causing the issue you're experiencing. It is also worth noting that reading 1gb of data into heap at once can definitely cause a OutOfMemoryError if enough heap isn't allocated which is likely. To solve this problem you could use an InputStream and read chunks of the file at a time, then write to another file until the process is completed, and ultimately replace the existing file with the modified one and delete. With this approach you could even use a ForkJoinTask to help with this since it's such a large job.
Side note;
There may be a better solution than create new file, write to new file, replace existing, delete new file.

How to delete all lines from a file one-by-one after reading the line?

I'm writing a java program that does the following:
Reads in a line from a file
Does some action based on that line
Delete the line (or replace it with ""), and if 2 is not successful, write it to a new file
Continue on to the next line for all lines in file (as opposed to removing an arbitrary line)
Currently I have:
try (BufferedReader br = new BufferedReader(new FileReader(inputFile))) {
String line;
while ((line = br.readLine()) != null) {
try {
if (!do_stuff(line)){ //do_stuff returns bool based on success
write_non_success(line);
}
} catch (Exception e) {
e.printStackTrace(); //eat the exception for now, do something in the future
}
}
Obviously I'm going to need to not use a BufferedReader for this, as it can't write, but what class should I use? Also, read order doesn't matter
This differs from this question because I want to remove all lines, as opposed to an arbitrary line number as the other OP wants, and if possible I'd like to avoid writing the temp file after every line, as my files are approximately 1 million lines
If you do everything according to the algorithm that you describe, the content left in the original file would be the same as the content of "new file" from step #3:
If a line is processed successfully, it gets removed from the original file
If a line is not processed successfully, it gets added to the new file, and it also stays in the original file.
It is easy to see why at the end of this process the original file is the same as the "new file". All you need to do is to carry out your algorithm to the end, and then copy the new file in place of the original.
If your concern is that the process is going to crash in the middle, the situation becomes very different: now you have to write out the current state of the original file after processing each line, without writing over the original until you are sure that it is going to be in a consistent state. You can do it by reading all lines into a list, deleting the first line from the list once it has been processed, writing the content of the entire list into a temporary file, and copying it in place of the original. Obviously, this is very expensive, so it shouldn't be attempted in a tight loop. However, this approach ensures that the original file is not left in an inconsistent state, which is important when you are looking to avoid doing the same work multiple times.

Java: What's the most efficient way to read relatively large txt files and store its data?

I was supposed to write a method that reads a DNA sequence in order to test some string matching algorithms on it.
I took some existing code I use to read text files (don't really know any others):
try {
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
while((line = br.readLine()) != null) {
seq += line;
}
br.close();
}
catch(FileNotFoundException e) { e.printStackTrace(); }
catch(IOException e) { e.printStackTrace(); }
This seems to work just fine for small text files with ~3000 characters, but it takes forever (I just cancelled it after 10 minutes) to read files containing more than 45 million characters.
Is there a more efficient way of doing this?
One thing I notice is that you are doing seq+=line. seq is probably a String? If so, then you have to remember that strings are immutable. So in fact what you are doing is creating a new String each time you are trying to append a line to it. Please use StringBuilder instead. Also, if possible you don't want to do create a string and then process. That way you have to do it twice. Ideally you want to process as you read, but I don't know your situation.
The main element slowing your progress is the "concatenation" of the String seq and line when you call seq+=line. I use quotes for concatenation because in Java, Strings cannot be modified once they are created (e.g. immutable as user1598503 mentioned). Initially, this is not an issue, as the Strings are small, however once the Strings become very long, e.e. hundreds of thousands of characters, memory must be reallocated for the new String, which takes quite a bit of time. StringBuilder will allow you to do these concatenations in place, meaning you will not be creating a new Object every single time.
Your problem is not that the reading takes too much time, but the concatenating takes too much time. Just to verify this I ran your code (didn't finish) and then simply comented line 8 (seq += line) and it ran in under a second. You could try using seq = seq.concat(line) since it has been reported to be quite a bit faster most of the times, but I tried that too and didn't ran under 1-2 minutes (for a 9.6mb input file). My solution would be to store your lines in an ArrayList (or a container of your choice). The ArrayList example worked in about 2-3 seconds with the same input file. (so the content of your while loop would be list.add(line);). If you really, really want to store your entire file in a string you could do something like this (using the Scanner class):
String content = new Scanner(new File("input")).useDelimiter("\\Z").next();
^^This works in a matter of seconds as well. I should mention that "\Z" is the end of file delimiter so that's why it reads the whole thing in one swoop.

How can I speed up my Java text file parser?

I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).
My parser functions includes the following steps (ordered):
find text between two tags, which is the relevant text to read in each file.
lowecase all the text
string.split with multiple delimiters.
creating an arrayList with words like this: "aaa-aa", then adding to the string splitted above, and discounting "aaa" and "aa" to the String []. (i did this because i wanted "-" to be a delimiter, but i also wanted "aaa-aa" to be one word only, and not "aaa" and "aa".
get the String [] and map to a Map = new HashMap ... (word, frequency)
print everything.
It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?
unique words found: 398752.
CODE:
File file = new File(dir);
String[] files = file.list();
for (int i = 0; i < files.length; i++) {
BufferedReader br = new BufferedReader(
new InputStreamReader(
new BufferedInputStream(
new FileInputStream(dir + files[i])), encoding));
try {
String line;
while ((line = br.readLine()) != null) {
parsedString = parseString(line); // parse the string
m = stringToMap(parsedString, m);
}
} finally {
br.close();
}
}
EDIT: Check this:
![enter image description here][1]
I don't know what to conclude.
EDIT: 80% TIME USED WITH THIS FUNCTION
public String [] parseString(String sentence){
// separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");
Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>
Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
Matcher matcher = pattern.matcher(sentence);
// Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
for(int i=0; matcher.find(); i++){
String [] tempo = matcher.group().split("-");
o.put(tempo[0], tempo[1]);
}
//System.out.println("words: " + o);
ArrayList temp = new ArrayList();
temp.addAll(Arrays.asList(parts));
for (Map.Entry<String, String> entry : o.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
temp.add(key+"-"+value);
if(temp.indexOf(key)!=-1){
temp.remove(temp.indexOf(key));
}
if(temp.indexOf(value)!=-1){
temp.remove(temp.indexOf(value));
}
}
String []strArray = new String[temp.size()];
temp.toArray(strArray);
return strArray;
}
600 files, each file about 0.5MB
EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:
2:
Be sure to increase your heap size, if you haven't already, using -Xmx. For this app, the impact may be striking.
The parts of your code that are likely to have the largest performance impact are the ones that are executed the most - which are the parts you haven't shown.
Update after memory screenshot
Look at all those Pattern$6 objects in the screenshot. I think you're recompiling the pattern a lot - maybe for every line. That would take a lot of time.
Update 2 - after code added to question.
Yup - two patterns compiled on every line - the explicit one, and also the "-" in the split (much cheaper, of course). I wish they hadn't added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.
Try to use to single regex that has a group that matches each word that is within tags - so a single regex could be used for your entire input and there would be not separate "split" stage.
Otherwise your approach seems reasonable, although I don't understand what you mean by "get the String [] ..." - I thought you were using an ArrayList. In any event, try to minimize the creation of objects, for both construction cost and garbage collection cost.
Is it just the parsing that's taking so long, or is it the file reading as well?
For the file reading, you can probably speed that up by reading the files on multiple threads. But first step is to figure out whether it's the reading or the parsing that's taking all the time so you can address the right issue.
Run the code through the Netbeans profiler and find out where it is taking the most time (right mouse click on the project and select profile, make sure you do time not memory).
Nothing in the code that you have shown us is an obvious source of performance problems. The problem is likely to be something to do with the way that you are parsing the lines or extracting the words and putting them into the map. If you want more advice you need to post the code for those methods, and the code that declares / initializes the map.
My general advice would be to profile the application and see where the bottlenecks are, and use that information to figure out what needs to be optimized.
#Ed Staub's advice is also sound. Running an application with a heap that is too small can result serious performance problems.
If you aren't already doing it, use BufferedInputStream and BufferedReader to read the files. Double-buffering like that is measurably better than using BufferedInputStream or BufferedReader alone. E.g.:
BufferedReader rdr = new BufferedReader(
new InputStreamReader(
new BufferedInputStream(
new FileInputStream(aFile)
)
/* add an encoding arg here (e.g., ', "UTF-8"') if appropriate */
)
);
If you post relevant parts of your code, there'd be a chance we could comment on how to improve the processing.
EDIT:
Based on your edit, here are a couple of suggestions:
Compile the pattern once and save it as a static variable, rather than compiling every time you call parseString.
Store the values of temp.indexOf(key) and temp.indexOf(value) when you first call them and then use the stored values instead of calling indexOf a second time.
It looks like its spending most of it time in regular expressions. I would firstly try writing the code without using a regular expression and then using multiple threads as if the process still appears to be CPU bound.
For the counter, I would look at using TObjectIntHashMap to reduce the overhead of the counter. I would use only one map, not create an array of string - counts which I then use to build another map, this could be a significant waste of time.
Precompile the pattern instead of compiling it every time through that method, and rid of the double buffering: use new BufferedReader(new FileReader(...)).

Deleting duplicate lines in a file using Java

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.
I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.
It would be something like this (ignoring exceptions):
public void stripDuplicatesFromFile(String filename) {
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
String line;
while ((line = reader.readLine()) != null) {
lines.add(line);
}
reader.close();
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for (String unique : lines) {
writer.write(unique);
writer.newLine();
}
writer.close();
}
If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.
Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.
Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:
Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
Read Line.
If not(Line in hashSet)
Add Line to hashset.
Write Line to output.
End If.
End While.
Free hashset.
Close input.
Close output.
Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.
A similar approach
public void stripDuplicatesFromFile(String filename) {
IOUtils.writeLines(
new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
"\n", new FileOutputStream(filename + ".uniq"));
}
Something like this, perhaps:
BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
out.println(line);
LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.
You could use Set in the Collections library to store unique, seen values as you read the file.
Set<String> uniqueStrings = new HashSet<String>();
// read your file, looping on newline, putting each line into variable 'thisLine'
uniqueStrings.add(thisLine);
// finish read
for (String uniqueString:uniqueStrings) {
// do your processing for each unique String
// i.e. System.out.println(uniqueString);
}
If the order does not matter, the simplest way is shell scripting:
<infile sort | uniq > outfile
Try a simple HashSet that stores the lines you have already read.
Then iterate over the file.
If you come across duplicates they are simply ignored (as a Set can only contain every element once).
Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)
The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.
Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.
If you could use UNIX shell commands you could do something like the following:
for(i = line 0 to end)
{
sed 's/\$i//2g' ; deletes all repeats
}
This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.
There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.
For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.
The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.
The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing
the result, just the line numbers to be deleted.
Then copy the original file line by line, ignoring the line numbers you have stored above.
Does it matter in which order the lines come, and how many duplicates are you counting on seeing?
If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.
I have made two assumptions for this efficient solution:
There is a Blob equivalent of line or we can process it as binary
We can save the offset or a pointer to start of each line.
Based on these assumptions solution is:
1.read a line, save the length in the hashmap as key , so we have lighter hashmap. Save the list as the entry in hashmap for all the lines having that length mentioned in key. Building this hashmap is O(n).
While mapping the offsets for each line in the hashmap,compare the line blobs with all existing entries in the list of lines(offsets) for this key length except the entry -1 as offset.if duplicate found remove both lines and save the offset -1 in those places in list.
So consider the complexity and memory usage:
Hashmap memory ,space complexity = O(n) where n is number of lines
Time Complexity - if no duplicates but all equal length lines considering length of each line = m, consider the no of lines =n then that would be , O(n). Since we assume we can compare blob , the m does not matter.
That was worst case.
In other cases we save on comparisons although we will have little extra space required in hashmap.
Additionally we can use mapreduce on server side to split the set and merge results later. And using length or start of line as the mapper key.
void deleteDuplicates(File filename) throws IOException{
#SuppressWarnings("resource")
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new LinkedHashSet<String>();
String line;
String delims = " ";
System.out.println("Read the duplicate contents now and writing to file");
while((line=reader.readLine())!=null){
line = line.trim();
StringTokenizer str = new StringTokenizer(line, delims);
while (str.hasMoreElements()) {
line = (String) str.nextElement();
lines.add(line);
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for(String unique: lines){
writer.write(unique+" ");
}
writer.close();
}
}
System.out.println(lines);
System.out.println("Duplicate removal successful");
}
These answers all rely on the file being small enough to store in memory.
If it is OK to sort the file, this is an algorithm that can be used on any sized file.
You need this library: https://github.com/lemire/externalsortinginjava
I assume you start with a file fileDumpCsvFileUnsorted and you will end up with a new file fileDumpCsvFileSorted that is sorted and has no dupes.
ExternalSort.sort(fileDumpCsvFileUnsorted, fileDumpCsvFileSorted);
int numDupes = 0;
File dupesRemoved = new File(fileDumpCsvFileSorted.getAbsolutePath() + ".nodupes");
String previousLine = null;
try (FileWriter fw = new FileWriter(dupesRemoved);
BufferedWriter bw = new BufferedWriter(fw);
FileReader fr = new FileReader(fileDumpCsvFileSorted);
LineIterator lineIterator = new LineIterator(fr)
) {
while (lineIterator.hasNext()) {
String nextLine = lineIterator.nextLine();
if (StringUtils.equals(nextLine, previousLine)) {
++numDupes;
continue;
}
bw.write(String.format("%s%n", nextLine));
previousLine = nextLine;
}
}
logger.info("Removed {} dupes from {}", numDupes, fileDumpCsvFileSorted.getAbsolutePath());
FileUtils.deleteQuietly(fileDumpCsvFileSorted);
FileUtils.moveFile(dupesRemoved, fileDumpCsvFileSorted);
The file fileDumpCsvFileSorted is now created sorted with no dupes.

Categories