Find words frequency from a large file - java

I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word including partial matches too. ie, the word to present in the word tom. So my expected word count of to is 2.
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
The text file I have is around 30gb hence its not possible to load all the content in memory.
So What I am doing right now is:
reading the input file using scanner
for each word finding the frequency using this code:
Long wordsCount = Files.lines(Paths.get(allWordsFile))
.filter(s->s.contains(word)).count();
ie, for each word I am looping the entire file content. Even though I am using threadpool executor, the performance of this approach is really poor.
Is there a better way of doing this?
Any tools are available to find the frequency of the words from a large file?

Assuming there are a lot of repetitions you could try something like this (wrote this from scratch may not compile perfectly)
File file =
new File("fileLoc");
BufferedReader br = new BufferedReader(new FileReader(file));
Map <String, Integer> hm = new HashMap<>();
String name;
while ((name = br.readLine()) != null)
if(hm.containsKey(name){
hm.replace(name,hm.get(name) + 1);
}
else{
hm.put(name,1);
}
}
EDIT: I didnt notice the partial matches part but you should be able to just loop back through the map after reading the enter file so that way if theres a partial match just combine the partial match value with the match value

The best in term of performance is to read the lines from the file with a BufferedReader, and to store the word counter in a HashMap.

Related

What is the fastest way to compare two text files, not counting moved lines as different

I have two files which are very large in size say 50000 lines each. I need to compare these two files and identify the changes. However, the catch is if a line is present at different position, it should not be shown as different.
For eg, consider this
File A.txt
xxxxx
yyyyy
zzzzz
File B.txt
zzzzz
xxxx
yyyyy
So if this is the content of the file. My code should give the output as xxxx(or both xxxx and xxxxx).
Ofcourse the easiest way would be storing each line of the file in a
List< String>
and comparing with the other
List< String>.
But this seems to be taking a lot of time. I have also tried using the DiffUtils in java. But it doesnt recognize the lines present in diferent line numbers as same. So is there any other algorithm that might help me?
In general HashSet would be the best solution, but as we are dealing with strings there are two possible solutions:
saving one file as HashSet and trying to find the lines of other file in it.
saving one file as Trie and trying to find the lines of other file in it
In this post you can find comparison between HashSets and Tries How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?
probably using Set is the easiest way:
Set<String> set1 = new HashSet<String>(FileUtils.readLines(file1));
Set<String> set2 = new HashSet<String>(FileUtils.readLines(file2));
Set<String> similars = new HashSet<String>(set1);
similars.retainAll(set2);
set1.removeAll(similars); //now set1 contains distinct lines in file1
set2.removeAll(similars); //now set2 contains distinct lines in file2
System.out.println(set1); //prints distinct lines in file1;
System.out.println(set2); //prints distinct lines in file2
You need to keep track of the case where the same record might appear more than once in the files. For example, if a record appears twice in file A and once in file B, then you need to record that as an extra record.
Since we have to keep track of the number of occurrences, you need one of:
A Multiset
A Map from record to Integer e.g. Map
With a Multiset, you can add and remove records and it will keep track of the number of times the record has been added (a Set doesn't do that - it rejects an add of a record that is already there). With the Map approach, you have to do a little bit of work so that the integer tracks the number of occurrences. let's consider that approach (the MultiSet is simpler).
With the map, when we talk about 'adding' a record, you look to see if there is an entry for that String in the Map. if there is, replace the value with value+1 for that key. If there isn't, create an entry with the value of 1. When we talk about 'removing an entry', look for an entry for that key. If you find it, replace the value with value-1. If that reduces the value to 0, remove the entry.
Create a Map for each file.
Read a record for one of the files
Check to see if that record exists in the other Map.
If it exists in the other Map, remove that entry (see above for what that means)
If it doesn't exist, add it to the Map for this file (see above)
Repeat until end, alternating files.
The contents of the two Maps will give you the records that appeared in that file but not the other.
Doing this as we go along, rather than building the Maps up front, keeps the memory usage down, but probably doesn't have a big impact on performance.
I think this will be useful,
BufferedReader reader1 = new BufferedReader(new FileReader("C:\\file1.txt"));
BufferedReader reader2 = new BufferedReader(new FileReader("C:\\file2.txt"));
String line1 = reader1.readLine();
String line2 = reader2.readLine();
boolean areEqual = true;
int lineNum = 1;
while (line1 != null || line2 != null)
{
if(line1 == null || line2 == null)
{
areEqual = false;
break;
}
else if(! line1.equalsIgnoreCase(line2))
{
areEqual = false;
break;
}
line1 = reader1.readLine();
line2 = reader2.readLine();
lineNum++;
}
if(areEqual)
{
System.out.println("Two files have same content.");
}
else
{
System.out.println("Two files have different content. They differ at line "+lineNum);
System.out.println("File1 has "+line1+" and File2 has "+line2+" at line "+lineNum);
}
reader1.close();
reader2.close();
You could try parsing your first file first, storing all of the lines in a HashMap and then checking whether there is a mapping present for each of the lines of the second file.
This is still O(n), though.
Just use a byte comparison with BufferedReader. This will be the fastest way to compare two files. Read a byte block from one file and compare it with the byte block of the other file. First check if the file length is the same.
Or just use FileUtils.contentEquals(file1, file2); from org.apache.commons.io.FileUtils.
You can use FileUtils.contentEquals(file1, file2)
It will compare the contents of the 2 files.
Find more information here

Java while loop dramatically slows down over time after a large number of iterations

My program reads a text file line by line in a while loop. It then processes each line and extracts some information to be written in the output. Everything it does inside the while loop is O(1) except two ArrayList indexOf() method calls which I suppose are O(N). The program runs at a reasonable pace (1M lines per 100 seconds) in the beginning but over time it slows down dramatically. I have 70 M lines in the input file so the loop iterates 70 million times. In theory this should take about 2 hours but in practice it takes 13 hours. Where is the problem?
Here is the code snippet:
BufferedReader corpus = new BufferedReader(
new InputStreamReader(
new FileInputStream("MyCorpus.txt"),"UTF8"));
Writer outputFile = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("output.txt"), "UTF-8"));
List<String> words = new ArrayList();
//words is being updated with relevant values here
LinkedHashMap<String,Integer> DIC = new LinkedHashMap();
//DIC is being updated with relevant key-value pairs here
String line = "";
while ((line = corpus.readLine()) != null)
String[] parts = line.split(" ");
if (DIC.containsKey(parts[0]) && DIC.containsKey(parts[1])) {
int firstIndexPlusOne = words.indexOf(parts[0])+ 1;
int secondIndexPlusOne = words.indexOf(parts[1]) +1;
outputFile.write(firstIndexPlusOne +" "+secondIndexPlusOne+" "+parts[2]+"\n");
} else {
notFound++;
outputFile.write("NULL\n");
}
}
outputFile.close();
I am assuming you add words to your words ArrayList as you go.
You correctly state that words.indexOf is O(N) and that is the cause of your issue. As N increases (you add words to the list) these operations take longer and longer.
To avoid this keep your list sorted and use binarySearch.
To keep it sorted use binarySearch on each word to work out where to insert it. This takes your complexity from O(n) to O(log(N)).
I think, words is meant to collect unique words, hence use Set.
Set<String> words = new HashSet<>();
Map<String, Integer> DIC = new HashMap<>();
Also DIC seems something like a frequency table, in which case dic.keySet() would be the same as words. A LinkedHashMap maintains an extra list to keep the entries sorted on order of insertion.
The writing of separate strings, instead of first creating new strings is faster.
outputFile.write(firstIndexPlusOne);
outputFile.write(" ");
outputFile.write(secondIndexPlusOne);
outputFile.write(" ");
outputFile.write(parts[2]);
outputFile.write("\n");
I think one of your problem is that line:
outputFile.write(firstIndexPlusOne +" "+secondIndexPlusOne+" "+parts[2]+"\n");
Since strings are immutable, you are cluttering the memory. Also, maybe try to flush the write buffer every turn in the loop it maybe improve a bit (my hypothesis here)
Try something like:
String line = "";
StringBuilder sb = new StringBuilder();
while ...
...
sb.append(firstIndexPlusOne);
sb.append(" ");
sb.append(secondIndexPlusOne);
sb.append(" ");
sb.append(parts[2]);
sb.append("\n");
outputFile.write(sb.toString());
sb.setLength(0);
outputFile.flush();
Also, maybe a good read: Tuning Java I/O Performance (Oracle)
If the corpus and the word list are both sorted, the linear search performed by the words.indexOf(..) call would become slower in each iteration.
Building a HashMap(..) from your word list before processing the corpus would even things out. It might be a good idea to do so for optimization, even if that is not the problem.
Assuming that you don't update neither words nor DIC in your loop, obviously the most runtime is consumed when DIC.containsKey(parts[0]) && DIC.containsKey(parts[1]) evaluates to true.
If your question is "why is it slowing down", and not "how can I speed it up", I'd suggest that you take the first 10M lines of your file, copy them into another file and duplicate them so you receive 70M lines consisting of copies of your first 10M lines. Then, execute your code. If it slows down even though the same content is examined again and again, you may check the other answers regarding string builders and such.
If you don't experience the slowing down, then obviously it's dependent on the actual content of your 70M file. Propably, for the remaining 60M lines of your original file, DIC.containsKey(parts[0]) && DIC.containsKey(parts[1]) evaluates to true more often and therefore the inner loop is executed more often, taking more time.
In the latter case, I doubt that you can trick the I/O load by applying single writes such that a performance gain is obtained, but of course I may be very wrong there. You'd have to try. But first, I'd recommend exploring the source of the problem, which I think lies in the file content's structure. After you understand how your code performs with respect to the input given, you may try to optimize (althoug I would try to keep the whole string in memory and write its contents in one operation after the loop instead of performing very many small write operations).

How can i return the address of a line in a Random Access file?

i'm trying to create a Random Access file in java.
I write something in a new line.
How can i return the address of that line in Java?
Also, I'm a bit confused with RAFs.
For example i have a file that consists of the following entries in alphabetical manner
George 10 10 8
Mary 9 10 10
Nick 8 8 8
Nickolas 10 10 9
I would like to return the grades of Nickolas.
How can i declare that in a RAF?
Is there any method that can "read("Nickolas")" and return to me the line?
Thanks, in advance
Random access files usually contain binary data rather than ascii (e.g. plain text) data. The example you are showing is ascii.
Since the data is ascii, this means it's not as easy to seek to various places in the file. In fact, generally the approach to get the grades for Nickolas would be to read the file line by line and parse each line into columns. Then, compare the first column for Nickolas.
For example,
BufferedReader in = new BufferedReader(new FileReader("grades.txt"));
String line = in.readLine();
while(null != line) {
String [] columns = line.split(" ");
if( columns[0].equals("Nickolas") )
System.out.println("I found the line! " + line);
line = in.readLine();
}
EDIT:
There are a number of ways to speed this up. Here are three:
Storing all data in a HashMap
If you don't have too many records, or if each record doesn't take much space, you could read them all into RAM. You can also use a HashMap to map the name of the student to their record. For example:
HashMap<String, Student> grades = new HashMap<String, Student>();
BufferedReader in = new BufferedReader(new FileReader("grades.txt"));
String line = in.readLine();
while(null != line) {
String [] columns = line.split(" ");
grades.put( column[0],
new Student( /* create student class instance from columns */ );
line = in.readLine();
}
Now, lookups will be extremely fast.
Using a Binary Search
If you have too many records to fit in RAM, you can write all of the student data to a random access (binary) file. Here, you have a couple of options: you can either make each record vary in length, or you can make each record have a fixed length. Fixed length records are easier for some kinds of searching, like binary searches.
For example, if you know each record is 100 bytes, then you know how to get to the n'th record in the binary file storing the records. Basically, read 99*n bytes. Then the next 100 bytes are the 100th record.
Thus, if the records are sorted by student name, you can very easily use a binary search to find a specific student. This approach will still be fast, albeit not as fast as the RAM-based data structure.
Using a HashMap as an index
Yet another option is to combine the two approaches I mentioned above. Write the data to a binary file, and store the byte offsets of the records in a hash map. The hash map can use the student name as the key as before, but then stores a long integer offset to the record in the random access file. Thus, to look up a specific student, you find the byte offset using the hash map, and then "seek" to the record in the file and then read it. This last approach works even if the records vary in length.
There is no such thing as a 'line'. There are, however line delimiters (newline, which is '\n'). You can write a line but that only writes the data followed by newline. You can read a line, but again that only reads until it finds a newline character or the end of the file.
So to find line n, you have to keep reading until you've counted n-1 newline characters, and keep reading until you find the next one (or the end of the file).

How can I speed up my Java text file parser?

I am reading about 600 text files, and then parsing each file individually and add all the terms to a map so i can know the frequency of each word within the 600 files. (about 400MB).
My parser functions includes the following steps (ordered):
find text between two tags, which is the relevant text to read in each file.
lowecase all the text
string.split with multiple delimiters.
creating an arrayList with words like this: "aaa-aa", then adding to the string splitted above, and discounting "aaa" and "aa" to the String []. (i did this because i wanted "-" to be a delimiter, but i also wanted "aaa-aa" to be one word only, and not "aaa" and "aa".
get the String [] and map to a Map = new HashMap ... (word, frequency)
print everything.
It is taking me about 8min and 48 seconds, in a dual-core 2.2GHz, 2GB Ram. I would like advice on how to speed this process up. Should I expect it to be this slow? And if possible, how can I know (in netbeans), which functions are taking more time to execute?
unique words found: 398752.
CODE:
File file = new File(dir);
String[] files = file.list();
for (int i = 0; i < files.length; i++) {
BufferedReader br = new BufferedReader(
new InputStreamReader(
new BufferedInputStream(
new FileInputStream(dir + files[i])), encoding));
try {
String line;
while ((line = br.readLine()) != null) {
parsedString = parseString(line); // parse the string
m = stringToMap(parsedString, m);
}
} finally {
br.close();
}
}
EDIT: Check this:
![enter image description here][1]
I don't know what to conclude.
EDIT: 80% TIME USED WITH THIS FUNCTION
public String [] parseString(String sentence){
// separators; ,:;'"\/<>()[]*~^ºª+&%$ etc..
String[] parts = sentence.toLowerCase().split("[,\\s\\-:\\?\\!\\«\\»\\'\\´\\`\\\"\\.\\\\\\/()<>*º;+&ª%\\[\\]~^]");
Map<String, String> o = new HashMap<String, String>(); // save the hyphened words, aaa-bbb like Map<aaa,bbb>
Pattern pattern = Pattern.compile("(?<![A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû-])[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+-[A-Za-zÁÉÍÓÚÀÃÂÊÎÔÛáéíóúàãâêîôû]+(?![A-Za-z-])");
Matcher matcher = pattern.matcher(sentence);
// Find all matches like this: ("aaa-bb or bbb-cc") and put it to map to later add this words to the original map and discount the single words "aaa-aa" like "aaa" and "aa"
for(int i=0; matcher.find(); i++){
String [] tempo = matcher.group().split("-");
o.put(tempo[0], tempo[1]);
}
//System.out.println("words: " + o);
ArrayList temp = new ArrayList();
temp.addAll(Arrays.asList(parts));
for (Map.Entry<String, String> entry : o.entrySet()) {
String key = entry.getKey();
String value = entry.getValue();
temp.add(key+"-"+value);
if(temp.indexOf(key)!=-1){
temp.remove(temp.indexOf(key));
}
if(temp.indexOf(value)!=-1){
temp.remove(temp.indexOf(value));
}
}
String []strArray = new String[temp.size()];
temp.toArray(strArray);
return strArray;
}
600 files, each file about 0.5MB
EDIT3#- The pattern is no longer compiling each time a line is read. The new images are:
2:
Be sure to increase your heap size, if you haven't already, using -Xmx. For this app, the impact may be striking.
The parts of your code that are likely to have the largest performance impact are the ones that are executed the most - which are the parts you haven't shown.
Update after memory screenshot
Look at all those Pattern$6 objects in the screenshot. I think you're recompiling the pattern a lot - maybe for every line. That would take a lot of time.
Update 2 - after code added to question.
Yup - two patterns compiled on every line - the explicit one, and also the "-" in the split (much cheaper, of course). I wish they hadn't added split() to String without it taking a compiled pattern as an argument. I see some other things that could be improved, but nothing else like the big compile. Just compile the pattern once, outside this function, maybe as a static class member.
Try to use to single regex that has a group that matches each word that is within tags - so a single regex could be used for your entire input and there would be not separate "split" stage.
Otherwise your approach seems reasonable, although I don't understand what you mean by "get the String [] ..." - I thought you were using an ArrayList. In any event, try to minimize the creation of objects, for both construction cost and garbage collection cost.
Is it just the parsing that's taking so long, or is it the file reading as well?
For the file reading, you can probably speed that up by reading the files on multiple threads. But first step is to figure out whether it's the reading or the parsing that's taking all the time so you can address the right issue.
Run the code through the Netbeans profiler and find out where it is taking the most time (right mouse click on the project and select profile, make sure you do time not memory).
Nothing in the code that you have shown us is an obvious source of performance problems. The problem is likely to be something to do with the way that you are parsing the lines or extracting the words and putting them into the map. If you want more advice you need to post the code for those methods, and the code that declares / initializes the map.
My general advice would be to profile the application and see where the bottlenecks are, and use that information to figure out what needs to be optimized.
#Ed Staub's advice is also sound. Running an application with a heap that is too small can result serious performance problems.
If you aren't already doing it, use BufferedInputStream and BufferedReader to read the files. Double-buffering like that is measurably better than using BufferedInputStream or BufferedReader alone. E.g.:
BufferedReader rdr = new BufferedReader(
new InputStreamReader(
new BufferedInputStream(
new FileInputStream(aFile)
)
/* add an encoding arg here (e.g., ', "UTF-8"') if appropriate */
)
);
If you post relevant parts of your code, there'd be a chance we could comment on how to improve the processing.
EDIT:
Based on your edit, here are a couple of suggestions:
Compile the pattern once and save it as a static variable, rather than compiling every time you call parseString.
Store the values of temp.indexOf(key) and temp.indexOf(value) when you first call them and then use the stored values instead of calling indexOf a second time.
It looks like its spending most of it time in regular expressions. I would firstly try writing the code without using a regular expression and then using multiple threads as if the process still appears to be CPU bound.
For the counter, I would look at using TObjectIntHashMap to reduce the overhead of the counter. I would use only one map, not create an array of string - counts which I then use to build another map, this could be a significant waste of time.
Precompile the pattern instead of compiling it every time through that method, and rid of the double buffering: use new BufferedReader(new FileReader(...)).

Deleting duplicate lines in a file using Java

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.
I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!
Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.
It would be something like this (ignoring exceptions):
public void stripDuplicatesFromFile(String filename) {
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
String line;
while ((line = reader.readLine()) != null) {
lines.add(line);
}
reader.close();
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for (String unique : lines) {
writer.write(unique);
writer.newLine();
}
writer.close();
}
If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.
Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.
Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:
Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
Read Line.
If not(Line in hashSet)
Add Line to hashset.
Write Line to output.
End If.
End While.
Free hashset.
Close input.
Close output.
Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.
A similar approach
public void stripDuplicatesFromFile(String filename) {
IOUtils.writeLines(
new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
"\n", new FileOutputStream(filename + ".uniq"));
}
Something like this, perhaps:
BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
out.println(line);
LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.
You could use Set in the Collections library to store unique, seen values as you read the file.
Set<String> uniqueStrings = new HashSet<String>();
// read your file, looping on newline, putting each line into variable 'thisLine'
uniqueStrings.add(thisLine);
// finish read
for (String uniqueString:uniqueStrings) {
// do your processing for each unique String
// i.e. System.out.println(uniqueString);
}
If the order does not matter, the simplest way is shell scripting:
<infile sort | uniq > outfile
Try a simple HashSet that stores the lines you have already read.
Then iterate over the file.
If you come across duplicates they are simply ignored (as a Set can only contain every element once).
Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)
The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.
Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.
If you could use UNIX shell commands you could do something like the following:
for(i = line 0 to end)
{
sed 's/\$i//2g' ; deletes all repeats
}
This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.
There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.
For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.
The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.
The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing
the result, just the line numbers to be deleted.
Then copy the original file line by line, ignoring the line numbers you have stored above.
Does it matter in which order the lines come, and how many duplicates are you counting on seeing?
If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.
I have made two assumptions for this efficient solution:
There is a Blob equivalent of line or we can process it as binary
We can save the offset or a pointer to start of each line.
Based on these assumptions solution is:
1.read a line, save the length in the hashmap as key , so we have lighter hashmap. Save the list as the entry in hashmap for all the lines having that length mentioned in key. Building this hashmap is O(n).
While mapping the offsets for each line in the hashmap,compare the line blobs with all existing entries in the list of lines(offsets) for this key length except the entry -1 as offset.if duplicate found remove both lines and save the offset -1 in those places in list.
So consider the complexity and memory usage:
Hashmap memory ,space complexity = O(n) where n is number of lines
Time Complexity - if no duplicates but all equal length lines considering length of each line = m, consider the no of lines =n then that would be , O(n). Since we assume we can compare blob , the m does not matter.
That was worst case.
In other cases we save on comparisons although we will have little extra space required in hashmap.
Additionally we can use mapreduce on server side to split the set and merge results later. And using length or start of line as the mapper key.
void deleteDuplicates(File filename) throws IOException{
#SuppressWarnings("resource")
BufferedReader reader = new BufferedReader(new FileReader(filename));
Set<String> lines = new LinkedHashSet<String>();
String line;
String delims = " ";
System.out.println("Read the duplicate contents now and writing to file");
while((line=reader.readLine())!=null){
line = line.trim();
StringTokenizer str = new StringTokenizer(line, delims);
while (str.hasMoreElements()) {
line = (String) str.nextElement();
lines.add(line);
BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
for(String unique: lines){
writer.write(unique+" ");
}
writer.close();
}
}
System.out.println(lines);
System.out.println("Duplicate removal successful");
}
These answers all rely on the file being small enough to store in memory.
If it is OK to sort the file, this is an algorithm that can be used on any sized file.
You need this library: https://github.com/lemire/externalsortinginjava
I assume you start with a file fileDumpCsvFileUnsorted and you will end up with a new file fileDumpCsvFileSorted that is sorted and has no dupes.
ExternalSort.sort(fileDumpCsvFileUnsorted, fileDumpCsvFileSorted);
int numDupes = 0;
File dupesRemoved = new File(fileDumpCsvFileSorted.getAbsolutePath() + ".nodupes");
String previousLine = null;
try (FileWriter fw = new FileWriter(dupesRemoved);
BufferedWriter bw = new BufferedWriter(fw);
FileReader fr = new FileReader(fileDumpCsvFileSorted);
LineIterator lineIterator = new LineIterator(fr)
) {
while (lineIterator.hasNext()) {
String nextLine = lineIterator.nextLine();
if (StringUtils.equals(nextLine, previousLine)) {
++numDupes;
continue;
}
bw.write(String.format("%s%n", nextLine));
previousLine = nextLine;
}
}
logger.info("Removed {} dupes from {}", numDupes, fileDumpCsvFileSorted.getAbsolutePath());
FileUtils.deleteQuietly(fileDumpCsvFileSorted);
FileUtils.moveFile(dupesRemoved, fileDumpCsvFileSorted);
The file fileDumpCsvFileSorted is now created sorted with no dupes.

Categories