How sort N files - java

Following this answer -->
How do I sort very large files
I need only the Merge function on N already sorted files on disk ,
I want to sort them into one Big file my limitation is the memory Not more than K lines in the memory (K < N) so i cannot fetch all them and then sort, preferred with java
so far I Tried as the code below , but I need a good way to iterate over all N of files line by line (not more than K LINES in memory) + store to disk the sorted final file
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
//......TODO with N ?? ......
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
regards,

You can use something like this
public static void mergeFiles(String target, String... input) throws IOException {
String lineBreak = System.getProperty("line.separator");
PriorityQueue<Map.Entry<String,BufferedReader>> lines
= new PriorityQueue<>(Map.Entry.comparingByKey());
try(FileWriter fw = new FileWriter(target)) {
String header = null;
for(String file: input) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line = br.readLine();
if(line == null) br.close();
else {
if(header == null) fw.append(header = line).write(lineBreak);
line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
for(;;) {
Map.Entry<String, BufferedReader> next = lines.poll();
if(next == null) break;
fw.append(next.getKey()).write(lineBreak);
final BufferedReader br = next.getValue();
String line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
catch(Throwable t) {
for(Map.Entry<String,BufferedReader> br: lines) try {
br.getValue().close();
} catch(Throwable next) {
if(t != next) t.addSuppressed(next);
}
}
}
Note that this code, unlike the code in your question, handles the header line. Like the original code, it will delete the input lines. If that’s not intended, you can remove the DELETE_ON_CLOSE option and simplify the entire reader construction to
BufferedReader br = new BufferedReader(new FileReader(file));
It has exactly as much lines in memory, as you have files.
While in principle, it is possible to hold less line strings in memory, to re-read them when needed, it would be a performance disaster for a questionable little saving. E.g. you have already N strings in memory when calling this method, due to the fact that you have N file names.
However, when you want to reduce the number of lines held at the same time, at all costs, you can simply use the method shown in your question. Merge the first two files into a temporary file, merge that temporary file with the third to another temporary file, and so on, until merging the temporary file with the last input file to the final result. Then you have at most two line strings in memory (K == 2), saving less memory than the operating system will use for buffering, trying to mitigate the horrible performance of this approach.
Likewise, you can use the method shown above to merge K files into a temporary file, then merge the temporary file with the next K-1 file, and so on, until merging the temporary file with the remaining K-1 or less files to the final result, to have a memory consumption scaling with K < N. This approach allows to tune K to have a reasonable ratio to N, to trade memory for speed. I think, in most practical cases, K == N will work just fine.

#Holger gave a nice answer assuming that K>=N.
You can extend it to the K<N case by using mark(int) and reset() methods of the BufferedInputStream.
The parameter of mark is how many bytes a single line can have.
The idea is as follows:
Instead of putting all the N lines in the TreeMap, you can only have K of them. Whenever you put a new line into the set and it is already 'full' you evict the smallest one from it. Additionally, you reset the stream from which it came. So when you will read it again the same data can pop up.
You have to keep track of the maximum line not kept in the TreeSet, lets call it the lower bound. Once there are no elements in the TreeSet greater than the maintained lower bound, you scan all the files once again and repopulate the set.
I'm not sure if this approach is optimal, but should be ok.
Moreover, you have to be aware that BufferedInputStream has an internal buffer at least the size of a single line, so that will consume a lot of your memory, perhaps it would be better to maintain buffering on your own.

Related

External Sort GC Overhead

I am writing an external sort to sort a large 2 gig file on disk
I first split the file into chunks that fit into memory and sort each one individually, and rewrite them back to disk. However, during this process I am getting GC Memory overhead exception in String.Split method in function geModel. Below is my code.
private static List<Model> getModel(String file, long lineCount, final long readSize) {
List<Model> modelList = new ArrayList<Model>();
long read = 0L;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
//Skip lineCount lines;
for (long i = 0; i < lineCount; i++)
br.readLine();
String line = "";
while ((line = br.readLine()) != null) {
read += line.length();
if (read > readSize)
break;
String[] split = line.split("\t");
String curvature = (split.length >= 7) ? split[6] : "";
String heading = (split.length >= 8) ? split[7] : "";
String slope = (split.length == 9) ? split[8] : "";
modelList.add(new Model(split[0], split[1], split[2], split[3], split[4], split[5], curvature, heading, slope));
}
br.close();
return modelList;
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
private static void split(String inputDir, String inputFile, String outputDir, final long readSize) throws IOException {
long lineCount = 0L;
int count = 0;
int writeSize = 100000;
System.out.println("Reading...");
List<Model> curModel = getModel(inputDir + inputFile, lineCount, readSize);
System.out.println("Reading Complete");
while (curModel.size() > 0) {
lineCount += curModel.size();
System.out.println("Sorting...");
curModel.sort(new Comparator<Model>() {
#Override
public int compare(Model arg0, Model arg1) {
return arg0.compareTo(arg1);
}
});
System.out.println("Sorting Complete");
System.out.println("Writing...");
writeFile(curModel, outputDir + inputFile + count, writeSize);
System.out.println("Writing Complete");
count++;
System.out.println("Reading...");
curModel = getModel(inputDir + inputFile, lineCount, readSize);
System.out.println("Reading Complete");
}
}
It makes it through one pass and sorts ~250 MB of data from the file. However, on the second pass it throws GC Memory Overhead exception on String.split function. I do not want to use external libraries, I want to learn this on my own. The sorting and splitting works, but I cannot understand why the GC is throwing memory overhead exception on string.split function.
I'm not sure just what is causing the exception--manipulating large strings, in particular cutting and splicing them, is a huge memory/gc issue. StringBuilder can help, but in general you may have to take more direct control over the process.
To figure out more you probably want to run a profiler with your app. There is one built into the JDK (VisualVM) that is functional. It will show you what objects Java is holding on to... because of the nature of strings it's possible that you are holding onto a lot of redundant character array data.
Personally I'd try a completely different approach, for instance, what if you sorted the entire file in memory by loading the first 10(?) sortable characters of each line into an array along with the file location they were read from, sort the array and resolve any ties by loading more (the rest?) of those lines that were identical.
If you did something like that then you should be able to seek to each line and copy it to the destination file without ever caching more than one line in memory and only reading through the source file twice.
I suppose you could manufacture a file that would fail if all the strings were identical until the last couple characters, so if that ever became an issue you might have to be able to flush the full strings you've cached (there is a java memory reference object made to automatically do this for you, it's not particularly hard)
Based on how I read your implementation readSize only makes sure that you get first block X size. You are not reading 2nd block or 3rd block. Hence its not actually complete external sort.
read += line.length();
if (read > readSize)
break;
String[] split = line.split("\t");
even though you are splitting each line you seem to be using only first 9 characters. And then checking no of words in each line. This means your data is not uniform.

Java 6: Copy and manipulate files

What I want to do...
I have XML-Files with names like SomeName999999blablabla.xml with lots of content, where almost every line contains the string "999999". I need identical xml-files where 999999 is replaced by 888888, 777777, and so on, in the name and the file's content.
The problem...
My code works fine and actually creates all the files I need, BUT there are sometimes tiny errors. Like in one line an E is "randomly" replaced by a D (it seems to be always one letter lower than what its supposed to be, but I can't confirm that 100%). Its not a lot, like one or two instances in 60 files, each file being about 100MB. But since its an xml this is a real problem, as this often is a schema violation, which causes a crash in later processing.
I have absolutely no idea where this is coming from or how to fix it, please help.
My code so far...
private void createMandant(String mandant) throws Exception {
String line;
File dir = new File(TestConstants.getXmlDirectory());
for (File file : dir.listFiles()) {
if (file.getName().endsWith((".xml")) && file.getName().contains("999999")) {
BufferedReader br = new BufferedReader(new FileReader(file));
FileWriter fw = new FileWriter(file.getAbsolutePath().replace("999999", mandant));
while ((line = br.readLine()) != null) {
fw.write(line.replace("999999", mandant) + "\r\n");
}
br.close();
fw.close();
}
}
}
Environment...
We are on Java 6. As mentioned before the files are quite large. Like 100MB, several hundred thousand lines each.
It appears to be a problem with String.replace()
I have replaced it with StringBuilder:
while ((line = br.readLine()) != null) {
index = 0;
// fw.write(line.replace("999999", mandant) + "\r\n");
StringBuilder builder = new StringBuilder(line);
index = builder.indexOf("999999");
if (index > 0) {
fw.write(builder.replace(index, index + 6, mandant).toString() + "\r\n");
} else {
fw.write(line + "\r\n");
}
}
... and now it seems to work. Two runs have already completed without any problems.
But that seems very strange. Could it really be that a heavily used function like String.replace() just randomly gets single letters wrong every few million method calls?

Check if large list of words has specific length

I have a dictionary text file of around 60000 words. I would like to read in that text file and see if it has a certain amount of n words, provided by the user. At the recommendation of my Professor, I'm going to create a method that expands the array to compensate the different n values. I know how to do that. My question is, how do I initially read the text file and determine if each of the 60000 words has a specific n length?
I know I have to use a loop and import the file: (although I've never done throw exception)
Scanner inputFile = new Scanner(new File("2of12inf.txt"));
for(int i = 0; i < sizeWord; i++) {
}
But what I would normally do is use a charAt(i) , and check if the word has n many characters. But I can't possibly do that for 60000 words. Suggestions?
try{
BufferedReader br = new BufferedReader(new FileReader(new File("2of12inf.txt")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
int lineLength = line.length();
// assuming each line contains one word, do whatever you want to with this length
}
} catch (Exception e) {
System.out.println("Exception caught! Should handle it accordingly: " + e);
} finally {
be.close();
}

Get the value from the file in Java

How do I fetch a value corresponding to i (if i=5, I must get 57.05698808926067) from a text file myFile.txt ? The values may continue till 25000.
0->37.6715587270802
1->40.02056806304368
2->351.65161070935005
3->54.74486689415533
4->86.12063488461266
5->57.05698808926067
6->0.0
7->56.078343612293374
Use this code:
// Note: bounds checking left as an exercise
public double getDoubleFromFile(final String filename, final int index)
throws IOException
{
final List<String> list = Files.readAllLines(paths.get(filename),
StandardCharsets.UTF_8);
return Double.parseDouble(list.get(index));
}
However, this slurps the whole file. Why not, if you have to query several times. Another solution:
public double getDoubleFromFile(final String filename, final int index)
throws IOException
{
final Path path = Paths.get(filename);
int i = 0;
String line;
try (
final BufferedReader reader = Files.newBufferedReader(path,
StandardCharsets.UTF_8);
) {
while ((line = reader.readLine()) != null) {
if (index == i)
return Double.parseDouble(line);
i++;
}
return Double.NaN;
}
}
Read the file line by line and check characters before "->" and parse them to int and compare with i. Then get the values after "->".
You could also just read line by line and increment the value of an index variable every time you read a line and when the index is equal to i, get the string after "->".
This simple way is to read line by line til you get to your line
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
int i=0;
while ((line = br.readLine()) != null) {
if (i == muLineNumber){
// process the line.
break;
}
i++;
}
br.close();
If the size of each line const, you can use BufferedReader.skip
If the file size of small, use FileUtils.readLines
Choose what is best for you
Since 25'000 lines is not that huge, I would load the whole file in an array, and use i as an index into the array. If, however, I had harsh constraints on memory usage, I would use an RandomAccessFile, set position somewhere around i*average-line-length, find the next '\n', then read the index, if the index is the one that I was seeking, I would read the rest of the line, otherwise, move up if the index is greater or down if it's smaller, and repeat the process.

Java - Comparing Lists

I have a program written in Java that reads in a file that is simply a list of strings into a LinkedHashMap. Then it takes a second file which consists of two columns and for each row see if the right-hand term matches one of the terms from the HashMap. The problem is it's running very slow.
Here's a code snippet, this is where it compares the second file to the HashMap terms:
String output = "";
infile = new File("2columns.txt");
try {
in = new BufferedReader(new FileReader(infile));
} catch (FileNotFoundException e2) {
System.out.println("2columns.txt" + " not found");
}
try {
fw = new FileWriter("newfile.txt");
out = new PrintWriter(fw);
try {
String str = in.readLine();
while (str != null) {
StringTokenizer strtok = new StringTokenizer(str);
strtok.nextToken();
String strDest = strtok.nextToken();
System.out.println("Term = " + strDest);
//if (uniqList.contains(strDest)) {
if (uniqMap.get(strDest) != null) {
output += str + "\r\n";
System.out.println("Matched! Added: " + str);
}
str = in.readLine();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
out.print(output);
I got a performance boost from switching from an ArrayList initially to the LinkedHashMap but it's still taking a long time. What can I do to speed this up?
Your major bottleneck may be that you are recreating a StringTokenizer for every iteration of the while loop. Moving this outside the loop could help considerably. Minor speed ups can be obtained by moving the String definition outside the while loop.
The biggest speedup will probably come from using a StreamTokenizer. See below for an example.
Oh and use a HashMap instead of a LinkedHashMap as #Doug Ayers says in the above comments :)
And # MДΓΓ БДLL's suggestion of profiling your code is bang on. checkout this Eclipse Profiling Example
Reader r = new BufferedReader(new FileReader(infile));
StreamTokenizer strtok = new StreamTokenizer(r);
String strDest ="";
while (strtok.nextToken() != StreamTokenizer.TT_EOF) {
strDest=strtok.sval; //strtok.toString() might be safer, but slower
strtok.nextToken();
System.out.println("Term = " + strtok.sval);
//if (uniqList.contains(strDest)) {
if (uniqMap.get(strtok.sval) != null) {
output += str + "\r\n";
System.out.println("Matched! Added: " + strDest +" "+ strtok.sval);
}
str = in.readLine();
}
One final thought is (and I'm not confident on this one) that writing to a file may also be faster if you do it in one go at the end. i.e. store all your matches in a buffer of some sort and do the writing in one hit.
StringTokenizer is a legacy class. The recommended replacement is the string "split" method.
Some of the trys might be consolidated. You can have multiple catches for a single try.
The suggestion to use HashMap instead of LinkedHashMap is a good one. Performance for gets and puts in a smidgeon faster since there is no need to maintain a list structure.
The "output" string should be a StringBuilder rather than a String. That could help a lot.

Categories