split very large text file by max rows - java

I want to split a huge file containing strings into a set of new (smaller) file and tried to use nio2.
I do not want to load the whole file into memory, so I tried it with BufferedReader.
The smaller text files should be limited by the number of text rows.
The solution works, however I want to ask if someone knows a solution with better performance by usion java 8 (maybe lamdas with stream()-api?) and nio2:
public void splitTextFiles(Path bigFile, int maxRows) throws IOException{
int i = 1;
try(BufferedReader reader = Files.newBufferedReader(bigFile)){
String line = null;
int lineNum = 1;
Path splitFile = Paths.get(i + "split.txt");
BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
while ((line = reader.readLine()) != null) {
if(lineNum > maxRows){
writer.close();
lineNum = 1;
i++;
splitFile = Paths.get(i + "split.txt");
writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
}
writer.append(line);
writer.newLine();
lineNum++;
}
writer.close();
}
}

Beware of the difference between the direct use of InputStreamReader/OutputStreamWriter and their subclasses and the Reader/Writer factory methods of Files. While in the former case the system’s default encoding is used when no explicit charset is given, the latter always default to UTF-8. So I strongly recommend to always specify the desired charset, even if it’s either Charset.defaultCharset() or StandardCharsets.UTF_8 to document your intention and avoid surprises if you switch between the various ways to create a Reader or Writer.
If you want to split at line boundaries, there is no way around looking into the file’s contents. So you can’t optimize it the way like when merging.
If you are willing to sacrifice the portability you could try some optimizations. If you know that the charset encoding will unambiguously map '\n' to (byte)'\n' as it’s the case for most single byte encodings as well as for UTF-8 you can scan for line breaks on the byte level to get the file positions for the split and avoid any data transfer from your application to the I/O system.
public void splitTextFiles(Path bigFile, int maxRows) throws IOException {
MappedByteBuffer bb;
try(FileChannel in = FileChannel.open(bigFile, READ)) {
bb=in.map(FileChannel.MapMode.READ_ONLY, 0, in.size());
}
for(int start=0, pos=0, end=bb.remaining(), i=1, lineNum=1; pos<end; lineNum++) {
while(pos<end && bb.get(pos++)!='\n');
if(lineNum < maxRows && pos<end) continue;
Path splitFile = Paths.get(i++ + "split.txt");
// if you want to overwrite existing files use CREATE, TRUNCATE_EXISTING
try(FileChannel out = FileChannel.open(splitFile, CREATE_NEW, WRITE)) {
bb.position(start).limit(pos);
while(bb.hasRemaining()) out.write(bb);
bb.clear();
start=pos;
lineNum = 0;
}
}
}
The drawbacks are that it doesn’t work with encodings like UTF-16 or EBCDIC and, unlike BufferedReader.readLine() it won’t support lone '\r' as line terminator as used in old MacOS9.
Further, it only supports files smaller than 2GB; the limit is likely even smaller on 32Bit JVMs due to the limited virtual address space. For files larger than the limit, it would be necessary to iterate over chunks of the source file and map them one after another.
These issues could be fixed but would raise the complexity of this approach. Given the fact that the speed improvement is only about 15% on my machine (I didn’t expect much more as the I/O dominates here) and would be even smaller when the complexity raises, I don’t think it’s worth it.
The bottom line is that for this task the Reader/Writer approach is sufficient but you should take care about the Charset used for the operation.

I made a slight modification to #nimo23 code, taking into account the option of adding a header and a footer for each of the split files, also it output the files into a directory with the same name as the original file with _split appended to it. the code below:
public static void splitTextFiles(String fileName, int maxRows, String header, String footer) throws IOException
{
File bigFile = new File(fileName);
int i = 1;
String ext = fileName.substring(fileName.lastIndexOf("."));
String fileNoExt = bigFile.getName().replace(ext, "");
File newDir = new File(bigFile.getParent() + "\\" + fileNoExt + "_split");
newDir.mkdirs();
try (BufferedReader reader = Files.newBufferedReader(Paths.get(fileName)))
{
String line = null;
int lineNum = 1;
Path splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%03d", i) + ext);
BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
while ((line = reader.readLine()) != null)
{
if(lineNum == 1)
{
writer.append(header);
writer.newLine();
}
writer.append(line);
writer.newLine();
lineNum++;
if (lineNum > maxRows)
{
writer.append(footer);
writer.close();
lineNum = 1;
i++;
splitFile = Paths.get(newDir.getPath() + "\\" + fileNoExt + "_" + String.format("%03d", i) + ext);
writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
}
}
if(lineNum <= maxRows) // early exit
{
writer.append(footer);
}
writer.close();
}
System.out.println("file '" + bigFile.getName() + "' split into " + i + " files");
}

Related

Using RandomAccessFile along with BufferedReader to speed up file read

I have to :-
Read large text file line by line.
Note down file pointer position after every line read.
Stop the file read if running time is greater than 30 seconds.
Resume from last noted file pointer in a new process.
What I am doing :
Using RandomAccessFile.getFilePointer() to note the file pointer.
Wrap RandomAccessFile into another BufferedReader to speed up file read process as per this answer.
When time is greater than 30 seconds, I stop reading the file. Restarting the process with new RandomAccessFile and using RandomAccessFile.seek method to move file pointer to where I left.
Problem:
As I am reading through BufferedReader wrapped around RandomAccessFile, it seems file pointer is moving far ahead in a single call to BufferedReader.readLine(). However, if I use RandomAccessFile.readLine() directely, file pointer is moving properly step by step in forward direction.
Using BufferedReader as a wrapper :
RandomAccessFile randomAccessFile = new RandomAccessFile("mybigfile.txt", "r");
BufferedReader brRafReader = new BufferedReader(new FileReader(randomAccessFile.getFD()));
while((line = brRafReader.readLine()) != null) {
System.out.println(line+", Position : "+randomAccessFile.getFilePointer());
}
Output:
Line goes here, Position : 13040
Line goes here, Position : 13040
Line goes here, Position : 13040
Line goes here, Position : 13040
Using Direct RandomAccessFile.readLine
RandomAccessFile randomAccessFile = new RandomAccessFile("mybigfile.txt", "r");
while((line = randomAccessFile.readLine()) != null) {
System.out.println(line+", Position : "+randomAccessFile.getFilePointer());
}
Output: (This is as expected. File pointer moving properly with each call to readline)
Line goes here, Position : 11011
Line goes here, Position : 11089
Line goes here, Position : 12090
Line goes here, Position : 13040
Could anyone tell, what wrong am I doing here ? Is there any way I can speed up reading process using RandomAccessFile ?
The reason for the observed behavior is that, as the name suggests, the BufferedReader is buffered. It reads a larger chunk of data at once (into a buffer), and returns only the relevant parts of the buffer contents - namely, the part up to the next \n line separator.
I think there are, broadly speaking, two possible approaches:
You could implement your own buffering logic.
Using some ugly reflection hack to obtain the required buffer offset
For 1., you would no longer use RandomAccessFile#readLine. Instead, you'd do your own buffering via
byte buffer[] = new byte[8192];
...
// In a loop:
int read = randomAccessFile.read(buffer);
// Figure out where a line break `\n` appears in the buffer,
// return the resulting lines, and take the position of the `\n`
// into account when storing the "file pointer"
As the vague comment indicates: This may be cumbersome and fiddly. You'd basically re-implement what the readLine method does in the BufferedReader class. And at this point, I don't even want to mention the headaches that different line separators or character sets could cause.
For 2., you could simply access the field of the BufferedReader that stores the buffer offset. This is implemented in the example below. Of course, this is a somewhat crude solution, but mentioned and shown here as a simple alternative, depending on how "sustainable" the solution should be and how much effort you are willing to invest.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;
public class LargeFileRead {
public static void main(String[] args) throws Exception {
String fileName = "myBigFile.txt";
long before = System.nanoTime();
List<String> result = readBuffered(fileName);
//List<String> result = readDefault(fileName);
long after = System.nanoTime();
double ms = (after - before) / 1e6;
System.out.println("Reading took " + ms + "ms "
+ "for " + result.size() + " lines");
}
private static List<String> readBuffered(String fileName) throws Exception {
List<String> lines = new ArrayList<String>();
RandomAccessFile randomAccessFile = new RandomAccessFile(fileName, "r");
BufferedReader brRafReader = new BufferedReader(
new FileReader(randomAccessFile.getFD()));
String line = null;
long currentOffset = 0;
long previousOffset = -1;
while ((line = brRafReader.readLine()) != null) {
long fileOffset = randomAccessFile.getFilePointer();
if (fileOffset != previousOffset) {
if (previousOffset != -1) {
currentOffset = previousOffset;
}
previousOffset = fileOffset;
}
int bufferOffset = getOffset(brRafReader);
long realPosition = currentOffset + bufferOffset;
System.out.println("Position : " + realPosition
+ " with FP " + randomAccessFile.getFilePointer()
+ " and offset " + bufferOffset);
lines.add(line);
}
return lines;
}
private static int getOffset(BufferedReader bufferedReader) throws Exception {
Field field = BufferedReader.class.getDeclaredField("nextChar");
int result = 0;
try {
field.setAccessible(true);
result = (Integer) field.get(bufferedReader);
} finally {
field.setAccessible(false);
}
return result;
}
private static List<String> readDefault(String fileName) throws Exception {
List<String> lines = new ArrayList<String>();
RandomAccessFile randomAccessFile = new RandomAccessFile(fileName, "r");
String line = null;
while ((line = randomAccessFile.readLine()) != null) {
System.out.println("Position : " + randomAccessFile.getFilePointer());
lines.add(line);
}
return lines;
}
}
(Note: The offsets may still appear to be off by 1, but this is due to the line separator not being taken into account in the position. This could be adjusted if necessary)
NOTE: This is only a sketch. The RandomAccessFile objects should be closed properly when reading is finished, but that depends on how the reading is supposed to be interrupted when the time limit is exceeded, as described in the question
BufferedReader reads a block of data from the file, 8 KB by default. Finding line breaks on order to return the next line is done in the buffer.
I guess, this is why you see a huge increment in the physical file position.
RandomAccessFile will not be using a buffer when reading the next line. It will read byte after byte. That's really slow.
How is performance when you just use a BufferedReader and remember the line you need to continue from?

How sort N files

Following this answer -->
How do I sort very large files
I need only the Merge function on N already sorted files on disk ,
I want to sort them into one Big file my limitation is the memory Not more than K lines in the memory (K < N) so i cannot fetch all them and then sort, preferred with java
so far I Tried as the code below , but I need a good way to iterate over all N of files line by line (not more than K LINES in memory) + store to disk the sorted final file
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
//......TODO with N ?? ......
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
regards,
You can use something like this
public static void mergeFiles(String target, String... input) throws IOException {
String lineBreak = System.getProperty("line.separator");
PriorityQueue<Map.Entry<String,BufferedReader>> lines
= new PriorityQueue<>(Map.Entry.comparingByKey());
try(FileWriter fw = new FileWriter(target)) {
String header = null;
for(String file: input) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line = br.readLine();
if(line == null) br.close();
else {
if(header == null) fw.append(header = line).write(lineBreak);
line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
for(;;) {
Map.Entry<String, BufferedReader> next = lines.poll();
if(next == null) break;
fw.append(next.getKey()).write(lineBreak);
final BufferedReader br = next.getValue();
String line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
catch(Throwable t) {
for(Map.Entry<String,BufferedReader> br: lines) try {
br.getValue().close();
} catch(Throwable next) {
if(t != next) t.addSuppressed(next);
}
}
}
Note that this code, unlike the code in your question, handles the header line. Like the original code, it will delete the input lines. If that’s not intended, you can remove the DELETE_ON_CLOSE option and simplify the entire reader construction to
BufferedReader br = new BufferedReader(new FileReader(file));
It has exactly as much lines in memory, as you have files.
While in principle, it is possible to hold less line strings in memory, to re-read them when needed, it would be a performance disaster for a questionable little saving. E.g. you have already N strings in memory when calling this method, due to the fact that you have N file names.
However, when you want to reduce the number of lines held at the same time, at all costs, you can simply use the method shown in your question. Merge the first two files into a temporary file, merge that temporary file with the third to another temporary file, and so on, until merging the temporary file with the last input file to the final result. Then you have at most two line strings in memory (K == 2), saving less memory than the operating system will use for buffering, trying to mitigate the horrible performance of this approach.
Likewise, you can use the method shown above to merge K files into a temporary file, then merge the temporary file with the next K-1 file, and so on, until merging the temporary file with the remaining K-1 or less files to the final result, to have a memory consumption scaling with K < N. This approach allows to tune K to have a reasonable ratio to N, to trade memory for speed. I think, in most practical cases, K == N will work just fine.
#Holger gave a nice answer assuming that K>=N.
You can extend it to the K<N case by using mark(int) and reset() methods of the BufferedInputStream.
The parameter of mark is how many bytes a single line can have.
The idea is as follows:
Instead of putting all the N lines in the TreeMap, you can only have K of them. Whenever you put a new line into the set and it is already 'full' you evict the smallest one from it. Additionally, you reset the stream from which it came. So when you will read it again the same data can pop up.
You have to keep track of the maximum line not kept in the TreeSet, lets call it the lower bound. Once there are no elements in the TreeSet greater than the maintained lower bound, you scan all the files once again and repopulate the set.
I'm not sure if this approach is optimal, but should be ok.
Moreover, you have to be aware that BufferedInputStream has an internal buffer at least the size of a single line, so that will consume a lot of your memory, perhaps it would be better to maintain buffering on your own.

Write string to huge file

I know there are several threads about this problem but i think my problem is a little bit different because of the size.
In my example I want to write 1,7 million lines to an text file. In worst case there could be much more. This lines are create for the sql loader to load fast data into a table so the file could be very large because sql loader could handle that.
Now I want to write the big file as fast as I could. This is my actually method:
BufferedWriter bw = new BufferedWriter(new FileWriter("out.txt"),40000);
int u=profils.size()-1;
for(int z=0; z<u;z++){
for(int b=0;b<z;b++){
p = getValue();
if(!Double.isNaN(p) & p > 0.55){
bw.write(map.get(z) + ";" + map.get(b) + ";" + p + "\n");
}
}
}
bw.close();
For my 1,7 million lines I need about 20 minutes. Can I handle that faster with any method that I don't know?
FileChannel:
File out = new File("out.txt");
FileOutputStream fileOutputStream = new FileOutputStream(out, true);
FileChannel fileChannel = fileOutputStream.getChannel();
ByteBuffer byteBuffer = null;
int u=profils.size()-1;
for(int z=0; z<u;z++){
for(int b=0;b<z;b++){
p = getValue();
if(!Double.isNaN(p) & p > 0.55){
String str = indexToSubstID.get(z) + ";" + indexToSubstID.get(b) + ";" + p + "\n";
byteBuffer = ByteBuffer.wrap(str.getBytes(Charset.forName("ISO-8859-1")));
fileChannel.write(byteBuffer);
}
}
}
fileOutputStream.close();
FileChannel is your way to go. It is used for huge amount of writes.
Read the api documentation
here

Why I am getting OutOfMemory Exception?

I am getting OutOfMemory Exception. Why? I am using this code for logging. Does this approach correct?
Exceptions and closing of streams are handled in parent methods.
private static void writeToFile(File file, FileWriter out, String message) throws IOException {
if (file.exists() && file.isFile()) {
if ((file.length() + message.getBytes().length) <= FILE_MAX_SIZE_B) {
out.write(message);
} else {
int cutLenght = (int) (file.length() + message.getBytes().length - FILE_MAX_SIZE_B);
FileInputStream fileInputStream = new FileInputStream(file);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream));
char[] buf = new char[1024];
int numRead = 0;
StringBuffer text = new StringBuffer(1000);
while ((numRead=bufferedReader.read(buf)) != -1) {
text.append(buf,0,numRead);
}
String result = new String(text).substring(cutLenght);
result += message;
FileWriter fileWriter = new FileWriter(file, appendToFile);
writeToFile(file, fileWriter, result);
bufferedReader.close();
}
}
}
EDIT:
I am using this method for writting my logs in file. So for example in one second I can call 10 logs. I am getting error on lines:
while ((numRead=bufferedReader.read(buf)) != -1) {
text.append(buf,0,numRead);
}
My guess is that you are getting the OutOfMemoryError because you are reading the entire contents of the log file back into memory once it has gotten too close to its maximum size.
You could instead read and write it in smaller chunks, but that could be tricky since you have to avoid overwriting something you haven't already read.
Overall, this technique seems like a very inefficient method of maintaining the log data. Some alternative approaches off the top of my head:
(1) maintain a set of n log files, each with maximum size FILE_MAX_SIZE_B/n. When the first log fills up, open the next one for writing, and so on; when the last one fills up, go back to the first one. In this way you are discarding some of the oldest log data each time you switch files, but not all of it, and still maintaining your overall size limit.
(2) rotate the data within a single file. After each write, add a marker that indicates this is the end of the log stream. When the file has reached its maximum size, just start again at the beginning, overwriting the data that is there. The marker will tell you where the latest message is.
Try something like this:
void appendToFile(File f, CharSequence message, Charset cs, long maximumSize) throws IOException {
long available = maximumSize - f.length();
if (available > 0) {
FileOutputStream fos = new FileOutputStream(f, true);
try {
CharBuffer chars = CharBuffer.wrap(message);
ByteBuffer bytes = ByteBuffer.allocate(8 * 1024); // Re-used when encoding the string
CharsetEncoder enc = cs.newEncoder();
CoderResult res;
do {
res = enc.encode(chars, bytes, true);
bytes.flip();
long len = Math.min(available, bytes.remaining());
available -= len;
fos.write(bytes.array(), bytes.position(), (int) len);
bytes.clear();
} while (res == CoderResult.OVERFLOW && available > 0);
} finally {
fos.close();
}
}
}
Testable with this:
File f = new File(getCacheDir(), "tmp.txt");
f.delete();
// Or whatever charset you want.
Charset cs = Charset.forName("UTF-8");
int maxlen = 2 * 1024; // For this test, 2kb
try {
for (int i = 0; i < maxlen / 20; i++) {
// Write 30 characters for maxlen/20 times == guaranteed overflow
appendToFile(f, "123456789012345678901234567890", cs, maxlen);
System.out.println("Length=" + f.length());
}
} catch (Throwable t) {
t.printStackTrace();
}
f.delete();
Well, you're getting OOM because you're trying to load a huge file into memory.
Did you try opening it with append option instead?
you get OOME because you load the whole file, then get some part of the string. Instead, do a skip on your input stream and read.

Java - Scanner not scanning after a certain number of lines

I'm doing some relatively simple I/O in Java. I have a .txt files that I'm reading from using a Scanner and a .txt file I'm writing to using a BufferedWriter. Another Scanner then reads that file and another BufferedWriter then creates another .txt file. I've provided the code below just in case, but I don't know if it will help too much, as I don't think the code is the issue here. The code compiles without any errors, but it's not doing what I expect it to. For some reason, charReader will only read about half of its file, then hasNext() will return false, even though the end of the file hasn't been reached. These aren't big text files - statsReader's file is 34 KB and charReader's file is 29 KB, which is even weirder, because statsReader reads its entire file fine, and it's bigger! Also, I do have that code surrounded in a try/catch, I just didn't include it.
From what I've looked up online, this may happen with very large files, but these are quite small, so I'm pretty lost.
My OS is Windows 7 64-bit.
Scanner statsReader = new Scanner(statsFile);
BufferedWriter statsWriter = new BufferedWriter(new FileWriter(outputFile));
while (statsReader.hasNext()) {
statsWriter.write(statsReader.next());
name = statsReader.nextLine();
temp = statsReader.nextLine();
if (temp.contains("form")) {
name += " " + temp;
temp = statsReader.next();
}
statsWriter.write(name);
statsWriter.newLine();
statsWriter.write(temp);
if (! (temp = statsReader.next()).equals("-"))
statsWriter.write("/" + temp);
statsWriter.write("\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "\t");
statsWriter.write(statsReader.nextInt() + "");
statsWriter.newLine();
statsReader.nextInt();
}
Scanner charReader = new Scanner(charFile);
BufferedWriter codeWriter = new BufferedWriter(new FileWriter(codeFile));
while (charReader.hasNext()) {
color = charReader.next();
name = charReader.nextLine();
name = name.replaceAll("\t", "");
typing = pokeReader.next();
place = charReader.nextInt();
area = charReader.nextInt();
def = charReader.nextInt();
shape = charReader.nextInt();
size = charReader.nextInt();
spe = charReader.nextInt();
index = typing.indexOf('/');
if (index == -1) {
typeOne = determineType(typing);
typeTwo = '0';
}
else {
typeOne = determineType(typing.substring(0, index));
typeTwo = determineType(typing.substring(index+1, typing.length()));
}
}
SSCCE:
public class Tester {
public static void main(String[] args) {
File statsFile = new File("stats.txt");
File testFile = new File("test.txt");
try {
Scanner statsReader = new Scanner(statsFile);
BufferedWriter statsWriter = new BufferedWriter(new FileWriter(testFile));
while (statsReader.hasNext()) {
statsWriter.write(statsReader.nextLine());
statsWriter.newLine();
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
This is a classic problem: You need to flush and close the output stream (in this case statsWriter) before reading the file.
Being buffered, it doesn't actually write to the file with ever call to write. Calling flush forces it to complete any pending write operations.
Here's the javadoc for OutputStream.flush():
Flushes this output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.
After you have written your file with your statsWriter, you need to call:
statsWriter.flush();
statsWriter.close();
or simply:
statsWriter.close(); // this will call flush();
This is becuase your are using a Buffered Writer, it does not write everything out to the file as you call the write functions, but rather in buffered chunks. When you call flush() and close(), it empties all the content it still has in it's buffer out to the file, and closes the stream.
You will need to do the same for your second writer.

Categories